1. https://scipython.com/blog/scraping-a-wikipedia-table-with-beautiful-soup/
2. https://beautiful-soup-4.readthedocs.io/en/latest/#quick-start

In [34]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

# Problem 1

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:
![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1591228800000&hmac=jE4j1xhA_kfXeMEO_FzkscrlzL1VJ9l6Pg6l72G8eGg)
3. To create the above dataframe:
    * The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    * Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    * More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
    * If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
    * Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
    * In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository. (10 marks)

Note: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.

## Answer 1

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipedia_html = requests.get(url).text

Parse the wikipedia page into a file.

In [3]:
soup = BeautifulSoup(wikipedia_html, 'html.parser')
#print(soup.prettify())

If we take a look in the html script, we'll see that out table is under a class called `wikitable sortable`. So let's find it!

In [367]:
table_html = soup.find('table',{'class':'wikitable sortable'})

#you can see the table here, it's just too long to show
#table_html

Nice! Let's continue extracting the elements in this table. Fisrt we want to extract the headers, that are under the `th` part of the html table.

In [5]:
#headers
header = [th.text.rstrip() for th in table_html.find_all('th')]
print(header)

['Postal Code', 'Borough', 'Neighborhood']


Now we want the items of the the table. A row is under the `tr` tag and it's elements are in the `td` tag. We can from a list of columns items, and then zip the together.

In [6]:
#table columns
pcode = []
br = []
neigh = []

for tr in table_html.find_all('tr'):
    tds = tr.find_all('td')
    if not tds:
        continue
    postal_code, borough, neighborhood = [td.text.strip() for td in tds[:3]]
    
    pcode.append(postal_code)
    br.append(borough)
    neigh.append(neighborhood)    

In [7]:
postal_list = list(zip(pcode, br, neigh))
postal_list[0:5]

[('M1A', 'Not assigned', 'Not assigned'),
 ('M2A', 'Not assigned', 'Not assigned'),
 ('M3A', 'North York', 'Parkwoods'),
 ('M4A', 'North York', 'Victoria Village'),
 ('M5A', 'Downtown Toronto', 'Regent Park, Harbourfront')]

Here's our DataFrame! But as we can see, it's not done yet.

In [8]:
df_canada_pcodes = pd.DataFrame(postal_list, columns = header)
print(df_canada_pcodes.shape)
df_canada_pcodes.head()

(180, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The only thing that we need to process is cells with a borough that is Not assigned. Why?
1. The neighborhoods with the same postcodes have already been grouped up.
2. There's no neighborhood that is not assigned.

In [9]:
df_canada_pcodes = df_canada_pcodes[df_canada_pcodes['Borough'] != 'Not assigned']
df_canada_pcodes.reset_index(inplace=True, drop=True)
print(df_canada_pcodes.shape)
df_canada_pcodes

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


There, you go!

---

# Question 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

```python
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
```

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:

![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1591228800000&hmac=bRspFs7vQ1PPd_YZTEjeIW4TdqUsudvFiHfa49n7-tM)

Important Note: There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto.

## Answer 2

In [10]:
import geocoder # import geocoder

In [11]:
##IMPORTANT
#In my atempts, the geocoder wasn't retrieving the lat and long, so I decided to use the csv file

# df_canada_pcodes['Latitude'] = ''
# df_canada_pcodes['Longitude'] = ''

# # loop through the postal code to find out the latitude and longitude
# for index, data in df_canada_pcodes.iterrows():
#     lat_lng_coords = None
#     while(lat_lng_coords is None):
#         g = geocoder.google('{}, Toronto, Ontario'.format(data['Postal Code']))
#         lat_lng_coords = g.latlng
#     data['Latitude'] = lat_lng_coords[0]
#     data['Longitude'] = lat_lng_coords[1]
#     print ('PostalCode:', data['PostalCode'], 'Latitude:', data['Latitude'], 'Longitude:', data['Longitude'])

In [12]:
df = pd.read_csv('Geospatial_Coordinates.csv')

Let's iterate through the Geospastial Coordinates and add them in our data frame.

In [13]:
df_canada_pcodes['Latitude'] = ''
df_canada_pcodes['Longitude'] = ''

for index, data in df_canada_pcodes.iterrows():
    #pick a postal code, search them in the csv, and return their coodinates
    q = df.query('`Postal Code` == "{}"'.format(data['Postal Code'])).values    
    data['Latitude'] = q[0,1]
    data['Longitude'] = q[0,2]
    
df_canada_pcodes

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7185,-79.4648
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623,-79.3895
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6537,-79.5069
99,M4Y,Downtown Toronto,Church and Wellesley,43.6659,-79.3832
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.6627,-79.3216
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6363,-79.4985


# Question 3

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together.

## Answer 3

Like we did in Manhantan, let's cluster a single borough

In [321]:
# just focus on North York
ny_df = df_canada_pcodes[df_canada_pcodes['Borough'] == 'North York'].reset_index(drop=True)

#the index 12 gave me problems later, with NaN values. So I decided to drop it!
ny_df = ny_df.drop(index=12)
ny_df.reset_index(inplace=True, drop=True)
ny_df.head(30)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7185,-79.4648
3,M3B,North York,Don Mills,43.7459,-79.3522
4,M6B,North York,Glencairn,43.7096,-79.4451
5,M3C,North York,Don Mills,43.7259,-79.3409
6,M2H,North York,Hillcrest Village,43.8038,-79.3635
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.7543,-79.4423
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.7785,-79.3466
9,M3J,North York,"Northwood Park, York University",43.768,-79.4873


#### Define Foursquare Credentials and Version

In [238]:
CLIENT_ID = 'IUSR3EJDBDGC2EE5VXBYQMTMKN01HTP3TKAPLAJCA535L4UF' # your Foursquare ID
CLIENT_SECRET = 'AZ4REZIQGBZ4XJXL3I14KDMKO5GY5Z4RQREDY3G3WTJBEEDD' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: IUSR3EJDBDGC2EE5VXBYQMTMKN01HTP3TKAPLAJCA535L4UF
CLIENT_SECRET:AZ4REZIQGBZ4XJXL3I14KDMKO5GY5Z4RQREDY3G3WTJBEEDD


#### Explore Neighborhoods in North York

The problem that I encounter with Noth York was that there was neighborhoods with different postal codes.
For example:
1. Downsview
2. Don Mills

So, when grouping them and joining by the neighbohood name it reduced the data set and gave problems. The corret operationg needs to be based on the Postal Code, because its the id that operates like a primary-key.

In [239]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

def getNearbyVenues(pcodes, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for pcode, name, lat, lng in zip(pcodes, names, latitudes, longitudes): 
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            pcode, 
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 'Neighborhood', 'Neighborhood Latitude', 
                  'Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    
    return(nearby_venues)

In [240]:
ny_venues = getNearbyVenues(pcodes = ny_df['Postal Code'], names=ny_df['Neighborhood'], 
                                   latitudes=ny_df['Latitude'], longitudes=ny_df['Longitude'])

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills
Glencairn
Don Mills
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview
Downsview
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale, Willowdale East
Downsview
York Mills West
Willowdale, Willowdale West


Let's check how many venues were returned for each neighborhood

In [241]:
ny_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
M2H,5,5,5,5,5,5,5
M2J,67,67,67,67,67,67,67
M2K,4,4,4,4,4,4,4
M2M,1,1,1,1,1,1,1
M2N,33,33,33,33,33,33,33
M2P,4,4,4,4,4,4,4
M2R,6,6,6,6,6,6,6
M3A,3,3,3,3,3,3,3
M3B,4,4,4,4,4,4,4
M3C,21,21,21,21,21,21,21


#### Analyze each neighborhood

In [242]:
# one hot encoding
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
# and postal code too! 
ny_onehot['Neighborhood'] = ny_venues['Neighborhood'] 
ny_onehot['Postal Code'] = ny_venues['Postal Code']

# move neighborhood column to the first column
fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

#group them
ny_grouped = ny_onehot.groupby(by=['Postal Code', 'Neighborhood']).mean().reset_index()

print(ny_grouped.shape)
ny_grouped.head()

(23, 103)


Unnamed: 0,Postal Code,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,...,Supermarket,Supplement Shop,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,M2H,Hillcrest Village,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M2J,"Fairview, Henry Farm, Oriole",0.014925,0.0,0.014925,0.0,0.0,0.014925,0.0,0.014925,...,0.0,0.014925,0.0,0.029851,0.0,0.014925,0.014925,0.014925,0.0,0.014925
2,M2K,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M2M,"Willowdale, Newtonbrook",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M2N,"Willowdale, Willowdale East",0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,...,0.0,0.0,0.060606,0.0,0.0,0.0,0.0,0.0,0.030303,0.0


#### Frequencies

In [243]:
num_top_venues = 5

for code, hood in zip(ny_grouped['Postal Code'], ny_grouped['Neighborhood']): 
    print('---({}), {}---'. format(code, hood))
    temp = ny_grouped[ny_grouped['Postal Code'] == code].T.reset_index() 
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:] #mod10
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---(M2H), Hillcrest Village---
                      venue  freq
0                   Dog Run   0.2
1               Golf Course   0.2
2                      Pool   0.2
3        Athletics & Sports   0.2
4  Mediterranean Restaurant   0.2


---(M2J), Fairview, Henry Farm, Oriole---
                  venue  freq
0        Clothing Store  0.15
1           Coffee Shop  0.07
2  Fast Food Restaurant  0.06
3   Japanese Restaurant  0.04
4            Restaurant  0.04


---(M2K), Bayview Village---
                 venue  freq
0   Chinese Restaurant  0.25
1                 Café  0.25
2                 Bank  0.25
3  Japanese Restaurant  0.25
4    Accessories Store  0.00


---(M2M), Willowdale, Newtonbrook---
               venue  freq
0       Home Service   1.0
1  Accessories Store   0.0
2       Liquor Store   0.0
3               Park   0.0
4      Movie Theater   0.0


---(M2N), Willowdale, Willowdale East---
              venue  freq
0  Ramen Restaurant  0.09
1  Sushi Restaurant  0.06
2       Pizza 

#### Put into a Pandas DataFrame

In [244]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [339]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code', 'Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = ny_grouped['Postal Code']
neighborhoods_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']

In [340]:
for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

print('Shape: ', neighborhoods_venues_sorted.shape)    
neighborhoods_venues_sorted.head(6)

Shape:  (23, 12)


Unnamed: 0,Postal Code,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,Hillcrest Village,Golf Course,Pool,Athletics & Sports,Mediterranean Restaurant,Dog Run,Women's Store,Dim Sum Restaurant,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
1,M2J,"Fairview, Henry Farm, Oriole",Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Japanese Restaurant,Convenience Store,Bank,Tea Room,Food Court,Shoe Store
2,M2K,Bayview Village,Japanese Restaurant,Chinese Restaurant,Café,Bank,Distribution Center,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
3,M2M,"Willowdale, Newtonbrook",Home Service,Women's Store,Distribution Center,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
4,M2N,"Willowdale, Willowdale East",Ramen Restaurant,Pizza Place,Sandwich Place,Sushi Restaurant,Café,Coffee Shop,Juice Bar,Pet Store,Bubble Tea Shop,Movie Theater
5,M2P,York Mills West,Convenience Store,Bar,Electronics Store,Park,Women's Store,Diner,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop


### Clustering

In [247]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium # map rendering library

In [248]:
# set number of clusters
kclusters = 5

ny_grouped_clustering = ny_grouped.drop(['Postal Code','Neighborhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

array([0, 2, 4, 3, 2, 0, 2, 0, 4, 2, 2, 0, 0, 2, 1, 2, 2, 2, 0, 2, 0, 3,
       1], dtype=int32)

In [365]:
# add clustering labels
cluster_df = neighborhoods_venues_sorted.copy()
cluster_df.insert(0, 'Cluster Labels', kmeans.labels_)

ny_merged = ny_df.copy()

cluster_df = cluster_df.drop("Neighborhood", 1) 
#ny_merged = ny_merged.drop("Neighborhood", 1)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
ny_merged = ny_merged.join(cluster_df.set_index('Postal Code'), on='Postal Code')

ny_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.7533,-79.3297,0,Pool,Food & Drink Shop,Park,Women's Store,Dim Sum Restaurant,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store
1,M4A,North York,Victoria Village,43.7259,-79.3156,2,Coffee Shop,Hockey Arena,Portuguese Restaurant,Intersection,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7185,-79.4648,0,Clothing Store,Furniture / Home Store,Accessories Store,Boutique,Event Space,Miscellaneous Shop,Coffee Shop,Vietnamese Restaurant,Arts & Crafts Store,Bakery
3,M3B,North York,Don Mills,43.7459,-79.3522,4,Gym / Fitness Center,Caribbean Restaurant,Café,Japanese Restaurant,Women's Store,Discount Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega
4,M6B,North York,Glencairn,43.7096,-79.4451,2,Japanese Restaurant,Pizza Place,Sushi Restaurant,Pub,Women's Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop


### Let's vizualize out clusters in folium!

1. Get North York latitude and longitude
2. Print in the map

In [267]:
address = 'North York, Toronto Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.7543263, -79.44911696639593.


In [368]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Github says that my notebook 

![alt text](cluster_map.png)