## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto, Canada.

A) Download and Explore Dataset\
B) Explore Neighborhoods in New York City\
C) Analyze Each Neighborhood\
D) Cluster Neighborhoods\
E) Examine Clusters

## Before we get the data and start exploring it, let's download all the dependencies that we will need and set the environment.

In [1]:
# pandas
import pandas as pd

# Matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors

# numpy
import numpy as np

# import k-means
from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library

# convert an address into latitude and longitude values
!pip install geopy 
from geopy.geocoders import Nominatim 

import json # library to handle JSON files
import requests 
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

print('Libraries imported.')

Libraries imported.


# <font color=blue>Part 1.0 Start - Download and Explore Dataset</font>
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M \
In order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe.

In [2]:
# Webpage url                                                                                                               
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'

# Extract
dfs = pd.read_html(url)

# Get first table                                                                                                           
df = dfs[0]

# Extract columns                                                                                                           
df1 = df[['Postal Code','Borough','Neighbourhood']]

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.
If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

In [3]:
df1.drop(df1[df1['Borough'] == 'Not assigned'].index, inplace = True) 

df1.head(15)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [4]:
print(df1.shape) 

(103, 3)


# <font color=blue>Part 1.0 End - Submit a link to the new Notebook on your Github repository. (10 marks)</font>

# <font color=blue>2.0 Start - Latitude and the longitude coordinates of each neighborhood.</font>

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name.\
In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

### Load Geospatial CSV Data File

In [5]:
!wget -O Geospatial_data.csv http://cocl.us/Geospatial_data

--2021-03-16 12:58:41--  http://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 52.116.121.148, 52.116.127.82
Connecting to cocl.us (cocl.us)|52.116.121.148|:80... connected.
HTTP request sent, awaiting response... 308 Permanent Redirect
Location: https://cocl.us/Geospatial_data [following]
--2021-03-16 12:58:41--  https://cocl.us/Geospatial_data
Connecting to cocl.us (cocl.us)|52.116.121.148|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2021-03-16 12:58:42--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.29.197
Connecting to ibm.box.com (ibm.box.com)|107.152.29.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2021-03-16 12:58:42--  https://ibm.box.com/public/static/9afz

### Load geospatial data file into dataframe with lattitude and longitude for each postal code.

In [6]:
df2 = pd.read_csv('Geospatial_data.csv')
df2

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


### Merge web page and geospatial dataframes based on Postal Code column

In [7]:
df_merge = df1.merge(df2, on="Postal Code", how = 'inner')

df_merge

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [8]:
print('The merged dataframe has {} boroughs and {} Postal Codes.'.format(
        len(df_merge['Borough'].unique()),
        df_merge.shape[0]
    )
)

The merged dataframe has 11 boroughs and 103 Postal Codes.


# <font color=blue>2.0 End - Submit a link to the new Notebook on your Github repository. (2 marks)
</font>

# <font color=blue>3.0 Start - Explore and cluster the neighborhoods in Toronto, Canada.</font>
You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data.

### Get latitude and longitude for Toronto, Canada.

In [9]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer_test")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


### Create map of Toronto using latitude and longitude values and plot boroughs and neighbourhoods

In [10]:

map_toronto = folium.Map(location=[latitude,longitude],zoom_start=10)

for lat,lng,borough,neighbourhood in zip(df_merge['Latitude'],df_merge['Longitude'],df_merge['Borough'],df_merge['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

### Define Foursquare Credentials and Version

In [11]:
CLIENT_ID = 'UYMQKDZHLNXEGN5B0UHU3ULMGOSSODXT2KN4AGM0L2R3VR0H' # your Foursquare ID
CLIENT_SECRET = 'AXLMGPJ5IET1Q2WCV5K5GUVFQ4CPD1QNERHY4VUJGISBGAN5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: UYMQKDZHLNXEGN5B0UHU3ULMGOSSODXT2KN4AGM0L2R3VR0H
CLIENT_SECRET:AXLMGPJ5IET1Q2WCV5K5GUVFQ4CPD1QNERHY4VUJGISBGAN5


In [12]:
neighbourhood_latitude = df_merge.loc[0, 'Latitude'] # neighbourhood latitude value
neighbourhood_longitude = df_merge.loc[0, 'Longitude'] # neighbourhood longitude value

neighbourhood_name = df_merge.loc[0, 'Neighbourhood'] # neighbourhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


In [13]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=UYMQKDZHLNXEGN5B0UHU3ULMGOSSODXT2KN4AGM0L2R3VR0H&client_secret=AXLMGPJ5IET1Q2WCV5K5GUVFQ4CPD1QNERHY4VUJGISBGAN5&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100'

In [14]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '6050ab8462d9b434d709862b'},
 'response': {'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA'

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  app.launch_new_instance()


Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Brookbanks Pool,Pool,43.751389,-79.332184
2,Variety Store,Food & Drink Shop,43.751974,-79.333114
3,TTC stop - 44 Valley Woods,Bus Stop,43.755402,-79.333741


## Explore Neighborhoods in Toronto

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
df_merge_small=df_merge[df_merge['Neighbourhood'].str.contains("Toronto")]


In [19]:
toronto_venues = getNearbyVenues(names=df_merge_small['Neighbourhood'],
                                   latitudes=df_merge_small['Latitude'],
                                   longitudes=df_merge_small['Longitude']
                                  )

East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
North Toronto West, Lawrence Park
University of Toronto, Harbord
New Toronto, Mimico South, Humber Bay Shores
Business reply mail Processing Centre, South Central Letter Processing Plant Toronto


In [20]:
print(toronto_venues.shape)
toronto_venues

(284, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106,Aldwych Park,43.684901,-79.341091,Park
1,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106,The Path,43.683923,-79.335007,Park
2,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106,Donlands & Mortimer,43.687680,-79.340100,Intersection
3,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106,Sammon Convenience,43.686951,-79.335007,Convenience Store
4,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106,Donlands & Sammon,43.685787,-79.339516,Intersection
...,...,...,...,...,...,...,...
279,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,TTC Russell Division,43.664908,-79.322560,Light Rail Station
280,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,Jonathan Ashbridge Park,43.664702,-79.319898,Park
281,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,Toronto Yoga Mamas,43.664824,-79.324335,Yoga Studio
282,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,ONE Academy,43.662253,-79.326911,Gym / Fitness Center


In [21]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"East Toronto, Broadview North (Old East York)",5,5,5,5,5,5
"Harbourfront East, Union Station, Toronto Islands",100,100,100,100,100,100
"New Toronto, Mimico South, Humber Bay Shores",12,12,12,12,12,12
"North Toronto West, Lawrence Park",19,19,19,19,19,19
"Toronto Dominion Centre, Design Exchange",100,100,100,100,100,100
"University of Toronto, Harbord",32,32,32,32,32,32


In [22]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 107 uniques categories.


# Analyze Each Neighborhood

In [23]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
manhattan_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,American Restaurant,Aquarium,Art Gallery,Asian Restaurant,Auto Workshop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,...,Taco Place,Tailor Shop,Tea Room,Theater,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Wine Bar,Yoga Studio,Neighbourhood
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"East Toronto, Broadview North (Old East York)"
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"East Toronto, Broadview North (Old East York)"
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"East Toronto, Broadview North (Old East York)"
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"East Toronto, Broadview North (Old East York)"
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"East Toronto, Broadview North (Old East York)"


In [24]:
toronto_onehot.shape

(284, 108)

In [25]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,American Restaurant,Aquarium,Art Gallery,Asian Restaurant,Auto Workshop,Bakery,Bank,Bar,Baseball Stadium,...,Sushi Restaurant,Taco Place,Tailor Shop,Tea Room,Theater,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Wine Bar,Yoga Studio
0,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
1,"East Toronto, Broadview North (Old East York)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Harbourfront East, Union Station, Toronto Islands",0.0,0.05,0.01,0.0,0.0,0.02,0.01,0.02,0.02,...,0.01,0.0,0.0,0.01,0.01,0.01,0.01,0.0,0.01,0.0
3,"New Toronto, Mimico South, Humber Bay Shores",0.083333,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"North Toronto West, Lawrence Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632
5,"Toronto Dominion Centre, Design Exchange",0.02,0.0,0.01,0.01,0.0,0.03,0.01,0.01,0.0,...,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.0,0.01,0.0
6,"University of Toronto, Harbord",0.0,0.0,0.0,0.0,0.0,0.0625,0.03125,0.0625,0.0,...,0.03125,0.0,0.0,0.0,0.03125,0.0,0.0,0.03125,0.0,0.03125


In [26]:
toronto_grouped.shape

(7, 108)

In [27]:
num_top_venues = 5

for hood in toronto_grouped ['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue  freq
0           Yoga Studio  0.06
1               Brewery  0.06
2            Restaurant  0.06
3        Farmers Market  0.06
4  Fast Food Restaurant  0.06


----East Toronto, Broadview North (Old East York)----
               venue  freq
0       Intersection   0.4
1               Park   0.4
2  Convenience Store   0.2
3        Music Venue   0.0
4                Pub   0.0


----Harbourfront East, Union Station, Toronto Islands----
                 venue  freq
0          Coffee Shop  0.12
1             Aquarium  0.05
2                Hotel  0.04
3                 Café  0.04
4  Sporting Goods Shop  0.03


----New Toronto, Mimico South, Humber Bay Shores----
                 venue  freq
0  American Restaurant  0.08
1                 Café  0.08
2         Liquor Store  0.08
3                  Gym  0.08
4   Mexican Restaurant  0.08


----North Toronto West, Lawrence Park----
  

In [28]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [29]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Business reply mail Processing Centre, South C...",Yoga Studio,Brewery,Garden Center,Garden,Light Rail Station,Fast Food Restaurant,Farmers Market,Park,Pizza Place,Comic Shop
1,"East Toronto, Broadview North (Old East York)",Intersection,Park,Convenience Store,French Restaurant,Cosmetics Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner
2,"Harbourfront East, Union Station, Toronto Islands",Coffee Shop,Aquarium,Hotel,Café,Scenic Lookout,Brewery,Fried Chicken Joint,Restaurant,Italian Restaurant,Sporting Goods Shop
3,"New Toronto, Mimico South, Humber Bay Shores",American Restaurant,Pizza Place,Pet Store,Cosmetics Shop,Restaurant,Café,Mexican Restaurant,Fast Food Restaurant,Liquor Store,Pharmacy
4,"North Toronto West, Lawrence Park",Coffee Shop,Clothing Store,Sporting Goods Shop,Yoga Studio,Fast Food Restaurant,Gym / Fitness Center,Restaurant,Chinese Restaurant,Salon / Barbershop,Café


## Cluster Neighbourhoods in Toronto

### Set kmeans clustering model

In [30]:
k=4
toronto_clustering = df_merge_small.drop(['Postal Code','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df_merge_small.insert(0, 'C Labels', kmeans.labels_)

In [31]:
df_merge_small

Unnamed: 0,C Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
35,2,M4J,East York,"East Toronto, Broadview North (Old East York)",43.685347,-79.338106
36,0,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
42,0,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576
73,3,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
80,0,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049
88,1,M8V,Etobicoke,"New Toronto, Mimico South, Humber Bay Shores",43.605647,-79.501321
100,2,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558


In [32]:
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df_merge_small['Latitude'], df_merge_small['Longitude'], df_merge_small['Neighbourhood'], df_merge_small['C Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters


In [33]:
df_merge_small.loc[df_merge_small['C Labels'] == 0, df_merge_small.columns[[1] + list(range(5, df_merge_small.shape[1]))]]

Unnamed: 0,Postal Code,Longitude
36,M5J,-79.381752
42,M5K,-79.381576
80,M5S,-79.400049


In [34]:
df_merge_small.loc[df_merge_small['C Labels'] == 1, df_merge_small.columns[[1] + list(range(5, df_merge_small.shape[1]))]]

Unnamed: 0,Postal Code,Longitude
88,M8V,-79.501321


In [35]:
df_merge_small.loc[df_merge_small['C Labels'] == 2, df_merge_small.columns[[1] + list(range(5, df_merge_small.shape[1]))]]

Unnamed: 0,Postal Code,Longitude
35,M4J,-79.338106
100,M7Y,-79.321558


In [36]:
df_merge_small.loc[df_merge_small['C Labels'] == 3, df_merge_small.columns[[1] + list(range(5, df_merge_small.shape[1]))]]

Unnamed: 0,Postal Code,Longitude
73,M4R,-79.405678


# <font color=blue>3.0 End - Explore and cluster the neighborhoods in Toronto, Canada.</font>
Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)