# Capstone Project Notebook
## This will be the assignment called "Segmenting and Clustering Neighborhoods in Toronto".

### In the first place, we create a dataframe with all the neighborhoods in Toronto, and we print the shape of this dataframe.

In [1]:
import numpy as np
import pandas as pd

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
all_tables=pd.read_html(url)
print(all_tables) # Here we can see that the table we want is the first one

[    Postal Code           Borough  \
0           M1A      Not assigned   
1           M2A      Not assigned   
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
..          ...               ...   
175         M5Z      Not assigned   
176         M6Z      Not assigned   
177         M7Z      Not assigned   
178         M8Z         Etobicoke   
179         M9Z      Not assigned   

                                         Neighbourhood  
0                                         Not assigned  
1                                         Not assigned  
2                                            Parkwoods  
3                                     Victoria Village  
4                            Regent Park, Harbourfront  
..                                                 ...  
175                                       Not assigned  
176                                       Not assigned  
177                                      

In [4]:
tor_neigh=all_tables[0]
tor_neigh.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now we inspect the Not assigned cells in Borough column so that we can drop those columns.

In [5]:
indexes=tor_neigh[tor_neigh['Borough']=='Not assigned'].index # Here we pick up the indexes of the rows
tor_neigh.drop(indexes,inplace=True) # Here we drop them
tor_neigh.reset_index(drop=True,inplace=True) # Here we reset the indexes
tor_neigh.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Let's do the same study for Neighbourhood column.

In [6]:
index=tor_neigh[tor_neigh['Neighbourhood']=='Not assigned'].index
for i in index:
    tor_neigh['Neighbourhood'][i]=tor_neigh['Borough'][i]
tor_neigh.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
tor_neigh.shape

(103, 3)

### Now I'm going to include the latitude and longitude of all the postal codes.

In [8]:
try:
    import geocoder
except:
    print('There is no package named geocoder.')

There is no package named geocoder.


As we can see, there is no module named `geocoder`, so I'm going to use the csv file of the course.

In [9]:
coord_df=pd.read_csv('Geospatial_Coordinates.csv')

In [10]:
lat=[]
long=[]
index_list=pd.Index(list(coord_df['Postal Code'])) # Here I create a list of postal codes with indexes
for postal_code in tor_neigh['Postal Code']:
    index2=index_list.get_loc(postal_code) # Here I return the index position of the postal code
    lat.append(coord_df['Latitude'][index2])
    long.append(coord_df['Longitude'][index2])

tor_neigh['Latitude']=lat
tor_neigh['Longitude']=long
tor_neigh.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


The `pd.Index` function, with `get_loc`, returns the position of every postal code, relating it with the the ones in the original dataframe. So we can just append it to the lists `lat` and `long`, that will be in the `tor_neigh` order.

### Now I'll do the clusterization.

First I import the packages and modules that will be used (if necessary, installing them).


In [11]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium
import requests

import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

I'll focus on Central Toronto borough, so first I create a data frame with all its information.

In [12]:
cent_tor_df=tor_neigh[tor_neigh['Borough']=='Central Toronto'] 
cent_tor_df.reset_index(drop=True,inplace=True)
cent_tor_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M5N,Central Toronto,Roselawn,43.711695,-79.416936
2,M4P,Central Toronto,Davisville North,43.712751,-79.390197
3,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307
4,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
5,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
6,M4S,Central Toronto,Davisville,43.704324,-79.38879
7,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
8,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


Now I'm going to pick the function that gets the venues information through the Foursquare API.

In [13]:
CLIENT_ID = 'HIAF21VJXTKALI35TKU3RJF0IROXXFLQE1FJODUTEVGR1XZ1'
CLIENT_SECRET = 'TU3QC25USIVR00MHBY0WAS2J0VX2B15EIXAUADGE3SKSXJHM'
VERSION = '20180605'
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
cent_tor_venues=getNearbyVenues(cent_tor_df['Neighbourhood'],
                                cent_tor_df['Latitude'],
                                cent_tor_df['Longitude'])
cent_tor_venues.head(20)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.72802,-79.38879,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.72802,-79.38879,HYC Design Inc.,43.726793,-79.391681,Business Service
2,Lawrence Park,43.72802,-79.38879,Zodiac Swim School,43.728532,-79.38286,Swim School
3,Lawrence Park,43.72802,-79.38879,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
4,Roselawn,43.711695,-79.416936,Ceiling Champions,43.713891,-79.420702,Home Service
5,Roselawn,43.711695,-79.416936,Rosalind's Garden Oasis,43.712189,-79.411978,Garden
6,Roselawn,43.711695,-79.416936,Havergal College,43.712108,-79.41168,Music Venue
7,Davisville North,43.712751,-79.390197,Summerhill Market North,43.715499,-79.392881,Food & Drink Shop
8,Davisville North,43.712751,-79.390197,Sherwood Park,43.716551,-79.387776,Park
9,Davisville North,43.712751,-79.390197,Homeway Restaurant & Brunch,43.712641,-79.391557,Breakfast Spot


One I get the information, the analysis begins.

First, I change the information of the venues from string to numerical info through `pd.get_dummies` function. Then, I append the Neighbourhood column so that we can identify every result (and I move it to the front).

Then I group it by Neighbourhoods column through mean, with that result.

In [15]:
cent_tor_hot_encod=pd.get_dummies(cent_tor_venues[['Venue Category']], prefix="", prefix_sep="")

cent_tor_hot_encod['Neighbourhood']=cent_tor_venues['Neighbourhood']
fixed_columns = [cent_tor_hot_encod.columns[-1]] + list(cent_tor_hot_encod.columns[:-1])
cent_tor_hot_encod = cent_tor_hot_encod[fixed_columns]

cent_tor_grouped=cent_tor_hot_encod.groupby('Neighbourhood').mean().reset_index()
cent_tor_grouped

Unnamed: 0,Neighbourhood,American Restaurant,BBQ Joint,Bagel Shop,Bank,Breakfast Spot,Brewery,Burger Joint,Bus Line,Business Service,...,Sporting Goods Shop,Supermarket,Sushi Restaurant,Swim School,Tennis Court,Thai Restaurant,Toy / Game Store,Trail,Vietnamese Restaurant,Yoga Studio
0,Davisville,0.0,0.0,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,...,0.0,0.0,0.055556,0.0,0.027778,0.055556,0.027778,0.0,0.0,0.0
1,Davisville North,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Forest Hill North & West, Forest Hill Road Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0
3,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.25,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
4,"Moore Park, Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0
5,"North Toronto West, Lawrence Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
6,Roselawn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,0.0,0.071429,0.071429,0.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.071429,0.0,0.0,0.0,0.0,0.0,0.071429,0.0
8,"The Annex, North Midtown, Yorkville",0.0,0.052632,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Thanks to this function, we pick the top 5 result of the previous analysys, putting them into the future data frame.

In [16]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

The first `for` loop creates the 5 columns of the final data frame `neigh_venues_sorting`, and after that the second loop creates the rows.

In [17]:
indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']
for ind in np.arange(5):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neigh_venues_sorted = pd.DataFrame(columns=columns)
neigh_venues_sorted['Neighbourhood'] = cent_tor_grouped['Neighbourhood']

for ind in np.arange(cent_tor_grouped.shape[0]):
    neigh_venues_sorted.iloc[ind, 1:] = return_most_common_venues(cent_tor_grouped.iloc[ind, :], 5)

neigh_venues_sorted

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Davisville,Pizza Place,Dessert Shop,Sandwich Place,Café,Sushi Restaurant
1,Davisville North,Pizza Place,Sandwich Place,Hotel,Food & Drink Shop,Park
2,"Forest Hill North & West, Forest Hill Road Park",Trail,Mexican Restaurant,Sushi Restaurant,Jewelry Store,Yoga Studio
3,Lawrence Park,Park,Bus Line,Swim School,Business Service,Discount Store
4,"Moore Park, Summerhill East",Playground,Lawyer,Tennis Court,Diner,Discount Store
5,"North Toronto West, Lawrence Park",Coffee Shop,Clothing Store,Café,Gym / Fitness Center,Gift Shop
6,Roselawn,Home Service,Garden,Music Venue,History Museum,Gym
7,"Summerhill West, Rathnelly, South Hill, Forest...",Coffee Shop,American Restaurant,Restaurant,Vietnamese Restaurant,Fried Chicken Joint
8,"The Annex, North Midtown, Yorkville",Café,Sandwich Place,Coffee Shop,Pharmacy,Park


Let's initialize the cluster method. It'll print the labels for each neighbourhood.

In [18]:
kclusters = 5

cent_tor_grouped_clustering = cent_tor_grouped.drop('Neighbourhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cent_tor_grouped_clustering)

kmeans.labels_

array([1, 1, 4, 0, 2, 1, 3, 1, 1])

Now we append the columns of the data frame `neigh_venues_sorting` to a new data frame `cent_tor_merged` with a column that includes the cluster labels.

In [19]:
neigh_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

cent_tor_merged = cent_tor_df

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
cent_tor_merged = cent_tor_merged.join(neigh_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

cent_tor_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Park,Bus Line,Swim School,Business Service,Discount Store
1,M5N,Central Toronto,Roselawn,43.711695,-79.416936,3,Home Service,Garden,Music Venue,History Museum,Gym
2,M4P,Central Toronto,Davisville North,43.712751,-79.390197,1,Pizza Place,Sandwich Place,Hotel,Food & Drink Shop,Park
3,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307,4,Trail,Mexican Restaurant,Sushi Restaurant,Jewelry Store,Yoga Studio
4,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678,1,Coffee Shop,Clothing Store,Café,Gym / Fitness Center,Gift Shop
5,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,1,Café,Sandwich Place,Coffee Shop,Pharmacy,Park
6,M4S,Central Toronto,Davisville,43.704324,-79.38879,1,Pizza Place,Dessert Shop,Sandwich Place,Café,Sushi Restaurant
7,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,2,Playground,Lawyer,Tennis Court,Diner,Discount Store
8,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049,1,Coffee Shop,American Restaurant,Restaurant,Vietnamese Restaurant,Fried Chicken Joint


Finally, it's time to represent the neighbourhoods on a map. Also, I pick different colors to distinguish the cluster labels.

By the center of the map I choose the mean of the coordinates of the neighbourhoods.

In [20]:
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

[latitude,longitude]=cent_tor_df.mean()
cent_tor_map=folium.Map(location=[latitude,longitude],zoom_start=13)

In [21]:
markers_colors = []

for lat, lon, poi, cluster in zip(cent_tor_merged['Latitude'], cent_tor_merged['Longitude'], cent_tor_merged['Neighbourhood'], cent_tor_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(cent_tor_map)
       
cent_tor_map

Let's see the meaning of the clusterization.

For doing that, I'm going to separate the venues into different data frames, based on the cluster labels.

In [22]:
cent_tor_merged.loc[cent_tor_merged['Cluster Labels'] == 0,
                    cent_tor_merged.columns[[2] + list(range(5, cent_tor_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Lawrence Park,0,Park,Bus Line,Swim School,Business Service,Discount Store


In [23]:
cent_tor_merged.loc[cent_tor_merged['Cluster Labels'] == 1,
                    cent_tor_merged.columns[[2] + list(range(5, cent_tor_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Davisville North,1,Pizza Place,Sandwich Place,Hotel,Food & Drink Shop,Park
4,"North Toronto West, Lawrence Park",1,Coffee Shop,Clothing Store,Café,Gym / Fitness Center,Gift Shop
5,"The Annex, North Midtown, Yorkville",1,Café,Sandwich Place,Coffee Shop,Pharmacy,Park
6,Davisville,1,Pizza Place,Dessert Shop,Sandwich Place,Café,Sushi Restaurant
8,"Summerhill West, Rathnelly, South Hill, Forest...",1,Coffee Shop,American Restaurant,Restaurant,Vietnamese Restaurant,Fried Chicken Joint


In [24]:
cent_tor_merged.loc[cent_tor_merged['Cluster Labels'] == 2,
                    cent_tor_merged.columns[[2] + list(range(5, cent_tor_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
7,"Moore Park, Summerhill East",2,Playground,Lawyer,Tennis Court,Diner,Discount Store


In [25]:
cent_tor_merged.loc[cent_tor_merged['Cluster Labels'] == 3,
                    cent_tor_merged.columns[[2] + list(range(5, cent_tor_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,Roselawn,3,Home Service,Garden,Music Venue,History Museum,Gym


In [26]:
cent_tor_merged.loc[cent_tor_merged['Cluster Labels'] == 4,
                    cent_tor_merged.columns[[2] + list(range(5, cent_tor_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
3,"Forest Hill North & West, Forest Hill Road Park",4,Trail,Mexican Restaurant,Sushi Restaurant,Jewelry Store,Yoga Studio


The results are clear:

* The Lawrence Park Neighbourhood is a residential area with several sports places, for both adults and children.
* The neighbourhoods with label equal 1 are areas with restaurants, cafés and several catering services. This allows us to think that is an area designed to commerce.
* The neighbourhoods with label equal 2 are areas with stores and commerce.
* Roselawn has services to families and places for recreations.
* Finally, Moore Park and Summerhill East are job places with other type of services.

I conclude that Central Toronto is a residential area with commerce dedicated to families. 