# Segmenting and Clustering Neighborhoods in *Toronto*, **Canada**
###### For this assignment, we will be required to explore and cluster the neighborhoods in *Toronto*

###### In the below dataframe we will perform the below
    > The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    > Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned,
    > More than one neighborhood can exist in one postal code area. These two rows will be combined into one row with the neighborhoods separated with a comma,
    > If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

**1** Importing basic libraries:

In [4]:
import pandas as pd
import numpy as np

In [2]:
#Define table source address as 'url':
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#Read table into a data frame:
TPC_df = pd.read_html(url)[0]
TPC_df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [3]:
#Drop all rows where 'Borough' shows a 'Not assigned' value:
TPC_df = TPC_df[TPC_df['Borough'] != 'Not assigned'].reset_index(drop=True)
#Replace 'Not assigned' values in 'Neighbourhood' column with existing values in 'Borough' column:
TPC_df.Neighbourhood.replace('Not assigned', TPC_df.Borough, inplace=True)

In [4]:

#Create a new data frame that is grouped by postal code and add a custom lambda function to join unique values in other columns:
#Please note that there where no duplicate Postal Codes in the original table, and thus, the resulting dataframe remained the same as the original:
TPC_grouped = TPC_df.groupby('Postal Code')['Borough'].apply(lambda x: ', '.join(np.unique(x))).reset_index()
TPC_grouped['Neighborhood'] = TPC_df.groupby('Postal Code',as_index=False)['Neighbourhood'].apply(lambda x: ', '.join(np.unique(x)))
TPC_df = TPC_grouped

In [5]:
#Show first 12 rows of the grouped data frame:
TPC_df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [6]:

#Show the dimensions of the resulting data frame:
TPC_df.shape

(103, 3)

**2**   Storing the .csv file containing the latitude and longitude coordinates and read it into a pandas dataframe:

In [7]:

latlng_df = pd.read_csv('http://cocl.us/Geospatial_data')
latlng_df.head(12)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


**2.1** Perform an outer join to add the Latitude and Longitude coordinates into the "TPC_df" data frame:

In [8]:
TPC_df = pd.merge(TPC_df, latlng_df,on='Postal Code',how='left')
TPC_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [9]:
TPC_df.shape

(103, 5)

**3.1** Importing other necessary libraries and packages:

In [10]:
!pip install geopy
!pip install folium
from geopy.geocoders import Nominatim
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 3.9 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.11.0


**4** Specify the Foursquare credentials (the following hidden cell contains: "CLIENT_ID", "CLIENT_SECRET" and "VERSION"):

###### Sensitive Foursquare Credentials hidden

In [11]:
# @hidden_cell

CLIENT_ID = 'Z2J3DM0MFRRZ2GPBC3RLFUGHHCTVUYQ5OKYRPAFJRV20P44A'
CLIENT_SECRET = 'CKVA15FDA3S5WOKFYAMWPEO0PKTTQZLYFA2OP1BLFMX1NZ3I' 
VERSION = '20210102'
LIMIT = 100


**4.1** Check how many postal codes, boroughs and neighborhoods there are in Toronto:

In [12]:
print("There are: ", TPC_df['Postal Code'].unique().shape[0], 'postal codes, ',
      TPC_df['Borough'].unique().shape[0], 'boroughs and ', 
      len(np.unique(np.concatenate((TPC_df['Neighborhood'].str.split(', ')),axis=0))), 
      'neighborhoods in Toronto')

There are:  103 postal codes,  10 boroughs and  208 neighborhoods in Toronto


In [13]:
TPC_grouped1 = TPC_df.groupby('Borough')['Postal Code'].apply(lambda x: len(np.unique(x))).reset_index()
TPC_grouped2 = TPC_df.groupby('Borough')['Neighborhood'].apply(lambda x: len(np.unique(', '.join(np.unique(x)).split(', ')))).reset_index()
TPC_grouped = pd.DataFrame(columns=['Postal Code Count', 'Borough', 'Neighborhood Count'])
TPC_grouped[['Postal Code Count', 'Borough']] = TPC_grouped1[['Postal Code','Borough']]
TPC_grouped[['Neighborhood Count']] = TPC_grouped2[['Neighborhood']]
TPC_grouped
#Note: The Runnymede neighborhood is splitted between West Toronto and York. As a result, Runnymede is included twice in this table. Thus, 
#the total number of neighborhoods in this table is 209 instead of 208 as previously showed.

Unnamed: 0,Postal Code Count,Borough,Neighborhood Count
0,9,Central Toronto,17
1,19,Downtown Toronto,38
2,5,East Toronto,8
3,5,East York,7
4,12,Etobicoke,47
5,1,Mississauga,1
6,24,North York,32
7,17,Scarborough,38
8,6,West Toronto,13
9,5,York,8


*Etobicoke* has 47 neighborhoods, the most number of neighborhoods in any borough in Toronto. However, Etobicoke only has 12 postal codes. *North York* has 24 postal codes, but only 32 neighborhoods. On the other hand, *Downtown Toronto* has a good combination of a high number of postal codes and a high number of neighborhoods. For this reason, I will choose the Downtown Toronto borough in this analysis.

Since there can be many neighborhoods within a postal code, many postal codes within a neighborhood, and latitude and longitude coordinates are only available for postal codes, I will do my analysis based on postal codes instead of neighborhoods.

In [14]:
Downtown_Toronto_PCodes = TPC_df[TPC_df['Borough']=='Downtown Toronto'].reset_index(drop=True)
Downtown_Toronto_PCodes

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752


###### Borrow function from previous lab to loop through all of the available postal codes in Downtown Toronto and obtain a maximum of 100 venues within a specified radius of 500m from the coordinates of each postal code:

In [15]:
def getNearbyVenues(PCodes, latitudes, longitudes, radius=500, limit=100):
    
    venues_list=[]
    for PCode, lat, lng in zip(PCodes, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            PCode, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [16]:

toronto_venues = getNearbyVenues(Downtown_Toronto_PCodes['Postal Code'], Downtown_Toronto_PCodes['Latitude'], Downtown_Toronto_PCodes['Longitude'], radius=500, limit=100)
print(toronto_venues.shape)
toronto_venues.head()

(1233, 7)


Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4W,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,M4W,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,M4W,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,M4W,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,M4X,43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


In [17]:
#Count the number of resulting venues in each Downtown Toronto postal code:
toronto_venues[['Postal Code','Venue']].groupby('Postal Code').count()

Unnamed: 0_level_0,Venue
Postal Code,Unnamed: 1_level_1
M4W,4
M4X,44
M4Y,80
M5A,47
M5B,100
M5C,79
M5E,58
M5G,65
M5H,96
M5J,100


In [18]:
#Identify the postal codes that have less than 20 venues:
drop_Pcodes = []
Pcode_series = toronto_venues.groupby('Postal Code').apply(lambda x : len(x)>19)
for index, value in Pcode_series.iteritems():
    if value == False:
        drop_Pcodes.append(index)
        
#Drop postal codes with less than 20 venues from the "toronto_venues" data frame:
toronto_venues_adjusted = toronto_venues.groupby('Postal Code').filter(lambda x : len(x)>19)

#Drop postal codes with less than 20 venues from the "Downtown_Toronto_PCodes" data frame and name it "toronto_merged":
toronto_merged = Downtown_Toronto_PCodes[~Downtown_Toronto_PCodes['Postal Code'].isin(drop_Pcodes)].reset_index(drop=True)

#Show codes to be droped:
drop_Pcodes

['M4W', 'M5V', 'M6G']

In [19]:
print(len(toronto_venues_adjusted['Venue Category'].unique()), "unique venue categories!")

194 unique venue categories!


**4.2** Group all postal codes and calculate the mean for each category:

In [20]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues_adjusted[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Postal Code'] = toronto_venues_adjusted['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(1196, 195)


Unnamed: 0,Postal Code,Adult Boutique,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,...,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
4,M4X,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,M4X,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,M4X,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,M4X,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,M4X,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
print(toronto_grouped.shape)
toronto_grouped.head()

(16, 195)


Unnamed: 0,Postal Code,Adult Boutique,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,...,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,M4X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4Y,0.0125,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0125,0.0125,0.0125,0.0,0.0,0.0,0.0125,0.0,0.025
2,M5A,0.0,0.0,0.021277,0.0,0.021277,0.0,0.0,0.0,0.0,...,0.0,0.0,0.042553,0.0,0.0,0.0,0.0,0.0,0.0,0.021277
3,M5B,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.01,0.01,0.02,0.0,0.0,0.0,0.01,0.01,0.01,0.0
4,M5C,0.0,0.037975,0.0,0.0,0.012658,0.0,0.0,0.012658,0.012658,...,0.0,0.012658,0.012658,0.0,0.0,0.012658,0.0,0.0,0.012658,0.0


In [22]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 0, 1, 1, 1, 2, 1, 1, 1, 1, 4, 3, 1, 1, 2], dtype=int32)

In [23]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [24]:
# append the cluster labels to the "toronto_merged" dataframe:
toronto_merged['Cluster Labels'] = kmeans.labels_

In [28]:

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, neigh in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Postal Code'], toronto_merged['Cluster Labels'], toronto_merged['Neighborhood']):
    label = folium.Popup('Postal Code: ' + str(poi) + ' Neighborhood: ' + str(neigh) +' Cluster: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [27]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code', 'Cluster']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Pcodes_venues_sorted = pd.DataFrame(columns=columns)
Pcodes_venues_sorted['Postal Code'] = toronto_merged['Postal Code']
Pcodes_venues_sorted['Cluster'] = toronto_merged['Cluster Labels']

for ind in np.arange(toronto_grouped.shape[0]):
    Pcodes_venues_sorted.iloc[ind, 2:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

Pcodes_venues_sorted.sort_values(by=['Cluster'], ascending=False).reset_index(drop=True)

Unnamed: 0,Postal Code,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5S,4,Café,Bookstore,Bar,Italian Restaurant,Japanese Restaurant,Bakery,Yoga Studio,French Restaurant,Beer Store,Sandwich Place
1,M5T,3,Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Coffee Shop,Bakery,Farmers Market,Caribbean Restaurant,Bar,Gaming Cafe
2,M5G,2,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Bubble Tea Shop,Thai Restaurant,Salad Place,Burger Joint,Portuguese Restaurant,Ramen Restaurant
3,M7A,2,Coffee Shop,Sushi Restaurant,Yoga Studio,Smoothie Shop,Burrito Place,Café,College Auditorium,College Cafeteria,Park,Creperie
4,M4X,1,Coffee Shop,Pub,Bakery,Italian Restaurant,Café,Pizza Place,Restaurant,Japanese Restaurant,Farmers Market,Jewelry Store
5,M4Y,1,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Restaurant,Gay Bar,Yoga Studio,Pub,Men's Store,Mediterranean Restaurant,Hotel
6,M5B,1,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Bubble Tea Shop,Japanese Restaurant,Hotel,Middle Eastern Restaurant,Fast Food Restaurant,Pizza Place
7,M5C,1,Coffee Shop,Café,Gastropub,American Restaurant,Cocktail Bar,Italian Restaurant,Restaurant,Cheese Shop,Farmers Market,Clothing Store
8,M5E,1,Coffee Shop,Cocktail Bar,Bakery,Beer Bar,Seafood Restaurant,Cheese Shop,Farmers Market,Restaurant,Diner,Comfort Food Restaurant
9,M5H,1,Coffee Shop,Café,Hotel,Restaurant,Thai Restaurant,Gym,Deli / Bodega,Bakery,Concert Hall,Cosmetics Shop


###### **Cluster #1**
By looking at the map we can observe that postal codes in *cluster 1* are located to the south of cluster 3 and to the East of Yonge Street in Downtown Toronto where coffe shops are the most common type of venue. Other common venues in this side of town include Cafes, Hotels, Pizza Places,  Seafood, Japanese, and Sushi restaurants. If you are looking for a place to eat seafood and/or japanese food, cluster 1 seems to be the right side of town to visit.
###### **Cluster #2**
The postal codes in *cluster 2* are situated to the North of Dundas Street West, right in between University Avenue and Yonge Street. Again, the most common venues in this cluster are coffee shops. Other common venues are italian restaurants and sandwich places. Maybe if you are looking to eat fast food (sandwiches) or Italian food this might be a good side of town to visit.
###### **Cluster #3**
There is just one postal code in *custer 3*. Chinatown is located in this cluster. Cluster 3 seems to be a good place to visit if you are looking for a vegetarian/vegan, Mexican or Vietnamese type of restaurant.
###### **Cluster #4**
Lastly, there is just one postal code in *cluster 4*. The University of Toronto is located in this area and the most common venues in this side of town are Cafes, Bookstores, Bars and Japanese and Italian restaurants.