# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we are tasked with studying data pertaining to the neighborhoods of Toronto.

## Part 1: Scraping Toronto postal codes from Wikipedia

We start by first scraping data on Toronto's neighborhoods and putting them into a Pandas dataframe. For this part, we are only interesting in obtaining the postal code, the borough, and the neighborhood name. We want our table to list all of the unique postal codes in the area. Data will be scraped from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

In [1]:
# Import necessary Python libraries to scrape web data and store it
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests

In [2]:
# Load wiki page
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050'
source = requests.get(url).text
soup = BeautifulSoup(source,'lxml')


In [3]:
# Parse Wiki table
neighborhood_dict = {'PostalCode':[],'Borough':[],'Neighborhood':[]}
table=soup.find('table')
for tr in table.find_all('tr'):
    tds=tr.find_all('td')
    if(not tds):
        continue
    pc, bor, nh = [td.text.strip() for td in tds]
    if bor=='Not assigned':
        continue
    elif nh == 'Not assigned':
        nh = bor
    neighborhood_dict['PostalCode'].append(pc)
    neighborhood_dict['Borough'].append(bor)
    neighborhood_dict['Neighborhood'].append(nh)
df_toronto = pd.DataFrame.from_dict(neighborhood_dict)

In [4]:
# Process table to merge rows with the same PostalCode
df_toronto=df_toronto.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [5]:
print(f"The shape of the dataframe is: {df_toronto.shape}")

The shape of the dataframe is: (103, 3)


Now we can see that there are 103 unique postal codes in Toronto. 

## Part 2: Getting Latitude and Longitude
Using the above data, what we want to do now is to get the latitude and longitude of each PostalCode and then display them on a map of Toronto.

In [6]:
# import geopy
# import geocoder
# from geopy.geocoders import Nominatim
# pos={"PostalCode":[],"Latitude":[],"Longitude":[]}
# for code in df_toronto['PostalCode']:
#     lat_lng_coords = None
#     while(lat_lng_coords is None):
#         g = geocoder.google(f'{code}, Toronto, Ontario')
#         lat_lng_coords = g.latlng
#         print(f'{code}, Toronto, Ontario', lat_lng_coords)
#  This is unreliable, returning None almost all of the time. Other free geocoders via geopy are also unreliable.

In [7]:
# Import mapping library
import geopy
import geocoder
from geopy.geocoders import Nominatim
import folium # map rendering library

In [8]:
df_locations = pd.read_csv('data/Geospatial_Coordinates.csv')
df_locations.rename(columns={"Postal Code":"PostalCode"},inplace=True)
df_toronto=df_toronto.merge(df_locations, on='PostalCode')

In [9]:
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [10]:
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, postalCode, borough in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['PostalCode'], df_toronto['Borough']):
    label = '{}, {}'.format(postalCode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The geograpical coordinates of Toronto are 43.653963, -79.387207.


# Part 3: Explore and Cluster

For this part, we want to explore the data we have collected and processed, the goal being to cluster the different locations by their venue category frequency. Since, the location data we collected is per postal code rather than something like per neighborhood or per borough, we'll just keep it simple for now and cluster by postal code.

In [11]:
# Import libraries for this section
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

In [12]:
# API Credentials
CLIENT_ID = '2FY41QWJDJIXPBTEL0JXJCXHBFUL0VIAY5IFL4RQ0O4B1CXO' # your Foursquare ID
CLIENT_SECRET = 'DRE3VWEOWVT2PQ1AZI2RDAE4NR3UK0AWAYCLNN32WXNDKOMM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

# Function to get venues from given location
# Foursquare has an upper limit of 50 results per request
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit = 50):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PC Latitude', 
                  'PC Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We apply the above function to every postal code we have location data for. Using the default radius of 500 m doesn't yield many results for many of the locations. Instead, we'll try something like 1 km. After, we'll need to remove duplicates since there may be overlap.

In [13]:
toronto_venues = getNearbyVenues(names=df_toronto['PostalCode'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude'],
                                   radius = 1000
                                  )

In [14]:
toronto_venues.drop_duplicates(inplace=True)

Now, we use a one-hot encoding on the venue categories for each venue and then regroup by postal code to get a frequency of occurence per venue category.

In [15]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']],prefix='',prefix_sep='')
toronto_onehot.insert(0,'PostalCode',toronto_venues['PostalCode'])
toronto_venue_freq = toronto_onehot.groupby('PostalCode').mean().reset_index()

In [16]:
toronto_venue_freq.head()

Unnamed: 0,PostalCode,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.033333


With the data all nice and encoded, we can take a look at the most common types of venues in each postal code.

In [17]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [18]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_venues_sorted = pd.DataFrame(columns=columns)
toronto_venues_sorted['PostalCode'] = toronto_venue_freq['PostalCode']

for ind in np.arange(toronto_venue_freq.shape[0]):
    toronto_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_venue_freq.iloc[ind, :], num_top_venues)

toronto_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Coffee Shop,Fast Food Restaurant,Trail,Bus Station,Caribbean Restaurant,Bank,Bakery,Restaurant,Sandwich Place,Auto Workshop
1,M1C,Playground,Italian Restaurant,Breakfast Spot,Burger Joint,Park,Electronics Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
2,M1E,Pizza Place,Bank,Fast Food Restaurant,Coffee Shop,Liquor Store,Sports Bar,Juice Bar,Discount Store,Sandwich Place,Fried Chicken Joint
3,M1G,Coffee Shop,Park,Fast Food Restaurant,Mobile Phone Shop,Chinese Restaurant,Indian Restaurant,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant
4,M1H,Coffee Shop,Pharmacy,Bank,Bakery,Gas Station,Indian Restaurant,Yoga Studio,Music Store,Fried Chicken Joint,Chinese Restaurant


And now the last step, clustering.

In [19]:
# set number of clusters
kclusters = 5
toronto_clustering = toronto_venue_freq.drop('PostalCode', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)
# add clustering labels
toronto_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
df_toronto_merged = df_toronto
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
df_toronto_merged = df_toronto_merged.join(toronto_venues_sorted.set_index('PostalCode'), on='PostalCode')
df_toronto_merged.head() # check the last columns!


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1.0,Coffee Shop,Fast Food Restaurant,Trail,Bus Station,Caribbean Restaurant,Bank,Bakery,Restaurant,Sandwich Place,Auto Workshop
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.0,Playground,Italian Restaurant,Breakfast Spot,Burger Joint,Park,Electronics Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1.0,Pizza Place,Bank,Fast Food Restaurant,Coffee Shop,Liquor Store,Sports Bar,Juice Bar,Discount Store,Sandwich Place,Fried Chicken Joint
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4.0,Coffee Shop,Park,Fast Food Restaurant,Mobile Phone Shop,Chinese Restaurant,Indian Restaurant,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1.0,Coffee Shop,Pharmacy,Bank,Bakery,Gas Station,Indian Restaurant,Yoga Studio,Music Store,Fried Chicken Joint,Chinese Restaurant


In [39]:
# assign new cluster to postal codes that don't have one
df_toronto_merged['Cluster Labels'].replace(np.nan,kclusters,inplace=True)
# Convert cluster label from float to in
df_toronto_merged=df_toronto_merged.astype({'Cluster Labels':'int32'})

In [41]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters+1)
ys = [i + x + (i*x)**2 for i in range(kclusters+1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto_merged['Latitude'], df_toronto_merged['Longitude'], df_toronto_merged['PostalCode'], df_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We can now look at each cluster individually to try to spot any similarities.

Just from examining the map, we see that one cluster (3, green) is concentrated near the center of the city in the commercial region and another (1, blue) is on the outskirts. There are also orange (4) clusters which are strewn throughout.

In [48]:
cluster = 1
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == cluster, df_toronto_merged.columns[[1] + list(range(5, df_toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,1,Coffee Shop,Fast Food Restaurant,Trail,Bus Station,Caribbean Restaurant,Bank,Bakery,Restaurant,Sandwich Place,Auto Workshop
2,Scarborough,1,Pizza Place,Bank,Fast Food Restaurant,Coffee Shop,Liquor Store,Sports Bar,Juice Bar,Discount Store,Sandwich Place,Fried Chicken Joint
4,Scarborough,1,Coffee Shop,Pharmacy,Bank,Bakery,Gas Station,Indian Restaurant,Yoga Studio,Music Store,Fried Chicken Joint,Chinese Restaurant
5,Scarborough,1,Ice Cream Shop,Bowling Alley,Train Station,Convenience Store,Coffee Shop,Restaurant,Sandwich Place,Fast Food Restaurant,Japanese Restaurant,Pizza Place
6,Scarborough,1,Discount Store,Chinese Restaurant,Coffee Shop,Fast Food Restaurant,Grocery Store,Hobby Shop,Light Rail Station,Bank,Metro Station,Intersection
7,Scarborough,1,Intersection,Bus Line,Bakery,Coffee Shop,Bank,Beer Store,Fast Food Restaurant,Metro Station,Mexican Restaurant,Soccer Field
8,Scarborough,1,Pizza Place,Ice Cream Shop,Beach,Cajun / Creole Restaurant,Hardware Store,Sports Bar,Park,Burger Joint,Eastern European Restaurant,Dog Run
10,Scarborough,1,Furniture / Home Store,Coffee Shop,Asian Restaurant,Chinese Restaurant,Pharmacy,Indian Restaurant,Restaurant,Automotive Shop,Fast Food Restaurant,Electronics Store
11,Scarborough,1,Pizza Place,Middle Eastern Restaurant,Grocery Store,Intersection,Fish Market,Badminton Court,Bakery,Bar,Supermarket,Korean Restaurant
12,Scarborough,1,Chinese Restaurant,Shopping Mall,Bakery,Pizza Place,Caribbean Restaurant,Breakfast Spot,Discount Store,Japanese Restaurant,Latin American Restaurant,Lounge


My guess as to how they are clustered:

    0. Parks
    1. Coffee/Pizza
    2. Park
    3. Coffee/Cafe
    4. Coffee
    5. Nothing