# Segmenting and Clustering Toronto Neighborhood

## Section 1 - Web Scraping to Obtain Toronto Postal Code DataFrame

In this section, I am going to prepare Toronto Postal Code DataFrame by scraping it from website. This section also answers the first task in this Peer-graded assignment.

In [1]:
#Import Libraries
import numpy as np # Library to handle data in a vectorized manner
import pandas as pd # Library for data analysis
import json # Library to handle JSON files
import requests # Library to send HTTP requests using Python


from geopy.geocoders import Nominatim # Convert an address into longitude and latitude values
from pandas.io.json import json_normalize # Tranform JSON file into a pandas dataframe

import matplotlib.pyplot as plt # Visualization
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans # Clustering

import folium # Map rendering

from bs4 import BeautifulSoup # Library to scrape information from web pages

print('Libraries imported!')

Libraries imported!


Use pandas package to scrape information from web pages.

In [2]:
# Use pandas to scrape table from web page
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

request = requests.get(wiki_url)
Wiki_df_list = pd.read_html(request.text) # This parses all the tables in webpages to a list
df = Wiki_df_list[0] # DataFrame
print(df.head())
print('\nDataframe Shape:', df.shape)

  Postal Code           Borough              Neighbourhood
0         M1A      Not assigned               Not assigned
1         M2A      Not assigned               Not assigned
2         M3A        North York                  Parkwoods
3         M4A        North York           Victoria Village
4         M5A  Downtown Toronto  Regent Park, Harbourfront

Dataframe Shape: (180, 3)


Dataframe successfully collected!

The next step is to remove all **'Not assigned'** value from Dataframe.

In [3]:
# Remove all 'Not assigned' value from Dataframe
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


'If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.'
 **Quoted from the Submission page.**

In [4]:
# Verify Dataframe to find any 'Not assigned' values left behind
print(df[df["Neighbourhood"] == 'Not assigned'])

Empty DataFrame
Columns: [Postal Code, Borough, Neighbourhood]
Index: []


Based on the output, there are no **'Not assigned'** values left. So the dataframe is clean and ready to be processed further.

In [5]:
# Dataframe shape
print('Dataframe shape:', df.shape)

Dataframe shape: (103, 3)


<br> <br>

## Section 2 - Collecting Latitude and Longitude for Toronto Postal Codes

After successfully collecting Toronto postal codes, boroughs and neighborhood, the next step would be to collect latitude and longitude of each postal code.  
<br>
The First step in this Section 2 is to collect geospatial data.

In [6]:
# Read CSV - Geospatial Data
geo_df = pd.read_csv('http://cocl.us/Geospatial_data')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
# Merge two dataframes based on postal code
merge_df = pd.merge(df, geo_df, on="Postal Code")
merge_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [8]:
# Verify the number of postal codes (should be 103 postal codes in total)
print('The Dataframe is successfully merged, and ready to be processed further.\nWith the total of {} postal codes.'.format(merge_df.shape[0]))


The Dataframe is successfully merged, and ready to be processed further.
With the total of 103 postal codes.


<br><br>

## Section 3 - Explore and Cluster the Neighborhoods in Toronto

The Dataframe has been merged successfully. The first task in this section is to do some exploring, to see what each neighborhood has to offer. To do that, I have to use Foursquare to collect data about venues available in each neighborhood.

### Part 1 - Exploration

In [9]:
# Setup Foursquare credentials
CLIENT_ID = '3KHGUYZDYHRJXPMI4WUX5YY2Q0021DZUJ0GENJUZWAZ2CARH'
CLIENT_SECRET = '3RMCQMDYOM3RB5DDLMC2JCJ2DKW4KIFOZ0DPTMSMAE2ILOBQ'
VERSION = '20210101' # January 1st, 2021
LIMIT = 100 # default Foursquare API limit value


In [10]:
# Create function to repeat the process of collecting venues for each neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius = 500):
    
    venues_list=[] #Create empty list
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name) # Print each neighborhood
        
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        # Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Collect only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
    # Save it into Dataframe    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Save the newly acquired information about venues in Toronto into a new Dataframe.

In [13]:
# Venues Dataframe
toronto_venues = getNearbyVenues(names=merge_df['Neighbourhood'],
                                latitudes=merge_df['Latitude'],
                                longitudes=merge_df['Longitude']
                                )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Checking the size of the newly created Dataframe.

In [14]:
# Shape of the new Dataframe
print(toronto_venues.shape)
toronto_venues.head()

(2114, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


Checking how many venues were returned for each category and how many unique venue category.

In [15]:
# Groupby 'Neighborhood'
toronto_venues.groupby('Neighborhood').count().head()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",6,6,6,6,6,6
"Bathurst Manor, Wilson Heights, Downsview North",22,22,22,22,22,22
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23


In [16]:
# Unique venues
print('There are {} unique venue categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 273 unique venue categories.


**Next step is to analyze each neighborhood in Toronto to find what type of venues are most common (5 types of venue).**

In [17]:
# One hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

In [18]:
# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.shape

(2114, 274)

**Then, the next step is to group rows by neighborhood and by taking the mean of the frequency of occurence of each category.**

In [19]:
# Groupby 'Neighbourhood'
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,"Willowdale, Willowdale East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0
92,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
93,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
94,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# Print neighborhoods along with the top 5 most common venues
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0             Clothing Store  0.25
1  Latin American Restaurant  0.25
2             Breakfast Spot  0.25
3                     Lounge  0.25
4                      Motel  0.00


----Alderwood, Long Branch----
          venue  freq
0   Pizza Place  0.33
1  Skating Rink  0.17
2           Pub  0.17
3           Gym  0.17
4   Coffee Shop  0.17


----Bathurst Manor, Wilson Heights, Downsview North----
           venue  freq
0           Bank  0.09
1    Coffee Shop  0.09
2      Gift Shop  0.05
3  Shopping Mall  0.05
4    Bridal Shop  0.05


----Bayview Village----
                 venue  freq
0  Japanese Restaurant  0.25
1                 Bank  0.25
2                 Café  0.25
3   Chinese Restaurant  0.25
4        Moving Target  0.00


----Bedford Park, Lawrence Manor East----
                     venue  freq
0           Sandwich Place  0.09
1              Coffee Shop  0.09
2       Italian Restaurant  0.09
3  Comfort Food Restaurant  0.04
4 

4    Accessories Store  0.00


----Northwood Park, York University----
                  venue  freq
0        Massage Studio  0.25
1  Caribbean Restaurant  0.25
2                   Bar  0.25
3           Coffee Shop  0.25
4     Accessories Store  0.00


----Old Mill South, King's Mill Park, Sunnylea, Humber Bay, Mimico NE, The Queensway East, Royal York South East, Kingsway Park South East----
                 venue  freq
0       Baseball Field   1.0
1    Accessories Store   0.0
2   Miscellaneous Shop   0.0
3  Moroccan Restaurant   0.0
4  Monument / Landmark   0.0


----Parkdale, Roncesvalles----
            venue  freq
0  Breakfast Spot  0.14
1       Gift Shop  0.14
2    Dessert Shop  0.07
3   Movie Theater  0.07
4         Dog Run  0.07


----Parkview Hill, Woodbine Gardens----
                venue  freq
0         Pizza Place   0.2
1  Athletics & Sports   0.1
2         Flea Market   0.1
3           Pet Store   0.1
4                Bank   0.1


----Parkwoods----
                       

**The next step is to sort venue frequency data and save it to a new Dataframe.**

In [21]:
# Function to sort the venue in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [22]:
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Clothing Store,Latin American Restaurant,Breakfast Spot,Lounge,Motel
1,"Alderwood, Long Branch",Pizza Place,Skating Rink,Pub,Gym,Coffee Shop
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Gift Shop,Shopping Mall,Bridal Shop
3,Bayview Village,Japanese Restaurant,Bank,Café,Chinese Restaurant,Moving Target
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Comfort Food Restaurant,Indian Restaurant


### Part 2 - Clustering

In the second part of section 3, I will create a Machine Learning clustering model and set the value of k = 4. So there will be 4 clusters.

In [23]:
# Set the number of clusters
kclusters = 4

# Drop column 'Neighbourhood' from toronto_grouped Dataframe and save it to new variable
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

After labels successfully generated, the next step would be to create a new Dataframe that combines informations from 2 Dataframe.

In [24]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = merge_df

# merge toronto_grouped with merge_df to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,3.0,Park,Food & Drink Shop,Accessories Store,Middle Eastern Restaurant,Monument / Landmark
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Pizza Place,Hockey Arena,Coffee Shop,Portuguese Restaurant,Men's Store
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,Coffee Shop,Bakery,Park,Pub,Breakfast Spot
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Clothing Store,Accessories Store,Furniture / Home Store,Gift Shop,Coffee Shop
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.0,Coffee Shop,Sushi Restaurant,Yoga Studio,Distribution Center,Burrito Place


Check to see any NaN values.

In [25]:
Check = toronto_merged["Cluster Labels"].isnull().values.any()

# Check to see any NaN values
print('Is there any NaN values?')
if Check == True:
    print('Yes.')
else:
    print('No')

# Check the amount of NaN values
print('\nThe amount of NaN values:', toronto_merged["Cluster Labels"].isnull().values.sum())

Is there any NaN values?
Yes.

The amount of NaN values: 3


The reason for these NaN values is because there is no venue data registered from Foursquare in the 500 radius of this coordinate.

In [26]:
# Drop NaN values
toronto_merged = toronto_merged.dropna()
print(toronto_merged.shape)

(100, 11)


### Part 3 - Visualization

**The final step is to visualize the resulting clusters.**

In [27]:
# Set variable for Toronto's latitude and longitude
toronto_latitude = 43.6532
toronto_longitude = -79.3832

# Create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters