## 1. Load data from web page

Scrape the Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [2]:
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
postal_codes = tables[0]
postal_codes.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Filter dataframe deleting rows containing Not assigned Postal Codes. Rename first column.

In [3]:
postal_codes = postal_codes[postal_codes['Borough'] != 'Not assigned'].reset_index(drop=True)
postal_codes.rename(columns={'Postal code': 'PostalCode'}, inplace=True)

postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In order to check duplicate postal codes we can count unique items for each column.

In [4]:
postal_codes.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M5M,North York,Downsview
freq,1,24,4


No duplicate postal codes! It seems thet neighborhoods are already grouped by postal code.<br/>
But some neighborhoods are listed twice, since we have 98 unique neighborhoods over 103 postal codes. We will handle this issue later.<br/>
Let's check if there is any NaN or Not Assigned in Neighborhood column.

In [5]:
postal_codes['Neighborhood'].isna().sum()

0

In [6]:
(postal_codes['Neighborhood']=='Not assigned').sum()

0

Multiple Neighborhoods for the same Postal Code are formatted with '/', replace it with a single comma (,).

In [7]:
postal_codes['Neighborhood'].replace(' /',',',regex=True,inplace=True)
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
postal_codes.shape

(103, 3)

The first part of the project is complete.

## 2. Build Dataset

Let's add coordinates columns to the dataframe.

In [9]:
postal_codes['Latitude'] = 0.0
postal_codes['Longitude'] = 0.0

postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,0.0,0.0
1,M4A,North York,Victoria Village,0.0,0.0
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",0.0,0.0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",0.0,0.0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",0.0,0.0


Let's find out which neighborhoods appear more than once.

In [10]:
postal_codes_grouped = postal_codes.groupby('Neighborhood').count()

postal_codes_grouped[postal_codes_grouped['PostalCode']>1]

Unnamed: 0_level_0,PostalCode,Borough,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Don Mills,2,2,2,2
Downsview,4,4,4,4
Willowdale,2,2,2,2


In next dataframe iteration we will append Postal Code to these Neighborhoods in order to have unique Neighborhoods

Use geocoder to get coordinates and store them into dataframe. Use arcgis provider instead of google.

In [12]:
! conda install -c conda-forge geocoder

import geocoder # import geocoder

print('geocoder imported!')

geocoder imported!


In [13]:
# iterate over dataframe and add values for latitude and longitude
for index, row in postal_codes.iterrows():
    postal_code = row['PostalCode']
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
    postal_codes.at[index,'Latitude']= g.lat
    postal_codes.at[index,'Longitude'] = g.lng
    if  row['Neighborhood'] in ['Don Mills','Downsview','Willowdale']:
        postal_codes.at[index,'Neighborhood'] = postal_codes.at[index,'Neighborhood'] + ' ' + postal_codes.at[index,'PostalCode']
    
# postal_codes contains all neighborhoods information now

Let's take a look at the dataframe now

In [14]:
postal_codes.shape

(103, 5)

In [15]:
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939


Dataframe is now ready for further analysis!


## 3. Analyze Data

### 3.1 Explore Neighborhoods

Import libraries first.

In [16]:
import json # library to handle JSON files

import requests # library to handle requests

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Get Toronto coordinates and create a map with neighborhoods

In [17]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


In [18]:
# create map of Toronto using latitude and longitude values
map_toronto_full = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(postal_codes['Latitude'], postal_codes['Longitude'], postal_codes['Borough'], postal_codes['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_full)  
    
map_toronto_full

In [19]:
toronto_data = postal_codes
toronto_data.shape

(103, 5)

Foursquare Credentials

In [20]:
CLIENT_ID = 'QJ1PBI3IOFN5VJL5UCNGYC5NM5JVEXMUJP5VLXIE0V4VAT4S' #  Foursquare ID
CLIENT_SECRET = 'EVQVSSLLIO0RMLFEZMGBRZ2PTE4V2O2ZZ2JYDDYMOPZ5HCAP' #  Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

credentials:
CLIENT_ID: QJ1PBI3IOFN5VJL5UCNGYC5NM5JVEXMUJP5VLXIE0V4VAT4S
CLIENT_SECRET:EVQVSSLLIO0RMLFEZMGBRZ2PTE4V2O2ZZ2JYDDYMOPZ5HCAP


Define Functions

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
  
#
# function that gets top 100 venues for each neighborhood
# names, latitudes, longitudes are the columns of the dataframe
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now let's build a new dataframe with all the venues

In [22]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'])

Let's check the results

In [23]:
print(str(toronto_venues.shape[0]) + ' venues found in ' + str(toronto_venues.groupby('Neighborhood')['Neighborhood'].nunique().count()) + ' neighborhoods.')
print('Total number of neighborhoods in Toronto is ' + str(toronto_data.shape[0]))

2268 venues found in 101 neighborhoods.
Total number of neighborhoods in Toronto is 103


There are no nearby venues for 2 neighborhoods. Let's take a look at the dataframe.

In [None]:
grouped_toronto_venues = toronto_venues.groupby('Neighborhood').count().reset_index()
grouped_toronto_venues

And take a look at the categories

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are too few venues in some neighborhoods to perform a significant cluster analysis. I will exclude neighborhoods with less then 20 venues.

In [None]:
too_few = grouped_toronto_venues[grouped_toronto_venues['Venue'] < 20].reset_index()
print('There are ' + str(too_few.shape[0]) + ' neighborhoods with less than 20 nearby venues')

In [None]:
indexes = toronto_venues[toronto_venues["Neighborhood"].isin(too_few['Neighborhood'])].index

toronto_venues_filtered = toronto_venues.drop(indexes)
toronto_venues_filtered.reset_index(inplace=True, drop=True)
toronto_venues_filtered.shape

In [None]:
toronto_venues_filtered.head()

In [None]:
# do the same for toronto_data

toronto_data_filtered = toronto_data.copy()

indexes = toronto_data_filtered[toronto_data_filtered["Neighborhood"].isin(too_few['Neighborhood'])].index
toronto_data_filtered.drop(indexes , inplace=True)
toronto_data_filtered.reset_index(inplace=True, drop=True)
toronto_data_filtered.shape

#  there area still 2 neighborhoods for which the search returned no values...

In [None]:
toronto_data_filtered.shape

Now let's check if neighborhoods with too few venues have beeen correctly deleted

In [None]:
toronto_venues_filtered.groupby('Neighborhood').count()

Create a new map only with selected neighborhoods in new dataframe.
Same longitude and latitiude as before, a little more zoom.

In [None]:
# create map of Toronto using latitude and longitude values
map_toronto_filtered = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data_filtered['Latitude'], toronto_data_filtered['Longitude'], toronto_data_filtered['Borough'], toronto_data_filtered['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_filtered)  
    
map_toronto_filtered

Data preprocessing

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues_filtered[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues_filtered['Neighborhood'] 

toronto_onehot.shape

In [None]:
# move neighborhood column to the first column
nb = toronto_onehot.columns.get_loc('Neighborhood')
fixed_columns = [toronto_onehot.columns[nb]] + list(toronto_onehot.columns[: nb]) + list(toronto_onehot.columns[nb+1 :]) 

toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

And let's examine the new dataframe size.

In [None]:
toronto_onehot.shape

Calculate the mean of the occurrence of each category grouped by neighborhood

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

In [None]:
toronto_grouped.shape

Function for returning most common values

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

### 3.2 Cluster Analysis

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data_filtered.drop('PostalCode',1)

# merge toronto_grouped with toronto_data_filtered to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged

In [None]:
# the 2 neighborhoods with no venues are still here...

toronto_merged.dropna(inplace=True)

In [None]:
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

#### Cluster 5

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]