# Scrape Toronto's Postal Code from Wikipedia

## Section 1 - Scraping

### 1. Import all the needed resources, as well as initializing the Wikipedia URL that we want to scrape

In [2]:
! pip install lxml

import requests
import lxml.html as lh
import pandas as pd

wikipedia_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Collecting lxml
  Downloading lxml-4.6.2-cp38-cp38-manylinux1_x86_64.whl (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 1.9 MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.6.2


### 2. Extract the HTML content from the URL, parse it into HTML document and get the table using XPath

In [3]:
response = requests.get(wikipedia_url)

doc = lh.fromstring(response.content)
table = doc.xpath('//table[@class=\'wikitable sortable\']')

### 3. Convert the table into Panda DataFrame

Note that there is no specific ID defined in the HTML table in Wikipedia page, and therefore, a little bit of hack is needed.
The loop will stop if the iterator finds the first element of the table is empty, which is indicating that it is now going through the next table which we don't need

In [4]:
rows = table[0].xpath('//tr')
colums = []
toronto_df = None

for i, r in enumerate(rows):
    data_row = [r[0].text_content().rstrip(),
                r[1].text_content().rstrip() if r[1].text_content().rstrip() != 'Not assigned' else None,
                r[2].text_content().rstrip() if r[2].text_content().rstrip() != 'Not assigned' else None]
    if i == 0:
        columns = data_row
        toronto_df = pd.DataFrame(columns=columns)
    else:
        if r[0].text_content().rstrip() == '':
            break
        tmp = pd.DataFrame([data_row], columns=columns)
        toronto_df = toronto_df.append(tmp, ignore_index=True)

toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,,
176,M6Z,,
177,M7Z,,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


## Section 2 - Geocoder

### 1. Download the Geospatial data and load it to DataFrame

In [4]:
!wget -O geospatial_data.csv https://cocl.us/Geospatial_data

--2020-12-24 16:25:41--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.63.96.194, 169.63.96.176
Connecting to cocl.us (cocl.us)|169.63.96.194|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-12-24 16:25:42--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 185.235.236.197
Connecting to ibm.box.com (ibm.box.com)|185.235.236.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-12-24 16:25:42--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following

In [5]:
toronto_geo_df = pd.read_csv('geospatial_data.csv')
toronto_geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### 2. Cross reference the Geospatial data with postal code data, as well as perform data clean-up

In [6]:
complete_toronto_df = toronto_df.copy()
complete_toronto_df = complete_toronto_df.join(toronto_geo_df.set_index('Postal Code'), on='Postal Code', how='left')
complete_toronto_df = complete_toronto_df.dropna(subset=['Latitude'])
complete_toronto_df = complete_toronto_df[complete_toronto_df['Neighbourhood'] != None]
complete_toronto_df = complete_toronto_df.reset_index(drop=True)

complete_toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## Section 3 - Segmenting and Clustering

### 1. Initialize Geopy and Folium libraries

In [7]:
!pip install geopy
!pip install folium

import folium
from geopy.geocoders import Nominatim

print('Folium and Geopy are installed')

Collecting geopy
  Downloading geopy-2.0.0-py3-none-any.whl (111 kB)
[K     |████████████████████████████████| 111 kB 4.1 MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.0.0
Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 1.5 MB/s eta 0:00:011
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Folium and Geopy are installed


### 2. Initialize Toronto's geographical coordinate

In [8]:
toronto = 'Toronto, ON'

geolocator = Nominatim(user_agent="ds_toronto_expl")
location = geolocator.geocode(toronto)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Toronto are 43.6534817, -79.3839347.


### 3. Test Folium visualization by plotting all Toronto's neighbourhood

In [9]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, label in zip(complete_toronto_df['Latitude'], complete_toronto_df['Longitude'], complete_toronto_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [10]:
import time

CLIENT_ID = '<MASKED>'
CLIENT_SECRET = '<MASKED>'
VERSION = '20201224'
DEFAULT_RADIUS = 500
DEFAULT_LIMIT = 100

FOURSQUARE_BASE_EXPLORE_URL = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'

def explore_venues(toronto_dataframe):
    
    columns = ['Neighbourhood',
                'Neighbourhood Latitude',
                'Neighbourhood Longitude',
                'Venue',
                'Venue Latitude',
                'Venue Longitude',
                'Category']

    venues_list = pd.DataFrame(columns=columns)
    
    for i, n in toronto_dataframe.iterrows():
        neighbourhood = n['Neighbourhood']
        latitude = n['Latitude']
        longitude = n['Longitude']

        explore_url = FOURSQUARE_BASE_EXPLORE_URL.format(CLIENT_ID,
                                                         CLIENT_SECRET,
                                                         VERSION,
                                                         latitude,
                                                         longitude,
                                                         DEFAULT_RADIUS,
                                                         DEFAULT_LIMIT)

        response = requests.get(explore_url).json()['response']
        
        try:
            response['groups']
        except KeyError:
            print(response)

        results = response['groups'][0]['items']        

        for r in results:
            data_row = [neighbourhood,
                        latitude,
                        longitude,
                        r['venue']['name'],
                        r['venue']['location']['lat'],
                        r['venue']['location']['lng'],
                        r['venue']['categories'][0]['name']]

            tmp = pd.DataFrame([data_row], columns=columns)
            venues_list = venues_list.append(tmp, ignore_index=True)
        
        # Forcefully sleep the API call to avoid Foursquare API call threshold
        time.sleep(1)

    return venues_list

### 4. Explore interesting venues from 103 neighbourhood in Toronto

In [11]:
venues_list = explore_venues(complete_toronto_df)
venues_list.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


### 5. Grouped the result into Foursquare categories

In [13]:
toronto_onehot = pd.get_dummies(venues_list[['Category']], prefix="", prefix_sep="")
toronto_onehot['Neighbourhood'] = venues_list['Neighbourhood']
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [17]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Breakfast Spot,Latin American Restaurant,Skating Rink,Yoga Studio,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
1,"Alderwood, Long Branch",Pizza Place,Pharmacy,Athletics & Sports,Coffee Shop,Pub,Sandwich Place,Skating Rink,Gym,Airport Terminal,Dim Sum Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Pet Store,Restaurant,Mobile Phone Shop,Deli / Bodega,Supermarket,Ice Cream Shop,Middle Eastern Restaurant,Diner
3,Bayview Village,Café,Japanese Restaurant,Bank,Chinese Restaurant,Diner,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Thai Restaurant,Pet Store,Pharmacy,Pub,Restaurant,Café


### 6. Start K-Means machine learning and visualization

In [100]:
from sklearn.cluster import KMeans

kclusters = 9
n = toronto_grouped['Neighbourhood']
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)
toronto_grouped_clustering
kmeans = KMeans(n_clusters=kclusters).fit(toronto_grouped_clustering)

neighborhoods_venues_cluster = neighborhoods_venues_sorted.copy()
neighborhoods_venues_cluster.insert(0, 'Cluster Labels', kmeans.labels_)

final_df = neighborhoods_venues_cluster.merge(
    complete_toronto_df[['Neighbourhood', 'Latitude', 'Longitude']], how='inner')

In [71]:
import matplotlib.cm as cm
import matplotlib.colors as colors
import math
import random

### 7. Visualize the clustering into Folium map

In [101]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
colors = ["#%06x" % random.randint(0, 0xFFFFFF) for x in range(0, kclusters)]

for lat, lon, poi, cluster in zip(final_df['Latitude'], final_df['Longitude'], final_df['Neighbourhood'], final_df['Cluster Labels']):
    cl = str(cluster) if not math.isnan(cluster) else 'outliar'
    label = folium.Popup(str(poi) + ' Cluster ' + cl, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='#000000',
        fill=True,
        fill_color=colors[cluster],
        fill_opacity=1.0,
        weight=2).add_to(map_clusters)
       
map_clusters

### 8. Analysis

From the activity above, and based on the latest information retrieved from Foursquare as per 24th December, the clustering seems to be grouped heavily on Cluster 0.
Drilling down the top 5 common venues for this cluster:

In [134]:
final_df[final_df['Cluster Labels'] == 0] \
    [['Neighbourhood', '1st Most Common Venue']] \
    .groupby('1st Most Common Venue') \
    .count() \
    .sort_values(['Neighbourhood'], ascending=False) \
    .head()

Unnamed: 0_level_0,Neighbourhood
1st Most Common Venue,Unnamed: 1_level_1
Coffee Shop,21
Café,6
Grocery Store,5
Pub,2
Pharmacy,2


In [136]:
final_df[final_df['Cluster Labels'] == 0] \
    [['Neighbourhood', '2nd Most Common Venue']] \
    .groupby('2nd Most Common Venue') \
    .count() \
    .sort_values(['Neighbourhood'], ascending=False) \
    .head()

Unnamed: 0_level_0,Neighbourhood
2nd Most Common Venue,Unnamed: 1_level_1
Coffee Shop,7
Café,6
Park,6
Bakery,5
Breakfast Spot,3


Coffee Shop and Cafe dominate this cluster, and looks like most part of Toronto have these 2 venues. Because of the dominace of coffee shop and cafe in Toronto, it makes the existence of other cluster to be less relevant.