## Scraping Toronto Postcodes from Wikipedia using BeautifulSoup

In [18]:
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import xml

Create a urlrequest of the Wikipedia page then cache the file locally for ease of use later.

In [19]:
url = 'http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req = urllib.request.urlopen(url)
article = req.read().decode()
with open('postcode.html', 'w') as fo:
    fo.write(article)

Read the local file, then find the tables in the html.

In [20]:
article = open('postcode.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

Convert the table to a string and feed that into a pandas dataframe.

In [21]:
df = pd.read_html(str(tables))[0]

Remove Boroughs that are 'Not assigned'

In [22]:
df = df[df.Borough != 'Not assigned']

In [23]:
df.shape

(103, 3)

In [24]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Attempted the Geocoder import, it seemed to get stuck in an infinite loop.
Decided to use the provided csv instead

In [43]:
postcode_df = pd.read_csv("https://cocl.us/Geospatial_data")

In [44]:
postcode_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the imported csv dataframe into the original df.

In [48]:
df = postcode_df.merge(df, on = 'Postal Code')

In [59]:
df.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


In [60]:
df.shape

(103, 5)

## Part 3 - Clustering of the Data

Convert the locations in the df to a list of lat/long and descriptions.

In [134]:
locations = df[['Latitude', 'Longitude', 'Neighborhood']]
locationlist = locations.values.tolist()

In [92]:
import folium
import requests

Plot the neighborhoods on a map to visualise the locations as a starting point

In [84]:
map = folium.Map(location=[43.7238, -79.3886], zoom_start=10.5)
for point in locationlist:
    lat = point[0]
    long = point[1]
    desc = point[2]
    folium.Marker((lat, long), popup=desc).add_to(map)

In [112]:
map

Prepare the credentials for Foursquare

In [87]:
CLIENT_ID = 'OK1NXP2WQ3OTIZKJWNUDANHTK12T5EC21HW5IQMJ3DLQO3TR' # your Foursquare ID
CLIENT_SECRET = 'OMBS5ZEOVIRDCHFRO4DSPVH2U2MXDDA4PSLQDDAKKJRJBFVR' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Test Data using the first value in the list

In [89]:
locationlist[0]

[43.806686299999996, -79.19435340000001, 'Malvern, Rouge']

In [96]:
LIMIT = 100
radius = 800 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    locationlist[0][0], 
    locationlist[0][1], 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=OK1NXP2WQ3OTIZKJWNUDANHTK12T5EC21HW5IQMJ3DLQO3TR&client_secret=OMBS5ZEOVIRDCHFRO4DSPVH2U2MXDDA4PSLQDDAKKJRJBFVR&v=20180605&ll=43.806686299999996,-79.19435340000001&radius=800&limit=100'

Gets the results from the Foursquare API and puts it into JSON

In [100]:
results = requests.get(url).json()

In [123]:
def getNearbyVenues(location_list, radius=800):
    
    venues_list=[]
    for location in location_list:
        name = location[2]
        lat = location[0]
        long = location[1]
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            long, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            long, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Attempted the function with the 500m radius as Manhattan but there weren't that many results - radius increased to 800m and the venues were refreshed.

In [135]:
venuelist = getNearbyVenues(locationlist)

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale, Willowdale East
York Mills West
Willowdale, Willowdale West
Parkwoods
Don Mills
Don Mills
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview
Downsview
Downsview
Downsview
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto, Broadview North (Old East York)
The Danforth West, 

In [136]:
venuelist.shape

(3960, 7)

In [137]:
venuelist.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,16,16,16,16,16,16
"Alderwood, Long Branch",13,13,13,13,13,13
"Bathurst Manor, Wilson Heights, Downsview North",26,26,26,26,26,26
Bayview Village,11,11,11,11,11,11
"Bedford Park, Lawrence Manor East",38,38,38,38,38,38
...,...,...,...,...,...,...
"Willowdale, Willowdale West",8,8,8,8,8,8
Woburn,6,6,6,6,6,6
Woodbine Heights,12,12,12,12,12,12
York Mills West,5,5,5,5,5,5


In [138]:
# one hot encoding
venuelist_onehot = pd.get_dummies(venuelist[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
venuelist_onehot['Neighborhood'] = venuelist['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [venuelist_onehot.columns[-1]] + list(venuelist_onehot.columns[:-1])
venuelist_onehot = venuelist_onehot[fixed_columns]

venuelist_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [140]:
venuelist_onehot.shape

(3960, 330)

In [142]:
venue_grouped = venuelist_onehot.groupby('Neighborhood').mean().reset_index()
venue_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.026316,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
94,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
95,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
96,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


In [144]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [148]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in range(1, num_top_venues + 1):
        columns.append(str(ind) + ' most common')

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = venue_grouped['Neighborhood']

for ind in np.arange(venue_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(venue_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1 most common,2 most common,3 most common,4 most common,5 most common,6 most common,7 most common,8 most common,9 most common,10 most common
0,Agincourt,Chinese Restaurant,Badminton Court,Clothing Store,Supermarket,Seafood Restaurant,Skating Rink,Lounge,Latin American Restaurant,Malay Restaurant,Pool Hall
1,"Alderwood, Long Branch",Pizza Place,Convenience Store,Athletics & Sports,Discount Store,Donut Shop,Sandwich Place,Gym,Park,Pub,Coffee Shop
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Pizza Place,Coffee Shop,Park,Shopping Mall,Diner,Sandwich Place,Sushi Restaurant,Fried Chicken Joint,Supermarket
3,Bayview Village,Japanese Restaurant,Bank,Chinese Restaurant,Dog Run,Shopping Mall,Grocery Store,Skating Rink,Café,Park,Women's Store
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Italian Restaurant,Restaurant,Sandwich Place,Liquor Store,Baby Store,Bagel Shop,Bakery,Bank,Sushi Restaurant


## Now cluster the neighborhoods based on the most common venues

In [151]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

venue_grouped_clustering = venue_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venue_grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)



ValueError: cannot insert Cluster Labels, already exists

In [177]:
venue_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
venue_merged = venue_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
venue_merged = venue_merged.dropna()
venue_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood,Cluster Labels,1 most common,2 most common,3 most common,4 most common,5 most common,6 most common,7 most common,8 most common,9 most common,10 most common
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge",0.0,Fast Food Restaurant,Coffee Shop,Trail,Hobby Shop,Paper / Office Supplies Store,Martial Arts Dojo,Business Service,Restaurant,Chinese Restaurant,Spa
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek",0.0,Breakfast Spot,Burger Joint,Bar,Italian Restaurant,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill",2.0,Pizza Place,Fast Food Restaurant,Grocery Store,Beer Store,Supermarket,Bank,Fried Chicken Joint,Restaurant,Greek Restaurant,Sports Bar
3,M1G,43.770992,-79.216917,Scarborough,Woburn,2.0,Coffee Shop,Park,Business Service,Construction & Landscaping,Women's Store,Electronics Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae,0.0,Coffee Shop,Indian Restaurant,Yoga Studio,Chinese Restaurant,Burger Joint,Flower Shop,Bus Line,Music Store,Fried Chicken Joint,Bank


Finally to visualise the clusters in Toronto:

In [181]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[43.7238, -79.3886], zoom_start=10.5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(venue_merged['Latitude'], venue_merged['Longitude'], venue_merged['Neighborhood'], venue_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters


# Seems like quite a strange result - 3 x single dot 'clusters' then 2 different red / blue clusters for the centre of the city

In [187]:
compare = ['North Toronto West, Lawrence Park', 'Roselawn', 'Northwest, West Humber - Clairville', 'Humewood-Cedarvale']
new_df = venue_merged.loc[venue_merged['Neighborhood'].isin(compare)]

In [188]:
new_df

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood,Cluster Labels,1 most common,2 most common,3 most common,4 most common,5 most common,6 most common,7 most common,8 most common,9 most common,10 most common
46,M4R,43.715383,-79.405678,Central Toronto,"North Toronto West, Lawrence Park",0.0,Coffee Shop,Restaurant,Skating Rink,Sporting Goods Shop,Diner,Café,Italian Restaurant,Track,Spa,Electronics Store
63,M5N,43.711695,-79.416936,Central Toronto,Roselawn,4.0,Playground,Pet Store,Garden,Women's Store,Distribution Center,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
73,M6C,43.693781,-79.428191,York,Humewood-Cedarvale,2.0,Pizza Place,Restaurant,Trail,Field,Sandwich Place,Korean Restaurant,Sushi Restaurant,Bagel Shop,Frozen Yogurt Shop,Deli / Bodega
102,M9W,43.706748,-79.594054,Etobicoke,"Northwest, West Humber - Clairville",1.0,Rental Car Location,Women's Store,Electronics Store,Discount Store,Distribution Center,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant


Hard to see the exact reason for the strange clustering!