In [None]:
$ apt-get install python3-bs4 (for Python 3)

In [None]:
$ apt-get install python-lxml

# Scraping Toronto Neighborhoods

To begin the creation of our table we need to import all the libraries (the ones that are going to be required in the upcoming analysis are included)

In [2]:
#Libraries

import pandas as pd
import numpy as np
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
#import folium # map rendering library
#from bs4 import BeautifulSoup #scrape data

We then proceed to scrape the data we need from wikipedia, in order to obtain a table with all the features we need

In [3]:
df=pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


We check the table for 'Not assigned' values for the Neighborhood and Borough features and we proceed to drop the lines without a specified Borough, as requested in the assignement.

In [4]:
pd.Series(df['Neighbourhood']=='Not assigned').value_counts()

False    103
True      77
Name: Neighbourhood, dtype: int64

In [5]:
pd.Series(df['Borough']=='Not assigned').value_counts()

False    103
True      77
Name: Borough, dtype: int64

In [6]:
df2 = df[df['Borough']!= 'Not assigned']
df2= df2.reset_index(drop=True)
df2

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


We finally verify that our table has no duplicate Postal Codes (so that neighborhoods are joined in the respective feature as required in the assignement) and we look for potential unassigned Neighborhoods.

In [7]:
duplicated=df2[['Postal Code']].duplicated()
pd.Series(duplicated).value_counts()

False    103
dtype: int64

In [8]:
pd.Series(df2['Neighbourhood']=='Not assigned').value_counts()

False    103
Name: Neighbourhood, dtype: int64

We check the shape of our table and notice that it's much bigger than what was required by the assignement, as it includes all Canadian postal codes, while only Toronto ones are required.

In [9]:
print('DF shape is:',df.shape)
print('DF2 shape is:',df2.shape)

DF shape is: (180, 3)
DF2 shape is: (103, 3)


We then proceed to filter our table for Toronto postal codes, manually reacreating the vector of the ones we were shown in the assignement.

In [25]:
Toronto_Codes = pd.DataFrame(['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A'], columns=['Postal Code'])

In [26]:
Toronto_neigh = pd.DataFrame(pd.merge(Toronto_Codes,df2, on = 'Postal Code', how='inner'))
Toronto_neigh

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


We proceed to check again the shape of our table and we verify that it matches with the requirements.

In [28]:
print('DF shape is:',Toronto_neigh.shape)
print('DF2 shape is:',Toronto_neigh.shape)

DF shape is: (12, 3)
DF2 shape is: (12, 3)
