<h2>Segmenting and Clustering Neighborhoods in the city of Toronto, Canada</h2>

In [1]:
#import libraries
import requests
import pandas as pd
import numpy as np

Store the link in a variable, wikipedia_link. Then, using the function get in the request library, download the Wikipedia page with wikipedia_page as argument

In [2]:
# wikipedia link that contains the postal codes
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_wikipedia_page = requests.get(wikipedia_link)

Use data attribute text to extract XML to a string & assign it to page

In [3]:
page = raw_wikipedia_page.text

Extract the table from the article

In [4]:
start_point = page.find('<table class="wikitable sortable">') 
end_point = page.find("</table>") + 8 # 8 is offset for table tag
table = page[start_point:end_point]


Read the table into a dataframe

In [5]:
df = pd.read_html(table, header = 0)
df = df[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Replace "Not Assigned" with NaN

In [6]:
df.replace("Not assigned", np.nan, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Drop all the rows that do not have an assigned borough

In [7]:
# simply drop whole row with NaN in "Borough" column
df.dropna(subset=["Borough"], axis=0, inplace=True)

# reset index, because we dropped two rows
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Replace Nan values in the Neighbourhood column with the Borough

In [8]:
df["Neighbourhood"].replace(np.nan, df["Borough"], inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Merge rows where the Postcode and Borough are the same, separating the neighbourhood name by a comma

In [9]:
# Group By Postcode and Borough and combine the Neighbourhoods on one line
df = df.groupby(['Postcode','Borough'],sort=False,as_index=False).agg(lambda x:', '.join(x))
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


Use the shape method to print the number of rows in the dataframe

In [10]:
df.shape

(103, 3)

<h3>Part II: Get coordinates for Postal Codes</h3>

In [11]:
# Read the Geospatial Coordinates into a new data frame
df_coords = pd.read_csv('Geospatial_Coordinates.csv')
df_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge df_coords with the original dataframe based on Postal Code

In [12]:
# Rename 'Postal Code' in df_coords to 'Postcode' for merge
df_coords.rename(index=str, columns = {'Postal Code':'Postcode'}, inplace=True)
combined_df = pd.merge(df,df_coords,on='Postcode')
combined_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
