# Segmenting and Clustering Neighborhoods in Toronto

## Scraping Wikipedia page and building postcode dataframe
Let us scrape Wikipedia page into list of dataframes and choose the right dataframe for further processing. Note, that the right dataframe is at the first position of the list, and its header is in the first row.

In [1]:
# Import libraries and objects
import pandas as pd

# Scrap Wikipedia page into dataframes list
pc_dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)
# Relevant dataframe is at the first position of the list
pc_df = pc_dfs[0]

# Display few first rows of the dataframe
pc_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


There is a lot rows where _Borough_ column has _Not assigned_ value, let us drop such rows and reset index.

In [2]:
# Drop all rows where 'Borough' is 'Not assigned'
pc_df = pc_df.drop(pc_df[pc_df['Borough'] == 'Not assigned'].index).reset_index(drop=True)

# Display few first rows of the dataframe
pc_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


There is a lot of rows where _Neighbourhood_ column has _Not assigned_ value. Let us replace that value with relevant _Borough_ column one. Note, that the replacement musts take place before concatenation, which makes that replacement more difficult.

In [3]:
# Use 'Borough' as 'Neighbourhood' when 'Neighbourhood' is 'Not assigned'
pc_df['Neighbourhood'].mask(pc_df.Neighbourhood == 'Not assigned', pc_df['Borough'], inplace=True)

# Display few first rows of the dataframe
pc_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


There are a lot of rows where _Postcode_ and _Borough_ columns have the same values. Let us group the data by those columns and concatenate _Neighbourhood_ column values using comma followed by white space as the separator. Note, that at the end index reset takes place.

In [4]:
# Group by 'Postcode' and 'Borough' concatenating 'Neighbourhood' values
pc_df = pc_df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: '%s' % ', '.join(x)).reset_index()

# Display few first rows of the dataframe
pc_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Let us see what is the shape of the dataframe, which has just been built.

In [5]:
# Display shape of dataframe
print('Shape:', pc_df.shape)

Shape: (103, 3)


## Retrieving location coordinates
Let us retrieve location coordinates for each postal code and add them to the dataframe. Location coordinates will be obtained using Google's Geocode API.

In [6]:
# Import libraries and objects
import geocoder

# Google API key, the key is restricted to Geocode API and IP addresses
API_KEY = 'AIzaSyAmIcwix4zGGCWAzqQ4FA7OClA4OtYy4lE'

# Retrieve location coordinates using Google's Gecode API
lats = []
lngs = []
for index, row in pc_df.iterrows():
    ll = None
    while (ll is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(row['Postcode']), key=API_KEY)
        ll = g.latlng
    lats.insert(index, ll[0])
    lngs.insert(index, ll[1])

# Add location coordinates to the dataframe
pc_df['Latitude'] = lats
pc_df['Longitude'] = lngs

# Display few first rows of the dataframe
pc_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
