# Segmenting and Clustering Neighborhoods in Toronto 

## Part 1: Getting the data

In [1]:
import pandas as pd

We actually don't need any external parsing libraries for this job. Pandas can already call lxml or Beautiful Soup behind the scenes when we use the `pd.read_html()` function. As long as the page you're scraping uses HTML `<table>` elements, which Wikipedia thankfully does, we don't need to write any new parsing code.

In [2]:
postal_codes = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", header=0)[0]
postal_codes

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Let's get rid of all those unassigned values...

In [3]:
postal_codes = postal_codes[postal_codes['Borough'] != 'Not assigned']

def replace_unassigned_neighbourhoods(row): # Function to replace unassigned Neighbourhood values with their Borough
    if(row['Neighbourhood'] == 'Not assigned'): # This could just be a lambda, really, but I think it's clearer this way
        row['Neighbourhood'] = row['Borough']
    return row

postal_codes.apply(replace_unassigned_neighbourhoods, axis=1)
postal_codes

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


...And now we can group and aggregate our values using the `groupby()` and `agg()` functions. I used value_counts as a quick hack to get only the most common value for "Borough", which should also be the only value, assuming each postal code has one borough.

In [4]:
def comma_list(x):
    return ', '.join(x) # Function to make comma separated lists

postal_codes_grouped = postal_codes.groupby(by='Postcode').agg({
    'Neighbourhood': lambda x : ', '.join(x),
    'Borough':  lambda x: x.value_counts().index[0] # Assumung each postal code has one Borough, picking the most popular value means picking the only value.
})
postal_codes_grouped

Unnamed: 0_level_0,Neighbourhood,Borough
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,"Rouge, Malvern",Scarborough
M1C,"Highland Creek, Rouge Hill, Port Union",Scarborough
M1E,"Guildwood, Morningside, West Hill",Scarborough
M1G,Woburn,Scarborough
M1H,Cedarbrae,Scarborough
M1J,Scarborough Village,Scarborough
M1K,"East Birchmount Park, Ionview, Kennedy Park",Scarborough
M1L,"Clairlea, Golden Mile, Oakridge",Scarborough
M1M,"Cliffcrest, Cliffside, Scarborough Village West",Scarborough
M1N,"Birch Cliff, Cliffside West",Scarborough


In [5]:
postal_codes_grouped.shape

(103, 2)