# Segmenting and Clustering Neighbourhoods in Toronto
### First of all, let's import the pandas library and read the html of the Wikipedia url.

In [38]:
import pandas as pd

pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969')

dfs = pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969', header=0)

### By using [0] as our header, we can now determine a dataframe called 'df'.

In [39]:
df = dfs[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


As we can see above, there's a lot of 'Not assigned' values. We need to fix that!

### To fix these values, we will simply create a filtered dataframe which will not have any 'Not assigned' values.

In [40]:
df_filtered = df[df['Borough'] != 'Not assigned'] 

In [41]:
df_filtered

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


It is clear that the neighbourhoods are already grouped by the Postal Code, so there isn't any 'Not assigned' values in the 'Neighbourhood' column.

### Now we can finally check the dataframe shape by using the .shape function!

In [42]:
df_filtered.shape

(103, 3)

So the new dataframe has 103 rows and 3 columns!

### Let's import the Geospatial Coordinates data.
We also need to check if it matches the same number of rows and columns from our filtered dataframe.

In [43]:
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell


df_data_1 = pd.read_csv(body)
df_data_1.shape

(103, 3)

As shown by the .shape function, the Geospatial_Coordinates.csv has 103 rows and 3 columns too! It matches the filtered dataframe .shape function.

### We can proceed to merge the two dataframes.
Let's use the .merge() function!

In [44]:
new_df = df_filtered.merge(df_data_1,left_on='Postal Code',right_on='Postal Code')

In [45]:
new_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


There it is! The merged table is ready for use!