# Segmenting and Clustering Neighborhoods in Toronto
## Applied data capstone

First let's import the necessary packages.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import folium
from sklearn.cluster import KMeans

### 1 - Building the Dataframe

Now I read the code of the wikipedia page and I store it into a variable, then I import it to BeautifulSoup

In [3]:
wikipage = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [4]:
soup = BeautifulSoup(wikipage, 'lxml')

Here I select only the part of code that belongs to "wikitable sortable" with BS functions, i.e. the table that retains the info I want.

In [18]:
toronto_table = soup.find('table', {'class':'wikitable sortable'})

Here I separate each single line in the "row" list, with BS functions.

In [6]:
row = toronto_table.find_all('tr')
len(row)

288

Now I cicle between all rows with thew outer cicle, and between all the columns with the inner cicle and assign each element to the correct list, removing at the same time the html commands I don't need, as "td" and "tr", with the "strip" command. I also discard the lines in which the neighborhood is not assigned, with an "if" command.

At the end I create the dataframe. If the borough is not assigned I discard again the whole line.

In [7]:
postal_code=[]
borough=[]
neighborhood=[]

for i in range(1,len(row)):
    column = row[i].find_all('td')
    postal_code.append(column[0].get_text(strip=True))
    borough.append(column[1].get_text(strip=True))
    if column[2].get_text(strip=True) == 'Not assigned':
        neighborhood.append(borough[i-1])
    else:
        neighborhood.append(column[2].get_text(strip=True))

In [8]:
merge={'postal_code':postal_code, 'borough':borough, 'neighborhood':neighborhood}
df=pd.DataFrame(merge)
df=df[df.borough!='Not assigned']
df.reset_index(inplace=True, drop=True)
df.head(10)

Unnamed: 0,postal_code,borough,neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Queen's Park,Queen's Park
6,M9A,Queen's Park,Queen's Park
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


Now I create a dictionary with the "groupby" method from pandas. The dictionary group each neighborhood by the postal code and give me the index of the lines pertaining to the same postal code.

Finally, I create the string with all the neighborhood pertaining to the same postal code with the "join" function, separating them with a comma.

In [9]:
dictio=df.groupby(['postal_code']).groups
dictio

{'M1B': Int64Index([7, 8], dtype='int64'),
 'M1C': Int64Index([20, 21, 22], dtype='int64'),
 'M1E': Int64Index([32, 33, 34], dtype='int64'),
 'M1G': Int64Index([38], dtype='int64'),
 'M1H': Int64Index([42], dtype='int64'),
 'M1J': Int64Index([53], dtype='int64'),
 'M1K': Int64Index([65, 66, 67], dtype='int64'),
 'M1L': Int64Index([78, 79, 80], dtype='int64'),
 'M1M': Int64Index([92, 93, 94], dtype='int64'),
 'M1N': Int64Index([107, 108], dtype='int64'),
 'M1P': Int64Index([116, 117, 118], dtype='int64'),
 'M1R': Int64Index([126, 127], dtype='int64'),
 'M1S': Int64Index([140], dtype='int64'),
 'M1T': Int64Index([146, 147, 148], dtype='int64'),
 'M1V': Int64Index([154, 155, 156, 157], dtype='int64'),
 'M1W': Int64Index([181], dtype='int64'),
 'M1X': Int64Index([187], dtype='int64'),
 'M2H': Int64Index([43], dtype='int64'),
 'M2J': Int64Index([54, 55, 56], dtype='int64'),
 'M2K': Int64Index([68], dtype='int64'),
 'M2L': Int64Index([81, 82], dtype='int64'),
 'M2M': Int64Index([95, 96], dty

In [10]:
s=','
neigh=[]
for code,ind in dictio.items():
    a=[]
    for i in ind:
        a.append(df.neighborhood[i])   
    b=s.join(a)
    neigh.append(b)
neigh

['Rouge,Malvern',
 'Highland Creek,Rouge Hill,Port Union',
 'Guildwood,Morningside,West Hill',
 'Woburn',
 'Cedarbrae',
 'Scarborough Village',
 'East Birchmount Park,Ionview,Kennedy Park',
 'Clairlea,Golden Mile,Oakridge',
 'Cliffcrest,Cliffside,Scarborough Village West',
 'Birch Cliff,Cliffside West',
 'Dorset Park,Scarborough Town Centre,Wexford Heights',
 'Maryvale,Wexford',
 'Agincourt',
 "Clarks Corners,Sullivan,Tam O'Shanter",
 "Agincourt North,L'Amoreaux East,Milliken,Steeles East",
 "L'Amoreaux West",
 'Upper Rouge',
 'Hillcrest Village',
 'Fairview,Henry Farm,Oriole',
 'Bayview Village',
 'Silver Hills,York Mills',
 'Newtonbrook,Willowdale',
 'Willowdale South',
 'York Mills West',
 'Willowdale West',
 'Parkwoods',
 'Don Mills North',
 'Flemingdon Park,Don Mills South',
 'Bathurst Manor,Downsview North,Wilson Heights',
 'Northwood Park,York University',
 'CFB Toronto,Downsview East',
 'Downsview West',
 'Downsview Central',
 'Downsview Northwest',
 'Victoria Village',
 'Woodb

Now I create the borough list discarding the duplicates.

In [11]:
df_unique=df[['postal_code','borough']].drop_duplicates()
df_unique.head(10)

Unnamed: 0,postal_code,borough
0,M3A,North York
1,M4A,North York
2,M5A,Downtown Toronto
3,M6A,North York
5,M7A,Queen's Park
6,M9A,Queen's Park
7,M1B,Scarborough
9,M3B,North York
10,M4B,East York
12,M5B,Downtown Toronto


Here I create the new dataframe with the grouped neighborhoods, and the postal code as index.

In [12]:
df_merge=pd.DataFrame(neigh,dictio.keys())
df_merge.head(10)

Unnamed: 0,0
M1B,"Rouge,Malvern"
M1C,"Highland Creek,Rouge Hill,Port Union"
M1E,"Guildwood,Morningside,West Hill"
M1G,Woburn
M1H,Cedarbrae
M1J,Scarborough Village
M1K,"East Birchmount Park,Ionview,Kennedy Park"
M1L,"Clairlea,Golden Mile,Oakridge"
M1M,"Cliffcrest,Cliffside,Scarborough Village West"
M1N,"Birch Cliff,Cliffside West"


Here I add the borough column with the command "merge" from pandas, using the postal code as common key, to maintain coherence with the data. After that I rename the columns with the correct name.

In [13]:
df_final = df_unique.merge(df_merge, left_on='postal_code', right_index=True)
df_final.reset_index(inplace=True, drop=True)
final_columns={'postal_code':'PostalCode','borough':'Borough',0:'Neighborhood'}
df_final.rename(columns=final_columns, inplace=True)
df_final.shape

(103, 3)

### 2 - Adding the geographical coordinates

Here I create a new dataframe with the geographical coordinates of the postal codes, and merge the two dataframes as before. The coordinates are from the csv file linked in the description page of the project.

In [14]:
geo=pd.read_csv('Geospatial_Coordinates.csv')

In [15]:
df_geo = df_final.merge(geo, left_on='PostalCode', right_on='Postal Code')
df_geo.drop(columns=['Postal Code'], inplace=True)
df_geo.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Queen's Park,Queen's Park,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


### 3 - Clustering the postal codes

Finally I create the map with all the postal codes, with the "folium" package.

In [16]:
lat_toronto=43.716589
lon_toronto=-79.340686
map_neigh = folium.Map(location=[lat_toronto, lon_toronto], zoom_start=11)

In [17]:
for lat, lng, label in zip(df_geo['Latitude'], df_geo['Longitude'], df_geo['Neighborhood']):
    folium.Marker(
        [lat, lng],
        popup=label,
        radius=5,
        fill=True,
        color='blue',
        fill_color='blue',
        fill_opacity=0.6
        ).add_to(map_neigh)
