# Segmenting and Clustering Neighborhoods in Toronto

##### by Luca Mancino 

In this notebook, I will explore, segment, and cluster the neighborhoods in the city of Toronto. Unlike NYC, the neighborhood data in not readily available on the web. For the purpose, I will use a Wikipedia page containing all the information needed to explore and cluster the neighborhoods in Toronto.

**INSTRUCTIONS**:
    1) run '**pip install wikipedia**' to use the Wikipedia library; 
    2) rum '**pip install lxml**'. 
    
These steps are fundamental to create a Pandas dataframe starting from a Wikipedia table. Once the packages have been installed, do not compute them again. 

In [None]:
# To use the Wikipedia library, there is the need to install it. 
pip install wikipedia;

In [None]:
pip install lxml

In [1]:
# Start by creating a new Notebook for this assignment.
# setup import
import pandas as pd 
import wikipedia as wp
from bs4 import BeautifulSoup

In [2]:
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
df = pd.read_html(html, header = 0)[0]

# The next line code is used to drop the rows containing 'Not assigned' values in the 'Borough' column
df_1 = df[df.Borough != 'Not assigned']

In [3]:
# Shape of the 'df' dataset (which is not 'cleaned')
df.shape

(180, 3)

In [4]:
# Shape of the 'df_1' dataset obtained by dropping out the rows containing 'Not assigned' values in the 'Borough' colums
df_1.shape

(103, 3)

Note that more than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma. I will call df_2 the new dataset.

In [5]:
df_2=df_1.groupby(['Postal code', 'Borough'])['Neighborhood'].apply(list).apply(lambda x:', '.join(x)).to_frame().reset_index()

In [6]:
df_2.iterrows()

<generator object DataFrame.iterrows at 0x7f78a8625ca8>

In [7]:
for index, row in df_2.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']

In [8]:
df_2

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,Kingsview Village / St. Phillips / Martin Grov...
101,M9V,Etobicoke,South Steeles / Silverstone / Humbergate / Jam...


In [9]:
# Shape of the 'df_2' dataset 
df_2.shape

(103, 3)