# Segmenting and Clustering Neighborhoods in Toronto 

### Coursera Capstone Project 2

## 1. Loading and Extracting Data

Importing necessary libraries:

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Getting the html document from the wikipedia page that contains our data on neighborhoods in Toronto, Canada:

In [2]:
html_doc = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

Using BeautifulSoup library to parse the html document:

In [3]:
soup = BeautifulSoup(html_doc, 'html.parser')

In an HTML code the table data and all of its attributes are stored in the keyword "table".

In [4]:
# Finding the table containing our data 
table = soup.find('table',{'class' : 'wikitable sortable'})

(Note: I've removed the print commands for the previous two lines of code as the outputs are insanely long)  
Extracting rows from the HTML table(stored within 'tr' in the HTML file), and creating an empty list called rows_list which we can loop through to append data from each row into the list:

In [5]:
rows = table.find_all('tr')
rows_list = []

for r in rows:
    row = r.find_all('td')
    row = [r.text.strip() for r in row] 
    if row:
        rows_list.append(row)
rows_list   

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 ['M9B', 'Etobicoke', 'Islington'],
 ['M9B', 

In [6]:
#converting the list to dataframe
df = pd.DataFrame(rows_list, columns = ['PostalCode', 'Borough', 'Neighborhood'])
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Then we have to drop rows that do not have a neighborhood assigned:

In [7]:
df.loc[df['Neighborhood']=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned
13,M2B,Not assigned,Not assigned
20,M7B,Not assigned,Not assigned
21,M8B,Not assigned,Not assigned
30,M2C,Not assigned,Not assigned
36,M7C,Not assigned,Not assigned
37,M8C,Not assigned,Not assigned


But since some cells do have a borough but do not have a neighborhood assigned, we need to rename these neighborhoods with the same name as their boroughs. Finding boroughs with assigned names but no neighborhood names:

In [8]:
df.loc[(df['Borough']!='Not assigned') & (df['Neighborhood']=='Not assigned')]

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M7A,Queen's Park,Not assigned


Since there is only one such borough, which is Queen's Park, we can easily look up that row's index and name its neighborhood as "Queen's Park":

In [9]:
df.iloc[8].replace(to_replace="Not assigned", value="Queen's Park", inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
9,M8A,Not assigned,Not assigned


Now we can drop all the boroughs and neighborhoods that do not have a name assigned:

In [10]:
df.replace("Not assigned", np.nan, inplace=True) #converts all 'Not assigned' values to NaN values
df.dropna(subset=['Borough'], axis=0, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Some postal codes have more than one neighborhood. Grouping the neighborhoods belonging to the same postal codes:

In [11]:
#function to separate each concatenated neighborhood by a comma
f = lambda a: ", ".join(a) 
df = df.groupby(by='PostalCode').agg({'Borough': 'max',
                                       'Neighborhood' : f}).reset_index()
df = df[['PostalCode', 'Borough', 'Neighborhood']] #just rearranging the columns
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [12]:
#checking the dimensions of our dataframe
df.shape

(103, 3)

In [13]:
df.to_csv('Toronto Neighborhood Data.csv', index=False) #removing the index so it doesn't create problems when we read the csv later on