## Applied Data Science Capstone Notebook*
---
\* this notebook will be mainly used for the capstone project

In [1]:
import pandas as pd
import numpy as np

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Segmenting and Clustering Neighborhoods in Toronto

Getting URL with a list of neighborhoods in Toronto.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Using `requests` library to download web page

In [4]:
import requests
data = requests.get(url).text

Using `BeautifulSoup` to parse HTML

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data,"html5lib")

Using sample code from [Hints for scraping Notebook](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/NewLinkWebscrapingHints.md) to scrape HTML and create dataframe

In [6]:
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        
        table_contents.append(cell)

#print(table_contents)

In [7]:
df=pd.DataFrame(table_contents)

In [8]:
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [9]:
df.head()

Unnamed: 0,Borough,Neighborhood,PostalCode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,"Regent Park, Harbourfront",M5A
3,North York,"Lawrence Manor, Lawrence Heights",M6A
4,Queen's Park,Ontario Provincial Government,M7A


Moving `PostalCode` column on first place

In [10]:
fixed_columns = [df.columns[-1]] + list(df.columns[:-1])
df = df[fixed_columns]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [11]:
print("Number of rows: {}".format(df.shape[0]))

Number of rows: 103


Getting geographical coordinates of each postal code provided in csv file

In [12]:
geo_spatial_csv = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'

In [13]:
geo_df = pd.read_csv(geo_spatial_csv)
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Renaming `Postal Code` in `geo_df` to match `PostalCode` in `df` to merge them later

In [14]:
geo_df.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
geo_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging two dataframes on `PostalCode`

In [15]:
df = df.merge(geo_df, on="PostalCode", how = 'inner')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


Counting `PostalCode` in each `Borough`

In [16]:
df['Borough'].value_counts()

North York                24
Scarborough               17
Downtown Toronto          17
Etobicoke                 11
Central Toronto            9
West Toronto               6
York                       5
East Toronto               4
East York                  4
East Toronto Business      1
Downtown Toronto Stn A     1
Queen's Park               1
Mississauga                1
East York/East Toronto     1
Etobicoke Northwest        1
Name: Borough, dtype: int64

Analysing each `Borough` by setting `1` for `PostalCode` in corresponding column

In [17]:
df_onehot = pd.get_dummies(df[['Borough']], prefix="", prefix_sep="")
df_onehot['PostalCode'] = df['PostalCode']
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]
df_onehot.head()

Unnamed: 0,PostalCode,Central Toronto,Downtown Toronto,Downtown Toronto Stn A,East Toronto,East Toronto Business,East York,East York/East Toronto,Etobicoke,Etobicoke Northwest,Mississauga,North York,Queen's Park,Scarborough,West Toronto,York
0,M3A,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,M4A,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,M5A,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M6A,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,M7A,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [18]:
df_onehot.shape

(103, 16)

Run k-means to cluster the neighborhood into 5 clusters.

In [19]:
from sklearn.cluster import KMeans

In [20]:
kclusters = 5

df_clustering = df_onehot.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([2, 2, 4, 2, 0, 3, 1, 2, 0, 4])

Adding cluster to dataframe

In [21]:
df_onehot.insert(0, 'Cluster Labels', kmeans.labels_)
df_onehot.head()

Unnamed: 0,Cluster Labels,PostalCode,Central Toronto,Downtown Toronto,Downtown Toronto Stn A,East Toronto,East Toronto Business,East York,East York/East Toronto,Etobicoke,Etobicoke Northwest,Mississauga,North York,Queen's Park,Scarborough,West Toronto,York
0,2,M3A,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,2,M4A,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,4,M5A,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2,M6A,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,M7A,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


Merging with initial data frame

In [22]:
df = df.merge(df_onehot[['Cluster Labels','PostalCode']], on="PostalCode", how = 'inner')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels
0,M3A,North York,Parkwoods,43.753259,-79.329656,2
1,M4A,North York,Victoria Village,43.725882,-79.315572,2
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,4
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,2
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,0


Examine `Downtown Toronto` cluster

In [23]:
df.loc[df['Cluster Labels'] == 4,  df.columns[:-1]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
36,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
42,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576
48,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817


Examine `Etobicoke` cluster

In [24]:
df.loc[df['Cluster Labels'] == 3,  df.columns[:-1]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
11,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724
17,M9C,Etobicoke,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",43.643515,-79.577201
70,M9P,Etobicoke,Westmount,43.696319,-79.532242
77,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
88,M8V,Etobicoke,"New Toronto, Mimico South, Humber Bay Shores",43.605647,-79.501321
89,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
93,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


Examine `North York` cluster

In [25]:
df.loc[df['Cluster Labels'] == 2,  df.columns[:-1]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
7,M3B,North York,Don Mills North,43.745906,-79.352188
10,M6B,North York,Glencairn,43.709577,-79.445073
13,M3C,North York,Don Mills South,43.7259,-79.340923
27,M2H,North York,Hillcrest Village,43.803762,-79.363452
28,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259
33,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
34,M3J,North York,"Northwood Park, York University",43.76798,-79.487262


Examine `Scarborough` cluster

In [26]:
df.loc[df['Cluster Labels'] == 1,  df.columns[:-1]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
18,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
22,M1G,Scarborough,Woburn,43.770992,-79.216917
26,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
32,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
38,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
44,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
51,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
58,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


Examine cluster with other borough

In [27]:
df.loc[df['Cluster Labels'] == 0,  df.columns[:-1]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
14,M4C,East York,Woodbine Heights,43.695344,-79.318389
16,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
21,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512
23,M4G,East York,Leaside,43.70906,-79.363452
29,M4H,East York,Thorncliffe Park,43.705369,-79.349372
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
35,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106
