# Toronto Neighborhood Clustering

**Objectives:**

1. Scrapping Toronto neighborhood data from Wikipedia.
2. Loading and merging coordinate data for neighborhoods.
3. Clustering neighborhoods using K-means Clustering.

## Scrapping Toronto Neighborhood Data from Wikipedia

In [1]:
#importing important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
import requests
from bs4 import BeautifulSoup

**Specifying the url and creating BeautifulSoup Object for parsing**

In [2]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(wiki_url).text
soup = BeautifulSoup(html, 'html5lib')

**Parsing and Loading data in a List**

In [3]:
table_contents = []
table = soup.find('table')
for row in table.findAll('td'):
    cell={}
    if row.span.text == 'Not assigned': 
        pass                                   # Passing over cells with only Postal Code and no other details
    else:
        cell['PostalCode'] = row.p.text[:3]    # Three Digit Postal Code
        cell['Borough'] = (row.span.text).split('(')[0] # borough name is first element in span tag
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace('/',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

The Neighborhood column members required a lot of cleaning. This was done at the time of loading the data itself.

**Converting the Table into a Pandas Dataframe**

In [4]:
df = pd.DataFrame(table_contents)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,..."


The Borough column has a lot of unnecessary text. Lets clean it up!

In [5]:
df['Borough'] = df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                        'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                        'EtobicokeNorthwest':'Etobicoke Northwest',
                                        'East YorkEast Toronto':'East York/East Toronto',
                                        'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [6]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern , Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill , Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


**Printing Results of Scrapping and Cleaning.**

In [7]:
print(f'My dataframe has {df.shape[0]} rows and {df.shape[1]} columns')

My dataframe has 103 rows and 3 columns


## Getting Latitude and Longitude for Toronto Neighborhood Data

This step can be completed using geocoder library or using the provided csv. I chose the later as geocoder was taking a long time. 

**Importing Latitude Longitude data and merging dataframes**

In [8]:
lat_lng_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
lat_lng_df = pd.read_csv(lat_lng_url)
lat_lng_df 

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [9]:
lat_lng_df.rename(columns = {'Postal Code':'PostalCode'}, inplace=True) # For merging operation later

In [10]:
final_df = pd.merge(df,lat_lng_df,on='PostalCode') # Final Data 
final_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill , Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Visaulizing Neighborhoods and Clustering

We will use Folium library to vizualize the neighborhood data.

**Let us visualize all the boroughs in our dataframe**

In [11]:
neigh_map = folium.Map(location = [43.6532,-79.3832], zoom_start=10)

for lat, lng, neighborhood, borough in zip(final_df['Latitude'], final_df['Longitude'], final_df['Neighborhood'], final_df['Borough']):
    label = f'{neighborhood},{borough}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    color='blue',
    popup = label,
    fill = True,
    fill_color='blue',
    fill_opacity=0.5,
    parse_html=False).add_to(neigh_map)
    
neigh_map.save('neigh_map.html')
neigh_map

**Filtering boroughs with 'Toronto' in their name**

In [12]:
df_with_tor = final_df[final_df['Borough'].str.contains('Toronto',regex=False)]
df_with_tor

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond , Adelaide , King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin , Dovercourt Village",43.669005,-79.442259
35,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106


**Visualizing the filtered boroughs**

In [13]:
tor_neigh_map = folium.Map(location = [43.6532,-79.3832], zoom_start=11) # Increasing zoom as target area is smaller

for lat, lng, neighborhood, borough in zip(df_with_tor['Latitude'], df_with_tor['Longitude'], df_with_tor['Neighborhood'], df_with_tor['Borough']):
    label = f'{neighborhood},{borough}'
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    color='blue',
    popup = label,
    fill = True,
    fill_color='blue',
    fill_opacity=0.5,
    parse_html=False).add_to(tor_neigh_map)
    
tor_neigh_map.save('tor_neigh_map.html')
tor_neigh_map

## Clustering the Neighborhoods

Clustering the neighborhoods using **KMeans** clustering algorithm.

In [1]:
from sklearn.cluster import KMeans

k = 5 # Tentative Value
df1 = df_with_tor.drop(['PostalCode','Borough','Neighborhood'],1) # Removing non-numeric attributes
df1.head()

NameError: name 'df_with_tor' is not defined

In [15]:
kmeans = KMeans(n_clusters=k, random_state=42).fit(df1) # random_state attribute to get same cluster no for each run
df_with_tor.insert(0, 'Cluster', kmeans.labels_)

Lets look at the clustered dataset.

In [16]:
df_with_tor.head()

Unnamed: 0,Cluster,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,1,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
9,1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,1,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,0,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,1,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [17]:
cluster_map = folium.Map(location = [43.6532,-79.3832], zoom_start=11)

color_list = cm.rainbow(np.linspace(0, 1, k))  # to give each cluster a specific color
rainbow = [colors.rgb2hex(i) for i in color_list]

for lat, lng, neighborhood, borough, cluster in zip(df_with_tor['Latitude'], df_with_tor['Longitude'], df_with_tor['Neighborhood'], df_with_tor['Borough'], df_with_tor['Cluster']):
    label = f'Cluster {cluster}'
    folium.CircleMarker(
    [lat,lng],
    radius=7,
    color=rainbow[cluster],
    popup = label,
    fill = True,
    fill_color=rainbow[cluster],
    fill_opacity=0.5,
    parse_html=False).add_to(cluster_map)
    
cluster_map.save('cluster_map.html')
cluster_map