## WEEK 3. IBM Data Science Capstone project

### Part 1. Web-scraping and preparing data for the next assignments

importing libraries

In [142]:
import numpy as np
import pandas as pd

import requests

Scraping Wikipedia page with lis of postal_codes of Canada. Response is an object variable which contains information we need.

In [143]:
response=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

We can 'digitalize' table using pandas without help of BeautifulSoup

In [144]:
df = pd.read_html(str(response.text))
df=df[0]
df[1:10]

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,Not assigned,Not assigned
9,M9A,Downtown Toronto,Queen's Park


Remove rows with Borough 'Not assigned'

In [145]:
df = df[df['Borough'] != 'Not assigned']
df.shape

(210, 3)

Remove duplicates with the same Postcode and list all Boroughs which are having same Postcode, separating them by comma

In [146]:
df = df.groupby(['Postcode','Borough']).agg(', '.join)
df[20:30]

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighbourhood
Postcode,Borough,Unnamed: 2_level_1
M2L,North York,"Silver Hills, York Mills"
M2M,North York,"Newtonbrook, Willowdale"
M2N,North York,Willowdale South
M2P,North York,York Mills West
M2R,North York,Willowdale West
M3A,North York,Parkwoods
M3B,North York,Don Mills North
M3C,North York,"Flemingdon Park, Don Mills South"
M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights"
M3J,North York,"Northwood Park, York University"


We don't need table to be indexed by Postcode and Borough. Reset index

In [147]:
df=df.reset_index()
df.shape

(103, 3)

### Part 2. Geotagging our postal codes and adding geodata to table

In [148]:
df_geo = pd.read_csv('https://cocl.us/Geospatial_data')
df_geo[20:30]

Unnamed: 0,Postal Code,Latitude,Longitude
20,M2L,43.75749,-79.374714
21,M2M,43.789053,-79.408493
22,M2N,43.77012,-79.408493
23,M2P,43.752758,-79.400049
24,M2R,43.782736,-79.442259
25,M3A,43.753259,-79.329656
26,M3B,43.745906,-79.352188
27,M3C,43.7259,-79.340923
28,M3H,43.754328,-79.442259
29,M3J,43.76798,-79.487262


To merge df and df_geo, use pandas merge method, then drom extra column

In [149]:
df=pd.merge(df, df_geo, left_on='Postcode', right_on='Postal Code').drop('Postal Code', axis=1)
df[20:30]

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
20,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714
21,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493
22,M2N,North York,Willowdale South,43.77012,-79.408493
23,M2P,North York,York Mills West,43.752758,-79.400049
24,M2R,North York,Willowdale West,43.782736,-79.442259
25,M3A,North York,Parkwoods,43.753259,-79.329656
26,M3B,North York,Don Mills North,43.745906,-79.352188
27,M3C,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923
28,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259
29,M3J,North York,"Northwood Park, York University",43.76798,-79.487262


### Part 3. Explore and cluster the neighborhoods in Toronto

In [150]:
dfNorthYork = df[df['Borough']=='North York']

In [151]:
dfNorthYork.reset_index().drop("index", axis=1)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493
5,M2N,North York,Willowdale South,43.77012,-79.408493
6,M2P,North York,York Mills West,43.752758,-79.400049
7,M2R,North York,Willowdale West,43.782736,-79.442259
8,M3A,North York,Parkwoods,43.753259,-79.329656
9,M3B,North York,Don Mills North,43.745906,-79.352188


In [152]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

In [153]:
map_NorthYork = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighbourhood in zip(dfNorthYork['Latitude'],dfNorthYork['Longitude'],dfNorthYork['Borough'],dfNorthYork['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_NorthYork)
map_NorthYork

## The map might not be visible on Github. Check out the README for the map.

Machine Learning with KMeans clustering

In [154]:
from sklearn.cluster import KMeans

k=5
clusters = dfNorthYork.drop(['Postcode','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(clusters)
kmeans.labels_
dfNorthYork.insert(0, 'Cluster#', kmeans.labels_)
dfNorthYork

Unnamed: 0,Cluster#,Postcode,Borough,Neighbourhood,Latitude,Longitude
17,0,M2H,North York,Hillcrest Village,43.803762,-79.363452
18,0,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
19,0,M2K,North York,Bayview Village,43.786947,-79.385975
20,0,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714
21,4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493
22,4,M2N,North York,Willowdale South,43.77012,-79.408493
23,4,M2P,North York,York Mills West,43.752758,-79.400049
24,4,M2R,North York,Willowdale West,43.782736,-79.442259
25,3,M3A,North York,Parkwoods,43.753259,-79.329656
26,3,M3B,North York,Don Mills North,43.745906,-79.352188


In [155]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[43.737473, -79.464763],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(dfNorthYork['Latitude'], dfNorthYork['Longitude'], dfNorthYork['Neighbourhood'], dfNorthYork['Cluster#']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## The map might not be visible on Github. Check out the README for the map.

As can be seen from the map, there are 5 clusters in North York. They may be explained by the fact that for the clustering we have used only geolocation data. Geography of the region and Zoning By-law are major causes of this clustering of North York's neighbourhoods.