#                        Capstone Week 3

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Part 1</a>

2. <a href="#item2">Part 2</a>

3. <a href="#item3">Part 3</a>
</font>
</div>

<a id='item1'></a>

## Part 1

_Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M_

_in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe_

#### ---------------------------------
#### make the request, store the table into a dataframe

In [33]:
import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### replace "Not assigned" to NaN

In [34]:
import numpy as np

df.replace("Not assigned", np.nan, inplace = True)
df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### simply drop whole row with NaN in "Borough" column

In [36]:
df.dropna(subset=["Borough"], axis=0, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### replace empty neighbourhoods with the borough's name

In [59]:
df['Neighbourhood'] = np.where(df['Neighbourhood'].isnull(), df['Borough'], df['Neighbourhood'])
df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### build a new dataframe with the unique set of postcodes and boroughs

In [96]:
pcodes = df[["Postcode","Borough"]]
pcodes.drop_duplicates(subset="Postcode", keep='first', inplace=True)
pcodes.set_index('Postcode', inplace=True)
pcodes.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0_level_0,Borough
Postcode,Unnamed: 1_level_1
M3A,North York
M4A,North York
M5A,Downtown Toronto
M6A,North York
M7A,Queen's Park


#### Group the neighbourhoods per postcode 

In [121]:
neighbourhoods = df.groupby('Postcode').agg({'Neighbourhood':lambda x: list(x)})
neighbourhoods['Neighbourhood'] = neighbourhoods['Neighbourhood'].apply(', '.join).str.replace('[','').replace(']','')
neighbourhoods.head()

Unnamed: 0_level_0,Neighbourhood
Postcode,Unnamed: 1_level_1
M1B,"Rouge, Malvern"
M1C,"Highland Creek, Rouge Hill, Port Union"
M1E,"Guildwood, Morningside, West Hill"
M1G,Woburn
M1H,Cedarbrae


#### Join the tables, perform some final cleaning and sorting

In [123]:
final = pd.concat([pcodes, neighbourhoods], axis=1)
final.reset_index(inplace=True)
final.rename(columns={"index": "PostalCode"}, inplace=True)
final.sort_values(by=['Borough'], inplace=True)
final.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,PostalCode,Borough,Neighbourhood
47,M4S,Central Toronto,Davisville
63,M5N,Central Toronto,Roselawn
46,M4R,Central Toronto,North Toronto West
64,M5P,Central Toronto,"Forest Hill North, Forest Hill West"
65,M5R,Central Toronto,"The Annex, North Midtown, Yorkville"


#### Perform Shape function on the final dataframe

In [125]:
final.shape

(103, 3)

<a id='item2'></a>

## Part 2

#### Bring in the file as a new dataframe

In [132]:
filename = "https://cocl.us/Geospatial_data"
geo = pd.read_csv(filename)
geo.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
geo.set_index("PostalCode", inplace=True)
geo.head()

Unnamed: 0_level_0,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


#### Concat our neighbourhoods dataframe with the new geo dataframe

In [142]:
final_geo = final.set_index("PostalCode")
final_geo = pd.concat([final_geo, geo], axis=1)
final_geo.reset_index(inplace=True)
final_geo.sort_values(by=['Borough'], inplace=True)
final_geo

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,index,Borough,Neighbourhood,Latitude,Longitude
47,M4S,Central Toronto,Davisville,43.704324,-79.388790
63,M5N,Central Toronto,Roselawn,43.711695,-79.416936
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
64,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
65,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.672710,-79.405678
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.383160
44,M4N,Central Toronto,Lawrence Park,43.728020,-79.388790
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049
50,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529


<a id='item3'></a>

## Part 3

#### Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [145]:
pip install geopy

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/80/93/d384479da0ead712bdaf697a8399c13a9a89bd856ada5a27d462fb45e47b/geopy-1.20.0-py2.py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 16.9MB/s ta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/5b/ac/4f348828091490d77899bc74e92238e2b55c59392f21948f296e94e50e2b/geographiclib-1.49.tar.gz
Building wheels for collected packages: geographiclib
  Building wheel for geographiclib (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/99/45/d1/14954797e2a976083182c2e7da9b4e924509e59b6e5c661061
Successfully built geographiclib
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.49 geopy-1.20.0
Note: you may need to restart the kernel to use updated packages.


In [146]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


#### Use geopy library to get the latitude and longitude values of Toronto.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>to_explorer</em>, as shown below.

In [147]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Create a toronto dataframe with only those boroughs containing the name toronto

In [172]:
toronto_data = final_geo[final_geo['Borough'].str.contains('Toronto', regex=False)].reset_index(drop=True)
toronto_data.head(80)

Unnamed: 0,index,Borough,Neighbourhood,Latitude,Longitude
0,M4S,Central Toronto,Davisville,43.704324,-79.38879
1,M5N,Central Toronto,Roselawn,43.711695,-79.416936
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
4,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
5,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
6,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
7,M4P,Central Toronto,Davisville North,43.712751,-79.390197
8,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049
9,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529


#### Create a map of Toronto with neighbourhoods superimposed on top

In [173]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for latitude, longitude, borough, neighbourhood in zip(toronto_data['Latitude'], final_geo['Longitude'], final_geo['Borough'], final_geo['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Cluster Neighbourhooods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [176]:
# set number of clusters
kclusters = 5

toronto_data_clustering = toronto_data.drop('Neighbourhood', 1)
toronto_data_clustering = toronto_data_clustering.drop('Borough', 1)
toronto_data_clustering = toronto_data_clustering.drop('index', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_data_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 4, 0, 0, 0, 0, 1], dtype=int32)