## Toronto Neighbourhoods Segmentation and Clustering

<img src="https://typicalbritto.files.wordpress.com/2015/03/mapa-de-dosbarrios-11.jpg" alt="Toronto Neighborhoods" align="left">

<p>&nbsp;</p>
<p><strong>Step 1: Building the code to scrape the following Wikipedia page: <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M ">https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M </a> in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.</strong></p>
<p>&nbsp;</p>

In [47]:
#Install the packages if required (remove #)
#conda install -c conda-forge lxml
#conda install -c anaconda beautifulsoup4

In [69]:
from pandas.io.html import read_html

page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable"})

print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 1 wikitables


In [70]:
toronto_postal_codes = wikitables[0]
toronto_postal_codes.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront


In [71]:
toronto_postal_codes.shape

(288, 2)

<p>&nbsp;</p>
<p><strong>Step 2: Processing the dataframe according to the assignment instructions below:</strong></p>
<ul>
<li>The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood</li>
</ul>
<p>&nbsp;</p>

In [72]:
toronto_postal_codes.reset_index(inplace=True)

In [73]:
toronto_postal_codes.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

<p>&nbsp;</p><ul>
<li>Only process the cells that have an assigned borough. Ignore cells with a borough that is&nbsp;<strong>Not assigned.</strong></li>
</ul>
<p>&nbsp;</p>

In [74]:
condition = toronto_postal_codes[ toronto_postal_codes['Borough'] == 'Not assigned' ].index
 
# Delete these row indexes from dataFrame
toronto_postal_codes.drop(condition , inplace=True)

In [75]:
toronto_postal_codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


<p>&nbsp;</p><ul>
<li>More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that&nbsp;<strong>M5A</strong>&nbsp;is listed twice and has two neighborhoods:&nbsp;<strong>Harbourfront&nbsp;</strong>and&nbsp;<strong>Regent Park</strong>. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in&nbsp;<strong>row 11&nbsp;</strong>in the above table.</li>
</ul><p>&nbsp;</p>

In [76]:
toronto_postal_codes = toronto_postal_codes.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

In [77]:
toronto_postal_codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<p>&nbsp;</p><ul>
<li>If a cell has a borough but a&nbsp;<strong>Not assigned&nbsp;</strong>neighborhood, then the neighborhood will be the same as the borough. So for the&nbsp;<strong>9th</strong>&nbsp;cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be&nbsp;<strong>Queen's Park.</strong></li>
</ul><p>&nbsp;</p>

In [78]:
condition = toronto_postal_codes[ toronto_postal_codes['Neighbourhood'] == 'Not assigned' ].index

In [79]:
for i in condition:
    toronto_postal_codes.loc[i]['Neighbourhood'] = toronto_postal_codes.loc[i]['Borough']

In [80]:
toronto_postal_codes.loc[(toronto_postal_codes['Postcode'] == 'M7A')]

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


<p>&nbsp;</p><ul>
<li>Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.</li>
<li>In the last cell of your notebook, use the&nbsp;<strong>.shape</strong>&nbsp;method to print the number of rows of your dataframe.</li>
</ul><p>&nbsp;</p>

In [81]:
toronto_postal_codes.shape

(103, 3)

<p>&nbsp;</p><p><strong>Step 3: Now we need to get the latitude and the longitude coordinates of each neighborhood in order to utilize the Foursquare location data.</strong></p><p>&nbsp;</p>

In [82]:
# Tried to use geocoder with no luck. Will use the CSV file offered by Coursera
import pandas as pd
df_coord = pd.read_csv("https://cocl.us/Geospatial_data")
df_coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [83]:
toronto_postal_codes = toronto_postal_codes.join(df_coord.set_index('Postal Code'), on='Postcode')

In [84]:
toronto_postal_codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<p>&nbsp;</p><p><strong>Step 4: Finally let's explore and cluster the neighborhoods in Toronto.</strong></p>
<ul>
<li> Geting the latitude and longitude values of Toronto using geopy library.</li>
<li> Creating a map of Toronto with neighborhoods superimposed on top.</li>
</ul><p>&nbsp;</p>

In [64]:
#Install the packages if required (remove #)
#!conda install -c conda-forge geopy --yes 
#!conda install -c conda-forge folium=0.5.0 --yes

In [85]:
# Downloading the dependencies that will be required
import numpy as np  # library to handle data in a vectorized manner

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [86]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [87]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(toronto_postal_codes['Latitude'], toronto_postal_codes['Longitude'], toronto_postal_codes['Borough'], toronto_postal_codes['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<p>&nbsp;</p><ul>
<li> Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them. </li>
</ul><p>&nbsp;</p>

In [88]:
config = json.load(open('C:/Users/lcuzacov/Desktop/config.json'))
CLIENT_ID = config['CLIENT_ID']
CLIENT_SECRET = config['CLIENT_SECRET']
VERSION = config['VERSION']