<h1 style="text-align: center;">IBM Data Science Professional Certificate - Capstone</h1>
<p>This notebook will be used for the capstone project part of the IBM Data Science Professional Certificate from Coursera. <a href="https://www.coursera.org/professional-certificates/ibm-data-science">View details</a></p>
<h2>Part I: Segmenting and Clustering Neighborhoods in Toronto</h2>

<p>In Part I of the project, we will be exploring, segmenting, and clustering the neighborhoods in the city of Toronto (Canada). For our dataset, a <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">Wikipedia page</a> exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will scrape that page in order to obtain the information that is in the table of postal codes and wrangle the data, clean it, and then read it into a pandas dataframe with 3 columns: PostalCode, Borough and Neighborhood so that it is in a structured format.</p>

<p>Once the data is in a structured format, we will conduct an analysis to explore and cluster the neighborhoods in the city of Toronto. Specifically:
    <ul>
        <li>We will convert postalcodes into their equivalent latitude and longitude values.</li>
        <li>We will use the Foursquare API to explore neighborhoods in Toronto: We will use the explore function to get the most common venue categories in each neighborhood</li>
        <li>We will then use the most common venue categories feature to group the neighborhoods into clusters. We will use the k-means clustering algorithm to complete this task</li>
        <li>Finally, we will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.</li>
    </ul>
</p>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [8]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't downloaded the package
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't downloaded the package
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### 1. Web Scraping Wikipedia page and building of the dataset

#### Scrape the Wikipedia page and tranform the data into a pandas dataframe

In [9]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_list = pd.read_html(url)
# our dataset is the first table
df_toronto = df_list[0]
df_toronto.shape
#df_toronto.head()

(288, 3)

#### Rename Postcode->PostalCode and Neighbourhood->Neighborhood

In [10]:
df_toronto.columns = ['PostalCode','Borough','Neighborhood']
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Drop rows with a borough that is Not assigned.

In [11]:
df_toronto.drop(df_toronto[df_toronto['Borough'] == 'Not assigned'].index, inplace=True)
df_toronto.shape
#df_toronto.head()

(211, 3)

#### Combine rows with the same postalcode, multiple neighborhoods should be separated with a comma

In [12]:
df_toronto_postalcode = df_toronto.groupby('PostalCode', as_index=False).agg(lambda x: ', '.join(set(x.dropna())))
df_toronto_postalcode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Morningside, West Hill, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

In [13]:
# find row with no Neighborhood
df_toronto_postalcode[df_toronto_postalcode['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


In [14]:
df_toronto_postalcode.loc[df_toronto_postalcode['Neighborhood'] == 'Not assigned', 'Neighborhood' ] = df_toronto_postalcode['Borough']
print(df_toronto_postalcode.loc[85])

PostalCode               M7A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 85, dtype: object


In [15]:
df_toronto_postalcode.shape[0]

103

### **** END OF ASSIGNEMNT WEEK 3

### 2.  Data analysis to explore and cluster the neighborhoods in the city of Toronto

Then let's loop through the data and add 2 new column (Latitude and Longitude) to the dataframe one row at a time.