# Coursera Capstone Project

<h2><center>Segmenting and Clustering Neighborhoods in Toronto</center></h2>

Scrape the Toronto neighborhood data from the Wiki page table. The resulting dataframe should meet the following criteria:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


## Scraping the wikipedia page

In [153]:
import pandas as pd
import numpy as np
import json
import csv
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium
print('Libraries imported.')

Libraries imported.


In [146]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source)
soup.prettify

<bound method Tag.prettify of 
<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"Xq7GOApAIIIAA1fsphIAAACW","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":951325562,"wgRevisionId":951325562,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Tor

In [128]:
# Find the table and the cells from the HTML script
table = soup.find('table',class_='wikitable sortable')
tr_elements = soup.find_all(['tr'])[1:181]
#print(tr_elements)

# Write the table headers and cells into a CSV
with open('toronto_boroughs.csv', 'w', newline='', encoding='utf-8') as f:
    column_headers = ['PostalCode','Borough','Neighborhood']
    writer = csv.writer(f)
    writer.writerow(column_headers)
    for cell in tr_elements:
            td = cell.find_all('td')
            row = [i.text.replace('\n','').replace(' / ',',') for i in td]
            writer.writerow(row)

In [129]:
# Read the CSV into a dataframe
toronto_boroughs = pd.read_csv('toronto_boroughs.csv', header=0)
toronto_boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park,Harbourfront"


In [131]:
# Remove rows where Borough is 'Not assigned'
indexName_notassigned = toronto_boroughs[toronto_boroughs['Borough'] == 'Not assigned'].index
toronto_boroughs.drop(indexName_notassigned, inplace=True)
toronto_boroughs.reset_index(drop=True)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park,Harbourfront"
3,M6A,North York,"Lawrence Manor,Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road ,Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South,King's Mill Park,Sunnylea,Humbe..."


In [147]:
# Check if there are duplicate postal codes in the PostalCode column
check_duplicate = not toronto_boroughs['PostalCode'].is_unique
check_duplicate

False

In [148]:
# Check rows where Neighborhood is NaN
toronto_boroughs.isna().sum()

PostalCode      0
Borough         0
Neighborhood    0
dtype: int64

In [149]:
toronto_boroughs.shape

(103, 3)

All criteria has been met, resulting dataframe has no rows where Borough is 'Not assigned', no duplicate postal codes and all rows in the Neighborhood columns has a value. The final processed dataframe has 103 rows and 3 columns.

## Getting the neighborhood latitude and longitude from FourSqaure

As suggested in the assignment instructions, I will use the provided CSV to add the latitude and longitude to my existing dateframe as the Geocoder package is not as reliable.

In [155]:
geo_coord = pd.read_csv('Geospatial_Coordinates.csv')
geo_coord

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [162]:
toronto_boroughs_ll = toronto_boroughs.merge(geo_coord, left_on='PostalCode', right_on='Postal Code',
                                            how='outer')
toronto_boroughs_ll

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park,Harbourfront",M5A,43.654260,-79.360636
3,M6A,North York,"Lawrence Manor,Lawrence Heights",M6A,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government",M7A,43.662301,-79.389494
...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road ,Old Mill North",M8X,43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,M4Y,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing CentrE,M7Y,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South,King's Mill Park,Sunnylea,Humbe...",M8Y,43.636258,-79.498509
