# Coursera Capstone Project

<h2><center>Segmenting and Clustering Neighborhoods in Toronto</center></h2>

Scrape the Toronto neighborhood data from the Wiki page table. The resulting dataframe should meet the following criteria:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


## Scraping the wikipedia page

In [2]:
import pandas as pd
import numpy as np
import json
import csv
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium
print('Libraries imported.')

Libraries imported.


In [146]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source)
soup.prettify

<bound method Tag.prettify of 
<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of postal codes of Canada: M - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"Xq7GOApAIIIAA1fsphIAAACW","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":951325562,"wgRevisionId":951325562,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Communications in Ontario","Postal codes in Canada","Tor

In [128]:
# Find the table and the cells from the HTML script
table = soup.find('table',class_='wikitable sortable')
tr_elements = soup.find_all(['tr'])[1:181]
#print(tr_elements)

# Write the table headers and cells into a CSV
with open('toronto_boroughs.csv', 'w', newline='', encoding='utf-8') as f:
    column_headers = ['PostalCode','Borough','Neighborhood']
    writer = csv.writer(f)
    writer.writerow(column_headers)
    for cell in tr_elements:
            td = cell.find_all('td')
            row = [i.text.replace('\n','').replace(' / ',',') for i in td]
            writer.writerow(row)

In [8]:
# Read the CSV into a dataframe
toronto_boroughs = pd.read_csv('../data/raw/toronto_boroughs.csv', header=0)
toronto_boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park,Harbourfront"


In [9]:
# Remove rows where Borough is 'Not assigned'
indexName_notassigned = toronto_boroughs[toronto_boroughs['Borough'] == 'Not assigned'].index
toronto_boroughs.drop(indexName_notassigned, inplace=True)
toronto_boroughs.reset_index(drop=True)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park,Harbourfront"
3,M6A,North York,"Lawrence Manor,Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road ,Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South,King's Mill Park,Sunnylea,Humbe..."


In [10]:
# Check if there are duplicate postal codes in the PostalCode column
check_duplicate = not toronto_boroughs['PostalCode'].is_unique
check_duplicate

False

In [11]:
# Check rows where Neighborhood is NaN
toronto_boroughs.isna().sum()

PostalCode      0
Borough         0
Neighborhood    0
dtype: int64

In [12]:
toronto_boroughs.shape

(103, 3)

All criteria has been met, resulting dataframe has no rows where Borough is 'Not assigned', no duplicate postal codes and all rows in the Neighborhood columns has a value. The final processed dataframe has 103 rows and 3 columns.

## Getting the neighborhood latitude and longitude

As suggested in the assignment instructions, I will use the provided CSV to add the latitude and longitude to my existing dateframe as the Geocoder package is not as reliable.

In [5]:
geo_coord = pd.read_csv('../data/external/Geospatial_Coordinates.csv')
geo_coord

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [23]:
toronto_boroughs_ll = toronto_boroughs.merge(geo_coord, left_on='PostalCode', right_on='Postal Code',
                                            how='outer').drop('Postal Code', axis = 1)
toronto_boroughs_ll

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park,Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor,Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road ,Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South,King's Mill Park,Sunnylea,Humbe...",43.636258,-79.498509


## Exploring & clustering the neighborhoods

Let's start by getting the coordinates of Toronto and plotting a map to visualise the Toronto neighborhoods.

In [25]:
city = 'Toronto, Canada'
geolocator = Nominatim(user_agent='toronto_explorer')
location = geolocator.geocode(city)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of {} are {}, {}.'.format(city, latitude, longitude))

The geographical coordinate of Toronto, Canada are 43.6534817, -79.3839347.


In [33]:
# create map of Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add neighborhoods to map
for lat, lng, label in zip(toronto_boroughs_ll['Latitude'], toronto_boroughs_ll['Longitude'], toronto_boroughs_ll['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
map_toronto

Let's explore the first neighborhood, Parkwoods.

In [39]:
secret_dict = {}
with open('../secrets/foursquare_secrets.txt') as f:
    for item in f:
        (key, val) = item.split(':')
        secret_dict[key] = val.strip('\n')

In [53]:
LIMIT = 100
radius = 500
VERSION = '20180605'
neighborhood_latitude = toronto_boroughs_ll.loc[0, 'Latitude']
neighborhood_longitude = toronto_boroughs_ll.loc[0, 'Longitude']
neighborhood_name = toronto_boroughs_ll.loc[0, 'Neighborhood']
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(secret_dict.get('client_id'), secret_dict.get('client_secret'), latitude, longitude, VERSION, radius, LIMIT)
results = requests.get(url).json()

In [45]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [46]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Downtown Toronto,Neighborhood,43.653232,-79.385296
1,Nathan Phillips Square,Plaza,43.65227,-79.383516
2,Indigo,Bookstore,43.653515,-79.380696
3,Chatime 日出茶太,Bubble Tea Shop,43.655542,-79.384684
4,Textile Museum of Canada,Art Museum,43.654396,-79.3865


In [47]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

71 venues were returned by Foursquare.


#### Now we repeat what we have done above for all the other neighborhoods by creating a function that repeat the same process.

In [48]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            secret_dict.get('client_id'), 
            secret_dict.get('client_secret'), 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [50]:
toronto_venues = getNearbyVenues(names=toronto_boroughs_ll['Neighborhood'],
                                latitudes=toronto_boroughs_ll['Latitude'],
                                longitudes=toronto_boroughs_ll['Longitude']
                                )

Parkwoods
Victoria Village
Regent Park,Harbourfront
Lawrence Manor,Lawrence Heights
Queen's Park,Ontario Provincial Government
Islington Avenue
Malvern,Rouge
Don Mills
Parkview Hill,Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park,Princess Gardens,Martin Grove,Islington,Cloverdale
Rouge Hill,Port Union,Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate,Bloordale Gardens,Old Burnhamthorpe,Markland Wood
Guildwood,Morningside,West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor,Wilson Heights,Downsview North
Thorncliffe Park
Richmond,Adelaide,King
Dufferin,Dovercourt Village
Scarborough Village
Fairview,Henry Farm,Oriole
Northwood Park,York University
East Toronto
Harbourfront East,Union Station,Toronto Islands
Little Portugal,Trinity
Kennedy Park,Ionview,East Birchmount Park
Bayview Village
Downsview
The Danforth West,Riverdale
Toronto Dominion Centr

In [51]:
print(toronto_venues.shape)
toronto_venues.head()

(2133, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [55]:
toronto_venues.groupby('Neighborhood')['Venue'].count()

Neighborhood
Agincourt                                         4
Alderwood,Long Branch                            10
Bathurst Manor,Wilson Heights,Downsview North    20
Bayview Village                                   4
Bedford Park,Lawrence Manor East                 23
                                                 ..
Willowdale                                       39
Woburn                                            4
Woodbine Heights                                 12
York Mills West                                   4
York Mills,Silver Hills                           1
Name: Venue, Length: 93, dtype: int64

In [56]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 269 unique categories.
