Applied Data Science Capstone Project\
Week 3: Segmenting and Clustering Neighborhoods in Toronto

Prepare environment and import/install required packages...

In [4]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install beautifulsoup4

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/e8/b5/7bb03a696f2c9b7af792a8f51b82974e51c268f15e925fc834876a4efa0b/beautifulsoup4-4.9.0-py3-none-any.whl (109kB)
[K     |████████████████████████████████| 112kB 8.3MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/05/cf/ea245e52f55823f19992447b008bcbb7f78efc5960d77f6c34b5b45b36dd/soupsieve-2.0-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.0 soupsieve-2.0
Note: you may need to restart the kernel to use updated packages.


In [6]:
import pandas as pd

import requests

from bs4 import BeautifulSoup

In [7]:
req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

soup = BeautifulSoup(req.content,'lxml')

table = soup.find_all('table')[0]

df = pd.read_html(str(table))

In [8]:
neighborhood=pd.DataFrame(df[0])
neighborhood.set_index(['Postal code'], inplace = True)
neighborhood

Unnamed: 0_level_0,Borough,Neighborhood
Postal code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1A,Not assigned,
M2A,Not assigned,
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Regent Park / Harbourfront
...,...,...
M5Z,Not assigned,
M6Z,Not assigned,
M7Z,Not assigned,
M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


Clean-up dataframe by deleting rows with 'Not assigned' in Borough column.

In [9]:
# Get names of indexes for which column Borough has value 'Not assigned'
indexNames = neighborhood[ neighborhood['Borough'] == 'Not assigned' ].index
# Delete these row indexes from dataFrame
neighborhood.drop(indexNames , inplace=True)

Review post clean-up of dataframe...

In [10]:
neighborhood

Unnamed: 0_level_0,Borough,Neighborhood
Postal code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Regent Park / Harbourfront
M6A,North York,Lawrence Manor / Lawrence Heights
M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...
M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
M4Y,Downtown Toronto,Church and Wellesley
M7Y,East Toronto,Business reply mail Processing CentrE
M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


Review shape of dataframe...

In [11]:
neighborhood.shape

(103, 2)

Retrieve latitude & longitude data for each location from preset CSV file, create dataframe, and review...

In [12]:
# Read data from file from 'https://cocl.us/Geospatial_data' 
#  
LocationData = pd.read_csv("https://cocl.us/Geospatial_data")
LocationData.rename(columns = {'Postal Code':'Postal code'}, inplace = True)
LocationData.set_index(['Postal code'], inplace = True)
# Preview the first 5 lines of the loaded data 
LocationData.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Joining both "neighborhood" and "LocationData" dataframes creating "TorontoDF" as final dataset...

In [13]:
# Merge two Dataframes on index of both the dataframes
TorontoDF = neighborhood.merge(LocationData, left_index=True, right_index=True)
TorontoDF.head()

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
Postal code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494


In [14]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(TorontoDF['Borough'].unique()),
        TorontoDF.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


Create map of Toronto, Canada...

In [15]:
conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          92 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.21.0-py_0



Downloading and Extracting Packages
geopy-1.21.0         | 58 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ##################################### |

In [16]:
import numpy as np # library to handle data in a vectorized manner

#import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


In [19]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Canada are 43.6534817, -79.3839347.


In [21]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(TorontoDF['Latitude'], TorontoDF['Longitude'], TorontoDF['Borough'], TorontoDF['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [22]:
CLIENT_ID = 'RKG3ZQVZHY3T2AZS3C1XER2DLCLTAJ0PHGZMAKAYSKADXEXE' # your Foursquare ID
CLIENT_SECRET = 'YCLBUB0T5NV30VJTJSWW1LDUU4KU4L0MOL01CDIEV1GRIXLL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: RKG3ZQVZHY3T2AZS3C1XER2DLCLTAJ0PHGZMAKAYSKADXEXE
CLIENT_SECRET:YCLBUB0T5NV30VJTJSWW1LDUU4KU4L0MOL01CDIEV1GRIXLL


In [25]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=RKG3ZQVZHY3T2AZS3C1XER2DLCLTAJ0PHGZMAKAYSKADXEXE&client_secret=YCLBUB0T5NV30VJTJSWW1LDUU4KU4L0MOL01CDIEV1GRIXLL&v=20180605&ll=43.6534817,-79.3839347&radius=500&limit=100'