## Segmenting and Clustering Neighborhoods in the City of Toronto, Canada
#### This is part of the Coursera final capstone project. It involves exploration, segmentation, and clustering of the neighborhoods in the city of Toronto based on the postal code and borough information.

### Part 1: Install Required Packages, Webscrapping, Create and Clean Dataframe

#### Step 1: Install and import required packages and libraries.

In [1]:
!pip install beautifulsoup4
!pip install lxml

import requests # library to handle requests
import re
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

from IPython.display import display_html
import lxml.html as lh
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup # for scrapping webpage contents
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python-3.8-main

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2021.5.30          |   py38h578d9bd_0         141 KB  conda-forge
    geographiclib-1.52         |     pyhd8ed1ab_0          35 KB  conda-forge
    geopy-2.2.0                |     pyhd8ed1ab_0          67 KB  conda-forge
    openssl-1.1.1k             |       h7f98852_0         2.1 MB  conda-forge
    python_abi-3.8             |           2_cp38           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.52-pyhd8ed1ab_0
  geopy       

#### Step 2: Scrape the data from the source url - Wikipedia; wrangle, clean and read it into a pandas dataframe so that it is in a structured format.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

data = []
for p in soup.select("td > p"):
    text = p.get_text(strip=True, separator=" ")
    post_code, borough, neighbourhood = re.search(
        r"^(M[^\s]+)\s*([^(]+)(?:\s*(.*))?", text
    ).groups()
    borough = borough.strip()
    neighbourhood = (neighbourhood or "Not Assigned").strip("() ")
    neighbourhood = neighbourhood.replace("(", "/").replace(")", "/")

    data.append((post_code, borough, neighbourhood))

df = pd.DataFrame(data, columns=["Postcode", "Borough", "Neighborhood"])
print(df)

    Postcode           Borough  \
0        M1A      Not assigned   
1        M2A      Not assigned   
2        M3A        North York   
3        M4A        North York   
4        M5A  Downtown Toronto   
..       ...               ...   
175      M5Z      Not assigned   
176      M6Z      Not assigned   
177      M7Z      Not assigned   
178      M8Z         Etobicoke   
179      M9Z      Not assigned   

                                          Neighborhood  
0                                         Not Assigned  
1                                         Not Assigned  
2                                            Parkwoods  
3                                     Victoria Village  
4                           Regent Park / Harbourfront  
..                                                 ...  
175                                       Not Assigned  
176                                       Not Assigned  
177                                       Not Assigned  
178  Mimico NW / The 

In [3]:
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not Assigned
1,M2A,Not assigned,Not Assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


#### Step 3: Remove 'Not assigned' boroughs and fix dataframe index

In [4]:
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
df.index = range(len(df))
df

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business reply mail Processing Ce...,Enclave of M4L
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


#### Step 4: Print the number of rows and columns in the dataframe

In [5]:
df.shape

(103, 3)

### Part 2: Obtain the Latitude and Longitude Coordinates of each Neighbourhood.

#### Step 1: Use the Geocoder package to get the latitude and the longitude coordinates for all the neighborhoods in the dataframe. 

In [6]:
!pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 14.5 MB/s eta 0:00:01
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [7]:
import geocoder
from geopy.geocoders import Nominatim # to convert an address into latitude and longitude values

#### Step 2: Run a while loop for each postal code to ontain the coordinates. Alternatively, use the Geospatial dataset to get the coordinates.

In [8]:
#get latitude and longitude using geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.arcgis('{}, Toronto, Ontario'.format('Postal Code'))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

print(latitude,longitude )

43.648690000000045 -79.38543999999996


In [9]:
#read geospatial data file

geos_data = pd.read_csv('https://cocl.us/Geospatial_data')
geos_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Step 3: Merge/append geospatial data with Canadian Neighborhoods data. 

In [11]:
geos_data.rename(columns={'Postal Code':'Postcode'},inplace=True)
df2 = pd.merge(df,geos_data,on='Postcode')
df2.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [12]:
df2.shape

(103, 5)

### Part 3: Explore and Cluster the Neighbourhoods in Toronto

#### Step 1: Extract and create a data frame containing "only" Toronto Boroughs

In [13]:
df3 = df2[df2['Borough'].str.contains('Toronto',regex=False)]
df3

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District , Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,Richmond / Adelaide / King,43.650571,-79.384568
31,M6H,West Toronto,Dufferin / Dovercourt Village,43.669005,-79.442259
35,M4J,East York East Toronto,The Danforth East,43.685347,-79.338106


#### Step 2: Visualize the data using Folium

##### Step 2a: Visualize the entire Toronto neighborhood

In [19]:
# First, use Kmeans to cluster the neighborhoods

k=5
toronto_clustering = df3.drop(['Postcode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df3.insert(0, 'Cluster Labels', kmeans.labels_)

In [21]:
df3

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighborhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
9,0,M5B,Downtown Toronto,"Garden District , Ryerson",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,3,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,2,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,0,M5H,Downtown Toronto,Richmond / Adelaide / King,43.650571,-79.384568
31,4,M6H,West Toronto,Dufferin / Dovercourt Village,43.669005,-79.442259
35,3,M4J,East York East Toronto,The Danforth East,43.685347,-79.338106


In [22]:
# Create map

map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters created in the previous step. This step can be skipped if you don't want the neighbourhoods color-coded.
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(df3['Latitude'], df3['Longitude'], df3['Neighborhood'], df3['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

##### Step 2b: Narrow Down the data to Downtown Toronto and create a map for Downtown Toronto only.

In [23]:
# Create a dataframe for Downtown Toronto only. You can also use "DowntownToronto = df3[df3['Borough'] == 'Downtown Toronto']" but it won't re-arrange the row numbers

Downtown_data = df3[df3['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
Downtown_data.head()

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighborhood,Latitude,Longitude
0,0,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
1,0,M5B,Downtown Toronto,"Garden District , Ryerson",43.657162,-79.378937
2,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
4,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


In [24]:
# Obtain coordinates for Downtown Toronto using geolocator

address = 'Downtown Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


In [25]:
# Visualize Downtown Toronto with Folium to create a map of Downtown Toronto using latitude and longitude values

map_DowntownToronto = folium.Map(location=[43.6563221, -79.3809161], zoom_start=13)

# add markers to map
for lat, lng, label in zip(Downtown_data['Latitude'], Downtown_data['Longitude'], Downtown_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_DowntownToronto)  
    
map_DowntownToronto

#### Step 3: Utilize Foursquare API to Explore and Segment Neighborhoods

##### Define Foursquare credentials and versions

In [27]:
CLIENT_ID = 'SCG03Z2ZDA23YVAV2LCLQMXE1MSLV4IBSZ4VGXFD0R50ZKAL' # Foursquare ID
CLIENT_SECRET = 'MK0CVOJHC0RMTEY2ZPCBTYSKMITIPW4NL3VWIX53X2UKHQMR' # Foursquare Secret
VERSION = '20210811' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: SCG03Z2ZDA23YVAV2LCLQMXE1MSLV4IBSZ4VGXFD0R50ZKAL
CLIENT_SECRET: MK0CVOJHC0RMTEY2ZPCBTYSKMITIPW4NL3VWIX53X2UKHQMR


##### Explore one of the neighborhoods in Downtown Toronto : Central Bay Street


In [28]:
# Obtain neighborhood name

Downtown_data.loc[4, 'Neighborhood']

'Central Bay Street'

In [30]:
# Obtain Central Bay Street's Coordinates

neighborhood_latitude = Downtown_data.loc[4, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = Downtown_data.loc[4, 'Longitude'] # neighborhood longitude value

neighborhood_name = Downtown_data.loc[4, 'Neighborhood']

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Central Bay Street are 43.6579524, -79.3873826.


In [31]:
# Obtain top 100 venues in Central Bay street within a 500-meter radius.

LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=SCG03Z2ZDA23YVAV2LCLQMXE1MSLV4IBSZ4VGXFD0R50ZKAL&client_secret=MK0CVOJHC0RMTEY2ZPCBTYSKMITIPW4NL3VWIX53X2UKHQMR&v=20210811&ll=43.6579524,-79.3873826&radius=500&limit=100'

In [32]:
# Send a GET request and examine the resutls

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '61143f9b79eab071d745342a'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Bay Street Corridor',
  'headerFullLocation': 'Bay Street Corridor, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 68,
  'suggestedBounds': {'ne': {'lat': 43.6624524045, 'lng': -79.38117421839567},
   'sw': {'lat': 43.6534523955, 'lng': -79.39359098160432}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '537d4d6d498ec171ba22e7fe',
       'name': "Jimmy's Coffee",
       'location': {'address': '82 Gerrard Street W',
        'crossStreet': 'Gerrard & LaPlante',
        'lat': 43.65842123574496,
        'lng': -79.38561319551111,
        'label

##### Create a 'get_category' function to extract the category of the venue

In [34]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

##### Clean the JSON and structure it into a pandas dataframe

In [35]:
venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng', 'venue.location.address']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng,address
0,Jimmy's Coffee,Coffee Shop,43.658421,-79.385613,82 Gerrard Street W
1,Tim Hortons,Coffee Shop,43.65857,-79.385123,70 Gerrard St West
2,Hailed Coffee,Coffee Shop,43.658833,-79.383684,44 Gerrard St W
3,Somethin' 2 Talk About,Middle Eastern Restaurant,43.658395,-79.385338,78 Gerrard St W
4,NEO COFFEE BAR,Coffee Shop,43.66013,-79.38583,770 Bay Street Unit 3


In [37]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

68 venues were returned by Foursquare.


##### Create a Dataframe containing the list of venues and their corresponding coordinates.

In [38]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng, in zip(names, latitudes, longitudes):
        print(name)
        
        #api request
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, radius, LIMIT)
        
        #get request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        #return only the info we want
        venues_list.append([(name, lat, lng,
                           v['venue']['name'],
                           v['venue']['location']['lat'],
                           v['venue']['location']['lng'],
                           v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood'],['Neighborhood Latitude'],['Neighborhood Longitude'],['Venue'],['Venue Latitude'],['Venue Longitude'],['Venue Category']
            
    return(nearby_venues)

##### Obtain a list of Downtown venues

In [39]:
Downtown_venues = getNearbyVenues(names=Downtown_data['Neighborhood'], latitudes=Downtown_data['Latitude'], longitudes=Downtown_data['Longitude'])

Regent Park / Harbourfront
Garden District , Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond / Adelaide / King
Harbourfront East / Union Station / Toronto Islands
Toronto Dominion Centre / Design Exchange
Commerce Court / Victoria Hotel
University of Toronto / Harbord
Kensington Market / Chinatown / Grange Park
CN Tower / King and Spadina / Railway Lands / Harbourfront West / Bathurst Quay / South Niagara / Island airport
Rosedale
St. James Town / Cabbagetown
First Canadian Place / Underground city
Church and Wellesley


In [40]:
Downtown_venues.head()

Unnamed: 0,"(Neighborhood,)","(Neighborhood Latitude,)","(Neighborhood Longitude,)","(Venue,)","(Venue Latitude,)","(Venue Longitude,)","(Venue Category,)"
0,Regent Park / Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Regent Park / Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Regent Park / Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Regent Park / Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
4,Regent Park / Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


##### Format the column headers appropriately

In [41]:
list(Downtown_venues.columns.values)

[('Neighborhood',),
 ('Neighborhood Latitude',),
 ('Neighborhood Longitude',),
 ('Venue',),
 ('Venue Latitude',),
 ('Venue Longitude',),
 ('Venue Category',)]

In [42]:
Downtown_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
Downtown_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park / Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Regent Park / Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Regent Park / Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,Regent Park / Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
4,Regent Park / Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


In [44]:
Downtown_venues.shape

(1117, 7)

##### To view venues in our area of interest only : Central Bay Street

In [43]:

CentralBay_venues = Downtown_venues[Downtown_venues['Neighborhood'] == 'Central Bay Street']
CentralBay_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
288,Central Bay Street,43.657952,-79.387383,Jimmy's Coffee,43.658421,-79.385613,Coffee Shop
289,Central Bay Street,43.657952,-79.387383,Tim Hortons,43.658570,-79.385123,Coffee Shop
290,Central Bay Street,43.657952,-79.387383,Hailed Coffee,43.658833,-79.383684,Coffee Shop
291,Central Bay Street,43.657952,-79.387383,Somethin' 2 Talk About,43.658395,-79.385338,Middle Eastern Restaurant
292,Central Bay Street,43.657952,-79.387383,NEO COFFEE BAR,43.660130,-79.385830,Coffee Shop
...,...,...,...,...,...,...,...
351,Central Bay Street,43.657952,-79.387383,Teriyaki Experience,43.659884,-79.387879,Restaurant
352,Central Bay Street,43.657952,-79.387383,Anoush,43.660034,-79.388309,Middle Eastern Restaurant
353,Central Bay Street,43.657952,-79.387383,Mo'Ramyun,43.656148,-79.392282,Korean Restaurant
354,Central Bay Street,43.657952,-79.387383,Valens Restaurants,43.656096,-79.392839,Restaurant
