<center><h1>Segmenting and Clustering Neighborhoods of Toronto</h1></center>

<h2>Importing necessary libraries </h2>

In [20]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#While scraping the wikipedia page an error occured where the absence of lxml was indicated. This statement imports it. Upon installation, Kernel restart was necessary. /


print('Libraries imported.')

Libraries imported.


In [1]:
#conda install -c anaconda lxml

<h2>Scraping the table in the Wikipedia URL</h2>

In [4]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
len(dfs)

3

<h2>Turns out, there are multiple tables in the page. The first table has the information we value.</h2>

In [5]:
df_table1 = dfs[0]

<h2>Eliminating the Boroughs which have the value 'Not assigned'. From the data, it seems like if the borough is not assigned, the neighborhood is not assigned as well.</h2>

In [6]:
df_toronto = df_table1[df_table1['Borough']!='Not assigned'].reset_index(drop=True)
df_toronto.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

<h2>The dataframe of interest is shown below. Truncated for presentability. For the full table check the other juypter notebook.</h2>

In [7]:
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<h2>Shape method of the dataframe indicating the rows and columns</h2>

In [8]:
df_toronto.shape

(103, 3)

<h2>An attempt was made to use the geocoder library. But since it took a long time to return coordinates for just one postal code, the csv file is being used</h2>

In [9]:

# %pip install geocoder
# import geocoder

# initialize your variable to None
# lat_lng_coords = None
#postal_code = "M5G"

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]


<h2>The lat, long info is being converted to a dataframe</h2>

In [10]:
df_lat_long = pd.read_csv("https://cocl.us/Geospatial_data")
df_lat_long.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df_lat_long.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h2>New Columns are created in the Toronto neighborhood dataframe for Latitude and Longitude and are assigned 'None'</h2>

In [12]:
df_toronto['Latitude'] = None
df_toronto['Longitude'] = None
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",,
3,M6A,North York,"Lawrence Manor, Lawrence Heights",,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",,


<h2>In the following cells, the population of latitudes and longitudes is achieved on the Toronto Neighborhood dataframe </h2>

<h3>This is achieved by querying the lat, long dataframe for postal code, obtaining latitude and longitudes as a list, converting the lists to a dataframe and appending the dataframes as columns to the Toronto Neighborhood dataframe</h3>

In [13]:
LatList = []
LongList = []
for index, row in df_toronto.iterrows():
    pc = (df_toronto.at[index, "PostalCode"])
    LatList.append(df_lat_long.query("PostalCode == '"+str(pc)+"'")["Latitude"].tolist()[0])
    LongList.append(df_lat_long.query("PostalCode == '"+str(pc)+"'")["Longitude"].tolist()[0])


In [16]:
df_toronto['Latitude']= pd.DataFrame(LatList,columns=['Latitude'])
df_toronto['Longitude']= pd.DataFrame(LongList,columns=['Latitude'])

<h2>Filtering Boroughs that contain the word 'Toronto'</h2>

In [33]:
toronto_data = df_toronto[df_toronto['Borough'].str.contains("Toronto", case=False)].reset_index(drop=True)
toronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


<h2>Mapping the filtered Boroughs from the previous step</h2>

In [37]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

<h2>Getting the coordinates for Toronto to set the map zoom</h2>

In [38]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


<h2>Mapping locations</h2>

In [39]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<h2>Foursquare credentials</h2>

In [40]:
CLIENT_ID = 'JBY0YANCCPVYDZJRGUC4PKUJXPXRSCB52IZYBIN3VV4BH3OQ' # your Foursquare ID
CLIENT_SECRET = 'MV4MQUCRVYAIOPQTHU2EGDWRV4SSLTQPHAZUR5LQOC5C1QWX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: JBY0YANCCPVYDZJRGUC4PKUJXPXRSCB52IZYBIN3VV4BH3OQ
CLIENT_SECRET:MV4MQUCRVYAIOPQTHU2EGDWRV4SSLTQPHAZUR5LQOC5C1QWX


<h2>Methods to obtain top venues across all neighborhoods near Toronto </h2>

In [41]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [43]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    LIMIT=100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return nearby_venues
    

<h2>Steps to know the number of interesting venues at every neighborhood near Toronto</h2>

In [44]:
toronto_venue_aggregate = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  ).groupby('Neighborhood').count()


In [45]:
toronto_venue_count = toronto_venue_aggregate.filter(['Neighborhood', 'Venue'])
toronto_venue_count.rename(columns={'Venue':'# of interesting venues'}, inplace=True)

In [46]:
#print(manhattan_venues)
toronto_venue_count

Unnamed: 0_level_0,# of interesting venues
Neighborhood,Unnamed: 1_level_1
Berczy Park,56
"Brockton, Parkdale Village, Exhibition Place",23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",18
Central Bay Street,65
Christie,16
Church and Wellesley,78
"Commerce Court, Victoria Hotel",100
Davisville,31
Davisville North,9


<h2>Neighborhoods to spend time at</h2>

In [47]:
toronto_venue_count.loc[(toronto_venue_count['# of interesting venues'] >= 100)]

Unnamed: 0_level_0,# of interesting venues
Neighborhood,Unnamed: 1_level_1
"Commerce Court, Victoria Hotel",100
"First Canadian Place, Underground city",100
"Garden District, Ryerson",100
"Harbourfront East, Union Station, Toronto Islands",100
"Toronto Dominion Centre, Design Exchange",100


<h2>Neighborhoods that can be avoided</h2>

In [48]:
toronto_venue_count.loc[(toronto_venue_count['# of interesting venues'] <= 10)]

Unnamed: 0_level_0,# of interesting venues
Neighborhood,Unnamed: 1_level_1
Davisville North,9
"Forest Hill North & West, Forest Hill Road Park",4
Lawrence Park,3
"Moore Park, Summerhill East",2
Rosedale,4
Roselawn,3
The Beaches,4
