# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this lab, we'll explore neighborhoods in Toronto Canada by using foursquare API and segment those data about most common venues into different cluster by K-mean clustering. 
The result will be visuallized by Folium then. 

# 1. Download and Explore Dataset

## 1.1 Process Postal data from wikipedia 

Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe 

In [1]:
#Download data
!wget https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M -O canada_postal_code.xml

--2019-05-26 13:59:39--  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Resolving en.wikipedia.org (en.wikipedia.org)... 103.102.166.224
Connecting to en.wikipedia.org (en.wikipedia.org)|103.102.166.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79293 (77K) [text/html]
Saving to: ‘canada_postal_code.xml’


2019-05-26 13:59:39 (496 KB/s) - ‘canada_postal_code.xml’ saved [79293/79293]



In [2]:
#Install and import BeautifulSoup library to parse XML data above
!pip install bs4

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.[0m


In [3]:
#import needed lib
from bs4 import BeautifulSoup
import pandas as pd
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
import requests
import numpy as np
from sklearn.cluster import KMeans 

import matplotlib.cm as cm
import matplotlib.colors as colors

from geopy.geocoders import Nominatim

!pip install folium
import folium

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.[0m
Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/43/77/0287320dc4fd86ae8847bab6c34b5ec370e836a79c7b0c16680a3d9fd770/folium-0.8.3-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 751kB/s eta 0:00:01
[?25hCollecting numpy (from folium)
[?25l  Downloading https://files.pythonhosted.org/packages/6e/36/e8369aa628b29f50211ba82daec31cc110f6627feca160bc11b0e4ee1191/numpy-1.16.3-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.9MB)
[K     |████████████████████████████████| 13.9MB 2.5MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/a1/37/675c85871b923bb35ea9a5b516a1841428bd753d7f885d5921060dfd3c41

In [4]:
#Load and process xml data 
with open('canada_postal_code.xml') as f:
    soup=BeautifulSoup(f,'html.parser')

In [5]:
#Extract data from postal table
L=[]
for i in range(1,len(soup.table.find_all('tr'))):
    L.append([ x.rstrip('\n') for x in soup.table.find_all('tr')[i].strings if x.rstrip('\n') != '' ])
print("Let's see first 5 values")
L[:5]

Let's see first 5 values


[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

#### Create a dataframe from data above that consist of three columns: PostalCode, Borough, and Neighborhood

In [6]:
#Import our lists to a dataframe
columns=['PostCode','Borough','Neighborhood']
df=pd.DataFrame(L,columns=columns)
df.head()

Unnamed: 0,PostCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Ignore cells with a borough that is Not assigned

In [7]:
#Drop rows with Borough == 'Not assigned'
df=df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,PostCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### Group neighborhoods by PostCode

In [8]:
df=df.groupby(['PostCode','Borough'],as_index=True)['Neighborhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,PostCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Replace "Not assigned" Neighborhood by Borough name

In [9]:
df.Neighborhood[df.Neighborhood == 'Not assigned'] = df.Borough

#Verify result
try:
    df.set_index('Neighborhood').loc["Not assigned"]
except KeyError:
    print("There is no Not assigned neighborhood anymore\n")
    
print("The value of Neighboorhood for Queen's park borough now is: %s" % df.set_index('Borough').loc["Queen's Park"].Neighborhood)

There is no Not assigned neighborhood anymore

The value of Neighboorhood for Queen's park borough now is: Queen's Park


#### Our dataframe shape

In [10]:
df.shape

(103, 3)

## 1.2 Add long - lat values to our dataframe 

#### Download csv file that contains long - lat for each postal code from https://cocl.us/Geospatial_data

In [14]:
!wget https://cocl.us/Geospatial_data -O Geospatial_data -q

#### Process and merge long lat data to our dataframe

In [15]:
#Load lat long
lat_long=pd.read_csv('Geospatial_data')
lat_long.head()
#Update column name
lat_long.rename(columns={'Postal Code':'PostCode'},inplace=True)

#Merge to DataFrame
full_df=pd.merge(df,lat_long,on='PostCode')
full_df.head()

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


#### Process data with toronto neighborhood only

In [13]:
toronto_data=full_df[full_df['Borough'].str.find("Toronto") > -1].reset_index(drop=True)
toronto_data.head(5)

#### Shape of dataframe with lat-long

In [None]:
toronto_data.shape

# 2. Explore Neighborhoods in Toronto

#### Function to repeat the same process to all the neighborhoods in Toronto ( same things for manhattan in the labs)

In [None]:
#Foursquare api info
CLIENT_ID = 'NOCPT3ULRVCEGUZIFDUDREID0FCWL4NCTTLTUZM4BNLSMW5C' # your Foursquare ID
CLIENT_SECRET = 'Q3OUHQKP3JL5UPEJCVFBC1G0M1DVGUID1RINFAFOPELUYV2F' # your Foursquare Secret
VERSION = '20190506' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
LIMIT=100
radius=500
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                  'Neighborhood Latitude',
                  'Neighborhood Longitude',
                  'Venue',
                  'Venue Latitude',
                  'Venue Longitude',
                  'Venue Category']

    return(nearby_venues)

#### Build venues dataframe for totonro

#### Make sure you fill in your foursquare API credentials above or this will fail !!!!

In [None]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

In [None]:
#Review venues dataframe
toronto_venues.head(10)

In [None]:
#shape
toronto_venues.shape

# 3. Analyze Each Neighborhood

In [None]:
#One hot encoding
toronto_onehot=pd.get_dummies(toronto_venues[['Venue Category']],prefix="",prefix_sep="")

#Add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']


#Move neighborhood columns to the first columns

#Get position of Neighborhood column
c_index=toronto_onehot.columns.get_loc('Neighborhood')

#Move to the first position
fixed_columns = [toronto_onehot.columns[c_index]] + list(toronto_onehot.columns[:c_index]) \
+ list(toronto_onehot.columns[c_index+1:])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot.head()

#### Group rows by neighborhood and by taking the mean of occurance frequency of each category 

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

### Build a dataframe with 10 most common venues for each neighborhood 

In [None]:

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues=10
indicators = ['st','nd','rd']
#Create columns according to number of top ve
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1,indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
#Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind,1:] = return_most_common_venues(toronto_grouped.iloc[ind,:], num_top_venues)
neighborhoods_venues_sorted.head()

# 4. Cluster Neighborhoods and visualize by folium

## 4.1 Run kmeans to cluster neighborhood into 5 cluster

In [None]:
#set k=5
kcluster=5
toronto_grouped_clustering=toronto_grouped.drop('Neighborhood',1)

#Run k-means clustering
kmeans = KMeans(n_clusters=kcluster,random_state=0).fit(toronto_grouped_clustering)
kmeans.labels_[0:10]


In [None]:
#Add clustering neighbor 
try:
    neighborhoods_venues_sorted.insert(0,'Cluster Labels', kmeans.labels_)
except ValueError:
    print("Warn: cannot insert Cluster Labels, already exists. It looks like you're running this twice. Let's proceed")
    
#merge toronto_data to add long/lat to each neighborhood    
toronto_merged=toronto_data
toronto_merged=toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'),on='Neighborhood')
toronto_merged.head()

## 4.2 Visualize cluster

In [None]:
#create map 
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=11)

#set color scheme for the cluster
x = np.arange(kcluster)

ys= [ i + x + (i*x)**2 for i in range(kcluster)]


colors_array=cm.rainbow(np.linspace(0,1,len(ys)))
rainbow= [colors.rgb2hex(i) for i in colors_array]

#add markers to the map 
markers_colors=[]
for lat,lon,poi,cluster in zip(toronto_merged['Latitude'],toronto_merged['Longitude'],toronto_merged['Neighborhood'],toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' +str(cluster),parse_html=True)
    folium.CircleMarker(
        [lat,lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters
    

## 5. Examine clusters 

The clustering method doesn't work very well somehow, as we still some outliers ( 2 nodes in the same cluster which is far from each other ). 