# Segmenting and Clustering Neighborhoods in Toronto

## Task 1: Data scraping

For this task we need to scrape boroughs data from a provided wikipedia page and meet following conditions:
- *include only cells with an assigned borough*
- *put neighborhoods with identical postal code into the same cell separated with comma*
- *neighborhood name needs to be the same as borough if missing*

The final DataFrame should look like this
<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1593475200000&hmac=ZLO37J4noDrOUqHCbiJJaIRgMbEcirsbqN8ILaIOWfg" alt="Toronto boroughs DF" width="400"/>



After looking closely into the source page we can actually see that data table has clearly been modified after this assignment was developed. It now meets most of requirements: 
- each postal code is mentioned only once in a table
- neighborhoods within the same postal code are separated with comma
- there are no missing neighborhood cells (if borough is present)

That actually makes the task rather straightforward. I have chosen to use `pandas.read_html()` function which searches for table elements within html page and returns a list of DataFrames. Picking the first item in that list gives us the desired DataFrame. 

In [1]:
import pandas as pd

In [2]:
link = 'http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
boroughs_df = pd.read_html(link)[0]
boroughs_df.head(5)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The only thing left to do is to drop rows with *Not Assigned* values in Borough column and reset index.

In [3]:
boroughs_df = boroughs_df[boroughs_df['Borough'] != 'Not assigned']
boroughs_df.reset_index(drop=True, inplace=True)

boroughs_df.shape

(103, 3)

## Task 2: Add Latitude and Longitude columns

Here we need to add 2 columns into our DataFrame - *Latitude* and *Longitude* with according values.

In the instructions to this assignment we were offered to use Geocoder Python package. Also we were warned on how inconsistent it is.
Well, naturally it didn't work for me at all. I kept receiving None values and got stuck in infinite loops.
But I didn't want to just give in to using a provided csv file.

So after a little bit of googling I found a nice Python package `pgeocode` which can provide desired values and is also offline.

In [4]:
#if you want to install this package uncomment line below

#!pip install pgeocode

In [5]:
import pgeocode

Using this package I loop through each postal code in the DataFrame and save its latitude and longitude to the according lists. After that these lists are inserted into DataFrame as new columns.

In [6]:
lat, long = [], []

#Nominatim object is initiated with a country code represented with 2 letters
nomi = pgeocode.Nominatim('ca')

for code in boroughs_df['Postal Code']:
    #query_postal_code method takes postal code as an argument an returns location data for it
    location = nomi.query_postal_code(code)
    lat.append(location['latitude'])
    long.append(location['longitude'])

In [7]:
boroughs_df['Latitude'] = lat
boroughs_df['Longitude'] = long

In [8]:
boroughs_df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.3300
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.6518,-79.5076
99,M4Y,Downtown Toronto,Church and Wellesley,43.6656,-79.3830
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.6325,-79.4939


# Task 3: Area analysis

This task requires to explore and cluser neighborhood in Toronto. First let's make all the imports we need.

In [9]:
#required imports
import folium
import json 
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize
import numpy as np

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

I chose to perfrom analysis of neighborhoods which contain word *'Toronto'* in them as it was suggested in instructions. For this i'm using method `pd.Dataframe.str.contains` to narrow down our toronto_df.

In [10]:
toronto_df = boroughs_df[boroughs_df['Borough'].str.contains('Toronto')]

Let's define Foraquare credentials and then borrow  `getNearbyVenues` function from a previous lab to build a *toronto_venues* dataframe with a list of venues for each neighborhood.

In [11]:
CLIENT_ID = 'EUQ5GMDZDOPFLRFOF4BGMDMNTYMS2YFMSKWEOHSTXAK5WL0P'
CLIENT_SECRET = 'FRA2QEUKL5RP3KMTVY5TG02LKOTRAA21MOK25AZW1YVFFHTY'
VERSION = '20180605'
LIMIT = 100

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
toronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

In [16]:
print(toronto_venues.shape)
toronto_venues.head()

(1537, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.6555,-79.3626,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.6555,-79.3626,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.6555,-79.3626,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.6555,-79.3626,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
4,"Regent Park, Harbourfront",43.6555,-79.3626,Dominion Pub and Kitchen,43.656919,-79.358967,Pub


We can see how many venues are located in each neighborhood.

In [17]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,92,92,92,92,92,92
"Brockton, Parkdale Village, Exhibition Place",39,39,39,39,39,39
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",14,14,14,14,14,14
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",59,59,59,59,59,59
Central Bay Street,63,63,63,63,63,63
Christie,11,11,11,11,11,11
Church and Wellesley,78,78,78,78,78,78
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,23,23,23,23,23,23
Davisville North,7,7,7,7,7,7


In [18]:
print('Neighborhood' in toronto_venues['Venue Category'].unique())

True


Apparently some venues in Toronto have a category named *'Neighborhood'* which seems a bit odd. So I'm gonna rename it to *'Hood'* in order to add a proper neighborhood column into a new dataframe *toronto_onehot* in which we transform categorical variables of *'Venue Category'* into indicators.

In [19]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

#renaming Neighborhood category to Hood
toronto_onehot.rename({'Neighborhood':'Hood'}, axis=1, inplace=True)

# add neighborhood column back into the beginning of dataframe
toronto_onehot.insert(0, 'Neighborhood', toronto_venues['Neighborhood'])

toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Baby Store,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's see the mean occurance of every venue category in each neighborhood.

In [20]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Baby Store,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.01087,0.021739,0.0,0.0,0.0,0.01087,0.0,...,0.0,0.01087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01087
1,"Brockton, Parkdale Village, Exhibition Place",0.025641,0.0,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.016949,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033898,0.0,0.016949
4,Central Bay Street,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.0,0.0,...,0.0,0.0,0.015873,0.015873,0.0,0.015873,0.0,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.012821,0.012821,0.0,0.0,0.012821,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,0.0,0.0,0.025641
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.03,0.01,0.0,0.0,0.03,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next we take a `return_most_common_venues` function from a lab to sort venues. Then create a new dataframe *toronto_venues_sorted* to display most common categories in each neighborhood.

In [21]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [22]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_venues_sorted = pd.DataFrame(columns=columns)
toronto_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    toronto_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

toronto_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Café,Hotel,Seafood Restaurant,Cocktail Bar,Restaurant,Japanese Restaurant,Bakery,Beer Bar,Deli / Bodega
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Thrift / Vintage Store,Gift Shop,Brewery,Sandwich Place,Chiropractor,Restaurant,Cocktail Bar
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Restaurant,Yoga Studio,Bank,Breakfast Spot,Furniture / Home Store,Sushi Restaurant,Bookstore,Japanese Restaurant,Italian Restaurant
3,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Italian Restaurant,Bar,Café,Gym / Fitness Center,Speakeasy,Park,Bakery,Bank,French Restaurant
4,Central Bay Street,Coffee Shop,Sandwich Place,Bubble Tea Shop,Middle Eastern Restaurant,Italian Restaurant,Japanese Restaurant,Clothing Store,Café,Poke Place,Breakfast Spot


Now to clustering our neighborhoods using k-means clustering. Unlike it was done in the lab I've chosen to divide neighborhood into 3 clusters as it divides neighborhoods into clusters a little bit more even, rather than with 5 clusters. Although it still favors one of the clusters more than others.

In [23]:
# set number of clusters
kclusters = 3

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 0])

In [24]:
# add clustering labels
toronto_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_venues_sorted

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_df.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Postal Code,Borough,Latitude,Longitude
0,2,Berczy Park,Coffee Shop,Café,Hotel,Seafood Restaurant,Cocktail Bar,Restaurant,Japanese Restaurant,Bakery,Beer Bar,Deli / Bodega,M5E,Downtown Toronto,43.6456,-79.3754
1,2,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Thrift / Vintage Store,Gift Shop,Brewery,Sandwich Place,Chiropractor,Restaurant,Cocktail Bar,M6K,West Toronto,43.6383,-79.4301
2,2,"Business reply mail Processing Centre, South C...",Coffee Shop,Restaurant,Yoga Studio,Bank,Breakfast Spot,Furniture / Home Store,Sushi Restaurant,Bookstore,Japanese Restaurant,Italian Restaurant,M7Y,East Toronto,43.7804,-79.2505
3,2,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Italian Restaurant,Bar,Café,Gym / Fitness Center,Speakeasy,Park,Bakery,Bank,French Restaurant,M5V,Downtown Toronto,43.6404,-79.3995
4,2,Central Bay Street,Coffee Shop,Sandwich Place,Bubble Tea Shop,Middle Eastern Restaurant,Italian Restaurant,Japanese Restaurant,Clothing Store,Café,Poke Place,Breakfast Spot,M5G,Downtown Toronto,43.6564,-79.386


In [25]:
toronto_merged.shape

(38, 16)

Finally we display each neighborhood and its cluster on a map.

In [26]:
#get Toronto coordinates to use for map creation
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [27]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters