# Capstone Project - The Battle of Neighborhoods
#### By Kristian Correia

### Imports
The cell below contains one centralized location with all imports used in this workbook

In [2]:
import pandas as pd
import numpy as np

#!pip install geocoder

import geocoder
from geopy.geocoders import Nominatim

import json
import requests

from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

print('Imports Completed Successfully ')

Imports Completed Successfully 


## Introduction

Being the three largest cities in Canada, Toronto and Vancouver serve as some of the largest hubs in the nation for business and tourism in the Country. It is not uncommon for residents of one of these cities to either travel to the other for vacation or to move to the other city. Given the large distance between these metropolises, it would be hard to scope out neighborhoods and getting a strong understanding of the other city before committing to spending a prolonged period of time in the other city.

In order to understand another city, it is beneficial to map it out in reference to a city that you understand much better. I am personally a resident of Vancouver and I would like to understand Toronto in greater depth. It would make sense to me to relate the neighborhoods in Vancouver that I am familiar with to the neighborhoods of Toronto to improve my understand of the city's layout and which neighborhoods interest me.

I am intending with this project to leverage foursquare data and the lessons taught in the previous modules, to map out Toronto's neighborhoods in the context of how similar they are to the neighborhoods in Vancouver

I intend to create clusters of the different neighborhoods in Vancouver to build several general types of neighborhoods in the city. I will then feed the neighborhoods of Toronto through the same model to determine which cluster each borough aligns with.

Using this data, I will be able to determine what parts of Vancouver and Toronto are similar to one another and what areas of the Toronto will interest me to investigate further.

## Data

I will need a substantial amount of data for each of the two cities in order to complete this project.

For each city I will need a list of the different neighborhoods, as we did in module three, I intend to use postal code FSAs in order to define my different neighborhoods. I will use web scrapping to devise tables of this data from the following sources:

Toronto: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Vancouver: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V

I will also need to associate the different neighborhoods with with their respective coordinates. I will use the geocoder package to obtain this data like I did in module 3.

Lastly I will need venue data to build profiles on the different neighborhoods in both Toronto and Vancouver. I intend on obtaining very granule data for each neighborhood and will accept the top 300 venues for each neighborhood using the Foursquare API. Specifically, I will be using venue categories which I will one hot encode, and sum up for each neighborhood and then normalize to account for difference in total number of venues. This will allow for me to build a profile for each of the different boroughs. Which I can ultimately use to create clusters of the Vancouver neighborhoods and map Toronto to.

#### Toronto Neighborhood Data

In [3]:
#Scrape, fomat and wrangle the table
torontoWikiPageHTML = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
dfTorontoN = torontoWikiPageHTML[0]
dfTorontoN = dfTorontoN[dfTorontoN.Borough != 'Not assigned']
dfTorontoN.reset_index(drop=True, inplace=True)
dfTorontoN.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
dfTorontoN.rename(columns={"Neighbourhood": "Neighborhood"}, inplace=True)

#Find Cooridnates for Neighborhoods
llData_lat = []
llData_lon = []
for n in dfTorontoN.index:
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      #print(dfTorontoN['PostalCode'][n])
      g = geocoder.arcgis('{}, Toronto, Ontario'.format(dfTorontoN['PostalCode'][n]))
      lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    #llData.append({'Latitude':latitude, 'Longitude':longitude}, ignore_index=True)
    llData_lat.append(latitude)
    llData_lon.append(longitude)

#Join Neighborhood data with Coordinate Data
llData = pd.DataFrame(data=np.array([llData_lat, llData_lon]).T, columns=['Latitude', 'Longitude'])
dfTorontoN = pd.concat([dfTorontoN, llData], axis=1)
dfTorontoN.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75188,-79.33036
1,M4A,North York,Victoria Village,43.73042,-79.31282
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.36265
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72321,-79.45141
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.39302


#### Vancouver  Neighborhood Data

In [4]:
#Scrape, fomat and wrangle the table
vancouverWikiPageHTML = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_V')
dfVancouverN = vancouverWikiPageHTML[0].values.tolist()
i = []
for sublists in dfVancouverN:
    for items in sublists:
        PC = []
        PC.append(items[:3])
        description = items[-len(items)+3:]
        if description.find('Vancouver') == -1:
            PC.append('Not assigned')
            PC.append('Not assigned')
        else:
            if description.find('West Vancouver') >= 0:
                PC.append('West Vancouver')
                description = description.replace('West Vancouver', 'West Vancouver - ')
            elif description.find('North Vancouver') >= 0:
                PC.append('North Vancouver')
                description = description.replace('(city)','')
                description = description.replace('(district municipality)','')
                description = description.replace('North Vancouver', 'North Vancouver -')
            else:
                PC.append('Central Vancouver')
                description = description.replace('Vancouver(','')
                description = description.replace(')','')
            PC.append(description)
        i.append(PC)
dfVancouverN = i
dfVancouverN = pd.DataFrame(data= dfVancouverN, columns = ['PostalCode', 'Borough', 'Neighborhood'])
dfVancouverN = dfVancouverN[dfVancouverN.Neighborhood != 'Not assigned']
dfVancouverN.reset_index(drop=True, inplace=True)

#Find Cooridnates for Neighborhoods
llData_lat = []
llData_lon = []
for n in dfVancouverN.index:
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      #print(dfVancouverN['PostalCode'][n])
      g = geocoder.arcgis('{}, Vancouver, British Coloumbia'.format(dfVancouverN['PostalCode'][n]))
      lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    #llData.append({'Latitude':latitude, 'Longitude':longitude}, ignore_index=True)
    llData_lat.append(latitude)
    llData_lon.append(longitude)

#Join Neighborhood data with Coordinate Data
llData = pd.DataFrame(data=np.array([llData_lat, llData_lon]).T, columns=['Latitude', 'Longitude'])
dfVancouverN = pd.concat([dfVancouverN, llData], axis=1)
dfVancouverN.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,V6A,Central Vancouver,Strathcona / Chinatown / Downtown Eastside,49.27835,-123.08965
1,V6B,Central Vancouver,NE Downtown / Gastown / Harbour Centre / Inter...,49.2804,-123.11285
2,V6C,Central Vancouver,Waterfront / Coal Harbour / Canada Place,49.28572,-123.11618
3,V6E,Central Vancouver,SE West End / Davie Village,49.28351,-123.12952
4,V6G,Central Vancouver,NW West End / Stanley Park,49.29686,-123.13759


#### Retrieve Venue Data

In [5]:
#Establish venue retrieval protocol
#CLIENT_ID = 'J1AIUE1YR4RFLPDKHE0EJI3MBC4OO5WUKOSDQNCVQXXIAUMH' # your Foursquare ID
#CLIENT_SECRET = 'GGOVYXVCBSBPUVITMR14LOCSQ253DHFEWEAP41XGJTUPWBYH' # your Foursquare Secret
CLIENT_ID = '2XKYI3Z4QMMPLGUYAQ4H1SBMEUFGBZENRLYMBE2U1QA2WH35' # backup Foursquare ID
CLIENT_SECRET = 'GFZFIANXXBMMRIGRCPCMBTLEWBSCBE0CZ20TVOTDJLU3W2GZ' # bakcup Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 300

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [6]:
#Retrieve Toronto Venues
Toronto_venues = getNearbyVenues(names=dfTorontoN['Neighborhood'],
                                   latitudes=dfTorontoN['Latitude'],
                                   longitudes=dfTorontoN['Longitude']
                                  )
Toronto_venues['City']='Toronto'

#Retrieve Vancouver Venues
Vancouver_venues = getNearbyVenues(names=dfVancouverN['Neighborhood'],
                                   latitudes=dfVancouverN['Latitude'],
                                   longitudes=dfVancouverN['Longitude']
                                  )
Vancouver_venues['City']='Vancouver'

print(Toronto_venues.head())
print(Vancouver_venues.head())

       Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0         Parkwoods               43.75188               -79.33036   
1         Parkwoods               43.75188               -79.33036   
2         Parkwoods               43.75188               -79.33036   
3         Parkwoods               43.75188               -79.33036   
4  Victoria Village               43.73042               -79.31282   

             Venue  Venue Latitude  Venue Longitude     Venue Category  \
0  Brookbanks Park       43.751976       -79.332140               Park   
1         PetSmart       43.748639       -79.333488          Pet Store   
2    Variety Store       43.751974       -79.333114  Food & Drink Shop   
3      649 Variety       43.754513       -79.331942  Convenience Store   
4     Wigmore Park       43.731023       -79.310771               Park   

      City  
0  Toronto  
1  Toronto  
2  Toronto  
3  Toronto  
4  Toronto  
                                 Neighborhood  Neighborh

In [55]:
# Concat dfs from both cities
allVenueData = pd.concat([Toronto_venues, Vancouver_venues])
allVenueData

allVenue_onehot = pd.get_dummies(allVenueData[['Venue Category']], prefix="", prefix_sep="")

allVenue_onehot.drop(['Neighborhood'], axis=1, inplace=True)
allVenue_onehot['Neighborhood'] = allVenueData['Neighborhood']
allVenue_onehot['City'] = allVenueData['City']

fixed_columns = list(allVenue_onehot.columns[-2:]) + list(allVenue_onehot.columns[:-2])
allVenue_onehot = allVenue_onehot[fixed_columns]
allVenue_onehot.head()

Unnamed: 0,Neighborhood,City,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Terminal,American Restaurant,Amphitheater,Antique Shop,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,Toronto,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Toronto Venue Data

In [57]:
toronto_onehot = allVenue_onehot.loc[allVenue_onehot['City'] == 'Toronto']
toronto_onehot.drop(['City'], axis=1, inplace=True)

#Create Normalized Aggregated Distribution of Venues for each Neighborhood
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Terminal,American Restaurant,Amphitheater,Antique Shop,Aquarium,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
1,"Alderwood, Long Branch",0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
2,"Bathurst Manor, Wilson Heights, Downsview North",0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
3,Bayview Village,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
4,"Bedford Park, Lawrence Manor East",0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
5,Berczy Park,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.015152,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.015152
6,"Birch Cliff, Cliffside West",0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
7,"Brockton, Parkdale Village, Exhibition Place",0.011628,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.011628,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.034884
8,"Business reply mail Processing Centre, South C...",0.000000,0.000000,0.000000,0.000000,0.0,0.020000,0.0,0.000000,0.0,...,0.020000,0.000000,0.000000,0.000000,0.0,0.010000,0.000000,0.000000,0.000000,0.000000
9,"CN Tower, King and Spadina, Railway Lands, Har...",0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,...,0.000000,0.013514,0.000000,0.000000,0.0,0.000000,0.000000,0.013514,0.000000,0.013514


#### Vancouver Venue Data

In [58]:
vancouver_onehot = allVenue_onehot.loc[allVenue_onehot['City'] == 'Vancouver']
vancouver_onehot.drop(['City'], axis=1, inplace=True)

#Create Normalized Aggregated Distribution of Venues for each Neighborhood
vancouver_grouped = vancouver_onehot.groupby('Neighborhood').mean().reset_index()
vancouver_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Terminal,American Restaurant,Amphitheater,Antique Shop,Aquarium,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Bentall Centre,0.0,0.0,0.0,0.0,0.012346,0.037037,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012346
1,Central Kitsilano / Greektown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.022222,0.0,0.0,0.044444,0.0,0.0,0.022222,0.0,0.022222,0.022222
2,East Fairview / South Cambie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,East Mount Pleasant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0
4,Killarney,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,NE Downtown / Gastown / Harbour Centre / Inter...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.027778,0.0,0.0,0.013889,0.013889,0.0,0.0,0.0,0.0,0.0
6,NW Arbutus Ridge / NE Dunbar-Southlands,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,NW Dunbar-Southlands / Chaldecutt / South Univ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,NW Shaughnessy / East Kitsilano / Quilchena,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,NW West End / Stanley Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Methodology

In this project, we will be focusing on the 41 unique boroughs of Vancouver and determining which neighbourhoods are similar to one another. We will also be mapping Toronto in terms of its similarity to Vancouver and finding which neighborhoods are most similar. We will complete this objective through four steps:

We will retrieve the coordinate and venue data in all 41 postal code FSAs in Vancouver and 98 in Toronto. We will also be obtaining the venue category of the top 300 venues from each of get neighborhoods in both cities and build a normalized distribution of the venue types in each neighborhood. This first step was completed in the Data secion of the report.

We will train a k-means cluster algorithm with the venue data from the Vancouver neighborhoods to cluster the neighborhoods into 8 categories. We will create a visual display of these clusters of neighborhoods on a map and we will create a profile of each cluster type noting what the average neighborhood in each cluster looks like.

We will use the k-means cluster to algorithm that we developed to categorize the neighborhoods in Toronto into the same clusters that we created for Vancouver. Similar to how we analyzed Vacnouver, we will create a visual display of these clusters of neighborhoods on a map and we will create a profile of each cluster type noting what the average neighborhood in each cluster looks like.

We will lastly look to find which specifc neighborhoods are closest to one another. We will do this by creating a distance matrix between all the neighborhoods in each Toronto against all the neighborhoods in Vancouver. We will then bring forward the closest neighborhoods in each of the 8 clusters. We will review if there are any insights comparing the details of each pair of neighborhoods.

## Results

#### Clustering Vancouver Neighborhoods

In [59]:
# Create the model
kmeans = KMeans(n_clusters=8, random_state=0).fit(vancouver_grouped.drop('Neighborhood', 1))

In [159]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
vancouver_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
vancouver_neighborhoods_venues_sorted['Neighborhood'] = vancouver_grouped['Neighborhood']

for ind in np.arange(vancouver_grouped.shape[0]):
    vancouver_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(vancouver_grouped.iloc[ind, :], num_top_venues)

vancouver_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
    
vancouver_merged = dfVancouverN

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
vancouver_merged = vancouver_merged.join(vancouver_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

vancouver_merged = vancouver_merged[vancouver_merged['Cluster Labels'].notna()]


# Create map of Vancouver showing different clusters
map_clusters_vancouver = folium.Map(location=[49.2827, -123.1207], zoom_start=11)

# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(8)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(vancouver_merged['Latitude'], vancouver_merged['Longitude'], vancouver_merged['Neighborhood'], vancouver_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters_vancouver)
       
map_clusters_vancouver

In [63]:
vancouver_consolidated = dfVancouverN
kmeans.labels_
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
vancouver_consolidated = vancouver_consolidated.join(vancouver_grouped.set_index('Neighborhood'), on='Neighborhood')
vancouver_consolidated.dropna(axis=0, inplace=True)
vancouver_consolidated.sort_values('Neighborhood', inplace=True)
vancouver_consolidated.insert(0, 'Cluster Labels', kmeans.labels_)
vancouver_consolidated2 = vancouver_consolidated.groupby('Cluster Labels').size().to_frame().join(vancouver_consolidated.groupby('Cluster Labels').sum(), on='Cluster Labels')
vancouver_consolidated2.rename(columns={0: "Number of Neighborhoods"}, inplace=True)
vancouver_consolidated2.drop(['Latitude', 'Longitude'], axis=1, inplace=True)
vancouver_consolidated2.reset_index(inplace=True)

columns = ['Cluster Labels', 'Number of Neighborhoods']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
vancouverClusterProfiles = pd.DataFrame(columns=columns)
vancouverClusterProfiles[['Cluster Labels', 'Number of Neighborhoods']] = vancouver_consolidated2[['Cluster Labels', 'Number of Neighborhoods']]
        
for ind in np.arange(vancouverClusterProfiles.shape[0]): 
    vancouverClusterProfiles.iloc[ind, 2:] = return_most_common_venues(vancouver_consolidated2.iloc[ind, 2:], num_top_venues)

vancouverClusterProfiles

Unnamed: 0,Cluster Labels,Number of Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,1,Grocery Store,Gym / Fitness Center,Japanese Restaurant,Event Space,Dry Cleaner,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Elementary School,Ethiopian Restaurant
1,1,1,Caribbean Restaurant,Italian Restaurant,Bakery,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Elementary School,Ethiopian Restaurant,Event Space
2,2,1,Food & Drink Shop,Yoga Studio,Falafel Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Elementary School,Ethiopian Restaurant,Event Space,Fair
3,3,3,Trail,Pet Store,Park,Coffee Shop,Mountain,Bus Stop,Ski Chairlift,Falafel Restaurant,Farm,Fair
4,4,2,Playground,Boat or Ferry,Park,Falafel Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Elementary School,Ethiopian Restaurant,Event Space
5,5,1,Construction & Landscaping,Elementary School,Yoga Studio,Farm,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Fair,Falafel Restaurant
6,6,27,Coffee Shop,Park,Café,Bus Stop,Bank,Sushi Restaurant,Hotel,Bakery,Indian Restaurant,Sandwich Place
7,7,4,Chinese Restaurant,Coffee Shop,Dessert Shop,Bus Stop,Sushi Restaurant,Asian Restaurant,Pizza Place,Field,Bubble Tea Shop,Sandwich Place


#### Clustering Toronto Neighborhoods

In [67]:
torontoFitted = kmeans.predict(toronto_grouped.drop('Neighborhood', 1))
len(set(torontoFitted))

5

In [72]:
print(toronto_grouped.shape[0])

98


In [160]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
toronto_neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    toronto_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

toronto_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', torontoFitted)
    
toronto_merged = dfTorontoN

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged = toronto_merged[toronto_merged['Cluster Labels'].notna()]


# Create map of Vancouver showing different clusters
map_clusters_toronto = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(8)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters_toronto)
       
map_clusters_toronto

#### Distance Matrix of Neighborhoods

In [161]:
distanceofNeighborhoods = []
for i in np.arange(vancouver_grouped.shape[0]):
    for j in np.arange(toronto_grouped.shape[0]):
        a_name = vancouver_grouped[vancouver_grouped.columns[0]].iloc[i]
        b_name = toronto_grouped[toronto_grouped.columns[0]].iloc[j]
        a = vancouver_grouped[vancouver_grouped.columns[1:]].iloc[i]
        b = toronto_grouped[toronto_grouped.columns[1:]].iloc[j]
        result = np.sqrt((a-b)**2)
        results = result.sum()
        distanceofNeighborhoods.append([a_name, b_name, results])
distanceMap = pd.DataFrame(data=distanceofNeighborhoods, columns= ['Vancouver Neighborhood', 'Toronto Neighborhood', 'Distance'])
distanceMap.sort_values(by='Distance', ascending=True, inplace=True)
print('The most simlar neighborhoods between Vancouver and Toronto')
distanceMap.head()

The most simlar neighborhoods between Vancouver and Toronto


Unnamed: 0,Vancouver Neighborhood,Toronto Neighborhood,Distance
1571,North Vancouver - Northwest Central,Bayview Village,0.666667
467,Killarney,"Steeles West, L'Amoreaux West",0.8
3220,Waterfront / Coal Harbour / Canada Place,"Toronto Dominion Centre, Design Exchange",0.94
3165,Waterfront / Coal Harbour / Canada Place,"First Canadian Place, Underground city",0.98
3201,Waterfront / Coal Harbour / Canada Place,"Richmond, Adelaide, King",0.99


#### Comparing North Vancouver - Northwest Central in Vancouver to Bayview Village in Toronto

In [152]:
display(vancouver_neighborhoods_venues_sorted.loc[vancouver_neighborhoods_venues_sorted['Neighborhood'] == 'North Vancouver - Northwest Central'])
display(toronto_neighborhoods_venues_sorted.loc[toronto_neighborhoods_venues_sorted['Neighborhood'] == 'Bayview Village'])

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,3,North Vancouver - Northwest Central,Trail,Park,Yoga Studio,Dry Cleaner,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Elementary School,Ethiopian Restaurant,Event Space


Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,3,Bayview Village,Construction & Landscaping,Park,Trail,Yoga Studio,Fair,Dry Cleaner,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Elementary School


#### Comparing Killarney in Vancouver to Steeles West, L'Amoreaux West in Toronto

In [155]:
display(vancouver_neighborhoods_venues_sorted.loc[vancouver_neighborhoods_venues_sorted['Neighborhood'] == 'Killarney'])
display(toronto_neighborhoods_venues_sorted.loc[toronto_neighborhoods_venues_sorted['Neighborhood'] == 'Steeles West, L\'Amoreaux West'])

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,6,Killarney,Chinese Restaurant,Fast Food Restaurant,Coffee Shop,Sushi Restaurant,Farmers Market,Salon / Barbershop,Sandwich Place,Mobile Phone Shop,Bank,Shopping Mall


Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
75,6,"Steeles West, L'Amoreaux West",Fast Food Restaurant,Chinese Restaurant,Coffee Shop,Grocery Store,Sandwich Place,Discount Store,Breakfast Spot,Indian Restaurant,Bank,Pharmacy


#### Comparing Waterfront / Coal Harbour / Canada Place in Vancouver to Toronto Dominion Centre, Design Exchange in Toronto

In [156]:
display(vancouver_neighborhoods_venues_sorted.loc[vancouver_neighborhoods_venues_sorted['Neighborhood'] == 'Waterfront / Coal Harbour / Canada Place'])
display(toronto_neighborhoods_venues_sorted.loc[toronto_neighborhoods_venues_sorted['Neighborhood'] == 'Toronto Dominion Centre, Design Exchange'])

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,6,Waterfront / Coal Harbour / Canada Place,Coffee Shop,Hotel,Restaurant,Café,Steakhouse,Food Truck,New American Restaurant,Hotel Bar,Spa,Jewelry Store


Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
84,6,"Toronto Dominion Centre, Design Exchange",Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Seafood Restaurant,Salad Place,American Restaurant,Deli / Bodega,Asian Restaurant


#### Comparing Waterfront / Coal Harbour / Canada Place in Vancouver to First Canadian Place, Underground city in Toronto

In [157]:
display(vancouver_neighborhoods_venues_sorted.loc[vancouver_neighborhoods_venues_sorted['Neighborhood'] == 'Waterfront / Coal Harbour / Canada Place'])
display(toronto_neighborhoods_venues_sorted.loc[toronto_neighborhoods_venues_sorted['Neighborhood'] == 'First Canadian Place, Underground city'])

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,6,Waterfront / Coal Harbour / Canada Place,Coffee Shop,Hotel,Restaurant,Café,Steakhouse,Food Truck,New American Restaurant,Hotel Bar,Spa,Jewelry Store


Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,6,"First Canadian Place, Underground city",Coffee Shop,Café,Hotel,Restaurant,Gym,American Restaurant,Concert Hall,Seafood Restaurant,Japanese Restaurant,Salad Place


#### Comparing Waterfront / Coal Harbour / Canada Place in Vancouver to Richmond, Adelaide, King in Toronto

In [158]:
display(vancouver_neighborhoods_venues_sorted.loc[vancouver_neighborhoods_venues_sorted['Neighborhood'] == 'Waterfront / Coal Harbour / Canada Place'])
display(toronto_neighborhoods_venues_sorted.loc[toronto_neighborhoods_venues_sorted['Neighborhood'] == 'Richmond, Adelaide, King'])

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,6,Waterfront / Coal Harbour / Canada Place,Coffee Shop,Hotel,Restaurant,Café,Steakhouse,Food Truck,New American Restaurant,Hotel Bar,Spa,Jewelry Store


Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
65,6,"Richmond, Adelaide, King",Coffee Shop,Café,Gym,Hotel,Japanese Restaurant,Asian Restaurant,Steakhouse,Restaurant,Salad Place,American Restaurant


## Discussion

From this exercise, there are several notable observations that we can value moving forward. The first being that at its highest level, there is a common thread between most neighborhoods in both cities. Of the 40 neighborhoods in Vancouver, 27 fell into the same cluster and when categorizing the 98 neighborhoods in Toronto, 94 fell into this same cluster. We can conclude that many neighborhoods in both cities have a similar composition of venues where they tend to largely contain coffee shops, cafes, green spaces, public transit routes and a variety of restaurants.

While we did see a number of similar neighborhoods that did not stand out from one another, we are able to use this data to portion off a few unique neighborhoods in Toronto that align with unique areas of Vancouver. We can see 4 neighborhoods in Toronto that were placed in minority clusters from the Vancouver data. We will focus on 3 of these from our categorization analysis and one will be reviewed in the distance matrix review that we completed.

The neighborhood Westmount in Toronto was placed in cluster 7. We can conclude that this neighborhood is distinct from the others in Toronto and has a larger distribution of East Asian ethnicity based restaurant. This indicates that similar to the 4 neighborhoods in Vancouver that are in this cluster, these neighborhoods have a large cultural influence and are distinct from the others in this sense.

The Malvern, Rouge Neighborhood in Toronto was placed in cluster 5 which solely contained the North Vancouver – Inner East Neighborhood. We can draw to the similarity that both neighborhoods have a distinct focus on green space and active living along with a broad influence of cuisine from different cultures.

The Cedarbrae neighborhood in Toronto was placed in cluster 4. Similar the case of Malvern Rouge, it has similar to the neighborhoods in its clusters on the grounds of large focus on green space and active living along with a broad influence of cuisine from different cultures. We do see that the types of ethnic influence on the restaurants in these separate clusters are different from one another.

In addition to the categorization model that we completed we also created a distance matrix to individually measure how different each pair of neighborhoods between the different cities was. After reviewing the 5 most similar pair of neighbourhoods we can draw a number of conclusions.

The North Vancouver - Northwest Central neighborhood in Vancouver and the Bayview Village neighborhood in Toronto were found to be the most similar to one another. They were also both categorized in the same niche cluster. We can see the that they have a number of commonalities, specifically the large portion of Yoga studios, Schools, Eastern European restaurants, Trails, Electronic Shops and Parks. This is an excellent case of both the clusting model and the distance matrix finding two communities that have much in common.

The next closest pairing was the Killarney neighborhood in Vancouver and the Steeles West, L'Amoreaux West in Toronto. They both have a commonly high proportion of fast food restaurants, Chinese restaurants, and coffee shops.

The next three closest pairings were the Waterfront / Coal Harbour / Canada Place neighborhood in Vancouver to the Toronto Dominion Centre, Design Exchange neighborhood, the First Canadian Place, Underground city neighborhood and the Richmond, Adelaide, King neighborhood in Toronto. We can conclude that their common link between these four neighborhoods is that they serve as the central areas in their respective cities for commerce. We can see that in cases there is a large overlap in common venues including coffee shops, hotels and more North American style restaurants.

## Conclussion

While this exercise did display a commonality that links many neighborhoods in both cities, it was also able to shine a light on certain niche neighborhoods that are distinct from others in their city and yet are similar to a neighborhood in the other city. There are several unique pockets of Toronto that we can expect to be similar in experience to the neighborhoods of Vancouver. These are neighborhoods have substantial range and are filled with nature, recreation, multicultural influence or serve as large hubs for the cities thriving business sector, that are useful. As a future extension of this project, I think it would be insightful to run a second iteration of this process, however, the clusters would be created from only the 27 neighborhoods in Vancouver that were assigned to cluster 6. This may allow for us to create an even more granular understanding of these neighborhoods.