<h1><center>Neighborhoods of Toronto</center></h1>

This notebook explores, segments, and clusters the neighborhoods in the city of Toronto. The concepts of web scraping, machine learning models, and geolocation APIs, will be employed to undergo this investigation. The results will be processed, discussed, and interpreted, with respect to various assumptions identified.

<h2> Part 1 : Web Scrape - Wikipedia Article - List of postal codes of Canada </h2>

The BeautifulSoup python library will be used to scrape the article for the table of Canadian postal codes.

In [1]:
#Import BeautifulSoup and Requests libraries to obtain the website HTML
!pip install beautifulsoup4

from bs4 import BeautifulSoup
import requests

#Obtain HTML of website 
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url)
soup.unicode
#soup.prettify()

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/cb/a1/c698cf319e9cfed6b17376281bd0efc6bfc8465698f54170ef60a485ab5d/beautifulsoup4-4.8.2-py3-none-any.whl (106kB)
[K     |████████████████████████████████| 112kB 30.8MB/s eta 0:00:01
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.2 soupsieve-1.9.5


The *html* code block pertaining to the table was found, within which the repeating layout treated as rows. These rows were then looped through for the elements containing data related to Canadian postal codes, boroughs, and neighborhoods. 

Within each iteration, the data was mapped to the empty dataframe <b>neigborhoods</b> by adding a new tuple, and the dataframe further groomed to account for boroughs and neighborhoods that were set as 'Not assigned'.

The resultant dataframe can be seen as output below.

In [2]:
#Find table and respective rows within beautifulsoup html output
myHoodTable = soup.find('table',{'class':'wikitable sortable'})
rows = myHoodTable.findAll('tr')

#Import pandas library, create dataframe, and loop through data to store in dataframe
import pandas as pd
import numpy as np

neighborhoods = pd.DataFrame(columns=['Postal Code', 'Borough', 'Neighborhood'])

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    cols = np.asarray(cols)
    if ((cols.size>0) and (cols[1]!="Not assigned")):
        postalcode = cols[0]
        borough = cols[1]
        if cols[2]=="Not assigned":
            neighborhood = cols[1]
        else:
            neighborhood = cols[2]
        neighborhoods.loc[len(neighborhoods.index)+1] = [postalcode,borough,neighborhood]

#Group neighborhoods with respect to same postal code using .join
neighborhoods = neighborhoods.groupby(['Postal Code','Borough'])['Neighborhood'].apply(','.join)
neighborhoods = pd.DataFrame(neighborhoods)
neighborhoods = neighborhoods.reset_index()

#View resultant dataframe
neighborhoods

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


In [3]:
neighborhoods.shape
print('The resultant dataframe contains '+str(neighborhoods.shape[0])+' tuples.')

The resultant dataframe contains 103 tuples.


<h2> Part 2 : Determine latitudes and longitudes of Canadial neighborhoods, using Geocoder </h2>

In [4]:
#Import geocoder library

!conda install -c conda-forge geocoder --yes
import geocoder

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    scikit-learn-0.20.1        |   py36h22eb022_0         5.7 MB
    liblapack-3.8.0            |      11_openblas          10 KB  conda-forge
    liblapacke-3.8.0           |      11_openblas          10 KB  conda-forge
    libopenblas-0.3.6          |       h5a2b251_2         7.7 MB
    numpy-1.17.3               |   py36h95a1406_0         5.2 MB  conda-forge
    click-7.0                  |             py_0          61 KB  conda-forge
    ratelim-0.1.6              |             py_2     

The postal codes, web-scraped in the previous section, are looped through in the <b>neighborhoods</b> dataframe and used as arguments for the geocoder.arcgis() function to obtain the respective neighborhood latitudes and longitudes.

These coordinates are mapped accordingly to variables in each iteration and appended to initialised lists, <b>latitudes</b> and <b>longitudes</b>, which are added as columns to dataframe <b>neighborhoods</b>.

<b>Note:</b> *geocoder.arcgis* is preferred as an online geocoding service over *geocoder.google* due to the former's execution speed compared to the latter. Both services can be used to obtain the required neighborhood latitudes and longitudes.

In [5]:
latitudes = []
longitudes = []

for postal_code in neighborhoods['Postal Code']:
    # initialize your variable to None
    lat_lng_coords = None

    #loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    latitudes.append(latitude)
    longitudes.append(longitude)
neighborhoods['Latitude'] = latitudes
neighborhoods['Longitude'] = longitudes

neighborhoods

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.217590
4,M1H,Scarborough,Cedarbrae,43.769688,-79.239440
...,...,...,...,...,...
98,M9N,York,Weston,43.704845,-79.517546
99,M9P,Etobicoke,Westmount,43.696505,-79.530252
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie...",43.686810,-79.557284
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",43.743145,-79.584664


Regardless of the assumption that grouped neighborhoods have the same geolocation coordinates, it looks like they are fairly similar values with respect to other neighborhoods. Exploring these neighborhoods with respect to their venues in the next section, using the Foursquare API, will be interesting given this proximity. Even moreso, using clustering algorithms based on *dissimilarity* of these neighborhoods, rather than *similarilty* may also skew the model's accuracy given the proximity of these coordinates.

<h2> Part 3: Clustering Neighborhoods in Toronto Using K-Means Clustering </h2>

In this section, the Foursquare API is used to explore neighborhoods in Toronto, namely around the type, and ranking, of venues across each neighborhood. K-means is also employed as a clustering algorithm to group these neighborhoods based on their ranked venues.

Firstly, the original dataframe from previous sections is filtered to only work with boroughs in Toronto, CA. The index is reset, without dependence on any of the particular columns, for the new dataframe.

<h3> Using Foursquare API to explore neighborhood venues </h3>

In [6]:
#Obtain data for only boroughs with the word 'Toronto' in them
toronto_data = neighborhoods[neighborhoods['Borough'].str.contains("Toronto")].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676531,-79.295425
1,M4K,East Toronto,"The Danforth West,Riverdale",43.683178,-79.355105
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.667965,-79.314667
3,M4M,East Toronto,Studio District,43.660629,-79.334855
4,M4N,Central Toronto,Lawrence Park,43.72842,-79.387133


The coordinates of the city of Toronto are obtained, using *geocoder.arcgis*, in order to use as an argument for *folium.Map()* and center the map around these when displaying the clusters of neighborhoods.

In [7]:
#Obtain coordinates of Toronto, CA
address = 'Toronto, CA'

# initialize your variable to None
toronto_lat_lng_coords = None

while(toronto_lat_lng_coords is None):
        toronto_g = geocoder.arcgis(address)
        toronto_lat_lng_coords = toronto_g.latlng

        toronto_latitude = toronto_lat_lng_coords[0]
        toronto_longitude = toronto_lat_lng_coords[1]
print('The geograpical coordinate of Toronto,CA are {}, {}.'.format(toronto_latitude, toronto_longitude))

The geograpical coordinate of Toronto,CA are 43.648690000000045, -79.38543999999996.


In order to use the Foursquare API for exploring neighborhoods, the API credentials are defined. These credentials are the API Developer account.

It is assumed a maximum of 100 venues can be retrieved to generate a considerably accurate clustering model. Additionally, this limit will abide by the Developer account constraints.

In [8]:
#Define Foursquare credentials

CLIENT_ID = 'VCITPTWXT3TOCGPTNZCETKPK3FV4RMLXZVIZSZNWEASFFMZ5' # your Foursquare ID
CLIENT_SECRET = 'D2FEO4SDBK0PS2Z3SJX4KPWRK1YWF2LJAKRLLPNTAFPFZKZW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: VCITPTWXT3TOCGPTNZCETKPK3FV4RMLXZVIZSZNWEASFFMZ5
CLIENT_SECRET:D2FEO4SDBK0PS2Z3SJX4KPWRK1YWF2LJAKRLLPNTAFPFZKZW


<a id='item1'></a>

<a id='item2'></a>

The Foursquare API will be called via a function *getNearbyVenues* for all neighborhoods in Toronto. This function will take their names and coordinates as arguments, and call the Foursquare API for venue objects (*/venues/explore?*), to process and append the returned venue details (ie. Names, locations, and categories) to a new dataframe **nearby_venues** listing them with respect to the neighborhoods.

It is assumed a maximum of 500 metres radius can be defined, to retrieve nearby venues, to generate a considerably accurate clustering model. Additionally, this limit will abide by the Developer account constraints.

In [9]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The function above can now be called, with respect to our Toronto neighborhoods dataframe **toronto_data**, and mapped to a new dataframe **toronto_venues**.

In [10]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

The Beaches
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie
Dovercourt Village,Dufferin
Little Portugal,Trinity
Brockton,Exhibition Place,Parkdale Village
High Park,The Junction South
Parkdale,Roncesvalles
Runnymede

In [11]:
print(toronto_venues.shape)
toronto_venues.head()

(1739, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676531,-79.295425,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676531,-79.295425,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676531,-79.295425,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676531,-79.295425,Domino's Pizza,43.679058,-79.297382,Pizza Place
4,The Beaches,43.676531,-79.295425,Upper Beaches,43.680563,-79.292869,Neighborhood


Some venues can be seen above for the neighborhood **The Beaches**. The categories seem fairly diverse. Interestingly, neighborhoods are also considered as venue categories. In the best interests of improving model accuracy, one approach might be to remove this category in using venues as predictor variables to cluster neighborhoods.

The venue count for each neighborhood is also calculated.

In [12]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,62,62,62,62,62,62
"Brockton,Exhibition Place,Parkdale Village",68,68,68,68,68,68
Business Reply Mail Processing Centre 969 Eastern,100,100,100,100,100,100
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",69,69,69,69,69,69
"Cabbagetown,St. James Town",40,40,40,40,40,40
Central Bay Street,98,98,98,98,98,98
"Chinatown,Grange Park,Kensington Market",79,79,79,79,79,79
Christie,11,11,11,11,11,11
Church and Wellesley,83,83,83,83,83,83


The range of venues across each neighborhood is quite large, with 3 being the smallest number of venues in the Moore Park neighborhood, and 100 being the maximum in several neighborhoods (given our limit of 100 defined earlier for the API call)

To understand how dissimilar each neighbordhood may be from one another, the number of unique venue categories are also calculated.

In [13]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 218 uniques categories.


This seems to be an ample amount of unique values to showcase the dissimilarity effectively via a clustering model.

Now that we have venue details obtained for each neighborhood, we can begin pre-processing the **toronto_venues** dataframe to prepare the details as effective predictors for the clustering model.

One-hot encoding is applied to each venue category in each neighborhood. This will normalise the data across each neighborhood for effective use as predictors.

In [14]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,...,Toy / Game Store,Trail,Train Station,Tram Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The normalised categories are then grouped by neighborhoods, and the mean frequency of each category derived to illustrate how often it appears across each neighboorhood group (**Note:** Grouping with respect to Boroughs)

In [15]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Toy / Game Store,Trail,Train Station,Tram Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,"Adelaide,King,Richmond",0.0,0.0,0.03,0.0,0.01,0.0,0.03,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.016129,...,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.0,0.029412,0.014706,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.029412,0.0,0.014706,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.02,0.0,0.01,0.0,0.03,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.014493,0.0,0.0,0.0,0.0,0.0,0.014493,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014493
5,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.010204,0.0,0.010204,0.0,0.0,0.0,0.0,...,0.010204,0.0,0.0,0.0,0.0,0.010204,0.010204,0.010204,0.0,0.0
7,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.0,0.012658,0.012658,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.037975,0.0,0.050633,0.012658,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.012048,0.012048,0.012048,0.0,0.0,0.012048,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,0.0,0.0,0.012048


Based on the venue mean frequency values, the top 5 most common venues for each neighborhood are output. Coffee shops seem to be a popular venue across Toronto neighborhoods.

In [16]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
         venue  freq
0  Coffee Shop  0.06
1         Café  0.06
2   Steakhouse  0.04
3        Hotel  0.04
4    Gastropub  0.03


----Berczy Park----
          venue  freq
0   Coffee Shop  0.08
1    Restaurant  0.05
2  Cocktail Bar  0.05
3   Cheese Shop  0.03
4        Bakery  0.03


----Brockton,Exhibition Place,Parkdale Village----
                    venue  freq
0             Coffee Shop  0.09
1  Furniture / Home Store  0.06
2                    Café  0.06
3              Restaurant  0.06
4                  Bakery  0.04


----Business Reply Mail Processing Centre 969 Eastern----
         venue  freq
0  Coffee Shop  0.09
1   Steakhouse  0.04
2        Hotel  0.04
3         Café  0.04
4          Bar  0.04


----CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara----
                  venue  freq
0           Coffee Shop  0.09
1    Italian Restaurant  0.07
2                   Bar  0.04
3  Gym / Fitness Center  0.

In order to provide a more permanent, reusable approach to interpreting these venues, the function below can iterate through the neighborhood venue data and return the *nth* most common venues.

In [17]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Based on ordinal indicators defined, we will loop through each neighborhood and append the *nth* most common venues into a dataframe **neighborhoods_venues_sorted**.

It is interesting to see which neighborhoods mark hotels as popular venues. The variance of this popularity through seasons might form a great pricing model for hotels to adjust prices to demand appropriately.

In [18]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Steakhouse,Hotel,Bar,Bakery,Gym,Burger Joint,Asian Restaurant,Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Seafood Restaurant,Cheese Shop,Bakery,Steakhouse,Farmers Market,Beer Bar,Hotel
2,"Brockton,Exhibition Place,Parkdale Village",Coffee Shop,Furniture / Home Store,Restaurant,Café,Bakery,Italian Restaurant,Sandwich Place,Vegetarian / Vegan Restaurant,Art Gallery,Hotel
3,Business Reply Mail Processing Centre 969 Eastern,Coffee Shop,Café,Hotel,Steakhouse,Bar,Restaurant,Pub,Pizza Place,Gym,Asian Restaurant
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Coffee Shop,Italian Restaurant,Bar,Café,Gym / Fitness Center,Intersection,Sandwich Place,Bakery,Pub,Restaurant


<a id='item4'></a>

<h3>Creating a K-means clustering model to group similar neighborhoods into clusters</h3>

The *k*-means clustering algorithm will be developed to cluster the neighborhood into 5 clusters, based on each neighborhoods venue details.

The algorithm is based on the *dissimilarity* of neighborhood predictor values (ie. how different venues are across each neighborhood). An effective use case of this model, then, would be to provide relocation recommendations to an individual looking to move to another area, based on how similar their 'vibe' (ie. types of venues and faciilties) is.

In [19]:
# Import k-means from clustering library
from sklearn.cluster import KMeans

<h4>Model Generation</h4>

In [20]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:38] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 3, 0,
       1, 0, 0, 4, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Quite a few neighborhoods are grouped into Cluster 0.

<h4>Model Output Processing</h4>

The cluster labels will be joined with the common venues data, obtained previously, in a new dataframe, in order to interpret and get further insights into what differentiates the clusters.

In [21]:
# add clustering labels
if 'Cluster Labels' in neighborhoods_venues_sorted.columns:
    neighborhoods_venues_sorted = neighborhoods_venues_sorted.drop(['Cluster Labels'],axis=1)
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
else:
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

#merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676531,-79.295425,0.0,Health Food Store,Pizza Place,Trail,Pub,Wings Joint,Dumpling Restaurant,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm
1,M4K,East Toronto,"The Danforth West,Riverdale",43.683178,-79.355105,0.0,Park,Bus Line,Grocery Store,Discount Store,Restaurant,Wings Joint,Eastern European Restaurant,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.667965,-79.314667,0.0,Gym,Italian Restaurant,Park,Pet Store,Pizza Place,Movie Theater,Pub,Sandwich Place,Fast Food Restaurant,Liquor Store
3,M4M,East Toronto,Studio District,43.660629,-79.334855,0.0,Diner,Italian Restaurant,Café,Pizza Place,Brewery,Sushi Restaurant,Sandwich Place,Bar,Gastropub,Coffee Shop
4,M4N,Central Toronto,Lawrence Park,43.72842,-79.387133,1.0,Bus Line,Swim School,Wings Joint,Food,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm
5,M4P,Central Toronto,Davisville North,43.712755,-79.388514,0.0,Clothing Store,Gym,Food & Drink Shop,Hotel,Breakfast Spot,Park,Convenience Store,Fish & Chips Shop,Eastern European Restaurant,Fast Food Restaurant
6,M4R,Central Toronto,North Toronto West,43.714523,-79.40696,4.0,Gym Pool,Playground,Park,Garden,Wings Joint,Donut Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
7,M4S,Central Toronto,Davisville,43.703395,-79.385964,0.0,Dessert Shop,Coffee Shop,Café,Sandwich Place,Toy / Game Store,Italian Restaurant,Pizza Place,Gym,Fast Food Restaurant,Seafood Restaurant
8,M4T,Central Toronto,"Moore Park,Summerhill East",43.690685,-79.382946,0.0,Convenience Store,Gym,Playground,Restaurant,Donut Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Event Space
9,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686074,-79.402265,0.0,Light Rail Station,Coffee Shop,Liquor Store,Supermarket,Wings Joint,Electronics Store,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market


At a glance, Cluster 0 neighborhoods definitely contain popular eatery places, such as coffee shops and Italian restaurants. Other clusters, such as Cluster 4, tend to have a greater focus around facilities such as gyms and public transport.

<h2>Part 4: Viewing Neighborhood Clusters on a Geospatial Map</h2>

This section employs the Folium library, and corresponding marker tools, to illustrate where exactly the neighborhoods are located and categorise what clusters they fall into. 

Being able to geospatially distinguish these clusters provides further insight into how these neighborhoods are similar based on their venues, and ultimately whether any further patterns/predictors exist that have not yet been explored or appropriately addressed.

In [22]:
# Import Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.0

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



In [23]:
#Pre-process the dataframe to remove any neighborhood data that contains N/A values.

toronto_merged.dropna(inplace=True)

The colors array scheme is generated, based on the number of clusters, and the neighborhood markers added on the map. These markers are then categorised using the color scheme based on which cluster they fall into.

In [24]:
# create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

As hypothesised earlier, a majority of the neighborhoods fall under Cluster 0 (let's label this the Foodie cluster). The most common venues in these areas tend to be focused around eateries, ranging from coffee shops and general cafes, to restaurants of particular cuisines such as Italian.
A high frequency of this type of cluster is expected. Many developed countries boast high volumes of eat-out businesses, along with a new 'foodie' trend to explore new innovative foods and beverages. Furthermore, considering Toronto is a fairly popular city for tourists, going out to the local Tim Horton's or poutine takeaway restaurant is a very common tourist activity any time of the year. Hence, no geospatial pattern can be discerned in identifying this cluster group, confirming the accuracy of purely focusing on venue data as predictors for the K-means clustering algorithm.