## Segmenting and Clustering Neighborhoods in Toronto

**Jay Gendron  -  July 2019**

In this project, we explore and cluster the neighborhoods in Toronto. The project has three parts:

### Table of Contents

1. <a href="#part1">Creating dataframe of Toronto's PostalCodes, Boroughs, and Neighborhoods</a>
2. <a href="#part2">Getting neighborhood latitude and longitude using Geocoder package</a>  
3. <a href="#part3">Exploring and clustering the neighborhoods in Toronto</a>


<a name="part1"></a> 
    
### Part 1. Creating dataframe of Toronto's PostalCodes, Boroughs, and Neighborhoods

The first steps in most data science projects include ETL (extract, transform, and load) of the data. In this project, Part 1 and Part 2 retrieve the data and perform the needed transformations. First things first...we are given a data source on Wikipedia that contains postal codes within the city of Toronto (in the province of Ontario). That source is:

[Data Source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)


In [1]:
#Import libraries
import numpy as np
import pandas as pd
import requests
import re #package for regular expressions
from geopy.geocoders import Nominatim 
import json
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium # map rendering library

We can now use the **requests** library `get()` function to extract the Wiki-based table from the data source.

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

Pandas allows for reading data directly from HTML sources with the `read_html()` function.

In [3]:
df_raw = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',
                      flavor='bs4') #flavor bs4 uses BeautifulSoup as a parsing engine
df = df_raw[0] #extract the first element from the list returned from read_html
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
df.tail() #and check to see all table elements down to M9Z were read into the dataframe

Unnamed: 0,Postcode,Borough,Neighbourhood
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor
287,M9Z,Not assigned,Not assigned


Now that the raw data is available, there are a six pre-processing requirements. They are provided from the project instructions here for reference.

#### Pre-Processing Requirements
>
>1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

**Requirement 2:** Use only boroughs assigned a name

In [5]:
#Create a copy() of dataframe to avoid memory reference issues
print(f'Before eliminating rows with unnamed boroughs, there were {df.shape[0]} boroughs')
df_named = df[df['Borough']!='Not assigned'].copy()
print(f'After eliminating rows there were {df_named.shape[0]} boroughs')

Before eliminating rows with unnamed boroughs, there were 288 boroughs
After eliminating rows there were 211 boroughs


**Requirement 4:** Fill in unnamed neighborhoods with name of borough

In [6]:
#Use apply function to assign borough (x[-2]) if neighborhood (x[-1]) is 'Not assigned'
df_named['Neighbourhood'] = df_named.apply(func=lambda x:x[-2] if x[-1]=='Not assigned' else x[-1],
                                           axis=1) #axis=1 applies along rows

#verify by checking borough named Queen's Park
df_named[df_named['Borough']=="Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Queen's Park


**Requirement 3:** Combine postal code neighborhoods with comma-separated string for label

This step makes use of a number of Pandas functions:
* `groupby()` to gather the rows having the same postal code
* `apply()` to manipulate the group with a user-defined function
* `lambda()` to hold the user-defined function
* `', 'join()` to convert the list of names and create a comma-separated string
* `to_list()` to gather the neighborhood names in the group into a list

The grouped neighborhoods are saved into an additional dataframe that will merge with the original data.

In [7]:
grouped_neighbourhoods = df_named.groupby('Postcode')['Neighbourhood'].apply(lambda x:', '.join(x.to_list()))
grouped_neighbourhoods = grouped_neighbourhoods.to_frame().reset_index()
grouped_neighbourhoods.head()

Unnamed: 0,Postcode,Neighbourhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae


Combine the grouped neighborhoods with the df_named dataframe using the `merge()` function - similar to a SQL join.

In [8]:
df_grouped = pd.merge(df_named, grouped_neighbourhoods, on='Postcode')
df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood_x,Neighbourhood_y
0,M3A,North York,Parkwoods,Parkwoods
1,M4A,North York,Victoria Village,Victoria Village
2,M5A,Downtown Toronto,Harbourfront,"Harbourfront, Regent Park"
3,M5A,Downtown Toronto,Regent Park,"Harbourfront, Regent Park"
4,M6A,North York,Lawrence Heights,"Lawrence Heights, Lawrence Manor"


Now we can select the three columns we want to keep and drop the duplicate rows, as seen above with Postcode M5A.

In [9]:
df_clean = df_grouped[['Postcode','Borough','Neighbourhood_y']].drop_duplicates()
df_clean = df_clean.sort_values(by='Postcode') #sort final dataframe by postal code

#Verify by checking that Postcode M5A is listed once with two neighborhoods: Harbourfront and Regent Park
df_clean[df_clean['Postcode']=='M5A']

Unnamed: 0,Postcode,Borough,Neighbourhood_y
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"


**Requirement 1:** Rename the three required columns

In [10]:
df_clean.columns = ['PostalCode', 'Borough', 'Neighborhood']
df_clean.reset_index(inplace=True, drop=True)
df_clean.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


**Requirement 6:** Use the `.shape` method to print the number of rows of your dataframe

In [11]:
print(f'After processing, the datframe of 211 boroughs by 3 columns was reduced to {df_clean.shape}')

After processing, the datframe of 211 boroughs by 3 columns was reduced to (103, 3)


<a id="part2"></a>

### Part 2. Getting neighborhood latitude and longitude using Geocoder package

**NOTE** 

> Neither google.geocoder nor Nominatim as demonstrated in the course labs provided successful geolocation information for the Canada postal codes as provided in the data.

>However, the pgeocode package - which uses Nominatim - did provide results for international addresses. Unfortunately, it appears that the postal codes we have for this project are incomplete with 3 digits. The [Canada Post website](https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf) shows postal codes with a six character alpha-numeric format:

<center>$AnA-nAn$
<center> where: A is a letter and n is a digit
    
<p>
After installing the package and running the query for a Canadian postal code, we see the database does not return information:

In [12]:
#!pip install pgeocode #uncomment if pgeocode is not installed already
import pgeocode

In [13]:
nomi = pgeocode.Nominatim('ca')
nomi.query_postal_code('M1E 2Z2')

postal_code       2Z2
country code      NaN
place_name        NaN
state_name        NaN
state_code        NaN
county_name       NaN
county_code       NaN
community_name    NaN
community_code    NaN
latitude          NaN
longitude         NaN
accuracy          NaN
Name: 0, dtype: object

However, using the example at the [pgeocode documentation](https://pypi.org/project/pgeocode/) does yield results:

In [14]:
nomi = pgeocode.Nominatim('fr') #France
nomi.query_postal_code("75013")

postal_code                 75013
country code                   FR
place_name        Paris 13, Paris
state_name          Île-de-France
state_code                     11
county_name                 Paris
county_code                    75
community_name              Paris
community_code                751
latitude                  48.8322
longitude                 2.35245
accuracy                        5
Name: 0, dtype: object

**Resolution:** Given the challenges with the Canadian postal code formatting and the <font color='red'>\[REQUEST_DENIED\] Google - Geocode \[empty\]</font> error code, the decision was to download the CSV file provided in the project instructions and merge that with the data frame created in Part 1.

In [15]:
!wget 'https://cocl.us/Geospatial_data'
print('Data downloaded!')

--2019-07-19 22:33:08--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-07-19 22:33:08--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197, 107.152.26.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-07-19 22:33:08--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [followin

In [16]:
df_latlng = pd.read_csv('Geospatial_data')

df_latlng.columns = ['PostalCode','Latitude','Longitude']
df_latlng.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
df_final = pd.merge(left=df_clean, right=df_latlng, how='inner')
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<a id="part3"></a>


### Part 3. Exploring and clustering the neighborhoods in Toronto

This portion of the analysis uses the final, loaded data in the dataframe `df_final` to explore and cluster the Toronto neighborhoods using information from the FourSquare API. Final results are visualized using the **Folium** package. 

For readability, this analysis focuses on only those boroughs that contain the word Toronto. The first step is to create that subset from the dataframe.

In [18]:
borough = df_final['Borough']
df_toronto = df_final[borough.str.contains(r'Toronto')]
print(f'The data only contains boroughs containing the word Toronto.\n\
Specifically: {set(df_toronto["Borough"])}')
df_toronto.head()

The data only contains boroughs containing the word Toronto.
Specifically: {'East Toronto', 'West Toronto', 'Downtown Toronto', 'Central Toronto'}


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [19]:
address = 'Rosedale, Toronto'

geolocator = Nominatim(user_agent="my_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of the heart of downtown Toronto, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of the heart of downtown Toronto, Canada are 43.6783556, -79.3807457.


In [20]:
# create map of downtown Toronto using latitude and longitude values from above
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'],
                                           df_toronto['Longitude'],
                                           df_toronto['Borough'],
                                           df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Create the map of Downtown Toronto showing the neighborhoods**

Having the neighborhoods by latitude and longitude allows us to extract FourSquare data for venues in the vicinity of each neighborhood. First we insert our FourSquare credentials:

In [40]:
CLIENT_ID = '<hidden>' # your Foursquare ID
CLIENT_SECRET = '<hidden>' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: <hidden>
CLIENT_SECRET:<hidden>


**Note**

> It appears that the Folim maps don't transfer from the live Jupyter Notebook over to Github (or as a PDF either). For that reason I have placed a static image of the map generated by the code above in the Git repo and provided a hyperlink here:

[Map showing clustered neighborhoods (file: Toronto_neighborhoods.jpg)](https://github.com/jgendron/Coursera_Capstone/blob/master/Toronto_neighborhoods.jpg)


**Acknowledgement:**
> This analysis replicates many of the functions and API calls from the lab on New York City neighborhoods. A special thanks to the IBM data scientists who assembled and documented that analysis for our use.

The function `getNearbyVenues()` takes neighborhood names as well as their latitude and longitude. It returns the first 100 venues within 500 meters of the geolocation.

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Using the `getNearbyVenues()` function, we can pass it the Toronto neighborhood names and the latitude/longitude to get the top 100 venues for each neighborhood. The function prints each negihborhood as it is processed to provide a sense of execution.

In [25]:
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

This results in a dataframe of venues. We can check the dimensions of the dataframe using the `shape()` function.

In [26]:
print(toronto_venues.shape)
toronto_venues.head()

(1705, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Glen Stewart Ravine,43.6763,-79.294784,Other Great Outdoors
4,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood


We can get a sense of how populated each neighborhood is (in terms of venues) by grouping all venues by neighborhood and counting the grouped results.

In [27]:
toronto_venues.groupby('Neighborhood')['Venue'].count()

Neighborhood
Adelaide, King, Richmond                                                                                      100
Berczy Park                                                                                                    56
Brockton, Exhibition Place, Parkdale Village                                                                   24
Business Reply Mail Processing Centre 969 Eastern                                                              16
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara     16
Cabbagetown, St. James Town                                                                                    44
Central Bay Street                                                                                             89
Chinatown, Grange Park, Kensington Market                                                                     100
Christie                                                                   

We see a variety of venues by neighborhood. Some are full of venues and others are sparse. That should become very important in clustering.
Among the 1,705 venues we can check to see how many different venues are present in Toronto.

In [28]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 231 uniques categories.


In order to cluster using venue similarity, we need a vector of venues by type to serve as a form of distance measure. One-hot encoding will transform the `Venue Category` column into 231 columns labeled with each venue category.

In [29]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Looking at the dimensions of the one-hot encoded dataframe shows the 1,705 venues coded across 231 categorical columns.

In [30]:
toronto_onehot.shape

(1705, 231)

Our similarity will be the average of the number of venue categories for each neighborhood. For instance, a neighborhood with five venues across five different venue categories would get a mean score of **0.2 (1/5)** for each venue category.

In [31]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


All these transformations should result in a dimensionality equal to the number of neighborhoods (38) by the number of venue categories (231). We can check that with the `shape` function.

In [32]:
toronto_grouped.shape

(38, 231)

As done in the NYC lab, we can get a sense of the neighborhood venue distribution by looking at the top five venues (based on mean score) and printing those out for each neighborhood.

In [33]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
             venue  freq
0      Coffee Shop  0.07
1             Café  0.05
2  Thai Restaurant  0.04
3       Steakhouse  0.04
4              Bar  0.04


----Berczy Park----
          venue  freq
0   Coffee Shop  0.11
1  Cocktail Bar  0.05
2   Cheese Shop  0.04
3    Steakhouse  0.04
4        Bakery  0.04


----Brockton, Exhibition Place, Parkdale Village----
               venue  freq
0     Breakfast Spot  0.08
1               Café  0.08
2        Coffee Shop  0.08
3  Convenience Store  0.04
4             Office  0.04


----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0  Light Rail Station  0.12
1         Pizza Place  0.06
2          Restaurant  0.06
3          Smoke Shop  0.06
4       Burrito Place  0.06


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0    Airport Lounge  0.12
1   Airport Service  0.12
2  Airport Terminal  0.

Let's code that operation into a function and then use it to generate the 10-top venues for each neighborhood to use in our clustering.

In [34]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [35]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Steakhouse,Thai Restaurant,Bar,Cosmetics Shop,Gym,Restaurant,Hotel,American Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Café,Farmers Market,Beer Bar,Bakery,Steakhouse,Cheese Shop,Seafood Restaurant,Belgian Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Café,Coffee Shop,Breakfast Spot,Grocery Store,Bar,Stadium,Office,Falafel Restaurant,Burrito Place,Caribbean Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Fast Food Restaurant,Garden,Farmers Market,Comic Shop,Brewery,Spa,Burrito Place,Restaurant,Skate Park
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Lounge,Airport Service,Airport Terminal,Harbor / Marina,Bar,Plane,Coffee Shop,Sculpture Garden,Boat or Ferry,Boutique


After some experimentation, this analysis will make use of k-means with 10 clusters. That seemed to fit the data best. The algortihm will use the vector for each neighborhood containing the venue categories and will assign neighborhoods to one of 10 clusters. We can see the cluster assignments using the `kmeans.labels_` function.

In [36]:
# set number of cluster
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([5, 5, 5, 0, 1, 5, 5, 1, 5, 5, 5, 5, 8, 5, 5, 1, 5, 2, 5, 5, 5, 1,
       4, 1, 9, 5, 1, 6, 3, 5, 5, 5, 5, 5, 5, 7, 5, 5], dtype=int32)

Using the cluster labels, they are added as a column in the dataframe of Top-10 Venues by Neighborhood.

In [37]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,7,Health Food Store,Other Great Outdoors,Trail,Pub,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,5,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Ice Cream Shop,Yoga Studio,Bookstore,Brewery,Bubble Tea Shop,Restaurant
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,5,Pet Store,Fast Food Restaurant,Sushi Restaurant,Brewery,Pub,Ice Cream Shop,Movie Theater,Italian Restaurant,Sandwich Place,Fish & Chips Shop
43,M4M,East Toronto,Studio District,43.659526,-79.340923,5,Café,Coffee Shop,Bakery,Italian Restaurant,Gastropub,American Restaurant,Yoga Studio,Fish Market,Bookstore,Latin American Restaurant
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Park,Swim School,Bus Line,Wings Joint,Dive Bar,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


That is all we need to map the neighborhoods superimposed on a Toronto map and setting the color of each neighborhood marker according to the cluster it was assigned to using k-means clustering. 

In [38]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' (Cluster ' + str(cluster) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='gray',
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In closing this project, we can extract any cluster and view the neighborhoods as well as the top-10 venues among each neighborhood in the cluster. Cluster 5 is the largest cluster. One sees the similarity in the types of venues as well as their predominance ranking within the cluster.

Notice where **Coffee Shop** and **Cafe** appears among these neighborhoods showing similiarity.

**Note**

> It appears that the Folim maps don't transfer from the live Jupyter Notebook over to Github (or as a PDF either). For that reason I have placed a static image of the map generated by the code above in the Git repo and provided a hyperlink here:

[Map showing clustered neighborhoods (file: Clustered_neighborhoods.jpg)](https://github.com/jgendron/Coursera_Capstone/blob/master/Clustered_neighborhoods.jpg)


In [39]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
41,East Toronto,5,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Ice Cream Shop,Yoga Studio,Bookstore,Brewery,Bubble Tea Shop,Restaurant
42,East Toronto,5,Pet Store,Fast Food Restaurant,Sushi Restaurant,Brewery,Pub,Ice Cream Shop,Movie Theater,Italian Restaurant,Sandwich Place,Fish & Chips Shop
43,East Toronto,5,Café,Coffee Shop,Bakery,Italian Restaurant,Gastropub,American Restaurant,Yoga Studio,Fish Market,Bookstore,Latin American Restaurant
46,Central Toronto,5,Clothing Store,Coffee Shop,Mexican Restaurant,Diner,Rental Car Location,Dessert Shop,Salon / Barbershop,Chinese Restaurant,Furniture / Home Store,Sporting Goods Shop
47,Central Toronto,5,Sandwich Place,Dessert Shop,Coffee Shop,Café,Italian Restaurant,Pizza Place,Pharmacy,Sushi Restaurant,Gourmet Shop,Seafood Restaurant
49,Central Toronto,5,Pub,Coffee Shop,Bagel Shop,Vietnamese Restaurant,Light Rail Station,Supermarket,Sushi Restaurant,Pizza Place,Liquor Store,American Restaurant
51,Downtown Toronto,5,Coffee Shop,Park,Café,Restaurant,Pub,Chinese Restaurant,Italian Restaurant,Pizza Place,Bakery,Liquor Store
52,Downtown Toronto,5,Coffee Shop,Gay Bar,Japanese Restaurant,Sushi Restaurant,Restaurant,Men's Store,Gym,Mediterranean Restaurant,Café,Bubble Tea Shop
53,Downtown Toronto,5,Coffee Shop,Café,Park,Bakery,Mexican Restaurant,Gym / Fitness Center,Restaurant,Breakfast Spot,Theater,Pub
54,Downtown Toronto,5,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Fast Food Restaurant,Tea Room,Restaurant,Ramen Restaurant,Bookstore,Diner
