# Segmenting and Clustering Neighborhoods in Toronto
This notebook contains all code for the peer-graded assignment *Segmenting and Clustering Neighborhoods in Toronto* from the Data Science Capstone Project from Coursera. 

In order to follow the submission instructions, the notebook is divided into three main sections: 
1. [Create the Neighborhood Dataframe](#1.-Create-the-Neighbourhoods-Dataframe)

2. [Get coordinates for each neighbourhood](#2.-Get-coordinates-for-each-neighbourhood)

3. [Exploring and Clustering](#3.-Exploring-and-clustering)

    3.1. [Replicating the NYC analysis](#3.1.-Replicating-the-NYC-analysis)
    
    3.2. [Clustering neighborhoods per venue prices](#3.2.-Clustering-neighborhoods-per-venue-prices)


In [49]:
# First we download some common use packages
import numpy as np
import pandas as pd

import requests
from bs4 import BeautifulSoup

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

## 1. Create the Neighborhoods Dataframe

First, we get the raw text from the Wikipedia page and store it in the ```website_text``` variable

In [2]:
website_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
website_text = requests.get(website_url).text

Then, we use the BeautifulSoup package to get all the data from the table

In [3]:
soup = BeautifulSoup(website_text,'lxml')
neigh_table = soup.find("table", {"class":"wikitable sortable"})

We use read_html to read the HTML table into a pandas-compatible format. We will use the ```na_values``` parameter in order to store the **Not assigned** values as NaN, so it is easier later to remove them from the dataframe

In [4]:
# We can then use the pandas read_html function to convert the table into a pandas-compatible format
dfs = pd.read_html(neigh_table.prettify(), flavor='bs4', na_values='Not assigned')

# read_html returns a list of dataframes; as we are only using one table, we take the first element in the list
df = dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


INSTRUCTION: *The dataframe will consist of three columns: PostalCode, Borough and Neighborhood*

Let's first rename ```Postal Code``` to ```PostalCode``` and ```Neighbourhood``` to ```Neighborhood```.

In [5]:
df.rename(columns={"Postal Code": "PostalCode", "Neighbourhood": "Neighborhood"}, errors="raise", inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


INSTRUCTION: *Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned***

When importing the data (in cell, we decided to store the **Not assigned** values as NaN (Not a Number). We can then use ```dropna``` to remove all rows where ```Borough``` is NaN (Not assigned)

In [6]:
# We are only considering rows that have an assigned borough. We drop all rows with a value of NaN in column 'Borough'
df.dropna(subset=['Borough'], inplace=True)

INSTRUCTION: *In the table on the Wikipedia page, you will notice that **M5A** is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. These two rows will be combined into one row with a comma*.

However, from taking a look at the [table in Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) on 12/15/20, it seems that this has already been taken care of. 

In [7]:
df[df['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In order to double-check, let's check if there are other duplicated postal code values:

In [8]:
# First, let's check if there any duplicates in column 'Postal Code'
df['PostalCode'].duplicated().sum()

0

INSTRUCTION: *If a cell has a borough but a **Not assigned** neighbourhood, then the neighborhood will be the same as the borough*

Let's check if there are any neighborhoods with a value of **Not assigned**. As during import we stored all **Not assigned** as **NaN**, let's check if there are any ```null``` values in the ```Neighborhood``` column.

In [9]:
# Let's check if there are any neighbourhoods not assigned
df['Neighborhood'].isnull().sum()

0

In [10]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


It seems that there are no null values in the ```Neighborhood``` column.

Finally, the last instruction commands to use the **```.shape```** method to pring the number of rows of the dataframe

In [11]:
print('The number of rows in the dataframe is {}.'.format(df.shape[0]))

The number of rows in the dataframe is 103.


## 2. Get coordinates for each neighbourhood

### Downloading the data
Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In order to make sure that we are get the data, we will use the csv file available in [http://cocl.us/Geospacial_data](http://cocl.us/Geospacial_data)

In [12]:
# We get the data and store it in data.csv
!wget -O geo_data.csv http://cocl.us/Geospatial_data

--2020-12-28 21:24:23--  http://cocl.us/Geospatial_data
Resolving cocl.us... 169.63.96.194, 169.63.96.176
Connecting to cocl.us|169.63.96.194|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data [following]
--2020-12-28 21:24:24--  https://cocl.us/Geospatial_data
Connecting to cocl.us|169.63.96.194|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-12-28 21:24:25--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com... 185.235.236.197
Connecting to ibm.box.com|185.235.236.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-12-28 21:24:26--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection

Then, we use pandas' ```read_csv``` method to read the csv file into a pandas dataframe, and check the first 5 rows

In [13]:
geo_df = pd.read_csv('geo_data.csv')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Before merging, we get more information about the ```geo_df``` dataframe

In [14]:
geo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Postal Code  103 non-null    object 
 1   Latitude     103 non-null    float64
 2   Longitude    103 non-null    float64
dtypes: float64(2), object(1)
memory usage: 2.5+ KB


The ```geo_df``` dataframe has the same number of rows (103) as our ```df``` dataframe obtained in the previous section. Let's merge both 

In [15]:
df_final = pd.merge(df, geo_df, left_on='PostalCode', right_on='Postal Code', how='left')
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",M5A,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",M6A,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",M7A,43.662301,-79.389494


Let's remove the duplicated ```Postal Code```column

In [16]:
df_final.drop('Postal Code', axis=1, inplace=True)
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [17]:
df_final.shape

(103, 5)

## 3. Exploring and clustering
Explore and cluster the neighbourhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the NYC data.

### 3.1. Replicating the NYC analysis
First, let's get the coordinates of Toronto

#### 3.1.1. Explore data
We sill use the ```geopy``` library to get the latitude and longitude values of Toronto.

In [18]:
# Firs, import the Nominatim library to convert an address into latitude and longitude values.
from geopy.geocoders import Nominatim

In [19]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("The geographical coordinates of Toronto are ({}, {})".format(latitude, longitude))


The geographical coordinates of Toronto are (43.6534817, -79.3839347)


##### Create a map of Toronto with neighborhoods superimposed to it

In [20]:
import folium

In [21]:
# create a map of Toronto using the latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start = 11)

for lat, lng, borough, neighborhood in zip(df_final['Latitude'], df_final['Longitude'], df_final['Borough'], df_final['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False
    ).add_to(map_toronto)

map_toronto

We select only boroughs that contain the word 'Toronto'

In [22]:
tb = df_final[df_final['Borough'].str.contains('Toronto')].reset_index(drop=True)
tb.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [23]:
tb.shape

(39, 5)

In [24]:
boroughs = list(tb['Borough'].unique())
boroughs

['Downtown Toronto', 'East Toronto', 'West Toronto', 'Central Toronto']

In [47]:
map_colors=['deeppink', 'green', 'blue', 'purple']

There are only 39 postal codes in the *Toronto area* (boroughs that contain the word *Toronto* in their names). Let's check their location in a map. As a curiosity, as there can be more that 1 neighborhood per postal code, the size of the marker for each postal code will be proportional to the number of neighborhoods pertaining to the postal code. 

In [48]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start = 12)

for lat, lng, borough, neighborhood in zip(tb['Latitude'], tb['Longitude'], tb['Borough'], tb['Neighborhood']):
    neighbourhoods = neighborhood.split(', ')
    label = folium.Popup('{} - <strong>{}</strong>'.format(neighborhood, borough), parse_html=False)
    folium.CircleMarker(
        [lat, lng],
        popup=label,
        fill=True,
        color=map_colors[boroughs.index(borough)],
        fill_color=map_colors[boroughs.index(borough)],
        fill_opacity=0.5,
        radius=2*len(neighbourhoods),
        parse_html=True
    ).add_to(map_toronto)
    
    
map_toronto

First, we will define a function for building an url that will take as parameters the latitude and longitude values and will return the url that will ask for the 100 top recommended places in a radius of 500 meters around that coordinates.

In [28]:
# Url without price
def build_url(lat, lng):
    url = 'https://api.foursquare.com/v2/venues/explore?' +'&client_id={}'.format(CLIENT_ID) +'&client_secret={}&v={}'.format(CLIENT_SECRET, VERSION) + '&radius=500&limit=100' + '&ll={},{}'.format(lat,lng)
    return url

Let's borrow the **get_category_type** function from previous FourSquare labs.

In [29]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### 3.1.2. Explore Neighborhoods in Toronto

Our first objective is to create a dataframe that contains all the recommended venues in a neighborhood.

In [30]:
def getNearbyVenues(names, latitudes, longitudes):
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        # create the API request URL using the 
        url = build_url(lat, lng)
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

Now let's use the above function on each neighborhood and crate a new dataframe called *toronto_venues*

In [31]:
toronto_venues = getNearbyVenues(names=tb['Neighborhood'], 
                                 latitudes=tb['Latitude'], 
                                 longitudes=tb['Longitude'])

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

In [32]:
print('There are {} venues in the Toronto area'.format(toronto_venues.shape[0]))
toronto_venues.head()

There are 1619 venues in the Toronto area


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


In [33]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,58,58,58,58,58,58
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,64,64,64,64,64,64
Christie,16,16,16,16,16,16
Church and Wellesley,80,80,80,80,80,80
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,33,33,33,33,33,33
Davisville North,9,9,9,9,9,9


Let's find out how many unique categories can be curated from all the returned values

In [34]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 235 unique categories.


#### 3.1.3.  Analyze Each Neighborhood

In [36]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
# fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
# toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
toronto_onehot['Neighborhood']

0                               Regent Park, Harbourfront
1                               Regent Park, Harbourfront
2                               Regent Park, Harbourfront
3                               Regent Park, Harbourfront
4                               Regent Park, Harbourfront
                              ...                        
1614    Business reply mail Processing Centre, South C...
1615    Business reply mail Processing Centre, South C...
1616    Business reply mail Processing Centre, South C...
1617    Business reply mail Processing Centre, South C...
1618    Business reply mail Processing Centre, South C...
Name: Neighborhood, Length: 1619, dtype: object

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [38]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.015625,0.015625
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0125,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,...,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,0.0,0.025
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's print each neighborhood along with the top 5 most common venues

In [39]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2            Beer Bar  0.03
3            Pharmacy  0.03
4  Seafood Restaurant  0.03


----Brockton, Parkdale Village, Exhibition Place----
                venue  freq
0                Café  0.13
1      Breakfast Spot  0.09
2         Coffee Shop  0.09
3          Restaurant  0.04
4  Italian Restaurant  0.04


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue  freq
0    Light Rail Station  0.12
1  Gym / Fitness Center  0.06
2         Auto Workshop  0.06
3            Skate Park  0.06
4            Restaurant  0.06


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue  freq
0   Airport Service  0.19
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3             Plane  0.06
4   Harbor / Marina  0.06


----Central Bay Street----
            

Let's put that into a ```pandas``` dataframe.

First, let's write a function to sort the venues in descending order

In [40]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now, let's create the new dataframe and display the top 10 venues for each neighborhood

In [53]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Beer Bar,Farmers Market,Seafood Restaurant,Cheese Shop,Pharmacy,Bakery,Japanese Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Coffee Shop,Gym,Bakery,Stadium,Burrito Place,Restaurant,Climbing Gym,Pet Store
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Auto Workshop,Park,Comic Shop,Pizza Place,Restaurant,Burrito Place,Brewery,Skate Park,Farmers Market
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Terminal,Airport Lounge,Boat or Ferry,Rental Car Location,Bar,Plane,Harbor / Marina,Sculpture Garden,Airport Food Court
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Café,Salad Place,Thai Restaurant,Bubble Tea Shop,Burger Joint,Yoga Studio,Juice Bar


#### 3.1.4. Cluster Neighborhoods
Run k-means to cluster the neighborhood into 5 clusters

In [54]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [55]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_final

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,,,,,,,,,,,
1,M4A,North York,Victoria Village,43.725882,-79.315572,,,,,,,,,,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,Coffee Shop,Park,Bakery,Café,Pub,Breakfast Spot,Theater,Mexican Restaurant,Shoe Store,Brewery
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,,,,,,,,,,,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.0,Coffee Shop,Sushi Restaurant,Yoga Studio,Discount Store,Portuguese Restaurant,Park,Mexican Restaurant,Japanese Restaurant,Italian Restaurant,Hobby Shop


In [56]:
toronto_merged["Cluster Labels"] = toronto_merged["Cluster Labels"].fillna(0).astype(int)

In [57]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### 3.1.5. Examine Clusters
Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories
##### Cluster 1
Cluster 1 seems to include the outliers, the neighborhoods with no recommended values or with unique combinations of common venues

In [58]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,0,,,,,,,,,,
1,North York,0,,,,,,,,,,
2,Downtown Toronto,0,Coffee Shop,Park,Bakery,Café,Pub,Breakfast Spot,Theater,Mexican Restaurant,Shoe Store,Brewery
3,North York,0,,,,,,,,,,
4,Downtown Toronto,0,Coffee Shop,Sushi Restaurant,Yoga Studio,Discount Store,Portuguese Restaurant,Park,Mexican Restaurant,Japanese Restaurant,Italian Restaurant,Hobby Shop
...,...,...,...,...,...,...,...,...,...,...,...,...
98,Etobicoke,0,,,,,,,,,,
99,Downtown Toronto,0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Pub,Fast Food Restaurant,Men's Store,Mediterranean Restaurant
100,East Toronto,0,Light Rail Station,Auto Workshop,Park,Comic Shop,Pizza Place,Restaurant,Burrito Place,Brewery,Skate Park,Farmers Market
101,Etobicoke,0,,,,,,,,,,


##### Cluster 2

In [59]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
83,Central Toronto,1,Gym,Park,Department Store,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant


##### Cluster 3

In [60]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,Central Toronto,2,Ice Cream Shop,Home Service,Music Venue,Garden,Dessert Shop,Diner,Discount Store,Distribution Center,Dog Run,Deli / Bodega


##### Cluster 4

In [61]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,Central Toronto,3,Jewelry Store,Trail,Mexican Restaurant,Sushi Restaurant,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant


##### Cluster 5

In [62]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
91,Downtown Toronto,4,Park,Playground,Trail,Dance Studio,Escape Room,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run


### 3.2. Clustering neighborhoods per venue prices

The FourSquare API allows you to explore the venues around you based on their price.. According to Foursquare's documentation, currently the valid range of price points are [1,2,3,4], 1 being the least expensive and 4 being the most expensive. For food venues in the United States, 1 is < \\$0 an entree, 2 is \\$10-20 an entree, 3 is \\$20-$30 an entree, 4 is > \\$30 an entree.


We will also make a request of all recommended places in the neighbourhood based on their price. In order to do that, we use the ```price``` argument in the url. In order to automate the process, we will use a function that will build the url for requesting at most 100 recommended prices in a radius of 500m based on their price range: 

In [63]:
# Url with price
def build_url_by_price(lat, lng, price):
    url = 'https://api.foursquare.com/v2/venues/explore?' +'&client_id={}'.format(CLIENT_ID) +'&client_secret={}&v={}'.format(CLIENT_SECRET, VERSION) + '&radius=500&limit=100' + '&ll={},{}'.format(lat,lng) + '&price={}'.format(price)
    return url

We will store the number of venues per price category in a dedicated dataframe, named ```price_df```. This dataframe will contain 6 columns: ```Neighborhood``` (for the neighborhood name), ```Total``` (for the total number of recommended venues, regardless their price range), and ```Price 1``` to ```4``` to store the number of venues per category price in each neighborhood 

In [64]:
prices_df_columns = ['Neighborhood','Total', 'Price 1', 'Price 2', 'Price 3', 'Price 4']
prices_df = pd.DataFrame(columns=prices_df_columns)
prices_df.head()

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4


Let's populate the ```prices_df``` dataframe

In [65]:
for neighbourhood_name, lat, lng in zip(tb['Neighborhood'], tb['Latitude'], df_final['Longitude']):
    prices = [neighbourhood_name, 0, 0, 0, 0, 0]
    url = build_url(lat, lng)
    result_wo_price = requests.get(url).json()
        
    prices[1] = result_wo_price['response']['totalResults']
    for i in range(4):
        url_by_price = build_url_by_price(lat, lng, i + 1)
        results_by_price = requests.get(url_by_price).json()
        prices[i + 2] = results_by_price['response']['totalResults']
    prices_df = prices_df.append(dict(zip(prices_df_columns, prices)), ignore_index=True)

Let's take a look at the first rows of our new dataframe

In [66]:
prices_df.head()

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4
0,"Regent Park, Harbourfront",5,0,0,0,0
1,"Queen's Park, Ontario Provincial Government",28,6,7,1,0
2,"Garden District, Ryerson",41,12,6,0,0
3,St. James Town,11,2,1,0,0
4,The Beaches,30,7,5,4,0


We will assume here (and this assumption might be wrong) that the total number of recommended venues in a neighborhood is the sum of all venues that have a price, plus other venues which do not have a price tag (i.e. a public beach, a park). We will store the difference between the total results (obtained via a query) and the sum of all the venues with a price tag in a column named ```No price```.

In [67]:
prices_df['No price'] = prices_df['Total'] - (prices_df['Price 1'] + prices_df['Price 2'] + prices_df['Price 3'] + prices_df['Price 4'])
prices_df.head()

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price
0,"Regent Park, Harbourfront",5,0,0,0,0,5
1,"Queen's Park, Ontario Provincial Government",28,6,7,1,0,14
2,"Garden District, Ryerson",41,12,6,0,0,23
3,St. James Town,11,2,1,0,0,8
4,The Beaches,30,7,5,4,0,14


#### 3.2.2. Clustering neighbourhoods per price range
Before starting to cluster the neighborhoods, we will standardize the prices so we get values between 0 and 1 for all values. Before that, we remove the 'neighborhood' column (as it is not numeric)

In [68]:
prices_df_values = prices_df[['Total', 'Price 1', 'Price 2', 'Price 3', 'Price 4', 'No price']]
prices_df_values.head()

Unnamed: 0,Total,Price 1,Price 2,Price 3,Price 4,No price
0,5,0,0,0,0,5
1,28,6,7,1,0,14
2,41,12,6,0,0,23
3,11,2,1,0,0,8
4,30,7,5,4,0,14


In [69]:
scaler = StandardScaler()
prices_std = scaler.fit_transform(prices_df_values)

In [70]:
kmeans_prices = KMeans(n_clusters=5, random_state=0).fit(prices_std)
kmeans_prices.labels_[0:10]

array([2, 0, 0, 2, 0, 2, 2, 0, 2, 4], dtype=int32)

In [71]:
prices_df['Label'] = kmeans_prices.labels_
prices_df.head()

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price,Label
0,"Regent Park, Harbourfront",5,0,0,0,0,5,2
1,"Queen's Park, Ontario Provincial Government",28,6,7,1,0,14,0
2,"Garden District, Ryerson",41,12,6,0,0,23,0
3,St. James Town,11,2,1,0,0,8,2
4,The Beaches,30,7,5,4,0,14,0


In [72]:
toronto_venues_by_price = prices_df.merge(tb[['Neighborhood', 'Latitude', 'Longitude']], on='Neighborhood', how='left')
toronto_venues_by_price.head()

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price,Label,Latitude,Longitude
0,"Regent Park, Harbourfront",5,0,0,0,0,5,2,43.65426,-79.360636
1,"Queen's Park, Ontario Provincial Government",28,6,7,1,0,14,0,43.662301,-79.389494
2,"Garden District, Ryerson",41,12,6,0,0,23,0,43.657162,-79.378937
3,St. James Town,11,2,1,0,0,8,2,43.651494,-79.375418
4,The Beaches,30,7,5,4,0,14,0,43.676357,-79.293031


In [73]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x=np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_venues_by_price['Latitude'], toronto_venues_by_price['Longitude'], toronto_venues_by_price['Neighborhood'], toronto_venues_by_price['Label']):
    label = folium.Popup(str(poi) + 'Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

#### 3.1.3. Analyzing each cluster
First, let's check the number of occurrences per label

In [74]:
toronto_venues_by_price.groupby('Label').count()

Unnamed: 0_level_0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price,Latitude,Longitude
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,9,9,9,9,9,9,9,9,9
1,1,1,1,1,1,1,1,1,1
2,21,21,21,21,21,21,21,21,21
3,2,2,2,2,2,2,2,2,2
4,6,6,6,6,6,6,6,6,6


It seems that labels 2 and 3 have fewer occurrences. Let's check the mean of the number of occurrences per label

In [75]:
toronto_venues_by_price['Total'] = toronto_venues_by_price['Total'].astype(int)
toronto_venues_by_price['Price 1'] = toronto_venues_by_price['Price 1'].astype(int)
toronto_venues_by_price['Price 2'] = toronto_venues_by_price['Price 2'].astype(int)
toronto_venues_by_price['Price 3'] = toronto_venues_by_price['Price 3'].astype(int)
toronto_venues_by_price['Price 4'] = toronto_venues_by_price['Price 4'].astype(int)
toronto_venues_by_price['No price'] = toronto_venues_by_price['No price'].astype(int)

toronto_venues_by_price.groupby('Label').mean()

Unnamed: 0_level_0,Total,Price 1,Price 2,Price 3,Price 4,No price,Latitude,Longitude
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,30.222222,9.444444,6.333333,1.0,0.0,13.444444,43.669649,-79.371983
1,162.0,53.0,38.0,16.0,7.0,48.0,43.648429,-79.38228
2,5.142857,1.952381,1.142857,0.047619,0.047619,1.952381,43.669746,-79.392779
3,73.0,13.0,14.5,7.0,2.5,36.0,43.662958,-79.402864
4,50.333333,15.5,18.166667,3.166667,0.0,13.5,43.658734,-79.403474


It seems like:
* Label 0 includes neighborhoods with average 50 recommended venues, with no 'Price 4' venues, and slighly more venues with 'Price 2' than really cheap venues.
* Label 1 includes neighborhoods with few recommended places
* Label 2 includes neighborhoods with many recommended places. It has the highest numbers of venues with 'Price 4', and have significantly more venues with a price tag than venues without tag.
* Label 3 include proportionally more 'No price' venues than label 2 (approximately half of them). These neighborhoods have also a higher mean number of 'Price 2' venues compared to 'Price 1' venues, so we can consider these are 'more expensive' neighborhoods. 
* Label 4 includes neighborhoods with an average 30 recommended venues, with no 'Price 4' venues, and with slighly more venues with 'Price 1' tags than venues with 'Price 2' venues

As we did in the previous clustering in section 3.1, let's check the different cluster

#### Cluster 0

In [76]:
toronto_venues_by_price[toronto_venues_by_price['Label'] == 0][['Neighborhood', 'Total', 'Price 1', 'Price 2', 'Price 3', 'Price 4', 'No price']]

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price
1,"Queen's Park, Ontario Provincial Government",28,6,7,1,0,14
2,"Garden District, Ryerson",41,12,6,0,0,23
4,The Beaches,30,7,5,4,0,14
7,Christie,24,11,7,0,0,6
15,"India Bazaar, The Beaches West",40,15,6,1,0,18
23,"North Toronto West, Lawrence Park",26,6,4,0,0,16
27,"University of Toronto, Harbord",37,10,10,0,0,17
34,Stn A PO Boxes,20,10,4,0,0,6
35,"St. James Town, Cabbagetown",26,8,8,3,0,7


#### Cluster 1

In [77]:
toronto_venues_by_price[toronto_venues_by_price['Label'] == 1][['Neighborhood', 'Total', 'Price 1', 'Price 2', 'Price 3', 'Price 4', 'No price']]

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price
36,"First Canadian Place, Underground city",162,53,38,16,7,48


#### Cluster 2

In [78]:
toronto_venues_by_price[toronto_venues_by_price['Label'] == 2][['Neighborhood', 'Total', 'Price 1', 'Price 2', 'Price 3', 'Price 4', 'No price']]

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price
0,"Regent Park, Harbourfront",5,0,0,0,0,5
3,St. James Town,11,2,1,0,0,8
5,Berczy Park,16,7,4,0,0,5
6,Central Bay Street,0,0,0,0,0,0
8,"Richmond, Adelaide, King",0,0,0,0,0,0
11,"Little Portugal, Trinity",4,5,1,0,0,-2
12,"The Danforth West, Riverdale",0,0,0,0,0,0
13,"Toronto Dominion Centre, Design Exchange",2,0,0,0,0,2
14,"Brockton, Parkdale Village, Exhibition Place",1,0,0,0,0,1
17,Studio District,4,2,0,0,1,1


#### Cluster 3

In [79]:
toronto_venues_by_price[toronto_venues_by_price['Label'] == 3][['Neighborhood', 'Total', 'Price 1', 'Price 2', 'Price 3', 'Price 4', 'No price']]

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price
24,"The Annex, North Midtown, Yorkville",71,14,15,7,2,33
30,"Kensington Market, Chinatown, Grange Park",75,12,14,7,3,39


#### Cluster 4

In [80]:
toronto_venues_by_price[toronto_venues_by_price['Label'] == 4][['Neighborhood', 'Total', 'Price 1', 'Price 2', 'Price 3', 'Price 4', 'No price']]

Unnamed: 0,Neighborhood,Total,Price 1,Price 2,Price 3,Price 4,No price
9,"Dufferin, Dovercourt Village",70,26,18,2,0,24
10,"Harbourfront East, Union Station, Toronto Islands",43,15,13,3,0,12
16,"Commerce Court, Victoria Hotel",44,11,20,3,0,10
25,"Parkdale, Roncesvalles",42,8,18,8,0,8
33,Rosedale,55,19,17,2,0,17
37,Church and Wellesley,48,14,23,1,0,10
