<h1 align="center"><strong>Segmenting and Clustering Neighborhoods in Toronto</strong></h1>
<h4 align="right">by <strong>Prafful Agrawal</strong></h4>

# Introduction

In this notebook, the neighborhoods of the city of Toronto are segmented and clustered. The following procedure is adopted:
- **Web Scraping**  - The given [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) is scraped to collect data about the neighborhood of Toronto.
- **Location Data** - For the above neighborhoods, location data, i.e. Latitude and Longitude, is generated using Geopy library.
- **Explore Neighborhoods** - Using [Foursquare API](https://developer.foursquare.com/), explore each neighborhood for popular venues.
- **Analyse Neighborhoods** - Analyze the type of venues which are popular in each neighborhood.
- **Cluster Neighborhoods** - Using K-Means Clustering, we cluster the various neighborhoods togther.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

#!conda install -c conda-forge BeautifulSoup4 --yes # uncomment this line to install Beautiful Soup package if not previously installed
from bs4 import BeautifulSoup # library to handle HTML files

#!conda install -c conda-forge geopy --yes # uncomment this line to install Geopy package if not previously installed
from geopy.geocoders import ArcGIS # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line to install Folium package if not previously installed
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Web Scraping

### Download and explore the data

Scrape the Wikipedia page using **requests** and **BeautifulSoup** libraries.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
website_html = requests.get(url).text
soup = BeautifulSoup(website_html,'html.parser')

Extract the first *table* (of class *wikitable sortable*) on the webpage. [NOTE: You can extract all tables using the **soup.find_all()** method]

In [3]:
table = soup.find('table',{'class':'wikitable sortable'})

Print the table using **prettify()** method from **BeautifulSoup**.

In [4]:
# print(table.prettify()) # uncomment this line to print the table

### Transform the data into a *pandas* dataframe

Extract the contents of the table.

In [5]:
postal_codes = []
boroughs = []
neighborhoods = []

# Extract rows from the table
rows = table.find_all('tr')

# Iterate over rows
for row in rows[1:]:
    # Extract all cells from the row
    cells = row.find_all('td')
    # Check if all three columns are available
    if len(cells) == 3:
        # Append 'postal_codes' with the new data
        postal_code = cells[0]
        postal_codes.append(postal_code.text.strip())
        # Append 'boroughs' with the new data
        borough = cells[1]
        boroughs.append(borough.text.strip())
        # Append 'neighborhoods' with the new data
        neighborhood = cells[2]
        neighborhoods.append(neighborhood.text.strip())

Convert the above data into a **pandas** dataframe.

In [6]:
df = pd.DataFrame({'PostalCode':postal_codes, 'Borough':boroughs, 'Neighborhood':neighborhoods})
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [7]:
df.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z,Not assigned,Not assigned


### Clean the dataset

Check the shape of the dataframe.

In [8]:
df.shape

(180, 3)

Check the number of *unique* **PostalCode**. 

In [9]:
len(df['PostalCode'].unique())

180

Since, number of unique postal codes is equal to the number of rows, hence, there is no duplicate postal code in the dataframe.

Drop the rows where the **Borough** is **Not assigned**.

In [10]:
df = df[df['Borough'] != 'Not assigned'].reset_index(drop=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Check if there are rows where **Neighborhood** is **Not assigned**.

In [11]:
sum(df['Neighborhood']=='Not assigned')

0

Since, there are no such rows, we can proceed to the next step. Otherwise, we copy the value of **Borough** to the **Neighborhood** column.

Check the shape of the revised dataframe.

In [12]:
df.shape

(103, 3)

## 2. Location Data

### Use **geopy** library to get the latitude and longitude values of the city of Toronto and its neighborhoods

Use **ArcGIS()** method from **geopy.geocoders** library to get the *latitude* and *longitude* values of the city of Toronto.

In [13]:
# Initialize ArcGIS instance
geolocator = ArcGIS()

# Toronto Latitude and Longitude
tor_location = geolocator.geocode('Toronto, Canada')
tor_lat = tor_location[1][0]
tor_lng = tor_location[1][1]
print('Toronto: Latitude %.4f, Longitude %.4f'%(tor_lat, tor_lng))

Toronto: Latitude 43.6487, Longitude -79.3854


Similarly, get the *latitude* and *longitude* values of the various neighborhoods.

In [14]:
latitudes = []
longitudes = []

for postal_code in df['PostalCode'] :
    # Query the location address for each Postal Code
    location = geolocator.geocode('{}, Toronto, Ontario'.format(postal_code))
    # Extract the Latitude and append to the list
    latitude = location[1][0]
    latitudes.append(latitude)
    # Extract the Longitude and append to the list
    longitude = location[1][1]
    longitudes.append(longitude)

### Append the dataset with *Latitude* and *Longitude* values

Add the **Latitude** and **Longitude** data into the **pandas** dataframe.

In [15]:
df['Latitude'] = latitudes
df['Longitude'] = longitudes
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939


Check if **NAs** are introduced in the dataframe.

In [16]:
df.isna().values.any()

False

Since, there are no such rows, we can proceed to the next step. Otherwise, we find the row(s) with **NAs**, and then drop those row(s).

Check the shape of the revised dataframe.

In [17]:
df.shape

(103, 5)

### Create a map of Toronto with the neighborhoods superimposed on top.

Use **Folium** library to generate the map of Toronto along with neighborhood data.

In [18]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[tor_lat, tor_lng], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### **NOTE:** [Regarding Folium Maps.](#0)

## 3. Explore Neighborhoods

### Define Foursquare Credentials and Version

In [19]:
CLIENT_ID = 'PJVBZE0T3PCXYR3LFJZE4HB13SNKOMUM3ZLMEOCYBSS20WMJ' # your Foursquare ID
CLIENT_SECRET = 'GDM0LJTO5A03R3X23I0OBCTGX2KMMOMHBGV4LQYNPAJ11ZJ4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PJVBZE0T3PCXYR3LFJZE4HB13SNKOMUM3ZLMEOCYBSS20WMJ
CLIENT_SECRET:GDM0LJTO5A03R3X23I0OBCTGX2KMMOMHBGV4LQYNPAJ11ZJ4


**NOTE: The above credentials have been RESET. You may replace them with your own credentials to recreate the notebook.**

### Retrieve venue data for the neighborhoods

Create a function to retrieve nearby venue data for each of the neighborhoods.

In [20]:
def getNearbyVenues(postal_codes, boroughs, names, latitudes, longitudes, radius=500, LIMIT=50):
    
    venues_list=[]
    number_of_venues=[]
    for postal_code, borough, name, lat, lng in zip(postal_codes, boroughs, names, latitudes, longitudes):
        print('Postal Code: {}; Borough: {}; Neighbourhood: {}'.format(postal_code, borough, name))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(postal_code,
                             borough,
                             name,
                             lat,
                             lng,
                             v['venue']['name'],
                             v['venue']['location']['lat'],
                             v['venue']['location']['lng'],
                             v['venue']['categories'][0]['name']) for v in results])
        
        # Add the number of venues returned for the location to the list
        number_of_venues.append(len(results))
        print('Number of venues returned:', len(results))

    # Transform the venues data into a pandas dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code',
                             'Borough',
                             'Neighborhood',
                             'Neighborhood Latitude',
                             'Neighborhood Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']
    # Return the venue data
    return(nearby_venues, number_of_venues)

Call the function and retrieve the resulting dataset

In [21]:
toronto_venues, number_of_venues = getNearbyVenues(postal_codes=df['PostalCode'],
                                                   boroughs=df['Borough'],
                                                   names=df['Neighborhood'],
                                                   latitudes=df['Latitude'],
                                                   longitudes=df['Longitude'])

Postal Code: M3A; Borough: North York; Neighbourhood: Parkwoods
Number of venues returned: 3
Postal Code: M4A; Borough: North York; Neighbourhood: Victoria Village
Number of venues returned: 6
Postal Code: M5A; Borough: Downtown Toronto; Neighbourhood: Regent Park, Harbourfront
Number of venues returned: 25
Postal Code: M6A; Borough: North York; Neighbourhood: Lawrence Manor, Lawrence Heights
Number of venues returned: 50
Postal Code: M7A; Borough: Downtown Toronto; Neighbourhood: Queen's Park, Ontario Provincial Government
Number of venues returned: 38
Postal Code: M9A; Borough: Etobicoke; Neighbourhood: Islington Avenue, Humber Valley Village
Number of venues returned: 4
Postal Code: M1B; Borough: Scarborough; Neighbourhood: Malvern, Rouge
Number of venues returned: 1
Postal Code: M3B; Borough: North York; Neighbourhood: Don Mills
Number of venues returned: 5
Postal Code: M4B; Borough: East York; Neighbourhood: Parkview Hill, Woodbine Gardens
Number of venues returned: 14
Postal Code

### Explore the resulting dataset

Venue data.

In [22]:
toronto_venues.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,North York,Parkwoods,43.752935,-79.335641,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,North York,Parkwoods,43.752935,-79.335641,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,M3A,North York,Parkwoods,43.752935,-79.335641,Brookbanks Pool,43.751389,-79.332184,Pool
3,M4A,North York,Victoria Village,43.728102,-79.31189,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,M4A,North York,Victoria Village,43.728102,-79.31189,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [23]:
toronto_venues.shape

(1741, 9)

Append the original dataset `df` with **NumberOfRows**

In [24]:
df['NumberOfVenues'] = number_of_venues
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,NumberOfVenues
0,M3A,North York,Parkwoods,43.752935,-79.335641,3
1,M4A,North York,Victoria Village,43.728102,-79.31189,6
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041,25
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211,50
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939,38


In [25]:
df.shape

(103, 6)

Check the *maximum* **NumberOfVenues**.

In [26]:
df['NumberOfVenues'].max()

50

This is equal to the **LIMIT** as defined in the function head. You can change this **LIMIT** by passing the desired value when calling the function `getNearbyVenues`.

Check the *minimum* **NumberOfVenues**.

In [27]:
df['NumberOfVenues'].min()

0

This indicates there are some locations where no venue data was retrieved.

You can change the value of **radius** during function call to include a larger area to scan for nearby venues.

Retrieve the locations where no venue data was returned.

In [28]:
df[df['NumberOfVenues']==df['NumberOfVenues'].min()]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,NumberOfVenues
50,M9L,North York,Humber Summit,43.759381,-79.557174,0
53,M3M,North York,Downsview,43.73322,-79.4977,0
95,M1X,Scarborough,Upper Rouge,43.834768,-79.204101,0


Confirm if these locations are not present in the **toronto_venues** dataset.

In [29]:
# Check the postal codes which are different between the two datasets
set(df['PostalCode']) - set(toronto_venues['Postal Code'])

{'M1X', 'M3M', 'M9L'}

Create a modified dataframe `df` removing the above **PostalCode**.

In [30]:
df_mod = df[df['NumberOfVenues']!=df['NumberOfVenues'].min()]

Confirm the size of modified dataset.

In [31]:
df_mod.shape

(100, 6)

## 4. Analyze Neighborhoods 

### Analyze location data

Check the number of *unique* **Postal Code** in the **toronto_venues** dataset.

In [32]:
len(toronto_venues['Postal Code'].unique())

100

This is equal to the number of **PostalCode** in the **df** dataset minus the number of **PostalCode** for which no nearby venue data was returned.

Check the number of *unique* **Neighborhood** in the **toronto_venues** dataset.

In [33]:
len(toronto_venues['Neighborhood'].unique())

97

This is less than the number of **Postal Code** in the **toronto_venues** dataset.

Check for duplicate **Neighborhood** in the **df** dataset.

In [34]:
df[df['Neighborhood'].duplicated(keep=False)]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,NumberOfVenues
7,M3B,North York,Don Mills,43.7489,-79.35722,5
13,M3C,North York,Don Mills,43.722143,-79.352023,6
40,M3K,North York,Downsview,43.739101,-79.467631,4
46,M3L,North York,Downsview,43.729992,-79.512027,2
53,M3M,North York,Downsview,43.73322,-79.4977,0
60,M3N,North York,Downsview,43.755819,-79.519973,20


Since, there are *unique locations* where the **Borough** and **Neighborhood** values are duplicate but the **PostalCode** values are different, use **Postal Code** to group by and analyze the **toronto_venues** dataset.

#### **NOTE:** Solve the discrepancy in the **Venue Category** column of **toronto_venues** dataset.

Check the shape of **toronto_venues** dataset.

In [35]:
toronto_venues.shape

(1741, 9)

There are some venues where the **Venue Category** is **Neighborhood**.

In [36]:
toronto_venues[toronto_venues['Venue Category'] == 'Neighborhood']

Unnamed: 0,Postal Code,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
316,M4E,East Toronto,The Beaches,43.678148,-79.295349,Upper Beaches,43.680563,-79.292869,Neighborhood
422,M5G,Downtown Toronto,Central Bay Street,43.656072,-79.385653,Downtown Toronto,43.653232,-79.385296,Neighborhood
525,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650542,-79.384116,Downtown Toronto,43.653232,-79.385296,Neighborhood
768,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.639922,-79.43124,Parkdale,43.640524,-79.4322,Neighborhood


To eliminate the chance of future discrepancy, remove these rows.

In [37]:
toronto_venues = toronto_venues[toronto_venues['Venue Category'] != 'Neighborhood'].reset_index(drop=True)
toronto_venues.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,North York,Parkwoods,43.752935,-79.335641,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,North York,Parkwoods,43.752935,-79.335641,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,M3A,North York,Parkwoods,43.752935,-79.335641,Brookbanks Pool,43.751389,-79.332184,Pool
3,M4A,North York,Victoria Village,43.728102,-79.31189,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,M4A,North York,Victoria Village,43.728102,-79.31189,Portugril,43.725819,-79.312785,Portuguese Restaurant


Check the shape of revised **toronto_venues** dataset.

In [38]:
toronto_venues.shape

(1737, 9)

### One Hot Encode the Venue Categories

Check the number of *unique* **Venue Category**.

In [39]:
len(toronto_venues['Venue Category'].unique())

247

One hot encode the venue categories.

In [40]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add various columns back to dataframe
toronto_onehot.insert(loc=0, column='Postal Code', value=toronto_venues['Postal Code'])
toronto_onehot.insert(loc=1, column='Borough', value=toronto_venues['Borough'])
toronto_onehot.insert(loc=2, column='Neighborhood', value=toronto_venues['Neighborhood'])

toronto_onehot.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M3A,North York,Parkwoods,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A,North York,Parkwoods,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A,North York,Parkwoods,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4A,North York,Victoria Village,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4A,North York,Victoria Village,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check the shape of the new dataset

In [41]:
toronto_onehot.shape

(1737, 250)

### Analyze the dataset

Next, group the rows of above dataset by **Postal Code** and calculate the mean of the frequencies of occurance of each category.

In [42]:
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Postal Code,Accessories Store,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Check the shape of resulting dataframe.

In [43]:
toronto_grouped.shape

(100, 248)

Print each neighborhood along with the top 3 most common venues.

In [44]:
num_top_venues = 3

for postal_code in toronto_grouped['Postal Code']:
    print("----- Postal Code: "+postal_code+" -----")
    temp = toronto_grouped[toronto_grouped['Postal Code'] == postal_code].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----- Postal Code: M1B -----
               venue  freq
0              Trail   1.0
1  Accessories Store   0.0
2       Optical Shop   0.0


----- Postal Code: M1C -----
                       venue  freq
0                        Bar   1.0
1          Accessories Store   0.0
2  Middle Eastern Restaurant   0.0


----- Postal Code: M1E -----
                  venue  freq
0           Coffee Shop   0.1
1           Pizza Place   0.1
2  Fast Food Restaurant   0.1


----- Postal Code: M1G -----
               venue  freq
0        Coffee Shop   0.4
1  Korean Restaurant   0.2
2           Pharmacy   0.2


----- Postal Code: M1H -----
                        venue  freq
0  Construction & Landscaping  0.25
1                 Gaming Cafe  0.25
2                      Lounge  0.25


----- Postal Code: M1J -----
                  venue  freq
0        Sandwich Place  0.25
1     Indian Restaurant  0.25
2  Fast Food Restaurant  0.25


----- Postal Code: M1K -----
              venue  freq
0  Department Store

### Create a new *pandas* dataframe containing top 10 most common venues 

Write a function to sort the venues in descending order of frequencies.

In [45]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new *pandas* dataframe.

In [46]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Postal Code'] = toronto_grouped['Postal Code']

for ind in np.arange(toronto_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Trail,Yoga Studio,Distribution Center,Flower Shop,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
1,M1C,Bar,Yoga Studio,Dog Run,Flower Shop,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
2,M1E,Fast Food Restaurant,Pizza Place,Restaurant,Coffee Shop,Thrift / Vintage Store,Mexican Restaurant,Pharmacy,Sports Bar,Fried Chicken Joint,Supermarket
3,M1G,Coffee Shop,Convenience Store,Pharmacy,Korean Restaurant,Fish & Chips Shop,Fish Market,Field,Fast Food Restaurant,Dog Run,Farmers Market
4,M1H,Lounge,Gaming Cafe,Construction & Landscaping,Trail,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Falafel Restaurant,Farm,Yoga Studio


Check the shape of the dataframe

In [47]:
venues_sorted.shape

(100, 11)

## 5. Cluster Neighborhoods 

### K-means Clustering

Run *k*-means to cluster the neighborhood into 5 clusters.

In [48]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Postal Code', axis=1)

# run k-means clustering
kmeans = KMeans(init="k-means++", n_clusters=kclusters, n_init=15, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 1, 1, 1, 4, 1, 1, 1, 1, 1], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [49]:
# initilize the dataset
toronto_merged = df_mod.copy()

# add clustering labels
toronto_merged.insert(len(toronto_merged.columns), 'Cluster Labels', kmeans.labels_)

# merge toronto_merged with venues_sorted to add the venues data for each neighborhood
toronto_merged = toronto_merged.join(venues_sorted.set_index('Postal Code'), on='PostalCode')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,NumberOfVenues,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.752935,-79.335641,3,4,Food & Drink Shop,Park,Pool,Health Food Store,Diner,Field,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
1,M4A,North York,Victoria Village,43.728102,-79.31189,6,1,Coffee Shop,Intersection,Pizza Place,Portuguese Restaurant,French Restaurant,Park,Donut Shop,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041,25,1,Pub,Café,Athletics & Sports,Trail,Bank,Chocolate Shop,Restaurant,Tech Startup,Bakery,Thai Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211,50,1,Clothing Store,Food Court,Restaurant,Bookstore,Toy / Game Store,Furniture / Home Store,Men's Store,American Restaurant,Cosmetics Shop,Mexican Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939,38,4,Coffee Shop,Sushi Restaurant,Café,Restaurant,Fried Chicken Joint,Bookstore,Smoothie Shop,Burger Joint,Burrito Place,Sandwich Place


### Visualize the results on a map 

Use **Folium** library to generate the map of Toronto along with neighborhood data.

In [50]:
# create map
map_clusters = folium.Map(location=[tor_lat, tor_lng], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, pos, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(pos) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### **NOTE:** [Regarding Folium Maps.](#0)

### Examine the clusters 

Examine each cluster and determine the discriminating venue categories that distinguish each cluster.

#### Cluster 1

In [51]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,NumberOfVenues,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,M6C,5,0,Hockey Arena,Park,Field,Grocery Store,Trail,Donut Shop,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Falafel Restaurant
22,M1G,5,0,Coffee Shop,Convenience Store,Pharmacy,Korean Restaurant,Fish & Chips Shop,Fish Market,Field,Fast Food Restaurant,Dog Run,Farmers Market
24,M5G,50,0,Coffee Shop,Clothing Store,Plaza,Bubble Tea Shop,Sandwich Place,Café,Japanese Restaurant,Middle Eastern Restaurant,Department Store,Modern European Restaurant
39,M2K,2,0,Construction & Landscaping,Trail,Yoga Studio,Dog Run,Flower Shop,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market
43,M6K,43,0,Café,Coffee Shop,Thrift / Vintage Store,Gift Shop,Restaurant,Dance Studio,Brewery,Sandwich Place,Caribbean Restaurant,Chiropractor
44,M1L,9,0,Intersection,Bus Line,Metro Station,Soccer Field,Bakery,Bus Station,Coffee Shop,Electronics Store,Ethiopian Restaurant,Falafel Restaurant
46,M3L,2,0,Pool,Caribbean Restaurant,Yoga Studio,Distribution Center,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
48,M5L,50,0,Café,Coffee Shop,Restaurant,Hotel,Gym,Gastropub,Japanese Restaurant,Deli / Bodega,American Restaurant,New American Restaurant
59,M2N,50,0,Ramen Restaurant,Korean Restaurant,Coffee Shop,Sushi Restaurant,Pizza Place,Fast Food Restaurant,Café,Sandwich Place,Lounge,Sports Bar
73,M4R,4,0,Playground,Garden,Gym Pool,Park,Yoga Studio,Ethiopian Restaurant,Donut Shop,Eastern European Restaurant,Electronics Store,Farm


#### Cluster 2

In [52]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,NumberOfVenues,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M4A,6,1,Coffee Shop,Intersection,Pizza Place,Portuguese Restaurant,French Restaurant,Park,Donut Shop,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant
2,M5A,25,1,Pub,Café,Athletics & Sports,Trail,Bank,Chocolate Shop,Restaurant,Tech Startup,Bakery,Thai Restaurant
3,M6A,50,1,Clothing Store,Food Court,Restaurant,Bookstore,Toy / Game Store,Furniture / Home Store,Men's Store,American Restaurant,Cosmetics Shop,Mexican Restaurant
5,M9A,4,1,Park,Skating Rink,Baseball Field,Yoga Studio,Dog Run,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market
6,M1B,1,1,Trail,Yoga Studio,Distribution Center,Flower Shop,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,M4X,49,1,Coffee Shop,Chinese Restaurant,Bakery,Italian Restaurant,Pub,Pizza Place,Restaurant,Café,Caribbean Restaurant,Diner
99,M4Y,50,1,Coffee Shop,Restaurant,Japanese Restaurant,Dance Studio,Burger Joint,Men's Store,Gay Bar,Café,Bookstore,Juice Bar
100,M7Y,50,1,Coffee Shop,Hotel,Concert Hall,Café,Restaurant,Steakhouse,Theater,Italian Restaurant,Mediterranean Restaurant,Pizza Place
101,M8Y,6,1,Bank,Park,Tennis Court,Baseball Field,Convenience Store,Construction & Landscaping,Flower Shop,Fish Market,Fish & Chips Shop,Field


#### Cluster 3

In [53]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,NumberOfVenues,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
64,M9N,4,2,Park,Pharmacy,Deli / Bodega,Thai Restaurant,Field,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Distribution Center
90,M1W,13,2,Chinese Restaurant,Pizza Place,Coffee Shop,Fast Food Restaurant,Electronics Store,Sandwich Place,Discount Store,Other Great Outdoors,Bank,Grocery Store


#### Cluster 4

In [54]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,NumberOfVenues,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,M4C,11,3,Breakfast Spot,Bar,Pharmacy,Pet Store,Coffee Shop,Café,Fast Food Restaurant,Sushi Restaurant,Gas Station,Pizza Place


#### Cluster 5

In [55]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[0] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,NumberOfVenues,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,3,4,Food & Drink Shop,Park,Pool,Health Food Store,Diner,Field,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
4,M7A,38,4,Coffee Shop,Sushi Restaurant,Café,Restaurant,Fried Chicken Joint,Bookstore,Smoothie Shop,Burger Joint,Burrito Place,Sandwich Place
18,M1E,20,4,Fast Food Restaurant,Pizza Place,Restaurant,Coffee Shop,Thrift / Vintage Store,Mexican Restaurant,Pharmacy,Sports Bar,Fried Chicken Joint,Supermarket
26,M1H,4,4,Lounge,Gaming Cafe,Construction & Landscaping,Trail,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Falafel Restaurant,Farm,Yoga Studio
35,M4J,1,4,Home Service,Food Court,Flower Shop,Fish Market,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant


## Thank You.

#### <a id="0"></a>**NOTE: Please access the notebook on [NBVIEWER](https://nbviewer.jupyter.org/github/prafful-agrawal/Coursera_Capstone/blob/master/01-Segmenting-and-Clustering-Neighborhoods-in-Toronto.ipynb) to interact with folium maps.**