<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>
<h1 align=right><font size = 4>by Pavel Milchev</font></h1>

<h1> Step 1 - Parse data from site and shape it</h1>

*Install the needed librabies.*

In [48]:
!pip install beautifulsoup4 # used to parse the html
print("BeautifulSoup 4 is installed!")
!pip install html5lib # already installed but was in the example ^_^
print("HTML5 libraty is installed!")
!pip install lxml # already installed but was in the example ^_^
print("lxml libraty is installed!")

BeautifulSoup 4 is installed!
HTML5 libraty is installed!
lxml libraty is installed!


*Import the needed libraries*

BeautifulSoup is used by pd.read_html

In [49]:
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd

*Take the complete text of the wiki-page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and create a data frame*

**I assume that:** I need the first table from the page and that there is at least one

In [50]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

**I assume that:** the names of the columns from the html page are as follows: Postcode, Borough, Neighbourhood

In [51]:
# rename the column to ber the same as the expected from the assignement
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)

*Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.*

In [52]:
#Ignore cells with a borough that is Not assigned.
indexNames = df[df['Borough'] == 'Not assigned'].index
df.drop(indexNames , inplace=True)

*If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.*

In [53]:
nonNeighbourhoods = df[df['Neighbourhood'] == 'Not assigned'].index
for k in nonNeighbourhoods:
    df.at[k,'Neighbourhood'] = df.at[k,'Borough']

***The procedure to fulfill the following requirement:*** More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [64]:
# create a new empty DataFrame to store the combined Neighbourhoods
refined_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighbourhood'])
# group by PostalCode to find the Neighbourhoods with the same postal
grouped = df.groupby(['PostalCode'], sort=False)
for postalCode, postalCode_df in grouped:
    # transform the list of unique neighbourhoods to a comma separated single string
    postalCode_df['Neighbourhood'] = ",".join(postalCode_df['Neighbourhood'].unique())
    # drop duplicated rows based on the Borough column
    postalCode_df = postalCode_df.drop_duplicates(subset='Borough')
    # add the current dataframe with ONE LINE to the final refined_df
    refined_df = pd.concat([refined_df, postalCode_df])
    
# reindex the table starting the rows from 0
refined_df.reset_index( drop=True, inplace = True)
print(refined_df.shape)
refined_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


(103, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout..."


In [65]:
#save the transformed data frame for the next steps of the assignement
refined_df.to_csv('toronto_postal_step1.csv', mode='w')


<h1> Step 2 - Add geolocaion to the data </h1>

*Install the geocoder library.*

In [44]:
!pip install geocoder
print("The geocoder library is installed!")

The geocoder library is installed!


*Define a getLatLng function which returns the latitude and longitude of Toronto postal code, which are passed as arguments*

In [38]:
import geocoder # import geocoder
import pandas as pd

def getLatLng(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    failsafe = 1
    adress = '{}, Toronto, Ontario'.format(postal_code)
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis(adress)
        lat_lng_coords = g.latlng
        failsafe = failsafe + 1
        if(failsafe > 10):
            break;
    
    return lat_lng_coords

*Read the dataframe from the csv saved at the end of the previous step*

In [39]:
df = pd.read_csv('toronto_postal_step1.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout..."


*Create lists of latitudes and longitudes for each postal code from the data frame*

In [45]:
latList = []
lngList = []

for i in df.index:
    #print("finding the coords of: ", df.at[i,'PostalCode'])
    coords = getLatLng(df.at[i,'PostalCode'])
    latList.append(coords[0])
    lngList.append(coords[1])

*Add the columns Latitude and Longitude to the data frame*

df['Latitude'] = latList
df['Longitude'] = lngList
df

In [47]:
#save the transformed data frame for the next steps of the assignement
df.to_csv('toronto_postal_step2.csv', mode='w')

<h1> Step 3 - Explore and cluster </h1>
<h2> Step 3.1 - Gather the data for venues in Toronto </h2>

Start by getting the data frame from the last step

In [18]:
import pandas as pd
toronto_data = pd.read_csv('toronto_postal_step2.csv')
toronto_data.drop(columns = ['Unnamed: 0'], inplace = True)
toronto_data

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752420,-79.329242
1,M4A,North York,Victoria Village,43.730600,-79.313265
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.723270,-79.451286
4,M7A,Downtown Toronto,Queen's Park,43.661150,-79.391715
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North",43.653760,-79.510890
99,M4Y,Downtown Toronto,Church and Wellesley,43.666585,-79.381302
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.648690,-79.385440
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout...",43.632835,-79.489550


Let's get the function *getNearbyVenues()* from a previous lab session

Hide some of the sensitive information such as CLIENT_ID and CLIENT_SECRET

In [14]:
# @hidden_cell
CLIENT_ID = 'NZSBIPLOD5IGR13AUKNJ4JOZFTJ5GT1DAJWGQOP0C1WZ22P4' # your Foursquare ID
CLIENT_SECRET = 'BQRDN1UQISPBMUTY5RX3ZAOLQ1NXRPB4ANESMMKTDNLMMWTJ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [22]:
import requests
LIMIT = 100 # limit of number of venues returned by Foursquare API

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now let's reduce the handled data. I will use data such that the Borough columns contains the string Toronto

In [110]:
reduced_toronto_data = toronto_data[toronto_data['Borough'].str.contains("Toronto")]

Let's gather all(100) the venues per neighbourhood using the Foursquare Api

In [111]:
toronto_venues = getNearbyVenues(names=reduced_toronto_data['Neighbourhood'],
                                   latitudes=reduced_toronto_data['Latitude'],
                                   longitudes=reduced_toronto_data['Longitude']
                                  )

Harbourfront
Queen's Park
Ryerson,Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide,King,Richmond
Dovercourt Village,Dufferin
Harbourfront East,Toronto Islands,Union Station
Little Portugal,Trinity
The Danforth West,Riverdale
Design Exchange,Toronto Dominion Centre
Brockton,Exhibition Place,Parkdale Village
The Beaches West,India Bazaar
Commerce Court,Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North,Forest Hill West
High Park,The Junction South
North Toronto West
The Annex,North Midtown,Yorkville
Parkdale,Roncesvalles
Davisville
Harbord,University of Toronto
Runnymede,Swansea
Moore Park,Summerhill East
Chinatown,Grange Park,Kensington Market
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown,St. James Town
First Canadian Place,Underground city

Let's check the size of the resulting dataframe

In [125]:
toronto_venues.shape

(1789, 7)

Let's save it for a further analysis. Just in case we hit the limits of Foursquare api for some reason

In [113]:
#save the transformed data frame for the next steps of the assignement
toronto_venues.to_csv('toronto_venues_with_borough-name_Toronto.csv', mode='w')

<h2>3.2 Explore venues in Toronto neighborhoods</h2>

Let's start by getting the data frame from the last step

In [124]:
toronto_venues = pd.read_csv('toronto_venues_with_borough-name_Toronto.csv')
toronto_venues.drop(columns = ['Unnamed: 0'], inplace = True)
toronto_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.650295,-79.359166,The Distillery Historic District,43.650244,-79.359323,Historic Site
1,Harbourfront,43.650295,-79.359166,Distillery Sunday Market,43.650075,-79.361832,Farmers Market
2,Harbourfront,43.650295,-79.359166,Arvo,43.649963,-79.361442,Coffee Shop
3,Harbourfront,43.650295,-79.359166,SOMA chocolatemaker,43.650622,-79.358127,Chocolate Shop
4,Harbourfront,43.650295,-79.359166,Cacao 70,43.650067,-79.360723,Dessert Shop
...,...,...,...,...,...,...,...
1784,Business Reply Mail Processing Centre 969 Eastern,43.648690,-79.385440,Omg! Oh My Gyro!,43.650064,-79.391104,Souvlaki Shop
1785,Business Reply Mail Processing Centre 969 Eastern,43.648690,-79.385440,Sweet Lulu,43.650557,-79.381175,Asian Restaurant
1786,Business Reply Mail Processing Centre 969 Eastern,43.648690,-79.385440,A&W,43.645982,-79.389621,Fast Food Restaurant
1787,Business Reply Mail Processing Centre 969 Eastern,43.648690,-79.385440,The Shore Club,43.645823,-79.386781,Seafood Restaurant


Let's check how does that looks like on a real map.

In [128]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[43.7184038, -79.5181399], zoom_start=11)

def addLabelsToMap(foliumMap, labelName, latitude, longitude):
    # add markers to map
    for lat, lng, label in zip(latitude, longitude, labelName):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(foliumMap)  

addLabelsToMap(map_toronto, 
               latitude = toronto_venues["Venue Latitude"], 
               longitude = toronto_venues["Venue Longitude"], 
               labelName = toronto_venues["Venue"])
map_toronto

Let's check how many venues were returned for each neighborhood

In [129]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,63,63,63,63,63,63
"Brockton,Exhibition Place,Parkdale Village",67,67,67,67,67,67
Business Reply Mail Processing Centre 969 Eastern,100,100,100,100,100,100
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",69,69,69,69,69,69
"Cabbagetown,St. James Town",43,43,43,43,43,43
Central Bay Street,100,100,100,100,100,100
"Chinatown,Grange Park,Kensington Market",76,76,76,76,76,76
Christie,11,11,11,11,11,11
Church and Wellesley,82,82,82,82,82,82


Let's find out how many unique categories can be curated from all the returned venues

In [130]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 226 uniques categories.


<h2>3.3 Analyze each neighborhood</h2>

In [141]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
#fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
#toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()


Unnamed: 0,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,Bagel Shop,...,Train Station,Tram Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [142]:
toronto_onehot.shape

(1789, 226)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [143]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Afghan Restaurant,American Restaurant,Antique Shop,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,...,Train Station,Tram Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.03,0.0,0.01,0.0,0.03,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.015873,0.0,0.0,0.0,0.015873,0.0,...,0.0,0.0,0.015873,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.014925,0.014925,0.0,0.0,0.0,0.0,...,0.0,0.0,0.029851,0.0,0.014925,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.02,0.0,0.0,0.0,0.03,0.0,0.0,0.0,...,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0,0.0,0.0,0.0,0.014493,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014493
5,"Cabbagetown,St. James Town",0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0
7,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.039474,0.013158,0.0,0.0,0.0,0.0,...,0.0,0.0,0.039474,0.0,0.052632,0.013158,0.0,0.0,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.090909,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.012195,0.012195,0.0,0.0,0.012195,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.012195,0.0,0.012195,0.0,0.0,0.0


Let's confirm the new size

In [144]:
toronto_grouped.shape

(38, 226)

Let's print each neighborhood along with the top 5 most common venues

In [145]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
                 venue  freq
0          Coffee Shop  0.07
1                 Café  0.06
2                Hotel  0.05
3           Steakhouse  0.04
4  American Restaurant  0.03


----Berczy Park----
          venue  freq
0   Coffee Shop  0.08
1  Cocktail Bar  0.05
2      Beer Bar  0.03
3    Restaurant  0.03
4          Café  0.03


----Brockton,Exhibition Place,Parkdale Village----
                    venue  freq
0             Coffee Shop  0.09
1              Restaurant  0.06
2  Furniture / Home Store  0.06
3                    Café  0.06
4          Sandwich Place  0.04


----Business Reply Mail Processing Centre 969 Eastern----
         venue  freq
0  Coffee Shop  0.10
1   Steakhouse  0.04
2        Hotel  0.04
3          Bar  0.04
4          Pub  0.03


----CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara----
                venue  freq
0         Coffee Shop  0.10
1  Italian Restaurant  0.07
2              

Let's put that into a *pandas* dataframe

First, let's write(take from last lab) a function to sort the venues in descending order.

In [147]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [198]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Hotel,Steakhouse,Gastropub,Bar,Gym,Breakfast Spot,Burger Joint,Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Beer Bar,Cheese Shop,Seafood Restaurant,Bakery,Farmers Market,Hotel,Café,Steakhouse
2,"Brockton,Exhibition Place,Parkdale Village",Coffee Shop,Café,Restaurant,Furniture / Home Store,Sandwich Place,Bakery,Bar,Juice Bar,Vegetarian / Vegan Restaurant,Hotel
3,Business Reply Mail Processing Centre 969 Eastern,Coffee Shop,Hotel,Steakhouse,Bar,Gym,Café,Pub,Italian Restaurant,Sushi Restaurant,Seafood Restaurant
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Coffee Shop,Italian Restaurant,Café,Bar,Spa,Electronics Store,Speakeasy,Restaurant,Bakery,Pub


<h2> 3.4. Cluster Neighborhoods </h2>

Run k-means to cluster the neighborhood into 5 clusters.

In [199]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[:] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0,
       1, 0, 2, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 4], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [200]:
toronto_merged = reduced_toronto_data.rename(columns={'Neighbourhood':'Neighborhood'})
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.dropna(axis=0, subset=['Cluster Labels'], inplace =True)
toronto_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166,0.0,Coffee Shop,Bakery,Boat or Ferry,Theater,French Restaurant,Gastropub,Breakfast Spot,Brewery,Spa,Café
4,M7A,Downtown Toronto,Queen's Park,43.66115,-79.391715,0.0,Coffee Shop,Café,Sandwich Place,Gym,Italian Restaurant,Falafel Restaurant,Bookstore,Sushi Restaurant,College Auditorium,Restaurant
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657363,-79.37818,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Japanese Restaurant,Fast Food Restaurant,Middle Eastern Restaurant,Plaza,Theater,Diner
15,M5C,Downtown Toronto,St. James Town,43.65121,-79.375481,0.0,Coffee Shop,Café,Restaurant,Seafood Restaurant,Italian Restaurant,Bakery,Hotel,Beer Bar,Cocktail Bar,Breakfast Spot
19,M4E,East Toronto,The Beaches,43.676531,-79.295425,0.0,Health Food Store,Pub,Trail,Yoga Studio,Electronics Store,Food,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop
20,M5E,Downtown Toronto,Berczy Park,43.64516,-79.373675,0.0,Coffee Shop,Cocktail Bar,Beer Bar,Cheese Shop,Seafood Restaurant,Bakery,Farmers Market,Hotel,Café,Steakhouse
24,M5G,Downtown Toronto,Central Bay Street,43.656091,-79.38493,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Plaza,Bakery,Tea Room,Restaurant,Café,Bookstore,Spa
25,M6G,Downtown Toronto,Christie,43.668781,-79.42071,0.0,Café,Grocery Store,Athletics & Sports,Coffee Shop,Italian Restaurant,Baby Store,Playground,Candy Store,Flower Shop,Food
30,M5H,Downtown Toronto,"Adelaide,King,Richmond",43.6497,-79.382582,0.0,Coffee Shop,Café,Hotel,Steakhouse,Gastropub,Bar,Gym,Breakfast Spot,Burger Joint,Restaurant
31,M6H,West Toronto,"Dovercourt Village,Dufferin",43.665087,-79.438705,0.0,Park,Furniture / Home Store,Bus Line,Bakery,Pet Store,Middle Eastern Restaurant,Smoke Shop,Café,Brazilian Restaurant,Bar


Finally, let's visualize the resulting clusters

In [201]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[43.7184038, -79.5181399], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
print(rainbow, cluster)
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

['#8000ff', '#00b5eb', '#80ffb4', '#ffb360', '#ff0000'] nan


<h2> 3.5. Examine Clusters </h2>

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1

In [203]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,0.0,Coffee Shop,Bakery,Boat or Ferry,Theater,French Restaurant,Gastropub,Breakfast Spot,Brewery,Spa,Café
4,Downtown Toronto,0.0,Coffee Shop,Café,Sandwich Place,Gym,Italian Restaurant,Falafel Restaurant,Bookstore,Sushi Restaurant,College Auditorium,Restaurant
9,Downtown Toronto,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Japanese Restaurant,Fast Food Restaurant,Middle Eastern Restaurant,Plaza,Theater,Diner
15,Downtown Toronto,0.0,Coffee Shop,Café,Restaurant,Seafood Restaurant,Italian Restaurant,Bakery,Hotel,Beer Bar,Cocktail Bar,Breakfast Spot
19,East Toronto,0.0,Health Food Store,Pub,Trail,Yoga Studio,Electronics Store,Food,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop
20,Downtown Toronto,0.0,Coffee Shop,Cocktail Bar,Beer Bar,Cheese Shop,Seafood Restaurant,Bakery,Farmers Market,Hotel,Café,Steakhouse
24,Downtown Toronto,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Plaza,Bakery,Tea Room,Restaurant,Café,Bookstore,Spa
25,Downtown Toronto,0.0,Café,Grocery Store,Athletics & Sports,Coffee Shop,Italian Restaurant,Baby Store,Playground,Candy Store,Flower Shop,Food
30,Downtown Toronto,0.0,Coffee Shop,Café,Hotel,Steakhouse,Gastropub,Bar,Gym,Breakfast Spot,Burger Joint,Restaurant
31,West Toronto,0.0,Park,Furniture / Home Store,Bus Line,Bakery,Pet Store,Middle Eastern Restaurant,Smoke Shop,Café,Brazilian Restaurant,Bar


#### Cluster 2

In [204]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
61,Central Toronto,1.0,Bus Line,Swim School,Lawyer,Yoga Studio,Falafel Restaurant,Food & Drink Shop,Food,Flower Shop,Flea Market,Fish Market


#### Cluster 3

In [206]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
73,Central Toronto,2.0,Playground,Gym Pool,Park,Garden,Yoga Studio,Ethiopian Restaurant,Food,Flower Shop,Flea Market,Fish Market
83,Central Toronto,2.0,Playground,Gym,Trail,Park,Electronics Store,Food,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop
91,Downtown Toronto,2.0,Playground,Candy Store,Grocery Store,Park,Yoga Studio,Ethiopian Restaurant,Food,Flower Shop,Flea Market,Fish Market


In [207]:
#### Cluster 4

In [208]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,Central Toronto,3.0,Home Service,Park,Ethiopian Restaurant,Food & Drink Shop,Food,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant


#### Cluster 5

In [210]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
41,East Toronto,4.0,Bus Line,Discount Store,Grocery Store,Park,Yoga Studio,Event Space,Food & Drink Shop,Food,Flower Shop,Flea Market
