<br></br>
<p style="font-family: Arial; font-size:3em; color:red">
    <strong>
        Segmenting and Clustering Neighborhoods in Toronto
    </strong>
</p>
<br></br>
<br></br>

<br></br>
<p style="font-family: Calibri; font-size:2em; padding=0.1em color:black">
    <em>
        In this notebook I will explore, segment, and cluster the neighborhoods in the city of Toronto. <br><br>   
        For the Toronto neighborhood data, a Wikipedia page exists that has all the information needed to explore, <br><br>   
        and cluster the neighborhoods in Toronto. <br><br> 
        In order to obtain this information I will scrape the Wikipedia page and wrangle the data, clean it, <br><br> 
        and then read it into a pandas dataframe so that it is in a structured format.<br><br> 
        Once the data is in a structured format, I will explore and cluster the neighborhoods in the city of Toronto.  <br><br> 
    </em>
</p>
<br></br>

        
          

        



<p style="font-family: Calibri; font-size:1.5em; padding:1em; color:black">  
    Importing Required Liblaries
</p>

In [33]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
    Assigning Wikipedia page content to the BeautyfulSoap object
</p>

In [2]:
# url of the website
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(URL)

# assign content of the page to BeautifulSoup object
page_content = BeautifulSoup(page.content, 'html.parser')

<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
       Extracting column information from the 1st table and store it in the list
</p>

In [3]:
headers = page_content.find_all('table')[0].find('tr')
headers_list = []

for item in headers:
    try:
        headers_list.append(item.get_text().strip('\n'))
    except:
        continue
        
# print(headers_list)

<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
       Extracting content of each column and store it in the list
</p>

In [4]:
rows = page_content.find_all('table')[0].find_all('tr')[1:]
rows_list = []


for row in rows:

    row_list = []
    for item in row:
        try:
            row_list.append(item.get_text().strip('\n'))
        except:
            continue

    rows_list.append(row_list)

# print(rows_list)

<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Dropping all rows with Borough = 'Not assigned'
</p>

In [5]:
i = 0
while i < len(rows_list):
    if rows_list[i][1] == 'Not assigned':
        del rows_list[i]
        i -= 1
    i += 1

<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Assigning Borough to Neighbourhood if Neighbourhood = 'Not assigned'
</p>

In [6]:
for i in range(0, len(rows_list)):
    if rows_list[i][2] == 'Not assigned':
        rows_list[i][2] = rows_list[i][1]

<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Creating panda data frame with table information
</p>

In [7]:
df_postal_codes = pd.DataFrame(columns=headers_list)

for i in range(0, len(rows_list)):
    tmp_row = {'Postal Code':rows_list[i][0], 'Borough':rows_list[i][1], 'Neighborhood':rows_list[i][2]}
    df_postal_codes = df_postal_codes.append(tmp_row, ignore_index=True)

print('Shape of the dataframe: {}'.format(df_postal_codes.shape))
df_postal_codes.head(10)

Shape of the dataframe: (103, 4)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Neighborhood
0,M3A,North York,,Parkwoods
1,M4A,North York,,Victoria Village
2,M5A,Downtown Toronto,,"Regent Park, Harbourfront"
3,M6A,North York,,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,,"Malvern, Rouge"
7,M3B,North York,,Don Mills
8,M4B,East York,,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,,"Garden District, Ryerson"


<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
       Installing geocoder library
</p>

In [8]:
import sys
!{sys.executable} -m pip install geocoder



<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Extracting latitude & longitude of the neighbourhood based on postal code
</p>

In [None]:
import geocoder # import geocoder

latitude = []
longitude = []
# extracting coordinates of neighbourhood based on postal code
for postal_code in df_postal_codes['Postal Code'].values:
    
    # loop until you get the coordinates
    lat_lng_coords = None
    while(lat_lng_coords is None):
        coords = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = coords.latlng

    latitude.append(lat_lng_coords[0])
    longitude.append(lat_lng_coords[1])
    
for i in range(0, len(df_postal_codes['Postal Code'].values)):
    print('Post Code: {}, Latitude: {}, Longitude: {}'.format(df_postal_codes['Postal Code'].values[i], latitude[i], longitude[i]))

<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Due to an issues with geocoder library, coordinate information will be read from csv file and saved into Panda Dataframe
</p>

In [9]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Merging "df_postal_codes" & "df_lat_lng_coords" dataframes based on "Postal Code"
</p>

In [10]:
df_postal_codes = pd.merge(df_postal_codes, df_lat_lng_coords, on=['Postal Code'])
df_postal_codes = df_postal_codes.sort_values(by=['Postal Code'], ignore_index=True)
df_postal_codes.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,,Woburn,43.770992,-79.216917
4,M1H,Scarborough,,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,,"Birch Cliff, Cliffside West",43.692657,-79.264848


<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Selecting neighbourhoods only from Toronto
</p>

In [11]:
df_postal_codes_toronto = df_postal_codes[df_postal_codes['Borough'].str.contains('Toronto')]
df_postal_codes_toronto.reset_index(drop=True, inplace=True)
df_postal_codes_toronto.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,,"North Toronto West, Lawrence Park",43.715383,-79.405678
7,M4S,Central Toronto,,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Installing and Importing "folium" in order to depict neighbourhood locations on the map
</p>

In [12]:
import sys
!{sys.executable} -m pip install folium



In [13]:
import folium

<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Defining average latitude and longitude from neighbourhoods included in df_postal_codes_toronto
</p>

In [14]:
latitude_toronto_average = df_postal_codes_toronto[['Latitude']].mean(axis=0)
longitude_toronto_average = df_postal_codes_toronto[['Longitude']].mean(axis=0)
print('Average Latitude of neighbourhoods in Toronto: {}, Average Longitude of neighbourhoods in Toronto: {}'.format(latitude_toronto_average, longitude_toronto_average))

Average Latitude of neighbourhoods in Toronto: Latitude    43.667135
dtype: float64, Average Longitude of neighbourhoods in Toronto: Longitude   -79.389873
dtype: float64


<p style="font-family: Calibri; font-size:1.5em; padding:2em; color:black">  
      Visualizing neighborhoods in Toronto
</p>

In [15]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude_toronto_average, longitude_toronto_average], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_postal_codes_toronto['Latitude'], 
                                           df_postal_codes_toronto['Longitude'], 
                                           df_postal_codes_toronto['Borough'], 
                                           df_postal_codes_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Next, Foursquare API to explore the neighborhoods and segment them, will be utilized
</p>
<br></br>
<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Define Foursquare Credentials and Version
</p>

In [16]:
# The code was removed by Watson Studio for sharing.

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Create a function to get Venues in some predefined radius
</p>

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):   
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Code to run the above function on each neighborhood and create a new dataframe called df_toronto_venues.
</p>

In [18]:
df_toronto_venues = getNearbyVenues(names=df_postal_codes_toronto['Neighborhood'],
                                   latitudes=df_postal_codes_toronto['Latitude'],
                                   longitudes=df_postal_codes_toronto['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West,  Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Printing shape and heade of df_toronto_venues
</p>

In [19]:
print(df_toronto_venues.shape)
df_toronto_venues.head()

(1630, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Let's check how many venues were returned for each neighborhood
</p>

In [20]:
df_toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,58,58,58,58,58,58
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",17,17,17,17,17,17
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17,17,17,17,17,17
Central Bay Street,63,63,63,63,63,63
Christie,16,16,16,16,16,16
Church and Wellesley,83,83,83,83,83,83
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,37,37,37,37,37,37
Davisville North,8,8,8,8,8,8


<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Let's find out how many unique categories can be curated from all the returned venues
</p>

In [21]:
print('There are {} uniques categories.'.format(len(df_toronto_venues['Venue Category'].unique())))

There are 236 uniques categories.


<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Analyze Each Neighborhood
</p>

In [22]:
# one hot encoding
df_toronto_venues_onehot = pd.get_dummies(df_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_toronto_venues_onehot['Neighborhood'] = df_toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [df_toronto_venues_onehot.columns[-1]] + list(df_toronto_venues_onehot.columns[:-1])
df_toronto_venues_onehot = df_toronto_venues_onehot[fixed_columns]

df_toronto_venues_onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      New dataframe size
</p>

In [23]:
df_toronto_venues_onehot.shape

(1630, 236)

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
      Grouping rows by neighborhood and by taking the mean of the frequency of occurrence of each category
</p>

In [24]:
df_toronto_venues_onehot_grouped = df_toronto_venues_onehot.groupby('Neighborhood').mean().reset_index()
df_toronto_venues_onehot_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.058824,0.058824,0.058824,0.117647,0.117647,0.117647,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.015873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.015873,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.024096,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,...,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,0.0,0.0
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
df_toronto_venues_onehot_grouped.shape

(39, 236)

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
     Let's print each neighborhood along with the top 5 most common venues
</p>

In [26]:
num_top_venues = 5

for hood in df_toronto_venues_onehot_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = df_toronto_venues_onehot_grouped[df_toronto_venues_onehot_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.10
1        Cocktail Bar  0.05
2  Seafood Restaurant  0.03
3      Farmers Market  0.03
4              Bakery  0.03


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0            Café  0.14
1  Breakfast Spot  0.09
2     Coffee Shop  0.09
3    Intersection  0.05
4   Burrito Place  0.05


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                venue  freq
0  Light Rail Station  0.12
1         Yoga Studio  0.06
2       Auto Workshop  0.06
3          Skate Park  0.06
4          Restaurant  0.06


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue  freq
0    Airport Lounge  0.12
1   Airport Service  0.12
2  Airport Terminal  0.12
3          Boutique  0.06
4   Harbor / Marina  0.06


----Central Bay Street----
                venue  freq
0         Coffee Sho

<p style="font-family: Calibri; font-size:2em; padding:0.2em; color:black">  
     Let's put that into a pandas dataframe
</p>

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
     First, let's write a function to sort the venues in descending order
</p>

In [27]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
     Now let's create the new dataframe and display the top 10 venues for each neighborhood
</p>

In [28]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for i in range(0, num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(i+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(i+1))
        
# create a new dataframe
df_toronto_venues_onehot_grouped_sorted = pd.DataFrame(columns=columns)
df_toronto_venues_onehot_grouped_sorted['Neighborhood'] = df_toronto_venues_onehot_grouped['Neighborhood']

for i in range(0, df_toronto_venues_onehot_grouped.shape[0]):
    df_toronto_venues_onehot_grouped_sorted.iloc[i, 1:] = return_most_common_venues(df_toronto_venues_onehot_grouped.iloc[i, :], num_top_venues)

df_toronto_venues_onehot_grouped_sorted.head()

Unnamed: 0,Neighborhood,1th Most Common Venue,2th Most Common Venue,3th Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Farmers Market,Beer Bar,Restaurant,Seafood Restaurant,Basketball Stadium,Beach
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Grocery Store,Furniture / Home Store,Bar,Nightclub,Bakery,Gym,Italian Restaurant
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Gym / Fitness Center,Skate Park,Auto Workshop,Brewery,Burrito Place,Butcher,Comic Shop,Farmers Market,Fast Food Restaurant
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport Terminal,Boat or Ferry,Harbor / Marina,Rental Car Location,Sculpture Garden,Plane,Coffee Shop,Bar
4,Central Bay Street,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Salad Place,Bubble Tea Shop,Burger Joint,Thai Restaurant,Miscellaneous Shop,Japanese Restaurant


<p style="font-family: Calibri; font-size:2em; padding:1em; color:black">  
    <strong>
    Cluster Neighborhoods
    </strong>
</p>

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
     Run k-means to cluster the neighborhood into 5 clusters
</p>

In [29]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

df_toronto_venues_onehot_grouped_clustering = df_toronto_venues_onehot_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_toronto_venues_onehot_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)

<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
     Creating new dataframe that includes the cluster as well as the top 10 venues for each neighborhood
</p>

In [31]:
# add clustering labels
df_toronto_venues_onehot_grouped_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
#df_toronto_venues_onehot_grouped_sorted.head()
df_postal_codes_toronto_merged = df_postal_codes_toronto
#df_postal_codes_toronto_merged.head()


# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
df_postal_codes_toronto_merged = df_postal_codes_toronto_merged.join(df_toronto_venues_onehot_grouped_sorted.set_index('Neighborhood'), on='Neighborhood')

df_postal_codes_toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Neighborhood,Latitude,Longitude,Cluster Labels,1th Most Common Venue,2th Most Common Venue,3th Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,,The Beaches,43.676357,-79.293031,3,Pub,Health Food Store,Trail,Wine Shop,Dog Run,Dessert Shop,Diner,Discount Store,Distribution Center,Donut Shop
1,M4K,East Toronto,,"The Danforth West, Riverdale",43.679557,-79.352188,1,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Bookstore,Furniture / Home Store,Japanese Restaurant,Caribbean Restaurant,Indian Restaurant,Spa
2,M4L,East Toronto,,"India Bazaar, The Beaches West",43.668999,-79.315572,1,Park,Fast Food Restaurant,Pizza Place,Board Shop,Food & Drink Shop,Brewery,Restaurant,Burrito Place,Italian Restaurant,Pub
3,M4M,East Toronto,,Studio District,43.659526,-79.340923,1,Coffee Shop,American Restaurant,Bakery,Brewery,Café,Gastropub,Gym / Fitness Center,Fish Market,Park,Music Store
4,M4N,Central Toronto,,Lawrence Park,43.72802,-79.38879,3,Business Service,Bus Line,Swim School,Park,Comfort Food Restaurant,College Rec Center,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant


<p style="font-family: Calibri; font-size:1.5em; padding:0.2em; color:black">  
    Let's visualize the resulting clusters
</p>

In [32]:
# create map
map_toronto_clusters = folium.Map(location=[latitude_toronto_average, longitude_toronto_average], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_postal_codes_toronto_merged['Latitude'],
                                  df_postal_codes_toronto_merged['Longitude'], 
                                  df_postal_codes_toronto_merged['Neighborhood'], 
                                  df_postal_codes_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto_clusters)
       
map_toronto_clusters