# Capstone Project - Data Scientist

This Notebook will be for the first Submission.

### Organize relevant Imports

In [1]:
import datetime
import folium       # plotting library
import requests     # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np  # library to handle data in a vectorized manner
import lxml         # needed for html -> pandas conversion
import json         # library to handle JSON files

import geocoder # import geocoder

# module to convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
# !pip install sklearn
from sklearn.cluster import KMeans

print('Successfully Imported')


Successfully Imported


# 1. Task of Week 3

### Fetch Data from the wiki page

In [2]:
# Fetch the html file
url_wiki = r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
result = requests.get(url_wiki)
print("Result of request: " + result.headers['content-type'])

html = result.content
df_list = pd.read_html(html) # convert all tables in the html page to a list of data_frames

df_pbn = df_list[0] # first table is requiered table for this taks

print("\nShape: " + str(df_pbn.shape))
print()
df_pbn.head()


Result of request: text/html; charset=UTF-8

Shape: (180, 3)



Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Processing of the Table

1. The dataframe will consist of three columns: **PostalCode, Borough, and Neighborhood**
2. Only process the cells that have an assigned borough. **Ignore cells with a borough that is Not assigned.**

In [3]:
df_pbn =  df_pbn[ df_pbn.Borough != 'Not assigned'] # ignore cells with borough 'Not assigned'

print(" Shape: " + str(df_pbn.shape))
df_pbn.head()

 Shape: (103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be **combined** into one row with the neighborhoods **separated with a comma** as shown in row 11 in the above table.

In [4]:
df_pbn_duplicates = df_pbn[df_pbn.duplicated(subset=['Postal Code'])]

print(" Shape: " + str(df_pbn_duplicates.shape)) # result 0, 3 -> no duplicated Postal Codes
# df_pbn_duplicates.head()

 Shape: (0, 3)


4. If a **cell has a borough** but a **Not assigned neighborhood**, then the **neighborhood will be the same as the borough**.


In [5]:
df_pbn_neighbourhood_not_assigned = df_pbn[ df_pbn.Neighbourhood == 'Not assigned' ]

df_pbn_neighbourhood_not_assigned.shape # result 0, 3 -> no 'Not assigned' neighborhood entries in the dataset!
# df_pbn_neighbourhood_not_assigned.head()

(0, 3)

5. **Clean** your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the **.shape** method to print the number of rows of your dataframe.

In [6]:
print( "Shape of dataframe restored from wiki: " + str(df_pbn.shape))
print( "Therefore Number of ROWs: \t \t" + str(df_pbn.shape[0]))

df_pbn.rename(columns={'Neighbourhood': 'Neighborhood', 'Postal Code':'PostalCode'}, inplace = True)
df_pbn.reset_index(inplace=True)
df_pbn.head()

Shape of dataframe restored from wiki: (103, 3)
Therefore Number of ROWs: 	 	103


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,index,PostalCode,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


# 2. Task - get the Geospatial Data via Package

In [7]:
latitudes = list()
longitudes = list()

for code in df_pbn['PostalCode']:
    
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
    print(code, g.latlng)
    
    while (g.latlng is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
        print(code, g.latlng)
        
    latlng = g.latlng
    latitudes.append(latlng[0])
    longitudes.append(latlng[1])

M3A [43.75245000000007, -79.32990999999998]
M4A [43.73057000000006, -79.31305999999995]
M5A [43.65512000000007, -79.36263999999994]
M6A [43.72327000000007, -79.45041999999995]
M7A [43.66253000000006, -79.39187999999996]
M9A [43.662630000000036, -79.52830999999998]
M1B [43.811390000000074, -79.19661999999994]
M3B [43.74923000000007, -79.36185999999998]
M4B [43.70718000000005, -79.31191999999999]
M5B [43.65739000000008, -79.37803999999994]
M6B [43.70687000000004, -79.44811999999996]
M9B [43.65034000000003, -79.55361999999997]
M1C [43.78574000000003, -79.15874999999994]
M3C [43.72168000000005, -79.34351999999996]
M4C [43.68970000000007, -79.30681999999996]
M5C [43.65215000000006, -79.37586999999996]
M6C [43.69211000000007, -79.43035999999995]
M9C [43.64857000000006, -79.57824999999997]
M1E [43.765750000000025, -79.17469999999997]
M4E [43.67709000000008, -79.29546999999997]
M5E [43.64536000000004, -79.37305999999995]
M6E [43.68784000000005, -79.45045999999996]
M1G [43.76812000000007, -79.2

In [8]:
# Using 'Address' as the column name and equating it to the list 
df2 = df_pbn.assign(latitude = latitudes) 
df3 = df2.assign(longitude = longitudes)

df3.rename(columns={'latitude':'Latitude', 'longitude':'Longitude'}, inplace=True)
df3.drop(labels='index', axis='columns', inplace=True)

df3.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113
99,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.64869,-79.38544
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945
102,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.62513,-79.52681


In [9]:
df_pbnll = df3
print('Shape of dataframe: {}'.format(df_pbnll.shape))

Shape of dataframe: (103, 5)


# 3. Cluster the data

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together.

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

### 3.1. Get only Dat of Toronto


In [10]:
substring = 'Toronto'
df_toronto = df_pbnll[ df_pbnll.Borough.str.contains(substring) ] # filter for only toronto
# df_toronto = df_pbnll # use all -> despite the wrong nameing...

print( df_toronto.shape )
df_toronto.head()

(39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804
15,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587
19,M4E,East Toronto,The Beaches,43.67709,-79.29547


### 3.2. Show the map with Marked Lat/Ng

In [11]:
lati_mean = df_toronto.Latitude.mean()
longi_mean = df_toronto.Longitude.mean()

#create a map of Toronto
map_toronto = folium.Map(location=[lati_mean, longi_mean],zoom_start=12)
# map_toronto

In [12]:
#add markers
for lat, lng, label in zip( df_toronto.Latitude , df_toronto.Longitude , df_toronto.Neighborhood):
    
    label=folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=2,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

### 3.3. Create Foursquare-API calls ###
#### 3.3.1 Create Class ####

In [13]:
# 'https://api.foursquare.com/v2/venues/explore?&client_id=...&client_secret=...&v=...&ll=...,...&radius=...&limit=...'
class FS_API_Call:
    def __init__(self):
        self.URL_BASE = r"https://api.foursquare.com/v2/"
        self.CLIENT_ID = r"GNJLRYNKED4TBP323NKZHR0RDNDZN1K4LBME4ZBKPQRTC0MB"
        self.CLIENT_SECRET = r"TTI0U25YNTVRAYD0JDBTEW2STOSMG2VPBRGFZ5WDQR2AAAVP"
        self.VERSION = datetime.datetime.today().strftime ('%Y%m%d')

        self.CREDENTIALS = "&client_id={0}&client_secret={1}&v={2}".format(self.CLIENT_ID, self.CLIENT_SECRET, self.VERSION)

        self.current_url = self.URL_BASE

    # groups: venues, users
    # endpoints: search, explore
    
    def explore(self, group = 'venues/' ):
        self.current_url = self.URL_BASE + group + "explore?" + self.CREDENTIALS
        return self

    def search(self, group = 'venues/' ):
        self.current_url = self.URL_BASE + group + "search?" + self.CREDENTIALS
        return self

    def and_search_for(self, uri, what_to_search_for):
        return url + "&query=" + what_to_search_for

    def location(self, lattitude,  longitude):
        self.current_url = "{0}&ll={1},{2}".format(self.current_url, lattitude, longitude)
        return self

    def radius(self, radius):
        self.current_url = "{0}&radius={1}".format(self.current_url, radius)
        return self

    def limit(self, limit):
        self.current_url = "{0}&limit={1}".format(self.current_url, limit)
        return self

    def get_request(self):
        return self.current_url

In [14]:
new_api_call = FS_API_Call().explore().location(99,88).radius(212).limit(222).get_request()
print(new_api_call)

https://api.foursquare.com/v2/venues/explore?&client_id=GNJLRYNKED4TBP323NKZHR0RDNDZN1K4LBME4ZBKPQRTC0MB&client_secret=TTI0U25YNTVRAYD0JDBTEW2STOSMG2VPBRGFZ5WDQR2AAAVP&v=20201122&ll=99,88&radius=212&limit=222


#### 3.3.2 Test-Request

In [15]:
row_0 = df_toronto.iloc[1]
url_request = FS_API_Call().explore().location(row_0.Latitude, row_0.Longitude).radius(500).get_request()
test_result = requests.get( url_request ).json()
# test_result


In [16]:
with open('results.json', 'w') as outfile:
    json.dump(test_result, outfile)

### 3.4. Function for Extracting Category Info from JSON

In [17]:
# extracts the category of the venue
def extract_category(row):
    try:
        categories_list = row['Categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


#### 3.4.1 Extract Categories

In [44]:
venues_json = test_result['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues_json) # flatten JSON
# print( nearby_venues.columns )

# only relevant columns
nearby_venues = nearby_venues.loc[:,[ 'venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(extract_category, axis=1)

nearby_venues.rename( columns= {
                                'venue.name':'Name', 
                                'venue.location.lat' : 'Latitude', 
                                'venue.location.lng' : 'Longitude', 
                                'venue.categories' : 'Categories'}, inplace=True)

# clean columns
# nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,Name,Categories,Latitude,Longitude
0,Queen's Park,Park,43.663946,-79.39218
1,Hart House Theatre,Theater,43.663571,-79.394616
2,Mercatto,Italian Restaurant,43.660391,-79.387664
3,Nando's,Portuguese Restaurant,43.661728,-79.386391
4,Starbucks,Coffee Shop,43.659456,-79.390411


In [45]:
nearby_venues.Categories.unique()

array(['Park', 'Theater', 'Italian Restaurant', 'Portuguese Restaurant',
       'Coffee Shop', 'Fried Chicken Joint', 'Gastropub', 'Burrito Place',
       'Bank', 'Sandwich Place', 'Café', 'Falafel Restaurant',
       'Mediterranean Restaurant'], dtype=object)

### 3.5. Get all neighboring venues

In [47]:
def get_nearby_venues(names, latitudes, longitudes, radius=500):

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = FS_API_Call().explore().location(lat,lng).radius(radius).get_request()
            
        # make the GET request
        results = None
        while not results:
            results = requests.get(url)
        
        results = results.json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append(
            [
                (
                name, 
                lat, 
                lng, 
                venue['venue']['name'], 
                venue['venue']['location']['lat'], 
                venue['venue']['location']['lng'],  
                venue['venue']['categories'][0]['name']) for venue in results
            ]
        )

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [48]:
#get venues information of all neighborhoods in Downtown Toronto
df_toronto_venues = get_nearby_venues( names     = df_toronto['Neighborhood'],
                                       latitudes = df_toronto['Latitude'],
                                       longitudes= df_toronto['Longitude']
                                      )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

In [22]:
print('The shape of toronto_venues is  ', df_toronto_venues.shape)
df_toronto_venues.head()

The shape of toronto_venues is   (813, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65512,-79.36264,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65512,-79.36264,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65512,-79.36264,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.65512,-79.36264,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
4,"Regent Park, Harbourfront",43.65512,-79.36264,Body Blitz Spa East,43.654735,-79.359874,Spa


In [23]:
# df_toronto_venues.groupby('Neighborhood').count()

In [24]:
print("Amount of unique venues: {}".format( len(df_toronto_venues['Venue Category'].unique())))

Amount of unique venues: 179


### 3.6. prepare Data for Clustering

In [25]:
# one hot encoding
df_toronto_venues_hot = pd.get_dummies(df_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

In [26]:
# add neighborhood column back to dataframe -> Neighborhood_Names Like Regent Park
df_toronto_venues_hot['Neighborhood_name'] = df_toronto_venues['Neighborhood'] 
df_toronto_venues_hot.head()

Unnamed: 0,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio,Neighborhood_name
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Regent Park, Harbourfront"
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Regent Park, Harbourfront"
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Regent Park, Harbourfront"
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,"Regent Park, Harbourfront"
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"Regent Park, Harbourfront"


In [27]:
df_toronto_venues_hot.index = df_toronto_venues_hot['Neighborhood_name']
df_toronto_venues_hot.drop(labels=['Neighborhood_name'], axis='columns', inplace=True)
df_toronto_venues_hot.head()

Unnamed: 0_level_0,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Baby Store,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
Neighborhood_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
df_toronto_venues_hot.shape

(813, 179)

In [29]:
# check what venues of the neighborhood has the most venues
df_toronto_venues_hot_grpd = df_toronto_venues_hot.groupby('Neighborhood_name').mean().reset_index()
df_toronto_venues_hot_grpd

Unnamed: 0,Neighborhood_name,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.033333,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,...,0.066667,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333
4,Central Bay Street,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0
7,"Commerce Court, Victoria Hotel",0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
num_top_venues = 5

for hood in df_toronto_venues_hot_grpd['Neighborhood_name']:
    # print("======  "+hood)
    temp = df_toronto_venues_hot_grpd[df_toronto_venues_hot_grpd['Neighborhood_name'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    # print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    # print('\n')

In [31]:
def get_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [32]:
# create columns according to number of top venues -> 1st, 2nd, 3rd
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Neighborhood_name']
for ind in np.arange(num_top_venues):
    if(ind<3):
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    else:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create new df and fill the values into the cells
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood_name'] = df_toronto_venues_hot_grpd['Neighborhood_name']

for ind in np.arange(df_toronto_venues_hot_grpd.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = get_most_common_venues(df_toronto_venues_hot_grpd.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood_name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Farmers Market,Coffee Shop,Seafood Restaurant,Cocktail Bar,Beer Bar,Comfort Food Restaurant,Park,Concert Hall,Food Truck,Café
1,"Brockton, Parkdale Village, Exhibition Place",Gift Shop,Supermarket,Coffee Shop,Restaurant,Furniture / Home Store,Italian Restaurant,Pet Store,Cocktail Bar,Ethiopian Restaurant,Café
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Theater,Café,Concert Hall,Restaurant,Japanese Restaurant,Smoke Shop,Salon / Barbershop,Pizza Place,Opera House
3,"CN Tower, King and Spadina, Railway Lands, Har...",Italian Restaurant,Gym / Fitness Center,Restaurant,Park,Sandwich Place,Ramen Restaurant,Peruvian Restaurant,New American Restaurant,Mexican Restaurant,Men's Store
4,Central Bay Street,Coffee Shop,Plaza,Poke Place,Shopping Mall,Italian Restaurant,Miscellaneous Shop,Seafood Restaurant,Sandwich Place,Japanese Restaurant,Ramen Restaurant


### 3.10. Clustering

In [33]:
# set number of clusters
kclusters = 8

df_toronto_2cluster = df_toronto_venues_hot_grpd.drop('Neighborhood_name', axis='columns')

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_toronto_2cluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[:]

array([0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 3, 0, 2, 0, 0, 3, 0, 0, 1, 0, 6, 5,
       0, 4, 4, 0, 3, 0, 0, 0, 0, 0, 4, 0, 7, 3, 0, 0])

In [34]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_toronto_merged = df_toronto

# merge sorted with DT_data to add latitude/longitude for each neighborhood
df_toronto_merged = df_toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood_name'), on='Neighborhood')

df_toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,4.0,Coffee Shop,Breakfast Spot,Yoga Studio,Spa,Bakery,Distribution Center,Electronics Store,Event Space,Food Truck,Italian Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188,4.0,Coffee Shop,Sandwich Place,Portuguese Restaurant,Café,Burrito Place,Bank,Fried Chicken Joint,Mediterranean Restaurant,Theater,Gastropub
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804,0.0,Café,Ramen Restaurant,Theater,Clothing Store,Coffee Shop,College Rec Center,Diner,Sandwich Place,Burger Joint,Electronics Store
15,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587,0.0,Gastropub,Cosmetics Shop,Coffee Shop,Restaurant,Japanese Restaurant,American Restaurant,Beer Bar,Department Store,Creperie,Diner
19,M4E,East Toronto,The Beaches,43.67709,-79.29547,7.0,Health Food Store,Trail,Neighborhood,Pub,Yoga Studio,Distribution Center,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


### 3.11. Create the Map + Colorize Labled Locations

In [35]:
# create map
map_clusters = folium.Map(location=[df_toronto_merged.Latitude.mean(), df_toronto_merged.Longitude.mean()], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

df_toronto_merged.dropna(subset=['Cluster Labels'],axis='index', inplace=True)
df_toronto_merged['Cluster Labels'] = df_toronto_merged['Cluster Labels'].astype(int)

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto_merged.Latitude, df_toronto_merged.Longitude, df_toronto_merged.Neighborhood, df_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [36]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 0, df_toronto_merged.columns[[2] + list(range(6, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,"Garden District, Ryerson",Café,Ramen Restaurant,Theater,Clothing Store,Coffee Shop,College Rec Center,Diner,Sandwich Place,Burger Joint,Electronics Store
15,St. James Town,Gastropub,Cosmetics Shop,Coffee Shop,Restaurant,Japanese Restaurant,American Restaurant,Beer Bar,Department Store,Creperie,Diner
20,Berczy Park,Farmers Market,Coffee Shop,Seafood Restaurant,Cocktail Bar,Beer Bar,Comfort Food Restaurant,Park,Concert Hall,Food Truck,Café
24,Central Bay Street,Coffee Shop,Plaza,Poke Place,Shopping Mall,Italian Restaurant,Miscellaneous Shop,Seafood Restaurant,Sandwich Place,Japanese Restaurant,Ramen Restaurant
30,"Richmond, Adelaide, King",Café,Coffee Shop,American Restaurant,Seafood Restaurant,Restaurant,Gym,Hotel,Plaza,Pizza Place,Monument / Landmark
36,"Harbourfront East, Union Station, Toronto Islands",Hotel,Coffee Shop,Plaza,Park,Aquarium,Neighborhood,Supermarket,Salad Place,Lake,Bubble Tea Shop
37,"Little Portugal, Trinity",Bar,Asian Restaurant,Cocktail Bar,Pizza Place,Brewery,Korean Restaurant,Record Shop,Coffee Shop,Yoga Studio,Beer Store
42,"Toronto Dominion Centre, Design Exchange",Coffee Shop,Restaurant,Café,Japanese Restaurant,Gym / Fitness Center,Gym,Pub,Pizza Place,Museum,Hotel
43,"Brockton, Parkdale Village, Exhibition Place",Gift Shop,Supermarket,Coffee Shop,Restaurant,Furniture / Home Store,Italian Restaurant,Pet Store,Cocktail Bar,Ethiopian Restaurant,Café
47,"India Bazaar, The Beaches West",Park,Pub,Sandwich Place,Liquor Store,Restaurant,Fast Food Restaurant,Italian Restaurant,Fish & Chips Shop,Steakhouse,Sushi Restaurant


In [37]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 1, df_toronto_merged.columns[[2] + list(range(6, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
61,Lawrence Park,Bus Line,Swim School,Yoga Studio,Donut Shop,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


In [38]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 2, df_toronto_merged.columns[[2] + list(range(6, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,"Forest Hill North & West, Forest Hill Road Park",Park,French Restaurant,Yoga Studio,Dog Run,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


In [39]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 3, df_toronto_merged.columns[[2] + list(range(6, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25,Christie,Café,Grocery Store,Coffee Shop,Italian Restaurant,Playground,Athletics & Sports,Baby Store,Candy Store,Falafel Restaurant,Event Space
31,"Dufferin, Dovercourt Village",Grocery Store,Park,Skating Rink,Bank,Music Venue,Smoke Shop,Brazilian Restaurant,Middle Eastern Restaurant,Pizza Place,Furniture / Home Store
41,"The Danforth West, Riverdale",Cosmetics Shop,Grocery Store,Park,Bus Line,Business Service,Discount Store,Coffee Shop,Ice Cream Shop,Intersection,Dumpling Restaurant
69,"High Park, The Junction South",Convenience Store,Residential Building (Apartment / Condo),Bowling Alley,Sandwich Place,Park,Dance Studio,Dumpling Restaurant,Falafel Restaurant,Creperie,Event Space
91,Rosedale,Park,Playground,Shop & Service,Bike Trail,Tennis Court,Yoga Studio,Dog Run,Event Space,Ethiopian Restaurant,Electronics Store


In [40]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 4, df_toronto_merged.columns[[2] + list(range(6, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Regent Park, Harbourfront",Coffee Shop,Breakfast Spot,Yoga Studio,Spa,Bakery,Distribution Center,Electronics Store,Event Space,Food Truck,Italian Restaurant
4,"Queen's Park, Ontario Provincial Government",Coffee Shop,Sandwich Place,Portuguese Restaurant,Café,Burrito Place,Bank,Fried Chicken Joint,Mediterranean Restaurant,Theater,Gastropub
86,"Summerhill West, Rathnelly, South Hill, Forest...",Light Rail Station,Coffee Shop,Supermarket,Liquor Store,Dumpling Restaurant,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant


In [41]:
df_toronto_merged.loc[df_toronto_merged['Cluster Labels'] == 5, df_toronto_merged.columns[[2] + list(range(6, df_toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
73,"North Toronto West, Lawrence Park",Park,Gym Pool,Playground,Yoga Studio,Distribution Center,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
