# Overview:

#### In this notebook, postal code data of the city Toronto has been taken from the wikipedia page.


#### After Exploring the location data, venue data was added from Foursquare.


#### The venue data was normalized using one-hot encoding and then clustered according to the type of venues.

# Creating and Exploring DataFrame

In [41]:
# imports

import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

from bs4 import BeautifulSoup

import json

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium

print("Done")

Solving environment: \ ^C
failed

CondaError: KeyboardInterrupt

Solving environment: / ^C
failed

CondaError: KeyboardInterrupt

Done


### Fetching Data From Wikipedia Page

In [2]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text #wikipedia url
soup = BeautifulSoup(url,'lxml')
#print(soup.prettify())
t = soup.find('table',{'class':'wikitable sortable'}) #finding the table in the page

Postal_Code = [] #creating three columns
Borough = []
Neighborhood = []

for r in t.find_all("tr"): #parsing through the wikipedia table and adding data in lists
    try:
        Postal_Code.append(r.find("td").text[0:-1])
        Borough.append(r.find("td").find_next("td").text[0:-1])
        Neighborhood.append(r.find("td").find_next("td").find_next("td").text[0:-1])
    except:
        pass
    
df = pd.DataFrame() #creating dataframe and adding columns
df["Postal Code"] = Postal_Code
df["Borough"] = Borough
df["Neighborhood"] = Neighborhood

In [3]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Cleaning Dataframe

In [4]:
df = df[df["Borough"] != 'Not assigned'].reset_index() #removing all not assigned boroughs

In [5]:
df.drop("index", axis=1, inplace=True) #resetting indices

In [6]:
df[df["Neighborhood"] == "Not assigned"] #checking for not assigned neighborhoods

Unnamed: 0,Postal Code,Borough,Neighborhood


In [7]:
df.sort_values(by=["Postal Code"], inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae


In [8]:
df.shape

(103, 3)

### Getting Co-ordinates

In [9]:
!wget -q -O 'toronto.csv' http://cocl.us/Geospatial_data
print("Done")

Done


In [10]:
coordinates = pd.read_csv('toronto.csv')
coordinates.sort_values(by=["Postal Code"],inplace=True)
coordinates.reset_index(drop=True,inplace=True)
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
#sorted both dataframes so correct lat and long are added

df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [12]:
#adding latitude and longitude columns

df["Latitude"] = coordinates["Latitude"]
df["Longitude"] = coordinates["Longitude"]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [13]:
df.shape

(103, 5)

### Exploring Dataset

In [14]:
#checking the number of postal districts in each borough

pd.DataFrame(df.groupby("Borough").count()["Postal Code"])

Unnamed: 0_level_0,Postal Code
Borough,Unnamed: 1_level_1
Central Toronto,9
Downtown Toronto,19
East Toronto,5
East York,5
Etobicoke,12
Mississauga,1
North York,24
Scarborough,17
West Toronto,6
York,5


#### Since North York has the maximum number of postal districts, this borough will be used for exploring

In [15]:
#checking all the postal districts in North York

ny_data = df[df["Borough"] == "North York"]
ny_data.reset_index(drop=True,inplace=True)
ny_data

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"York Mills, Silver Hills",43.75749,-79.374714
4,M2M,North York,"Willowdale, Newtonbrook",43.789053,-79.408493
5,M2N,North York,"Willowdale, Willowdale East",43.77012,-79.408493
6,M2P,North York,York Mills West,43.752758,-79.400049
7,M2R,North York,"Willowdale, Willowdale West",43.782736,-79.442259
8,M3A,North York,Parkwoods,43.753259,-79.329656
9,M3B,North York,Don Mills,43.745906,-79.352188


# Creating Maps

In [16]:
#Getting geographic data for Toronto

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="Exploring_Toronto")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [17]:
#Creating Map for Toronto

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

#### It can be seen from the map that Downtown Toronto has more closely packed postal areas than other Boroughs

In [20]:
# Getting geographic data for Downtown Toronto

geolocator = Nominatim(user_agent="Exploring_Toronto")
location_dt = geolocator.geocode("Downtown Toronto, Toronto, Ontario")
print("The geographical coordinates of Downtown Toronto are {}, {}.".format(location_dt.latitude,location_dt.longitude))

The geographical coordinates of Downtown Toronto are 43.6563221, -79.3809161.


In [22]:
# Creating Map for Downtown Toronto

dt_df = df[df["Borough"] == "Downtown Toronto"]

map_dt = folium.Map(location=[location_dt.latitude, location_dt.longitude], zoom_start=13)

# add markers to map
for lat, lng, label in zip(dt_df['Latitude'], dt_df['Longitude'], dt_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt)
    
map_dt

#### The neighborhoods in Downtown Toronto are more clearly visible now

#### Moving to the North York borough, which has the maximum number of Postal Areas

In [18]:
# Getting geographic data for North York, Toronto

address_ny = "North York, Toronto, Ontario"

geolocator = Nominatim(user_agent="Exploring_Toronto")
location_ny = geolocator.geocode(address_ny)
latitude_ny = location_ny.latitude
longitude_ny = location_ny.longitude
print('The geograpical coordinates of North York are {}, {}.'.format(latitude_ny, longitude_ny))

The geograpical coordinates of North York are 43.7543263, -79.44911696639593.


In [19]:
#Creating Map for North York

map_ny = folium.Map(location=[latitude_ny, longitude_ny], zoom_start=12)

# add markers to map
for lat, lng, label in zip(ny_data['Latitude'], ny_data['Longitude'], ny_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ny)
    
map_ny

# Exploring The Dataset Using Foursquare

In [23]:
# Defining Foursquare Credentials

CLIENT_ID = '' # Foursquare ID
CLIENT_SECRET = '' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [24]:
# The first neighborhood's name

ny_data.loc[0,"Neighborhood"]

'Hillcrest Village'

In [25]:
#Details of first Neighborhood

neighbor_lat = ny_data.loc[0, 'Latitude'] # neighborhood latitude value
neighbor_long = ny_data.loc[0, 'Longitude'] # neighborhood longitude value

neighbor_name = ny_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbor_name, 
                                                               neighbor_lat, 
                                                               neighbor_long))

Latitude and longitude values of Hillcrest Village are 43.8037622, -79.3634517.


### Now getting data from Foursquare

In [36]:
LIMIT = 100
radius = 1000

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    neighbor_lat,
    neighbor_long,
    radius,
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?&client_id=D5JGS42USHSUY3MFPD24XH0BM4CYVFDZHQSCXCQXUDO4PNK0&client_secret=JXPLF45TVUWZYNH3GJABQEBQSTFMUZ0GM1TS1RG4KTWBJ3UI&v=20180605&ll=43.8037622,-79.3634517&radius=1000&limit=100'

In [37]:
results = requests.get(url).json()

In [38]:
results

{'meta': {'code': 200, 'requestId': '5f2d6bb61db5810ffba7bc79'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 21,
  'suggestedBounds': {'ne': {'lat': 43.81276220900001,
    'lng': -79.35100467075661},
   'sw': {'lat': 43.79476219099999, 'lng': -79.37589872924339}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd9842be914a593adbd56fa',
       'name': 'Tastee',
       'location': {'address': '3913 Don Mills Rd.',
        'crossStreet': 'at Cliffwood Rd.',
        'lat': 43.80772211146167,
        'lng': -79.35679781099806,
        'labeledLatLngs': [{'label': 'display',
      

In [29]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Extracting Details of Venues from json file into pandas Dataframe

In [39]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Tastee,Bakery,43.807722,-79.356798
1,고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap,Korean Restaurant,43.798391,-79.369187
2,Cummer Park,Park,43.799564,-79.371175
3,Galati,Grocery Store,43.797831,-79.36941
4,Tim Hortons,Coffee Shop,43.798945,-79.369644


In [42]:
nearby_venues.shape

(21, 4)

In [44]:
nearby_venues.groupby("categories").count()

Unnamed: 0_level_0,name,lat,lng
categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bakery,1,1,1
Bank,1,1,1
Chinese Restaurant,1,1,1
Coffee Shop,2,2,2
Convenience Store,1,1,1
Fast Food Restaurant,1,1,1
Grocery Store,1,1,1
Ice Cream Shop,1,1,1
Intersection,1,1,1
Korean Restaurant,1,1,1


#### Counting all places that serve food

In [50]:
places = ['Bakery','Chinese Restaurant','Coffee Shop','Fast Food Restaurant','Ice Cream Shop','Korean Restaurant','Pizza Place','Restaurant','Sandwich Place']

nearby_venues[nearby_venues["categories"].isin(places)].shape[0]

10

### Doing the same for all neighborhoods in North York

In [51]:
# Creating a function

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [52]:
ny_venues = getNearbyVenues(names=ny_data['Neighborhood'],
                                   latitudes=ny_data['Latitude'],
                                   longitudes=ny_data['Longitude']
                                  )
ny_venues

Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale, Willowdale East
York Mills West
Willowdale, Willowdale West
Parkwoods
Don Mills
Don Mills
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview
Downsview
Downsview
Downsview
Victoria Village
Bedford Park, Lawrence Manor East
Lawrence Manor, Lawrence Heights
Glencairn
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Humberlea, Emery


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Tastee,43.807722,-79.356798,Bakery
1,Hillcrest Village,43.803762,-79.363452,고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap,43.798391,-79.369187,Korean Restaurant
2,Hillcrest Village,43.803762,-79.363452,Cummer Park,43.799564,-79.371175,Park
3,Hillcrest Village,43.803762,-79.363452,Galati,43.797831,-79.369410,Grocery Store
4,Hillcrest Village,43.803762,-79.363452,Tim Hortons,43.798945,-79.369644,Coffee Shop
5,Hillcrest Village,43.803762,-79.363452,TD Canada Trust,43.798466,-79.368832,Bank
6,Hillcrest Village,43.803762,-79.363452,Subway,43.799059,-79.368946,Sandwich Place
7,Hillcrest Village,43.803762,-79.363452,Pizza Pizza,43.799079,-79.369449,Pizza Place
8,Hillcrest Village,43.803762,-79.363452,New York Fries,43.803664,-79.363905,Fast Food Restaurant
9,Hillcrest Village,43.803762,-79.363452,Shoppers Drug Mart,43.798341,-79.369804,Pharmacy


In [54]:
print(ny_venues.shape)
ny_venues.head()

(639, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Tastee,43.807722,-79.356798,Bakery
1,Hillcrest Village,43.803762,-79.363452,고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap,43.798391,-79.369187,Korean Restaurant
2,Hillcrest Village,43.803762,-79.363452,Cummer Park,43.799564,-79.371175,Park
3,Hillcrest Village,43.803762,-79.363452,Galati,43.797831,-79.36941,Grocery Store
4,Hillcrest Village,43.803762,-79.363452,Tim Hortons,43.798945,-79.369644,Coffee Shop


In [55]:
ny_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",31,31,31,31,31,31
Bayview Village,15,15,15,15,15,15
"Bedford Park, Lawrence Manor East",44,44,44,44,44,44
Don Mills,74,74,74,74,74,74
Downsview,68,68,68,68,68,68
"Fairview, Henry Farm, Oriole",44,44,44,44,44,44
Glencairn,37,37,37,37,37,37
Hillcrest Village,21,21,21,21,21,21
Humber Summit,10,10,10,10,10,10
"Humberlea, Emery",10,10,10,10,10,10


In [57]:
print('There are {} uniques categories.'.format(len(ny_venues['Venue Category'].unique())))

There are 156 uniques categories.


# Clustering Data

### One-hot encoding

In [58]:
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ny_onehot['Neighborhood'] = ny_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

ny_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,Automotive Shop,...,Theater,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Video Game Store,Vietnamese Restaurant,Wings Joint,Women's Store,Yoga Studio
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [59]:
ny_onehot.shape

(639, 157)

In [60]:
ny_grouped = ny_onehot.groupby('Neighborhood').mean().reset_index()
ny_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,Automotive Shop,...,Theater,Toy / Game Store,Trail,Train Station,Turkish Restaurant,Video Game Store,Vietnamese Restaurant,Wings Joint,Women's Store,Yoga Studio
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0
3,Don Mills,0.0,0.0,0.013514,0.013514,0.0,0.027027,0.013514,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.0,0.013514,0.0
4,Downsview,0.0,0.014706,0.014706,0.0,0.0,0.0,0.029412,0.0,0.0,...,0.0,0.0,0.0,0.0,0.029412,0.0,0.073529,0.0,0.0,0.0
5,"Fairview, Henry Farm, Oriole",0.0,0.0,0.022727,0.0,0.0,0.022727,0.0,0.0,0.0,...,0.022727,0.022727,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0
6,Glencairn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Hillcrest Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Humber Summit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Humberlea, Emery",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Returning most common venues

In [61]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [62]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Convenience Store,Pharmacy,Mobile Phone Shop,Sandwich Place,Bridal Shop,Restaurant,Pizza Place,Pet Store
1,Bayview Village,Grocery Store,Gas Station,Bank,Japanese Restaurant,Park,Dog Run,Chinese Restaurant,Trail,Café,Skating Rink
2,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Bank,Fast Food Restaurant,Pizza Place,Restaurant,Sandwich Place,Pet Store,Breakfast Spot,Skating Rink
3,Don Mills,Restaurant,Coffee Shop,Japanese Restaurant,Gym,Burger Joint,Bank,Café,Pizza Place,Supermarket,Asian Restaurant
4,Downsview,Vietnamese Restaurant,Coffee Shop,Pizza Place,Hotel,Park,Gas Station,Grocery Store,Chinese Restaurant,Fast Food Restaurant,Liquor Store


### Finally, Clustering Data according to types of venues

#### Using k-means clustering to divide data into 5 clusters

In [63]:
# set number of clusters
kclusters = 5

ny_grouped_clustering = ny_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 3, 3, 3, 3, 3, 3, 0, 4, 2], dtype=int32)

#### Adding Top-10 most common venues

In [64]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

ny_merged = ny_data

# adding coordinates from ny_data DataFrame
ny_merged = ny_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

ny_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,0,Coffee Shop,Park,Pharmacy,Ice Cream Shop,Convenience Store,Chinese Restaurant,Recreation Center,Residential Building (Apartment / Condo),Restaurant,Sandwich Place
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,3,Coffee Shop,Clothing Store,Restaurant,Juice Bar,Bank,Bakery,Japanese Restaurant,Sandwich Place,Fast Food Restaurant,Electronics Store
2,M2K,North York,Bayview Village,43.786947,-79.385975,3,Grocery Store,Gas Station,Bank,Japanese Restaurant,Park,Dog Run,Chinese Restaurant,Trail,Café,Skating Rink
3,M2L,North York,"York Mills, Silver Hills",43.75749,-79.374714,1,Park,Pool,Dessert Shop,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store,Diner,Dim Sum Restaurant
4,M2M,North York,"Willowdale, Newtonbrook",43.789053,-79.408493,3,Korean Restaurant,Café,Pizza Place,Park,Bus Station,Coffee Shop,Middle Eastern Restaurant,Bank,Shopping Mall,Diner


### Visualizing Cluster

In [65]:
# create map
map_clusters = folium.Map(location=[latitude_ny, longitude_ny], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examining Clusters

#### Cluster 1

In [66]:
ny_merged.loc[ny_merged['Cluster Labels'] == 0, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,0,Coffee Shop,Park,Pharmacy,Ice Cream Shop,Convenience Store,Chinese Restaurant,Recreation Center,Residential Building (Apartment / Condo),Restaurant,Sandwich Place
7,North York,0,Pharmacy,Bank,Convenience Store,Coffee Shop,Park,Pizza Place,Eastern European Restaurant,Bus Line,Bakery,Dumpling Restaurant
8,North York,0,Park,Bus Stop,Pharmacy,Convenience Store,Shopping Mall,Chinese Restaurant,Road,Café,Caribbean Restaurant,Pizza Place
11,North York,0,Bank,Coffee Shop,Convenience Store,Pharmacy,Mobile Phone Shop,Sandwich Place,Bridal Shop,Restaurant,Pizza Place,Pet Store
21,North York,0,Coffee Shop,Convenience Store,Athletics & Sports,Pizza Place,Dim Sum Restaurant,Bakery,Chinese Restaurant,Mediterranean Restaurant,Gas Station,Park


#### Cluster 2

In [67]:
ny_merged.loc[ny_merged['Cluster Labels'] == 1, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,North York,1,Park,Pool,Dessert Shop,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store,Diner,Dim Sum Restaurant


#### Cluster 3

In [68]:
ny_merged.loc[ny_merged['Cluster Labels'] == 2, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,North York,2,Convenience Store,Auto Workshop,Discount Store,Business Service,Storage Facility,Bakery,Intersection,Gas Station,Golf Course,Park


#### Cluster 4

In [69]:
ny_merged.loc[ny_merged['Cluster Labels'] == 3, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,3,Coffee Shop,Clothing Store,Restaurant,Juice Bar,Bank,Bakery,Japanese Restaurant,Sandwich Place,Fast Food Restaurant,Electronics Store
2,North York,3,Grocery Store,Gas Station,Bank,Japanese Restaurant,Park,Dog Run,Chinese Restaurant,Trail,Café,Skating Rink
4,North York,3,Korean Restaurant,Café,Pizza Place,Park,Bus Station,Coffee Shop,Middle Eastern Restaurant,Bank,Shopping Mall,Diner
5,North York,3,Coffee Shop,Bubble Tea Shop,Ramen Restaurant,Pizza Place,Japanese Restaurant,Korean Restaurant,Sandwich Place,Restaurant,Sushi Restaurant,Fast Food Restaurant
6,North York,3,Park,Restaurant,Coffee Shop,Bowling Alley,Grocery Store,Golf Course,Gas Station,French Restaurant,Intersection,Dog Run
9,North York,3,Restaurant,Coffee Shop,Japanese Restaurant,Gym,Burger Joint,Bank,Café,Pizza Place,Supermarket,Asian Restaurant
10,North York,3,Restaurant,Coffee Shop,Japanese Restaurant,Gym,Burger Joint,Bank,Café,Pizza Place,Supermarket,Asian Restaurant
12,North York,3,Coffee Shop,Furniture / Home Store,Pizza Place,Caribbean Restaurant,Sushi Restaurant,Sports Bar,Middle Eastern Restaurant,Fast Food Restaurant,Bar,Bank
13,North York,3,Vietnamese Restaurant,Coffee Shop,Pizza Place,Hotel,Park,Gas Station,Grocery Store,Chinese Restaurant,Fast Food Restaurant,Liquor Store
14,North York,3,Vietnamese Restaurant,Coffee Shop,Pizza Place,Hotel,Park,Gas Station,Grocery Store,Chinese Restaurant,Fast Food Restaurant,Liquor Store


#### Cluster 5

In [70]:
ny_merged.loc[ny_merged['Cluster Labels'] == 4, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,North York,4,Electronics Store,Pharmacy,Pizza Place,Park,Shopping Mall,Optical Shop,Italian Restaurant,Bakery,Bank,Dim Sum Restaurant


## Observations:

#### The first cluster consists mostly of places that serve food, along with pharmacies and parks

#### The second cluster has more stores

#### The third cluster is made up of hobby-places

#### The fourth cluster again has more food-serving places, along with fitness centres

#### The fifth cluster consists of stores