In this project, we will learn about k-means clustering, which is a form of unsupervised learning. Then we will use clustering and the Foursquare API to segment and cluster the neighborhoods in the city of New York. Furthermore, we will learn how to scrape website and parse HTML code using the Python package Beautifulsoup, and convert data into a pandas dataframe.

This capstone project course will give us a taste of what data scientists go through in real life when working with data. 

We will learn about location data and different location data providers, such as Foursquare. You will learn how to make RESTful API calls to the Foursquare API to retrieve data about venues in different neighborhoods around the world. We will also learn how to be creative in situations where data are not readily available by scraping web data and parsing HTML code. We will utilize Python and its pandas library to manipulate data, which will help you help you refine your skills for exploring and analyzing data. 

Finally, We will be required to use the Folium library to great maps of geospatial data and to communicate your results and findings.

In [1]:
#part 1
import numpy as np 
import pandas as pd 
from bs4 import BeautifulSoup
from urllib.request import urlopen

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20181220094055-0000
KERNEL_ID = 742ba1e6-0e68-4fa2-8200-acb120a68d4f


In [2]:
wiki_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#query the website and return the html to the variable ‘page’
page = urlopen(wiki_page)
soup = BeautifulSoup(page, 'html.parser') #store in variable `soup`

#extract table and convert into dataframe
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]
df=pd.DataFrame(df)
header = df.iloc[0]
df = df[1:]
df = df.rename(columns = header)
df.head(15)


Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


In [3]:
df = df[df.Borough != 'Not assigned']
print(df.head())
print()

df['Neighbourhood'] = df.apply(lambda row: row['Borough'] if (row['Neighbourhood']=='Not assigned') else row['Neighbourhood'],axis=1)
print(df.head())
print()

df_grp = df.groupby(['Postcode','Borough'], sort=False)['Neighbourhood'].apply(','.join).reset_index()
print(df_grp.head())
print()

print(df_grp.shape)

  Postcode           Borough     Neighbourhood
3      M3A        North York         Parkwoods
4      M4A        North York  Victoria Village
5      M5A  Downtown Toronto      Harbourfront
6      M5A  Downtown Toronto       Regent Park
7      M6A        North York  Lawrence Heights

  Postcode           Borough     Neighbourhood
3      M3A        North York         Parkwoods
4      M4A        North York  Victoria Village
5      M5A  Downtown Toronto      Harbourfront
6      M5A  Downtown Toronto       Regent Park
7      M6A        North York  Lawrence Heights

  Postcode           Borough                    Neighbourhood
0      M3A        North York                        Parkwoods
1      M4A        North York                 Victoria Village
2      M5A  Downtown Toronto         Harbourfront,Regent Park
3      M6A        North York  Lawrence Heights,Lawrence Manor
4      M7A      Queen's Park                     Queen's Park

(103, 3)


In [4]:
#part 2
!pip install geocoder
import geocoder 
!pip install folium
import folium
import geopy
from geopy.geocoders import Nominatim

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K    100% |################################| 102kB 2.4MB/s a 0:00:01
[?25hCollecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Collecting click (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
[K    100% |################################| 81kB 3.6MB/s eta 0:00:01
[?25hCollecting future (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/90/52/e20466b85000a181e1e144fd8305caf2cf475e2f9674e797b222f8105f5f/future-0.17.1.tar.gz (829kB)
[K    100% |################################| 829kB 2.9MB/s eta 0:00:01
[

In [5]:
for index, row in df_grp.iterrows():
    address_1 = row['Neighbourhood'] 
    address_2 = address_1.split(',')[-1]
    address_3 = address_2+","+"Toronto,Canada"
    #print(address_3) #-- It worked

column_names = ['Latitude', 'Longitude'] 
n_hood = pd.DataFrame(columns=column_names)
n_hood.shape

(0, 2)

In [8]:
for index, row in df_grp.iterrows():
    try:
        address_1 = row['Neighbourhood'] 
        address_2 = address_1.split(',')[-1]
        address = address_2+","+"Toronto,Canada"
        geolocator = Nominatim()
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        #print(row['Borough'],address, latitude, longitude)
        n_hood = n_hood.append({'Latitude': latitude,'Longitude': longitude}, ignore_index=True)
        n_hood
        pass
    except ValueError as error_message:
        print("Error")
    except AttributeError:
        #print("Problem with data or cannot Geocode.")
        address_3 = row['Borough']
        address = address_3+","+"Toronto,Canada"
        geolocator = Nominatim()
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        #print(address, latitude, longitude)
        n_hood = n_hood.append({'Latitude': latitude,'Longitude': longitude}, ignore_index=True)
       # print(row['Borough'],address, latitude, longitude)
        n_hood
        pass

n_hood.head()

GeocoderServiceError: [Errno 99] Cannot assign requested address

In [9]:
df = pd.concat([df_grp, n_hood[['Latitude', 'Longitude']]], axis=1)
df.head(25)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.761224,-79.323986
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.660706,-79.360457
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.722079,-79.437507
4,M7A,Queen's Park,Queen's Park,43.65998,-79.390369
5,M9A,Etobicoke,Islington Avenue,43.614058,-79.50851
6,M1B,Scarborough,"Rouge,Malvern",43.809196,-79.221701
7,M3B,North York,Don Mills North,43.737178,-79.343451
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706298,-79.321907
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.656502,-79.377128


In [10]:
#part3
print('We have {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)
df.shape
len(df)

We have 12 boroughs and 151 neighborhoods.


151

In [11]:
#address = 'New York City, NY'
#address = 'Manhattan, NY'
address = 'Toronto,Canada'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [12]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

In [13]:
# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

In [14]:
#manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
#manhattan_data.head()
toronto_data = df[df['Borough'] == 'Scarborough'].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.809196,-79.221701
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.775504,-79.134976
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.768914,-79.187291
3,M1G,Scarborough,Woburn,43.759824,-79.225291
4,M1H,Scarborough,Cedarbrae,43.756467,-79.226692


In [15]:
address = 'Scarborough, Toronto,Canada'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Scarborough are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Scarborough are 43.773077, -79.257774.


In [16]:
# create map of Manhattan using latitude and longitude values
map_Scarborough = folium.Map(location=[latitude, longitude], zoom_start=11)
map_Scarborough

In [17]:
# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_Scarborough)  
    
map_Scarborough

In [18]:
!pip install folium
from sklearn.cluster import KMeans
import folium # map rendering library
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting folium
  Using cached https://files.pythonhosted.org/packages/55/e2/7e523df8558b7f4b2ab4c62014fd378ccecce3fdc14c9928b272a88ae4cc/folium-0.7.0-py3-none-any.whl
Collecting branca>=0.3.0 (from folium)
  Using cached https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Collecting six (from folium)
  Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Collecting numpy (from folium)
  Using cached https://files.pythonhosted.org/packages/86/04/bd774106ae0ae1ada68c67efe89f1a16b2aa373cc2db15d974002a9f136d/numpy-1.15.4-cp35-cp35m-manylinux1_x86_64.whl
Collecting requests (from folium)
  Using cached https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl
Collecting jinja2 (from folium)
  Using cached https://files.pyt

In [19]:
address = 'Toronto,Canada'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Toronto City are {}, {}.'.format(latitude, longitude))

Toronto City are 43.653963, -79.387207.


In [20]:
CLIENT_ID = 'HDD0TFIAZXBAGTHQFUMMFD1TXISGGMXQPMJNNYE1SIT5Q1FO' # your Foursquare ID
CLIENT_SECRET = 'T5ZUW2WMHRJ2JDWKN4TOX1JITUJQ4UFOQV4E1U3M0GFJ1HVJ' # your Foursquare Secret
VERSION = '20180924' # Foursquare API version
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


Your credentails:
CLIENT_ID: HDD0TFIAZXBAGTHQFUMMFD1TXISGGMXQPMJNNYE1SIT5Q1FO
CLIENT_SECRET:T5ZUW2WMHRJ2JDWKN4TOX1JITUJQ4UFOQV4E1U3M0GFJ1HVJ


In [21]:
toronto_data.loc[0, 'Neighbourhood']

neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))


Latitude and longitude values of Rouge,Malvern are 43.8091955, -79.2217008.


In [22]:
# Now, let's get the top 500 venues that are in Rouge/Malvern within a radius of 1000 meters.
# Also create GET request url 

radius = 1000
LIMIT = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=HDD0TFIAZXBAGTHQFUMMFD1TXISGGMXQPMJNNYE1SIT5Q1FO&client_secret=T5ZUW2WMHRJ2JDWKN4TOX1JITUJQ4UFOQV4E1U3M0GFJ1HVJ&v=20180924&ll=43.8091955,-79.2217008&radius=1000&limit=500'

In [23]:
import requests
import json
from pandas.io.json import json_normalize
results = requests.get(url).json()

In [24]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


In [25]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]


In [26]:
nearby_venues.head()
len(nearby_venues)
print('{} venues by Foursquare.'.format(nearby_venues.shape[0]))


14 venues by Foursquare.


In [27]:
#function to repeat the same process to all the neighborhoods in Scarborough
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [28]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'])


toronto_venues.head()


Rouge,Malvern
Highland Creek,Rouge Hill,Port Union
Guildwood,Morningside,West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park,Ionview,Kennedy Park
Clairlea,Golden Mile,Oakridge
Cliffcrest,Cliffside,Scarborough Village West
Birch Cliff,Cliffside West
Dorset Park,Scarborough Town Centre,Wexford Heights
Maryvale,Wexford
Agincourt
Clarks Corners,Sullivan,Tam O'Shanter
Agincourt North,L'Amoreaux East,Milliken,Steeles East
L'Amoreaux West,Steeles West
Upper Rouge


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge,Malvern",43.809196,-79.221701,Shoppers Drug Mart,43.809202,-79.22332,Pharmacy
1,"Rouge,Malvern",43.809196,-79.221701,Subway,43.806805,-79.222515,Sandwich Place
2,"Rouge,Malvern",43.809196,-79.221701,Pizza Hut,43.808326,-79.220616,Pizza Place
3,"Rouge,Malvern",43.809196,-79.221701,Pizza Pizza,43.806613,-79.221243,Pizza Place
4,"Rouge,Malvern",43.809196,-79.221701,Shoppers Drug Mart,43.806489,-79.223024,Pharmacy


In [29]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
toronto_venues.groupby('Neighborhood').count()


There are 79 uniques categories.


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,8,8,8,8,8,8
"Agincourt North,L'Amoreaux East,Milliken,Steeles East",5,5,5,5,5,5
"Birch Cliff,Cliffside West",6,6,6,6,6,6
Cedarbrae,22,22,22,22,22,22
"Clairlea,Golden Mile,Oakridge",5,5,5,5,5,5
"Clarks Corners,Sullivan,Tam O'Shanter",30,30,30,30,30,30
"Cliffcrest,Cliffside,Scarborough Village West",7,7,7,7,7,7
"Dorset Park,Scarborough Town Centre,Wexford Heights",5,5,5,5,5,5
"East Birchmount Park,Ionview,Kennedy Park",4,4,4,4,4,4
"Guildwood,Morningside,West Hill",30,30,30,30,30,30


In [30]:
# one hot encoding
toronto_venues_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")


In [31]:
# add neighborhood column back to dataframe
toronto_venues_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 


In [32]:
# move neighborhood column to the first column
fixed_columns = [toronto_venues_onehot.columns[-1]] + list(toronto_venues_onehot.columns[:-1])
toronto_venues_onehot = toronto_venues_onehot[fixed_columns]

toronto_venues_onehot.head()


Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Asian Restaurant,Athletics & Sports,Automotive Shop,Bakery,Bank,Bar,Beer Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Thrift / Vintage Store,Toy / Game Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Shop,Women's Store
0,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
toronto_venues_onehot.shape

(260, 80)

In [35]:
# to club and display he content of all neighborhood
toronto_venues_grouped = toronto_venues_onehot.groupby('Neighborhood').mean().reset_index()
toronto_venues_grouped

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Asian Restaurant,Athletics & Sports,Automotive Shop,Bakery,Bank,Bar,Beer Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Thrift / Vintage Store,Toy / Game Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Shop,Women's Store
0,Agincourt,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0
2,"Birch Cliff,Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0
4,"Clairlea,Golden Mile,Oakridge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Clarks Corners,Sullivan,Tam O'Shanter",0.0,0.0,0.0,0.0,0.033333,0.0,0.033333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0
6,"Cliffcrest,Cliffside,Scarborough Village West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Dorset Park,Scarborough Town Centre,Wexford He...",0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"East Birchmount Park,Ionview,Kennedy Park",0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Guildwood,Morningside,West Hill",0.0,0.0,0.0,0.0,0.033333,0.0,0.033333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0


In [36]:
num_top_venues = 5

for hood in toronto_venues_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_venues_grouped[toronto_venues_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----Agincourt----
                venue  freq
0        Skating Rink  0.25
1        Intersection  0.12
2      Cosmetics Shop  0.12
3  Athletics & Sports  0.12
4        Dance Studio  0.12


----Agincourt North,L'Amoreaux East,Milliken,Steeles East----
            venue  freq
0      Restaurant   0.2
1       Wine Shop   0.2
2  Sandwich Place   0.2
3   Grocery Store   0.2
4            Park   0.2


----Birch Cliff,Cliffside West----
            venue  freq
0     Pizza Place  0.33
1             Pub  0.17
2  Breakfast Spot  0.17
3  Sandwich Place  0.17
4   Grocery Store  0.17


----Cedarbrae----
                    venue  freq
0    Fast Food Restaurant  0.14
1             Coffee Shop  0.09
2           Grocery Store  0.09
3  Furniture / Home Store  0.09
4           Shopping Mall  0.05


----Clairlea,Golden Mile,Oakridge----
                  venue  freq
0     Convenience Store   0.2
1              Bus Stop   0.2
2                  Park   0.2
3  Fast Food Restaurant   0.2
4            Restaurant

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


In [38]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))


In [39]:
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_venues_grouped['Neighborhood']

for ind in np.arange(toronto_venues_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_venues_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Skating Rink,Intersection,Park,Athletics & Sports,Bus Stop,Cosmetics Shop,Dance Studio,Farmers Market,Coffee Shop,Convenience Store
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Grocery Store,Wine Shop,Restaurant,Park,Sandwich Place,Discount Store,Cocktail Bar,Coffee Shop,Convenience Store,Cosmetics Shop
2,"Birch Cliff,Cliffside West",Pizza Place,Breakfast Spot,Sandwich Place,Pub,Grocery Store,Gourmet Shop,Gift Shop,Chocolate Shop,Clothing Store,Cocktail Bar
3,Cedarbrae,Fast Food Restaurant,Grocery Store,Coffee Shop,Furniture / Home Store,Beer Store,Park,Clothing Store,Pizza Place,Liquor Store,Shopping Mall
4,"Clairlea,Golden Mile,Oakridge",Restaurant,Park,Bus Stop,Convenience Store,Fast Food Restaurant,Women's Store,Discount Store,Cocktail Bar,Coffee Shop,Cosmetics Shop


In [40]:
len(neighborhoods_venues_sorted)

17

In [41]:
toronto_venues_grouped.head()


Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Asian Restaurant,Athletics & Sports,Automotive Shop,Bakery,Bank,Bar,Beer Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Thrift / Vintage Store,Toy / Game Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Shop,Women's Store
0,Agincourt,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0
2,"Birch Cliff,Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.045455,0.0,0.045455,0.0,0.0,0.0
4,"Clairlea,Golden Mile,Oakridge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
# set number of clusters
kclusters = 5


In [43]:
toronto_grouped_clustering = toronto_venues_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([0, 0, 2, 2, 0, 2, 2, 3, 4, 2], dtype=int32)

In [44]:
toronto_merged = toronto_data


In [45]:
# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_


In [46]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.809196,-79.221701,0,Pizza Place,Fast Food Restaurant,Pharmacy,Park,Sandwich Place,Bubble Tea Shop,Grocery Store,Food & Drink Shop,Fried Chicken Joint,Farmers Market
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.775504,-79.134976,0,Park,Women's Store,Indian Restaurant,Cocktail Bar,Coffee Shop,Convenience Store,Cosmetics Shop,Dance Studio,Discount Store,Farmers Market
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.768914,-79.187291,2,Coffee Shop,Fast Food Restaurant,Pizza Place,Breakfast Spot,Gym,Food & Drink Shop,Grocery Store,Discount Store,Salon / Barbershop,Liquor Store
3,M1G,Scarborough,Woburn,43.759824,-79.225291,2,Fast Food Restaurant,Coffee Shop,Pizza Place,Beer Store,Indian Restaurant,Furniture / Home Store,Discount Store,Paper / Office Supplies Store,Pharmacy,Grocery Store
4,M1H,Scarborough,Cedarbrae,43.756467,-79.226692,0,Fast Food Restaurant,Grocery Store,Coffee Shop,Furniture / Home Store,Beer Store,Park,Clothing Store,Pizza Place,Liquor Store,Shopping Mall


In [48]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

In [49]:
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [50]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,0,Pizza Place,Fast Food Restaurant,Pharmacy,Park,Sandwich Place,Bubble Tea Shop,Grocery Store,Food & Drink Shop,Fried Chicken Joint,Farmers Market
1,Scarborough,0,Park,Women's Store,Indian Restaurant,Cocktail Bar,Coffee Shop,Convenience Store,Cosmetics Shop,Dance Studio,Discount Store,Farmers Market
4,Scarborough,0,Fast Food Restaurant,Grocery Store,Coffee Shop,Furniture / Home Store,Beer Store,Park,Clothing Store,Pizza Place,Liquor Store,Shopping Mall
12,Scarborough,0,Skating Rink,Intersection,Park,Athletics & Sports,Bus Stop,Cosmetics Shop,Dance Studio,Farmers Market,Coffee Shop,Convenience Store
15,Scarborough,0,Fast Food Restaurant,Grocery Store,Coffee Shop,Furniture / Home Store,Beer Store,Park,Clothing Store,Pizza Place,Liquor Store,Shopping Mall


In [51]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Scarborough,1,Spa,Asian Restaurant,Bus Line,Thai Restaurant,Middle Eastern Restaurant,Women's Store,Coffee Shop,Convenience Store,Cosmetics Shop,Dance Studio


In [52]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Scarborough,2,Coffee Shop,Fast Food Restaurant,Pizza Place,Breakfast Spot,Gym,Food & Drink Shop,Grocery Store,Discount Store,Salon / Barbershop,Liquor Store
3,Scarborough,2,Fast Food Restaurant,Coffee Shop,Pizza Place,Beer Store,Indian Restaurant,Furniture / Home Store,Discount Store,Paper / Office Supplies Store,Pharmacy,Grocery Store
5,Scarborough,2,Coffee Shop,Pub,Chinese Restaurant,Fast Food Restaurant,Pharmacy,Gym,Bakery,American Restaurant,Asian Restaurant,Convenience Store
6,Scarborough,2,Fast Food Restaurant,Asian Restaurant,Grocery Store,Women's Store,Cocktail Bar,Coffee Shop,Convenience Store,Cosmetics Shop,Dance Studio,Discount Store
9,Scarborough,2,Pizza Place,Breakfast Spot,Sandwich Place,Pub,Grocery Store,Gourmet Shop,Gift Shop,Chocolate Shop,Clothing Store,Cocktail Bar
11,Scarborough,2,Women's Store,American Restaurant,Coffee Shop,Restaurant,Movie Theater,Pizza Place,Pet Store,Clothing Store,Café,Cantonese Restaurant
13,Scarborough,2,Coffee Shop,Fast Food Restaurant,Pizza Place,Breakfast Spot,Gym,Food & Drink Shop,Grocery Store,Discount Store,Salon / Barbershop,Liquor Store
14,Scarborough,2,Grocery Store,Wine Shop,Restaurant,Park,Sandwich Place,Discount Store,Cocktail Bar,Coffee Shop,Convenience Store,Cosmetics Shop
16,Scarborough,2,Café,Bar,Bakery,Coffee Shop,Cocktail Bar,Clothing Store,Mexican Restaurant,Beer Store,African Restaurant,Sandwich Place


In [53]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,Scarborough,3,Restaurant,Park,Bus Stop,Convenience Store,Fast Food Restaurant,Women's Store,Discount Store,Cocktail Bar,Coffee Shop,Cosmetics Shop


In [54]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Scarborough,4,Coffee Shop,Pub,Chinese Restaurant,Fast Food Restaurant,Pharmacy,Gym,Bakery,American Restaurant,Asian Restaurant,Convenience Store
