
# Capstone Project - The Battle of the Neighborhoods : Report


##### Applied Data Science Capstone by IBM/Coursera


### Table of contents
- Introduction: Business Problem
- Data
- Methodology
- Analysis
- Results and Discussion
- Conclusion

### Introduction: Business Problem 

In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening an restaurant in Toronto, Canada.

Here we will try finding if someone wants to open a new restaurant in the city which location is best suited for it keeping in mind the competitors and which income group of people will be attracted most to it based on the population of the neighbourhood.

Since there are lots of restaurants in Toronto, we will try to detect locations that are not already crowded with restaurants. We would also prefer locations as close to city center as possible, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

### Data 

Based on definition of our problem, factors that will influence our decission are:

-All existing restaurants in the neighborhood (any type of restaurant)
-Age group of people with their income
-Distance of neighborhood from city center

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:

-centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
-number of restaurants and their type and location in every neighborhood will be obtained using Foursquare API

In [1]:

import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)
import json # library to handle JSON files

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from bs4 import BeautifulSoup


In [2]:

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values


In [3]:
import folium # map rendering library

In [4]:
# define the dataframe columns
column_names = ['Postalcode','Borough', 'Neighborhood'] 

df_1 = pd.DataFrame(columns=column_names)

##### 1. Download and Explore Dataset

In [5]:
#Reading the webage
from urllib.request import urlopen
wikki = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=942851379"

webpage = urlopen(wikki)

from bs4 import BeautifulSoup
soup = BeautifulSoup(webpage, "html.parser")

In [6]:
#Extraxting the table from the webpage
Toronto=soup.find('table', class_='wikitable sortable')

In [7]:
#Generate lists
Pos=[]
Bor=[]
Neig=[]

for row in Toronto.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==3: 
        Pos.append(cells[0].find(text=True))
        Bor.append(cells[1].find(text=True))
        Neig.append(cells[2].find(text=True))

        
#Add Data to our DataFrame
df_1['Postalcode']=Pos
df_1['Borough']=Bor
df_1['Neighborhood']=Neig

df_1.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned\n
9,M9A,Etobicoke,Islington Avenue


##### Cleaning the Data
- If Borough is Not Assigned drop row.
- Reset Index

In [8]:
df_1=df_1[df_1['Borough']!='Not assigned']
df_1.reset_index(inplace=True)
df_1.head(10)

Unnamed: 0,index,Postalcode,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,Harbourfront
3,5,M6A,North York,Lawrence Heights
4,6,M6A,North York,Lawrence Manor
5,7,M7A,Downtown Toronto,Queen's Park
6,9,M9A,Etobicoke,Islington Avenue
7,10,M1B,Scarborough,Rouge
8,11,M1B,Scarborough,Malvern
9,13,M3B,North York,Don Mills North\n


In [9]:
df_1.drop(columns={'index'})

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North\n


In [10]:
df_2 = df_1.groupby('Postalcode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
df_2 = df_2.reset_index(drop=False)
df_2.rename(columns={'Neighborhood':'Neighborhood_joined'},inplace=True)
df_2.head()

Unnamed: 0,Postalcode,Neighborhood_joined
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood\n, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae\n


In [11]:
df_3 = pd.merge(df_1, df_2, on='Postalcode')
df_3.drop(['Neighborhood'], axis=1, inplace = True)
df_3.drop_duplicates( inplace = True)
df_3.rename(columns={'Neighborhood_joined': 'Neighborhood'}, inplace = True)
df_3.head()

Unnamed: 0,index,Postalcode,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,Harbourfront
3,5,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,6,M6A,North York,"Lawrence Heights, Lawrence Manor"


In [12]:
df_3.drop(columns={'index'})

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M6A,North York,"Lawrence Heights, Lawrence Manor"
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,"Rouge, Malvern"
8,M1B,Scarborough,"Rouge, Malvern"
9,M3B,North York,Don Mills North\n


In [13]:
column = ['Postal_Code','Borough', 'Neighborhood'] 
df_ungrp = pd.DataFrame(columns=column_names)

df_ungrp = df_1.drop(df_1[df_1['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=False)


df_ungrp.index = pd.RangeIndex(len(df_ungrp.index))
df_ungrp

Unnamed: 0,index,Postalcode,Borough,Neighborhood
0,4,M5A,Downtown Toronto,Harbourfront
1,7,M7A,Downtown Toronto,Queen's Park
2,16,M5B,Downtown Toronto,Ryerson\n
3,17,M5B,Downtown Toronto,Garden District\n
4,33,M5C,Downtown Toronto,St. James Town
5,46,M4E,East Toronto,The Beaches
6,47,M5E,Downtown Toronto,Berczy Park
7,56,M5G,Downtown Toronto,Central Bay Street\n
8,57,M6G,Downtown Toronto,Christie\n
9,67,M5H,Downtown Toronto,Adelaide\n


In [14]:
df_ungrp.drop(columns={'index'},inplace=True)

In [15]:
import time

In [16]:
geolocator = Nominatim(scheme='http', user_agent="ES1234")

for row_index, item in df_ungrp.iterrows():
    
    list1 = df_ungrp.loc[[row_index],['Neighborhood']].values.astype('str')
    loc = ' , Toronto, Ontario, Canada'
    list1.astype('str')
    list1 = np.append(list1, loc)
    latitude = None
    longitude = None
    location = None
    
    location = geolocator.geocode(list1 , limit = 15)
    #time.sleep(5)
    if(location is not None):
        df_ungrp.loc[df_ungrp.index[row_index], 'Latitude'] = location.latitude
        df_ungrp.loc[df_ungrp.index[row_index], 'Longitude'] = location.longitude

In [17]:
df_ungrp.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.64008,-79.38015
1,M7A,Downtown Toronto,Queen's Park,43.659659,-79.39034
2,M5B,Downtown Toronto,Ryerson\n,43.658469,-79.378993
3,M5B,Downtown Toronto,Garden District\n,43.6565,-79.377114
4,M5C,Downtown Toronto,St. James Town,43.669403,-79.372704
5,M4E,East Toronto,The Beaches,43.671024,-79.296712
6,M5E,Downtown Toronto,Berczy Park,43.647984,-79.375396
7,M5G,Downtown Toronto,Central Bay Street\n,,
8,M6G,Downtown Toronto,Christie\n,43.664111,-79.418405
9,M5H,Downtown Toronto,Adelaide\n,43.650486,-79.379498


In [18]:
print('We have {} boroughs and {} neighborhoods.'.format(
        len(df_ungrp['Borough'].unique()),
        df_ungrp.shape[0]
    )
)

df_ungrp.dropna(inplace =True)

address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="ES1234")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

We have 4 boroughs and 74 neighborhoods.
The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [19]:
df_ungrp.reset_index(inplace=True)
df_ungrp

Unnamed: 0,index,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,0,M5A,Downtown Toronto,Harbourfront,43.64008,-79.38015
1,1,M7A,Downtown Toronto,Queen's Park,43.659659,-79.39034
2,2,M5B,Downtown Toronto,Ryerson\n,43.658469,-79.378993
3,3,M5B,Downtown Toronto,Garden District\n,43.6565,-79.377114
4,4,M5C,Downtown Toronto,St. James Town,43.669403,-79.372704
5,5,M4E,East Toronto,The Beaches,43.671024,-79.296712
6,6,M5E,Downtown Toronto,Berczy Park,43.647984,-79.375396
7,8,M6G,Downtown Toronto,Christie\n,43.664111,-79.418405
8,9,M5H,Downtown Toronto,Adelaide\n,43.650486,-79.379498
9,10,M5H,Downtown Toronto,King\n,43.648949,-79.377754


In [20]:
df_ungrp.drop(columns={'index'},inplace=True)
df_ungrp

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.64008,-79.38015
1,M7A,Downtown Toronto,Queen's Park,43.659659,-79.39034
2,M5B,Downtown Toronto,Ryerson\n,43.658469,-79.378993
3,M5B,Downtown Toronto,Garden District\n,43.6565,-79.377114
4,M5C,Downtown Toronto,St. James Town,43.669403,-79.372704
5,M4E,East Toronto,The Beaches,43.671024,-79.296712
6,M5E,Downtown Toronto,Berczy Park,43.647984,-79.375396
7,M6G,Downtown Toronto,Christie\n,43.664111,-79.418405
8,M5H,Downtown Toronto,Adelaide\n,43.650486,-79.379498
9,M5H,Downtown Toronto,King\n,43.648949,-79.377754


In [21]:
df_ungrp.to_csv("output.csv") 

**Generating a map of Toronto**

In [22]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, borough, neighborhood in zip(df_ungrp['Latitude'], df_ungrp['Longitude'], df_ungrp['Borough'], df_ungrp['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Lets use FOURSQUARE API to explore the neighbourhood**


In [24]:
CLIENT_ID = '342OZ1N45YWBLG0AM5MIWVYTKS5HR3O5RE2WO3HGUKUXQTWX' # your Foursquare ID
CLIENT_SECRET = 'V3TFDAT3X1EAZLRSRNRCBKPELCOYNL3QHFWLOFTU25Z50A5T' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Successfully Logged-In')

Successfully Logged-In


In [25]:
df_ungrp.loc[0]
neighborhood_latitude = np.float(df_ungrp.loc[0,['Latitude']].values)
neighborhood_longitude =  np.float(df_ungrp.loc[0,['Longitude']].values)

**Now, let's get the top 100 venues that are in Harbour Square Park within a radius of 500 meters**

*First, let's create the GET request URL. Name the URL url*

In [26]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
results = requests.get(url).json()

In [27]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

*Now let's clean the json and structure it into a pandas dataframe.*

In [28]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Harbour Square Park,Park,43.639253,-79.378395
1,Lake Ontario,Lake,43.638945,-79.379665
2,Harbourfront,Neighborhood,43.639526,-79.380688
3,Miku,Japanese Restaurant,43.641374,-79.377531
4,Natrel Pond/Rink,Skating Rink,43.638431,-79.382528


In [29]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


##### 2. Exploring the Neighborhoods in Toronto

In [30]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [32]:
toronto_venues = getNearbyVenues(names=df_ungrp['Neighborhood'],
                                   latitudes=df_ungrp['Latitude'],
                                   longitudes=df_ungrp['Longitude']
                                  )
print(toronto_venues.shape)
toronto_venues.head()

Harbourfront
Queen's Park
Ryerson

Garden District

St. James Town
The Beaches
Berczy Park
Christie

Adelaide

King

Richmond

Dovercourt Village
Dufferin

Harbourfront East

Toronto Islands
Union Station
Little Portugal
Trinity
The Danforth West

Riverdale
Design Exchange
Toronto Dominion Centre
Brockton

Exhibition Place
Parkdale Village
The Beaches West

India Bazaar
Commerce Court
Studio District

Lawrence Park
Roselawn

Davisville North

Forest Hill North
High Park
The Junction South

The Annex
Yorkville
Parkdale
Roncesvalles
Davisville

Harbord

University of Toronto
Runnymede
Swansea
Moore Park
Summerhill East

Chinatown
Grange Park
Kensington Market
Deer Park
Forest Hill SE

Rathnelly
South Hill
Summerhill West

CN Tower
Bathurst Quay

Harbourfront West

King and Spadina
South Niagara
Rosedale
Cabbagetown
St. James Town
First Canadian Place
Underground city
Church and Wellesley
(3392, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.64008,-79.38015,Harbour Square Park,43.639253,-79.378395,Park
1,Harbourfront,43.64008,-79.38015,Lake Ontario,43.638945,-79.379665,Lake
2,Harbourfront,43.64008,-79.38015,Harbourfront,43.639526,-79.380688,Neighborhood
3,Harbourfront,43.64008,-79.38015,Miku,43.641374,-79.377531,Japanese Restaurant
4,Harbourfront,43.64008,-79.38015,Natrel Pond/Rink,43.638431,-79.382528,Skating Rink


In [33]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide\n,100,100,100,100,100,100
Bathurst Quay\n,26,26,26,26,26,26
Berczy Park,100,100,100,100,100,100
Brockton\n,18,18,18,18,18,18
CN Tower,56,56,56,56,56,56
Cabbagetown,51,51,51,51,51,51
Chinatown,58,58,58,58,58,58
Christie\n,57,57,57,57,57,57
Church and Wellesley,77,77,77,77,77,77
Commerce Court,100,100,100,100,100,100


In [34]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 292 uniques categories.


##### 3. Analyzing Each Neighborhood

In [35]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.shape

(3392, 292)

In [36]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.shape

(64, 292)

**Lets Check top venues**

In [37]:
Top_venues = 5
for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(Top_venues))
    print('\n')

----Adelaide
----
                 venue  freq
0                 Café  0.06
1          Coffee Shop  0.06
2  American Restaurant  0.04
3                  Gym  0.04
4           Restaurant  0.04


----Bathurst Quay
----
              venue  freq
0       Coffee Shop  0.15
1              Café  0.12
2              Park  0.08
3             Diner  0.04
4  Ramen Restaurant  0.04


----Berczy Park----
                 venue  freq
0          Coffee Shop  0.09
1                 Café  0.06
2           Restaurant  0.05
3   Italian Restaurant  0.04
4  Japanese Restaurant  0.04


----Brockton
----
                   venue  freq
0                    Bar  0.17
1  Vietnamese Restaurant  0.11
2                   Park  0.11
3  Portuguese Restaurant  0.06
4              Gastropub  0.06


----CN Tower----
            venue  freq
0           Hotel  0.11
1     Coffee Shop  0.07
2     Pizza Place  0.07
3             Bar  0.04
4  Ice Cream Shop  0.04


----Cabbagetown----
         venue  freq
0   Restaurant  0.1

----Union Station----
                 venue  freq
0          Coffee Shop  0.15
1                 Café  0.08
2  Japanese Restaurant  0.06
3           Restaurant  0.04
4       Breakfast Spot  0.04


----University of Toronto----
                 venue  freq
0                 Café  0.17
1   Italian Restaurant  0.07
2  Japanese Restaurant  0.07
3                 Park  0.07
4            Bookstore  0.07


----Yorkville----
                venue  freq
0            Boutique  0.05
1          Restaurant  0.05
2  Italian Restaurant  0.05
3                Café  0.04
4      Clothing Store  0.04




In [38]:
def return_most_common_venues(row, Top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:Top_venues]

In [39]:
Top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(Top_venues):
    try:
        columns.append('{}{} Popular Venues'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Popular Venues'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], Top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues,6th Popular Venues,7th Popular Venues,8th Popular Venues,9th Popular Venues,10th Popular Venues
0,Adelaide\n,Café,Coffee Shop,Japanese Restaurant,Gym,Restaurant,American Restaurant,Gastropub,Cosmetics Shop,Seafood Restaurant,Deli / Bodega
1,Bathurst Quay\n,Coffee Shop,Café,Park,Harbor / Marina,New American Restaurant,Ramen Restaurant,Garden,Bank,Grocery Store,Gym
2,Berczy Park,Coffee Shop,Café,Restaurant,Japanese Restaurant,Italian Restaurant,Breakfast Spot,Gastropub,Hotel,Cocktail Bar,Bakery
3,Brockton\n,Bar,Vietnamese Restaurant,Park,French Restaurant,Jazz Club,Pizza Place,Korean Restaurant,Gastropub,Portuguese Restaurant,Bakery
4,CN Tower,Hotel,Coffee Shop,Pizza Place,Aquarium,Concert Hall,Baseball Stadium,Ice Cream Shop,Bar,Gym,Scenic Lookout
5,Cabbagetown,Restaurant,Café,Coffee Shop,Japanese Restaurant,Bakery,Gastropub,Beer Store,Pizza Place,Indian Restaurant,Pub
6,Chinatown,Café,Dessert Shop,Coffee Shop,Mexican Restaurant,Clothing Store,Art Gallery,Bar,Bakery,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
7,Christie\n,Korean Restaurant,Coffee Shop,Japanese Restaurant,Grocery Store,Café,Sandwich Place,Karaoke Bar,Cocktail Bar,Gift Shop,Indian Restaurant
8,Church and Wellesley,Sushi Restaurant,Coffee Shop,Japanese Restaurant,Restaurant,Yoga Studio,Burger Joint,Café,Men's Store,Mediterranean Restaurant,Smoke Shop
9,Commerce Court,Coffee Shop,Restaurant,Café,Italian Restaurant,Hotel,Gym,American Restaurant,Japanese Restaurant,Seafood Restaurant,Deli / Bodega


In [40]:
neighborhoods_venues_sorted.to_csv("output2.csv")

##### 4. Cluster Neighborhoods using K-Mean

In [41]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood',1)
#print(toronto_grouped_clustering)
#print(toronto_grouped)
# run k-means clustering
kmeans = KMeans(init = "k-means++", n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
labels = kmeans.labels_[0:66] 

In [42]:
toronto_merged = df_ungrp
print(toronto_merged.shape)
labels = np.append(labels,labels[0])
print(labels.shape)
# add clustering labels
toronto_merged['Cluster Labels'] = labels.tolist()

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

(65, 5)
(65,)


Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues,6th Popular Venues,7th Popular Venues,8th Popular Venues,9th Popular Venues,10th Popular Venues
0,M5A,Downtown Toronto,Harbourfront,43.64008,-79.38015,0,Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant,Chinese Restaurant,Bank,Brewery,Park,Pizza Place
1,M7A,Downtown Toronto,Queen's Park,43.659659,-79.39034,0,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Bubble Tea Shop,Vegetarian / Vegan Restaurant,Chinese Restaurant,Restaurant,French Restaurant,Japanese Restaurant
2,M5B,Downtown Toronto,Ryerson\n,43.658469,-79.378993,0,Coffee Shop,Clothing Store,Café,Italian Restaurant,Diner,Restaurant,Middle Eastern Restaurant,Burger Joint,Japanese Restaurant,Ramen Restaurant
3,M5B,Downtown Toronto,Garden District\n,43.6565,-79.377114,0,Restaurant,Clothing Store,Hotel,Coffee Shop,Theater,Lingerie Store,Japanese Restaurant,Sandwich Place,Tea Room,Bookstore
4,M5C,Downtown Toronto,St. James Town,43.669403,-79.372704,0,Coffee Shop,Pizza Place,Grocery Store,Market,Café,Caribbean Restaurant,Filipino Restaurant,Breakfast Spot,Bistro,Bike Rental / Bike Share


In [43]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

##### 5. Examine Clusters

**Cluster 1**

In [44]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues,6th Popular Venues,7th Popular Venues,8th Popular Venues,9th Popular Venues,10th Popular Venues
0,Downtown Toronto,0,Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant,Chinese Restaurant,Bank,Brewery,Park,Pizza Place
1,Downtown Toronto,0,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Bubble Tea Shop,Vegetarian / Vegan Restaurant,Chinese Restaurant,Restaurant,French Restaurant,Japanese Restaurant
2,Downtown Toronto,0,Coffee Shop,Clothing Store,Café,Italian Restaurant,Diner,Restaurant,Middle Eastern Restaurant,Burger Joint,Japanese Restaurant,Ramen Restaurant
3,Downtown Toronto,0,Restaurant,Clothing Store,Hotel,Coffee Shop,Theater,Lingerie Store,Japanese Restaurant,Sandwich Place,Tea Room,Bookstore
4,Downtown Toronto,0,Coffee Shop,Pizza Place,Grocery Store,Market,Café,Caribbean Restaurant,Filipino Restaurant,Breakfast Spot,Bistro,Bike Rental / Bike Share
5,East Toronto,0,Beach,Sandwich Place,Park,Salon / Barbershop,Tea Room,Bakery,Thai Restaurant,Japanese Restaurant,Pizza Place,Pub
6,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Japanese Restaurant,Italian Restaurant,Breakfast Spot,Gastropub,Hotel,Cocktail Bar,Bakery
7,Downtown Toronto,0,Korean Restaurant,Coffee Shop,Japanese Restaurant,Grocery Store,Café,Sandwich Place,Karaoke Bar,Cocktail Bar,Gift Shop,Indian Restaurant
8,Downtown Toronto,0,Café,Coffee Shop,Japanese Restaurant,Gym,Restaurant,American Restaurant,Gastropub,Cosmetics Shop,Seafood Restaurant,Deli / Bodega
9,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Gym,Seafood Restaurant,Gastropub,Japanese Restaurant,Hotel,Italian Restaurant,American Restaurant


**Cluster 2**

In [45]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues,6th Popular Venues,7th Popular Venues,8th Popular Venues,9th Popular Venues,10th Popular Venues
33,West Toronto,1,Convenience Store,Gym,Pizza Place,Pool,Pub,Sandwich Place,Pet Store,Food Truck,Tennis Court,Theme Park Ride / Attraction


**Cluster 3**

In [46]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues,6th Popular Venues,7th Popular Venues,8th Popular Venues,9th Popular Venues,10th Popular Venues
18,East Toronto,2,Coffee Shop,Pharmacy,Ice Cream Shop,Pizza Place,Bus Line,Fish & Chips Shop,Fried Chicken Joint,French Restaurant,Bank,Bakery
19,East Toronto,2,Vietnamese Restaurant,Chinese Restaurant,Bakery,Coffee Shop,Light Rail Station,Fast Food Restaurant,Café,Asian Restaurant,Trail,Fish Market
41,Downtown Toronto,2,Café,Park,Japanese Restaurant,Italian Restaurant,Bookstore,Bank,Bar,French Restaurant,Bubble Tea Shop,Museum
51,Central Toronto,2,Mexican Restaurant,Park,French Restaurant,Italian Restaurant,Pizza Place,Pub,BBQ Joint,Coffee Shop,Shoe Repair,American Restaurant


**Cluster 4**

In [47]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues,6th Popular Venues,7th Popular Venues,8th Popular Venues,9th Popular Venues,10th Popular Venues
38,West Toronto,3,Café,Restaurant,Gift Shop,Gourmet Shop,Bookstore,Sushi Restaurant,Gastropub,Gas Station,Sports Bar,Liquor Store


**Cluster 5**

In [48]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues,6th Popular Venues,7th Popular Venues,8th Popular Venues,9th Popular Venues,10th Popular Venues
58,Downtown Toronto,4,Pizza Place,Bakery,Dessert Shop,Yoga Studio,Café,Middle Eastern Restaurant,Furniture / Home Store,Coffee Shop,Spa,American Restaurant
