# Assignment_Week3: Segmenting and Clustering neighborhoods in Toronto

This notebook is dedicated to the capstone project of the IBM Data Science Professional Certificate. As part of the capstone, using a hypothetical research question, learners have to apply their knowledge and skills learnt from the courses to solve the real problem.

Introduction: Vietnamese cuisine is not strange to many people all over the world. Vietnam is famous for its diverse and tasty dishes that can satisfy mostly everyone. Globally Vietnamse cuisine is famous for its signature dishes namely Pho, the noodle soup with beef broth, beef, and many other herbs. When it comes to Vietnamese dishes, people think about Pho and many other signature dishes. In order to introduce Vietnamese cuisine to friends all over the world, it is necessary to represent our traditions by having some restaurants in cities globally. We can see now there are many Vietnamese restaurants in many big cities such as New York, Vancouver, etc. Recently, there have been investors, entrepreneurs, and chefs interested in opening a restaurant in such cities. This project will provide information of where to set up restaurants in Toronto, one of the biggest cities in Canada. This project will help to answer a question:"Where to open a Vietnamese restaurant in Toronto area?" Therefore, the objective of this project is to find the most suitable location to open a new Vietnamese Restaurant in Toronto, Canada.

Data: Regarding data, we need information of neighborhoods in Toronto, Canada as well as geographic information in terms of latitude and Longitude of these neighborhoods. We will also need information regarding venee data related to Vietnamese restaurants present in Toronto already.

Methodology
• Data will be collected from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and cleaned and processed into a dataframe.
• FourSquare be used to locate all venues and then filtered by Vietnamese restaurants. 
• Finally, the data be will be visually assessed using graphing from various Python libraries.

Step 1

Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [3]:
!pip install BeautifulSoup4
!pip install requests



In [11]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [12]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(source, 'lxml')


table = soup.find("table")
table_rows = table.tbody.find_all("tr")

In [13]:
data = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    data.append(row)

In [14]:
df = pd.DataFrame(data, columns = ["Postal Code", "Borough", "Neighborhood"])
df = df[~df['Postal Code'].isnull()]
df

Unnamed: 0,Postal Code,Borough,Neighborhood
1,M1A\n,Not assigned\n,Not assigned\n
2,M2A\n,Not assigned\n,Not assigned\n
3,M3A\n,North York\n,Parkwoods\n
4,M4A\n,North York\n,Victoria Village\n
5,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
...,...,...,...
176,M5Z\n,Not assigned\n,Not assigned\n
177,M6Z\n,Not assigned\n,Not assigned\n
178,M7Z\n,Not assigned\n,Not assigned\n
179,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."


In [15]:
#Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned

df.drop(df[df['Borough']=="Not assigned\n"].index,axis=0, inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A\n,North York\n,Parkwoods\n
4,M4A\n,North York\n,Victoria Village\n
5,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
6,M6A\n,North York\n,"Lawrence Manor, Lawrence Heights\n"
7,M7A\n,Downtown Toronto\n,"Queen's Park, Ontario Provincial Government\n"


In [16]:
df["Postal Code"] = df["Postal Code"].str.replace("\n","")
df["Borough"] = df["Borough"].str.replace("\n","")
df["Neighborhood"] = df["Neighborhood"].str.replace("\n","")

In [17]:
# group more than one neighborhood in one postal code area, seperate with a comma
df = df.groupby(["Postal Code", "Borough"])["Neighborhood"].apply(", ".join).reset_index()

In [18]:
df.Neighborhood.str.count("Not assigned").sum()

0

In [19]:
print("Shape: ", df.shape)

Shape:  (103, 3)


In [20]:
pip install geocoder

Note: you may need to restart the kernel to use updated packages.


In [21]:
pip install folium

Note: you may need to restart the kernel to use updated packages.


In [22]:
import folium

In [23]:
import numpy as np
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe

import folium # map rendering library

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [24]:

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_bdbfafe8c47b4772b0aef3fb2a2056cd = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='om7WEZtW8n8df7CTNOkVwk_iHfb60MPAQQNBdnIFm-WA',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_bdbfafe8c47b4772b0aef3fb2a2056cd.get_object(Bucket='capstoneprojectcourse-donotdelete-pr-lxhgnnueu8ziwq',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_2 = pd.read_csv(body)
df_data_2.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [25]:
df_combined = df.join(df_data_2.set_index('Postal Code'), on='Postal Code', how='inner')
df_combined

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [26]:
address = "Toronto, ON"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto city are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto city are 43.6534817, -79.3839347.


In [27]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
for latitude, longitude, borough, neighborhood in zip(df_combined['Latitude'], df_combined['Longitude'], df_combined['Borough'], df_combined['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False,
        fill=True
        ).add_to(map_toronto)  
    
map_toronto

Map of a part of Toronto City: work with only boroughs that contain the word Toronto

In [38]:
toronto_data = df_combined[df_combined['Borough'].str.contains("Toronto")].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [29]:
map_toronto_2 = folium.Map(location=[latitude, longitude], zoom_start=12)
for lat, lng, borough, neighborhood in zip(
        df_combined['Latitude'], 
        df_combined['Longitude'], 
        df_combined['Borough'], 
        df_combined['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_2)  

map_toronto_2

In [30]:
CLIENT_ID = 'MK00CU2GDEXJNJOCZFA3XVFRVNNLUO0JE51EIXEMDJMEWAUR' # your Foursquare ID
CLIENT_SECRET = 'VBIG3IZCHESXLYIF4QZ0QFCQSS0SXFRETW40DOFZCUKRJENI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MK00CU2GDEXJNJOCZFA3XVFRVNNLUO0JE51EIXEMDJMEWAUR
CLIENT_SECRET:VBIG3IZCHESXLYIF4QZ0QFCQSS0SXFRETW40DOFZCUKRJENI


In [31]:
neighborhood_name = df_combined.loc[0, 'Neighborhood']
print(f"The first neighborhood's name is '{neighborhood_name}'.")

The first neighborhood's name is 'Malvern, Rouge'.


In [32]:
neighborhood_latitude = df_combined.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_combined.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Malvern, Rouge are 43.806686299999996, -79.19435340000001.


In [33]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
     CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

# get the result to a json file
results = requests.get(url).json()

In [34]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [35]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  app.launch_new_instance()


Unnamed: 0,name,categories,lat,lng
0,Wendy’s,Fast Food Restaurant,43.807448,-79.199056


In [36]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

1 venues were returned by Foursquare.


Explore neighborhoods in a part of Toronto City

In [37]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [39]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

In [40]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [41]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,55,55,55,55,55,55
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,68,68,68,68,68,68
Christie,16,16,16,16,16,16
Church and Wellesley,75,75,75,75,75,75
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,33,33,33,33,33,33
Davisville North,9,9,9,9,9,9


In [42]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 236 uniques categories.


Analyze Each Neighborhood

transform collected information using the one-hot encoding method

In [43]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
toronto_onehot.shape

(1624, 236)

Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [45]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.125,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.014706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.014706,0.0,0.0,0.014706,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.026667,0.013333,0.0,0.0,0.0,0.0,0.0,0.0,0.013333,...,0.013333,0.013333,0.0,0.0,0.0,0.0,0.0,0.013333,0.0,0.0
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


 Exploring Toronto Restaurant

Collecting Restaurants
Let's explore the first neighborhood in our dataframe.

Capstone

In [46]:

len(toronto_grouped [toronto_grouped ["Vietnamese Restaurant"] > 0])

5

In [52]:
toronto_vietnam= toronto_grouped[['Neighborhood',"Vietnamese Restaurant"]]

In [53]:
toronto_vietnam.head()

Unnamed: 0,Neighborhood,Vietnamese Restaurant
0,Berczy Park,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0
2,"Business reply mail Processing Centre, South C...",0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0
4,Central Bay Street,0.0


Cluster Neighborhoods

In [55]:
# set number of clusters
toclusters = 3

to_clustering = toronto_vietnam.drop(["Neighborhood"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=toclusters, random_state=0).fit(to_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [56]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
to_merged = toronto_vietnam.copy()

# add clustering labels
to_merged["Cluster Labels"] = kmeans.labels_

In [57]:

to_merged.rename(columns={"Neighborhood": "Neighborhood"}, inplace=True)
to_merged.head()

Unnamed: 0,Neighborhood,Vietnamese Restaurant,Cluster Labels
0,Berczy Park,0.0,0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0
2,"Business reply mail Processing Centre, South C...",0.0,0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0
4,Central Bay Street,0.0,0


In [58]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
to_merged = to_merged.join(toronto_venues.set_index("Neighborhood"), on="Neighborhood")

print(to_merged.shape)
to_merged.head()

(1624, 9)


Unnamed: 0,Neighborhood,Vietnamese Restaurant,Cluster Labels,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Berczy Park,0.0,0,43.644771,-79.373306,LCBO,43.642944,-79.37244,Liquor Store
0,Berczy Park,0.0,0,43.644771,-79.373306,The Keg Steakhouse + Bar - Esplanade,43.646712,-79.374768,Restaurant
0,Berczy Park,0.0,0,43.644771,-79.373306,Fresh On Front,43.647815,-79.374453,Vegetarian / Vegan Restaurant
0,Berczy Park,0.0,0,43.644771,-79.373306,Meridian Hall,43.646292,-79.376022,Concert Hall
0,Berczy Park,0.0,0,43.644771,-79.373306,Goose Island Brewhouse,43.647329,-79.373541,Beer Bar


In [59]:
# sort the results by Cluster Labels
print(to_merged.shape)
to_merged.sort_values(["Cluster Labels"], inplace=True)
to_merged

(1624, 9)


Unnamed: 0,Neighborhood,Vietnamese Restaurant,Cluster Labels,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Berczy Park,0.000000,0,43.644771,-79.373306,LCBO,43.642944,-79.372440,Liquor Store
28,"Runnymede, Swansea",0.000000,0,43.651571,-79.484450,Book City (Bloor West),43.650211,-79.481220,Bookstore
28,"Runnymede, Swansea",0.000000,0,43.651571,-79.484450,Campo,43.655191,-79.487067,Italian Restaurant
28,"Runnymede, Swansea",0.000000,0,43.651571,-79.484450,Asa Sushi,43.649902,-79.484611,Sushi Restaurant
28,"Runnymede, Swansea",0.000000,0,43.651571,-79.484450,Fat Bastard Burrito,43.649779,-79.482894,Burrito Place
...,...,...,...,...,...,...,...,...,...
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,RBC Royal Bank,43.688058,-79.394478,Bank
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Sprout,43.687996,-79.394651,Vietnamese Restaurant
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,TTC Stop #,43.685826,-79.404981,Light Rail Station
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Tim Hortons,43.687682,-79.396840,Coffee Shop


In [60]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [62]:

# create map
map_clusters = folium.Map(location=[lat, lng], zoom_start=11)

# set color scheme for the clusters
x = np.arange(toclusters)
ys = [i+x+(i*x)**2 for i in range(toclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(to_merged['Neighborhood Latitude'], to_merged['Neighborhood Longitude'], to_merged['Neighborhood'], to_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [63]:

# save the map as HTML file
map_clusters.save('map_clusters.html')

In [64]:

#Cluster 0
to_merged.loc[to_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Vietnamese Restaurant,Cluster Labels,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Berczy Park,0.00,0,43.644771,-79.373306,LCBO,43.642944,-79.372440,Liquor Store
28,"Runnymede, Swansea",0.00,0,43.651571,-79.484450,Book City (Bloor West),43.650211,-79.481220,Bookstore
28,"Runnymede, Swansea",0.00,0,43.651571,-79.484450,Campo,43.655191,-79.487067,Italian Restaurant
28,"Runnymede, Swansea",0.00,0,43.651571,-79.484450,Asa Sushi,43.649902,-79.484611,Sushi Restaurant
28,"Runnymede, Swansea",0.00,0,43.651571,-79.484450,Fat Bastard Burrito,43.649779,-79.482894,Burrito Place
...,...,...,...,...,...,...,...,...,...
13,"Garden District, Ryerson",0.01,0,43.657162,-79.378937,Hokkaido Ramen Santouka らーめん山頭火,43.656435,-79.377586,Ramen Restaurant
13,"Garden District, Ryerson",0.01,0,43.657162,-79.378937,Jazz Bistro,43.655678,-79.379276,Music Venue
13,"Garden District, Ryerson",0.01,0,43.657162,-79.378937,Burrito Boyz,43.656265,-79.378343,Burrito Place
13,"Garden District, Ryerson",0.01,0,43.657162,-79.378937,Ed Mirvish Theatre,43.655102,-79.379768,Theater


In [65]:
#Cluster 1
to_merged.loc[to_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Vietnamese Restaurant,Cluster Labels,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
17,"Kensington Market, Chinatown, Grange Park",0.040541,1,43.653206,-79.400049,El Trompo,43.655832,-79.402561,Mexican Restaurant
17,"Kensington Market, Chinatown, Grange Park",0.040541,1,43.653206,-79.400049,The Burgernator,43.655642,-79.402440,Burger Joint
17,"Kensington Market, Chinatown, Grange Park",0.040541,1,43.653206,-79.400049,Poetry Jazz Cafe,43.654975,-79.402371,Jazz Club
17,"Kensington Market, Chinatown, Grange Park",0.040541,1,43.653206,-79.400049,Ozzy's Burger,43.655191,-79.402610,Burger Joint
17,"Kensington Market, Chinatown, Grange Park",0.040541,1,43.653206,-79.400049,Ronnie's Local 069,43.655104,-79.402675,Bar
...,...,...,...,...,...,...,...,...,...
19,"Little Portugal, Trinity",0.044444,1,43.647927,-79.419750,Pizzeria Libretto,43.648979,-79.420604,Pizza Place
19,"Little Portugal, Trinity",0.044444,1,43.647927,-79.419750,Bellwoods Brewery,43.647097,-79.419955,Brewery
17,"Kensington Market, Chinatown, Grange Park",0.040541,1,43.653206,-79.400049,House of Jaffle,43.652053,-79.404867,Snack Place
17,"Kensington Market, Chinatown, Grange Park",0.040541,1,43.653206,-79.400049,The Supermarket,43.656680,-79.402954,Bar


In [66]:
#Cluster 2
to_merged.loc[to_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Vietnamese Restaurant,Cluster Labels,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,LCBO,43.686991,-79.399238,Liquor Store
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,The Market By Longo’s,43.686711,-79.399536,Supermarket
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Union Social Eatery,43.687895,-79.394916,American Restaurant
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Daeco Sushi,43.687838,-79.395652,Sushi Restaurant
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Mary Be Kitchen,43.687708,-79.395062,Restaurant
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Starbucks,43.686756,-79.398292,Coffee Shop
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Popeyes Louisiana Kitchen,43.6893,-79.395302,Fried Chicken Joint
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Fionn MacCool's,43.687921,-79.394783,Pub
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,Pizzaiolo,43.687991,-79.394634,Pizza Place
33,"Summerhill West, Rathnelly, South Hill, Forest...",0.071429,2,43.686412,-79.400049,RBC Royal Bank,43.688058,-79.394478,Bank


Results/Recommendation

Most of Vietnamese restaurants are in Cluster 2 which is around Summerhill West, Rathnelly, South Hill. However it is fine to consider Cluster 1 areas which included Kensington Market, Chinatown, Grange Park. These locations should be considered to open a Vietnamese restaurants. It is noted that there are not so many Vietnamese restaurants in the Toronto areas, thus it will not be a big deal to seriously consider the best place. However, analyses from this project will help to visualize the reality of the distribution of Vietnamese restaurants in Toronto. Those who consider to open a Vietnamese restaurant should consider the findings of this project as a guide to start with. It is understood that in order to open a restaurant, there will be so many indicators to take into account. However, this analysis is helpful because it creates a general idea of what to start.

Limitations and Suggestions for Future Research

This project only consider the geospatial distribution of the Vietnamese restaurants. There are many factors such as the nearby restaurants and other competitives, renting cost, etc. If we can access data to work on these factors, the complete picture will be drawn.

Conclusion

This practical project is helpful to solve the real world problem which utilizes the knowledge from data science to step-by-step working with real data from identifying the business problem, using the appropriate methods, performing the machine learning by utilizing k-means clustering and providing recommendation to the stakeholder.