# Peer-graded Assignment: Capstone Project - The Battle of Neighborhoods

## Introduction

Within this data evaluation we will try to pin point the most touristic centres of toronto by clustering coffee shop venues, different centres of density. Thus, we are able to identify potential points of intereset to validate what could be a suitable location for Hotel choice.





## Imports & Data

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    geopy-2.0.0                |     pyh9f0ad1d_0          63 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0          conda-forge
    geopy:           

Download and Explore Toronto Dataset
The dateset being used is found at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.

The dataset is a list of Toronto's zipcodes which includes the boroughs and neighborhood names.

In [2]:
#Obtain Postal Code, Borough, and Neighborhood information from Wikipedia
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header = 0)

#Obtain the first table
df_toronto = table[0]
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [3]:
df_toronto.rename(columns = {"Postal Code": "Postalcode", "Neighbourhood": "Neighborhood"}, inplace = True)

#Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df_toronto.drop(df_toronto[df_toronto.Borough == 'Not assigned'].index, inplace=True)
#df.head()

#Combine the neighborhoods that exists in one postal code
df_toronto = df_toronto.groupby(['Postalcode', 'Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()
#df.head()

#Change unassigned Neighborhood to its Borough's name
df_toronto.loc[85,'Neighborhood'] = 'Queen\'s Park'

print (df_toronto.shape)

df_toronto.head()

(103, 3)


Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [4]:
#Create a dataframe of the latitude and longitudes of the Toronto Neighborhoods
latlong = pd.read_csv("http://cocl.us/Geospatial_data")
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [5]:
latlong.rename(columns = {"Postal Code": "Postalcode"}, inplace = True)
latlong.head()

Unnamed: 0,Postalcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [6]:
latlong.shape

(103, 3)

In [7]:
#Join the Lat and Long dataframe to Neighborhoods dataframe
df_toronto.set_index("Postalcode")
latlong.set_index("Postalcode")
neighbor=pd.merge(df_toronto, latlong)
neighbor.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [8]:
print('Toronto has {} boroughs and {} neighborhoods.'.format(
        len(neighbor['Borough'].unique()),
        neighbor.shape[0]
    )
)

Toronto has 10 boroughs and 103 neighborhoods.


In [9]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="capstone_pro")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Canada are 43.6534817, -79.3839347.


In [10]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighbor['Latitude'], neighbor['Longitude'], neighbor['Borough'], neighbor['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [11]:
CLIENT_ID = 'MCSBWF241RY52GPQF5RPEC4BIEKKVO4PN3WBDG0ZXYWCQFZG' # your Foursquare ID
CLIENT_SECRET = 'HAHZOXZ5KE4K5TPCXA2C3NXKFFBUL3RGZ4LBGKNY3OCGFLX5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MCSBWF241RY52GPQF5RPEC4BIEKKVO4PN3WBDG0ZXYWCQFZG
CLIENT_SECRET:HAHZOXZ5KE4K5TPCXA2C3NXKFFBUL3RGZ4LBGKNY3OCGFLX5


In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print('Found {} venues in {} neighborhoods.'.format(nearby_venues.shape[0], len(venues_list)))
    
    return(nearby_venues)

In [14]:
toronto_venues = getNearbyVenues(names=neighbor['Neighborhood'],
                                   latitudes=neighbor['Latitude'],
                                   longitudes=neighbor['Longitude'])

Found 2129 venues in 103 neighborhoods.


In [15]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Sail Sushi,43.765951,-79.191275,Restaurant


In [22]:
toronto_coffee = toronto_venues.loc[(toronto_venues['Venue Category'] == 'Coffee Shop') | (toronto_venues['Venue Category'] == 'Café')]
toronto_coffee.shape

(273, 7)

In [29]:
# set number of clusters
kclusters = 15

toronto_coffee_cluster = toronto_coffee[['Venue Latitude','Venue Longitude']]

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=2).fit(toronto_coffee_cluster)

# check cluster labels generated for each row in the dataframe
#kmeans.labels_[0:10] 
kmeans.labels_

array([ 5,  5,  9,  9,  3,  3,  3,  3,  3,  3,  3,  3,  8,  8,  8,  8,  8,
       14, 14, 14,  8,  8, 12, 14, 14, 14, 14, 14, 14,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  6,  6,  1,  1,  1,  1,  1,  1,  1, 13, 13,  6,  6,
        6,  6,  6, 10, 10, 10, 10, 10, 10, 10,  6,  6,  6,  0,  0,  6,  6,
        6,  6,  0, 10,  0, 10, 10,  0,  0, 10, 10,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10,  0,  0,  0,  0,  0,  0, 10, 10,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1, 13, 13, 13, 13, 13,
       10, 13, 13, 10, 13, 13, 10, 10, 10, 10, 10, 10, 10, 10,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0

In [30]:
#Note that the neighborhood Upper Rouge does not have any venues, so I will drop from dataset

toronto_coffee_merged = toronto_coffee

# add clustering labels
toronto_coffee_merged['Cluster Labels'] = kmeans.labels_

toronto_coffee_merged.head() # check the last columns!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster Labels
10,Woburn,43.770992,-79.216917,Starbucks,43.770037,-79.221156,Coffee Shop,5
11,Woburn,43.770992,-79.216917,Tim Hortons,43.770827,-79.223078,Coffee Shop,5
25,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029,Tim Hortons,43.726895,-79.266157,Coffee Shop,9
41,"Birch Cliff, Cliffside West",43.692657,-79.264848,The Birchcliff,43.691666,-79.264532,Café,9
84,"Steeles West, L'Amoreaux West",43.799525,-79.318389,Tim Hortons,43.799102,-79.318715,Coffee Shop,3


In [33]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_coffee_merged['Venue Latitude'], toronto_coffee_merged['Venue Longitude'], toronto_coffee_merged['Neighborhood'], toronto_coffee_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [38]:
toronto_coffee_merged_grp = toronto_coffee_merged.groupby('Cluster Labels').agg({'Venue Latitude':'mean','Venue Longitude':'mean','Venue':'count','Neighborhood':'-'.join}).sort_values(by='Venue', ascending=False)
toronto_coffee_merged_grp.head(5)

Unnamed: 0_level_0,Venue Latitude,Venue Longitude,Venue,Neighborhood
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,43.647621,-79.3796,123,"Regent Park, Harbourfront-Regent Park, Harbour..."
10,43.659308,-79.389663,47,Church and Wellesley-Church and Wellesley-Chur...
6,43.663027,-79.353859,23,"The Danforth West, Riverdale-The Danforth West..."
13,43.672061,-79.410682,16,"Summerhill West, Rathnelly, South Hill, Forest..."
1,43.716671,-79.401449,10,"North Toronto West, Lawrence Park-North Toront..."


Displayed are the top 5 of touristic places with their mean coordinates and the list of according Neighborhoods.

## Results  

Following this method, we tracked the location of various café´s and coffee shops, identified by Fousquare, by means of k_means clusters we were able to pin point the neighborhoods, which are most interesting for tourists. Giving possible guidelines to search for suitable Hotelrooms.