# Capstone Project - The Battle of Neighborhoods

### (Week 4-1)
## The Business Problem:

Section Description: Introduction where you discuss the business problem and who would be interested in this project.

The purpose of this study is to help a Chinese investor find a neighbourhood in which to open up a bubble tea shop. The investor is based in Toronto and personally enjoys bubble tea so he thought he would open up his own franchaise.

# Critieria:

* Not too many Coffee shops or other bubble tea competitors but they should be in the area
* Is near a gym or a school
* is near restaurants (and fast food)
* is near stores where there is heavier foot traffic

## The Data:

Section Description: Data where you describe the data that will be used to solve the problem and the source of the data.

The data consists of a webscrapped list of neighbourhoods and their latitudes and longitudes in the Canada from Wikipedia using BeautifulSoup. Specifically all of the lostal codes that start with M are the ones for the City of Toronto in Ontario.
This is the link to the table on Wikipedia: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

This dataset is combined with data on nearby venues and amenities pulled using the FourSquare API.
The data from the FourSquare API will look for venues within a radius of 500 meters which is about a 5 minute walking distance which is an important thing to consider for downtown toronto especially.

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

## Methodology:

Section description: Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.

As mentinoned in the Data section, the data of venues and boroughs of toronto are combined into one data set. From here, the combined dataset is refined to just boroughs containing "Toronto" in the name and the neighbourhoods will be futher sorted to ones that only match the required criteria and then ranked.

The criteria is set to filter for high traffic areas with interests that are relevant to bubble tea drinkers. For instance, bubble tea is usually a go to hangout spot for students, and a treat for people who have just gone to the gym. Additionally bubble tea is often regarded as a dessert stop for post-meal consumers or a good drink alternative to those getting fastfood (as opposed to a soft drink).

The boroughs are also filtered to just those containing the word Toronto to narrow it down to mostly the core down town areas for simplicity, and then clusters are used to sort the neighbourhoods based on which relevant venue groups are the most frequent to help us narrow down a particular neighbourhood to suggest to open a bubble tea shop in.


In [1]:
#Before we get the data and start exploring it, let's download all the dependencies that we will need.

import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.21.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [2]:
#webscrape data from wikipedia
from bs4 import BeautifulSoup

#Access url and needed table
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})

#Putting data into dataframe
table_rows = table.find_all('tr')
data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])
df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood'])

#Delete empty rows
df = df[~df['Borough'].isnull()]

#Dropping rows without assigned Boroughs
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)
df.reset_index(drop=True, inplace=True)
df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()

#Set unassigned neighbourhoods to be the same as Borough
df['Neighborhood'].replace('Not assigned',df['Borough'],inplace=True)

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [3]:
df.shape

(103, 3)

Adding Geospatial Longitude and Latitude Coordinates

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
Here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:

In [4]:
#Using the CSV file from http://cocl.us/Geospatial_data to get coordinate data
!wget -q -O 'Toronto_long_lat_data.csv'  http://cocl.us/Geospatial_data
df_lon_lat = pd.read_csv('Toronto_long_lat_data.csv')
df_lon_lat.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [5]:
#Changing df column name to merge
df_lon_lat.columns=['PostalCode','Latitude','Longitude']

#Merging dfs
dfm = pd.merge(df,
                 df_lon_lat[['PostalCode','Latitude', 'Longitude']],
                 on='PostalCode')
dfm.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [6]:
dfm.shape

(103, 5)

Drop all rows that do not contain Toronto in the Borough name.

In [7]:
dfm.drop(dfm[dfm.Borough.str.contains("Toronto")==False].index, inplace=True)
dfm = dfm.reset_index(drop=True)
dfm

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049


Clustering Neighborhoods in Toronto

Find spatial center of latitudes and longitudes from all boroughs with Toronto in their names as the starting point of the map.

In [8]:
latitude = (dfm['Latitude'].max() + dfm['Latitude'].min())/2
longitude = (dfm['Longitude'].max() + dfm['Longitude'].min())/2
latitude, longitude

(43.6784836, -79.38874055)

Create a map as visual.

In [9]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(dfm['Latitude'], dfm['Longitude'], dfm['Borough'], dfm['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Use the Foursquare API to explore the neighborhoods and segment them.

In [10]:
CLIENT_ID = 'KQK00XKUP120U0HFVI0GM5UITEBRKZRUKB45FETLPMFP3VTQ' # your Foursquare ID
CLIENT_SECRET = 'XQRHIT4CAIDCXM3GR5V3GPLA0ZQHHHJ2BIX4YNP05FKQWYJU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KQK00XKUP120U0HFVI0GM5UITEBRKZRUKB45FETLPMFP3VTQ
CLIENT_SECRET:XQRHIT4CAIDCXM3GR5V3GPLA0ZQHHHJ2BIX4YNP05FKQWYJU


Get the neighborhood's name.

In [11]:
dfm.loc[0, 'Neighborhood']

'The Beaches'

Get, store, and print latitude, longitude, and name of neighborhood

In [12]:
neighborhood_latitude = dfm.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = dfm.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = dfm.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


Find the top 100 venues within 500 meter radius from first neighborhood using Foursquare and create the get request URL.

In [13]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=KQK00XKUP120U0HFVI0GM5UITEBRKZRUKB45FETLPMFP3VTQ&client_secret=XQRHIT4CAIDCXM3GR5V3GPLA0ZQHHHJ2BIX4YNP05FKQWYJU&v=20180605&ll=43.67635739999999,-79.2930312&radius=500&limit=100'

Send the GET request and examine the results.

In [14]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e619e5814a126001b887141'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'labe

let's get the get_category_type function from the Foursquare

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and structure it into a pandas dataframe

In [16]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Upper Beaches,Neighborhood,43.680563,-79.292869


In [17]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


Repeat above steps for all neighborhoods in Toronto.

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Find all the venues in Toronto

In [19]:
toronto_venues = getNearbyVenues(names=dfm['Neighborhood'],
                                   latitudes=dfm['Latitude'],
                                   longitudes=dfm['Longitude']
                                  )

The Beaches
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie
Dovercourt Village,Dufferin
Little Portugal,Trinity
Brockton,Exhibition Place,Parkdale Village
High Park,The Junction South
Parkdale,Roncesvalles
Runnymede

Find resulting dataframe and size

In [20]:
print(toronto_venues.shape)
toronto_venues.head()

(1714, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West,Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


Check how many venues were returned for each neighborhood

In [21]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,56,56,56,56,56,56
"Brockton,Exhibition Place,Parkdale Village",25,25,25,25,25,25
Business Reply Mail Processing Centre 969 Eastern,18,18,18,18,18,18
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",16,16,16,16,16,16
"Cabbagetown,St. James Town",46,46,46,46,46,46
Central Bay Street,78,78,78,78,78,78
"Chinatown,Grange Park,Kensington Market",85,85,85,85,85,85
Christie,18,18,18,18,18,18
Church and Wellesley,86,86,86,86,86,86


In [22]:
toronto_venues.groupby('Neighborhood')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f210a6f9f28>

In [23]:
#Find how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 233 uniques categories.


Analyzing each neighbourhood.

In [24]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
#examine the new dataframe size
toronto_onehot.shape

(1714, 233)

In [26]:
#group rows by neighborhood and take the mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.012821,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,...,0.0,0.0,0.0,0.012821,0.0,0.0,0.012821,0.0,0.0,0.0
7,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.035294,0.0,0.058824,0.011765,0.0,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.011628,0.011628,0.0,0.0,0.0,0.0,0.0,0.0,0.011628,...,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,0.011628,0.011628,0.0


In [27]:
#confirm new size
toronto_grouped.shape

(39, 233)

In [28]:
to_excel = toronto_grouped.to_csv('example.csv')
print (toronto_grouped)

                                         Neighborhood  Yoga Studio  \
0                              Adelaide,King,Richmond     0.000000   
1                                         Berczy Park     0.000000   
2          Brockton,Exhibition Place,Parkdale Village     0.040000   
3   Business Reply Mail Processing Centre 969 Eastern     0.000000   
4   CN Tower,Bathurst Quay,Island airport,Harbourf...     0.000000   
5                          Cabbagetown,St. James Town     0.000000   
6                                  Central Bay Street     0.012821   
7             Chinatown,Grange Park,Kensington Market     0.000000   
8                                            Christie     0.000000   
9                                Church and Wellesley     0.011628   
10                      Commerce Court,Victoria Hotel     0.000000   
11                                         Davisville     0.000000   
12                                   Davisville North     0.000000   
13  Deer Park,Forest

Here we begin to filtering just for categories that are relevant

In [29]:
to_regroup_test = toronto_grouped[]

SyntaxError: invalid syntax (<ipython-input-29-3eb02b5c35d3>, line 1)

Set up the filter groups based on terms that the categories contain. IE any sort of Gym or Studio belongs to the group gym, anything that contains the word Shop or Store is a store, and any venue that ends in Restaurant is a food spot.

In [None]:
to_regroup_gym1= pd.DataFrame(toronto_grouped.filter(like='Gym'))
to_regroup_gym2= pd.DataFrame(toronto_grouped.filter(like='Studio'))
to_regroup_food = pd.DataFrame(toronto_grouped.filter(like='Restaurant'))
to_regroup_store = pd.DataFrame(toronto_grouped.filter(like='Store'))
to_regroup_shop= pd.DataFrame(toronto_grouped.filter(like='Shop'))
to_regroup_bbt = pd.DataFrame(toronto_grouped.filter(like='Bubble Tea'))
to_regroup_cafe = pd.DataFrame(toronto_grouped.filter(like='Café'))
to_regroup_coffee = pd.DataFrame(toronto_grouped.filter(like='Coffee'))

Sum up the mean frequencies of each category and they go into their own column in a new dataframe.

In [None]:
to_regroup_gym = pd.concat([to_regroup_gym1, to_regroup_gym2],ignore_index=True, sort=False)
to_regroup_shopping = pd.concat([to_regroup_store, to_regroup_shop],ignore_index=True, sort=False)
to_regroup_coffee = pd.concat([to_regroup_cafe, to_regroup_coffee],ignore_index=True, sort=False)

headings = ['Neighborhood', 'Gym', 'Food', 'Shopping', 'BBT', 'Coffees']
to_regroup = pd.DataFrame(columns = headings) #creates a new dataframe

In [None]:
to_regroup_gym.loc[:,'Gyms'] = to_regroup_gym.sum(axis=1)
to_regroup_food.loc[:,'Foods'] = to_regroup_food.sum(axis=1)
to_regroup_shopping.loc[:,'BuyAllThings'] = to_regroup_shopping.sum(axis=1)
to_regroup_bbt.loc[:,'BBT'] = to_regroup_bbt.sum(axis=1)
to_regroup_coffee.loc[:,'Caffine'] = to_regroup_coffee.sum(axis=1)

In [None]:
to_regroup['Neighborhood'] = toronto_grouped['Neighborhood']
to_regroup['Gym'] = to_regroup_gym['Gyms']
to_regroup['Food'] = to_regroup_food['Foods']
to_regroup['Shopping'] = to_regroup_shopping['BuyAllThings']
to_regroup['BBT'] = to_regroup_bbt['BBT']
to_regroup['Coffees'] = to_regroup_coffee['Caffine']

to_regroup

Print each neighborhood along with each category by frequency ranked from most to least.

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = to_regroup[to_regroup['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Sort the venues in descending order into a pandas datafram

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = to_regroup['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(to_regroup.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

k-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = to_regroup.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Create a new dataframe that includes the cluster as well as the ranked venue category groupings for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_

toronto_merged = dfm

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Visualize the resulting clusters.

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine each cluster and determine what distinguish each cluster. Based on the defining categories, we can then narrow it down to which clusters to consider based on the criteria and can then further break it down by which neighbourhood to suggest to the client.

Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 5

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

## Results:

Section Description: Results section where you discuss the results.

The results of the analysis show that the 39 boroughs containing "Toronto" in their name were seperated into 5 clusters based on which of the groupings of venues of interest were the most to least frequent in that area.

A closer look at the results of the clustering of the neighbourhoods is broken down and analysed in the 'Discussions' section below.

## Discussion:

Section Description: Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.

We can see that the first cluster with the exception of East Toronto and West Toronto (index 0 and 31 respectively) that the most common venues in order of frequency are all Coffee places, BBT shops, Shops and store, Food locations and then Gym locations. The second cluster is the largest cluster and like the rest of the clusters has a large concentration of restaurants. There are also a lot of coffee shops in those areas which is no surprise since half of the broughs are Downtown Toronto. This may be a catch-all cluster that happens to have more exercise related facilities in the area considering how scattered the last 3 columns of 'common venues' are.

The last three clusters also all have food locations as their most common venue in the area. Again, as 'Downtown Toronto' seems to be the magority of these clusters that would make sense. Cluster three and four seem pretty similar since in addition to food they also have many shopping locations and coffee spots in that order. Cluster three however, has more bubble tea locations than cluster four, while the latter cluster is more popular for gyms than bubble tea shops.
Cluster five has the highest frequency for bubble tea shops , the least gyms, and shopping and coffee frequencies fall inbetween the two.

Based on these observations, I would recommend that the investor look into boroughs of Downtown Toronto or Central Toronto in cluster 3 and 4.
Specifically, the neighbourhoods where food and shopping are the top most common venues (in that order respectively), since this is where the most relevant foot traffic will be. Ideally Coffee locations and Bubble tea shops will be next since coffee and bubble tea are roughly substitute goods. Bubble tea locations also seem to like to cluster together to give consumers choice more often than opening in locations where there do not exist many others so this is also something that may be important to consider, especially in an area of high population density like Toronto (compared to more rural areas).

## Conclusion:

Section Description: Conclusion section where you conclude the report.

In conclusion, we saw results of clustering locations in Toronto based on how many restaurants, gyms, bubble tea shops, coffee shops and stores there are in those areas.
Based on these results, we were able to distinguish what made each cluster different from the others and we also created a map to visualize where these neighbourhoods are geographically.

Upon analyzing these resulting clusters we can see that based on our criteria for the business problem outlined in the introduction, that there are certain neighbourhoods in certain clusters that fit the criteria more appropriately. 

Happy bubble tea drinking!