Toronto, despite being a major city in the world, is often viewed as boring. I believe that this is due to the lack of after-work activities provided. I hope to dive deeper into this problem, create a feasible solution, and maybe even put it into action.

If there are actually a lack of 'fun' things to do in Toronto, certain metrics will be able to determine if this is true or not. Some questions I hope to find the answer to using the data are:

Compared to other major cities, what kind of venues are lacking? (A follow up question to that is: why are they lacking? This will require more research outside of the foursquare data and may depend on laws that affect alcohol distribution, or are dependent on the cost of living, etc.)

Once deciding on the venue to pursue, I will have to decide where it will be located. I can scrape the internet for the rent prices, income levels, population density, that match the type of business I want to start. I would also like to look into accessability going to and from the venue. At this stage, I would have to look into other metrics that factor into if customers were to come to that area or that type of venue.

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

/bin/sh: conda: command not found
/bin/sh: conda: command not found
Libraries imported.


In [56]:
import requests
import lxml.html as lh

In [57]:
url='https://en.wikipedia.org/wiki/List_of_cities_by_international_visitors'

In [58]:
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [59]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]

In [60]:
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print('%d:%s'%(i,name))
    col.append((name,[]))

1:RankEuromonitor

2:RankMastercard

3:City

4:Country

5:Arrivals 2017Euromonitor

6:Arrivals 2016Mastercard

7:Growthin arrivalsEuromonitor

8:Income(billions $)Mastercard



In [61]:
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=8:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [62]:
#Check the length of each column
[len(C) for (title,C) in col]

[137, 137, 137, 137, 137, 137, 137, 137]

In [105]:
#Create dictionary and DataFrame
Dict={title:column for (title,column) in col}
df_raw=pd.DataFrame(Dict)
df_raw.head()
df_raw.shape

(137, 8)

In [109]:
#Clean up the DataFrame
df = df_raw
df.rename(columns={'City\n':'City','Country\n':'Country','Arrivals 2017Euromonitor\n':'Arrivals','Growthin arrivalsEuromonitor\n':'Growth'}, inplace=True)
df = df.replace('\n','', regex=True) #get rid of all \n in the DataFrame
df = df.replace('\xa0','', regex=True) #get rid of all \n in the DataFrame
df = df.replace('',np.nan, regex=True) #replace "Not Assigned" values to NaN
df.drop(['RankMastercard\n','RankEuromonitor\n','Arrivals 2016Mastercard\n','Income(billions $)Mastercard\n'], axis=1,inplace=True)
df.dropna(inplace=True)
df.head()

Unnamed: 0,City,Country,Arrivals,Growth
0,Hong Kong,Hong Kong,25695800,−3.1 %
1,Bangkok,Thailand,23270600,9.5%
2,London,United Kingdom,19842800,3.4%
3,Singapore,Singapore,17681800,6.1%
4,Macau,Macau,16299100,5.9%


In [110]:
df['City, Country'] = df['City'] + ', ' + df['Country']

In [387]:
Cities = df['City, Country'][:50].values

In [388]:
# type your answer here
CLIENT_ID = 'Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX' # your Foursquare ID
CLIENT_SECRET = 'FFMOJY1UQXST5XXCTUIGWVLENGPG2VDDJIYF201YLOXUTG30' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

latlng = []
cols = ['City, Country','Latitude','Longitude','URL Link']

limit = 100
radius = 2500

In [389]:
for city in Cities:
    geolocator = Nominatim(user_agent="explorer")
    location = geolocator.geocode(city, timeout=10)
    latitude = location.latitude
    longitude = location.longitude
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION, 
        latitude, 
        longitude,
        radius,
        limit)
    latlng.append([city, latitude, longitude, url])
    print(city, latitude, longitude, url)
print('Done')

Hong Kong, Hong Kong 22.350627 114.1849161 https://api.foursquare.com/v2/venues/explore?&client_id=Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX&client_secret=FFMOJY1UQXST5XXCTUIGWVLENGPG2VDDJIYF201YLOXUTG30&v=20180605&ll=22.350627,114.1849161&radius=2500&limit=100
Bangkok, Thailand 13.7538929 100.8160803 https://api.foursquare.com/v2/venues/explore?&client_id=Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX&client_secret=FFMOJY1UQXST5XXCTUIGWVLENGPG2VDDJIYF201YLOXUTG30&v=20180605&ll=13.7538929,100.8160803&radius=2500&limit=100
London, United Kingdom 51.5073219 -0.1276474 https://api.foursquare.com/v2/venues/explore?&client_id=Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX&client_secret=FFMOJY1UQXST5XXCTUIGWVLENGPG2VDDJIYF201YLOXUTG30&v=20180605&ll=51.5073219,-0.1276474&radius=2500&limit=100
Singapore, Singapore 1.3408528 103.878446863736 https://api.foursquare.com/v2/venues/explore?&client_id=Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX&client_secret=FFMOJY1UQXST5XXCTUIGWVLEN

Denpasar, Indonesia -8.6524973 115.2191175 https://api.foursquare.com/v2/venues/explore?&client_id=Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX&client_secret=FFMOJY1UQXST5XXCTUIGWVLENGPG2VDDJIYF201YLOXUTG30&v=20180605&ll=-8.6524973,115.2191175&radius=2500&limit=100
Osaka, Japan 34.6937569 135.5014539 https://api.foursquare.com/v2/venues/explore?&client_id=Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX&client_secret=FFMOJY1UQXST5XXCTUIGWVLENGPG2VDDJIYF201YLOXUTG30&v=20180605&ll=34.6937569,135.5014539&radius=2500&limit=100
Los Angeles, United States 34.0536909 -118.2427666 https://api.foursquare.com/v2/venues/explore?&client_id=Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX&client_secret=FFMOJY1UQXST5XXCTUIGWVLENGPG2VDDJIYF201YLOXUTG30&v=20180605&ll=34.0536909,-118.2427666&radius=2500&limit=100
Vienna, Austria 48.2083537 16.3725042 https://api.foursquare.com/v2/venues/explore?&client_id=Y1PZQ4UQX0XRYB5YMWSHH05PWS2F0R4S4SI4F05KIUJZ1JDX&client_secret=FFMOJY1UQXST5XXCTUIGWVLENGPG2VD

In [391]:
df_top = pd.DataFrame(latlng, columns=cols)
df_top.shape

(50, 4)

In [392]:
df_new = pd.merge(df_top, df, on=['City, Country'])
df_new.drop(['City','Country'], axis=1, inplace=True)
df_new.head()

Unnamed: 0,"City, Country",Latitude,Longitude,URL Link,Arrivals,Growth
0,"Hong Kong, Hong Kong",22.350627,114.184916,https://api.foursquare.com/v2/venues/explore?&...,25695800,−3.1 %
1,"Bangkok, Thailand",13.753893,100.81608,https://api.foursquare.com/v2/venues/explore?&...,23270600,9.5%
2,"London, United Kingdom",51.507322,-0.127647,https://api.foursquare.com/v2/venues/explore?&...,19842800,3.4%
3,"Singapore, Singapore",1.340853,103.878447,https://api.foursquare.com/v2/venues/explore?&...,17681800,6.1%
4,"Macau, Macau",22.195629,113.548785,https://api.foursquare.com/v2/venues/explore?&...,16299100,5.9%


In [393]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [394]:
venues_list = pd.DataFrame([])
columns1 = ['Name','Category','Latitude','Longitude','City, Country']

for i in range(len(df_new)):
    city = df_new.loc[i,'City, Country']
    url = df_new.loc[i, 'URL Link']
    results = requests.get(url).json()

    venues = results['response']['groups'][0]['items']

    nearby_venues = json_normalize(venues) # flatten JSON

    # filter columns
    filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng','City, Country']
    nearby_venues['City, Country'] = city
    nearby_venues = nearby_venues.loc[:, filtered_columns]

    # filter the category for each row
    nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

    # clean columns
    nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
    venues_list = pd.concat([venues_list, nearby_venues], ignore_index=True)
print('Done')
venues_list.shape

(100, 5)
(23, 5)
(100, 5)
(100, 5)
(100, 5)
(50, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(12, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(13, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(7, 5)
(100, 5)
(100, 5)
(35, 5)
(97, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
(100, 5)
Done


(4537, 5)

In [396]:
venues_list.head()

Unnamed: 0,name,categories,lat,lng,"City, Country"
0,Lion Rock Country Park (獅子山郊野公園),Nature Preserve,22.35437,114.182578,"Hong Kong, Hong Kong"
1,Lion Rock Park (獅子山公園),Park,22.344922,114.183878,"Hong Kong, Hong Kong"
2,Lion Rock (獅子山),Mountain,22.358955,114.18847,"Hong Kong, Hong Kong"
3,Sik Sik Yuen Wong Tai Sin Temple (嗇色園黃大仙祠),Temple,22.342357,114.193646,"Hong Kong, Hong Kong"
4,Kowloon Walled City Park (九龍寨城公園),Park,22.331995,114.190401,"Hong Kong, Hong Kong"


In [439]:
def format1(format1_df): 
    for i in range(len(format1_df)):
        format1_df.iat[i, 1] = format1_df.iat[i, 1].strip().capitalize()

In [495]:
vg = venues_list.groupby('categories').count()
vg = vg.sort_values('name', ascending=False)
vg.reset_index(inplace=True)
vg.head(50)

Unnamed: 0,categories,name,lat,lng,"City, Country"
0,Hotel,353,353,353,353
1,Coffee shop,195,195,195,195
2,Café,160,160,160,160
3,Plaza,106,106,106,106
4,Italian restaurant,98,98,98,98
5,Ice cream shop,97,97,97,97
6,Bar,87,87,87,87
7,Restaurant,79,79,79,79
8,Indian restaurant,74,74,74,74
9,Cocktail bar,71,71,71,71


In [494]:
print('Hotel:', venues_list.categories.str.count('hotel|Hotel').sum())
print('Cafe:', venues_list.categories.str.count('Coffee|coffee|café|Café|cafe|Cafe|dessert|desert|Dessert|dessert').sum())
print('Plaza:', venues_list.categories.str.count('plaza|Plaza').sum())
print('Icecream:', venues_list.categories.str.count('Ice Cream|ice cream|Ice cream|ice Cream').sum())
print('Bar:', venues_list.categories.str.count('bar|Bar|club|Club|cocktail|Cocktail').sum())
print('Park:', venues_list.categories.str.count('Park|park').sum())
print('Gallery:', venues_list.categories.str.count('Art|Gallery|art|gallery|Museum|museum').sum())
print('Restaurant:', venues_list.categories.str.count('Restaurant|restaurant').sum())

Hotel: 376
Cafe: 427
Plaza: 121
Icecream: 97
Bar: 385
Park: 69
Gallery: 297
Restaurant: 1089


In [454]:
vg_s = venues_list['categories']
vg_s.shape

(4537,)

In [462]:
searchfor = ['bar', 'Bar', 'club', 'Club']
[vg_s.str.contains('|'.join(searchfor))]

[0       False
 1       False
 2       False
 3       False
 4       False
 5       False
 6       False
 7       False
 8       False
 9       False
 10      False
 11      False
 12      False
 13      False
 14      False
 15      False
 16      False
 17      False
 18      False
 19      False
 20      False
 21      False
 22      False
 23      False
 24      False
 25      False
 26      False
 27      False
 28      False
 29      False
 30      False
 31      False
 32      False
 33      False
 34      False
 35      False
 36      False
 37      False
 38      False
 39      False
 40      False
 41      False
 42      False
 43      False
 44      False
 45      False
 46      False
 47      False
 48      False
 49      False
 50      False
 51      False
 52      False
 53      False
 54      False
 55      False
 56      False
 57      False
 58      False
 59      False
 60      False
 61      False
 62      False
 63      False
 64      False
 65      False
 66      F

314
1089


In [None]:
vg = vg.replace('\n','', regex=True) #get rid of all \n in the DataFrame

In [None]:
vg.replace('restaurant','Restaurant', regex=True)
vg.replace('bar','Bar', regex=True)

In [497]:
vg = vg.sort_values('name', ascending=False)
vg.head(10)

Unnamed: 0,categories,name,lat,lng,"City, Country"
0,Hotel,353,353,353,353
1,Coffee shop,195,195,195,195
2,Café,160,160,160,160
3,Plaza,106,106,106,106
4,Italian restaurant,98,98,98,98
5,Ice cream shop,97,97,97,97
6,Bar,87,87,87,87
7,Restaurant,79,79,79,79
8,Indian restaurant,74,74,74,74
9,Cocktail bar,71,71,71,71


Keep the extracted data there, stating why you went down this path but was ultimately unsuccessful.

Go back to the map in Toronto specifically, use folium to map out all the bars, lounges, clubs, etc.
Then use an algorithm to find a suitable place to start a new bar based on distance to other venues.
Verify it after.