# Restaurant "Likes" Prediction Using Foursquare API 

#### Hafsah Ahmed
Capstone Project

IBM Data Sceince Professional Cerificate

## 1. Introduction

An important aspect of marketing for a modern restaurant or any business for that matter is social media where the number of "likes" determines a company's public image, reputation and success (in terms of customer satisfaction). 

For a business owner planning to open a new restaurant (or expanding an existing one) in a new city, knowing the restaurant type, cuisine and location with the potential to do well both physically and in social media ahead of time can be a game changer. This is beacuse it could solve the problem of uncertainty surrounding performance when breaking into new markets.

In this analysis we will solve this uncertainty by leveraging data gathered from FourSquare's API, specifically "likes" data of different restaurants, their locations and category of cuisine.

The problem: How accurately can we predict the amount of "likes" a new restaurant opening in a certain location can expect to have based on the type of cuisine and where it will open.

For the purpose of this analysis we will focus on three of the most popular and heavily populated cities in Texas, namely Houston, San Antonio and Dallas. All three cities boast a very diverse restaurant scene mostly due to their culturally diverse residents.

The goal of this analysis is to aid a business owner make decisions regarding whether it is feasible to open a restaurant in a certain area and expect positive social media presence, what type of cuisines perform well in certain areas and the best area overall out of the three cities.

## 1.1 Let us begin by importing the necessary packages

In [1]:
import numpy as np

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

from geopy.geocoders import Nominatim

import requests
from pandas import json_normalize


import matplotlib.cm as cm
import matplotlib.colors as colors

import folium

from urllib.request import urlopen
from bs4 import BeautifulSoup

import matplotlib.pyplot as plt
import pylab as pl

from sklearn import linear_model
from sklearn.metrics import jaccard_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, r2_score
import itertools

print('Libraries imported.')


Libraries imported.


## 2. Data 

### 2.1 Data Scrapping and Cleaning

1. First we will retrieve the geographical coordinates of the 3 cities i.e. Houston, San Antonio and Dallas.
2. We will then leverage the FourSquare API to obtain URLs that will lead to the raw data in JSON format.
3. It is important to note the FourSquare API will retrieve all venue data other than just restaurants i.e. concert halls. This means we will have to clean the data to remove all non-restaurant venue data.
4. The "id" is an important column because it is where we will pull "likes" from. After retrieving "likes" based on the restaurant "id" we will append it to our data frame.

In [2]:
address1 = 'Houston, Texas'

geolocator = Nominatim(user_agent="smy-application")
location1 = geolocator.geocode(address1)
latitude1 = location1.latitude
longitude1= location1.longitude
print('The geographical coordinate of {} are {}, {}.'.format(address1, latitude1, longitude1))


address2 = 'San Antonio, Texas'

geolocator = Nominatim(user_agent="smy-application")
location2 = geolocator.geocode(address2)
latitude2 = location2.latitude
longitude2 = location2.longitude
print('The geographical coordinate of {} are {}, {}.'.format(address2, latitude2, longitude2))


address3 = 'Dallas, Texas'

geolocator = Nominatim(user_agent="smy-application")
location3 = geolocator.geocode(address3)
latitude3 = location3.latitude
longitude3 = location3.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address3, latitude3, longitude3))

The geographical coordinate of Houston, Texas are 29.7589382, -95.3676974.
The geographical coordinate of San Antonio, Texas are 29.4246002, -98.4951405.
The geograpical coordinate of Dallas, Texas are 32.7762719, -96.7968559.


In [3]:
CLIENT_ID = 'DOM1M1PK1OE5NKOXJDS05SFC24NYQOEFUSP5FH4ERTAOUINN' # your foursquare ID
CLIENT_SECRET = 'KLEUPSFNQAKCSRJFWUP2B4U5KZEYW35F4DQL15DL15YMRX0O' # your foursquare secret
VERSION = '20180605' # foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

LIMIT = 100 # limit of number of venues returned by foursquare API
radius = 1000 # define radius

# create URLs

url1 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude1, 
    longitude1, 
    radius, 
    LIMIT)

# create URLs
url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude2, 
    longitude2, 
    radius, 
    LIMIT)

# create URLs
url3 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude3, 
    longitude3, 
    radius, 
    LIMIT)

print(url1, url2, url3)


Your credentials:
CLIENT_ID: DOM1M1PK1OE5NKOXJDS05SFC24NYQOEFUSP5FH4ERTAOUINN
CLIENT_SECRET:KLEUPSFNQAKCSRJFWUP2B4U5KZEYW35F4DQL15DL15YMRX0O
https://api.foursquare.com/v2/venues/explore?&client_id=DOM1M1PK1OE5NKOXJDS05SFC24NYQOEFUSP5FH4ERTAOUINN&client_secret=KLEUPSFNQAKCSRJFWUP2B4U5KZEYW35F4DQL15DL15YMRX0O&v=20180605&ll=29.7589382,-95.3676974&radius=1000&limit=100 https://api.foursquare.com/v2/venues/explore?&client_id=DOM1M1PK1OE5NKOXJDS05SFC24NYQOEFUSP5FH4ERTAOUINN&client_secret=KLEUPSFNQAKCSRJFWUP2B4U5KZEYW35F4DQL15DL15YMRX0O&v=20180605&ll=29.4246002,-98.4951405&radius=1000&limit=100 https://api.foursquare.com/v2/venues/explore?&client_id=DOM1M1PK1OE5NKOXJDS05SFC24NYQOEFUSP5FH4ERTAOUINN&client_secret=KLEUPSFNQAKCSRJFWUP2B4U5KZEYW35F4DQL15DL15YMRX0O&v=20180605&ll=32.7762719,-96.7968559&radius=1000&limit=100


In [4]:
# scrape the data from the generated URLs
results1 = requests.get(url1).json()
results1

results2 = requests.get(url2).json()
results2

results3 = requests.get(url3).json()
results3

# function to extract category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# First city
venues1 = results1['response']['groups'][0]['items']
nearby_venues1 = json_normalize(venues1) # flatten JSON

# filter columns
filtered_columns1 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues1 =nearby_venues1.loc[:, filtered_columns1]

# filter the category for each row
nearby_venues1['venue.categories'] = nearby_venues1.apply(get_category_type, axis=1)

# clean columns
nearby_venues1.columns = [col.split(".")[-1] for col in nearby_venues1.columns]


# Second city
venues2 = results2['response']['groups'][0]['items']
nearby_venues2 = json_normalize(venues2) # flatten JSON

# filter columns
filtered_columns2 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues2 =nearby_venues2.loc[:, filtered_columns2]

# filter category for each row
nearby_venues2['venue.categories'] = nearby_venues2.apply(get_category_type, axis=1)

# clean columns
nearby_venues2.columns = [col.split(".")[-1] for col in nearby_venues2.columns]


# Third City

venues3 = results3['response']['groups'][0]['items']
nearby_venues3 = json_normalize(venues3) # flatten JSON

# filter columns
filtered_columns3 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues3 =nearby_venues3.loc[:, filtered_columns3]


# filter category for each row
nearby_venues3['venue.categories'] = nearby_venues3.apply(get_category_type, axis=1)

# clean columns
nearby_venues3.columns = [col.split(".")[-1] for col in nearby_venues3.columns]


print('{} venues were returned by Foursquare.'.format(nearby_venues1.shape[0]))
print('{} venues were returned by Foursquare.'.format(nearby_venues2.shape[0]))
print('{} venues were returned by Foursquare.'.format(nearby_venues3.shape[0]))

100 venues were returned by Foursquare.
100 venues were returned by Foursquare.
100 venues were returned by Foursquare.


In [5]:
# add locations data to the data sets of each city
nearby_venues1['city'] = 'Houston'
nearby_venues2['city'] = 'San Antonio'
nearby_venues3['city'] = 'Dallas'

In [6]:
# combine the three cities into one data set
nearby_venues = nearby_venues1.copy()
nearby_venues = nearby_venues.append(nearby_venues2)
nearby_venues = nearby_venues.append(nearby_venues3)

In [7]:
# check list and manually remove all non-restaurant data
nearby_venues['categories'].unique()
removal_list = ['Concert Hall', 'Opera House', 'Dance Studio',
                'Performing Arts Venue', 'Art Museum', 'Park',
                'Massage Studio', 'Music Venue', 'Bookstore', 'Clothing Store',
                'Boutique', 'Furniture/Home Store', 'Jazz Club',
                'Theater', 'Optical Shop', "Men's Store", 'Rock Club',
                'Gym / Fitness Center', 'Wine Shop', 'Indie Movie Theater',
                'Chocolate Shop', 'Dessert Shop', 'Recreation Center', 
                'Plaza', 'Hotel', 'Luggage Store', 'Farmers Market', 'Gym',
                'Jewelry Store', 'Furniture / Home Store', 'Butcher', 
                'Bakery', 'Marijuana Dispensary', 'Ice Cream Shop',
                'Comic Shop', 'Bagel Shop', 'Spa', 'Liquor Store', 'Bike Shop',
                'Yoga Studio', 'Pedestrian Plaza', 'Candy Store',
                'Park', 'Bookstore', 'Candy Store',  'Jazz Club', 'Art Gallery', 
                 'Supermarket', 'Museum', 'Boutique', 'Plaza', 'Building', 'Bakery',
                 'Historic Site', 'Ice Cream Shop', ' Concert Hall', 'Pharmacy', 
                 'Market', 'Movie Theater', 'Performing Arts Venue', 'Music Venue',
                 'Theater', 'Art Museum', 'Cheese Shop', 'Opera House',
                 'Pedestrian Plaza', 'School', 'Gift Shop', 'Athletics & Sports',
                 'Shoe Repair', 'General Entertainment', 'Stationery Store',
                 'Toy / Game Store', 'Brewery', 'Hotel', 'Theater', 'Music Venue', 'Business Service',
                 'Donut Shop', 'Liquor Store', 'Beer Store',
                 'Lounge', 'Plaza', 'Health Food Store', 'Concert Hall', 
                 'Lingerie Store', 'Gym', 'Mobile Phone Shop',
                 'Chocolate Shop', 'Ice Cream Shop', 'Hostel', 'Convenience Store', 
                 'Park', 'Farmers Market', 'Cosmetics Shop', 'Piano Bar',
                 'Nightclub', 'Massage Studio', 'Comedy Club', 'Concert Hall', 
                'Department Store', 'Fountain', 'Flower Shop', 'Souvenir Shop', 
                'Scenic Lookout', 'Miscellaneous Shop', 'Deli / Bodega', 'Cultural Center',
                'Memorial Site', 'Garden', 'Trail', 'Shopping Mall', 'High School', 'Train Station',
                'Smoke Shop', 'Church', 'Outdoor Sculpture', 'Multiplex', 'Shopping Plaza', 
                'Electronics Store', "Women's Store", 'Grocery Store', 'Arts & Crafts Store',
                'Neighborhood', 'Record Shop', 'Event Space', 'Palace', 'Monument / Landmark', 
                "Doctor's Office", 'Bank', 'Public Art', 'Resort', 'History Museum', 
                'General College & University', 'Music Store', 'Dive Bar', 'Discount Store', 'Pool', 'Hotel Bar',
                'IT Services', 'Garden Center']

nearby_venues = nearby_venues[~nearby_venues['categories'].isin(removal_list)]

nearby_venues['categories'].unique()

array(['Thai Restaurant', 'Food Truck', 'Pizza Place',
       'Italian Restaurant', 'Burger Joint', 'Beer Bar', 'Food Court',
       'Empanada Restaurant', 'Southern / Soul Food Restaurant',
       'Beer Garden', 'Steakhouse', 'Sushi Restaurant', 'Taco Place',
       'Fried Chicken Joint', 'Japanese Restaurant', 'Bistro', 'Wine Bar',
       'Coffee Shop', 'Gastropub', 'Bar', 'Whisky Bar',
       'Seafood Restaurant', 'BBQ Joint', 'New American Restaurant',
       'Greek Restaurant', 'Sandwich Place', 'Mexican Restaurant',
       'American Restaurant', 'Cocktail Bar', 'French Restaurant',
       'Dumpling Restaurant', 'Spanish Restaurant', 'Café',
       'Mediterranean Restaurant', 'Restaurant', 'Theme Restaurant',
       'Noodle House', 'German Restaurant', 'Indian Restaurant',
       'Salad Place', 'Cupcake Shop', 'Sports Bar', 'Wings Joint',
       'Diner', 'Caribbean Restaurant', 'Tex-Mex Restaurant',
       'Vietnamese Restaurant', 'Chinese Restaurant',
       'Cajun / Creole Resta

In [8]:
# set up to pull the likes from the API based on venue ID

url_list = []
like_list = []
json_list = []

for i in list(nearby_venues.id):
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)
for link in url_list:
    result = requests.get(link).json()
    likes = result['response']['likes']['count']
    like_list.append(likes)
print(like_list)


nearby_venues['likes'] = like_list
nearby_venues.head()

[22, 22, 150, 25, 24, 648, 29, 5, 31, 252, 141, 47, 20, 133, 3, 0, 2, 13, 81, 33, 102, 127, 20, 95, 10, 61, 79, 104, 6, 202, 6, 0, 59, 8, 26, 51, 51, 7, 85, 25, 188, 136, 44, 231, 52, 40, 66, 7, 11, 45, 249, 24, 19, 31, 77, 18, 184, 137, 47, 26, 92, 460, 7, 124, 59, 12, 103, 752, 90, 174, 11, 182, 8, 7, 154, 452, 239, 12, 105, 91, 24, 15, 183, 70, 71, 12, 81, 36, 203, 28, 47, 43, 135, 126, 178, 83, 13, 22, 51, 28, 64, 16, 7, 144, 25, 67, 246, 19, 21, 23, 31, 22, 63, 16, 32, 80, 21, 16, 15, 61, 184, 46, 20, 45, 32, 104, 9, 32, 12, 13, 51, 24, 52, 32, 52, 7, 30, 8, 35, 7, 80, 47, 7, 6, 24, 25, 12, 19, 20, 38, 69, 10, 25, 1, 74, 73, 99, 55, 24]


Unnamed: 0,name,categories,lat,lng,id,city,likes
1,Mango Tree Thai Bistro,Thai Restaurant,29.758251,-95.365387,56cf664a498eec17aaf5f939,Houston,22
2,Jason's Deli,Food Truck,29.757464,-95.365543,4bbb8ef0e452952192d954a4,Houston,22
4,Bombay Pizza Co.,Pizza Place,29.7577,-95.364586,4b5a3ea4f964a52021b728e3,Houston,150
7,Perbacco,Italian Restaurant,29.760257,-95.364773,4b510fe4f964a520554027e3,Houston,25
10,Becks Prime,Burger Joint,29.758185,-95.366172,4bd9ae2d2e6f0f47c9730b08,Houston,24


In [9]:
# this is the raw data set
raw_dataset = nearby_venues
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes
1,Mango Tree Thai Bistro,Thai Restaurant,29.758251,-95.365387,56cf664a498eec17aaf5f939,Houston,22
2,Jason's Deli,Food Truck,29.757464,-95.365543,4bbb8ef0e452952192d954a4,Houston,22
4,Bombay Pizza Co.,Pizza Place,29.7577,-95.364586,4b5a3ea4f964a52021b728e3,Houston,150
7,Perbacco,Italian Restaurant,29.760257,-95.364773,4b510fe4f964a520554027e3,Houston,25
10,Becks Prime,Burger Joint,29.758185,-95.366172,4bd9ae2d2e6f0f47c9730b08,Houston,24


## 2.2 Data Preparation 

The raw data set we have thus far still needs more processing before we can use it to model. For example, the column 'categories' contains large variation of cuisines that will limit us from drawing very meaningful results due to the broadness. Therefore, as part of data preparation we will group these variations into groups of cuisines i.e. Asian, European, Latin, North American, Casual as in the case of caffes and drinking establishment as in the case of bars.

Our 3 different cities of focus and 6 different categories of cusines are all categorical variables. Hence, we will require dummy variable encoding for meaningful analysis. We can accomplish this via one-hot encoding.

In [10]:
# inspecting our raw data set shows us that there is too many different types of cuisines
raw_dataset['categories'].unique()

array(['Thai Restaurant', 'Food Truck', 'Pizza Place',
       'Italian Restaurant', 'Burger Joint', 'Beer Bar', 'Food Court',
       'Empanada Restaurant', 'Southern / Soul Food Restaurant',
       'Beer Garden', 'Steakhouse', 'Sushi Restaurant', 'Taco Place',
       'Fried Chicken Joint', 'Japanese Restaurant', 'Bistro', 'Wine Bar',
       'Coffee Shop', 'Gastropub', 'Bar', 'Whisky Bar',
       'Seafood Restaurant', 'BBQ Joint', 'New American Restaurant',
       'Greek Restaurant', 'Sandwich Place', 'Mexican Restaurant',
       'American Restaurant', 'Cocktail Bar', 'French Restaurant',
       'Dumpling Restaurant', 'Spanish Restaurant', 'Café',
       'Mediterranean Restaurant', 'Restaurant', 'Theme Restaurant',
       'Noodle House', 'German Restaurant', 'Indian Restaurant',
       'Salad Place', 'Cupcake Shop', 'Sports Bar', 'Wings Joint',
       'Diner', 'Caribbean Restaurant', 'Tex-Mex Restaurant',
       'Vietnamese Restaurant', 'Chinese Restaurant',
       'Cajun / Creole Resta

In [11]:
# we can group some cuisines together to make a better categorical variable

euro = ['Alsatian Restaurant', 'French Restaurant', 'Scandinavian Restaurant', 'Souvlaki Shop', 'Spanish Restaurant', 'German Restaurant',  
       'Mediterranean Restaurant', 'Italian Restaurant', 'Pizza Place', 'Portuguese Restaurant', 'Tapas Restaurant', 'Greek Restaurant']

latino = ['Mexican Restaurant', 'Latin American Restaurant', 
          'Brazilian Restaurant', 'Taco Place', 'Empanada Restaurant', 'Caribbean Restaurant', 'Tex-Mex Restaurant']

bar = ['Beer Bar', 'Cocktail Bar', 'Tiki Bar', 'Wine Bar', 'Hotel Bar',
       'Beer Garden', 'Speakeasy', 'Brewery', 'Pub', 'Bar', 'Gastropub',
       'Hookah Bar', 'Cocktail Bar', 'Whisky Bar']

asian = ['Ramen Restaurant', 'Chinese Restaurant', 'Sushi Restaurant', 'Vietnamese Restaurant',
         'Thai Restaurant', 'Poke Place', 'Indian Restaurant', 
         'Japanese Curry Restaurant', 'Japanese Restaurant', 
         'Indonesian Restaurant', 'Udon Restaurant', 'Noodle House',
         'Falafel Restaurant', 'Filipino Restaurant', 'Turkish Restaurant',
         'Yoshoku Restaurant', 'Israeli Restaurant', 'Jewish Restaurant', 'Lebanese Restaurant', 'North Indian Restaurant', 'Dumpling Restaurant']

casual = ['Coffee Shop', 'Café', 'Sandwich Place', 'Food Truck', 'Salad Place', 'Smoothie Shop', 
          'Juice Bar', 'Frozen Yogurt Shop', 'Deli / Bodega', 'Dessert Shop', 'Diner', 
          'Hot Dog Joint', 'Burger Joint', 'Breakfast Spot', 'Cupcake Shop', 'Sports Bar',  
          'Fondue Restaurant', 'Pastry Shop', 'Tea Room', 'Bistro', 'Bubble Tea Shop', 'Breakfast Spot', 'Gelato Shop', 'Food Court']

american = ['Southern / Soul Food Restaurant', 'Food & Drink Shop', 'Fast Food Restaurant',  
            'Restaurant', 'American Restaurant', 'BBQ Joint', 'Wings Joint',  
            'Theme Restaurant', 'New American Restaurant',
            'Vegetarian / Vegan Restaurant', 'Seafood Restaurant', 'Vegetarian / Vegan Restaurant',
            'Seafood Restaurant', 'Gourmet Shop', 'Cajun / Creole Restaurant', 'Fried Chicken Joint', 'Steakhouse', 'Theme Restaurant' ]

def conditions(s):
    if s['categories'] in euro:
        return 'euro'
    if s['categories'] in latino:
        return 'latino'
    if s['categories'] in asian:
        return 'asian'
    if s['categories'] in casual:
        return 'casual'
    if s['categories'] in american:
        return 'american'
    if s['categories'] in bar:
        return 'bar'
    
raw_dataset['categories_classified']=raw_dataset.apply(conditions, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified
1,Mango Tree Thai Bistro,Thai Restaurant,29.758251,-95.365387,56cf664a498eec17aaf5f939,Houston,22,asian
2,Jason's Deli,Food Truck,29.757464,-95.365543,4bbb8ef0e452952192d954a4,Houston,22,casual
4,Bombay Pizza Co.,Pizza Place,29.7577,-95.364586,4b5a3ea4f964a52021b728e3,Houston,150,euro
7,Perbacco,Italian Restaurant,29.760257,-95.364773,4b510fe4f964a520554027e3,Houston,25,euro
10,Becks Prime,Burger Joint,29.758185,-95.366172,4bd9ae2d2e6f0f47c9730b08,Houston,24,casual
11,Flying Saucer Draught Emporium,Beer Bar,29.759116,-95.363216,41326e00f964a520f9111fe3,Houston,648,bar
12,Finn Hall,Food Court,29.758678,-95.363846,598a39f9123a19503bef4e46,Houston,29,casual
14,5411 Empanadas,Empanada Restaurant,29.758986,-95.36859,583491fb409f5673e5c8440f,Houston,5,latino
16,Treebeards - The Tunnel,Southern / Soul Food Restaurant,29.757903,-95.368972,4b42392df964a52060cf25e3,Houston,31,american
19,Frank's Pizza,Pizza Place,29.761393,-95.362721,4ada6409f964a520352221e3,Houston,252,euro


In [12]:
# check to make sure categories_classified has been created correctly

pd.crosstab(index=raw_dataset["categories_classified"],
            columns="count")

col_0,count
categories_classified,Unnamed: 1_level_1
american,41
asian,10
bar,33
casual,41
euro,16
latino,18


In [13]:
# classify the likes into different ranking levels
# lets first see where to bin the data
# we can try different ways of binning the data, I find it yields substantially different results

print(np.percentile(raw_dataset['likes'], 20))
print(np.percentile(raw_dataset['likes'], 40))
print(np.percentile(raw_dataset['likes'], 60))
print(np.percentile(raw_dataset['likes'], 80))

print(np.percentile(raw_dataset['likes'], 33))
print(np.percentile(raw_dataset['likes'], 66))

print(np.percentile(raw_dataset['likes'], 50))

13.0
25.200000000000003
51.8
103.4
23.14
64.56
36.0


In [23]:
# create a function to bin for us
def rankings(s):
     if s['likes']<=111.5:
         return 0
     if s['likes']>111.5:
         return 1


raw_dataset['ranking']=raw_dataset.apply(rankings, axis=1)
raw_dataset 

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified,ranking
1,Mango Tree Thai Bistro,Thai Restaurant,29.758251,-95.365387,56cf664a498eec17aaf5f939,Houston,22,asian,0
2,Jason's Deli,Food Truck,29.757464,-95.365543,4bbb8ef0e452952192d954a4,Houston,22,casual,0
4,Bombay Pizza Co.,Pizza Place,29.7577,-95.364586,4b5a3ea4f964a52021b728e3,Houston,150,euro,1
7,Perbacco,Italian Restaurant,29.760257,-95.364773,4b510fe4f964a520554027e3,Houston,25,euro,0
10,Becks Prime,Burger Joint,29.758185,-95.366172,4bd9ae2d2e6f0f47c9730b08,Houston,24,casual,0
11,Flying Saucer Draught Emporium,Beer Bar,29.759116,-95.363216,41326e00f964a520f9111fe3,Houston,648,bar,1
12,Finn Hall,Food Court,29.758678,-95.363846,598a39f9123a19503bef4e46,Houston,29,casual,0
14,5411 Empanadas,Empanada Restaurant,29.758986,-95.36859,583491fb409f5673e5c8440f,Houston,5,latino,0
16,Treebeards - The Tunnel,Southern / Soul Food Restaurant,29.757903,-95.368972,4b42392df964a52060cf25e3,Houston,31,american,0
19,Frank's Pizza,Pizza Place,29.761393,-95.362721,4ada6409f964a520352221e3,Houston,252,euro,1


In [24]:
# create dummies for linear regression modelling
# one hot encoding
reg_dataset = pd.get_dummies(raw_dataset[['categories_classified', 
                                          'city',]], 
                               prefix="", 
                               prefix_sep="")

# add name, ranking and likes columns back to dataframe
reg_dataset['ranking'] = raw_dataset['ranking']
reg_dataset['likes'] = raw_dataset['likes']
reg_dataset['name'] = raw_dataset['name']

# move name column to the first column
reg_columns = [reg_dataset.columns[-1]] + list(reg_dataset.columns[:-1])
reg_dataset = reg_dataset[reg_columns]


reg_dataset.head()

Unnamed: 0,name,american,asian,bar,casual,euro,latino,Dallas,Houston,San Antonio,ranking,likes
1,Mango Tree Thai Bistro,0,1,0,0,0,0,0,1,0,0,22
2,Jason's Deli,0,0,0,1,0,0,0,1,0,0,22
4,Bombay Pizza Co.,0,0,0,0,1,0,0,1,0,1,150
7,Perbacco,0,0,0,0,1,0,0,1,0,0,25
10,Becks Prime,0,0,0,1,0,0,0,1,0,0,24


# 3. Methodology

1. In our analysis, we will compare two machine learning models; linear and logistic regression. We will use the linear regression to attempt to predict the number of "likes" a new restaurant will have.

2. The logistic regression will be used as a classification method to possibly predict range of "likes" a new restaurant will have. Since we have mulitple categories, we will use multinomial regression.

# 4. Results

## 4.1 Linear Regression

The linear regression model was trained on a random subsample of 80% and tested on the remaining 20%. The residual sum of squares score and variance score were calculated.

In [33]:
# Multiple Linear Regression

msk = np.random.rand(len(reg_dataset)) < 0.8
train = reg_dataset[msk]
test = reg_dataset[~msk]

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Houston', 
                         'San Antonio', 'Dallas']])
y = np.asanyarray(train[['likes']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

Coefficients:  [[-14.78665871 -21.34966555  41.39870634 -27.60090558  12.62695107
    9.71157243  -4.85629773  32.73028715 -27.87398942]]


In [34]:
# multiple linear regression prediction capabilities

y_hat= regr.predict(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Houston', 
                         'San Antonio', 'Dallas']])
x = np.asanyarray(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Houston', 
                         'San Antonio', 'Dallas']])
y = np.asanyarray(test[['likes']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))

Residual sum of squares: 6905.32
Variance score: -0.09


## 4.2 Logistic Regression Results 

1. A multinomial ordinal logisitc regression model was trained on a random subsample of 80% of the sample and then tested on the other 20%.
2. The jaccard similarity score and log-loss were calculated.

In [35]:
# multinomial ordinal logistic regression

x_train = np.asanyarray(train[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Houston', 
                         'San Antonio', 'Dallas']])
y_train = np.asanyarray(train['ranking'])

x_test = np.asanyarray(test[['american', 'asian', 'bar', 'casual',
                         'euro', 'latino', 'Houston', 
                         'San Antonio', 'Dallas']])
y_test = np.asanyarray(test['ranking'])


# LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
# LR

mul_ordinal = linear_model.LogisticRegression(multi_class='multinomial',
                                              solver='newton-cg',
                                              fit_intercept=True).fit(x_train,
                                                                      y_train)

mul_ordinal

coef = mul_ordinal.coef_[0]
print (coef)

[ 0.04509203 -0.39407612  0.39835394 -0.2093582   0.33196087 -0.17209966
  0.10254     0.33083368 -0.43350083]


In [37]:
# multinomial ordinal logistic regression prediction capabilities
yhat = mul_ordinal.predict(x_test)
yhat

yhat_prob = mul_ordinal.predict_proba(x_test)
yhat_prob


jaccard_score(y_test, yhat, average = 'micro')

0.75

In [38]:
log_loss(y_test, yhat_prob)

0.39192480930243734

In [39]:
# exploration of coeeficient magnitudes on full dataset
x_all = np.asanyarray(reg_dataset[['american', 'asian', 'bar', 'casual',
                                   'euro', 'latino', 'Houston', 
                                   'San Antonio', 'Dallas']])
y_all = np.asanyarray(reg_dataset['ranking'])



LR = linear_model.LogisticRegression(multi_class='multinomial',
                                            solver='newton-cg',
                                            fit_intercept=True).fit(x_all,
                                                                    y_all)

LR

coef = LR.coef_[0]
print (coef)

[ 0.03589343 -0.11936171  0.31251363 -0.2031035   0.19605024 -0.22199273
  0.13706667  0.36204587 -0.49911318]


In [40]:
print (classification_report(y_test, yhat))

              precision    recall  f1-score   support

           0       0.86      1.00      0.92        30
           1       0.00      0.00      0.00         5

    accuracy                           0.86        35
   macro avg       0.43      0.50      0.46        35
weighted avg       0.73      0.86      0.79        35



  _warn_prf(average, modifier, msg_start, len(result))


# 5. Discussion

For the purpose of this project, we are assuming that 'likes' are a good proxy to show how well a new restaurant will do in a certain location, with a certain type of cuisine. Whether or not this assumption holds in real life scenario is subjective to the data available. It is Important to note however, that this analysis is limited in scope as far as the amount of data that can be fetched from the FourSquare API.

Using logistic regression we were able to obtain a Jaccard Similarity Score of 75%, which although not perfect, is more reasonable than the low variance score obtained from the linear regression. Therefore, given the data, logistic regression presents a better fit for the data over linear regression. Different binning methods for the classes were attempted, but the use of 2 bins by far yielded the best Jaccard Similarity Score.


# 6. Conclusion

In conclusion, after analyzing restaurant "likes" from 300 restaurants in Texas 3 largest cities of Houston, San Antonio & Dallas, we can conclude that the best approach to take in regards to maximizing business performance (as measured by "likes") is to set up an 'Amercian' style restaurant in either Houston or San Antonio. However, the bar style restuarant also performed exceptionally well in Houston and San Antonio, so it could be a viable second option. Dallas had the lowest 'likes' in any cuisine and therefore would not recommend to set up there. Additionally, the predictive capabilities of the logistic regression prediction model are most accurate for classifying whether a restaurant will fall in either the best or worst classes when the data is binned into 2 classes.