# IBM Applied Data Science Capstone Project Intent:

This notebook will be used for the purposes of completing the capstone project needed to obtain the professional certification from Coursera's IBM Data Science Professional Certificate course located here: https://www.coursera.org/professional-certificates/ibm-data-science

## Introduction:

For this project I decided I wanted to explore the restaurant scene in places I've lived in the past. I decided to explore the cities of Oakland, Emeryville, and San Diego to get a better glimpse into potential suggestions for investors and entrepreneurs looking to make a sound investment in restauranteering. I was looking for what type of restaurant would be the best and in which city to focus on when scouting for potential property. 

The question we attempted to address by conducting this project was to discern how accurately we can predict the amount of "likes" a new restaurant opening in this region can expect to have based on the type of cuisine it will serve and which city in California it will open in. For this project I analyzed and modeled the data using machine learning by comparing both linear and logistic regressions to see which method yielded better predictive capabilities after training and testing.

## Data:

Once this is complete, we finally name the dataframe 'raw_dataset' as it is the most complete compiled form before needing any processing for analysis via machine learning.

First we retrieved the geographical coordinates of the three cities (Oakland, Emeryville, and San Diego). We then went ahead and leveraged the Foursquare API in order to obtain the URLs that gave us the raw data in JSON form. Each respective URL was then scraped for the columns: 'name', 'categories', 'latitude', 'longitude', and'id' for each city. The city column will help us when separating where the restaurants are from.  

For this project, I decided to focus on those restaurants found within a 1000km radius from the coordinates that were provided by the geolocator. The Foursquare API provides us with more venue categories than we need, and therefore we had to make sure to clean our results by removing non-restaurant rows. Pulling the 'likes' data is necessary for us to make our final decisions. We don't want to be pulling information that will be discarded anyways and is of no valuable for our analysis. 

We used the 'id' column in order to pull the 'likes' using the API and append the information into the dataframe. We concluded by naming the dataframe 'raw_dataset', which we used in the machine learning portion of the project. 

## Methodology:

Both linear and logistic regression were used to train and test the data. Linear regression was used to predict the number of 'likes' a new restaurant in this region will acquire. Sci-Kit Learn was employed for this stage. 

Logistic regression was used as the classification method. Since, we used binning when classifying by number of 'likes', we essentially made use of multinomial logistic regression to perform our analysis. Although the ranges are discrete categories, they can be considered ordinal in nature. The logistic regression will need to be specified as being both multinomial and ordinal. The Sci-Kit Learn package is perfect for this. 

### Importing Libraries:

In [1]:
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np 
import json
import requests
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium 
import matplotlib.pyplot as plt
import pylab as pl
import itertools
import warnings
warnings.filterwarnings('ignore')

from urllib.request import urlopen
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim 
from sklearn import linear_model
from sklearn.metrics import jaccard_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, r2_score

print('All libraries have been imported')

All libraries have been imported


### Retrieving Foursquare City Coordinates:

In [2]:
address1 = 'Oakland, California'

geolocator = Nominatim(user_agent="foursquare_agent")
location1 = geolocator.geocode(address1)
latitude1 = location1.latitude
longitude1 = location1.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address1, latitude1, longitude1))

address2 = 'Emeryville, California'

geolocator = Nominatim(user_agent="foursquare_agent")
location2 = geolocator.geocode(address2)
latitude2 = location2.latitude
longitude2 = location2.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address2, latitude2, longitude2))

address3 = 'San Diego, California'

geolocator = Nominatim(user_agent="foursquare_agent")
location3 = geolocator.geocode(address3)
latitude3 = location3.latitude
longitude3 = location3.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address3, latitude3, longitude3))

The geograpical coordinate of Oakland, California are 37.8044557, -122.2713563.
The geograpical coordinate of Emeryville, California are 37.8314089, -122.2865266.
The geograpical coordinate of San Diego, California are 32.7174209, -117.1627714.


### Foursquare Credentials:

In [None]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

# create URLs
url1 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude1, 
    longitude1, 
    radius, 
    LIMIT)


# create URLs
url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude2, 
    longitude2, 
    radius, 
    LIMIT)


# create URLs
url3 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude3, 
    longitude3, 
    radius, 
    LIMIT)

print(url1, url2, url3)

### Data Exploration:

In [4]:
# scrape the data from the generated URLs

results1 = requests.get(url1).json()
results1

results2 = requests.get(url2).json()
results2

results3 = requests.get(url3).json()
results3

# function that extracts the category of the venue

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    

# FIRST CITY   

venues1 = results1['response']['groups'][0]['items']
nearby_venues1 = pd.json_normalize(venues1) # flatten JSON

# filter columns
filtered_columns1 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues1 = nearby_venues1.loc[:, filtered_columns1]

# filter the category for each row
nearby_venues1['venue.categories'] = nearby_venues1.apply(get_category_type, axis=1)

# clean columns
nearby_venues1.columns = [col.split(".")[-1] for col in nearby_venues1.columns]

# SECOND CITY

venues2 = results2['response']['groups'][0]['items']
nearby_venues2 = pd.json_normalize(venues2) # flatten JSON

# filter columns
filtered_columns2 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues2 = nearby_venues2.loc[:, filtered_columns2]

# filter the category for each row
nearby_venues2['venue.categories'] = nearby_venues2.apply(get_category_type, axis=1)

# clean columns
nearby_venues2.columns = [col.split(".")[-1] for col in nearby_venues2.columns]

# THIRD CITY

venues3 = results3['response']['groups'][0]['items']
nearby_venues3 = pd.json_normalize(venues3) # flatten JSON

# filter columns
filtered_columns3 = ['venue.name', 'venue.categories', 'venue.location.lat', 
                    'venue.location.lng', 'venue.id']
nearby_venues3 = nearby_venues3.loc[:, filtered_columns3]

# filter the category for each row
nearby_venues3['venue.categories'] = nearby_venues3.apply(get_category_type, axis=1)

# clean columns
nearby_venues3.columns = [col.split(".")[-1] for col in nearby_venues3.columns]





print('{} venues for Oakland, California were returned by Foursquare.'.format(nearby_venues1.shape[0]))
print()
print('{} venues for Emeryville, California were returned by Foursquare.'.format(nearby_venues2.shape[0]))
print()
print('{} venues for San Diego, California were returned by Foursquare.'.format(nearby_venues3.shape[0]))

100 venues for Oakland, California were returned by Foursquare.

100 venues for Emeryville, California were returned by Foursquare.

100 venues for San Diego, California were returned by Foursquare.


In [5]:
# add locations data to the data sets of each city

nearby_venues1['city'] = 'Oakland'
nearby_venues2['city'] = 'Emeryville'
nearby_venues3['city'] = 'San Diego'

In [6]:
# combine the three cities into one data set

nearby_venues = nearby_venues1.copy()
nearby_venues = nearby_venues.append(nearby_venues2)
nearby_venues = nearby_venues.append(nearby_venues3)

In [7]:
nearby_venues

Unnamed: 0,name,categories,lat,lng,id,city
0,Oaklandish,Clothing Store,37.805075,-122.270726,4dfb9c2c1f6eeef806ab898c,Oakland
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland
2,Cafe Van Kleef,Bar,37.80666,-122.270273,46884818f964a52056481fe3,Oakland
3,Bar Shiru,Bar,37.806378,-122.270393,5c5b9abdf870fd002c35d291,Oakland
4,Woods Bar & Brewery,Brewery,37.806889,-122.270415,5419f32c498e561ee5c2fa38,Oakland
5,Cape & Cowl,Comic Shop,37.806725,-122.272747,56562410498ea43ab630819a,Oakland
6,Ume Yoga,Yoga Studio,37.805493,-122.270945,57348ac1498e7aae9899a535,Oakland
7,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland
8,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland
9,Lucky Duck Bicycle Cafe,Café,37.801684,-122.268656,57bf810c498ee0a34d9f8ca1,Oakland


In [8]:
nearby_venues['categories'].unique()

array(['Clothing Store', 'Vegetarian / Vegan Restaurant', 'Bar',
       'Brewery', 'Comic Shop', 'Yoga Studio', 'Bagel Shop',
       'Japanese Restaurant', 'Café', 'Vietnamese Restaurant',
       'Coffee Shop', 'Tiki Bar', 'Music Venue', 'Sandwich Place',
       'Mexican Restaurant', 'Wine Bar', 'Chinese Restaurant',
       'Seafood Restaurant', 'Caribbean Restaurant', 'Cocktail Bar',
       'Fried Chicken Joint', 'Dance Studio', 'Gym / Fitness Center',
       'Beer Bar', 'Bubble Tea Shop', 'Nightclub', 'Food Court',
       'Hot Dog Joint', 'Taco Place', 'Ice Cream Shop', 'Cupcake Shop',
       'Skating Rink', 'Brazilian Restaurant', 'Dessert Shop',
       'Climbing Gym', 'Burger Joint', 'Bakery', 'American Restaurant',
       'New American Restaurant', 'Farmers Market', 'Gay Bar',
       'Hotpot Restaurant', 'Indian Restaurant', 'Beer Garden',
       'Tea Room', 'Thai Restaurant', 'Dumpling Restaurant',
       'Arts & Crafts Store', 'Grocery Store', 'Breakfast Spot',
       'Dim Sum R

### Data Cleaning:

In [9]:
# check list and manually remove all non-restaurant data

nearby_venues['categories'].unique()

removal_list = ['Clothing Store','Bar','Brewery', 
                'Comic Shop', 'Yoga Studio','Café', 
                'Coffee Shop', 'Tiki Bar', 'Music Venue', 
                'Wine Bar',  'Cocktail Bar', 'Dance Studio', 
                'Gym / Fitness Center','Beer Bar', 
                'Bubble Tea Shop', 'Nightclub', 'Food Court', 
                'Ice Cream Shop', 'Cupcake Shop', 'Skating Rink', 
                'Dessert Shop', 'Climbing Gym', 'Bakery', 
                'Farmers Market', 'Gay Bar','Beer Garden',
                'Tea Room','Arts & Crafts Store', 'Grocery Store', 
                'Sports Bar', 'Museum', 'Street Food Gathering', 
                'Library', 'Skate Park', 'Movie Theater','Park', 
                'Gym', 'Stadium', 'Furniture / Home Store', 'Discount Store', 
                'Playground', 'Cosmetics Shop', 'Casino', 
                'Pet Store','Electronics Store', 'Snack Place',
                'Salon / Barbershop', 'Shopping Plaza', 'Deli / Bodega', 
                'Candy Store', 'Liquor Store', 'Hotel', 
                'Shoe Store', 'Bookstore', 'Shopping Mall', 
                'Dive Bar', 'Video Game Store', 'Pharmacy', 
                'Accessories Store', 'Lingerie Store', 'Mobile Phone Shop', 
                'Pool Hall', 'Juice Bar', 'Kids Store', 
                'Supplement Shop', 'Big Box Store', 'Mattress Store', 
                'Hardware Store', 'Paper / Office Supplies Store', 'Theater', 
                'Business Service', 'Donut Shop', 'Beer Store', 
                'Lounge', 'Health Food Store', 'Pedestrian Plaza', 
                'Hookah Bar', 'Concert Hall', 'Chocolate Shop', 
                'Hostel', 'Convenience Store', 'Pub', 
                'Plaza', 'Comedy Club', 'Speakeasy', 
                'Tattoo Parlor', 'Massage Studio']

nearby_venues = nearby_venues[~nearby_venues['categories'].isin(removal_list)]

nearby_venues['categories'].unique().tolist()

['Vegetarian / Vegan Restaurant',
 'Bagel Shop',
 'Japanese Restaurant',
 'Vietnamese Restaurant',
 'Sandwich Place',
 'Mexican Restaurant',
 'Chinese Restaurant',
 'Seafood Restaurant',
 'Caribbean Restaurant',
 'Fried Chicken Joint',
 'Hot Dog Joint',
 'Taco Place',
 'Brazilian Restaurant',
 'Burger Joint',
 'American Restaurant',
 'New American Restaurant',
 'Hotpot Restaurant',
 'Indian Restaurant',
 'Thai Restaurant',
 'Dumpling Restaurant',
 'Breakfast Spot',
 'Dim Sum Restaurant',
 'Southern / Soul Food Restaurant',
 'Diner',
 'Mediterranean Restaurant',
 'Scandinavian Restaurant',
 'Asian Restaurant',
 'Filipino Restaurant',
 'Pizza Place',
 'Wings Joint',
 'Burrito Place',
 'Fast Food Restaurant',
 'Sushi Restaurant',
 'French Restaurant',
 'Falafel Restaurant',
 'Theme Restaurant',
 'Italian Restaurant',
 'Turkish Restaurant',
 'Ramen Restaurant']

### DataFrame Creation:

In [10]:
# set up to pull the likes from the API based on venue ID

url_list = []
like_list = []
json_list = []

for i in list(nearby_venues.id):
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)
for link in url_list:
    result = requests.get(link).json()
    likes = result['response']['likes']['count']
    like_list.append(likes)
print(like_list)


nearby_venues['likes'] = like_list
nearby_venues.head()

[77, 22, 156, 70, 33, 14, 51, 202, 94, 65, 104, 7, 369, 61, 177, 188, 43, 39, 40, 23, 56, 239, 73, 102, 259, 11, 24, 33, 13, 45, 99, 69, 230, 5, 39, 52, 120, 25, 55, 29, 247, 331, 133, 24, 35, 31, 56, 13, 61, 92, 18, 16, 61, 5, 68, 21, 65, 4, 17, 16, 1, 41, 0, 3, 31, 156, 76, 2, 78, 131, 171, 142, 26, 41, 9, 105, 34, 24, 35, 35, 30, 296, 18, 31, 94, 19, 21, 127, 102, 320, 104, 185, 132, 539, 31, 480, 54, 448, 20, 174]


Unnamed: 0,name,categories,lat,lng,id,city,likes
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland,77
7,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland,22
8,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland,156
10,Tay Ho Restaurant & Bar,Vietnamese Restaurant,37.802062,-122.269573,4c8b16ec52a98cfad73533e9,Oakland,70
14,Nature Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.802157,-122.270983,4f52deb2e4b0ac6d0c91df05,Oakland,33


In [11]:
nearby_venues.count()

name          100
categories    100
lat           100
lng           100
id            100
city          100
likes         100
dtype: int64

In [12]:
# this is really the raw dataset now so let us rename it something more appropriate

raw_dataset = nearby_venues
raw_dataset.head()

Unnamed: 0,name,categories,lat,lng,id,city,likes
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland,77
7,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland,22
8,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland,156
10,Tay Ho Restaurant & Bar,Vietnamese Restaurant,37.802062,-122.269573,4c8b16ec52a98cfad73533e9,Oakland,70
14,Nature Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.802157,-122.270983,4f52deb2e4b0ac6d0c91df05,Oakland,33


### Data Preparation for Machine Learning:

In [13]:
# inspecting the raw dataset shows that there may be too many different types of cuisines

raw_dataset['categories'].unique().tolist()

['Vegetarian / Vegan Restaurant',
 'Bagel Shop',
 'Japanese Restaurant',
 'Vietnamese Restaurant',
 'Sandwich Place',
 'Mexican Restaurant',
 'Chinese Restaurant',
 'Seafood Restaurant',
 'Caribbean Restaurant',
 'Fried Chicken Joint',
 'Hot Dog Joint',
 'Taco Place',
 'Brazilian Restaurant',
 'Burger Joint',
 'American Restaurant',
 'New American Restaurant',
 'Hotpot Restaurant',
 'Indian Restaurant',
 'Thai Restaurant',
 'Dumpling Restaurant',
 'Breakfast Spot',
 'Dim Sum Restaurant',
 'Southern / Soul Food Restaurant',
 'Diner',
 'Mediterranean Restaurant',
 'Scandinavian Restaurant',
 'Asian Restaurant',
 'Filipino Restaurant',
 'Pizza Place',
 'Wings Joint',
 'Burrito Place',
 'Fast Food Restaurant',
 'Sushi Restaurant',
 'French Restaurant',
 'Falafel Restaurant',
 'Theme Restaurant',
 'Italian Restaurant',
 'Turkish Restaurant',
 'Ramen Restaurant']

In [14]:
# we can group some cuisines together to make a better categorical variable

european = ['Mediterranean Restaurant', 'Scandinavian Restaurant', 'Pizza Place',
       'French Restaurant', 'Falafel Restaurant', 'Italian Restaurant',
       'Turkish Restaurant']

latin = ['Mexican Restaurant', 'Taco Place', 'Brazilian Restaurant', 
          'Burrito Place']

asian = ['Japanese Restaurant', 'Vietnamese Restaurant', 'Chinese Restaurant',
         'Hot Dog Joint', 'Hotpot Restaurant', 'Indian Restaurant',
         'Thai Restaurant', 'Dumpling Restaurant', 'Dim Sum Restaurant',
         'Asian Restaurant', 'Filipino Restaurant', 'Sushi Restaurant',
         'Ramen Restaurant']

american = ['Vegetarian / Vegan Restaurant', 'Seafood Restaurant', 'Caribbean Restaurant',
           'Burger Joint', 'American Restaurant', 'New American Restaurant',
            'Southern / Soul Food Restaurant', 'Diner']

casual = ['Bagel Shop', 'Sandwich Place', 'Fried Chicken Joint', 
          'Breakfast Spot', 'Wings Joint', 'Fast Food Restaurant',
          'Theme Restaurant']

def conditions(s):
    if s['categories'] in european:
        return 'european'
    if s['categories'] in latin:
        return 'latin'
    if s['categories'] in asian:
        return 'asian'
    if s['categories'] in american:
        return 'american'
    if s['categories'] in casual:
        return 'casual'

raw_dataset['categories_classified'] = raw_dataset.apply(conditions, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland,77,american
7,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland,22,casual
8,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland,156,asian
10,Tay Ho Restaurant & Bar,Vietnamese Restaurant,37.802062,-122.269573,4c8b16ec52a98cfad73533e9,Oakland,70,asian
14,Nature Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.802157,-122.270983,4f52deb2e4b0ac6d0c91df05,Oakland,33,american
15,Anula's Cafe,Sandwich Place,37.803583,-122.270151,4b50d22df964a520a73327e3,Oakland,14,casual
16,Binh Minh Quan,Vietnamese Restaurant,37.80202,-122.269396,4a75022cf964a52043e01fe3,Oakland,51,asian
17,Cosecha,Mexican Restaurant,37.801607,-122.274889,4e179c7752b123a586cef176,Oakland,202,latin
20,Spices! 3,Chinese Restaurant,37.802078,-122.270343,4620cbacf964a52085451fe3,Oakland,94,asian
21,The Cook And Her Farmer,Seafood Restaurant,37.801583,-122.27486,5376a879498e8eeb2402cd71,Oakland,65,american


In [15]:
# double check to make sure categories_classified has been created correctly

pd.crosstab(index = raw_dataset["categories_classified"],
            columns="count")

col_0,count
categories_classified,Unnamed: 1_level_1
american,23
asian,26
casual,16
european,17
latin,18


In [16]:
raw_dataset['likes'].mean()

90.93

In [17]:
# create a function to bin for us

def rankings(df):
    
    if df['likes'] <= 60:
        return 3
    
    elif df['likes'] <= 100:
        return 2
    
    elif df['likes'] > 100:
        return 1

In [18]:
# apply rankings function to dataset

raw_dataset['ranking'] = raw_dataset.apply(rankings, axis=1)
raw_dataset

Unnamed: 0,name,categories,lat,lng,id,city,likes,categories_classified,ranking
1,Golden Lotus Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.80329,-122.270473,49cebb1bf964a520785a1fe3,Oakland,77,american,2
7,Beauty’s Bagel Shop,Bagel Shop,37.806082,-122.268356,5bd0959cf1fdaf002ce03e11,Oakland,22,casual,3
8,Abura-Ya,Japanese Restaurant,37.805959,-122.267693,539a69a7498ee67090b2b285,Oakland,156,asian,1
10,Tay Ho Restaurant & Bar,Vietnamese Restaurant,37.802062,-122.269573,4c8b16ec52a98cfad73533e9,Oakland,70,asian,2
14,Nature Vegetarian Restaurant,Vegetarian / Vegan Restaurant,37.802157,-122.270983,4f52deb2e4b0ac6d0c91df05,Oakland,33,american,3
15,Anula's Cafe,Sandwich Place,37.803583,-122.270151,4b50d22df964a520a73327e3,Oakland,14,casual,3
16,Binh Minh Quan,Vietnamese Restaurant,37.80202,-122.269396,4a75022cf964a52043e01fe3,Oakland,51,asian,3
17,Cosecha,Mexican Restaurant,37.801607,-122.274889,4e179c7752b123a586cef176,Oakland,202,latin,1
20,Spices! 3,Chinese Restaurant,37.802078,-122.270343,4620cbacf964a52085451fe3,Oakland,94,asian,2
21,The Cook And Her Farmer,Seafood Restaurant,37.801583,-122.27486,5376a879498e8eeb2402cd71,Oakland,65,american,2


### Machine Learning | Linear Regression:

In [19]:
# create dummies for linear regression modelling

# one hot encoding
reg_dataset = pd.get_dummies(raw_dataset[['categories_classified', 
                                          'city',]], 
                               prefix="", 
                               prefix_sep="")

# add name, ranking, and likes columns back to dataframe
reg_dataset['ranking'] = raw_dataset['ranking']
reg_dataset['likes'] = raw_dataset['likes']
reg_dataset['name'] = raw_dataset['name']

# move name column to the first column
reg_columns = [reg_dataset.columns[-1]] + list(reg_dataset.columns[:-1])
reg_dataset = reg_dataset[reg_columns]


reg_dataset.head()

Unnamed: 0,name,american,asian,casual,european,latin,Emeryville,Oakland,San Diego,ranking,likes
1,Golden Lotus Vegetarian Restaurant,1,0,0,0,0,0,1,0,2,77
7,Beauty’s Bagel Shop,0,0,1,0,0,0,1,0,3,22
8,Abura-Ya,0,1,0,0,0,0,1,0,1,156
10,Tay Ho Restaurant & Bar,0,1,0,0,0,0,1,0,2,70
14,Nature Vegetarian Restaurant,1,0,0,0,0,0,1,0,3,33


In [20]:
# Multiple Linear Regression

msk = np.random.rand(len(reg_dataset)) < 0.8
train = reg_dataset[msk]
test = reg_dataset[~msk]

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

y = np.asanyarray(train[['likes']])
regr.fit (x, y)

# The coefficients

print ('Coefficients: ', regr.coef_)

Coefficients:  [[ 39.90533968 -19.50702492 -28.53229251 -24.33979385  32.4737716
    5.62737318 -35.51650283  29.88912965]]


In [21]:
# Multiple Linear Regression Prediction Capabilities

y_hat= regr.predict(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

x = np.asanyarray(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

y = np.asanyarray(test[['likes']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))

Residual sum of squares: 10159.10
Variance score: 0.16


### Machine Learning | Logistic Regression:

In [22]:
# Multinomial Ordinal Logistic Regression

x_train = np.asanyarray(train[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

y_train = np.asanyarray(train['ranking'])

x_test = np.asanyarray(test[['american', 'asian', 'casual',
                         'european', 'latin', 'Oakland', 
                         'Emeryville', 'San Diego']])

y_test = np.asanyarray(test['ranking'])


mul_ordinal = linear_model.LogisticRegression(multi_class='multinomial',
                                              solver='newton-cg',
                                              fit_intercept=True).fit(x_train,
                                                                      y_train)

mul_ordinal

coef = mul_ordinal.coef_[0]
print (coef)

[ 0.37507133 -0.55016123 -0.10365649 -0.14115854  0.41992659 -0.08455777
 -0.74395246  0.82853189]


In [23]:
# Multinomial Ordinal Logistic Regression Prediction Capabilities

yhat = mul_ordinal.predict(x_test)
yhat

yhat_prob = mul_ordinal.predict_proba(x_test)
yhat_prob


# average = None, average = 'micro', average = 'macro', or average = 'weighted'
jaccard_score(y_test, yhat, average='weighted')

0.2619047619047619

In [24]:
log_loss(y_test, yhat_prob)

1.2777842705582076

In [25]:
# Exploration of Coefficient Magnitudes of Full Dataset

x_all = np.asanyarray(reg_dataset[['american', 'asian', 'casual',
                                   'european', 'latin', 'Oakland', 
                                   'Emeryville', 'San Diego']])

y_all = np.asanyarray(reg_dataset['ranking'])



LR = linear_model.LogisticRegression(multi_class='multinomial',
                                            solver='newton-cg',
                                            fit_intercept=True).fit(x_all,
                                                                    y_all)

LR

coef = LR.coef_[0]
print(coef)

[ 0.44107953 -0.2342529  -0.3039025   0.04138763  0.05568807  0.11780993
 -0.70485215  0.58704205]


In [26]:
print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

           1       0.40      0.18      0.25        11
           2       0.00      0.00      0.00         5
           3       0.50      1.00      0.67        11

    accuracy                           0.48        27
   macro avg       0.30      0.39      0.31        27
weighted avg       0.37      0.48      0.37        27



## Results:

A linear regression model was trained on a random subsample of 80% and then the other 20% was used for testing purposes. In order to evaluate if the model is reasonable, the residual sum of squares and variance score were both calculated (10159.10, 0.16). The variance score is quite low, which means that is not a good way of modeling our data. Therefore, we moved on to logistic regression for our analysis. 

The multinomial ordinal logistic regression model was also trained on a random subsample of 80% and then tested on the remaining 20%. The jaccard score and log-loss were both calculated (26.19% and 1.278 respectively). Although the prediction is not promosing, a jaccard score of 26% is somewhat reasonable. The classification report is included in the analysis. 

Given the modestly accurate ability of this mode, we have the ability to run the model on the complete dataset. The coefficients we got show that opening a restaurant in Emeryville, or serving cuisine that is asian, or casual, are negatively associated with 'likes'.

## Discussion:

The first thing to note is that given the data, logistic regression presents a better fit for the data over linear regression. Using logistic regression we were able to obtain a Jaccard Score of 26.19%, which although not perfect, is more reasonable than the low variance score obtained from the linear regression. As stated before, please note that for the purposes of this project, we are assumming that likes are a good proxy for how well a new restaurant will do in terms of brand, image and by extension how well the restaurant will perform business-wise. Whether or not these assumptions hold up in a real-life scenario is up for discussion, but this project does contain limitations in scope due to the amount of data that can be fetched from the FourSquare API.

As such, to obtain insights into this data, we can proceed with breaking down the results of the logistic regression model. The results showed that the precision score for classifying whether the new restaurant would fall into classes 1, 2, or 3 (highest, medium, lowest) were 40%, 0%, and 50%. Therefore, the model is better at predicting if a restaurant will fall into the best or worst percentile of likes. This is good as we are mostly concerned with whether the restuarant will perform well or not so the high accuracy of predictions for the two extremum is a welcome feature. This allows us to fairly accurately predict the general performance of the business opportunity. Different binning methods for the classes were attempted, but the use of 3 bins by far yielded the best Jaccard Score.

Additionally, not only are we attempting to predict the general business performance but also pull insights to inform on business strategy. In this case strategy insight can be gleamed from the coefficient values from running the logistic regression on the full dataset. As such, we can see that opening a restaurant in Emeryville, or serving cuisine that is asian or casual in nature, are associated negatively with "likes." This suggests that the business opportunity should be opening a restaurant in either Oakland or San Diego, with a cuisine that is European, Latin, or American in nature would be the best approach for maximizing likes.

## Conclusion:

In conclusion, after analyzing restaurant 'likes' in California from the 300 restaurants, we can conclude that the approach to best take when looking to maximize business performance (as measured by 'likes') is to open a restaurant that is either European, Latino, or American and that opening the venue in either Oakland or San Diego rather than Emeryville would be the best approach. Additionally, the predictive capabilities of the logistic regression prediction model proved to be the most accurate for classifying whether a restaurant fell in either the best or worst classes when the data was binned into their 3 respective classes.