# Restaurant Industry Consulting Firm
July 15, 2019<br>
Ngoc, Inferential Statistics<br>
Welch’s T-Tests for Top Cuisines in the DMV Area

-----------------

In this notebook, we want to:
- Find 2 most popular cuisines in each area (DC, VA, and MD) then t-test to figure out which one is best in each area (vs. number of reviews)
- Over night (vs. number of reviews, stars)

-------------------

## Import Needed Libraries

In [1]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats

sns.set_style("whitegrid")
warnings.filterwarnings("ignore")

## Load Needed Data

In [2]:
dc = pd.read_csv("data/dc_restaurants.csv")
va = pd.read_csv("data/va_restaurants.csv")
md = pd.read_csv("data/md_restaurants.csv")

## Inferential Statistics

$\alpha$ = 0.05

Helper functions:

In [3]:
def bootstrap(sample, n):
    return np.random.choice(sample, size=n, replace=True)

In [4]:
def sampling(samples, n, num):
    sample_means = []
    for i in range(num):
        sample_means.append(bootstrap(samples, n).mean())
    return sample_means

### DC

H0: , H1: 

The first cuisine:

In [5]:
dc_first = dc[dc.cuisine == "Ramen"]
dc_first.reset_index(inplace=True, drop=True)
print(len(dc_first))
dc_first.head()

21


Unnamed: 0,hours,id,name,price,rating,review_count,special_hours,state,cuisine
0,"[{'open': [{'is_overnight': False, 'start': '1...",5h4xi-s9sYQsa6YSf-LoeQ,Sushi Keiko,$$,4.0,52,,DC,Ramen
1,"[{'open': [{'is_overnight': False, 'start': '1...",637UzQJ7bQ_G8KXqzQVcdw,Paper Horse,$$,3.0,17,,DC,Ramen
2,"[{'open': [{'is_overnight': True, 'start': '15...",6dzlGp9EtHAIHHB8oVyGhw,Chaplin's,$$,4.0,959,,DC,Ramen
3,,FrUAxcRnUIvwSm9doarThQ,Shaku Ramen Bar,,2.5,3,,DC,Ramen
4,"[{'open': [{'is_overnight': False, 'start': '1...",GMI7ecCz0Ylfw7hGRi56KA,Uzu,$$,4.5,54,,DC,Ramen


In [6]:
dc_first_review_count = dc_first.review_count

The second cuisine:

In [7]:
dc_second = dc[dc.cuisine == "Spanish"]
dc_second.reset_index(inplace=True, drop=True)
print(len(dc_second))
dc_second.head()

20


Unnamed: 0,hours,id,name,price,rating,review_count,special_hours,state,cuisine
0,"[{'open': [{'is_overnight': False, 'start': '1...",4MD0XxgyG75VBpr3NZJjPA,Boveda,$$,3.5,77,,DC,Spanish
1,,4SDuKYt9E9RF1mLHtQXIeg,Paellas & Cos,,5.0,2,,DC,Spanish
2,"[{'open': [{'is_overnight': True, 'start': '11...",5BBKzNPoAdkr_q3JWDO7oQ,Churreria Madrid,$$,3.0,89,,DC,Spanish
3,"[{'open': [{'is_overnight': False, 'start': '1...",8FGizWqfHi9XBL2R6J5uJg,Lauriol Plaza,$$,3.5,1955,,DC,Spanish
4,"[{'open': [{'is_overnight': False, 'start': '1...",8pWrgtRGn4BqcOWb8ExxcA,Boqueria Penn Quarter,,4.5,103,,DC,Spanish


In [8]:
dc_second_review_count = dc_second.review_count

Check if the distributions of the two groups follow the normal distribution:

H0: the data was drawn from a normal distribution, H1: the data was not drawn from a normal distribution

In [9]:
stats.shapiro(dc_first_review_count)

(0.7741931080818176, 0.00027142345788888633)

In [10]:
stats.shapiro(dc_second_review_count)

(0.7949022054672241, 0.0007259448757395148)

Reject the null hypotheses -> normality assumption is violated

Have to bootstrap:

In [11]:
dc_first_review_count_bs = sampling(dc_first_review_count, 100000, 1000)
dc_second_review_count_bs = sampling(dc_second_review_count, 100000, 1000)

Check for normality again:

In [12]:
stats.shapiro(dc_first_review_count_bs)

(0.9982810020446777, 0.42071017622947693)

In [13]:
stats.shapiro(dc_second_review_count_bs)

(0.9981014728546143, 0.32776689529418945)

Good!

In [14]:
stats.ttest_ind(dc_first_review_count_bs, dc_second_review_count_bs, equal_var=False)

Ttest_indResult(statistic=610.8826100740495, pvalue=0.0)

Reject.... -> Ramen

### VA

In [15]:
va_first = va[va.cuisine == "Bars"]
va_first.reset_index(inplace=True, drop=True)
print(len(va_first))
va_first.head()

110


Unnamed: 0,hours,id,name,price,rating,review_count,special_hours,state,cuisine
0,"[{'open': [{'is_overnight': True, 'start': '15...",-8TkrTNeebNwFGcfDKJgRQ,Viva Tequila,$$,1.5,26,,VA,Bars
1,"[{'open': [{'is_overnight': True, 'start': '11...",-ZVeyeJEL0jHiUO_-u8CAw,Rock It Grill,$$,3.0,268,,VA,Bars
2,"[{'open': [{'is_overnight': False, 'start': '1...",07NB7vLf_z9Sm4VW1Xx2KQ,Asia Bistro,$$,3.5,229,,VA,Bars
3,,085M24RRGm9PpcOV65xUhA,O My Hot Pot & Bar,,5.0,1,,VA,Bars
4,"[{'open': [{'is_overnight': False, 'start': '1...",22aD4k6tzbm5bIpiNBhkOg,Meridian Pint,,4.5,17,,VA,Bars


In [16]:
va_first_review_count = va_first.review_count

In [17]:
va_second = va[va.cuisine == "Seafood"]
va_second.reset_index(inplace=True, drop=True)
print(len(va_second))
va_second.head()

86


Unnamed: 0,hours,id,name,price,rating,review_count,special_hours,state,cuisine
0,,-q13gJA2tAaLJl2RbxF56g,Chesapeake Bay Seafood House,,5.0,1,,VA,Seafood
1,"[{'open': [{'is_overnight': True, 'start': '11...",0H01dKPAmfjeMpI1xki6MA,William Jeffrey's Tavern,$$,3.5,375,,VA,Seafood
2,"[{'open': [{'is_overnight': False, 'start': '1...",0VuICpMMCjY4brvaqnXpPQ,Catch on the Ave,$$,4.0,66,,VA,Seafood
3,"[{'open': [{'is_overnight': False, 'start': '1...",0X8ubZT59d1c_mBUdFLNwA,Casa Tequila Bar & Grill,,3.5,43,,VA,Seafood
4,"[{'open': [{'is_overnight': False, 'start': '1...",25drr0ej_Lp0xQrvIU2rjg,Eddie V's Prime Seafood,$$$,4.0,727,,VA,Seafood


In [18]:
va_second_review_count = va_second.review_count

In [19]:
stats.shapiro(va_first_review_count)

(0.8068469166755676, 1.0536710393083126e-10)

In [20]:
stats.shapiro(va_second_review_count)

(0.8101442456245422, 3.754873745265286e-09)

In [21]:
va_first_review_count_bs = sampling(va_first_review_count, 100000, 1000)
va_second_review_count_bs = sampling(va_second_review_count, 100000, 1000)

In [22]:
stats.shapiro(va_first_review_count_bs)

(0.9984657764434814, 0.5324946641921997)

In [23]:
stats.shapiro(va_second_review_count_bs)

(0.9985237717628479, 0.5702359676361084)

In [24]:
stats.ttest_ind(va_first_review_count_bs, va_second_review_count_bs, equal_var=False)

Ttest_indResult(statistic=757.3375435769763, pvalue=0.0)

### MD

In [25]:
md_first = md[md.cuisine == "Bars"]
md_first.reset_index(inplace=True, drop=True)
print(len(md_first))
md_first.head()

56


Unnamed: 0,hours,id,name,price,rating,review_count,special_hours,state,cuisine
0,"[{'open': [{'is_overnight': True, 'start': '11...",-HFG7EOMJ_no-rKdYxbojg,Clyde's of Chevy Chase,$$,3.0,421,,MD,Bars
1,"[{'open': [{'is_overnight': False, 'start': '1...",0MF_IBRkZoEAhA2TIYO10Q,Black's Bar & Kitchen,$$$,3.5,547,,MD,Bars
2,"[{'open': [{'is_overnight': False, 'start': '1...",14Z5fHYBRI8RLwTJGBbY6w,Hunter's Bar and Grill,$$,3.0,70,,MD,Bars
3,"[{'open': [{'is_overnight': True, 'start': '08...",3YK9x5aOrHxE27WKzGSw2g,Sheger Spring Cafe,$$,2.5,16,,MD,Bars
4,"[{'open': [{'is_overnight': True, 'start': '11...",3_JkcKgupFjr0Uo_3RWYjA,Tommy Joe's,$$,3.0,96,,MD,Bars


In [26]:
md_first_review_count = md_first.review_count

In [27]:
md_second = md[md.cuisine == "American (New)"]
md_second.reset_index(inplace=True, drop=True)
print(len(md_second))
md_second.head()

92


Unnamed: 0,hours,id,name,price,rating,review_count,special_hours,state,cuisine
0,"[{'open': [{'is_overnight': False, 'start': '1...",0MF_IBRkZoEAhA2TIYO10Q,Black's Bar & Kitchen,$$$,3.5,547,,MD,American (New)
1,"[{'open': [{'is_overnight': False, 'start': '0...",0Wczoa7CuBbicdNFiKEoqw,Beltsville Eatery,$,4.5,10,,MD,American (New)
2,"[{'open': [{'is_overnight': False, 'start': '1...",3-zpYxx6zmEMwiN4C3XOUA,Matchbox - Rockville,$$,3.5,1046,,MD,American (New)
3,"[{'open': [{'is_overnight': False, 'start': '1...",4LXoNMiPZCMyMiLV_wt-qA,Marketplace Cafe,$$,3.5,61,,MD,American (New)
4,"[{'open': [{'is_overnight': False, 'start': '1...",4_jyjExsUcWCltYdhzWrYw,Seasons 52,$$,3.5,613,,MD,American (New)


In [28]:
md_second_review_count = md_second.review_count

In [29]:
stats.shapiro(md_first_review_count)

(0.6900336742401123, 1.3818322042169484e-09)

In [30]:
stats.shapiro(md_second_review_count)

(0.6912879943847656, 1.3027578148194774e-12)

In [31]:
md_first_review_count_bs = sampling(md_first_review_count, 100000, 1000)
md_second_review_count_bs = sampling(md_second_review_count, 100000, 1000)

In [32]:
stats.shapiro(md_first_review_count_bs)

(0.9984540939331055, 0.5250107645988464)

In [33]:
stats.shapiro(md_second_review_count_bs)

(0.9984880089759827, 0.5468243360519409)

In [34]:
stats.ttest_ind(md_first_review_count_bs, md_second_review_count_bs, equal_var=False)

Ttest_indResult(statistic=1058.4004930913563, pvalue=0.0)