# Effect of FDA's Calorie Count policy on Restaurant Industry Using NLP

## Resources
1. [Comparison of pandas with SQL](https://pandas.pydata.org/pandas-docs/version/0.22.0/comparison_with_sql.html)

#### Import Packages

In [1]:
import pandas as pd
import nltk
import seaborn as sns
from collections import defaultdict
import matplotlib.pyplot as plt
from random import shuffle

%matplotlib inline

#### Data Exploration

In [2]:
df_2017 = pd.read_csv('./data/restuarant/Annual_Data_2017.csv', encoding = "ISO-8859-1")
df_2014 = pd.read_csv('./data/restuarant/Annual_Data_2014.csv', encoding = "ISO-8859-1")

FileNotFoundError: File b'./data/restuarant/Annual_Data_2017.csv' does not exist

In [None]:
#Modify the dataframe

df_2017 = df_2017.rename(columns={'Calories_2017': 'Calories'})
df_2014 = df_2014.rename(columns={'Calories_2014': 'Calories'})

df_2014['year'] = '2014'
df_2017['year'] = '2017'

In [None]:
df_2017.sample(5)

In [None]:
print(df_2017.shape)
print(df_2017.columns)

In [None]:
#What are the restaurants
df_2017['Restaurant'].unique()

In [None]:
#How many restaurant are we dealing here?
len(set(df_2017['Restaurant']))

In [None]:
#What categories the dishes/menu item comes under?
df_2017['Food_Category'].unique()

## Merging the two dataframes for comparison

Outer Join: (UNION in SQL terms)

In [None]:
verticalStack = pd.concat([df_2017[['Menu_Item_ID', 'Calories', 'Food_Category', 'year']], df_2014[['Menu_Item_ID', 'Calories', 'Food_Category', 'year']]])

In [None]:
#only on 4 columns
verticalStack.sample(5)

In [None]:
verticalStack.shape

## Comparison using Box Plot:

Helpful for compare different datasets!

Seaborn builds on top of Matplotlib and introduces additional plot types!
Also comparted to matplotlib, it's Visually appealing!

In [None]:
fig, ax = plt.subplots(sharex=True,sharey=True,figsize=(15,10))
plt.title('Calories by Food Category')
ax = sns.boxplot(x="Food_Category", y="Calories", hue="year", data=verticalStack, showfliers = False)  
plt.show()

### What can we infer from the box plot:
1. They look almost the same
2. Average(Median), spread(IQR NOT Range!)

Note:*2018 data would be much better comparison see the effect of calorie variation*

## Data Explore on 2017!

In [None]:
 df_2017[(df_2017['Restaurant'] == 'Starbucks') & (df_2017[' Item_Name_2017'] == 'Iced Caffe Latte w/ Coconut, Grande')]
 


In [None]:
pd.set_option('display.max_colwidth', -1)# to show the complete, non-truncated text data for each element of dataframe

df_2017[(df_2017['Restaurant'] == '7 Eleven') & (df_2017[' Item_Name_2017'] == 'Chicken Salad Sandwich')]

#### Pandas Groupby and barPlot

In [None]:
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
df_2017.groupby('Restaurant').Calories.mean().plot(figsize = (30,30),fontsize=(10),kind = 'bar',rot=90)
plt.xlabel('Restaurant(X axis) vs Calories(Y axis)')
plt.show()


### High Caloried restaurants

In [None]:
#Top 10
df_new = df_2017.groupby('Restaurant').Calories.mean().reset_index()
df_new.sort_values('Calories', ascending = False).head(10)

In [None]:
# df_2017.groupby('Restaurant').Calories.agg(['count', 'max', 'min', 'mean'])

In [None]:

# df_2017.groupby('Restaurant').Calories.count().plot(kind = 'hist')
# plt.ylabel('Number of Restaurants')
# plt.xlabel('Net Calorie Count(cal)')
# plt.show()

## Identifying the The product that gets  affected through Bigrams

More about [NLP Ngrams](https://en.wikipedia.org/wiki/N-gram)

In [None]:
#subset to our dataframe choice!
df_ourchoice = df_2017[(df_2017['Restaurant'] == 'Starbucks')]

In [None]:
#Tuple with itemname and menu id
item_name_id = list(zip(df_ourchoice[' Item_Name_2017'], df_ourchoice['Menu_Item_ID']))
item_name_id[0]

In [None]:
type(item_name_id)

In [None]:
def WrapperBigramsIds(item_name_id ):
    """
    Collects item name and its respective menu id from the input tuple and output a 
    list of bigram occurances with their menu id.
    
    Args:
        item_name_id - list of tuples with item name and its menu id
    
    Returns:
        bigram_dict - dictionary with bigram name as the key and list of IDs as value
    """
    bigram_dict = defaultdict(list)
    text = []
    for item_name, idx in item_name_id:
        text.append(item_name)
        bigrms = list(nltk.bigrams(item_name.split()))
        for bigrm in bigrms:
            joined_bigram = " ".join(bigrm)
            bigram_dict[joined_bigram].append(idx)
    return bigram_dict

In [None]:
bigram_dict = WrapperBigramsIds(item_name_id)
bigram_dict['Caramel Macchiato']
# bigram_dict[0]

In [None]:
def RecommenderList(user_choice):
    """
    Builds a recommender list based on the user_choice input from user and returns
    a recommended list of food options for the user.
    
    
    Args:
        user_choice - a string bigram input from user.
    
    Returns: 
        returns a list of tuples with item name and 
        calorie content value of that item.
    """
    choice_lst = []
    item_list = []
    calorie_list = []

    for idx in bigram_dict[user_choice]:
        name = df_ourchoice[df_ourchoice['Menu_Item_ID'] == idx][' Item_Name_2017'].tolist()
        cal = df_ourchoice[df_ourchoice['Menu_Item_ID'] == idx]['Calories']

        item_list.append(name[0])
        calorie_list.append(float(cal))

    return list(zip(item_list,calorie_list))

## Example
I want to have is a **Mocha Frappuccino** @ **Starbucks**

In [None]:
help(RecommenderList)

In [None]:
choice_lst = RecommenderList('Mocha Frappuccino')

shuffle(choice_lst)

In [None]:
len(choice_lst)

In [None]:
choice_lst

# Top 5 recommendations:

In [None]:
sorted(choice_lst, key=lambda tup: tup[1])[:5]