## High Level Guide to This Project

1. Web Scraping and Getting the Data
    
    a. Scrape the Home Chef website to get all the menu items and categories from current and future weeks
    
    b. Manipulate an existing restaurant ratings dataset to generate some "user data". 

2. Making Recommendations

    a. Build the Recommendation Engine (Cross-Collaborative Filter). Tune Recommendations to minimize RMSE.
    
    b. Make Predictions using the Engine



## Part 1a: Web Scraping

I need to generate a matrix of customers and menu items before performing the recommendation machine learning. For customers and ratings, I will generate some randomized data. For the menu items, I'll scrape them from the Home Chef Website for practice.

In [1]:
from bs4 import BeautifulSoup
import requests

url = "https://www.homechef.com/our-menu"
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<!--[if lt IE 9 ]><html class="no-js ie ltie9 ltie10" lang="en"><![endif]-->
<!--[if IE 9 ]><html lang="en" class="no-js ie ie9 ltie10"> <![endif]-->
<!--[if (gt IE 9)|(gt IEMobile 7)|!(IEMobile)|!(IE)]><!-->
<html class="no-js no-ie" lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8">
<title>Meals for the Week of May 15 | Home Chef</title>
<meta content="Our weekly deliveries of fresh, perfectly-portioned ingredients have everything you need to prepare home-cooked meals in about 30 minutes." name="description">
<meta content="Our weekly deliveries of fresh, perfectly-portioned ingredients have everything you need to prepare home-cooked meals in about 30 minutes." name="DC.description">
<meta content="Meals for the Week of May 15 | Home Chef" name="DC.title">
<meta content="Copyright Home Chef, 2013 - 2017" name="copyright">
<meta content="Check Out The Home Chef Menu For the Week of May 15" property="og:title">
<meta content="Our we

In [2]:
#let's get links to all the other weeks of menu items available on the site
import re
weeks = [url]

for link in soup.find_all('a'):
    href = link.get('href')
    if re.match('/our-menus/', href) and not re.search('standard$', href):
        weeks.append('https://www.homechef.com' + href)
print(weeks)

['https://www.homechef.com/our-menu', 'https://www.homechef.com/our-menus/22-may-2017', 'https://www.homechef.com/our-menus/29-may-2017', 'https://www.homechef.com/our-menus/05-jun-2017', 'https://www.homechef.com/our-menus/12-jun-2017']


In [3]:
import time
meal_categs = {}

for url in weeks:
    time.sleep(1)
    
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html.parser')

    cards = soup.find_all(id='meal')
    for card in cards:
        values = []
        meal = card.h2.text
    
        categs = card.find_all('li')
        for categ in categs:
            values.append(categ.span.text)
    
        icons = card.find_all('i')
        for icon in icons:
            values.append(icon['data-tooltip'])
    
        meal_categs[meal] = values
    
print(meal_categs)

{'Steak Wellington': ['Milk', 'Eggs', 'Wheat'], 'Cod al Cartoccio': ['Fish', 'Calorie-Conscious', 'Carb-Conscious'], 'Chicken with Basil-Pecorino Cream Sauce': ['Milk', 'Soy', 'Tree Nuts'], 'Blue Cheese and Green Onion-Crusted Bone-In Pork Chop': ['Milk', 'Tree Nuts', 'Calorie-Conscious', 'Carb-Conscious'], 'Brown Butter Shrimp': ['Milk', 'Shellfish', 'Calorie-Conscious'], 'Adobo Chicken Enchiladas': ['Milk', 'Wheat'], 'Quick Turkey Meatloaf': ['Milk', 'Wheat'], 'Japanese Chicken': ['Peanuts', 'Soy', 'Calorie-Conscious', 'Carb-Conscious'], 'Veggie Sloppy Joes': ['Milk', 'Wheat', 'Tree Nuts', 'Vegetarian'], 'Burrata Risotto': ['Milk', 'Vegetarian'], 'Beet and Goat Cheese Farro Bowl': ['Milk', 'Wheat', 'Tree Nuts', 'Calorie-Conscious', 'Vegetarian'], 'Frutti Tutti Smoothie': ['Milk', 'Vegetarian'], 'Spring Fruit Basket': ['Vegetarian'], 'Steak au Poivre': ['Milk', 'Carb-Conscious'], 'Baja Fish Tacos': ['Eggs', 'Fish', 'Wheat'], 'BBQ-Rubbed Crispy Chicken': ['Milk', 'Wheat', 'Soy'], 'Swis

In [4]:
#Create a set of categories from the dictionary values
categories = set()
for i in list(meal_categs.values()):
    for j in i:
        categories.add(j)
print(categories)
print(len(categories))

{'Tree Nuts', 'Shellfish', 'Soy', 'Peanuts', 'Eggs', 'Wheat', 'Milk', 'Carb-Conscious', 'Vegetarian', 'Calorie-Conscious', 'Fish'}
11


In [5]:
#Create a set of menu items from the dictionary keys
meals = set(meal_categs.keys())
print(list(meals)[:5])
print(len(meals))

['Farmhouse Fried Chicken', 'BBQ-Rubbed Crispy Chicken', 'Al Pastor Pork Tacos', 'Chicken Chopped Salad', 'Shrimp Scampi']
62


In [6]:
#Next we need to make a table that shows what categories
#the menu item falls into.
import pandas as pd
import numpy as np

m = len(meals)
n = len(categories)

df_menu = pd.DataFrame(data=np.zeros((m,n)), columns=categories, index=meals)
df_menu.head()

Unnamed: 0,Tree Nuts,Shellfish,Soy,Peanuts,Eggs,Wheat,Milk,Carb-Conscious,Vegetarian,Calorie-Conscious,Fish
Farmhouse Fried Chicken,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BBQ-Rubbed Crispy Chicken,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Al Pastor Pork Tacos,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Chicken Chopped Salad,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Shrimp Scampi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
#Now let's populate the table based on the dictionary

for meal in meals:
    for categ in meal_categs[meal]:
        df_menu.ix[meal][categ] = 1
df_menu.head()

Unnamed: 0,Tree Nuts,Shellfish,Soy,Peanuts,Eggs,Wheat,Milk,Carb-Conscious,Vegetarian,Calorie-Conscious,Fish
Farmhouse Fried Chicken,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
BBQ-Rubbed Crispy Chicken,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
Al Pastor Pork Tacos,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
Chicken Chopped Salad,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
Shrimp Scampi,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0


The table above could be used for Content-based filtering. However, for this exercise I'll be using Collaborative filtering (User-based). If I were to create a method for content-based filtering, I would want to create some more descriptive columns for each dish. These additional columns could be based on ingredients, region, or cooking methods.

## Part 1b: Generating Random User Meal Ratings Data

Since I don't have access to the actual Home Chef Ratings, I found another ratings data source online. This dataset hold ratings for restaurants, so somewhat similar to what we are looking for. Since the ratings are broken out into three categories, I will do some quick algebra to convert to single rating on a 5-star scale (Home Chef rating scale).

In [189]:
#we need a new table that will hold user_ids and reviews of menu items
d = pd.read_csv('RCData/rating_final.csv')
d['r'] = np.round((d.rating + d.food_rating + d.service_rating)*5/6)
d.head()

Unnamed: 0,userID,placeID,rating,food_rating,service_rating,r
0,U1077,135085,2,2,2,5.0
1,U1077,135038,2,2,1,4.0
2,U1077,132825,2,2,2,5.0
3,U1077,135060,1,2,2,4.0
4,U1068,135104,1,1,2,3.0


Checking the distribution of reviews. Looks ok.

In [191]:
d.r.value_counts().sort_index()

0.0    193
1.0     43
2.0    312
3.0    178
4.0    142
5.0    293
Name: r, dtype: int64

I'll just grab the restaurants with the most reviews up to the number of meals on the available Home Chef menus (62 at the time I did this).

In [194]:
top_places = d.groupby('placeID').size().sort_values(ascending=False).index[:len(meals)]

Next I'll take just those top restaurants and the users that reviewed them. I'll put them into a pandas dataframe as below.

In [193]:
df_ratings = d[d.placeID.isin(top_places)]
df_ratings = df_ratings.pivot_table(values='r', columns='placeID', index='userID')
df_ratings.columns = meals
df_ratings.head()

Unnamed: 0_level_0,Farmhouse Fried Chicken,BBQ-Rubbed Crispy Chicken,Al Pastor Pork Tacos,Chicken Chopped Salad,Shrimp Scampi,Japanese Chicken,"Grilled Red Pepper, Roasted Fennel and Goat Cheese Salad",BBQ Shrimp Pizza,Neapolitan Pizza Margherita,Thai Fish Curry,...,Sun-Dried Tomato Pesto Spaghetti,Chicken and Roasted Beet Salad,Crispy Tofu with Chimichurri Aioli,Seasonal Fruit Basket,Pork Chop with Pine Nut & Parmesan Butter,Beet and Goat Cheese Farro Bowl,Frutti Tutti Smoothie,Chocolate Strawberry Coconut Smoothie,Korean Pork Medallions,Fuji Apple Salad with Everything Bagel Croutons
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,4.0,2.0,,...,,,,,,2.0,,,,
U1002,,,,,,,,3.0,,,...,,,,,,2.0,,,2.0,
U1003,,4.0,,,,5.0,,4.0,,,...,5.0,,5.0,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,,5.0,
U1005,,,,,,,,,2.0,,...,,4.0,,,,,,,,


## Part 2a: Cross Collaborative Filtering, Tuning the Recommendation Engine

Now with the data ready, it's time to perform the recommendation prediction. There are several Python packages for doing this, but I'll build some functions by hand to do the work for the purposes of this project.

First I'll melt (unpivot) the data and get rid of NaN values to make it easier to manipulate

In [95]:
df = df_ratings.reset_index()
df = pd.melt(df, id_vars=['userID'], var_name=['meal'])
df = df[df.value.notnull()]
df.rename(columns={'value': 'rating'}, inplace=True)
df.head()

Unnamed: 0,userID,meal,rating
5,U1006,Farmhouse Fried Chicken,2.0
6,U1007,Farmhouse Fried Chicken,2.0
11,U1013,Farmhouse Fried Chicken,2.0
29,U1033,Farmhouse Fried Chicken,1.0
40,U1046,Farmhouse Fried Chicken,2.0


Here's a helper function to split the data into training and test sets.

In [93]:
def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df

df['for_testing'] = False
grouped = df.groupby('userID', group_keys=False).apply(assign_to_set)
df_train = df[grouped.for_testing == False]
df_test = df[grouped.for_testing == True]
print (df.shape)
print (df_train.shape)
print (df_test.shape)
assert (len(df_train.index & df_test.index) == 0)

(834, 4)
(611, 4)
(223, 4)


We'll also need evaluation metrics:

In [96]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [97]:
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    
    ids_to_estimate = zip(df_test.userID, df_test.meal)
    estimated = np.array([estimate_f(u,m) for (u,m) in ids_to_estimate])
    real = df_test.rating.values
    return compute_rmse(estimated, real)

I'll check the evaluation function by just guessing 3 for every meal.

In [195]:
def my_estimate_function(userID, meal):
    return 3

In [196]:
print ('RMSE for my estimate function: %s' % evaluate(my_estimate_function))

RMSE for my estimate function: 1.70596421456


Next I'll create some different similarity functions to try:

In [100]:
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))

In [101]:
def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))

Next, create a Class for the collaborative filtering recommender:

In [179]:
class CollabPearsonReco:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        
        self.all_user_profiles = df.pivot_table('rating', index='meal', columns='userID')

    def estimate(self, userID, meal):
        """ Ratings weighted by correlation similarity. """
        
        user_condition = df_train.userID != userID
        meal_condition = df_train.meal == meal
        ratings_by_others = df_train.loc[user_condition & meal_condition]
        if ratings_by_others.empty: 
            return 3.0
        
        ratings_by_others.set_index('userID', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[userID]
        sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ratings_sims.sim > 0]
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.rating, weights=ratings_sims.sim)
        
reco = CollabPearsonReco()
reco.learn()
print ('RMSE for CollabPearsonReco: %s' % evaluate(reco.estimate))



RMSE for CollabPearsonReco: 1.56373894206


In [180]:
class CollabEuclidReco:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        
        self.all_user_profiles = df.pivot_table('rating', index='meal', columns='userID')

    def estimate(self, userID, meal):
        """ Ratings weighted by correlation similarity. """
        
        user_condition = df_train.userID != userID
        meal_condition = df_train.meal == meal
        ratings_by_others = df_train.loc[user_condition & meal_condition]
        if ratings_by_others.empty: 
            return 3.0
        
        ratings_by_others.set_index('userID', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[userID]
        sims = their_profiles.apply(lambda profile: euclidean(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ratings_sims.sim > 0]
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.rating, weights=ratings_sims.sim)
    
    def recommend(self, userID, n=5):
        """ Recommend the top n meals for this user. """
        
        other_ratings = df[df.userID != userID]
        ratings_by_others.set_index('userID', inplace=True)
        their_ids = ratings_by_others.index
        sims = their_profiles.apply(lambda profile: euclidean(profile, user_profile), axis=0)
        
reco = CollabEuclidReco()
reco.learn()
print ('RMSE for CollabPearsonReco: %s' % evaluate(reco.estimate))

RMSE for CollabPearsonReco: 1.37204082908


Euclidean similarity had a better RMSE than Pearson similartiy (1.37 vs 1.56 respectively), so I'll use the Euclidean function going forward. With a larger dataset and preferably real empirical data, I would tune the algorithm to get the lowest RMSE.

## Part 2b: Using the Recommendation Engine to Make Predictions

Next, I'll look at a random user to see how well the model can predict something the user has already rated. In this case, we will look at the prediction for the "Pork Shumai Meatballs".

In [183]:
df[df.userID=='U1025']

Unnamed: 0,userID,meal,rating
884,U1025,BBQ Shrimp Pizza,0.0
1007,U1025,Neapolitan Pizza Margherita,5.0
2852,U1025,Coconut Jasmine Rice Bowl,2.0
4082,U1025,Pork Shumai Meatballs,2.0
5558,U1025,Strawberry Colada Smoothie,5.0
6173,U1025,Baja Fish Tacos,5.0


In [184]:
print("Estimated Rating: ", reco.estimate(userID='U1025',meal='Pork Shumai Meatballs'))
print("Actual Rating: ", df[(df.userID=='U1025') & (df.meal=='Pork Shumai Meatballs')].rating.item())

Estimated Rating:  1.96918767403
Actual Rating:  2.0


Pretty good. You can see for this case, the model was able to predict the user's actual rating very closely. Although this is only one case of many, we know we are on the right track.

Now I'll take a user and provide the top 5 recommendations for the next meal she should try. For this case, I'm going to convert the dataframe into a dictionary for easier manipulation.

In [None]:
from pandas import compat

rating_dict = df.pivot_table('rating', index='meal', columns='userID',)
def to_dict_dropna(data):
  return dict((k, v.dropna().to_dict()) for k, v in compat.iteritems(data))
rating_dict = to_dict_dropna(rating_dict)
rating_dict

I also need to rewrite the Euclidean distance function for the dictionary data structure

In [168]:
from math import sqrt

def sim_distance(prefs, person1, person2):
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item] = 1
            
    if len(si) == 0: return 0
    
    sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si])
    
    return 1/(1+sqrt(sum_of_squares))

Now I'll write a function to get a ranked list of recommendations.

In [186]:
# Gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs,userID,similarity=sim_distance, n=5):
    totals={}
    simSums={}
    for other in prefs:
        # don't compare me to myself
        if other==userID: continue
        sim=similarity(prefs,userID,other)

        # ignore scores of zero or lower
        if sim <= 0: continue
        for item in prefs[other]:

            # only score movies I haven't seen yet
            if item not in prefs[userID] or prefs[userID][item]==0:
                # Similarity * Score
                totals.setdefault(item,0)
                totals[item] += prefs[other][item]*sim
                # Sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+=sim

    # Create the normalized list
    rankings=[(total/simSums[item],item) for item,total in totals.items(  )]

    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings[:n]

In [188]:
getRecommendations(df1,'U1025', n=10)

[(4.2316027367034543, 'Steak Moutarde'),
 (3.942810114744542, 'Sun-Dried Tomato Pesto Spaghetti'),
 (3.8095238095238098, 'French Mustard-Thyme Butter Pork Chop'),
 (3.5625, 'Healthy Takeout Mongolian Beef'),
 (3.5263157894736841, 'Piedmont Chicken Breast'),
 (3.4630193016579667, 'Seasonal Fruit Basket'),
 (3.3250691052501495, 'Mango Tango Smoothie '),
 (3.318956467231482, 'Chicken and Roasted Beet Salad'),
 (3.2899375914569067, 'Roasted Chicken with Patatas Bravas'),
 (3.2601342338692798, 'Swiss Fondue Burger')]

And now we have a way to suggest the user which meals they might like to try next. The higher the score, more likely we believe they will rate the meal highly. Since the data came from another dataset, the meals don't necessarily make sense with what the user previously rated. However, with the actual user data, we would be able to predict preferences that would make more sense.