In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
%matplotlib inline

In [2]:
raw_data = pd.read_csv('epi_r.csv')

In [3]:
list(raw_data.columns)

['title',
 'rating',
 'calories',
 'protein',
 'fat',
 'sodium',
 '#cakeweek',
 '#wasteless',
 '22-minute meals',
 '3-ingredient recipes',
 '30 days of groceries',
 'advance prep required',
 'alabama',
 'alaska',
 'alcoholic',
 'almond',
 'amaretto',
 'anchovy',
 'anise',
 'anniversary',
 'anthony bourdain',
 'aperitif',
 'appetizer',
 'apple',
 'apple juice',
 'apricot',
 'arizona',
 'artichoke',
 'arugula',
 'asian pear',
 'asparagus',
 'aspen',
 'atlanta',
 'australia',
 'avocado',
 'back to school',
 'backyard bbq',
 'bacon',
 'bake',
 'banana',
 'barley',
 'basil',
 'bass',
 'bastille day',
 'bean',
 'beef',
 'beef rib',
 'beef shank',
 'beef tenderloin',
 'beer',
 'beet',
 'bell pepper',
 'berry',
 'beverly hills',
 'birthday',
 'biscuit',
 'bitters',
 'blackberry',
 'blender',
 'blue cheese',
 'blueberry',
 'boil',
 'bok choy',
 'bon appétit',
 'bon app��tit',
 'boston',
 'bourbon',
 'braise',
 'bran',
 'brandy',
 'bread',
 'breadcrumbs',
 'breakfast',
 'brie',
 'brine',
 'brisk

In [4]:
raw_data.rating.describe()

count    20052.000000
mean         3.714467
std          1.340829
min          0.000000
25%          3.750000
50%          4.375000
75%          4.375000
max          5.000000
Name: rating, dtype: float64

In [5]:
# sns.set_style('darkgrid')
# raw_data.rating.hist(bins=20)
# plt.title('Histogram of Recipe Ratings');

So a few things are shown in this histogram. Firstly there are sharp discontinutities. We don't have continuous data. No recipe has a 3.5 rating, for example. Also we see the anticipated increase at 0.

Let's try a naive approach again, this time using SVM Regressor. But first, we'll have to do a bit of data cleaning.

In [6]:
# Count nulls 
null_count = raw_data.isnull().sum()
null_count[null_count>0]

calories    4117
protein     4162
fat         4183
sodium      4119
dtype: int64

What we can see right away is that nutrition information is not available for all goods. Now this would be an interesting data point, but let's focus on ingredients and keywords right now. So we'll actually drop the whole columns for calories, protein, fat, and sodium. We'll come back to nutrition information later.

In [7]:
# Took too long to keep restarting so I just commented out the long processing code. 
from sklearn.svm import SVR
svr = SVR()
X = raw_data.drop(['rating', 'title', 'calories', 'protein', 'fat', 'sodium'], 1)
Y = raw_data.rating
# svr.fit(X,Y)

Note that this actually takes quite a while to run, compared to some of the models we've done before. Be patient. It's because of the number of features we have.

Let's see what a scatter plot looks like, comparing actuals to predicted.

In [8]:
#  plt.scatter(Y, svr.predict(X));

In [9]:
# svr.score(X,Y)

In [10]:
from sklearn.model_selection import cross_val_score
# cross_val_score(svr, X, Y, cv=5)

Oh dear, so this did seem not to work very well. In fact it is remarkably poor. Now there are many things that we could do here.

Firstly the overfit is a problem, even though it was poor in the first place. We could go back and clean up our feature set. There might be some gains to be made by getting rid of the noise.

We could also see how removing the nulls but including dietary information performs. Though its a slight change to the question we could still possibly get some improvements there.

Lastly, we could take our regression problem and turn it into a classifier. With this number of features and a discontinuous outcome, we might have better luck thinking of this as a classification problem. We could make it simpler still by instead of classifying on each possible value, group reviews to some decided high and low values.

# And that is your challenge.

Transform this regression problem into a binary classifier and clean up the feature set. You can choose whether or not to include nutritional information, but try to cut your feature set down to the 30 most valuable features.

Good luck!

When you've finished that, also take a moment to think about bias. Is there anything in this dataset that makes you think it could be biased, perhaps extremely so?

There is. Several things in fact, but most glaringly is that we don't actually have a random sample. It could be, and probably is, that the people more likely to choose some kinds of recipes are more likely to give high reviews.

After all, people who eat chocolate might just be happier people.

In [11]:
# Clean the data and remove the missing data
raw_data = raw_data.dropna()

In [12]:
# Checked that we didn't lose too much data after dropna.
# Count is 15,862 compared to 20,052 without dropping data.
print(raw_data.rating.describe())

# Check the breakdown of the ratings.
print('\n')
print(raw_data['rating'].value_counts())

count    15864.000000
mean         3.760952
std          1.285518
min          0.000000
25%          3.750000
50%          4.375000
75%          4.375000
max          5.000000
Name: rating, dtype: float64


4.375    6552
3.750    4136
5.000    2106
0.000    1296
3.125    1165
2.500     405
1.250     123
1.875      81
Name: rating, dtype: int64


In [13]:
# Create a new categorical column called 'Good_Rating'
raw_data['Good_Rating'] = np.where(raw_data['rating']>4.0,1,0)

In [14]:
# Check to see data type.
raw_data.dtypes

title                     object
rating                   float64
calories                 float64
protein                  float64
fat                      float64
sodium                   float64
#cakeweek                float64
#wasteless               float64
22-minute meals          float64
3-ingredient recipes     float64
30 days of groceries     float64
advance prep required    float64
alabama                  float64
alaska                   float64
alcoholic                float64
almond                   float64
amaretto                 float64
anchovy                  float64
anise                    float64
anniversary              float64
anthony bourdain         float64
aperitif                 float64
appetizer                float64
apple                    float64
apple juice              float64
apricot                  float64
arizona                  float64
artichoke                float64
arugula                  float64
asian pear               float64
          

In [15]:
# Split the data.
X = raw_data.drop(['title', 'rating', 'Good_Rating'], 1)
Y = raw_data['Good_Rating']

In [34]:
from sklearn.feature_selection import SelectKBest, f_classif
# Use SelectKBest to obtain the top 30 features.
select_k = SelectKBest(f_classif, k=30)

# Fit the data
fit = select_k.fit(X, Y)

# Get the new x data points from the selection
selected_features = fit.get_support(indices=True)

# Match up points to the names
k_features = X[X.columns[selected_features]]

  f = msb / msw


In [35]:
from sklearn.svm import SVC

# Build the SVC model.
svc = SVC()

# Fit the data.
svc.fit(k_features, Y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [36]:
from sklearn.model_selection import cross_val_score

# Test accuracy of data.
cross_val_score(svc, k_features, Y, cv=5)

array([0.56679269, 0.58808698, 0.56949259, 0.57755359, 0.57313997])

# Write Up 

Even after selecting the 30 best features using sklearn, the accuracy of the model is consistent, but the predictive value is low, sitting at around 57%. Perhaps the PCA might have done a better job in being a predictive model. I probably should've printed out a correlation map to make sure that the columns weren't multicollinear. 

Things in the dataset that could've been biased:
- recipes don't categorize what each type of recipe is (e.g. entree, snack, dessert, etc.)
- dataset doesn't have a count of each type of recipe so I don't know if the different categories aree equally represented. 
- This dataset can also be biased by the author of the book who only knows certain types of foods and uses ingredients/recipes more suited to his/her own taste. 
