In [None]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Now it's time for another guided example. This time we're going to look at recipes. Specifically we'll use the epicurious dataset, which has a collection of recipes, key terms and ingredients, and their ratings.

What we want to see is if we can use the ingredient and keyword list to predict the rating. For someone writing a cookbook this could be really useful information that could help them choose which recipes to include because they're more likely to be enjoyed and therefore make the book more likely to be successful.

First let's load the dataset. It's [available on Kaggle](https://www.kaggle.com/hugodarwood/epirecipes). We'll use the csv file here and as pull out column names and some summary statistics for ratings.

In [None]:
raw_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/epi_r.csv')

In [None]:
list(raw_data.columns)

In [None]:
raw_data.rating.describe()

We learn a few things from this analysis. From a ratings perspective, there are just over 20,000 recipes with an average rating of 3.71. What is interesting is that the 25th percentile is actually above the mean. This means there is likely some kind of outlier population. This makes sense when we think about reviews: some bad recipes may have very few very low reviews.

Let's validate the idea a bit further with a histogram.

In [None]:
raw_data.rating.hist(bins=20)
plt.title('Histogram of Recipe Ratings')
plt.show()

So a few things are shown in this histogram. Firstly there are sharp discontinutities. We don't have continuous data. No recipe has a 3.5 rating, for example. Also we see the anticipated increase at 0.

Let's try a naive approach again, this time using SVM Regressor. But first, we'll have to do a bit of data cleaning.

In [None]:
# Count nulls 
null_count = raw_data.isnull().sum()
null_count[null_count>0]

What we can see right away is that nutrition information is not available for all goods. Now this would be an interesting data point, but let's focus on ingredients and keywords right now. So we'll actually drop the whole columns for calories, protein, fat, and sodium. We'll come back to nutrition information later.

In [None]:
from sklearn.svm import SVR
svr = SVR()
X = raw_data.drop(['rating', 'title', 'calories', 'protein', 'fat', 'sodium'], 1).sample(frac=0.3, replace=True, random_state=1)
Y = raw_data.rating.sample(frac=0.3, replace=True, random_state=1)
svr.fit(X,Y)

__Note that this actually takes quite a while to run, compared to some of the models we've done before. Around 5-7 mins. Be patient.__ It's because of the number of features we have.

Let's see what a scatter plot looks like, comparing actuals to predicted.

In [None]:
plt.scatter(Y, svr.predict(X))

Now that is a pretty useless visualization. This is because of the discontinuous nature of our outcome variable. There's too much data for us to really see what's going on here. If you wanted to look at it you could create histograms, here we'll move on to the scores of both our full fit model and with cross validation. Again if you choose to run it again it will take some time, so you probably shouldn't.

In [None]:
svr.score(X, Y)

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(svr, X, Y, cv=5)

Oh dear, so this did seem not to work very well. In fact it is remarkably poor. Now there are many things that we could do here. 

Firstly the overfit is a problem, even though it was poor in the first place. We could go back and clean up our feature set. There might be some gains to be made by getting rid of the noise.

We could also see how removing the nulls but including dietary information performs. Though its a slight change to the question we could still possibly get some improvements there.

Lastly, we could take our regression problem and turn it into a classifier. With this number of features and a discontinuous outcome, we might have better luck thinking of this as a classification problem. We could make it simpler still by instead of classifying on each possible value, group reviews to some decided high and low values.

__And that is your challenge.__

Transform this regression problem into a binary classifier and clean up the feature set. You can choose whether or not to include nutritional information, but try to cut your feature set down to the 30 most valuable features.

Good luck!

### Robin's modeling

#### Problem statement

Transform this regression problem into a binary classifier and clean up the feature set. You can choose whether or not to include nutritional information, but try to cut your feature set down to the 30 most valuable features.

My client is a cookbook author. The author is looking for "...useful information that could help them choose which recipes to include because they're more likely to be enjoyed and therefore make the book more likely to be successful."

This implies that I need to know the keywords associated with favorite recipes(the keywords can't be obscured by something like PCA).

I am going to make binary classification model. I will classify a recipe as having a score below 4.0 or a score 4.0 and above.

Given the problem statement, keywords that describe geographic locations are not particularly valuable, so I will drop those. Keywords that are more likely to have value in this context are ingredients and recipe type.

In [2]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
recipe_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/epi_r.csv', nrows = 6500 )

In [4]:
recipe_data['target'] = np.where(recipe_data['rating'] >= 4.0, 1, 0)

In [5]:
#Put columns names into a data frame to export to Excel for easier editing, then load edited version
#kwords_df = pd.DataFrame(recipe_data.columns, columns = ['keyword'])
#export_kwords = kwords_df.to_excel (r'C:\Users\Robin\src\bootcamp\unit_3\epi_keywords.xlsx', index = None, header = False) 
keep_cols_df = pd.read_excel('../Datafiles/epi_keywords_edited.xlsx')
keep_cols_list = keep_cols_df['keyword'].values.tolist()

In [6]:
ratings_df = recipe_data.reindex(columns = keep_cols_list)

In [7]:
ratings_df.head()

Unnamed: 0,target,almond,appetizer,apple,apricot,artichoke,arugula,asian pear,asparagus,avocado,...,wine,winter,wok,yellow squash,yogurt,yuca,zucchini,leftovers,snack,turkey
0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
ratings_df.shape

(6500, 501)

In [9]:
assert pd.notnull(ratings_df).all().all()

In [10]:
#Set the target variable and drop it from the main dataset
y = ratings_df.target
X = ratings_df.drop('target', axis = 1)

In [11]:
zero_sums = X.loc[:, X.sum() == 0]
to_drop = [c for c in zero_sums.columns if any(X[c] == 0)]
X_reduced = X.drop(to_drop, axis=1)

In [12]:
X_reduced.shape

(6500, 476)

In [13]:
#Filter the columns further by dropping those that are correlated
# Calculate the correlation matrix and take the absolute value
corr_matrix = X_reduced.corr().abs()

# Create a True/False mask and apply it
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
X_corr = corr_matrix.mask(mask)


# List column names of highly correlated features (r > 0.85). 
to_drop = [c for c in X_corr.columns if any(X_corr[c] >  0.85)]

# Drop the features in the to_drop list
X_reduced = X_reduced.drop(to_drop, axis=1)

print("The reduced dataframe has {} columns.".format(X_reduced.shape[1]))

The reduced dataframe has 471 columns.


In [36]:
from sklearn.feature_selection import VarianceThreshold 

thresholder = VarianceThreshold(threshold=(0.92 * (1 - 0.92)))
thresholder.fit_transform(X_reduced)

mask = thresholder.get_support()
X_reduced_2 = X_reduced.loc[:, mask] 
print(X_reduced_2.shape)


(6500, 30)


In [37]:
features = X_reduced_2
target = y

In [38]:
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

X = features
y = target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 8)

model = svm.SVC()

model.fit(X_train, y_train)
predicted = model.predict(X_test)

print("SVC accuracy score: ")
print(accuracy_score(y_test, predicted))



Accuracy score: 
0.543846153846


In [39]:
#Now try random forest
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_predicted = rfc.predict(X_test)

print("Random forest Accuracy score: ")
print(accuracy_score(y_test, rfc_predicted))

Random forest Accuracy score: 
0.560769230769




In [40]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',    
                                    ascending=False
                                    )

In [41]:
feature_importances['cum_importance'] = feature_importances.cumsum(axis = 0)
print(feature_importances.head(40))

                   importance  cum_importance
gourmet              0.067099        0.067099
quick & easy         0.056807        0.123906
winter               0.045502        0.169408
bake                 0.043164        0.212572
fall                 0.039938        0.252510
onion                0.039672        0.292182
vegetarian           0.039437        0.331619
vegetable            0.039078        0.370697
kid-friendly         0.037446        0.408143
wheat/gluten-free    0.036886        0.445029
fruit                0.035669        0.480698
sauté                0.035189        0.515887
side                 0.035114        0.551001
spring               0.033222        0.584224
summer               0.033192        0.617415
healthy              0.033178        0.650593
egg                  0.032185        0.682778
garlic               0.031650        0.714428
sugar conscious      0.030124        0.744552
milk/cream           0.029705        0.774257
kidney friendly      0.026699     

### Discussion

When you've finished that, also take a moment to think about bias. Is there anything in this dataset that makes you think it could be biased, perhaps extremely so?

There is. Several things in fact, but most glaringly is that we don't actually have a random sample. It could be, and probably is, that the people more likely to choose some kinds of recipes are more likely to give high reviews.

After all, people who eat chocolate _might_ just be happier people.

#### Other sources of bias

The act of choosing some Tags and excluding others (tags are referred to as keywords above) to describe a recipe also introduces some bias.

It's worth noting that the Epicurious search feature has suggested categories that are included in the Tags listed above, such as 'Healthy', 'Quick and Easy', 'Gluten-free' and 'Vegetarian'.

Reviewing the feature importances, it appears that people that choose 'gourmet' recipes are somewhat more likely to give higher ratings. 