# Does this have nuts in it?

The CDC estimates that that 1-2% of Americans are afflicted with a nut allergy, and I'm one of them. Specifically, I am allergic to walnuts, pecans, hazelnuts, macadamia nuts, and brazil nuts.

After living with a nut allergy for several decades, I've developed a sixth sense for detecting when nuts are likely to be present in a particular recipe, but I still miss things.

Accordingly, I wanted to see if I could train a model to predict whether a recipe contains one of my allergens based on the recipe name alone.

Data was sourced from Kaggle. I started with a set of Epicurious recipes but then needed to expand the set to improve the model's performance.
* https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions

* https://www.kaggle.com/datasets/paultimothymooney/recipenlg

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import BinaryCrossentropy
from scipy.sparse import csr_matrix

I start by importing the data, which was sourced from Kaggle. It's a set of 13,501 Epicurious recipes.

*Source: https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions*

In [2]:
raw = pd.read_csv('recipes.csv', index_col=0)
raw

Unnamed: 0,Title,Ingredients,Instructions,Image_Name,Cleaned_Ingredients
0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher...","Pat chicken dry with paper towels, season all ...",miso-butter-roast-chicken-acorn-squash-panzanella,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher..."
1,Crispy Salt and Pepper Potatoes,"['2 large egg whites', '1 pound new potatoes (...",Preheat oven to 400°F and line a rimmed baking...,crispy-salt-and-pepper-potatoes-dan-kluger,"['2 large egg whites', '1 pound new potatoes (..."
2,Thanksgiving Mac and Cheese,"['1 cup evaporated milk', '1 cup whole milk', ...",Place a rack in middle of oven; preheat to 400...,thanksgiving-mac-and-cheese-erick-williams,"['1 cup evaporated milk', '1 cup whole milk', ..."
3,Italian Sausage and Bread Stuffing,"['1 (¾- to 1-pound) round Italian loaf, cut in...",Preheat oven to 350°F with rack in middle. Gen...,italian-sausage-and-bread-stuffing-240559,"['1 (¾- to 1-pound) round Italian loaf, cut in..."
4,Newton's Law,"['1 teaspoon dark brown sugar', '1 teaspoon ho...",Stir together brown sugar and hot water in a c...,newtons-law-apple-bourbon-cocktail,"['1 teaspoon dark brown sugar', '1 teaspoon ho..."
...,...,...,...,...,...
13496,Brownie Pudding Cake,"['1 cup all-purpose flour', '2/3 cup unsweeten...",Preheat the oven to 350°F. Into a bowl sift to...,brownie-pudding-cake-14408,"['1 cup all-purpose flour', '2/3 cup unsweeten..."
13497,Israeli Couscous with Roasted Butternut Squash...,"['1 preserved lemon', '1 1/2 pound butternut s...",Preheat oven to 475°F.\nHalve lemons and scoop...,israeli-couscous-with-roasted-butternut-squash...,"['1 preserved lemon', '1 1/2 pound butternut s..."
13498,Rice with Soy-Glazed Bonito Flakes and Sesame ...,['Leftover katsuo bushi (dried bonito flakes) ...,"If using katsuo bushi flakes from package, moi...",rice-with-soy-glazed-bonito-flakes-and-sesame-...,['Leftover katsuo bushi (dried bonito flakes) ...
13499,Spanakopita,['1 stick (1/2 cup) plus 1 tablespoon unsalted...,Melt 1 tablespoon butter in a 12-inch heavy sk...,spanakopita-107344,['1 stick (1/2 cup) plus 1 tablespoon unsalted...


A few of the columns are unnecessary. I take the two we need and make everything lowercase for consistency.

In [3]:
df = raw[['Title', 'Cleaned_Ingredients']]
df = df.rename(columns={'Title': 'name', 'Cleaned_Ingredients': 'ingredients'})
df = df.apply(lambda x: x.str.lower())
df

Unnamed: 0,name,ingredients
0,miso-butter roast chicken with acorn squash pa...,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher..."
1,crispy salt and pepper potatoes,"['2 large egg whites', '1 pound new potatoes (..."
2,thanksgiving mac and cheese,"['1 cup evaporated milk', '1 cup whole milk', ..."
3,italian sausage and bread stuffing,"['1 (¾- to 1-pound) round italian loaf, cut in..."
4,newton's law,"['1 teaspoon dark brown sugar', '1 teaspoon ho..."
...,...,...
13496,brownie pudding cake,"['1 cup all-purpose flour', '2/3 cup unsweeten..."
13497,israeli couscous with roasted butternut squash...,"['1 preserved lemon', '1 1/2 pound butternut s..."
13498,rice with soy-glazed bonito flakes and sesame ...,['leftover katsuo bushi (dried bonito flakes) ...
13499,spanakopita,['1 stick (1/2 cup) plus 1 tablespoon unsalted...


I define a list of nuts to which I am allergic and a simple function to determine if any of them are present in an input string. This list is easily adjustable, which will enable the model to be flexible should I want to adapt it to a different set of allergens.

In [4]:
allergens = ['walnut', 'pecan', 'macadamia', 'hazelnut', 'brazil nut', 'wal nut']

def find_allergens(string):
    return any(word in string for word in allergens)

I label the data in new column 'allergen' that shows True when an allergen is present and False otherwise. One initial observation: only 825 of 13501 entries contain an allergen (~6%) so the dataset is somewhat skewed. That may present issues later that we could try to address with resampling or other methods.

In [5]:
df['allergen'] = df.ingredients.apply(lambda x: find_allergens(x))
df.allergen.value_counts()

False    12676
True       825
Name: allergen, dtype: int64

A quick eyeball of the data labeled as containing allergens - looks reasonable.

In [6]:
df[df['allergen']]

Unnamed: 0,name,ingredients,allergen
62,nut butter granola bars,"['2 cups raw nuts (such as almonds, walnuts, p...",True
69,chocolate zucchini cake,"['2 1/4 cups sifted all purpose flour', '1/2 c...",True
70,swiss chard pasta with toasted hazelnuts and p...,"['¼ cup hazelnuts', '1 pound bow tie pasta (fa...",True
81,pear and hazelnut frangipane tart,"['1 cup hazelnuts, toasted, loose skins rubbed...",True
103,tahini-walnut magic shell,"['¼ cup raw walnuts', '3 oz. white chocolate, ...",True
...,...,...,...
13477,frisée and endive salad with warm brussels spr...,"['3 tablespoons white-wine vinegar', '2 tables...",True
13480,hazelnut-butter cookies with mini chocolate chips,"['1 1/2 cups all purpose flour', '3/4 teaspoon...",True
13492,cornmeal pancakes with honey-pecan butter,['1/2 cup (1 stick) unsalted european-style bu...,True
13494,ginger-pecan roulade with honey-glazed pecans,"['1/2 stick (1/4 cup) unsalted butter, melted,...",True


Now I define our variables for the model. I'd like to start with a Random Forest. Since the input feature is string-based and the algorithm requires a numerical input, I start by vectorizing the names of the recipes and then split them into train, validation, and test sets in a 60/20/20 ratio. The training data will be used to train the model, the validation data will be used to evaluate the model and subsequently tweak the parameters, and the test data will be used to evaluate the final model

In [7]:
vectorizer = CountVectorizer()
X = [str(x) for x in df.name]
X = vectorizer.fit_transform(X)
y = df.allergen

X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, test_size=0.4, random_state=3)
X_validation, X_test, y_validation, y_test = train_test_split(X_remaining, y_remaining, test_size=0.5, random_state=3)


Quick check on the sizes of the Train, Validation, and Test sets for features and labels. Looks right and in the expected 60/20/20 ratio.

In [8]:
print(X_train.shape[0], X_validation.shape[0], X_test.shape[0], y_train.shape[0], y_validation.shape[0], y_test.shape[0])

8100 2700 2701 8100 2700 2701


Another quick check to ensure that the proportion of True in each y set is roughly similar to the population proportion of ~6%. Looks good.

In [9]:
y_train_pct = y_train.sum() / y_train.count()
y_validation_pct = y_validation.sum() / y_validation.count()
y_test_pct = y_test.sum() / y_test.count()

print("Training Set Pct True %.2f%%" % (y_train_pct*100))
print("Validation Set Pct True %.2f%%" % (y_validation_pct*100))
print("Test Set Pct True %.2f%%" % (y_test_pct*100))

Training Set Pct True 5.98%
Validation Set Pct True 6.63%
Test Set Pct True 6.00%


Now let's create and train the Random Forest classifier.

In [10]:
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)

Now let's see how the model does on the training set. It's extremely accurate, which is great but might just mean an overfit model.

In [11]:
train_predictions = rf_classifier.predict(X_train)
print(classification_report(y_train, train_predictions))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00      7616
        True       1.00      0.99      1.00       484

    accuracy                           1.00      8100
   macro avg       1.00      1.00      1.00      8100
weighted avg       1.00      1.00      1.00      8100



Looking at the model's performance on the validation set now, the recall is quite low despite high precision, leading to a low F1 score. The accuracy of the model is still pretty high (albeit lower than the evaluation of the training set), so I wouldn't call this a drastically overfit model. We will need to raise the recall before this model will be useful.

In [12]:
val_predictions = rf_classifier.predict(X_validation)
print(classification_report(y_validation, val_predictions))


              precision    recall  f1-score   support

       False       0.97      1.00      0.98      2521
        True       0.92      0.53      0.67       179

    accuracy                           0.97      2700
   macro avg       0.94      0.76      0.83      2700
weighted avg       0.96      0.97      0.96      2700



Given the skewed nature of the sample, I try to solve the low recall problem by applying class weights. Since "True" values made up 6% of the population, I use a 1:15 class weight ratio to even the True and False values. Unfortunately, this adjustment does not help the recall issue.

In [13]:
# Define class weights
class_weights = {0:1.0, 1: 15}

# Create, train, and evaluate the Random Forest classifier
rf_classifier = RandomForestClassifier(class_weight=class_weights)
rf_classifier.fit(X_train, y_train)
val_predictions = rf_classifier.predict(X_validation)
print(classification_report(y_validation, val_predictions))

              precision    recall  f1-score   support

       False       0.96      1.00      0.98      2521
        True       0.93      0.49      0.64       179

    accuracy                           0.96      2700
   macro avg       0.95      0.74      0.81      2700
weighted avg       0.96      0.96      0.96      2700



Since the class weighting actually made the recall problem slightly worse, as a next step I'll try resampling. I define the combination resampling pipeline (over-sampling the Trues and under-sampling the Falses) and re-run the Random Forest. Unfortunately this only moderately improves the recall.

In [14]:
# Start by defining the combination resampling pipeline
resampling_pipeline = Pipeline([
    ('over_sampler', RandomOverSampler()),
    ('under_sampler', RandomUnderSampler()),
])

# Apply combination resampling to the training data
X_resampled, y_resampled = resampling_pipeline.fit_resample(X_train, y_train)

# Create, train, and evaluate the Random Forest classifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_resampled, y_resampled)
val_predictions = rf_classifier.predict(X_validation)
print(classification_report(y_validation, val_predictions))

              precision    recall  f1-score   support

       False       0.97      0.99      0.98      2521
        True       0.85      0.53      0.65       179

    accuracy                           0.96      2700
   macro avg       0.91      0.76      0.81      2700
weighted avg       0.96      0.96      0.96      2700



Since class weighting and resampling were minimally effective in improving recall, let's see if a more detailed / complex model might work. I use a Deep Learning model via TensorFlow with Sigmoid activation in the final layer given the binary desired output. 

In [15]:
# Since the output of the vectorizer I used earlier is a sparse matrix, I convert to a dense matrix.
# This consumes a lot of memory but it should be fine for this amount of data

X_train_dense = X_train.toarray()

# Now I create and compile the model
nn_model = Sequential([
    Dense(units = 128, activation = 'relu'),
    Dense(units = 64, activation = 'relu'),
    Dense(units = 32, activation = 'relu'),
    Dense(units = 16, activation = 'relu'),
    Dense(units = 8, activation = 'relu'),
    Dense(units = 1, activation = 'sigmoid')
])

nn_model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(learning_rate = 0.01)
)

nn_model.fit(X_train_dense, y_train, epochs=20)


Epoch 1/20


2023-07-19 08:09:50.707549: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x15d94a040>

Similar to the random forest, the model does well on the training set.

In [16]:
X_train_dense = X_train.toarray()

train_predictions = nn_model.predict(X_train_dense)
train_predictions = (train_predictions >0.5)
print(classification_report(y_train, train_predictions))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00      7616
        True       0.97      1.00      0.99       484

    accuracy                           1.00      8100
   macro avg       0.99      1.00      0.99      8100
weighted avg       1.00      1.00      1.00      8100



Unfortunately, evaluation on the validation set shows low recall. Accuracy is still high indicating minimal overfitting.

In [17]:
X_validation_dense = X_validation.toarray()

val_predictions = nn_model.predict(X_validation_dense)
val_predictions = (val_predictions >0.5)
print(classification_report(y_validation, val_predictions))

              precision    recall  f1-score   support

       False       0.97      0.98      0.97      2521
        True       0.63      0.53      0.58       179

    accuracy                           0.95      2700
   macro avg       0.80      0.75      0.78      2700
weighted avg       0.94      0.95      0.95      2700



It seems I'll need to add more data in order to improve the model's recall. This dataset from Kaggle (link below) contains similar type data to the original dataset.

*https://www.kaggle.com/datasets/paultimothymooney/recipenlg*

In [18]:
recipenlg = pd.read_csv('RecipeNLG_dataset.csv', index_col=0)
recipenlg.head()

Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


The new dataset includes more than 2 million entries.

In [19]:
recipenlg.title.count()

2231142

Clean up the column names and add the allergen column in line with the first dataset.

In [20]:
# I take only the columns we need and rename them to be consistent with the original dataset column names

df2 = recipenlg[['title', 'NER']]
df2 = df2.rename(columns={'title': 'name', 'NER': 'ingredients'})

# Similar to the original dataset, I add a column that labels the data where an allergen is present
# There roughly 6% of the recipes contain allergens, again consistent with the original set

df2['allergen'] = df2.ingredients.apply(lambda x: find_allergens(x))
df2.allergen.value_counts()

False    2083920
True      147222
Name: allergen, dtype: int64

Combine the two datasets

In [21]:
df_full = pd.concat([df, df2])
df_full.name.count()

2244638

Vectorize the recipe names and split the train / validation / test sets. Output the size of each set as we did before to ensure they are correctly split.

In [22]:
X = [str(x) for x in df2.name]
X = vectorizer.fit_transform(X)
y = df2.allergen

X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, test_size=0.4, random_state=4)
X_validation, X_test, y_validation, y_test = train_test_split(X_remaining, y_remaining, test_size=0.5, random_state=4)
print(X_train.shape[0], X_validation.shape[0], X_test.shape[0], y_train.shape[0], y_validation.shape[0], y_test.shape[0])

1338685 446228 446229 1338685 446228 446229


The proportion of Trues are roughly aligned with the broader population.

In [23]:
y_train_pct = y_train.sum() / y_train.count()
y_validation_pct = y_validation.sum() / y_validation.count()
y_test_pct = y_test.sum() / y_test.count()

print("Training Set Pct True %.2f%%" % (y_train_pct*100))
print("Validation Set Pct True %.2f%%" % (y_validation_pct*100))
print("Test Set Pct True %.2f%%" % (y_test_pct*100))

Training Set Pct True 6.59%
Validation Set Pct True 6.66%
Test Set Pct True 6.56%


Create and train the Random Forest classifier using the new dataset. Test it on the training set. Accuracy is still good. Recall has improved but still not great.

In [92]:
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)

train_predictions = rf_classifier.predict(X_train)
print(classification_report(y_train, train_predictions))

              precision    recall  f1-score   support

       False       0.98      0.99      0.99   1250470
        True       0.90      0.68      0.78     88215

    accuracy                           0.97   1338685
   macro avg       0.94      0.84      0.88   1338685
weighted avg       0.97      0.97      0.97   1338685



Evaluate the Random Forest model on the validation set. Recall has drastically declined, as has precision.

In [93]:
val_predictions = rf_classifier.predict(X_validation)
print(classification_report(y_validation, val_predictions))

              precision    recall  f1-score   support

       False       0.96      0.99      0.97    416489
        True       0.67      0.37      0.48     29739

    accuracy                           0.95    446228
   macro avg       0.82      0.68      0.73    446228
weighted avg       0.94      0.95      0.94    446228



Adding the new dataset did not help the recall, and the dataset is now so large that running the model takes a very long time. As a next step, instead of adding the entire new dataset, I will add True and False values in similar proportions to correct the skew of the data. I suspect this may improve recall (in addition to shrinking the set to make the model more efficient). I will add all 147,222 True values and an equal amount of randomly sampled False values.

In [24]:
df2_true = df2[df2.allergen == True]
df2_true.shape[0]

147222

In [25]:
df2_false = df2[df2.allergen == False]
df2_false_sample = df2_false.sample(n = 147222, random_state = 4)
df2_false_sample.shape[0]

147222

Combine these two new equal True and False datasets with the original - note that now the proportion of Trues and Falses is much more equal. Additionally, the size of the dataset (~300k) is much smaller than the last iteration (>2mm). 

In [26]:
df3 = pd.concat([df2_true, df2_false_sample, df])

print(df3.shape[0])
print(df3[df3.allergen == True].shape[0])
print(df3[df3.allergen == False].shape[0])


307945
148047
159898


Once again, vectorize and split the sets.

In [27]:
X = [str(x) for x in df3.name]
X = vectorizer.fit_transform(X)
y = df3.allergen

X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, test_size=0.4, random_state=4)
X_validation, X_test, y_validation, y_test = train_test_split(X_remaining, y_remaining, test_size=0.5, random_state=4)

Check that the proportion of Trues should now be a bit less than 50%.

In [28]:
y_train_pct = y_train.sum() / y_train.count()
y_validation_pct = y_validation.sum() / y_validation.count()
y_test_pct = y_test.sum() / y_test.count()

print("Training Set Pct True %.2f%%" % (y_train_pct*100))
print("Validation Set Pct True %.2f%%" % (y_validation_pct*100))
print("Test Set Pct True %.2f%%" % (y_test_pct*100))

Training Set Pct True 47.94%
Validation Set Pct True 48.35%
Test Set Pct True 48.20%


Let's try the neural network first this time.

In [29]:
X_train_dense = X_train.toarray()

# Create and compile the model
nn_model = Sequential([
    Dense(units = 128, activation = 'relu'),
    Dense(units = 64, activation = 'relu'),
    Dense(units = 32, activation = 'relu'),
    Dense(units = 16, activation = 'relu'),
    Dense(units = 8, activation = 'relu'),
    Dense(units = 1, activation = 'sigmoid')
])

nn_model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(learning_rate = 0.01)
)

nn_model.fit(X_train_dense, y_train, epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x168c6a850>

Evaluate the neural network model on the training set - finally recall and pecision are both >0.90!

In [30]:
X_train_dense = X_train.toarray()

train_predictions = nn_model.predict(X_train_dense)
train_predictions = (train_predictions >0.5)
print(classification_report(y_train, train_predictions))

              precision    recall  f1-score   support

       False       0.94      0.92      0.93     96184
        True       0.92      0.94      0.93     88583

    accuracy                           0.93    184767
   macro avg       0.93      0.93      0.93    184767
weighted avg       0.93      0.93      0.93    184767



Evaluate the model on the validation set. The stats aren't quite as good as the training set results, but still much improved over previous iterations. There is certainly some level of overfitting here. Given the divergence between training and validation results.

In [31]:
X_validation_dense = X_validation.toarray()

val_predictions = nn_model.predict(X_validation_dense)
val_predictions = (val_predictions >0.5)
print(classification_report(y_validation, val_predictions))

              precision    recall  f1-score   support

       False       0.86      0.83      0.84     31809
        True       0.82      0.85      0.84     29780

    accuracy                           0.84     61589
   macro avg       0.84      0.84      0.84     61589
weighted avg       0.84      0.84      0.84     61589



Trying a simpler model like the Random Forest may help the overfitting issue. Let's fit and then test the Random Forest model on the training set. Results look largely similar to the Neural Network (slightly better).

In [32]:
rf_classifier = RandomForestClassifier(n_estimators=100)
rf_classifier.fit(X_train, y_train)

train_predictions = rf_classifier.predict(X_train)
print(classification_report(y_train, train_predictions))

              precision    recall  f1-score   support

       False       0.98      0.95      0.96     96184
        True       0.94      0.97      0.96     88583

    accuracy                           0.96    184767
   macro avg       0.96      0.96      0.96    184767
weighted avg       0.96      0.96      0.96    184767



Let's try the Random Forest now on the validation set. The results are similar to the Neural Network, with stats worse than the training set but still materially improved over previous iterations. The difference in statistics between the training and validation sets suggests some level of overfitting.

In [33]:
val_predictions = rf_classifier.predict(X_validation)
print(classification_report(y_validation, val_predictions))

              precision    recall  f1-score   support

       False       0.86      0.84      0.85     31809
        True       0.83      0.86      0.84     29780

    accuracy                           0.85     61589
   macro avg       0.85      0.85      0.85     61589
weighted avg       0.85      0.85      0.85     61589



Since it still appears that the model is overfitting, let's try an even simpler model - a Logistic. I fit the model and evaluate it on the training set. Note that the model does not converge, but increasing the max iterations even very significantly does not impact the quality of the model, so I keep it at the default 100 max iterations. The Logistic model underperforms the Random Forest and Neural Network on the training set.

In [41]:
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

train_predictions = logistic_model.predict(X_train)
print(classification_report(y_train, train_predictions))

              precision    recall  f1-score   support

       False       0.86      0.85      0.86     96184
        True       0.84      0.85      0.85     88583

    accuracy                           0.85    184767
   macro avg       0.85      0.85      0.85    184767
weighted avg       0.85      0.85      0.85    184767



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Now let's evaluate the Logistic on the validation set. The Logistic performs very similarly on the validation set compared to the training set, suggesting that the overfitting of previous models may have been solved by the simpler Logistic. The evaluatory statistics are also very similar to those of the Random Forest and Neural Network. The Logistic also took significantly less time to run versus the other models. 

In summary, the Logistic appears to be less (or not) overfitted, similarly effective, and much faster than the Random Forest and Neural Network. Based on this, I will select the Logistic as the optimal model for this use case.

In [39]:
val_predictions = logistic_model.predict(X_validation)
print(classification_report(y_validation, val_predictions))

              precision    recall  f1-score   support

       False       0.84      0.84      0.84     31809
        True       0.83      0.84      0.83     29780

    accuracy                           0.84     61589
   macro avg       0.84      0.84      0.84     61589
weighted avg       0.84      0.84      0.84     61589



Now that we've chosen the model, let's evaluate it on the test set to produce an unbiased assessment of its accuracy. The results of the test set are practically identical to those of the validation set and satisfactory for my purposes.

In [42]:
test_predictions = logistic_model.predict(X_test)
print(classification_report(y_test, test_predictions))

              precision    recall  f1-score   support

       False       0.85      0.84      0.84     31905
        True       0.83      0.84      0.83     29684

    accuracy                           0.84     61589
   macro avg       0.84      0.84      0.84     61589
weighted avg       0.84      0.84      0.84     61589



Another benefit of the Logistic model is its simplicity and interpretability - by analyzing the coefficients we can see which words, when present, contribute the most to a prediction of allergens present. 

Here I pull the coefficients from the model and place them in a dataframe alongside the un-vectorized words from the input data. I print a list of the top contributors based on their coefficients - note that all of these top words contribute significantly when compared to the average coefficent, which is near zero.

It's not surprising that the top words are the allergens themselves, as I would expect any recipe with an allergen in its name to contain that allergen. Some of the other top names such as "Muhammara" are specific dishes that frequently contain allergens (walnuts in the case of Muhammara), which is particularly useful as those would be harder for a human to detect unless they were already familiar with that dish.

In [68]:
coefficients = logistic_model.coef_[0]
feature_names = vectorizer.get_feature_names_out()
coefficients_df = pd.DataFrame({'Word': feature_names, 'Coefficient': coefficients})
coefficients_df.Coefficient = coefficients_df.Coefficient.apply(lambda x: round(x, 2))
top_words = coefficients_df.sort_values(by='Coefficient', ascending=False).head(50)
print('Coefficient Summary Stats:')
print(coefficients_df.Coefficient.describe())
print()

print(top_words.to_string(index=False))


Coefficient Summary Stats:
count    26133.000000
mean        -0.021752
std          0.418596
min         -4.190000
25%         -0.150000
50%          0.000000
75%          0.020000
max          6.100000
Name: Coefficient, dtype: float64

        Word  Coefficient
      pecans         6.10
     walnuts         6.10
      walnut         5.76
    hazelnut         5.63
       pecan         5.33
   hazelnuts         5.29
    pralines         4.71
   fruitcake         4.36
millionaires         3.83
     romesco         3.78
    rugelach         3.61
     baklava         3.61
     turtles         3.49
     waldorf         3.37
     haroset         3.19
     praline         3.19
     lizzies         3.19
   muhammara         3.18
        ball         3.05
         log         3.03
        lush         2.99
        nuts         2.98
    charoset         2.97
    whiskers         2.93
       ozark         2.89
       rocks         2.87
  fruitcakes         2.82
 millionaire         2.81
        

At last we have a satisfactory model - now we can productionize the prediction model and I can start eating some food! 

Here I define a function to use our model to predict whether a single recipe contains allergens.

In [43]:
def test_recipe(recipe):
    recipe_lowercase = recipe.lower()
    vector = vectorizer.transform([recipe_lowercase])
    prediction = logistic_model.predict(vector)    
    return recipe + ' likely contains allergens' if prediction else recipe + ' is likely allergen-free!'


In [69]:
tester = 'Pear and Endive Salad'
print(test_recipe(tester))

Pear and Endive Salad likely contains allergens
