# Let's cook model

Let's combine what we've found so far.

- [What are ingredients?](https://www.kaggle.com/rejasupotaro/what-are-ingredients) (Preprocessing & Feature extraction)
- [Representations for ingredients](https://www.kaggle.com/rejasupotaro/representations-for-ingredients)

Steps are below.

1. Load dataset
2. Remove outliers
3. Preprocess
4. Create model
5. Check local CV
6. Train model
7. Check predicted values
8. Make submission

In [1]:
import json
import re
import unidecode
import numpy as np
import pandas as pd
from collections import defaultdict
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_validate
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer, LabelEncoder
from tqdm import tqdm
tqdm.pandas()

## 1. Load dataset

In [2]:
train = pd.read_json('train.json')
test = pd.read_json('test.json')
train1 = pd.read_json('cooking_train.json')

In [3]:
frames = [train, train1]
train = pd.concat(frames)

## 2. Remove outliers

I saw weird recipes in the dataset .

- water => Japanese
- butter => Indian
- butter => French

Let's filter such single-ingredient recipes and see how it goes.

In [4]:
train['num_ingredients'] = train['ingredients'].apply(lambda x: len(x))
train = train[train['num_ingredients'] > 1]

## 3. Preprocess

Currently, the preprocess is like below.

- convert to lowercase
- remove hyphen
- remove numbers
- remove words which consist of less than 2 characters
- lemmatize

This process can be better.

In [5]:
lemmatizer = WordNetLemmatizer()
def preprocess(ingredients):
    ingredients_text = ' '.join(ingredients)
    ingredients_text = ingredients_text.lower()
    ingredients_text = ingredients_text.replace('-', ' ')
    words = []
    for word in ingredients_text.split():
        if re.findall('[0-9]', word): continue
        if len(word) <= 2: continue
        if '’' in word: continue
        word = lemmatizer.lemmatize(word)
        if len(word) > 0: words.append(word)
    return ' '.join(words)

for ingredient, expected in [
    ('Eggs', 'egg'),
    ('all-purpose flour', 'all purpose flour'),
    ('purée', 'purée'),
    ('1% low-fat milk', 'low fat milk'),
    ('half & half', 'half half'),
    ('safetida (powder)', 'safetida (powder)')
]:
    actual = preprocess([ingredient])
    assert actual == expected, f'"{expected}" is excpected but got "{actual}"'

In [6]:
train['x'] = train['ingredients'].progress_apply(lambda ingredients: preprocess(ingredients))
test['x'] = test['ingredients'].progress_apply(lambda ingredients: preprocess(ingredients))
train.head()

100%|██████████| 69732/69732 [00:15<00:00, 4562.08it/s]
100%|██████████| 9944/9944 [00:02<00:00, 4700.89it/s]


Unnamed: 0,cuisine,id,ingredients,num_ingredients,x
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes...",9,romaine lettuce black olive grape tomato garli...
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g...",11,plain flour ground pepper salt tomato ground b...
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g...",12,egg pepper salt mayonaise cooking oil green ch...
3,indian,22213,"[water, vegetable oil, wheat, salt]",4,water vegetable oil wheat salt
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe...",20,black pepper shallot cornflour cayenne pepper ...


I need to tune the parameters of TfidfVectorizer later.

In [7]:
vectorizer = make_pipeline(
    TfidfVectorizer(sublinear_tf=True),
    FunctionTransformer(lambda x: x.astype('float16'), validate=False)
)

x_train = vectorizer.fit_transform(train['x'].values)
x_train.sort_indices()
x_test = vectorizer.transform(test['x'].values)

Encode cuisines to numeric values using LabelEncoder.

In [8]:
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train['cuisine'].values)
dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

{'brazilian': 0,
 'british': 1,
 'cajun_creole': 2,
 'chinese': 3,
 'filipino': 4,
 'french': 5,
 'greek': 6,
 'indian': 7,
 'irish': 8,
 'italian': 9,
 'jamaican': 10,
 'japanese': 11,
 'korean': 12,
 'mexican': 13,
 'moroccan': 14,
 'russian': 15,
 'southern_us': 16,
 'spanish': 17,
 'thai': 18,
 'vietnamese': 19}

## 4. Create model

I've tried LogisticRegression, GaussianProcessClassifier, GradientBoostingClassifier, MLPClassifier, LGBMClassifier, SGDClassifier, Keras but SVC works better so far.

I need to take a look at models and the parameters more closely.

In [9]:
estimator = SVC(
    C=50,
    kernel='rbf',
    gamma=1.4,
    coef0=1,
    cache_size=500,
)
classifier = OneVsRestClassifier(estimator, n_jobs=-1)

## 5. Train model

If I become to be confident in the model, I train it with the whole train data for submission.

In [11]:
%%time
classifier.fit(x_train, y_train)

Wall time: 37min 5s


OneVsRestClassifier(estimator=SVC(C=50, cache_size=500, class_weight=None, coef0=1,
  decision_function_shape='ovr', degree=3, gamma=1.4, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          n_jobs=-1)

## 6. Check predicted values

Check if the model fitted enough.

In [12]:
y_pred = label_encoder.inverse_transform(classifier.predict(x_train))
y_true = label_encoder.inverse_transform(y_train)

print(f'accuracy score on train data: {accuracy_score(y_true, y_pred)}')

def report2dict(cr):
    rows = []
    for row in cr.split("\n"):
        parsed_row = [x for x in row.split("  ") if len(x) > 0]
        if len(parsed_row) > 0: rows.append(parsed_row)
    measures = rows[0]
    classes = defaultdict(dict)
    for row in rows[1:]:
        class_label = row[0]
        for j, m in enumerate(measures):
            classes[class_label][m.strip()] = float(row[j + 1].strip())
    return classes
report = classification_report(y_true, y_pred)
pd.DataFrame(report2dict(report)).T

  if diff:
  if diff:


accuracy score on train data: 0.9996701657775483


Unnamed: 0,f1-score,precision,recall,support
avg / total,1.0,1.0,1.0,69732.0
brazilian,1.0,1.0,1.0,819.0
british,1.0,1.0,1.0,1404.0
chinese,1.0,1.0,1.0,4687.0
greek,1.0,1.0,1.0,2071.0
irish,1.0,1.0,1.0,1165.0
italian,1.0,1.0,1.0,13730.0
mexican,1.0,1.0,1.0,11320.0
russian,1.0,1.0,1.0,849.0
southern_us,1.0,1.0,1.0,7598.0


## 6. Make submission

It seems to be working well. Let's make a submission.

In [13]:
y_pred = label_encoder.inverse_transform(classifier.predict(x_test))
test['cuisine'] = y_pred
test[['id', 'cuisine']].to_csv('submission_external.csv', index=False)
test[['id', 'cuisine']].head()

  if diff:


Unnamed: 0,id,cuisine
0,18009,irish
1,28583,southern_us
2,41580,italian
3,29752,cajun_creole
4,35687,italian


That's it! Don't trust what I've done here. The score can be better. Please let me know if you find a better approach.