# "What's Cooking?"
### Cuisine classification from list of ingredients  
`March 31 2018 - current`

## Basics

In [1]:
import numpy as np
import json
import re
from collections import Counter

with open('../_data/train.json', 'r') as f:
    train = json.load(f)
with open('../_data/test.json', 'r') as f:
    test = json.load(f)

In [2]:
len(train), len(test)

(39774, 9944)

In [3]:
train[0]

{'cuisine': 'greek',
 'id': 10259,
 'ingredients': ['romaine lettuce',
  'black olives',
  'grape tomatoes',
  'garlic',
  'pepper',
  'purple onion',
  'seasoning',
  'garbanzo beans',
  'feta cheese crumbles']}

## Data cleaning

In [21]:
SPEC_CHARS = re.compile(r'[^\w\s_]')

chars = [re.findall(SPEC_CHARS, x)\
 for ilist in [r['ingredients'] for r in train+test] for x in ilist if re.search(SPEC_CHARS, x)]

Counter([x for charlist in chars for x in charlist])

Counter({'!': 34,
         '%': 394,
         '&': 479,
         "'": 240,
         '(': 55,
         ')': 55,
         ',': 814,
         '-': 14123,
         '.': 57,
         '/': 2,
         '®': 244,
         '’': 8,
         '€': 1,
         '™': 79})

### Rules:
* remove `'`, `’`, `( oz*)`, `(`, `)`
* replace `&` with `and`
* replace all else (`™`, `®`, `.`, `€`, `-`) with `' '`
* keep `%`

In [33]:
SPEC_REMOVE = re.compile(r'(\'|\’|\(.*oz.*\)|(\()|(\)))')
SPEC_AND = re.compile(r'\&')
SPEC_ELSE = re.compile(r'[^\w\s\%_]')

def clean_ingr(ingr):
    ingr = re.sub(SPEC_REMOVE, '', ingr)
    ingr = re.sub(SPEC_AND, 'and', ingr)
    ingr = re.sub(SPEC_ELSE, ' ', ingr)
    return ' '.join(ingr.split())

def get_ingrs(given):
    ingrs = [[clean_ingr(i).lower() for i in recipe['ingredients']] for recipe in given]
    return ingrs

def get_labels(given):
    return [r['cuisine'] for r in given]

In [34]:
train_ingrs = get_ingrs(train)
train_labels = get_labels(train)

## Pipeline
0. Cleaning data (removing and replacing special characters)
1. TFIDF counts of ingredients and word n-grams
  * Parameter tuning through grid search
    * `ngram_range = (1, 4)`
    * `stop_words = None` (No use of stop words)
2. Linear SVM modeling of concatenated count matrix
  * Parameter tuning through grid search
    * `loss = 'hinge'`
    * $C = 10^{0.1} \approx 1.25$

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

In [36]:
# Dummy preprocessor/tokenizer for ingredient counting
def itself(x):
    return x

# Processor to treat list of ingredients as one collection of words
# For ngram counting
def combine_words(ilist):
    return ' '.join(ilist)

### Counting ingredient-level and word-level yields better CV accuracy

In [37]:
ingr_word = Pipeline([
    ('union', FeatureUnion([
        ("ingrs", TfidfVectorizer(strip_accents='unicode',
                                  tokenizer=itself,
                                  preprocessor=itself)),
        ("words", TfidfVectorizer(strip_accents='unicode',
                                  preprocessor=combine_words,
                                  stop_words=None,
                                  ngram_range=(1, 4)))
    ])),
    ("linear svc", LinearSVC(loss='hinge', C=10**0.1))
])

### Cross-validation score

In [11]:
%%time
scores = cross_val_score(ingr_word, train_ingrs, train_labels, cv=5, n_jobs=-1)

CPU times: user 787 ms, sys: 79.7 ms, total: 867 ms
Wall time: 33.8 s


In [12]:
scores

array([ 0.78522984,  0.79555165,  0.79122675,  0.78732235,  0.79891783])

In [13]:
scores.mean()

0.79164968300351501

In [17]:
scores.mean()

0.79157428085768422

## Prediction
Fitting to train data (after encoding all of train+test ingredients)

In [46]:
%%time
dvec_all = FeatureUnion([
        ("ingrs", TfidfVectorizer(strip_accents='unicode',
                                  tokenizer=itself,
                                  preprocessor=itself)),
        ("words", TfidfVectorizer(strip_accents='unicode',
                                  preprocessor=combine_words,
                                  ngram_range=(1, 4),
                                  stop_words=None)),
        ]).fit(get_ingrs(train+test))

CPU times: user 7.21 s, sys: 64 ms, total: 7.28 s
Wall time: 7.27 s


In [47]:
test_bag = dvec_all.transform(get_ingrs(test))

In [48]:
test_bag.shape

(9944, 767320)

In [49]:
svc_linear =  LinearSVC(loss='hinge', C=10**0.1)

In [50]:
%%time
svc_linear = svc_linear.fit(dvec_all.transform(train_ingrs), train_labels)

CPU times: user 16.1 s, sys: 79.9 ms, total: 16.2 s
Wall time: 16.2 s


In [51]:
test_ids = [r['id'] for r in test]
test_preds = svc_linear.predict(dvec_all.transform(get_ingrs(test)))

In [52]:
df_test = pd.DataFrame({'id': test_ids, 'cuisine': test_preds}, columns=['id', 'cuisine'])

In [53]:
df_test.to_csv('../_data/submission_trainonly.csv', index=False)

* Accuracy: 0.79605
* Rank: 270/1388
![Kaggle](https://raw.githubusercontent.com/isnbh0/whats_cooking/master/_images/180406_14ngram_ingr_linearsvc.png)
![Kaggle](https://raw.githubusercontent.com/isnbh0/whats_cooking/master/_images/180406_14ngram_ingr_linearsvc_standing.png)

## Improvements
* Capture nonlinearity with nonlinear models (optimize parameters)
  * Nonlinear kernel
  * Random forest
  * XGBoost