# Kaggle's What's Cooking competition
https://www.kaggle.com/c/whats-cooking

Below I present the code which allowed me to jump to the 5th position on the leaderboard. I was experimenting with various feature engeneering and with dosens of classifiers - the final submission is just a single [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). The key idea was to utilize the relations between ingredients - they boosted the power of the classifier a lot!

### Competition
The competions was about predicting the type of cuisine of a dish based on its ingredients. The training set constisted of 39774 recipies, looking like this (I omitted id):

In [1]:
train_sample = {
    'cuisine': 'greek',
    'ingredients': [
        'romaine lettuce',
        'black olives',
        'grape tomatoes',
        'garlic',
        'pepper',
        'purple onion',
        'seasoning',
        'garbanzo beans',
        'feta cheese crumbles'
    ]
}

The task was to correctly predict cuisines of other 9944 recipes. There were 20 cuisines:

In [2]:
import numpy as np
import pandas as pd

data = pd.read_json('data/train.json')
print('recipes: ', data.shape[0])
print('cuisines', sorted(set(data.cuisine)))

recipes:  39774
cuisines ['brazilian', 'british', 'cajun_creole', 'chinese', 'filipino', 'french', 'greek', 'indian', 'irish', 'italian', 'jamaican', 'japanese', 'korean', 'mexican', 'moroccan', 'russian', 'southern_us', 'spanish', 'thai', 'vietnamese']


## Bag-of-ingredients

In order to build a classifier we have to somehow translate ingredients into values. A common approach is to use the bag-of-words model. First all words from recipes are retrieved to build a vocabulary - a list of unique words. Given the indices of words in the list, the recipes are converted into vectors, where each entry represent the number of appearences of an ingredient in the recipe. I use the sklearn implementation i.e. [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and here is an example:

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

recipes = [
    'salt, sugar, black pepper',
    'cucumber, carrot, salt'
]

vect = CountVectorizer()
vectors = vect.fit_transform(recipes).todense()

pd.DataFrame(data=vectors, columns=sorted(vect.vocabulary_))

Unnamed: 0,black,carrot,cucumber,pepper,salt,sugar
0,1,0,0,1,1,1
1,0,1,1,0,1,0


In our case, the ingredients are already in a list form, so we have to slightly modify the *tokenizer* method of CountVectorizer, namely we'll join all the ingredients into a single string. Also, as we shouldn't be differentiating between various forms of the same word, e.g. 'egg' - 'eggs', 'fry' - 'fried', we'll perform stemming. Below is presented a helper class **StemmerTokenizer** dealing with the above issues:

In [4]:
from nltk import regexp_tokenize
from nltk.stem.snowball import SnowballStemmer


class StemmerTokenizer(object):
    """
    Joins all ingredients into a single string and provides
    a list of stems of all words longer than 2 letters.
    
    Example:
    >>> tok = StemmerTokenizer()
    >>> tok.tokenizer(
            tok.preprocessor([
                'romaine lettuce', 'black olives', 'grape tomatoes',
                'garlic', 'pepper', 'purple onion', 'seasoning',
                'garbanzo beans', 'feta cheese crumbles'
            ])
        )
    ['romain', 'lettuc', 'black', 'oliv', 'grape', 'tomato',
     'garlic', 'pepper', 'purpl', 'onion', 'season',
     'garbanzo', 'bean', 'feta', 'chees', 'crumbl']
    """

    def __init__(self):
        self.pattern = r'(?u)\b[a-zA-Z_][a-zA-Z_]+\b'
        self.stemmer = SnowballStemmer('english')

    def mapper(self, word):
        return self.stemmer.stem(word)

    def tokenizer(self, doc):
        return [self.mapper(t) for t in regexp_tokenize(doc, pattern=self.pattern)]

    def preprocessor(self, line):
        return ' '.join(line).lower()

And here is an example use:

In [5]:
recipes = [
    ['tomatoes', 'fresh basil', 'garlic', 'extra-virgin olive oil', 'salt', 'black pepper'],
    ['olive oil', 'onion', 'pork', 'cheddar cheese', 'ground black pepper', 'salt', 'lime', 'jalapeno chilies'],
]

vect = CountVectorizer(
    preprocessor=StemmerTokenizer().preprocessor,
    tokenizer=StemmerTokenizer().tokenizer)
vectors = vect.fit_transform(recipes)

pd.DataFrame(data=vectors.todense(), columns=sorted(vect.vocabulary_))

Unnamed: 0,basil,black,cheddar,chees,chili,extra,fresh,garlic,ground,jalapeno,lime,oil,oliv,onion,pepper,pork,salt,tomato,virgin
0,1,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,1,1,1
1,0,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,0,0


Just by looking at the ingredients one could guess that the first recipe is probably italian or french - we can pick typical products, eg.: as extra-vergin olive oil, fresh basil. The second one is probably mexican - jalapeno chillies are good indicators. So, the first thing we do to recognize the cuisine is to filter out the most common ingredients such as salt, black pepper, olive oil and focus on the most unique ones. It would be of help to incorporate the information about 'commonness' into the above vectors.

The answer is: term frequency–inverse document frequency or [tf-idf](https://en.wikipedia.org/wiki/Tf-idf). I'm not going to go into details, it is a common approach when working with bags-of-words, we will rather focus on outcomes. Again, we use the scikit-learn implementation [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). We use variables from the previous cell:

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer

trans = TfidfTransformer()
tfidf = trans.fit_transform(vectors)

pd.DataFrame(data=tfidf.todense(), columns=sorted(vect.vocabulary_)).applymap(lambda x: '%.3f' % x)

Unnamed: 0,basil,black,cheddar,chees,chili,extra,fresh,garlic,ground,jalapeno,lime,oil,oliv,onion,pepper,pork,salt,tomato,virgin
0,0.342,0.244,0.0,0.0,0.0,0.342,0.342,0.342,0.0,0.0,0.0,0.244,0.244,0.0,0.244,0.0,0.244,0.342,0.342
1,0.0,0.219,0.308,0.308,0.308,0.0,0.0,0.0,0.308,0.308,0.308,0.219,0.219,0.308,0.219,0.308,0.219,0.0,0.0


We can see that ingredients appearing in single recipe have larger weights than the common ones - that's what we wanted! To use CountVectorizer and TfidfTransformer easier, we will chain them in a [pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) (we could be using a [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), but we keep them separatelly so it's clearer what's going on). We'll reproduce now the above table:

In [7]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('vectorizer', CountVectorizer(
            preprocessor=StemmerTokenizer().preprocessor,
            tokenizer=StemmerTokenizer().tokenizer)),
        ('transformer', TfidfTransformer()),
    ])

vectors = pipeline.fit_transform(recipes)
pd.DataFrame(data=vectors.todense(),
             columns=sorted(pipeline.named_steps['vectorizer'].vocabulary_)
            ).applymap(lambda x: '%.3f' % x)

Unnamed: 0,basil,black,cheddar,chees,chili,extra,fresh,garlic,ground,jalapeno,lime,oil,oliv,onion,pepper,pork,salt,tomato,virgin
0,0.342,0.244,0.0,0.0,0.0,0.342,0.342,0.342,0.0,0.0,0.0,0.244,0.244,0.0,0.244,0.0,0.244,0.342,0.342
1,0.0,0.219,0.308,0.308,0.308,0.0,0.0,0.0,0.308,0.308,0.308,0.219,0.219,0.308,0.219,0.308,0.219,0.0,0.0


## Relations between words

My first classifiers were based on the above vectorization of recipes. However, I shortly realized that listing all words without preserving the relations between them brings significant lose of information. Simple example, if we vectorize these two ingredients:

 - 'red pepper, black olives'   -> ['black', 'olive', 'pepper', 'red']
 - 'black pepper, green olives' -> ['black', 'olive', 'pepper', 'green']
  
The only difference now is 'red' and 'green', we cannot discriminate anymore between 'red pepper' and 'black pepper'. This is not a minor case, there's plenty of such overlaps.

However, this is not the only argument to introduce some means of preserving the relation between words. We will see it by examining recipes with ingredients often present in certain cuisines, but not unique for them. Let's check amount of **dijon** and **wine** in the french cusine:

In [8]:
def recipes_with(ingrs, df):
    df = df.copy()
    if ~isinstance(ingrs, list):
        ingrs = [ingrs]
    for ingr in ingrs:
        df = df[df.ingredients.apply(lambda row: ingr in row)]
    return df
  
def recipes_of(cuisine, df):
    df_in = df[df.cuisine == cuisine].copy()
    df_out = df[df.cuisine != cuisine].copy()
    return df_in, df_out

def fraction(df0, df1):
    return '%.2f' % (len(df0) / (len(df0) + len(df1)))

ingrs = ('dijon', 'wine')
cuisine = 'french'

data = pd.read_json('data/train.json')
data['ingredients'] = data.ingredients.apply(lambda ingrs: ' '.join(ingrs))
df0in, df0out = recipes_of(cuisine, recipes_with(ingrs[0], data))
df1in, df1out = recipes_of(cuisine, recipes_with(ingrs[1], data))
df01in = pd.merge(df0in, df1in, how='inner')
df01out = pd.merge(df0out, df1out, how='inner')

pd.DataFrame(
    data=[
        [len(df0in), len(df0out), fraction(df0in, df0out)],
        [len(df1in), len(df1out), fraction(df1in, df1out)],
        [len(df01in), len(df01out), fraction(df01in, df01out)]],
    index=list(ingrs) + ['%s + %s' % ingrs],
    columns=[cuisine, 'non-' + cuisine, 'fraction in cuisine'])

Unnamed: 0,french,non-french,fraction in cuisine
dijon,190,380,0.33
wine,613,3551,0.15
dijon + wine,87,86,0.5


We can see from the above table, that:
 - 33% of recipes with dijon are french
 - 15% of recipes with wine are french

This is a very important information when classifying. However if we consider a simultaneous appearance of both dijon and wine in the recipe we get **50% chances the recipe is french!** This is a significant increase of our odds!

**Conclusion**

We should use not only single words derived from a recipe, we should also use pairs of words - it allows us to preserve some valuable information.

## DupleTokenizer

Below I introduce *DupleTokenizer* - class inheriting from *StemmerTokenizer*, but with additional method to create pairs of words.

In [9]:
class DupleTokenizer(StemmerTokenizer):
    """
    Builds upon the StemmerTokenizer: after all words in a recipe are stemmed
    they are grouped in all possible combinations of two words and added
    to the words' list.
    
    Example (simplified):
    >>> tok = DupleTokenizer()
    >>> tok.tokenizer(
            tok.preprocessor([
                'sugar', 'salt', 'black pepper'
            ])
        )
    ['black', 'pepper', 'salt', 'sugar',
     'black pepper', 'black salt', 'black sugar', 'pepper salt', 'pepper sugar', 'salt sugar']
    """

    def duple(self, x):
        duples = []
        x_len = len(x)
        i = 0
        while i < x_len-1:
            j = i + 1
            while j < x_len:
                duples.append('%s %s' % (x[i], x[j]))
                j = j + 1
            i = i + 1
        return np.array(duples)

    def tokenizer(self, doc):
        words = np.array([self.mapper(t) for t in regexp_tokenize(doc, pattern=self.pattern)])
        words = sorted(set(words))
        words = np.hstack([words, self.duple(words)])
        return words

By using the above tokenizer, we get some 100 times more features:
 - StemmerTokenizer: 2598
 - DupleTokenizer: 292246

In [1]:
def features_count(tokenizer, df):
    vect = CountVectorizer(
        preprocessor=tokenizer().preprocessor,
        tokenizer=tokenizer().tokenizer)
    return len(vect.fit(df.ingredients).vocabulary_)

# It takes around 40s, so I commented it out
# data = pd.read_json('data/train.json')
# print('single words:           ', features_count(StemmerTokenizer, data))
# print('single words and pairs: ', features_count(DupleTokenizer, data))

# Classifier

After extracting features we can focus on classifier. We have around 40k training samples of 300k features, so there is much more features than observations. In such a situation we can experience curse of dimensionality and there is a high risk of overfitting. One approach to deal with this problem would be to try to reduce the number of features. However, it turns out that a direct use of a SVM with a linear kernel works very well [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). This classifier, if properly regularized by choosing a proper penalty parameter, is highly resistant to over-fitting. To deal with multiple classes the model is trained in one-vs-rest regime. It is also worth to note that linearSVC is pretty fast.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
    
tokenizerClass = DupleTokenizer

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(
        preprocessor=tokenizerClass().preprocessor,
        tokenizer=tokenizerClass().tokenizer,
        stop_words='english',
        max_df=1.0,
        min_df=1,
        binary=True,
    )),
    ('transformer', TfidfTransformer()),
    ('classifier', LinearSVC(
        C=0.78, penalty='l2', loss='squared_hinge', dual=True, max_iter=1000, random_state=0))
])

In [12]:
# Takes around 50s
train = pd.read_json('data/train.json')
X_train = train['ingredients']
y_train = train['cuisine']

test = pd.read_json('data/test.json')
X_test = test['ingredients']

pipeline.fit(X_train, y_train)
prediction = pipeline.predict(X_test)

In [13]:
submission = test.copy()
submission['cuisine'] = prediction
submission.to_csv(
    'submission.csv', index=False, quoting=3,
    columns=['id', 'cuisine'])

This is a recipe to jump to 5th place!
![score](score.png)