<h3>W207 Final Project</h3>

Team Pacific Knights<br/>
Members: Alan Wang, Daniel Sheinin, Kuan Lin, Michael Andrew Kennedy, Saru Mehta

Competition Description:<br/>
“What’s Cooking” - https://www.kaggle.com/c/whats-cooking
The goal of this competition is to successfully classify a set of recipes into one of twenty geographic regions of origin according to the ingredients they use. The competition provides a labeled training data set containing lists of raw ingredients and the cuisine they belong to. A second unlabeled data set is provided for scoring purposes.  Competitors are ranked by their ability to accurately label the test set. Competitors are granted up to 5 scoring attempts per day.

In [8]:
import os
import json
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

In [9]:
train_file = os.path.join(".","train.json","train.json")
with open(train_file) as data_file:
    train_data = json.loads(data_file.read())

# using 1/3 of the train data for dev testing
dev_test_data = ['\n'.join(d["ingredients"]) for d in train_data[:len(train_data)/3]]
dev_test_label = [d["cuisine"] for d in train_data[:len(train_data)/3]]
dev_train_data = ['\n'.join(d["ingredients"]) for d in train_data[len(train_data)/3:]]
dev_train_label = [d["cuisine"] for d in train_data[len(train_data)/3:]]

Initial Model:<br/>
<ul>
<li>Text model with TD-IDF vectorizer</li>
<li>ngram range (1,2) with maximum document frequency set to 0.5</li>
<li>Text preprocessore: lower all cases and replace "-" and "_" and they are commonly use to deliminate recipes</li>
<li>Grid Search logistic regression and Multinomial NB</li>
</ul>

In [10]:
from sklearn.feature_extraction.text import *
from sklearn.grid_search import GridSearchCV

def text_preprocessor(s):
    return s.lower().replace("-", " ").replace("_", " ")

print "vectorizing texts..."
vec = TfidfVectorizer(preprocessor=text_preprocessor, ngram_range=(1,2), max_df=0.5, strip_accents='unicode')
dev_train_vec = vec.fit_transform(dev_train_data)
dev_test_vec = vec.transform(dev_test_data)

print "grid search on logistic regression:"
params = {'C': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 1.5, 2.0, 5.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0], 'penalty': ['l1', 'l2']}
model_logistic = GridSearchCV(LogisticRegression(), param_grid=params, scoring='accuracy')
model_logistic.fit(dev_train_vec, dev_train_label)
print "best parameters:"
print str(model_logistic.best_params_)
print "accuracy: %.4f" % model_logistic.score(dev_test_vec, dev_test_label)
print
print "grid search on MultinomialNB:"
alphas = {'alpha': [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0]}
model_MNB = GridSearchCV(MultinomialNB(), param_grid=alphas)
model_MNB.fit(dev_train_vec, dev_train_label)
print "best parameters:"
print str(model_MNB.best_params_)
print "accuracy: %.4f" % model_MNB.score(dev_test_vec, dev_test_label)

vectorizing texts...
grid search on logistic regression:
best parameters:
{'penalty': 'l2', 'C': 11.0}
accuracy: 0.7859

  self.feature_log_prob_ = (np.log(smoothed_fc)




grid search on MultinomialNB:
best parameters:
{'alpha': 0.01}
accuracy: 0.7385


The best initial attempt is logistic regression with L2 penalty and C regularization at 11.0

Error Analysis #1:
<ul>
<li>Use confusion matrix to identify common errors</li>
<li>Look at which recipes are causing the confusion</li>
</ul>

In [24]:
from sklearn.metrics import confusion_matrix
import sys

print "make a confusion matrix on the best initial model"
cm = confusion_matrix(dev_test_label, model_logistic.predict(dev_test_vec))
for row in cm:
    for col in row: sys.stdout.write("%4d "%col)
    sys.stdout.write("\n")
print
print "classes:"
print str(model_logistic.best_estimator_.classes_)

make a confusion matrix on the best initial model
  84    0    2    0    3    6    0    4    0   13    0    0    0   23    0    2   16    5    6    0 
   0  105    3    1    1   44    3    7   19   21    5    1    0    6    0    4   56    1    1    0 
   1    2  379    0    1   29    0    2    1   27    1    0    0   18    1    4   73    2    0    0 
   1    3    3  769    6    2    0    7    0   14    0   15   16    4    0    2   17    1   23   12 
   2    3    1   31  153    6    0    2    0   10    1    2    1   11    1    0   14    2    3    6 
   0    4    6    1    3  565    6    7    8  166    0    2    0    8    2    8   65   16    0    1 
   0    0    0    2    0    9  272    9    1   65    0    1    0    3   12    1    4    8    0    0 
   2    0    0    2    3    4    9  882    0    6    1    0    1   21   15    0   11    0    8    0 
   1   12    0    1    0   31    3    1   97   15    3    0    0    4    2    1   34    3    0    0 
   1    6    4    0    0  115   39    3  

Looks like Italian and French cuisines are confusing.  Let's try putting some of the common ingredients in these two cuisines as stop words.  Also try to lemmatize the word so that we don't have several variants of the same ingredients.

In [27]:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [37]:
unwanted_features = ['salt', 'water', 'onion', 'garlic', 'olive', 'oil', 'clove']
vec = TfidfVectorizer(preprocessor=text_preprocessor, ngram_range=(1,2), max_df=0.5, stop_words=unwanted_features, strip_accents='unicode', tokenizer=LemmaTokenizer())
dev_train_vec = vec.fit_transform(dev_train_data)
dev_test_vec = vec.transform(dev_test_data)
# train with the best parameters on the initial model
model_logistic2 = LogisticRegression(penalty='l2', C=11.0)
model_logistic2.fit(dev_train_vec, dev_train_label)
print "accuracy: %.4f" % model_logistic2.score(dev_test_vec, dev_test_label)

accuracy: 0.7842
