# Data Exploration For Recipe Classification

Since this is for a class project, and not the Kaggle "What's Cooking" competition, the goals are a little different. While accuracy is important, since this is based on the fictional scenario of "ingredients on hand", there won't be exactly right and wrong answers. A model that trains and predicts quickly would be valuable here, so that the model would not need to be stored anywhere, and a model that has a method to predict the most similar recipes to the input would make things simpler (looking at you, kNN). Additionally, since a required output is some sort of "similarity" or "probability" to the predicted cuisine, a classifier with a predict_proba() function would be desirable.

To achieve this, I'm going to look at a few different ways to vectorize the data (along with some basic "cleaning" of the recipes), and then test the data with different models and different sets of hyperparameters. Since speed is valuable here as well, I'll also be looking at which models perform well with smaller fractions of the dataset. cross_val_score with the accuracy metric from sklearn will be used to assess each model. Since the "test" here is with user input data, 100% of the data will be used for training/cross-validation.

In [117]:
import os
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
from gensim.models import Word2Vec
from sklearn.model_selection import cross_val_score, GridSearchCV
import spacy
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import time
import seaborn as sns
from sklearn.naive_bayes import BernoulliNB
from sklearn.calibration import CalibratedClassifierCV

# Read In Data

Use the pandas .read_json() functionality to simplify reading in the training data.

In [2]:
filename = 'yummly.json'
df = pd.read_json(filename)

# De-Tokenize for Cleaning

The cleaning and some vectorization works better when all of the data is joined together, so a column is added to the dataframe with the joined recipes.

In [3]:
def de_tokenize(df):
    '''
    takes a dataframe as argument
    returns same dataframe with additional column for joined ingredient lists
    '''
    
    l = [' '.join(recipe) for recipe in df['ingredients']] #get list of joined ingredient lists
    
    df['ing_join'] = l #create new column in dataframe to hold them
    
    return df

In [4]:
df = de_tokenize(df)

# Clean Recipes

Remove some stopwords, and do some basic cleaning. Since word2vec likes the tokenized recipes, while the other vectorizers like the joined recipes, the functions will add both tokenized and non-tokenized columns to the dataframe.

In [5]:
def remove_stopwords(text,stopwords):
    '''
    Takes text and a set of stopwords as arguments and returns word tokens without stopwords
    '''
    
    tokens = word_tokenize(text) #tokenize the words in the recipe
    cleaned_tokens = [word for word in tokens if word not in stopwords] #remove stopwords
    
    return cleaned_tokens

In [6]:
def clean_recipes(df):
    
    '''
    takes the recipe dataframe as an argument
    removes stopwords, does basic cleaning and adds cleaned columns to dataframe
    '''
    
    stops = set(stopwords.words('english')) #get stopwords
    tokl = [] #initialize tokenized recipes list
    sentl = [] #initialize joined list
    
    for sent in df['ing_join']:
        
        sent.replace('mayonnais','mayonnaise') #a bunch of recipes have mayonnaise misspelled for some reason
    
        sent=re.sub(r'\([^)]*\)', '', sent) #remove parentheticals
    
        sent=re.sub(r'[^a-zA-Z ]+','',sent) #remove punctuation and numbers
    
        sent=sent.lower() #convert to lowercase
    
        tok=remove_stopwords(sent,stops) #get stopword-free tokens
        
        tokl.append(tok) #append to the tokenized list
        
        sent=' '.join(tok) #get joined sentence
        
        sentl.append(sent) #append to sentence list
    
    df['ing_clean']=sentl #add column to df
    df['ing_clean_tok'] = tokl #add column to df
    
    return df

In [7]:
df = clean_recipes(df)

Check results of cleaning. Notice that the cleaned tokenized recipes are tokenized differently than the non-cleaned tokenized recipes. Will check to see if this makes a difference experimentally.

In [8]:
df.head()

Unnamed: 0,id,cuisine,ingredients,ing_join,ing_clean,ing_clean_tok
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes...",romaine lettuce black olives grape tomatoes ga...,romaine lettuce black olives grape tomatoes ga...,"[romaine, lettuce, black, olives, grape, tomat..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g...",plain flour ground pepper salt tomatoes ground...,plain flour ground pepper salt tomatoes ground...,"[plain, flour, ground, pepper, salt, tomatoes,..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g...",eggs pepper salt mayonaise cooking oil green c...,eggs pepper salt mayonaise cooking oil green c...,"[eggs, pepper, salt, mayonaise, cooking, oil, ..."
3,22213,indian,"[water, vegetable oil, wheat, salt]",water vegetable oil wheat salt,water vegetable oil wheat salt,"[water, vegetable, oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe...",black pepper shallots cornflour cayenne pepper...,black pepper shallots cornflour cayenne pepper...,"[black, pepper, shallots, cornflour, cayenne, ..."


# Vectorize Recipes
In this section, the recipe ingredients will be vectorized using CountVectorizer, TfidfVectorizer, Word2Vec (from gensim) and GloVe (from SpaCy). Both the cleaned and uncleaned versions of the recipes will be vectorized.

## Countvec

In [9]:
clean_cvec = CountVectorizer(min_df=2) #tried to capture as many ingredients as possible while filtering for typos
clean_cmat = clean_cvec.fit_transform(df['ing_clean']) #vectorize the cleaned versions of recipes
clean_cmat.shape #check shape of matrix

(39774, 2475)

In [10]:
dirt_cvec = CountVectorizer(min_df=2) #tried to capture as many ingredients as possible while filtering for typos
dirt_cmat = dirt_cvec.fit_transform(df['ing_join']) #vectorized the un-cleaned versions of recipes
dirt_cmat.shape #check shape of matrix

(39774, 2459)

## TfIdf

In [11]:
clean_vec = TfidfVectorizer(min_df=2) #tried to capture as many ingredients as possible while filtering for typos
clean_mat = clean_vec.fit_transform(df['ing_clean']) #vectorize the cleaned versions of recipes
clean_mat.shape #check shape of matrix

(39774, 2475)

In [12]:
dirt_vec = TfidfVectorizer(min_df=2) #tried to capture as many ingredients as possible while filtering for typos
dirt_mat = dirt_vec.fit_transform(df['ing_join']) #vectorized the un-cleaned versions of recipes
dirt_mat.shape #check shape of matrix

(39774, 2459)

## Word2Vec

In [13]:
def get_model(df, size=300, mincount=2, clean=False):
    
    '''
    takes the recipe dataframe along with some word2vec parameters as arguments
    returns the vectors of a trained word2vec model
    '''
    
    if clean:
        model = Word2Vec(np.array(df['ing_clean_tok']), size=size, min_count=mincount)
    else:
        model = Word2Vec(np.array(df['ingredients']), size=size, min_count=mincount)
    
    return model.wv #returning the .wv portion (i.e. just vectors) of the model saves memory

In [14]:
def get_recipe_vectors(df,mod_vecs,zeros=False, clean=False):
    
    '''
    takes the recipe dataframe and word2vec vectors as arguments
    returns the vectorized recipe list
    '''

    if clean: #determining which column to use for vectorization
        
        ingredients = np.array(df['ing_clean_tok']) #get the ingredients to vectorize
        
    else:
        
        ingredients = np.array(df['ingredients']) #get the ingredients to vectorize
    
    if zeros: #i.e. any non-found ingredient is given a vector of all zeroes
        
        length = mod_vecs.vectors.shape[1] #get length of potential zeroes vectors
        
        vectors = np.array([np.mean([mod_vecs[word] if word in mod_vecs.index2word else np.zeros(length) for word in recipe], 
                                axis=0) for recipe in ingredients])
        #average all of the individual ingredient vectors, so that each recipe has one vector
    
    else: #i.e. ignore any non-found ingredients
        vectors = np.array([np.mean([mod_vecs[word] for word in recipe if word in mod_vecs.index2word], 
                                    axis=0) for recipe in ingredients])
        #average all of the individual ingredient vectors, so that each recipe has one vector
    
    return vectors

In [15]:
clean_mod = get_model(df, clean=True) #get model vectors for cleaned recipes
clean_w2vs = get_recipe_vectors(df,clean_mod, zeros=True, clean=True) #get recipe vectors
clean_w2vs.shape #check shape of matrix

(39774, 300)

In [16]:
dirt_mod = get_model(df, clean=False) #get model vectors for un-cleaned recipes
dirt_w2vs = get_recipe_vectors(df,dirt_mod, zeros=True, clean=False) #get recipe vectors
dirt_w2vs.shape #check shape of matrix

(39774, 300)

## SpaCy

In [17]:
nlp = spacy.load('en_core_web_lg') #load spacy model

In [18]:
%time dirt_spacy = np.array([nlp(recipe).vector for recipe in df['ing_join']]) #get pre-trained GloVe vectors 
dirt_spacy.shape #check matrix shape

Wall time: 7min 40s


(39774, 300)

In [19]:
%time clean_spacy = np.array([nlp(recipe).vector for recipe in df['ing_clean']]) #get pre-trained GloVe vectors 
clean_spacy.shape #check matrix shape

Wall time: 7min 37s


(39774, 300)

In terms of speed, the SpaCy vectorization is already pretty slow, being the only vectorization to take more than a minute.

# Find Best Model/Data Combination

## Initialize Models
Test a few different classification models, with a few different hyperparameter settings. If any model looks particularly promising, further hyperparameter tuning can be performed.

In [50]:
knn5 = KNeighborsClassifier()
knn10 = KNeighborsClassifier(n_neighbors=10)
svc1 = LinearSVC(max_iter=2500)
svc05 = LinearSVC(C=0.5, max_iter=2500)
logreg1 = LogisticRegression(max_iter=1500)
logreg05 = LogisticRegression(C=0.5,max_iter=1500)
etree200 = ExtraTreesClassifier(n_estimators=200)
rfor200 = RandomForestClassifier(n_estimators=200)
bnb = BernoulliNB()

## Create Dictionaries for Models and Data

In [51]:
models = {'KNN-5': knn5, 'KNN-10': knn10, 'SVC-1': svc1, 'SVC-0.5': svc05, 'LogReg-1': logreg1, 'LogReg-0.5': logreg05,
          'ExTrees': etree200, 'RandomForest': rfor200, 'BernoulliNB': bnb}

In [52]:
data = {'CleanCountVec': clean_cmat, 'DirtyCountVec': dirt_cmat, 'CleanTfIdf': clean_mat, 'DirtyTfIdf': dirt_mat, 
        'CleanW2V':clean_w2vs, 'DirtyW2V': dirt_w2vs, 'CleanSpaCy': clean_spacy, 'DirtySpaCy': dirt_spacy}

In [53]:
y = df['cuisine']

## Define Function for Analysis
This will return dataframes that for each combination of model and data formulation store the result of 5-fold cross-validation, and the time it took to perform said cross-validation. Since time is merely an issue of user experience here, and we aren't tryig to squeeze every iota of performance out, the wall-clock time suffices. 

In [55]:
def approach_analyzer(models, data, y, pickle=True, verbose=False, accname='', timename=''):
    
    '''
    accepts dictionaries of models and data, along with classification labels, flags and filenames if writing to pickle
    cross validates all models using all data formats, storing accuracy scores and runtimes
    returns dataframes with the accuracy scores and runtimes of all model-data combinations
    '''
    
    acc_dict = {} #initialize dictionaries to hold accuracy and time data
    time_dict = {}
    for model in models: #iterate through model dictionaries
        
        mod = models[model] #get model
        
        mod_acc_dict = {} #initialize accuracy and time dictionaries for selected model
        mod_time_dict = {}
        
        for datum in data: #iterate through each data formulation
            
            dat = data[datum] #get vectors
            start = time.time() #timestamp
            x = cross_val_score(mod, dat, y, cv=5, scoring='accuracy').mean() #cross-validate
            mod_acc_dict[datum] = x #store mean cv accuracy score
            
            elapse = time.time() - start #get elapsed time
            
            mod_time_dict[datum] = round(elapse,1) #store elapsed time
            
            if verbose: #print result of cross-validation
                print(f'{model} with {datum} data has acc score of {x} and took {round(elapse/60,1)} minutes to cross-validate')
        
        acc_dict[model] = mod_acc_dict #insert model dictionaries into general dictionaries
        time_dict[model] = mod_time_dict
    
    df_acc = pd.DataFrame(acc_dict) #create dataframes from dictionaries
    df_time = pd.DataFrame(time_dict)
    
    if pickle: #optionally store results in pickles due to time-intensive nature of function
        df_acc.to_pickle(accname)
        df_time.to_pickle(timename)
    
    return df_acc,df_time #return dataframes

Since this step can be very time-consuming, the results are stored in a pickle.

In [56]:
if os.path.isfile('accuracy1.pickle') and os.path.isfile('time1.pickle'):
    accuracy_df = pd.read_pickle('accuracy1.pickle')
    time_df = pd.read_pickle('time1.pickle')
    
else:
    accuracy_df, time_df = approach_analyzer(models,data,y,True,True,accname='accuracy1.pickle',timename='time1.pickle')

KNN-5 with CleanCountVec data has acc score of 0.6379797822393913 and took 1.6 minutes to cross-validate
KNN-5 with DirtyCountVec data has acc score of 0.6356918276317612 and took 1.4 minutes to cross-validate
KNN-5 with CleanTfIdf data has acc score of 0.7323630074689363 and took 1.4 minutes to cross-validate
KNN-5 with DirtyTfIdf data has acc score of 0.72999967285177 and took 1.4 minutes to cross-validate
KNN-5 with CleanW2V data has acc score of 0.6970635870270396 and took 16.8 minutes to cross-validate
KNN-5 with DirtyW2V data has acc score of 0.6020014043667493 and took 4.8 minutes to cross-validate
KNN-5 with CleanSpaCy data has acc score of 0.701664495424429 and took 21.2 minutes to cross-validate
KNN-5 with DirtySpaCy data has acc score of 0.6979184585407576 and took 20.9 minutes to cross-validate
KNN-10 with CleanCountVec data has acc score of 0.6470056596643776 and took 1.5 minutes to cross-validate
KNN-10 with DirtyCountVec data has acc score of 0.6445165452451534 and took 



SVC-1 with CleanCountVec data has acc score of 0.7735204895148993 and took 2.5 minutes to cross-validate




SVC-1 with DirtyCountVec data has acc score of 0.7748278433803926 and took 2.5 minutes to cross-validate
SVC-1 with CleanTfIdf data has acc score of 0.7885806239427937 and took 0.2 minutes to cross-validate
SVC-1 with DirtyTfIdf data has acc score of 0.789837698128159 and took 0.2 minutes to cross-validate
SVC-1 with CleanW2V data has acc score of 0.7056368999180865 and took 19.0 minutes to cross-validate
SVC-1 with DirtyW2V data has acc score of 0.6277468542801181 and took 19.6 minutes to cross-validate
SVC-1 with CleanSpaCy data has acc score of 0.7624829886871509 and took 12.7 minutes to cross-validate
SVC-1 with DirtySpaCy data has acc score of 0.7610247641727488 and took 12.6 minutes to cross-validate
SVC-0.5 with CleanCountVec data has acc score of 0.778775182314019 and took 1.4 minutes to cross-validate
SVC-0.5 with DirtyCountVec data has acc score of 0.7792025706580912 and took 1.4 minutes to cross-validate
SVC-0.5 with CleanTfIdf data has acc score of 0.7900639740734238 and to

In [57]:
accuracy_df #check the accuracy scores

Unnamed: 0,KNN-5,KNN-10,SVC-1,SVC-0.5,LogReg-1,LogReg-0.5,ExTrees,RandomForest,BernoulliNB
CleanCountVec,0.63798,0.647006,0.77352,0.778775,0.782647,0.78315,0.777895,0.760698,0.716926
DirtyCountVec,0.635692,0.644517,0.774828,0.779203,0.783401,0.783552,0.77782,0.759466,0.715819
CleanTfIdf,0.732363,0.742948,0.788581,0.790064,0.782421,0.768517,0.773822,0.75049,0.716926
DirtyTfIdf,0.73,0.741062,0.789838,0.791095,0.782546,0.769246,0.773219,0.751622,0.715819
CleanW2V,0.697064,0.703148,0.705637,0.702117,0.710892,0.708604,0.696636,0.698421,0.509378
DirtyW2V,0.602001,0.617866,0.627747,0.618696,0.630387,0.622542,0.65309,0.651908,0.483758
CleanSpaCy,0.701664,0.706969,0.762483,0.7611,0.758737,0.752099,0.676372,0.682029,0.47433
DirtySpaCy,0.697918,0.701614,0.761025,0.759567,0.758058,0.750742,0.671142,0.678785,0.473148


In [58]:
time_df #check the time scores

Unnamed: 0,KNN-5,KNN-10,SVC-1,SVC-0.5,LogReg-1,LogReg-0.5,ExTrees,RandomForest,BernoulliNB
CleanCountVec,95.1,89.1,151.1,83.2,122.6,100.4,886.6,621.5,2.4
DirtyCountVec,83.0,82.5,149.4,85.2,130.7,89.0,870.8,680.9,3.2
CleanTfIdf,82.5,83.3,11.6,9.3,72.1,51.6,833.8,635.5,2.4
DirtyTfIdf,84.2,84.1,11.5,9.0,69.9,51.4,829.0,626.2,2.1
CleanW2V,1006.0,971.4,1139.9,586.9,194.3,156.4,203.9,1088.5,5.3
DirtyW2V,290.5,358.9,1176.5,615.6,226.3,165.1,207.2,1121.2,4.4
CleanSpaCy,1272.7,1204.3,763.7,423.5,469.1,388.5,211.1,935.2,4.1
DirtySpaCy,1256.7,1140.8,758.3,420.2,411.9,321.1,210.3,986.3,3.7


# Attempt with Smaller Datasets
Again, since a faster model improves the user experience, the cross validation step will be attempted with a smaller dataset, to see what kind of accuracy tradeoff is made for the faster time. First, we'll define a function to get the data vectors for the specified fraction of the dataset

In [60]:
def get_vec_dict(df, nlp, s_frac):
    
    '''
    accepts the recipe dataframe, the SpaCy model and deisred sample fraction as arguments
    gets dataframe of desired sample size, and vectorized the recipe ingredients as shown above
    returns dictionary of data vectors, along with cuisine labels
    '''
    
    samp_df = df.sample(frac=s_frac)
    
    y = samp_df['cuisine']
    
    clean_cmat = CountVectorizer(min_df=2).fit_transform(samp_df['ing_clean'])
 
    dirt_cmat = CountVectorizer(min_df=2).fit_transform(samp_df['ing_join'])
 
    clean_mat = TfidfVectorizer(min_df=2).fit_transform(samp_df['ing_clean'])

    dirt_mat = TfidfVectorizer(min_df=2).fit_transform(samp_df['ing_join'])
    
    clean_mod = get_model(samp_df, clean=True) 
    clean_w2vs = get_recipe_vectors(samp_df,clean_mod, zeros=True, clean=True)
    
    dirt_mod = get_model(samp_df, clean=False)
    dirt_w2vs = get_recipe_vectors(samp_df,dirt_mod, zeros=True, clean=False)
    
    dirt_spacy = np.array([nlp(recipe).vector for recipe in samp_df['ing_join']])
    
    clean_spacy = np.array([nlp(recipe).vector for recipe in samp_df['ing_clean']])
    
    ret_dict = {'CleanCountVec': clean_cmat, 'DirtyCountVec': dirt_cmat, 'CleanTfIdf': clean_mat, 'DirtyTfIdf': dirt_mat, 
                'CleanW2V':clean_w2vs, 'DirtyW2V': dirt_w2vs, 'CleanSpaCy': clean_spacy, 'DirtySpaCy': dirt_spacy}
    
    return ret_dict,y

## Repeat Experiment with 10% of Dataset

In [71]:
%time data01,y01 = get_vec_dict(df, nlp, 0.1)

Wall time: 1min 40s


In [72]:
if os.path.isfile('accuracy01.pickle') and os.path.isfile('time01.pickle'):
    accuracy_df01 = pd.read_pickle('accuracy01.pickle')
    time_df01 = pd.read_pickle('time01.pickle')
    
else:
    accuracy_df01, time_df01 = approach_analyzer(models,data01,y01,True,True,
                                                 accname='accuracy01.pickle',timename='time01.pickle')

KNN-5 with CleanCountVec data has acc score of 0.5119446288043994 and took 0.0 minutes to cross-validate
KNN-5 with DirtyCountVec data has acc score of 0.5101848866976393 and took 0.0 minutes to cross-validate
KNN-5 with CleanTfIdf data has acc score of 0.6590458582219273 and took 0.0 minutes to cross-validate
KNN-5 with DirtyTfIdf data has acc score of 0.6580411491419361 and took 0.0 minutes to cross-validate
KNN-5 with CleanW2V data has acc score of 0.40683259062608645 and took 0.1 minutes to cross-validate
KNN-5 with DirtyW2V data has acc score of 0.2763326064283682 and took 0.0 minutes to cross-validate
KNN-5 with CleanSpaCy data has acc score of 0.5986928352454094 and took 0.2 minutes to cross-validate
KNN-5 with DirtySpaCy data has acc score of 0.5939161214879428 and took 0.2 minutes to cross-validate
KNN-10 with CleanCountVec data has acc score of 0.5240156758635947 and took 0.0 minutes to cross-validate
KNN-10 with DirtyCountVec data has acc score of 0.5320615656900857 and took

In [73]:
accuracy_df01

Unnamed: 0,KNN-5,KNN-10,SVC-1,SVC-0.5,LogReg-1,LogReg-0.5,ExTrees,RandomForest,BernoulliNB
CleanCountVec,0.511945,0.524016,0.672879,0.689473,0.703806,0.702297,0.68947,0.667843,0.632137
DirtyCountVec,0.510185,0.532062,0.674137,0.689223,0.704058,0.70431,0.689972,0.666336,0.632891
CleanTfIdf,0.659046,0.671618,0.724424,0.727441,0.682181,0.636417,0.689215,0.656526,0.632137
DirtyTfIdf,0.658041,0.673882,0.723921,0.72543,0.682434,0.634656,0.685444,0.658037,0.632891
CleanW2V,0.406833,0.433493,0.470453,0.459893,0.455873,0.441288,0.473222,0.491581,0.300728
DirtyW2V,0.276333,0.284635,0.263264,0.232841,0.212975,0.210711,0.41312,0.429468,0.283635
CleanSpaCy,0.598693,0.612525,0.711846,0.711595,0.684943,0.67564,0.60347,0.614535,0.474976
DirtySpaCy,0.593916,0.602968,0.709082,0.710338,0.684692,0.670862,0.602715,0.609505,0.464419


In [74]:
time_df01

Unnamed: 0,KNN-5,KNN-10,SVC-1,SVC-0.5,LogReg-1,LogReg-0.5,ExTrees,RandomForest,BernoulliNB
CleanCountVec,1.1,1.2,3.0,1.8,5.9,4.8,29.9,23.5,0.1
DirtyCountVec,1.0,1.2,2.7,1.9,5.8,4.8,32.2,25.0,0.1
CleanTfIdf,1.0,1.1,0.6,0.5,3.9,3.1,32.5,23.7,0.1
DirtyTfIdf,1.0,1.2,0.6,0.5,3.9,3.0,30.9,25.4,0.1
CleanW2V,3.9,4.7,87.7,40.5,9.7,8.2,20.6,64.5,0.3
DirtyW2V,2.0,2.2,111.3,48.4,5.3,5.1,22.4,66.8,0.3
CleanSpaCy,11.7,11.5,50.3,29.4,19.9,20.5,20.1,57.1,0.3
DirtySpaCy,11.7,11.6,49.8,28.8,18.7,18.1,19.4,58.7,0.3


## Repeat Experiment with 1% of Dataset

In [76]:
%time data001,y001 = get_vec_dict(df, nlp, 0.01)

Wall time: 10.1 s


In [77]:
if os.path.isfile('accuracy001.pickle') and os.path.isfile('time001.pickle'):
    accuracy_df001 = pd.read_pickle('accuracy001.pickle')
    time_df001 = pd.read_pickle('time001.pickle')
    
else:
    accuracy_df001, time_df001 = approach_analyzer(models,data001,y001,True,True,
                                                   accname='accuracy001.pickle',timename='time001.pickle')

KNN-5 with CleanCountVec data has acc score of 0.37443037974683546 and took 0.0 minutes to cross-validate
KNN-5 with DirtyCountVec data has acc score of 0.3543670886075949 and took 0.0 minutes to cross-validate
KNN-5 with CleanTfIdf data has acc score of 0.5279746835443038 and took 0.0 minutes to cross-validate




KNN-5 with DirtyTfIdf data has acc score of 0.5355379746835444 and took 0.0 minutes to cross-validate
KNN-5 with CleanW2V data has acc score of 0.11813291139240507 and took 0.0 minutes to cross-validate




KNN-5 with DirtyW2V data has acc score of 0.29905063291139244 and took 0.0 minutes to cross-validate
KNN-5 with CleanSpaCy data has acc score of 0.4800316455696202 and took 0.0 minutes to cross-validate




KNN-5 with DirtySpaCy data has acc score of 0.4925316455696202 and took 0.0 minutes to cross-validate




KNN-10 with CleanCountVec data has acc score of 0.3894936708860759 and took 0.0 minutes to cross-validate
KNN-10 with DirtyCountVec data has acc score of 0.39958860759493675 and took 0.0 minutes to cross-validate
KNN-10 with CleanTfIdf data has acc score of 0.5579746835443038 and took 0.0 minutes to cross-validate
KNN-10 with DirtyTfIdf data has acc score of 0.5504430379746834 and took 0.0 minutes to cross-validate




KNN-10 with CleanW2V data has acc score of 0.10813291139240506 and took 0.0 minutes to cross-validate




KNN-10 with DirtyW2V data has acc score of 0.309240506329114 and took 0.0 minutes to cross-validate




KNN-10 with CleanSpaCy data has acc score of 0.507626582278481 and took 0.0 minutes to cross-validate




KNN-10 with DirtySpaCy data has acc score of 0.5378481012658227 and took 0.0 minutes to cross-validate
SVC-1 with CleanCountVec data has acc score of 0.5203797468354431 and took 0.0 minutes to cross-validate




SVC-1 with DirtyCountVec data has acc score of 0.5253797468354431 and took 0.0 minutes to cross-validate
SVC-1 with CleanTfIdf data has acc score of 0.5780379746835445 and took 0.0 minutes to cross-validate
SVC-1 with DirtyTfIdf data has acc score of 0.5756012658227848 and took 0.0 minutes to cross-validate




SVC-1 with CleanW2V data has acc score of 0.20604430379746835 and took 0.0 minutes to cross-validate




SVC-1 with DirtyW2V data has acc score of 0.19851265822784808 and took 0.0 minutes to cross-validate




SVC-1 with CleanSpaCy data has acc score of 0.5829113924050633 and took 0.1 minutes to cross-validate




SVC-1 with DirtySpaCy data has acc score of 0.5779430379746835 and took 0.1 minutes to cross-validate
SVC-0.5 with CleanCountVec data has acc score of 0.527879746835443 and took 0.0 minutes to cross-validate
SVC-0.5 with DirtyCountVec data has acc score of 0.530379746835443 and took 0.0 minutes to cross-validate




SVC-0.5 with CleanTfIdf data has acc score of 0.5779746835443038 and took 0.0 minutes to cross-validate
SVC-0.5 with DirtyTfIdf data has acc score of 0.5805063291139241 and took 0.0 minutes to cross-validate




SVC-0.5 with CleanW2V data has acc score of 0.20354430379746832 and took 0.0 minutes to cross-validate




SVC-0.5 with DirtyW2V data has acc score of 0.19851265822784808 and took 0.0 minutes to cross-validate




SVC-0.5 with CleanSpaCy data has acc score of 0.5904430379746836 and took 0.0 minutes to cross-validate




SVC-0.5 with DirtySpaCy data has acc score of 0.5905696202531646 and took 0.0 minutes to cross-validate




LogReg-1 with CleanCountVec data has acc score of 0.5379113924050632 and took 0.0 minutes to cross-validate




LogReg-1 with DirtyCountVec data has acc score of 0.535379746835443 and took 0.0 minutes to cross-validate




LogReg-1 with CleanTfIdf data has acc score of 0.515126582278481 and took 0.0 minutes to cross-validate




LogReg-1 with DirtyTfIdf data has acc score of 0.5050949367088607 and took 0.0 minutes to cross-validate




LogReg-1 with CleanW2V data has acc score of 0.20104430379746838 and took 0.0 minutes to cross-validate




LogReg-1 with DirtyW2V data has acc score of 0.19851265822784808 and took 0.0 minutes to cross-validate




LogReg-1 with CleanSpaCy data has acc score of 0.547879746835443 and took 0.0 minutes to cross-validate




LogReg-1 with DirtySpaCy data has acc score of 0.5554430379746835 and took 0.0 minutes to cross-validate




LogReg-0.5 with CleanCountVec data has acc score of 0.5454430379746835 and took 0.0 minutes to cross-validate




LogReg-0.5 with DirtyCountVec data has acc score of 0.5329113924050632 and took 0.0 minutes to cross-validate




LogReg-0.5 with CleanTfIdf data has acc score of 0.46746835443037976 and took 0.0 minutes to cross-validate




LogReg-0.5 with DirtyTfIdf data has acc score of 0.4649683544303797 and took 0.0 minutes to cross-validate




LogReg-0.5 with CleanW2V data has acc score of 0.19854430379746835 and took 0.0 minutes to cross-validate




LogReg-0.5 with DirtyW2V data has acc score of 0.19851265822784808 and took 0.0 minutes to cross-validate




LogReg-0.5 with CleanSpaCy data has acc score of 0.5201898734177215 and took 0.0 minutes to cross-validate




LogReg-0.5 with DirtySpaCy data has acc score of 0.5177215189873419 and took 0.0 minutes to cross-validate




ExTrees with CleanCountVec data has acc score of 0.5905379746835442 and took 0.0 minutes to cross-validate




ExTrees with DirtyCountVec data has acc score of 0.5905379746835442 and took 0.0 minutes to cross-validate




ExTrees with CleanTfIdf data has acc score of 0.5704746835443039 and took 0.0 minutes to cross-validate




ExTrees with DirtyTfIdf data has acc score of 0.5779746835443038 and took 0.0 minutes to cross-validate




ExTrees with CleanW2V data has acc score of 0.18348101265822786 and took 0.1 minutes to cross-validate




ExTrees with DirtyW2V data has acc score of 0.34436708860759496 and took 0.0 minutes to cross-validate




ExTrees with CleanSpaCy data has acc score of 0.5252848101265822 and took 0.0 minutes to cross-validate




ExTrees with DirtySpaCy data has acc score of 0.5253481012658228 and took 0.1 minutes to cross-validate




RandomForest with CleanCountVec data has acc score of 0.5830696202531646 and took 0.1 minutes to cross-validate




RandomForest with DirtyCountVec data has acc score of 0.5579113924050633 and took 0.1 minutes to cross-validate




RandomForest with CleanTfIdf data has acc score of 0.5654430379746835 and took 0.1 minutes to cross-validate




RandomForest with DirtyTfIdf data has acc score of 0.547879746835443 and took 0.0 minutes to cross-validate




RandomForest with CleanW2V data has acc score of 0.20860759493670886 and took 0.1 minutes to cross-validate




RandomForest with DirtyW2V data has acc score of 0.3543670886075949 and took 0.1 minutes to cross-validate




RandomForest with CleanSpaCy data has acc score of 0.5277848101265823 and took 0.1 minutes to cross-validate




RandomForest with DirtySpaCy data has acc score of 0.5202531645569619 and took 0.1 minutes to cross-validate
BernoulliNB with CleanCountVec data has acc score of 0.5151582278481012 and took 0.0 minutes to cross-validate
BernoulliNB with DirtyCountVec data has acc score of 0.5101582278481013 and took 0.0 minutes to cross-validate
BernoulliNB with CleanTfIdf data has acc score of 0.5151582278481012 and took 0.0 minutes to cross-validate
BernoulliNB with DirtyTfIdf data has acc score of 0.5101582278481013 and took 0.0 minutes to cross-validate
BernoulliNB with CleanW2V data has acc score of 0.19851265822784808 and took 0.0 minutes to cross-validate




BernoulliNB with DirtyW2V data has acc score of 0.3141772151898734 and took 0.0 minutes to cross-validate
BernoulliNB with CleanSpaCy data has acc score of 0.5126265822784811 and took 0.0 minutes to cross-validate
BernoulliNB with DirtySpaCy data has acc score of 0.5025316455696203 and took 0.0 minutes to cross-validate




As expected, the small training set size caused some problems with fitting.

In [78]:
accuracy_df001

Unnamed: 0,KNN-5,KNN-10,SVC-1,SVC-0.5,LogReg-1,LogReg-0.5,ExTrees,RandomForest,BernoulliNB
CleanCountVec,0.37443,0.389494,0.52038,0.52788,0.537911,0.545443,0.590538,0.58307,0.515158
DirtyCountVec,0.354367,0.399589,0.52538,0.53038,0.53538,0.532911,0.590538,0.557911,0.510158
CleanTfIdf,0.527975,0.557975,0.578038,0.577975,0.515127,0.467468,0.570475,0.565443,0.515158
DirtyTfIdf,0.535538,0.550443,0.575601,0.580506,0.505095,0.464968,0.577975,0.54788,0.510158
CleanW2V,0.118133,0.108133,0.206044,0.203544,0.201044,0.198544,0.183481,0.208608,0.198513
DirtyW2V,0.299051,0.309241,0.198513,0.198513,0.198513,0.198513,0.344367,0.354367,0.314177
CleanSpaCy,0.480032,0.507627,0.582911,0.590443,0.54788,0.52019,0.525285,0.527785,0.512627
DirtySpaCy,0.492532,0.537848,0.577943,0.59057,0.555443,0.517722,0.525348,0.520253,0.502532


In [79]:
time_df001

Unnamed: 0,KNN-5,KNN-10,SVC-1,SVC-0.5,LogReg-1,LogReg-0.5,ExTrees,RandomForest,BernoulliNB
CleanCountVec,0.1,0.0,0.1,0.1,1.2,1.4,2.7,3.2,0.0
DirtyCountVec,0.0,0.0,0.1,0.1,1.3,1.1,2.7,3.0,0.0
CleanTfIdf,0.0,0.0,0.1,0.1,0.6,0.5,2.8,3.3,0.0
DirtyTfIdf,0.1,0.1,0.1,0.1,0.6,0.5,2.8,3.0,0.0
CleanW2V,0.1,0.1,1.7,0.9,0.7,0.4,3.1,6.6,0.0
DirtyW2V,0.2,0.3,1.1,0.6,0.2,0.3,2.8,5.9,0.1
CleanSpaCy,0.2,0.3,4.4,2.5,2.9,2.7,3.0,5.8,0.0
DirtySpaCy,0.2,0.2,3.9,2.5,2.9,2.1,3.1,5.9,0.0


# Tuning to Find Best Model
The best scoring model was the Linear SVM model. Based on the initial findings, it looks like C < 1 would be the ideal regularization parameter, but we can do a search to confirm.

In [87]:
model = LinearSVC(max_iter=2500)
params = {'loss': ['hinge','squared_hinge'],
          'C': [0.25,0.5,0.75,1,1.25,1.5]}
grid_search = GridSearchCV(model,params,'accuracy')

In [88]:
grid_search.fit(dirt_mat,y)
grid_search.best_estimator_



GridSearchCV(cv=None, error_score=nan,
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=2500,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.25, 0.5, 0.75, 1, 1.25, 1.5],
                         'loss': ['hinge', 'squared_hinge']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

Increasing resolution

In [91]:
params2 = {'C':[0.3,0.4,0.5,0.6,0.7]}
grid_search2 = GridSearchCV(model,params2,'accuracy')
grid_search2.fit(dirt_mat,y)
grid_search2.best_estimator_

LinearSVC(C=0.5, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=2500,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

One last grid search

In [131]:
model = LinearSVC(max_iter=2500)
params3 = {'C':[0.43,0.47,0.5,0.53,0.57]}
grid_search3 = GridSearchCV(model,params3,'accuracy')
grid_search3.fit(dirt_mat,y)
bestsvc = grid_search3.best_estimator_

In [132]:
cross_val_score(bestsvc, dirt_mat, y, cv=5, scoring='accuracy').mean()

0.7912707717395134

Tuning for Logistic Regression

In [114]:
model = LogisticRegression(max_iter=1500)
params = {'C': [1,2,3,4,5,6,7,8]}
grid_search = GridSearchCV(model,params,'accuracy')
grid_search.fit(dirt_mat,y)
bestlogreg = grid_search.best_estimator_

In [98]:
cross_val_score(bestlogreg, dirt_mat, y, cv=5, scoring='accuracy').mean()

0.791295928964266

In [115]:
model = KNeighborsClassifier()
params = {'n_neighbors': [11,12,13,14,15,16,17,18,19]}
grid_search = GridSearchCV(model,params,'accuracy')
%time grid_search.fit(dirt_mat,y)
bestknn = grid_search.best_estimator_

Wall time: 12min 36s


In [133]:
cross_val_score(bestknn, dirt_mat, y, cv=5, scoring='accuracy').mean()

0.744632156584838

# Ensemble Classifier
Now we'll try to see if using a voting classifier with some of our better models will increase the accuracy score.

Since LinearSVC doesn't have a predict_proba() fucntion, we'll need to wrap it in a CalibratedClassifier

In [116]:
bestsvc=CalibratedClassifierCV(base_estimator=LinearSVC(C=0.53,max_iter=2500), method='sigmoid')

In [129]:
vote = VotingClassifier(estimators=[('svc',bestsvc),('logreg',bestlogreg),('knn',bestknn)],voting='soft')

In [130]:
%time cross_val_score(vote,dirt_mat,y,cv=5,scoring='accuracy').mean()

Wall time: 4min 5s


0.7977825197588838

Looks like the accuracy is marginally improved, but still not over the 0.8 hump.