# w207 Final Project
#### *Random Acts of Pizza Kaggle Competition Project* - by Marcus DeMaster and Chuqing He

Using a dataset of 4040 Reddit requests for pizza, we attempt to build a machine learning model that predicts if a request for pizza results in successfully receiving a pizza.

We investigate several types of models that we import below from the sklearn library.

In [6]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import *

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

Below we use pandas to read the training set and test set in json file format from Kaggle.

In [7]:
#Download training and test datasets.
pd_train = pd.read_json('https://raw.githubusercontent.com/mdemaster/w207_Final_Project/master/train.json', orient='columns')
pd_test = pd.read_json('https://raw.githubusercontent.com/mdemaster/w207_Final_Project/master/test.json', orient='columns')

#Move label field 'requester_received_pizza' to first field.
pizza = pd_train['requester_received_pizza']
pd_train.drop(labels=['requester_received_pizza'], axis=1,inplace = True)
pd_train.insert(0, 'requester_received_pizza', pizza)

#Create numpy arrays datasets
np_test = np.array(pd_test)
np_train = np.array(pd_train)

We discovered that the test set only has 17 fields compared to the 32 present in the training set.  The test set also doesn't have a field titled, "requestor_received_pizza", which is the field we are using for the label.

In [8]:
#Print illustration of field differences in training and test sets.
train_cols=str(len(pd_train.columns.values))
test_cols=str(len(pd_test.columns.values))

print 'Number of training set fields:',train_cols.ljust(31,' '),' Number of test set fields: ',test_cols.ljust(55,' ')
print '==============================================================================================================='
for i in range(len(pd_train.columns.values)):
    train=pd_train.columns.values[i]
    if i<len(pd_test.columns.values):
        test=pd_test.columns.values[i]
    else:
        test=''
    print train.ljust(55,' '), '       ', test

Number of training set fields: 32                               Number of test set fields:  17                                                     
requester_received_pizza                                        giver_username_if_known
giver_username_if_known                                         request_id
number_of_downvotes_of_request_at_retrieval                     request_text_edit_aware
number_of_upvotes_of_request_at_retrieval                       request_title
post_was_edited                                                 requester_account_age_in_days_at_request
request_id                                                      requester_days_since_first_post_on_raop_at_request
request_number_of_comments_at_retrieval                         requester_number_of_comments_at_request
request_text                                                    requester_number_of_comments_in_raop_at_request
request_text_edit_aware                                         requester_number_of_pos

Because of these discrepancies in the given test data, we decided to split the train_data into a smaller training set, a dev set, and a test set.  

In [80]:
#Split dataset into predictor data,X, and labels,Y.
X = np_train[:,1:]
Y = np_train[:,0]

shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

#Print data and label array shapes to confirm they are the same.
print 'data shape: ', X.shape
print 'label shape:', Y.shape

#Assign first half of data to training set, 3rd quarter of data to dev set, and 4th quarter of data to test set.
l=len(X)
train_data, train_labels = X[:l/2], np.where(Y[:l/2][:]==True,1,0)
dev_data, dev_labels = X[l/2:(3*l)/4], np.where(Y[l/2:(3*l)/4][:]==True,1,0)
test_data, test_labels = X[(3*l)/4:], np.where(Y[(3*l)/4:][:]==True,1,0)

print 'train data shape:', train_data.shape
print 'train label shape:', train_labels.shape
print 'dev data shape:', dev_data.shape
print 'dev label shape:', dev_labels.shape
print 'test data shape:', test_data.shape
print 'test label shape:', test_labels.shape


data shape:  (4040, 31)
label shape: (4040,)
train data shape: (2020, 31)
train label shape: (2020,)
dev data shape: (1010, 31)
dev label shape: (1010,)
test data shape: (1010, 31)
test label shape: (1010,)


First Attempt(Baseline approach): Using CountVectorizer with Multinomial Naive Bayes and Logistic Regression.

In [10]:

categories = ['Didn\'t get pizza','Got Pizza']
mnb_clf = Pipeline([('vect', CountVectorizer()), 
                        ('mnclf', MultinomialNB())])
mnb_clf = mnb_clf.fit(train_data[:,7], train_labels)
pred = mnb_clf.predict(test_data[:,7])
acc = metrics.accuracy_score(test_labels,pred)
print('Naive Bayes Baseline:')
print('Pred sum(got pizza):',sum(pred))
print('Acutal sum(got pizza):',sum(test_labels))
print('accuracy:', acc)
print metrics.classification_report(test_labels, pred,
               target_names=categories)
print('')

log_clf = Pipeline([('vect', CountVectorizer()),
                     ('lgclf', LogisticRegression(C=100, tol=0.1))]);
log_clf = log_clf.fit(train_data[:,7], train_labels)
pred = log_clf.predict(test_data[:,7])
acc = metrics.accuracy_score(test_labels,pred)

print('Logistic Regression Baseline:')
print('Pred sum(got pizza):',sum(pred))
print('Acutal sum(got pizza):',sum(test_labels))
print('accuracy:', acc)
print metrics.classification_report(test_labels, pred,
               target_names=categories)
print('')


Naive Bayes Baseline:
('Pred sum(got pizza):', 18)
('Acutal sum(got pizza):', 260)
('accuracy:', 0.73465346534653464)
                  precision    recall  f1-score   support

Didn't get pizza       0.74      0.98      0.85       750
       Got Pizza       0.28      0.02      0.04       260

     avg / total       0.62      0.73      0.64      1010


Logistic Regression Baseline:
('Pred sum(got pizza):', 197)
('Acutal sum(got pizza):', 260)
('accuracy:', 0.68217821782178223)
                  precision    recall  f1-score   support

Didn't get pizza       0.76      0.83      0.79       750
       Got Pizza       0.35      0.26      0.30       260

     avg / total       0.66      0.68      0.67      1010




We used the CountVectorizer to transform the data and get a vocabulary dictionary, then we used Naive Bayes and Logistic Regression Classifier to predict the test data. Overall our accuracy was around 70%; however, the f1-score is very low, and we knew we had to try other methods to improve our results. 

We decided to Term Frequency Vectorizer and fit a logistic regression model to that. We also used a preprocessor to sanitize the data set. This way we can figure out the top weighted positive and negative features(words). These weights will later be used to binarize the text in GMM. To binarize the user input text, we check to see whether the user input contains any of the features from the feature array, if the user input contains one of the feature, we add it to the weightsum. If the weight sum is greater than 0, then we tag the user input with 1, otherwise we tag it with 0.

In [11]:
import re 

def better_preprocessor(s):
    repl = re.sub('&', ' and ', s)
    repl = repl.lower()
    repl = repl.replace('0',' zero ')
    repl = repl.replace('1',' one ')
    repl = repl.replace('2',' two ')
    repl = repl.replace('3',' three ')
    repl = repl.replace('4',' four ')
    repl = repl.replace('5',' five ')
    repl = repl.replace('6',' six ')
    repl = repl.replace('7',' seven ')
    repl = repl.replace('8',' eight ')
    repl = repl.replace('9',' nine ')
    repl = re.sub('[^a-z]+',' ', repl)
    return repl

 # tfid and log clg, get top positive weights and top negative weights
tfidvec = TfidfVectorizer(preprocessor=better_preprocessor, ngram_range=(1,3),max_df=0.5, min_df=3);
train_X = tfidvec.fit_transform(train_data[:,7])
log_clf = LogisticRegression(C=100, tol=0.1)
log_clf = log_clf.fit(train_X, train_labels)
features = tfidvec.get_feature_names();
weights = log_clf.coef_

print 'positive weights:'
weight_indexes = []
positive_features = []
weight_index = weights[0].argsort()[-5:][::-1].tolist()
weight_indexes += (weight_index)    
for i in range(len(weight_indexes)):
    index = weight_indexes[i]
    positive_features.append((features[index], weights[0][index]))
#             print 'Feature Name:', features[index]
#             print weights[0][index]
#             print ''
print positive_features
print ''
print 'negative weights:'
weight_indexes = []
negative_features = []
weight_index = weights[0].argsort()[:5].tolist()
weight_indexes += (weight_index)    
for i in range(len(weight_indexes)):
    index = weight_indexes[i]
    negative_features.append((features[index], weights[0][index]))
#             print 'Feature Name:', features[index]
#             print weights[0][index]
#             print ''
print negative_features
weight_features = positive_features + negative_features
# print 'total features:'
# print weight_features
print ''

positive weights:
[(u'thanks', 4.681132086421667), (u'my paycheck', 4.4402567754003037), (u'major', 4.2705901237262918), (u'well', 4.2373923370888393), (u'last two', 4.1505433256400268)]

negative weights:
[(u'or', -4.7296506481157401), (u'free', -4.4018368412321358), (u'say', -3.9620912701616957), (u'to say', -3.9309081988072769), (u'friend', -3.9305629444589187)]



After examining the dataset fields, we see that there are a number of scalar fields with highly variable distributions.  In order to normalize this information and utilize it for a Gaussian Mixture Model, we create new train, dev, and test sets with binarized fields with a uniform distribution binning criteria.

For the first Binarize function, we started off with three bins per scalar field and even splits in the data ranges for the bin boundaries.

In [12]:
#Create dictionary of scalar field numbers and their corresponding number of bins.
fields_bins = {
1:3, 2:3, 5:3, 9:3, 10:3, 11:3, 12:3, 13:3, 14:3, 15:3, 16:3, 17:3, 18:3, 19:3, 20:3, 21:3, 23:3, 24:3, 25:3, 26:3
}

In [13]:
def Binarize1(fields_bins,X):
    #Binarize Training Data
    train_data=X[:l/2]
    s=train_data.shape[0]
    bin_train=np.where(train_data[:,0]==u'N/A',0.,1.).reshape(s,1)

    for f, b in fields_bins.items():
        col=train_data[:,f]
        sort=np.sort(col)
        bin_train = np.column_stack((bin_train,np.where(col<sort[s/b],1.,0.).reshape(s,1)))
        hist_bins=[0]
        for i in range(b-2):
            x=sort[(i+1)*s/b]
            y=sort[(i+2)*s/b]
            bin_train = np.column_stack((bin_train,np.where((col>=x) & (col<y),1.,0.).reshape(s,1)))
            hist_bins=np.append(hist_bins,x)
        hist_bins=np.append(hist_bins,sort[-1])
        bin_train = np.column_stack((bin_train,np.where(col>=sort[(b-1)*s/b],1.,0.).reshape(s,1)))
       
    #Binarize Dev Data
    dev_data = X[l/2:(3*l)/4]
    s1=dev_data.shape[0]
    bin_dev=np.where(dev_data[:,0]==u'N/A',0.,1.).reshape(s1,1)

    for f, b in fields_bins.items(): 
        col=dev_data[:,f]
        sort=np.sort(col)
        bin_dev = np.column_stack((bin_dev,np.where(col<sort[s1/b],1.,0.).reshape(s1,1)))
        for i in range(b-2):
            x=sort[(i+1)*s1/b]
            y=sort[(i+2)*s1/b]
            bin_dev = np.column_stack((bin_dev,np.where((col>=x) & (col<y),1.,0.).reshape(s1,1)))
        bin_dev = np.column_stack((bin_dev,np.where(col>=sort[(b-1)*s1/b],1.,0.).reshape(s1,1)))  
        
    #Binarize Test Data
    test_data = X[(3*l)/4:]
    s2=test_data.shape[0]
    bin_test=np.where(test_data[:,0]==u'N/A',0.,1.).reshape(s2,1)

    for f, b in fields_bins.items():
        col=test_data[:,f]
        sort=np.sort(col)
        bin_test = np.column_stack((bin_test,np.where(col<sort[s2/b],1.,0.).reshape(s2,1)))
        for i in range(b-2):
            x=sort[(i+1)*s2/b]
            y=sort[(i+2)*s2/b]
            bin_test = np.column_stack((bin_test,np.where((col>=x) & (col<y),1.,0.).reshape(s2,1)))
        bin_test = np.column_stack((bin_test,np.where(col>=sort[(b-1)*s2/b],1.,0.).reshape(s2,1)))
    
    return [bin_train,bin_dev,bin_test]

Below we built a function to take the binned training data and binned dev data and run a Gaussian mixture model on this data.  The function loops through the four covariance types, a range of PCA components, and a range of GMM components to find the model with the most accurate fit. The top 5 models run on the binned dev data and the top 5 models run on the binned test data are printed.

In [39]:
def mixture_model(bin_train,bin_dev,bin_test):
    
    dev_test=[[bin_dev,dev_labels],[bin_test,test_labels]]
    for i,j in enumerate(dev_test):
        experiments=[]
        for c_type in ['spherical', 'diag', 'tied', 'full']:
            for p_comp in range(1,60):
                for g_comp in range(1,9):

                    params=((3+p_comp)*g_comp)*2
                    if params<=100:
                        #Run PCA with two components
                        pca = PCA(p_comp)

                        #Assign train data and labels to x and y for fitting and transforming.
                        x=bin_train
                        y=train_labels

                        #Transform train data and fit it with labels to 2-component projected PCA model.
                        proj_train=pca.fit(x,y).transform(x)

                        #Transform test data to 2-component projected PCA model
                        proj_test=pca.transform(j[0])

                        #Filter Projected data by positive examples.
                        pos=proj_train[y == 1]
                        neg=proj_train[y == 0]

                        #Fit GMM model to positive and negative train datasets.
                        gmm_pos = GMM(n_components=g_comp, covariance_type=c_type).fit(pos)
                        gmm_neg = GMM(n_components=g_comp, covariance_type=c_type).fit(neg)

                        #Get positive and negative GMM model scores for test data.
                        gmm_pos_score=np.array(gmm_pos.score(proj_test))
                        gmm_neg_score=np.array(gmm_neg.score(proj_test))

                        #Merge score arrays and check which score is greater to determine positive/negative classes.
                        point_scores=np.column_stack((gmm_pos_score,gmm_neg_score))
                        gmm_pred=np.where(point_scores[:,0]>point_scores[:,1], 1, 0)

                        #Calculate accuracy of merged scoring model method by comparing predicted classes with test_labels.  
                        accuracy = 1-np.mean(gmm_pred != j[1])

                        #Append model parameters and scores to experiments list
                        experiments.append([p_comp,g_comp,c_type,accuracy,params])

        #Make experiments an array and sort by descending accuracy
        experiments = np.array(experiments)
        experiments = experiments[np.argsort(experiments[:, 3])[::-1]]
        if i == 0:set_='Dev Set'
        else:set_='Test Set'
        #Print results of experiments
        print 'Top Scoring GMM Models for ',set_
        print 'PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters'

        for i in  experiments[:5]:
            print '%14s  ||' %(i[0])+'%16s  ||' %(i[1])+'%17s  ||' %(i[2])+'  %.4f    ||' %(float(i[3]))+'%10s ' %(i[4])
        print


Below we see that the GMM models on the binned datasets shows improved accuracy at around 82% on the test set and a little lower on the test set.

In [40]:
#Run mixture model on first binarized train set
d = Binarize1(fields_bins,X)
bin_train,bin_dev,bin_test = d[0],d[1],d[2]
mixture_model(bin_train,bin_dev,bin_test)

Top Scoring GMM Models for  Dev Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            27  ||               1  ||             tied  ||  0.8267    ||        60 
            27  ||               1  ||             full  ||  0.8267    ||        60 
            28  ||               1  ||             tied  ||  0.8228    ||        62 
            28  ||               1  ||             full  ||  0.8228    ||        62 
            29  ||               1  ||             tied  ||  0.8208    ||        64 

Top Scoring GMM Models for  Test Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            29  ||               1  ||             tied  ||  0.7960    ||        64 
            29  ||               1  ||             full  ||  0.7960    ||        64 
            28  ||               1  ||             full  ||  0.7941    ||        62 
            28  ||               1  ||             tied  ||  0.7941    || 

Below is a second binarize function using the numpy histogram function, which creates normalized binning boundaries.

In [90]:
def Binarize2(fields_bins,X):
    binnum=3
    
    #Binarize Training Data
    train_data=X[:l/2]
    s=train_data.shape[0]
    bin_train=np.where(train_data[:,0]==u'N/A',0.,1.).reshape(s,1)
       
    for f, b in fields_bins.items():
        col=train_data[:,f]
        h=np.histogram(col,bins=binnum,normed=True)[1]
        bin_train = np.column_stack((bin_train,np.where(col<h[1],1.,0.).reshape(s,1)))
        for i in range(len(h)-2):
            bin_train = np.column_stack((bin_train,np.where((col>=h[i+1]) & (col<h[i+2]),1.,0.).reshape(s,1)))
        bin_train = np.column_stack((bin_train,np.where(col>=h[-2],1.,0.).reshape(s,1)))
    

    #Binarize Dev Data
    dev_data = X[l/2:(3*l)/4]
    s1=dev_data.shape[0]
    bin_dev=np.where(dev_data[:,0]==u'N/A',0.,1.).reshape(s1,1)

    for f, b in fields_bins.items():
        col=dev_data[:,f]
        h=np.histogram(col,bins=binnum,normed=True)[1]
        bin_dev = np.column_stack((bin_dev,np.where(col<h[1],1.,0.).reshape(s1,1)))
        for i in range(len(h)-2):
            bin_dev = np.column_stack((bin_dev,np.where((col>=h[i+1]) & (col<h[i+2]),1.,0.).reshape(s1,1)))
        bin_dev = np.column_stack((bin_dev,np.where(col>=h[-2],1.,0.).reshape(s1,1)))
        
    #Binarize Test Data
    test_data = X[(3*l)/4:]
    s2=test_data.shape[0]
    bin_test=np.where(test_data[:,0]==u'N/A',0.,1.).reshape(s2,1)
    
    for f, b in fields_bins.items():
        col=test_data[:,f]
        h=np.histogram(col,bins=binnum,normed=True)[1]
        bin_test = np.column_stack((bin_test,np.where(col<h[1],1.,0.).reshape(s2,1)))
        for i in range(len(h)-2):
            bin_test = np.column_stack((bin_test,np.where((col>=h[i+1]) & (col<h[i+2]),1.,0.).reshape(s2,1)))
        bin_test = np.column_stack((bin_test,np.where(col>=h[-2],1.,0.).reshape(s2,1)))
        
    return [bin_train,bin_dev,bin_test]

Running Binarize2() and the mixture_model functions, we see an improvement in accuracy on both the dev set and the test set.

In [41]:
#Run mixture model on second binned train set
d = Binarize2(fields_bins,X)
bin_train,bin_dev,bin_test = d[0],d[1],d[2]
mixture_model(bin_train,bin_dev,bin_test)

Top Scoring GMM Models for  Dev Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            12  ||               3  ||             tied  ||  0.8366    ||        90 
             7  ||               3  ||             tied  ||  0.8356    ||        60 
             8  ||               3  ||             tied  ||  0.8356    ||        66 
            12  ||               1  ||             tied  ||  0.8356    ||        30 
            12  ||               1  ||             full  ||  0.8356    ||        30 

Top Scoring GMM Models for  Test Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            13  ||               3  ||             tied  ||  0.8198    ||        96 
            12  ||               3  ||             tied  ||  0.8188    ||        90 
            16  ||               1  ||             full  ||  0.8188    ||        38 
            16  ||               2  ||             tied  ||  0.8188    || 

To binarize the user input text, we check to see whether the user input contains any of the features from the feature array, if the user input contains one of the feature, we add it to the weightsum.  If the weight sum is greater than 0, then we assign 1 to the column, otherwise we assign 0.  This is done below in Binarize3.

In [42]:
def Binarize3(bin_train,bin_dev,bin_test):
    
    #Add text weight field to binarized train data
    text_weights_f_arr = []
    for line in train_data[:,7]:
        weightsum = 0
        for feat in weight_features: 
            if feat[0] in line.lower(): 
                weightsum += feat[1]
        if weightsum > 0: 
            text_weights_f_arr.append(1)
        else:
            text_weights_f_arr.append(0)
    bin_train = np.column_stack((bin_train, np.asarray(text_weights_f_arr)))

    #Add text weight field to binarized dev data
    text_weights_f_arr = []
    for line in dev_data[:,7]:
        weightsum = 0
        for feat in weight_features: 
            if feat[0] in line.lower(): 
                weightsum += feat[1]
        if weightsum > 0: 
            text_weights_f_arr.append(1)
        else:
            text_weights_f_arr.append(0)
    bin_dev = np.column_stack((bin_dev, np.asarray(text_weights_f_arr)))
        
    #Add text weight field to binarized test data

    text_weights_f_arr = []
    for line in test_data[:,7]:
        weightsum = 0
        for feat in weight_features: 
            if feat[0] in line.lower(): 
                weightsum += feat[1]
        if weightsum > 0: 
            text_weights_f_arr.append(1)
        else:
            text_weights_f_arr.append(0)
    bin_test = np.column_stack((bin_test, np.asarray(text_weights_f_arr)))
        
    return [bin_train,bin_dev,bin_test]

Starting with the data output from Binarize2, we run Binarize3 to add the text feature weights as a binarized field.  Unfortunately this doesn't improve the scores at all.  If anything, they decrease a little.

In [43]:
#Run mixture model on second binned train set with appended text weights field
d = Binarize2(fields_bins,X)
bin_train,bin_dev,bin_test = d[0],d[1],d[2]

d = Binarize3(bin_train,bin_dev,bin_test)
bin_train,bin_dev,bin_test = d[0],d[1],d[2]

mixture_model(bin_train,bin_dev,bin_test)

Top Scoring GMM Models for  Dev Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            11  ||               3  ||             tied  ||  0.8356    ||        84 
             8  ||               3  ||             tied  ||  0.8356    ||        66 
             7  ||               3  ||             tied  ||  0.8356    ||        60 
            10  ||               3  ||             tied  ||  0.8347    ||        78 
            13  ||               3  ||             tied  ||  0.8347    ||        96 

Top Scoring GMM Models for  Test Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            11  ||               3  ||             tied  ||  0.8188    ||        84 
            13  ||               3  ||             tied  ||  0.8178    ||        96 
            12  ||               3  ||             tied  ||  0.8158    ||        90 
            20  ||               2  ||             tied  ||  0.8129    || 

We hypothesize that maybe splitting out the feature weights into a one binarized field per feature (essentially adding 10 fields) would help create better principle components for the GMM model.  We perform this operation in Binarize4.

In [36]:
def Binarize4(bin_train,bin_dev,bin_test):
    
    ###Create binary features for train data
    for pos_f in positive_features:
        pos_f_arr = []
        for line in train_data[:,7]:
            pos_f_arr.append(np.where(pos_f[0] in line, 1, 0))
        bin_train = np.column_stack((bin_train, pos_f_arr))
    for neg_f in negative_features:
        neg_f_arr = []
        for line in train_data[:,7]:
            neg_f_arr.append(np.where(neg_f[0] in line, 1, 0))
        bin_train = np.column_stack((bin_train, neg_f_arr))
    
    ###Create binary features for dev data
    for pos_f in positive_features:
        pos_f_arr = []
        for line in dev_data[:,7]:
            pos_f_arr.append(np.where(pos_f[0] in line, 1, 0))
        bin_dev = np.column_stack((bin_dev, pos_f_arr))
    for neg_f in negative_features:
        neg_f_arr = []
        for line in dev_data[:,7]:
            neg_f_arr.append(np.where(neg_f[0] in line, 1, 0))
        bin_dev = np.column_stack((bin_dev, neg_f_arr))
    
    ###Create binary features for test data
    for pos_f in positive_features:
        pos_f_arr = []
        for line in test_data[:,7]:
            pos_f_arr.append(np.where(pos_f[0] in line, 1, 0))
        bin_test = np.column_stack((bin_test, pos_f_arr))
    for neg_f in negative_features:
        neg_f_arr = []
        for line in test_data[:,7]:
            neg_f_arr.append(np.where(neg_f[0] in line, 1, 0))
        bin_test = np.column_stack((bin_test, neg_f_arr))
          
    return [bin_train,bin_dev,bin_test]

Unfortunately, this new modification didn't improve the accuracy either.  

In [44]:
#Run mixture model on second binned train set with appended binary fields for each text feature
d = Binarize2(fields_bins,X)
bin_train,bin_dev,bin_test = d[0],d[1],d[2]

d = Binarize4(bin_train,bin_dev,bin_test)
bin_train,bin_dev,bin_test = d[0],d[1],d[2]

mixture_model(bin_train,bin_dev,bin_test)

Top Scoring GMM Models for  Dev Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            11  ||               3  ||             tied  ||  0.8356    ||        84 
             9  ||               3  ||             tied  ||  0.8356    ||        72 
            13  ||               3  ||             tied  ||  0.8356    ||        96 
            10  ||               3  ||             tied  ||  0.8356    ||        78 
            12  ||               3  ||             tied  ||  0.8347    ||        90 

Top Scoring GMM Models for  Test Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            13  ||               3  ||             tied  ||  0.8149    ||        96 
            20  ||               2  ||             full  ||  0.8089    ||        92 
            19  ||               2  ||             full  ||  0.8069    ||        88 
            17  ||               2  ||             full  ||  0.8069    || 

Finally, we have the idea to add a calculated field to the dataset, which is simply the character length of the user's message.

In [81]:
chars = []
for i in X[:,7]:
    chars=np.append(chars,int(len(i)))

chars=chars.reshape(X.shape[0],1)

X=np.column_stack((X,chars))

We then binarize this field as well with the histogram method from Binarize2.  This doesn't appear to improve the accuracy significantly either.  After trying a number of modifications to the data and to our models, it appears we have hit a ceiling in the accuracy we can achieve with GMM models.  While we didn't try everything, the GMM model appears to be our best choice compared with multinomial Naive Bayes or Logistic Regression.  Achieving around ~83% is a relatively good score for this dataset, good for 23rd place out of 464 entrants in the Kaggle competition.

In [88]:
#Run mixture model on second binned train set
d = Binarize2(fields_bins,X)
bin_train,bin_dev,bin_test = d[0],d[1],d[2]
mixture_model(bin_train,bin_dev,bin_test)

Top Scoring GMM Models for  Dev Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            11  ||               1  ||             full  ||  0.8416    ||        28 
            11  ||               1  ||             tied  ||  0.8416    ||        28 
            11  ||               2  ||             tied  ||  0.8416    ||        56 
            32  ||               1  ||             tied  ||  0.8406    ||        70 
            37  ||               1  ||             tied  ||  0.8406    ||        80 

Top Scoring GMM Models for  Test Set
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            16  ||               1  ||             tied  ||  0.8248    ||        38 
            17  ||               2  ||             tied  ||  0.8248    ||        80 
            19  ||               2  ||             tied  ||  0.8248    ||        88 
            19  ||               1  ||             tied  ||  0.8248    || 