w207 Final Project
Random Acts of Pizza Kaggle Competition Project

Using a dataset of 4040 Reddit requests for pizza, we attempt to build a machine learning model that predicts if a request for pizza results in successfully receiving a pizza.

We investigate several types of models that we import below from the sklearn library.

In [198]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import *

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

Below we use pandas to read the training set and test set in json file format from Kaggle.

In [303]:
#Download training and test datasets.
pd_train = pd.read_json('https://raw.githubusercontent.com/mdemaster/w207_Final_Project/master/train.json', orient='columns')
pd_test = pd.read_json('https://raw.githubusercontent.com/mdemaster/w207_Final_Project/master/test.json', orient='columns')

#Move label field 'requester_received_pizza' to first field.
pizza = pd_train['requester_received_pizza']
pd_train.drop(labels=['requester_received_pizza'], axis=1,inplace = True)
pd_train.insert(0, 'requester_received_pizza', pizza)

#Create numpy arrays datasets
np_test = np.array(pd_test)
np_train = np.array(pd_train)

We discovered that the test set only has 17 fields compared to the 32 present in the training set.  The test set also doesn't have a field titled, "requestor_received_pizza", which is the field we are using for the label.

In [305]:
#Print illustration of field differences in training and test sets.
train_cols=str(len(pd_train.columns.values))
test_cols=str(len(pd_test.columns.values))

print 'Number of training set fields:',train_cols.ljust(31,' '),' Number of test set fields: ',test_cols.ljust(55,' ')
print '==============================================================================================================='
for i in range(len(pd_train.columns.values)):
    train=pd_train.columns.values[i]
    if i<len(pd_test.columns.values):
        test=pd_test.columns.values[i]
    else:
        test=''
    print train.ljust(55,' '), '       ', test

Number of training set fields: 32                               Number of test set fields:  17                                                     
requester_received_pizza                                        giver_username_if_known
giver_username_if_known                                         request_id
number_of_downvotes_of_request_at_retrieval                     request_text_edit_aware
number_of_upvotes_of_request_at_retrieval                       request_title
post_was_edited                                                 requester_account_age_in_days_at_request
request_id                                                      requester_days_since_first_post_on_raop_at_request
request_number_of_comments_at_retrieval                         requester_number_of_comments_at_request
request_text                                                    requester_number_of_comments_in_raop_at_request
request_text_edit_aware                                         requester_number_of_pos

Because of these discrepancies in the given test data, we decided to split the train_data into a smaller training set, a dev set, and a test set.  

In [323]:
#Split dataset into predictor data,X, and labels,Y.
X = np_train[:,1:]
Y = np_train[:,0]

shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

#Print data and label array shapes to confirm they are the same.
print 'data shape: ', X.shape
print 'label shape:', Y.shape

#Assign first half of data to training set, 3rd quarter of data to dev set, and 4th quarter of data to test set.
l=len(X)
train_data, train_labels = X[:l/2], np.where(Y[:l/2][:]==True,1,0)
dev_data, dev_labels = X[l/2:(3*l)/4], np.where(Y[l/2:(3*l)/4][:]==True,1,0)
test_data, test_labels = X[(3*l)/4:], np.where(Y[(3*l)/4:][:]==True,1,0)


data shape:  (4040, 31)
label shape: (4040,)


After examining the dataset fields, we see that there are a number of scalar fields with highly variable distributions.  In order to normalize this information and utilize it for a Gaussian Mixture Model, we create new train, dev, and test sets with binarized fields with a uniform distrubtion binning criteria.

For the first Binarize function, we started off with three bins per scalar field and even splits in the data ranges for the bin boundaries.

In [None]:
#Create dictionary of scalar field numbers and their corresponding number of bins.
fields_bins = {
1:3, 2:3, 5:3, 9:3, 10:3, 11:3, 12:3, 13:3, 14:3, 15:3, 16:3, 17:3, 18:3, 19:3, 20:3, 21:3, 23:3, 24:3, 25:3, 26:3
}

In [393]:
def Binarize1(fields_bins,X):
    #Binarize Training Data
    train_data=X[:l/2]
    s=train_data.shape[0]
    bin_train=np.where(train_data[:,0]==u'N/A',0.,1.).reshape(s,1)

    for f, b in fields_bins.items():
        col=train_data[:,f]
        sort=np.sort(col)
        bin_train = np.column_stack((bin_train,np.where(col<sort[s/b],1.,0.).reshape(s,1)))
        hist_bins=[0]
        for i in range(b-2):
            x=sort[(i+1)*s/b]
            y=sort[(i+2)*s/b]
            bin_train = np.column_stack((bin_train,np.where((col>=x) & (col<y),1.,0.).reshape(s,1)))
            hist_bins=np.append(hist_bins,x)
        hist_bins=np.append(hist_bins,sort[-1])
        bin_train = np.column_stack((bin_train,np.where(col>=sort[(b-1)*s/b],1.,0.).reshape(s,1)))
        #print pd_train.columns.values[f+1],'__histogram bins: ',hist_bins

    #Binarize Dev Data
    dev_data = X[l/2:(3*l)/4]
    s1=dev_data.shape[0]
    bin_dev=np.where(dev_data[:,0]==u'N/A',0.,1.).reshape(s1,1)

    for f, b in fields_bins.items(): 
        col=dev_data[:,f]
        sort=np.sort(col)
        bin_dev = np.column_stack((bin_dev,np.where(col<sort[s1/b],1.,0.).reshape(s1,1)))
        for i in range(b-2):
            x=sort[(i+1)*s1/b]
            y=sort[(i+2)*s1/b]
            bin_dev = np.column_stack((bin_dev,np.where((col>=x) & (col<y),1.,0.).reshape(s1,1)))
        bin_dev = np.column_stack((bin_dev,np.where(col>=sort[(b-1)*s1/b],1.,0.).reshape(s1,1)))

    #Binarize Test Data
    test_data = X[(3*l)/4:]
    s2=test_data.shape[0]
    bin_test=np.where(test_data[:,0]==u'N/A',0.,1.).reshape(s2,1)

    for f, b in fields_bins.items():
        col=test_data[:,f]
        sort=np.sort(col)
        bin_test = np.column_stack((bin_test,np.where(col<sort[s2/b],1.,0.).reshape(s2,1)))
        for i in range(b-2):
            x=sort[(i+1)*s2/b]
            y=sort[(i+2)*s2/b]
            bin_test = np.column_stack((bin_test,np.where((col>=x) & (col<y),1.,0.).reshape(s2,1)))
        bin_test = np.column_stack((bin_test,np.where(col>=sort[(b-1)*s2/b],1.,0.).reshape(s2,1)))
    
    global bin_train,bin_dev,bin_test
    


  global bin_train,bin_dev,bin_test
  global bin_train,bin_dev,bin_test
  global bin_train,bin_dev,bin_test


Below we built a function to take the binned training data and binned dev data and run a Gaussian mixture model on this data.  The function loops through the four covariance types, a range of PCA components, and a range of GMM components to find the model with the most accurate fit. The top 10 models are printed.

In [420]:
def mixture_model(bin_train,bin_dev,bin_test):
    experiments=[]

    for c_type in ['spherical', 'diag', 'tied', 'full']:
        for p_comp in range(1,60):
            for g_comp in range(1,9):

                params=((3+p_comp)*g_comp)*2
                if params<=100:
                    #Run PCA with two components
                    pca = PCA(p_comp)

                    #Assign train data and labels to x and y for fitting and transforming.
                    x=bin_train
                    y=train_labels

                    #Transform train data and fit it with labels to 2-component projected PCA model.
                    proj_train=pca.fit(x,y).transform(x)

                    #Transform test data to 2-component projected PCA model
                    proj_test=pca.transform(bin_dev)

                    #Filter Projected data by positive examples.
                    pos=proj_train[y == 1]
                    neg=proj_train[y == 0]

                    #Fit GMM model to positive and negative train datasets.
                    gmm_pos = GMM(n_components=g_comp, covariance_type=c_type).fit(pos)
                    gmm_neg = GMM(n_components=g_comp, covariance_type=c_type).fit(neg)

                    #Get positive and negative GMM model scores for test data.
                    gmm_pos_score=np.array(gmm_pos.score(proj_test))
                    gmm_neg_score=np.array(gmm_neg.score(proj_test))

                    #Merge score arrays and check which score is greater to determine positive/negative classes.
                    point_scores=np.column_stack((gmm_pos_score,gmm_neg_score))
                    gmm_pred=np.where(point_scores[:,0]>point_scores[:,1], 1, 0)

                    #Calculate accuracy of merged scoring model method by comparing predicted classes with test_labels.  
                    accuracy = 1-np.mean(gmm_pred != dev_labels)

                    #Append model parameters and scores to experiments list
                    experiments.append([p_comp,g_comp,c_type,accuracy,params])

    #Make experiments an array and sort by descending accuracy
    experiments = np.array(experiments)
    experiments = experiments[np.argsort(experiments[:, 3])[::-1]]

    #Print results of experiments
    print 'Top Scoring GMM Models'
    print 'PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters'

    for i in  experiments[:10]:
        print '%14s  ||' %(i[0])+'%16s  ||' %(i[1])+'%17s  ||' %(i[2])+'  %.4f    ||' %(float(i[3]))+'%10s ' %(i[4])
        


Below we see that the GMM model on the binned datasets shows improved accuracy at around 80% on the test set and 81% on the dev set.

In [421]:
#Run mixture model on first binarized train set
Binarize1(fields_bins,X)
mixture_model(bin_train,bin_dev,bin_test)

Top Scoring GMM Models
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
            27  ||               1  ||             tied  ||  0.8119    ||        60 
            27  ||               1  ||             full  ||  0.8119    ||        60 
            24  ||               1  ||             tied  ||  0.8099    ||        54 
            24  ||               1  ||             full  ||  0.8099    ||        54 
            25  ||               1  ||             tied  ||  0.8069    ||        56 
            25  ||               1  ||             full  ||  0.8069    ||        56 
            26  ||               1  ||             tied  ||  0.8059    ||        58 
            26  ||               1  ||             full  ||  0.8059    ||        58 
            28  ||               1  ||             tied  ||  0.8030    ||        62 
            28  ||               1  ||             full  ||  0.8030    ||        62 


Below is a second binarize function using the numpy histogram function, which creates normalized binning boundaries.

In [418]:
def Binarize2(fields_bins,X):
    binnum=3
    
    #Binarize Training Data
    train_data=X[:l/2]
    s=train_data.shape[0]
    bin_train=np.where(train_data[:,0]==u'N/A',0.,1.).reshape(s,1)

    for f, b in fields_bins.items():
        col=train_data[:,f]
        h=np.histogram(col,bins=binnum,normed=True)[1]
        bin_train = np.column_stack((bin_train,np.where(col<h[1],1.,0.).reshape(s,1)))
        for i in range(len(h)-2):
            bin_train = np.column_stack((bin_train,np.where((col>=h[i+1]) & (col<h[i+2]),1.,0.).reshape(s,1)))
        bin_train = np.column_stack((bin_train,np.where(col>=h[-2],1.,0.).reshape(s,1)))


    #Binarize Dev Data
    dev_data = X[l/2:(3*l)/4]
    s1=dev_data.shape[0]
    bin_dev=np.where(dev_data[:,0]==u'N/A',0.,1.).reshape(s1,1)

    for f, b in fields_bins.items():
        col=dev_data[:,f]
        h=np.histogram(col,bins=binnum,normed=True)[1]
        bin_dev = np.column_stack((bin_dev,np.where(col<h[1],1.,0.).reshape(s1,1)))
        for i in range(len(h)-2):
            bin_dev = np.column_stack((bin_dev,np.where((col>=h[i+1]) & (col<h[i+2]),1.,0.).reshape(s1,1)))
        bin_dev = np.column_stack((bin_dev,np.where(col>=h[-2],1.,0.).reshape(s1,1)))

    #Binarize Test Data
    test_data = X[(3*l)/4:]
    s2=test_data.shape[0]
    bin_test=np.where(test_data[:,0]==u'N/A',0.,1.).reshape(s2,1)

    for f, b in fields_bins.items():
        col=test_data[:,f]
        h=np.histogram(col,bins=binnum,normed=True)[1]
        bin_test = np.column_stack((bin_test,np.where(col<h[1],1.,0.).reshape(s2,1)))
        for i in range(len(h)-2):
            bin_test = np.column_stack((bin_test,np.where((col>=h[i+1]) & (col<h[i+2]),1.,0.).reshape(s2,1)))
        bin_test = np.column_stack((bin_test,np.where(col>=h[-2],1.,0.).reshape(s2,1)))
    global bin_train,bin_dev,bin_test

  global bin_train,bin_dev,bin_test
  global bin_train,bin_dev,bin_test
  global bin_train,bin_dev,bin_test


Running Binarize2() and the mixture_model functions, we see an improvement in accuracy on the dev set up to 85%, but there was no improvement on the test set.

In [422]:
#Run mixture model on second binned train set
Binarize2(fields_bins,X)
mixture_model(bin_train,bin_dev,bin_test)

Top Scoring GMM Models
PCA Components  ||  GMM Components  ||  Covariance Type  ||  Accuracy  ||  Parameters
             6  ||               5  ||             diag  ||  0.8475    ||        90 
            14  ||               1  ||             full  ||  0.8446    ||        34 
             5  ||               5  ||             diag  ||  0.8446    ||        80 
            14  ||               1  ||             tied  ||  0.8446    ||        34 
            15  ||               1  ||             full  ||  0.8426    ||        36 
            15  ||               1  ||             tied  ||  0.8426    ||        36 
            13  ||               3  ||             tied  ||  0.8426    ||        96 
            14  ||               2  ||             tied  ||  0.8426    ||        68 
             4  ||               7  ||             tied  ||  0.8416    ||        98 
            15  ||               2  ||             tied  ||  0.8406    ||        72 
