# Random Acts of Pizza
#### Mohammad J. Habib - W207

## Overview

I picked Kaggle's Random Accts of Pizza (RAOP) classification task for my W207 final project. 
https://www.kaggle.com/c/random-acts-of-pizza

The goal is to classify Reddit posts in two categories: whether a post got a free pizza from a stranger (True) or not (False). About 75% of the posts in the RAOP dataset did not get a pizza. Randomly predicting "no pizza" or "False" for all classes would give an accuracy score of ~75% and a ROC AUC score of 0.50. 

I tried various sets of features and classifiers before I came up with this final model. You can read more about that in my final project presentation and review the experimentation in the notebooks on jhabib/w207/w207_final_project/background_work folder (github.com).

To summarize, a bag of words approach performs really poorly on this task no matter the classifier used (ROC AUC ~0.52 at best). Ensembles, Neural Nets, Gaussian Mixture Models etc. all perform similarly poorly with bag of words. PCA on the bag of words did not help either.  It was not until I looked at numeric features in the data that the ROC AUC score started to show some improvement. It turned out that numeric features alone, without any meta features inferred from the data e.g. length of post, gave me the best result. I also found that sub-sampling the training dataset to get an equal ratio of pizza-getting (True) and no-pizza (False) posts helped improve the score.

I used **xgboost's XGBClassifier** for the best result. You will need to install that on your machine before you can run this notebook. sklearn's GradientBoostingClassifier was not too far behind either but I left that be.

My team consisted of me, myself and I.


## Data, and the features used in the model

I used data available from the Standford website for this model. Link: https://cs.stanford.edu/~althoff/raop-dataset/

I did not use data available from Kaggle for the submission because it does not provide labels with the test data. I did create a Kaggle submission for kicks which can be found in _kaggle_notebook.ipynb.

### Features not used
The Stanford dataset includes some columns (features) that literally tell you which post recevied a Pizza. I obviously avoided these features:

+ giver_username_if_known (col. index 0)
+ requester_received_pizza (23)
+ requester_user_flair (29) - turns out requesters get flair when the get a pizza

### Features used in this model 
+ number_of_downvotes_of_request_at_retrieval
+ number_of_upvotes_of_request_at_retrieval
+ post_was_edited
+ request_number_of_comments_at_retrieval
+ requester_account_age_in_days_at_request
+ requester_account_age_in_days_at_retrieval
+ requester_days_since_first_post_on_raop_at_request
+ requester_days_since_first_post_on_raop_at_retrieval
+ requester_number_of_comments_at_request
+ requester_number_of_comments_at_retrieval
+ requester_number_of_comments_in_raop_at_request
+ requester_number_of_comments_in_raop_at_retrieval
+ requester_number_of_posts_at_request
+ requester_number_of_posts_at_retrieval
+ requester_number_of_posts_on_raop_at_request
+ requester_number_of_posts_on_raop_at_retrieval
+ requester_number_of_subreddits_at_request
+ requester_upvotes_minus_downvotes_at_request
+ requester_upvotes_minus_downvotes_at_retrieval
+ requester_upvotes_plus_downvotes_at_request
+ requester_upvotes_plus_downvotes_at_retrieval

OK, now that we have that out of the way, let's start by importing some packages and loading the data.

## The Model

In [7]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn import cross_validation

# xgboost
import xgboost as xgb

from collections import defaultdict

## Load the data

In [305]:
# Load the data
import urllib
import tarfile
import pandas as pd
import random

# download the data and extract the tarball 
# NOTE: change the url to http from https if you get a urllib error 
tf = urllib.URLopener()
tf.retrieve("https://cs.stanford.edu/~althoff/raop-dataset/pizza_request_dataset.tar.gz", "pizza.tar.gz")

tar = tarfile.open("pizza.tar.gz", "r:gz")
for name in tar.getnames():
    if name == "pizza_request_dataset/pizza_request_dataset.json":
        member = tar.getmember(name)
        f = tar.extractfile(member)
        if f is not None:
            json_data = f.read()

# convert data to a pandas dataframe
pizza_df = pd.read_json(json_data)
feature_names = np.asarray([x for x in pizza_df[:0]])
pizza_df = np.asarray(pizza_df)

# shuffle the data
np.random.seed(0)
shuffle = np.random.permutation(np.arange(pizza_df.shape[0]))
pizza_df = pizza_df[shuffle]

# keep only the features we need
# and separate the labels
features_to_keep = [2, 3, 4, 6, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 28]
X, y = pizza_df[:, features_to_keep], pizza_df[:, 23]

# separate out train, dev and test data and labels
# we've already shuffled this around before
dev_data, dev_labels = X[:500], y[:500]
test_data, test_labels = X[500:1000], y[500:1000]
train_data, train_labels = X[1000:], y[1000:]


# let's create a new training dataset with equal parts True and False posts
train_pos = train_data[np.where(train_labels==1)]
train_labels_pos = train_labels[np.where(train_labels==1)]
train_neg = train_data[np.where(train_labels==0)]
train_labels_neg = train_labels[np.where(train_labels==0)]

train_neg_wanted = []
train_labels_wanted = []

for i in range(train_pos.shape[0]):
    rand_index = random.randint(0, train_pos.shape[0])
    train_neg_wanted.append(train_neg[int(rand_index)])
    train_labels_wanted.append(train_labels_neg[int(rand_index)])

train_neg_wanted = np.asarray(train_neg_wanted)
train_labels_wanted = np.asarray(train_labels_wanted)

train_new = np.concatenate((train_pos, train_neg_wanted), axis=0)
train_labels_new = np.concatenate((train_labels_pos, train_labels_wanted), axis=0)

# we need the labels in a binary format
# Python can convert True, False to 1, 0
train_labels_new = np.asarray(train_labels_new, dtype=int)
dev_labels = np.asarray(dev_labels, dtype=int)
test_labels = np.asarray(test_labels, dtype=int)

# reshuffle train_new and train_labels_new
shuffle = np.random.permutation(np.arange(train_new.shape[0]))
train_new, train_labels_new = train_new[shuffle], train_labels_new[shuffle]

## Create helper functions

Create some helper functions that will reduce typing.

In [236]:
def score_classifier(clf, train, train_labels, test, test_labels):
    clf.fit(train, train_labels)
    train_accuracy = metrics.accuracy_score(train_labels, clf.predict(train))
    train_rocauc = metrics.roc_auc_score(train_labels, clf.predict(train))
    test_accuracy = metrics.accuracy_score(test_labels, clf.predict(test))    
    test_rocauc = metrics.roc_auc_score(test_labels, clf.predict(test))
    print("Train Accuracy: %.4f, Train AUC: %.4f \nTest Accuracy: %.4f, Test AUC: %.4f\n" 
          % (train_accuracy, train_rocauc, test_accuracy, test_rocauc))


In [237]:
# let's run a simple analysis at first
gbm = xgb.XGBClassifier(objective='binary:logistic', seed=0)

# Note that we are using dev_data here (not the held out test data)
score_classifier(gbm, train_new, train_labels_new, dev_data, dev_labels)

print("First baseline for the test data:")
score_classifier(gbm, train_new, train_labels_new, test_data, test_labels)

Train Accuracy: 0.8735, Train AUC: 0.8735 
Test Accuracy: 0.7800, Test AUC: 0.7881

First baseline for the test data:
Train Accuracy: 0.8735, Train AUC: 0.8735 
Test Accuracy: 0.7820, Test AUC: 0.7822



That is not too shabby on the dev_data (the AUC is already ~0.28 points better than bag of words and coin toss). I think we can do a better job by tuning some parameters with GridSearchCV.

In [238]:
# Tuning round one
params_ = {
    'n_estimators': [100, 250, 500],
    'learning_rate': [0.05, 0.07, 0.10],
    'max_depth': np.arange(5, 35, 10), 
}

gbm = xgb.XGBClassifier(objective='binary:logistic', seed=0)
gsc = GridSearchCV(gbm, params_, cv=10, verbose=1, scoring='roc_auc', n_jobs=-1)
gsc.fit(train_new, train_labels_new)

print ("Train Best AUC: %.4f" % (gsc.best_score_))
print ("Best Params: %s" % (gsc.best_params_))
score_classifier(gsc.best_estimator_, train_new, train_labels_new, dev_data, dev_labels)

Fitting 10 folds for each of 27 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   24.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:  3.3min finished


Train Best AUC: 0.9371
Best Params: {'n_estimators': 500, 'learning_rate': 0.05, 'max_depth': 15}
Train Accuracy: 1.0000, Train AUC: 1.0000 
Test Accuracy: 0.7720, Test AUC: 0.8032



That's not too goot but may be we can do better.

In [239]:
params_ = {
    'min_child_weight':np.arange(2, 10, 4), 
    'gamma': np.arange(0, 1, 0.25), 
}

gbm = xgb.XGBClassifier(n_estimators=500, learning_rate=0.07, max_depth=15, subsample=0.9, 
                        scale_pos_weight = 0.75, reg_alpha = 0.03, colsample_bytree=0.4, 
                        objective='binary:logistic', seed=0)

gsc = GridSearchCV(gbm, params_, cv=10, verbose=1, scoring='roc_auc', n_jobs=-1)
gsc.fit(train_new, train_labels_new)

print ("Train Best AUC: %.4f" % (gsc.best_score_))
print ("Best Params: %s" % (gsc.best_params_))
score_classifier(gsc.best_estimator_, train_new, train_labels_new, dev_data, dev_labels)

Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   29.1s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   53.3s finished


Train Best AUC: 0.9372
Best Params: {'min_child_weight': 2, 'gamma': 0.25}
Train Accuracy: 0.9987, Train AUC: 0.9987 
Test Accuracy: 0.7680, Test AUC: 0.8005



## Stacking xgboost

Train Accuracy: 0.9480, Train AUC: 0.9480 
Test Accuracy: 0.7920, Test AUC: 0.8090

In [240]:
from copy import deepcopy
def stack_clf(clf, train, train_labels, test):
    
    train_one, train_one_labels = train[train.shape[0]/2:,], train_labels[train_labels.shape[0]/2:,]
    train_two, train_two_labels = train[:train.shape[0]/2,], train_labels[:train_labels.shape[0]/2,]
    
    clf_one = deepcopy(clf)
    clf_one.fit(train_one, train_one_labels)
    
    preds_two = clf_one.predict_proba(train_two)
    preds_test_one = gbm_one.predict_proba(test_data)
    
    clf_two = deepcopy(clf)
    clf_two.fit(train_two, train_two_labels)

    preds_one = clf_one.predict_proba(train_one)
    preds_test_two = clf_one.predict_proba(test_data)
    
    # meta_train
    train_stack = np.concatenate((preds_one, preds_two), axis=0)
    # meta_dev
    test_stack = 0.5*(preds_test_one + preds_test_two)
    
    return np.column_stack((train, train_stack)), np.column_stack((test, test_stack))

In [241]:
gbm = xgb.XGBClassifier(n_estimators=500, learning_rate=0.07, max_depth=15, subsample=0.9, 
                        scale_pos_weight=0.75, reg_alpha=0.03, colsample_bytree=0.4, 
                        min_child_weight=2, gamma=0.5, 
                        objective='binary:logistic', seed=0)

train_stack, dev_stack = stack_clf(gbm, train_new, train_labels_new, dev_data)

params_ = {
    'min_child_weight':np.arange(2, 10, 4), 
    'gamma': np.arange(0, 1, 0.25), 
}

gbm = xgb.XGBClassifier(n_estimators=500, learning_rate=0.07, max_depth=15, subsample=0.9, 
                        scale_pos_weight = 0.75, reg_alpha = 0.03, colsample_bytree=0.4, 
                        objective='binary:logistic', seed=0)

gsc = GridSearchCV(gbm, params_, cv=10, verbose=1, scoring='roc_auc', n_jobs=-1)
gsc.fit(train_stack, train_labels_new)

print ("Train Best AUC: %.4f" % (gsc.best_score_))
print ("Best Params: %s" % (gsc.best_params_))
score_classifier(gsc.best_estimator_, train_stack, train_labels_new, dev_stack, dev_labels)

Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   37.4s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  1.1min finished


Train Best AUC: 0.9306
Best Params: {'min_child_weight': 2, 'gamma': 0.5}
Train Accuracy: 0.9996, Train AUC: 0.9996 
Test Accuracy: 0.7760, Test AUC: 0.7905



OK, well that made things worse as before. Let's try one last thing.

## BaggingClassifier with xgboost

In [242]:
from sklearn.ensemble import BaggingClassifier
bc = BaggingClassifier(base_estimator=xgb.XGBClassifier(n_estimators=500, learning_rate=0.07, max_depth=15, subsample=0.9, 
                        scale_pos_weight = 0.75, reg_alpha = 0.03, colsample_bytree=0.4, 
                        objective='binary:logistic', seed=0),
                           n_estimators=10,
                           bootstrap=True,
                           oob_score=True,
                           random_state=0)
score_classifier(bc, train_stack, train_labels_new, dev_stack, dev_labels)

Train Accuracy: 0.9905, Train AUC: 0.9905 
Test Accuracy: 0.7800, Test AUC: 0.7932



In [243]:
bc = BaggingClassifier(base_estimator=xgb.XGBClassifier(n_estimators=500, learning_rate=0.07, max_depth=15, subsample=0.9, 
                        scale_pos_weight = 0.75, reg_alpha = 0.03, colsample_bytree=0.4, 
                        objective='binary:logistic', seed=0),
                           n_estimators=10,
                           bootstrap=True,
                           oob_score=True,
                           random_state=0)
score_classifier(bc, train_new, train_labels_new, dev_data, dev_labels)

Train Accuracy: 0.9879, Train AUC: 0.9879 
Test Accuracy: 0.7820, Test AUC: 0.7997



## Results for the test data

We can now get a score for the test data that we've held out. Let's quickly rehash the plan:
    - Stack the training and test data
    - Use BaggingClassifier with XGBClassifier
    
Let's get to it then.

In [244]:
bc = BaggingClassifier(base_estimator=xgb.XGBClassifier(n_estimators=500, learning_rate=0.07, max_depth=15, subsample=0.9, 
                        scale_pos_weight = 0.75, reg_alpha = 0.03, colsample_bytree=0.4, 
                        objective='binary:logistic', seed=0),
                           n_estimators=10,
                           bootstrap=True,
                           oob_score=True,
                           random_state=0)
train_stack, test_stack = stack_clf(bc, train_new, train_labels_new, test_data)

score_classifier(bc, train_stack, train_labels_new, test_stack, test_labels)

Train Accuracy: 0.9909, Train AUC: 0.9909 
Test Accuracy: 0.7600, Test AUC: 0.7771



# Preparing a Kaggle submission


In [300]:
import json
from pandas import read_json

In [365]:
# Note that the Kaggle test set is missing a lot of features from before
kfeatures = [10, 12, 14, 16, 18, 20, 22]
ktrain = pd.read_json('C:/users/sp4/train.json')
ktrain, ktrain_labels = ktrain[feature_names[kfeatures]], ktrain['requester_received_pizza'] 
ktrain, ktrain_labels = np.asarray(ktrain), np.asarray(ktrain_labels)

ktest = pd.read_json('C:/users/sp4/test.json')
ktest_sub = np.asarray(ktest[feature_names[kfeatures]])

In [359]:
bc = BaggingClassifier(base_estimator=xgb.XGBClassifier(n_estimators=500, learning_rate=0.07, max_depth=15, subsample=0.9, 
                        scale_pos_weight = 0.75, reg_alpha = 0.03, colsample_bytree=0.4, 
                        objective='binary:logistic', seed=0),
                           n_estimators=10,
                           bootstrap=True,
                           oob_score=True,
                           random_state=0)
bc.fit(ktrain, ktrain_labels)

BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.4,
       gamma=0, learning_rate=0.07, max_delta_step=0, max_depth=15,
       min_child_weight=1, missing=None, n_estimators=500, nthread=-1,
       objective='binary:logistic', reg_alpha=0.03, reg_lambda=1,
       scale_pos_weight=0.75, seed=0, silent=True, subsample=0.9),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=True,
         random_state=0, verbose=0, warm_start=False)

In [378]:
results = np.column_stack((ktest['request_id'], bc.predict(ktest_sub)))

In [380]:
np.savetxt("kaggle_raop_sub.csv", results, delimiter=',', fmt='%s', 
           header='request_id,requester_received_pizza')