# Random Acts of Pizza

The objective of the project is to predict if the requester on Reddit will receive a Pizza as an act of altruism from one of the other Reddit users. 
The train and test data files are available in kaggle.com and have been downloaded at the below location before executing this notebook. 

In this notebook we are going to look at only the non-text field and build a classification model.

In [2]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from datetime import datetime
import re
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from scipy.optimize import minimize
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

We are going to add year, month, day and dayofweek fields and drop all text fields as well user-specific fields like giver_username_if_known, requester_username etc.

In [3]:
data = pd.read_json("/Users/gautamkarnataki/MIDS/train.json")
data['year']=data.apply(lambda x: datetime.utcfromtimestamp(x['unix_timestamp_of_request_utc']).strftime('%Y'), axis=1).astype(int)
data['month']=data.apply(lambda x: datetime.utcfromtimestamp(x['unix_timestamp_of_request_utc']).strftime('%m'), axis=1).astype(int)
data['day']=data.apply(lambda x: datetime.utcfromtimestamp(x['unix_timestamp_of_request_utc']).strftime('%d'), axis=1).astype(int)
data['dayofweek']=data.apply(lambda x: datetime.utcfromtimestamp(x['unix_timestamp_of_request_utc']).weekday(), axis=1).astype(int)
data=data.drop(['unix_timestamp_of_request'], axis=1)
data=data.drop(['request_id'], axis=1)
data=data.drop(['unix_timestamp_of_request_utc'], axis=1)
data=data.drop(['request_text'], axis=1)
data=data.drop(['request_text_edit_aware'], axis=1)
#data=data.drop(['request_title'], axis=1)
data=data.drop(['requester_subreddits_at_request'], axis=1)
data=data.drop(['requester_username'], axis=1)
data=data.drop(['giver_username_if_known'], axis=1)
data=data.drop(['requester_user_flair'], axis=1)

data['request_title']

0                 Request Colorado Springs Help Us Please
1       [Request] California, No cash and I could use ...
2       [Request] Hungry couple in Dundee, Scotland wo...
3       [Request] In Canada (Ontario), just got home f...
4       [Request] Old friend coming to visit. Would LO...
5       [REQUEST] I'll give a two week xbox live code ...
6       [Request] Help me give back to my roomies on F...
7       random acts of pizza, i have a request, if not...
8       [Request] Queensland Australia, Recently moved...
9               [REQUEST]We're in need of some om noms...
10      [REQUEST] Bummed out in Chicago. Too broke to ...
11                   [Request] Would love a pizza tonight
12      [REQUEST] Georgia, USA Please help me family o...
13                             [Request]  Broke in ATL.  
14               [Request] Make my bro in law a believer!
15      [Request] I am not a pothead nor a beggar, but...
16      [request] Cookeville, TN. My dog recently died...
17      [Reque

Let's cast the non-integer fields to integers. These fields represent days and having days as integers is sufficient for our analysis.

In [4]:
data.requester_account_age_in_days_at_request=data.requester_account_age_in_days_at_request.astype(int)
data.requester_account_age_in_days_at_retrieval=data.requester_account_age_in_days_at_retrieval.astype(int)
data.requester_days_since_first_post_on_raop_at_retrieval=data.requester_days_since_first_post_on_raop_at_retrieval.astype(int)
data.requester_days_since_first_post_on_raop_at_request=data.requester_days_since_first_post_on_raop_at_request.astype(int)

## Classification usng Non Text Fields

In [5]:
predictions = []

In [6]:
y = data["requester_received_pizza"]
data=data.drop(['requester_received_pizza'], axis=1)
X = data

X.dtypes

number_of_downvotes_of_request_at_retrieval              int64
number_of_upvotes_of_request_at_retrieval                int64
post_was_edited                                          int64
request_number_of_comments_at_retrieval                  int64
request_title                                           object
requester_account_age_in_days_at_request                 int64
requester_account_age_in_days_at_retrieval               int64
requester_days_since_first_post_on_raop_at_request       int64
requester_days_since_first_post_on_raop_at_retrieval     int64
requester_number_of_comments_at_request                  int64
requester_number_of_comments_at_retrieval                int64
requester_number_of_comments_in_raop_at_request          int64
requester_number_of_comments_in_raop_at_retrieval        int64
requester_number_of_posts_at_request                     int64
requester_number_of_posts_at_retrieval                   int64
requester_number_of_posts_on_raop_at_request           

In [9]:
sss = StratifiedShuffleSplit(n_splits=5, random_state=1234)
for train_index, dev_index in sss.split(X,y):
    break

train_data,dev_data = X.values[train_index],X.values[dev_index]
train_labels,dev_labels = y.values[train_index],y.values[dev_index]

clf = LogisticRegression()
clf.fit(X, y)
y_pred = clf.predict(dev_data)
print('Accuracy of classifier on dev set: {:.2f}'.format(clf.score(dev_data, dev_labels)))
print('LogLoss : {score}'.format(score=log_loss(dev_labels, y_pred)))
print(classification_report(dev_labels, y_pred))
predictions.append(clf.predict_proba(dev_data))

ValueError: could not convert string to float: [Request] USA WA. Unexpected bill, couldn't go grocery shopping this week.

## Classificaton using only text fields (request text and request title)

In [10]:
# Use basic pre-processing techniques
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

def text_preprocessor(s):
    message = s.lower()
    message = re.sub(r"\brequest|\[|\]|\(|\)|\$|\!|\/|\.|\*|\+|\&|\=|\%|\:|\?|\"|\,|\;|\@|\_|\\|\}|\{|\||\~", " ", message)
    message = re.sub(r"[0-9]", " ", message)
    message = re.sub(r"[-]*", "", message)
    message = ' '.join([word[0:20] for word in message.split() if len(word)>3])
    return message

def stemming_tokenizer(str_input):
    words = str_input.split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

In [11]:
data = pd.read_json("/Users/gautamkarnataki/MIDS/train.json")
text = data.loc[data["requester_received_pizza"]==True,["request_text"]]
text.iloc[1]

request_text    Austin, Texas\n\nMy two roommates and I are hu...
Name: 9, dtype: object

In [81]:
phrase = ""

text = data.loc[data["requester_received_pizza"]==True,["request_text","requester_received_pizza"]]
print text[text["request_text"].str.contains(phrase)==True].count()/len(text) * 100.0

text = data.loc[data["requester_received_pizza"]==False,["request_text","requester_received_pizza"]]
print text[text["request_text"].str.contains(phrase)==True].count()/len(text) * 100.0


request_text                14.285714
requester_received_pizza    14.285714
dtype: float64
request_text                12.212738
requester_received_pizza    12.212738
dtype: float64


In [33]:
data = pd.read_json("/Users/gautamkarnataki/MIDS/train.json")

X = data["request_text"]
y = data["requester_received_pizza"]
sss = StratifiedShuffleSplit(n_splits=5, random_state=1000)
for train_index, dev_index in sss.split(X,y):
    break

train_data,dev_data = X[train_index],X[dev_index]
train_labels,dev_labels = y[train_index],y[dev_index]
cv = CountVectorizer(stop_words='english',
                         preprocessor=text_preprocessor,
                         lowercase=True,
                         tokenizer=stemming_tokenizer,
                         min_df=10, 
                         max_df=0.2, 
                         ngram_range=(1,1))
transformer = cv.fit_transform(train_data)
logreg_pizza_text = LogisticRegression(C=0.5)
logreg_pizza_text.fit(transformer,train_labels)
dev_data_trans = cv.transform(dev_data)
y_pred = logreg_pizza_text.predict(dev_data_trans)
print ("Accuracy (on dev set): %.4f" % metrics.accuracy_score(y_true=dev_labels, y_pred=y_pred))
print (metrics.classification_report(y_true=dev_labels, y_pred=y_pred))
print('LogLoss {score}'.format(score=log_loss(dev_labels, y_pred)))
predictions.append(logreg_pizza_text.predict_proba(dev_data_trans))

coeff = logreg_pizza_text.coef_
print coeff.shape
ind_features = []
feat = cv.get_feature_names()
for i in range(0,1):
    idx = (-coeff[i]).argsort()[:800]
    for ind in idx:
        ind_features.append(feat[ind])

cv = CountVectorizer(stop_words='english',
                         preprocessor=text_preprocessor,
                         lowercase=True,
                         tokenizer=stemming_tokenizer,
                         vocabulary=ind_features,
                         min_df=10, 
                         max_df=0.2, 
                         ngram_range=(1,1))
transformer = cv.fit_transform(train_data)
logreg_pizza_text = LogisticRegression(C=0.5)
logreg_pizza_text.fit(transformer,train_labels)
dev_data_trans = cv.transform(dev_data)
y_pred = logreg_pizza_text.predict(dev_data_trans)
print ("Accuracy (on dev set): %.4f" % metrics.accuracy_score(y_true=dev_labels, y_pred=y_pred))
print (metrics.classification_report(y_true=dev_labels, y_pred=y_pred))
print('LogLoss {score}'.format(score=log_loss(dev_labels, y_pred)))
#predictions.append(logreg_pizza_text.predict_proba(dev_data_trans))

X = data["request_title"]
y = data["requester_received_pizza"]
sss = StratifiedShuffleSplit(n_splits=5, random_state=1000)
for train_index, dev_index in sss.split(X,y):
    break
    
train_data,dev_data = X[train_index],X[dev_index]
train_labels,dev_labels = y[train_index],y[dev_index]
cv = CountVectorizer(stop_words='english',
                         preprocessor=text_preprocessor,
                         lowercase=True,
                         tokenizer=stemming_tokenizer,
                         min_df=10, 
                         max_df=0.2, 
                         ngram_range=(1,1))
transformer = cv.fit_transform(train_data)
logreg_pizza_title = LogisticRegression(C=1.2)
#logreg_pizza_title = RandomForestClassifier(random_state=0)
logreg_pizza_title.fit(transformer,train_labels)
dev_data_trans = cv.transform(dev_data)
y_pred = logreg_pizza_title.predict(dev_data_trans)
print ("Accuracy (on dev set): %.4f" % metrics.accuracy_score(y_true=dev_labels, y_pred=y_pred))
print (metrics.classification_report(y_true=dev_labels, y_pred=y_pred))
print('LogLoss {score}'.format(score=log_loss(dev_labels, y_pred)))
predictions.append(logreg_pizza_title.predict_proba(dev_data_trans))

Accuracy (on dev set): 0.7351
              precision    recall  f1-score   support

       False       0.78      0.90      0.84       305
        True       0.43      0.23      0.30        99

   micro avg       0.74      0.74      0.74       404
   macro avg       0.60      0.57      0.57       404
weighted avg       0.70      0.74      0.71       404

LogLoss 9.14770757865
(1, 1288)
Accuracy (on dev set): 0.7450
              precision    recall  f1-score   support

       False       0.77      0.94      0.85       305
        True       0.44      0.14      0.21        99

   micro avg       0.75      0.75      0.75       404
   macro avg       0.60      0.54      0.53       404
weighted avg       0.69      0.75      0.69       404

LogLoss 8.80571376591
Accuracy (on dev set): 0.7327
              precision    recall  f1-score   support

       False       0.75      0.96      0.84       305
        True       0.24      0.04      0.07        99

   micro avg       0.73      0.73     

In [34]:
def log_loss_func(weights):
    ''' scipy minimize will pass the weights as a numpy array '''
    final_prediction = 0
    for weight, prediction in zip(weights, predictions):
            final_prediction += weight*prediction

    return log_loss(dev_labels, final_prediction)
    
#the algorithms need a starting value, right not we chose 0.5 for all weights
#its better to choose many random starting points and run minimize a few times
starting_values = [0.3333]*len(predictions)
cons = ({'type':'eq','fun':lambda w: 1-sum(w)})

#our weights are bound between 0 and 1
bounds = [(0,1)]*len(predictions)
res = minimize(log_loss_func, starting_values, method='SLSQP', bounds=bounds, constraints=cons)

print('Ensamble Score: {best_score}'.format(best_score=res['fun']))
print('Best Weights: {weights}'.format(weights=res['x']))

weights=res['x']
y_pred=[weights[0]*predictions[0][k][1]+weights[1]*predictions[1][k][1] for k in range(len(dev_data))]
y_pred=[True if k > 0.5 else False for k in y_pred]
print (metrics.classification_report(y_true=dev_labels, y_pred=y_pred))
print('LogLoss : {score}'.format(score=log_loss(dev_labels, y_pred)))

Ensamble Score: 0.547287548424
Best Weights: [0.14580708 0.28254264 0.57165029]
              precision    recall  f1-score   support

       False       0.75      1.00      0.86       305
        True       0.00      0.00      0.00        99

   micro avg       0.75      0.75      0.75       404
   macro avg       0.38      0.50      0.43       404
weighted avg       0.57      0.75      0.65       404

LogLoss : 8.46371005717


## Tuning Hyperparameters

In [12]:
def find_optimal(classifier, parameters, train_data, train_labels, param_label):
        # Set the scoring parameter to F1 as this is the score we're basing our accuracy on.
        clf = GridSearchCV(classifier, parameters, scoring='f1_weighted')
        clf.fit(train_data, train_labels)
        print "\nBest value of {0} = {1} [Mean F1-score = {2}]".format(param_label, clf.best_params_, clf.best_score_)

In [13]:
X = data["request_text"]
y = data["requester_received_pizza"]
sss = StratifiedShuffleSplit(n_splits=5, random_state=1000)
for train_index, dev_index in sss.split(X,y):
    break

train_data,dev_data = X[train_index],X[dev_index]
train_labels,dev_labels = y[train_index],y[dev_index]
cv = CountVectorizer(stop_words='english',
                         preprocessor=text_preprocessor,
                         lowercase=True,
                         tokenizer=stemming_tokenizer,
                         min_df=10, 
                         max_df=0.2, 
                         ngram_range=(1,1))
transformer = cv.fit_transform(train_data)
logreg_pizza_text = LogisticRegression(C=1.0, class_weight='balanced')
#logreg_pizza_text = RandomForestClassifier(random_state=0)
logreg_pizza_text.fit(transformer,train_labels)
dev_data_trans = cv.transform(dev_data)
y_pred = logreg_pizza_text.predict(dev_data_trans)

parameters = {'C': [0.001,0.01,0.1,0.5,0.75,1.0,1.1,1.2]}
find_optimal(LogisticRegression(),parameters,transformer,train_labels,'C')

  'precision', 'predicted', average, warn_for)



Best value of C = {'C': 0.5} [Mean F1-score = 0.688603082912]


In [14]:
X = data["request_title"]
y = data["requester_received_pizza"]
sss = StratifiedShuffleSplit(n_splits=5, random_state=1000)
for train_index, dev_index in sss.split(X,y):
    break
    
train_data,dev_data = X[train_index],X[dev_index]
train_labels,dev_labels = y[train_index],y[dev_index]
cv = CountVectorizer(stop_words='english',
                         preprocessor=text_preprocessor,
                         lowercase=True,
                         tokenizer=stemming_tokenizer,
                         min_df=10, 
                         max_df=0.2, 
                         ngram_range=(1,1))
transformer = cv.fit_transform(train_data)
logreg_pizza_title = LogisticRegression()
#logreg_pizza_title = RandomForestClassifier(random_state=0)
logreg_pizza_title.fit(transformer,train_labels)
dev_data_trans = cv.transform(dev_data)
y_pred = logreg_pizza_title.predict(dev_data_trans)

parameters = {'C': [0.001,0.01,0.1,0.5,0.75,1.0,1.1,1.2]}
find_optimal(LogisticRegression(),parameters,transformer,train_labels,'C')


Best value of C = {'C': 1.2} [Mean F1-score = 0.663180592487]


## Try with an even split of positive and negative cases

In [15]:
data = pd.read_json("/Users/gautamkarnataki/MIDS/train.json")
all_data = data[data["requester_received_pizza"]==True].iloc[0:1000]
all_data = all_data.append(data[data["requester_received_pizza"]==False].iloc[0:1000])
data = all_data.sample(frac=1)

In [16]:
train_data, dev_data = train_test_split(data, test_size=0.2)
train_labels = train_data["requester_received_pizza"]
dev_labels = dev_data["requester_received_pizza"]
train_data = train_data['request_text']
dev_data = dev_data['request_text']

In [17]:
cv = CountVectorizer(stop_words='english',
                         preprocessor=text_preprocessor,
                         lowercase=True,
                         tokenizer=stemming_tokenizer,
                         min_df=5, 
                         max_df=0.2, 
                         ngram_range=(1,1))
transformer = cv.fit_transform(train_data)
logreg_pizza_title = LogisticRegression(C=0.5)
logreg_pizza_title.fit(transformer,train_labels)
dev_data_trans = cv.transform(dev_data)
y_pred = logreg_pizza_title.predict(dev_data_trans)
print ("Accuracy (on dev set): %.4f" % metrics.accuracy_score(y_true=dev_labels, y_pred=y_pred))
print (metrics.classification_report(y_true=dev_labels, y_pred=y_pred))
print('LogLoss {score}'.format(score=log_loss(dev_labels, y_pred)))

Accuracy (on dev set): 0.5915
              precision    recall  f1-score   support

       False       0.60      0.60      0.60       204
        True       0.58      0.58      0.58       195

   micro avg       0.59      0.59      0.59       399
   macro avg       0.59      0.59      0.59       399
weighted avg       0.59      0.59      0.59       399

LogLoss 14.1099902741
