**Pizza Project**

Andrew Mamroth,
Colby Carter,
Matt Adereth,
Rob Deng


Kaggle: https://www.kaggle.com/c/random-acts-of-pizza

Team Github: https://github.com/mamrotha/2017_Fall_207_KaggleProj



**Data fields:**

"**giver_username_if_known**": Reddit username of giver if known, i.e. the person satisfying the request ("N/A" otherwise).

"number_of_downvotes_of_request_at_retrieval": Number of downvotes at the time the request was collected.

"number_of_upvotes_of_request_at_retrieval": Number of upvotes at the time the request was collected.

"post_was_edited": Boolean indicating whether this post was edited (from Reddit).

"request_id": Identifier of the post on Reddit, e.g. "t3_w5491".

"request_number_of_comments_at_retrieval": Number of comments for the request at time of retrieval.

"request_text": Full text of the request.

"request_text_edit_aware": Edit aware version of "request_text". We use a set of rules to strip edited comments indicating the success of the request such as "EDIT: Thanks /u/foo, the pizza was delicous".

"request_title": Title of the request.

"requester_account_age_in_days_at_request": Account age of requester in days at time of request.

"requester_account_age_in_days_at_retrieval": Account age of requester in days at time of retrieval.

"requester_days_since_first_post_on_raop_at_request": Number of days between requesters first post on RAOP and this request (zero if requester has never posted before on RAOP).

"requester_days_since_first_post_on_raop_at_retrieval": Number of days between requesters first post on RAOP and time of retrieval.

"requester_number_of_comments_at_request": Total number of comments on Reddit by requester at time of request.

"requester_number_of_comments_at_retrieval": Total number of comments on Reddit by requester at time of retrieval.

"requester_number_of_comments_in_raop_at_request": Total number of comments in RAOP by requester at time of request.

"requester_number_of_comments_in_raop_at_retrieval": Total number of comments in RAOP by requester at time of retrieval.

"requester_number_of_posts_at_request": Total number of posts on Reddit by requester at time of request.

"requester_number_of_posts_at_retrieval": Total number of posts on Reddit by requester at time of retrieval.

"requester_number_of_posts_on_raop_at_request": Total number of posts in RAOP by requester at time of request.

"requester_number_of_posts_on_raop_at_retrieval": Total number of posts in RAOP by requester at time of retrieval.

"requester_number_of_subreddits_at_request": The number of subreddits in which the author had already posted in at the time of request.

"requester_received_pizza": Boolean indicating the success of the request, i.e., whether the requester received pizza.

"requester_subreddits_at_request": The list of subreddits in which the author had already posted in at the time of request.

"requester_upvotes_minus_downvotes_at_request": Difference of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_minus_downvotes_at_retrieval": Difference of total upvotes and total downvotes of requester at time of retrieval.

"requester_upvotes_plus_downvotes_at_request": Sum of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_plus_downvotes_at_retrieval": Sum of total upvotes and total downvotes of requester at time of retrieval.

"**requester_user_flair**": Users on RAOP receive badges (Reddit calls them flairs) which is a small picture next to their username. In our data set the user flair is either None (neither given nor received pizza, N=4282), "shroom" (received pizza, but not given, N=1306), or "PIF" (pizza given after having received, N=83).

"requester_username": Reddit username of requester.

"unix_timestamp_of_request": Unix timestamp of request (supposedly in timezone of user, but in most cases it is equal to the UTC timestamp -- which is incorrect since most RAOP users are from the USA).

"unix_timestamp_of_request_utc": Unit timestamp of request in UTC.

In [3]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import *
from sklearn.pipeline import Pipeline

import pandas as pd

# ADD METRICS
from sklearn import metrics
from sklearn.metrics import classification_report

#NLTK - NLP Tokenizing and Cleaning
import nltk
#nltk.download()
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [5]:
# Load raw data and create labels
raw_train = pd.read_json('./data/train.json')
raw_test = pd.read_json('./data/test.json')

# Summarize raw data
print(raw_train.shape)
print(list(raw_train.columns.values))
print(raw_test.shape)
print("Test columns:",list(raw_test.columns.values))
# no "retrieval" variables in test data

(4040, 32)
['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza', 'requester_subreddits_at_request', 'requester_upvotes_minus_downvotes_at_request', 'requ

In [6]:
#TRANSFORMATIONS & NEW FEATURES

# split labels
train_labels = raw_train["requester_received_pizza"]
train_data = raw_train.drop(['post_was_edited','requester_received_pizza'], 1) #edits not available in test data
test_data = raw_test 
print("orignial train shape:",train_data.shape)
print("test shape:",test_data.shape)


#get length of post
train_data["post_length"] = train_data["request_text_edit_aware"].apply(lambda x: len(x))
test_data["post_length"] = test_data["request_text_edit_aware"].apply(lambda x: len(x))
#print(train_data["post_length"].head(6))


# Create key quadratic terms for the following numeric variables:
# NOTE: variables "at_retrieval" are NOT available in the test dataset

#remove numeric vars pulled "at_retreival"
for col in train_data.columns.values:
    #print(col)
    if "retrieval" in col:
        train_data = train_data.drop(col, 1)


# form quadratic terms to capture curved relationship with log-odds
for_quad = ["requester_account_age_in_days_at_request", "requester_days_since_first_post_on_raop_at_request",
            "requester_number_of_comments_at_request", "requester_number_of_comments_in_raop_at_request",
            "requester_number_of_posts_at_request", "requester_number_of_subreddits_at_request",
            "requester_upvotes_minus_downvotes_at_request", "post_length"]

for col in for_quad:
    train_data[col + "_2"] = train_data[col]**2
    test_data[col + "_2"] = test_data[col]**2
print("new shapes:", train_data.shape, test_data.shape)


#transform requester_user_flair into two dummy variables for logistic
#TEST DATA DOES NOT CONTAIN THIS VARIABLE
#flair = ["shroom","PIF"]
#for f in flair:
    #train_data[f] = train_data["requester_user_flair"].apply(lambda x: 1 if x==f else 0)
    #test_data[f] = test_data["requester_user_flair"].apply(lambda x: 1 if x==f else 0)

    
#flag givers (very few givers who are also requesters)
successes = raw_train[raw_train['requester_received_pizza']==True]
givers = successes['giver_username_if_known']
known_givers = list(givers[givers != 'N/A'])

train_data["requester_giver"] = train_data["requester_username"].apply(lambda x: 1 if x in known_givers else 0)
test_data["requester_giver"] = test_data["requester_username"].apply(lambda x: 1 if x in known_givers else 0)
print(train_data[train_data["requester_giver"] == 1].shape)


#Here we split the number variables from the string type variables
text_columns = ['giver_username_if_known','request_id','request_text','request_text_edit_aware','request_title',
              'requester_subreddits_at_request','requester_user_flair','requester_username']
#for logistic regression
num_columns = [i for i in train_data.columns.values if i not in text_columns]
print("Numeric features:",num_columns)
col_names = list(train_data.columns.values)


orignial train shape: (4040, 30)
test shape: (1631, 17)
new shapes: (4040, 28) (1631, 26)
(3, 29)
Numeric features: ['requester_account_age_in_days_at_request', 'requester_days_since_first_post_on_raop_at_request', 'requester_number_of_comments_at_request', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_posts_at_request', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_subreddits_at_request', 'requester_upvotes_minus_downvotes_at_request', 'requester_upvotes_plus_downvotes_at_request', 'unix_timestamp_of_request', 'unix_timestamp_of_request_utc', 'post_length', 'requester_account_age_in_days_at_request_2', 'requester_days_since_first_post_on_raop_at_request_2', 'requester_number_of_comments_at_request_2', 'requester_number_of_comments_in_raop_at_request_2', 'requester_number_of_posts_at_request_2', 'requester_number_of_subreddits_at_request_2', 'requester_upvotes_minus_downvotes_at_request_2', 'post_length_2', 'requester_giver']


In [7]:
# Create mini train and development set
dev_size = int(round(train_data.shape[0]*.15))

mini_train_data, mini_train_labels = train_data[dev_size:], train_labels[dev_size:]
print("Mini train size:",mini_train_data.shape, mini_train_labels.shape)

dev_data, dev_labels = train_data[:dev_size], train_labels[:dev_size]
print("Development set size:", dev_data.shape, dev_labels.shape)

Mini train size: (3434, 29) (3434,)
Development set size: (606, 29) (606,)


In [8]:
#Rob's Section

#Preprocess
def nltk_preprocess(data):
    stop = stopwords.words('english')
    
    #Merge title and request text together, lower string
    data['title_and_request'] = data[['request_text_edit_aware', 'request_title']].apply(lambda x: ''.join(x), axis=1).str.lower()
    #replacing sequences of numbers with a single token, removing various other non-letter characters, removing strings with underscores
    data['title_and_request'] = data['title_and_request'].apply(lambda x: re.sub(r'\d+', r' ', x)).apply(lambda y: re.sub(r'\W+', r' ', y)).apply(lambda z: re.sub(r"_+",r" ",z))
    #NLTK Tokenize
    data['tokenized_requests'] = data['title_and_request'].apply(word_tokenize)
    #NLTK Remove Stop Words i.e. the, an, etc
    data['tokenized_requests'] = data['tokenized_requests'].apply(lambda x: [item for item in x if item not in stop])
    #Word count of leftovers; didn't use this yet
    data['word_count'] = [len(data['tokenized_requests'][i]) for i in range(data.index[0], data.index[-1]+1)]
    #Rejoin after str split
    data['tokenized_requests'] = data['tokenized_requests'].apply(lambda x: ' '.join(x))
    return data['tokenized_requests']


def classify(model, model_parameters = False, use_tfidf=False):
    """Takes a model and parameters. 
       Outputs a classification report on the dev data, scored by f1_weighted.
       Prints out the best gridsearch parameter of choice."""
    if(use_tfidf):
        pipeliner = Pipeline([('cv', CountVectorizer()),
                        ('tfidf', TfidfTransformer()),
                        ('model', model())])
    else:
        pipeliner = Pipeline([("cv", CountVectorizer()), 
                              ("model", model())])

        #Make a simple prediction
    pipeliner.fit(processed_mini_train_data, mini_train_labels)
    pipeliner_pred = pipeliner.predict(processed_dev_data)
    print(model, "\n\n", "TFIDF = ", use_tfidf, "\n")
    print(classification_report(pipeliner_pred, dev_labels))
    return None


#process text entries
processed_mini_train_data = nltk_preprocess(mini_train_data)
processed_dev_data = nltk_preprocess(dev_data)
processed_test_data = nltk_preprocess(test_data)

classify(MultinomialNB, model_parameters = {"model__analyzer":"word", "model__ngram_range":(1,2)}, use_tfidf = True)
classify(MultinomialNB, model_parameters = {"model__analyzer":"word", "model__ngram_range":(1,2)}, use_tfidf = False)
classify(LogisticRegression, use_tfidf = True)
classify(LogisticRegression, use_tfidf = False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

<class 'sklearn.naive_bayes.MultinomialNB'> 

 TFIDF =  True 

             precision    recall  f1-score   support

      False       1.00      0.74      0.85       606
       True       0.00      0.00      0.00         0

avg / total       1.00      0.74      0.85       606

<class 'sklearn.naive_bayes.MultinomialNB'> 



  'recall', 'true', average, warn_for)


 TFIDF =  False 

             precision    recall  f1-score   support

      False       0.96      0.75      0.84       577
       True       0.07      0.38      0.12        29

avg / total       0.92      0.73      0.81       606

<class 'sklearn.linear_model.logistic.LogisticRegression'> 

 TFIDF =  True 

             precision    recall  f1-score   support

      False       0.99      0.74      0.85       598
       True       0.03      0.50      0.05         8

avg / total       0.98      0.74      0.84       606

<class 'sklearn.linear_model.logistic.LogisticRegression'> 

 TFIDF =  False 

             precision    recall  f1-score   support

      False       0.86      0.75      0.80       518
       True       0.17      0.30      0.21        88

avg / total       0.76      0.68      0.72       606



In [12]:
# Explore terms with the largest coefficients

def classify2(model, model_parameters = False, use_tfidf=False):
    """Takes a model and parameters. 
       Outputs a classification report on the dev data, scored by f1_weighted.
       Prints out the best gridsearch parameter of choice."""
    vec = CountVectorizer()
    train_feats = vec.fit_transform(processed_mini_train_data)
    train_vocab = vec.get_feature_names()
    print(len(train_vocab))
    dev_feats = vec.transform(processed_dev_data)
    
    lr =  LogisticRegression()
    #lr =  LogisticRegression(penalty="l1")
    lr.fit(train_feats, mini_train_labels)
    lr_preds = lr.predict(dev_feats)
    print(metrics.f1_score(dev_labels, lr_preds, average='micro'))
    print(classification_report(lr_preds, dev_labels))

    coefs = lr.coef_
    #print(coefs.shape)
    max_coefs = np.argsort(coefs, axis=1)[:,-30:]
    #print(max_coefs)
    top_features = []
    for i in range(max_coefs.shape[1]):
        #print(max_coefs[0][i])
        print(train_vocab[max_coefs[0][i]])


classify2(LogisticRegression, use_tfidf = False)

11509
0.681518151815
             precision    recall  f1-score   support

      False       0.86      0.75      0.80       518
       True       0.17      0.30      0.21        88

avg / total       0.76      0.68      0.72       606

expected
mentioned
projects
redding
relatives
state
denmark
christmas
gi
tucson
receive
drink
including
steam
lift
oatmeal
weather
ranch
loves
rather
cookout
eye
except
losing
topping
sunday
leg
zolo
hurting
father


In [13]:
#join numeric and vocabulary features

#use preprocessed vocab features
vec_full = CountVectorizer(stop_words = "english")
train_feats = vec_full.fit_transform(processed_mini_train_data)
train_vocab = vec_full.get_feature_names()
dev_feats = vec_full.transform(processed_dev_data)
test_feats = vec_full.transform(processed_test_data)

#make vocab arrays
train_vocab_ar = train_feats.toarray()
dev_vocab_ar = dev_feats.toarray()
test_vocab_ar = test_feats.toarray()
#new_array = np.array(train_feats)
print("Vocab train:", train_vocab_ar.shape)

#make numeric arrays
train_num_ar = mini_train_data[num_columns].as_matrix()
dev_num_ar = dev_data[num_columns].as_matrix()
test_num_ar = test_data[num_columns].as_matrix()
print("Numeric train:",train_num_ar.shape)

#join arrays into final feature sets
combined_train_feats = np.concatenate((train_vocab_ar,train_num_ar), axis = 1)
combined_dev_feats = np.concatenate((dev_vocab_ar,dev_num_ar), axis = 1)
combined_test_feats = np.concatenate((test_vocab_ar,test_num_ar), axis = 1)
print("Combined:",combined_train_feats.shape)
print("Test:",combined_test_feats.shape)

Vocab train: (3434, 11336)
Numeric train: (3434, 21)
Combined: (3434, 11357)
Test: (1631, 11357)


In [16]:
#Build a logistic model based on number value columns + preprocessed text field
# test with various values for C
full_lr = LogisticRegression(penalty="l1", C = .5)
full_lr.fit(combined_train_feats, mini_train_labels)
#full_lr.fit(train_feats, mini_train_labels)

#print(full_lr.coef_[0:20])

dev_preds = full_lr.predict(combined_dev_feats)
test_preds = full_lr.predict(combined_test_feats)
#dev_preds = full_lr.predict(dev_feats)
print(metrics.f1_score(dev_labels, dev_preds, average='micro'))
print(classification_report(dev_preds, dev_labels))

#test_preds = full_lr.predict(combined_test_feats)

test_out = pd.DataFrame()
test_out['request_id'] = test_data['request_id']
test_out['requester_received_pizza'] = test_preds.astype(int)
#num = sum(preds['requester_received_pizza'])

#print(num, sum(train_labels))

test_out.to_csv('./data/submission2.csv', index=False)

0.737623762376
             precision    recall  f1-score   support

      False       0.94      0.76      0.84       550
       True       0.17      0.48      0.25        56

avg / total       0.86      0.74      0.79       606



In [40]:
#Try looking at request only
train_text = mini_train_data['request_text_edit_aware']
dev_text = dev_data['request_text_edit_aware']
vec = CountVectorizer()
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))
print(classification_report(nb_preds, dev_labels))

11244
0.729372937294
             precision    recall  f1-score   support

      False       0.97      0.74      0.84       589
       True       0.03      0.29      0.06        17

avg / total       0.95      0.73      0.82       606



In [41]:
#Try looking at request BIGRAMS only
train_text = mini_train_data['request_text_edit_aware']
dev_text = dev_data['request_text_edit_aware']
vec = CountVectorizer(analyzer="word", ngram_range=(1,2))
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))
print(classification_report(nb_preds, dev_labels))

103097
0.740924092409
             precision    recall  f1-score   support

      False       1.00      0.74      0.85       606
       True       0.00      0.00      0.00         0

avg / total       1.00      0.74      0.85       606



  'recall', 'true', average, warn_for)


In [42]:
#Try looking at request TITLE only
train_text = mini_train_data['request_title']
dev_text = dev_data['request_title']
vec = CountVectorizer(analyzer="word", ngram_range=(1,2))
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))
print(classification_report(nb_preds, dev_labels))

23071
0.742574257426
             precision    recall  f1-score   support

      False       1.00      0.74      0.85       605
       True       0.01      1.00      0.01         1

avg / total       1.00      0.74      0.85       606



In [43]:
#Try looking at request only with TfidfVectorizer
train_text = mini_train_data['request_text_edit_aware']
dev_text = dev_data['request_text_edit_aware']
vec = TfidfVectorizer()
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))
print(classification_report(nb_preds, dev_labels))

11244
0.740924092409
             precision    recall  f1-score   support

      False       1.00      0.74      0.85       606
       True       0.00      0.00      0.00         0

avg / total       1.00      0.74      0.85       606



  'recall', 'true', average, warn_for)


In [148]:
#NB weights of vocab without preprocessing step
# feature_log_prob_
print(nb.feature_log_prob_.shape)
max_weights = np.argsort(nb.feature_log_prob_, axis=1)[:,-20:]
print(max_weights)
print(max_weights[0])
for i in max_weights[0]:
    print(train_vocab[i])
print("**********")
for i in max_weights[1]:
    print(train_vocab[i])

(2, 11566)
[[ 9406  7114  1719 10297 11497  1238  5516  6370 11157 11400  4846  5526
   5295  7635  7056  4159  6745 10247   776 10405]
 [11497  7114  9406 10297  5516  1719  6370  1238 11400 11157  7635  4846
   5526  5295  7056  4159  6745 10247   776 10405]]
[ 9406  7114  1719 10297 11497  1238  5516  6370 11157 11400  4846  5526
  5295  7635  7056  4159  6745 10247   776 10405]
so
on
but
this
you
be
is
me
we
would
have
it
in
pizza
of
for
my
the
and
to
**********
you
on
so
this
is
but
me
be
would
we
pizza
have
it
in
of
for
my
the
and
to
