https://www.kaggle.com/c/random-acts-of-pizza

**Data fields:**

"giver_username_if_known": Reddit username of giver if known, i.e. the person satisfying the request ("N/A" otherwise).

"number_of_downvotes_of_request_at_retrieval": Number of downvotes at the time the request was collected.

"number_of_upvotes_of_request_at_retrieval": Number of upvotes at the time the request was collected.

"post_was_edited": Boolean indicating whether this post was edited (from Reddit).

"request_id": Identifier of the post on Reddit, e.g. "t3_w5491".

"request_number_of_comments_at_retrieval": Number of comments for the request at time of retrieval.

"request_text": Full text of the request.

"request_text_edit_aware": Edit aware version of "request_text". We use a set of rules to strip edited comments indicating the success of the request such as "EDIT: Thanks /u/foo, the pizza was delicous".

"request_title": Title of the request.

"requester_account_age_in_days_at_request": Account age of requester in days at time of request.

"requester_account_age_in_days_at_retrieval": Account age of requester in days at time of retrieval.

"requester_days_since_first_post_on_raop_at_request": Number of days between requesters first post on RAOP and this request (zero if requester has never posted before on RAOP).

"requester_days_since_first_post_on_raop_at_retrieval": Number of days between requesters first post on RAOP and time of retrieval.

"requester_number_of_comments_at_request": Total number of comments on Reddit by requester at time of request.

"requester_number_of_comments_at_retrieval": Total number of comments on Reddit by requester at time of retrieval.

"requester_number_of_comments_in_raop_at_request": Total number of comments in RAOP by requester at time of request.

"requester_number_of_comments_in_raop_at_retrieval": Total number of comments in RAOP by requester at time of retrieval.

"requester_number_of_posts_at_request": Total number of posts on Reddit by requester at time of request.

"requester_number_of_posts_at_retrieval": Total number of posts on Reddit by requester at time of retrieval.

"requester_number_of_posts_on_raop_at_request": Total number of posts in RAOP by requester at time of request.

"requester_number_of_posts_on_raop_at_retrieval": Total number of posts in RAOP by requester at time of retrieval.

"requester_number_of_subreddits_at_request": The number of subreddits in which the author had already posted in at the time of request.

"requester_received_pizza": Boolean indicating the success of the request, i.e., whether the requester received pizza.

"requester_subreddits_at_request": The list of subreddits in which the author had already posted in at the time of request.

"requester_upvotes_minus_downvotes_at_request": Difference of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_minus_downvotes_at_retrieval": Difference of total upvotes and total downvotes of requester at time of retrieval.

"requester_upvotes_plus_downvotes_at_request": Sum of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_plus_downvotes_at_retrieval": Sum of total upvotes and total downvotes of requester at time of retrieval.

"requester_user_flair": Users on RAOP receive badges (Reddit calls them flairs) which is a small picture next to their username. In our data set the user flair is either None (neither given nor received pizza, N=4282), "shroom" (received pizza, but not given, N=1306), or "PIF" (pizza given after having received, N=83).

"requester_username": Reddit username of requester.

"unix_timestamp_of_request": Unix timestamp of request (supposedly in timezone of user, but in most cases it is equal to the UTC timestamp -- which is incorrect since most RAOP users are from the USA).

"unix_timestamp_of_request_utc": Unit timestamp of request in UTC.

In [110]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import *

import pandas as pd

# ADD METRICS
from sklearn import metrics

**Julia sample code**

using DataFrames
using MachineLearning
using JSON

function read_data(file_name)
    f = open(file_name)
    json = JSON.parse(readall(f))
    close(f)

    colnames = keys(json[1])
    columns  = Any[[json[i][name] for i=1:length(json)] for name=colnames]
    DataFrame(columns, Symbol[name for name=colnames])
end

train = read_data("../data/train.json")
test  = read_data("../data/test.json")

println(@sprintf("There are %d rows in the training set", nrow(train)))
println(@sprintf("There are %d rows in the test set", nrow(test)))

feature_names = Symbol["requester_account_age_in_days_at_request",
                       "requester_days_since_first_post_on_raop_at_request",
                       "requester_number_of_comments_at_request",
                       "requester_number_of_comments_in_raop_at_request",
                       "requester_number_of_posts_at_request",
                       "requester_number_of_posts_on_raop_at_request",
                       "requester_number_of_subreddits_at_request",
                       "requester_upvotes_minus_downvotes_at_request",
                       "requester_upvotes_plus_downvotes_at_request",
                       "unix_timestamp_of_request_utc"]

for feature = feature_names
    train[feature] = float64(train[feature])
    test[feature]  = float64(test[feature])
end

columns_to_keep = cat(1, feature_names, [:requester_received_pizza])

rf = fit(train[columns_to_keep], :requester_received_pizza, classification_forest_options(num_trees=200, display=true))
println("")
println(rf)
println("")
predictions = predict_probs(rf, test)[:,2]
submission = DataFrame(request_id=test[:request_id], requester_received_pizza=predictions)
writetable("simple_julia_benchmark.csv", submission)

In [103]:
# Load raw data and create labels
raw_train = pd.read_json('./data/train.json')
raw_test = pd.read_json('./data/test.json')

# Summarize raw data
col_names = list(raw_train.columns.values)
print(col_names)
print(raw_train.shape)
print(train_labels.shape)
print(raw_test.shape)
#print(dev_data.shape)
print(list(raw_test.columns.values))
#print(dev_size)
#print(train_data.dtypes)

<class 'pandas.core.frame.DataFrame'>
['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza', 'requester_subreddits_at_request', 'requester_upvotes_minus_d

In [107]:
# split labels and development set
train_labels = raw_train["requester_received_pizza"]
train_data = raw_train.drop('requester_received_pizza', 1)

dev_size = int(round(train_data.shape[0]*.1))

mini_train_data, mini_train_labels = train_data[dev_size:], train_labels[dev_size:]
print(mini_train_data.shape, mini_train_labels.shape)
dev_data, dev_labels = train_data[:dev_size], train_labels[:dev_size]
print(dev_data.shape, dev_labels.shape)

(3636, 31) (3636,)
(404, 31) (404,)


**Feature development:**
- request_title: parse word vocabulary


In [108]:
#Here we split the number variables from the object or string type variables
obj_columns = ['giver_username_if_known','request_id','request_text','request_text_edit_aware','request_title',
              'requester_subreddits_at_request','requester_user_flair','requester_username']
num_columns = [i for i in col_names if i not in obj_columns]

#Here we split the test data columns into
test_names = list(test_data.columns.values)
test_num_columns = [i for i in test_names if i not in obj_columns]
test_obj_columns = [i for i in test_names if i not in test_num_columns]

print(len(test_num_columns))
print(len(test_obj_columns))

11
6


In [115]:
#Build an initial model based on number value columns
lr = LogisticRegression()
print("shape of train:",mini_train_data[test_num_columns].shape, mini_train_labels.shape)
print("shape of dev:",dev_data[test_num_columns].shape, dev_labels.shape)
lr.fit(mini_train_data[test_num_columns], mini_train_labels)

dev_preds = lr.predict(dev_data[test_num_columns])
print(metrics.f1_score(dev_labels, dev_preds, average='micro'))

shape of train: (3636, 11) (3636,)
shape of dev: (404, 11) (404,)
0.752475247525


In [140]:
#Try looking at request only
train_text = mini_train_data['request_text_edit_aware']
dev_text = dev_data['request_text_edit_aware']
vec = CountVectorizer()
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))

11566
0.740099009901


In [143]:


#Try looking at request BIGRAMS only
train_text = mini_train_data['request_text_edit_aware']
dev_text = dev_data['request_text_edit_aware']
vec = CountVectorizer(analyzer="word", ngram_range=(1,2))
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))

107160
0.752475247525


In [144]:
#Try looking at request TITLE only
train_text = mini_train_data['request_title']
dev_text = dev_data['request_title']
vec = CountVectorizer(analyzer="word", ngram_range=(1,2))
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))

24109
0.752475247525


In [147]:
#Try looking at request only with TfidfVectorizer
train_text = mini_train_data['request_text_edit_aware']
dev_text = dev_data['request_text_edit_aware']
vec = TfidfVectorizer()
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))

11566
0.752475247525


In [149]:
#Try looking at request only with TfidfVectorizer with BIGRAMS
train_text = mini_train_data['request_text_edit_aware']
dev_text = dev_data['request_text_edit_aware']
vec = TfidfVectorizer(analyzer="word", ngram_range=(1,2))
train_feats = vec.fit_transform(train_text)
train_vocab = vec.get_feature_names()
print(len(train_vocab))
vec2 = CountVectorizer(vocabulary=train_vocab)
dev_feats = vec2.transform(dev_text)

nb =  MultinomialNB()
nb.fit(train_feats, mini_train_labels)
nb_preds = nb.predict(dev_feats)
print(metrics.f1_score(dev_labels, nb_preds, average='micro'))

107160
0.752475247525


In [148]:

# feature_log_prob_
print(nb.feature_log_prob_.shape)
max_weights = np.argsort(nb.feature_log_prob_, axis=1)[:,-20:]
print(max_weights)
print(max_weights[0])
for i in max_weights[0]:
    print(train_vocab[i])
print("**********")
for i in max_weights[1]:
    print(train_vocab[i])

(2, 11566)
[[ 9406  7114  1719 10297 11497  1238  5516  6370 11157 11400  4846  5526
   5295  7635  7056  4159  6745 10247   776 10405]
 [11497  7114  9406 10297  5516  1719  6370  1238 11400 11157  7635  4846
   5526  5295  7056  4159  6745 10247   776 10405]]
[ 9406  7114  1719 10297 11497  1238  5516  6370 11157 11400  4846  5526
  5295  7635  7056  4159  6745 10247   776 10405]
so
on
but
this
you
be
is
me
we
would
have
it
in
pizza
of
for
my
the
and
to
**********
you
on
so
this
is
but
me
be
would
we
pizza
have
it
in
of
for
my
the
and
to
