# Random Acts of Pizza (RAOP) Notes

**Source**: Althoff, T., Danescu-Niculescu-Mizil, C., & Jurafsky, D. (2014). *How to Ask for a Favor: A Case Study on the Success of Altruistic Requests*. Association for the Advancement of Artificial
Intelligence (www.aaai.org).

- "The community only publishes which users have given or received pizzas but not which requests were successful. 
In the case of successful users posting multiple times it is unclear which of the requests was actually successful. 
Therefore, we restrict our analysis to users with a single request for which we can be certain whether or not 
it was successful, leaving us with 5728 pizza requests. We split this dataset into development(70%) and test set (30%) 
such that both sets mirror the average success rate in our dataset of 24.6%. All features are developed on the 
development test only while the test set is used only once to evaluate the prediction accuracy of our proposed model on held-out data. For a small number of requests (379) we further observe the identity of the benefactor through a 
'thank you' post by the beneficiary after the successful request. This enables us to reason about the impact of 
user similarity on giving."


- "It is extremely difficult to disentangle the effects of all these factors in determining what makes people satisfy requests, and what makes them select some requests over others. . . In this paper, we develop a framework for controlling for each of these potential confounds while studying the role of two aspects that characterize compelling requests: **social factors** (who is asking and how the recipient is related to the donor and community) and **linguistic factors** (how they are asking and what linguistic devices accompany successful requests). With the notable exception of Mitra and Gilbert (2014), the effect of language on the success of requests has largely been ignored thus far."


- "[Their] goal is to understand what motivates people to give when they do not receive anything tangible in return. That is, [they] focus on the important special case of altruistic requests in which the giver receives no rewards." **DSC**: But how do you know people don't want something in return, especially if they are more likely to help requesters who have high status or are more similar to them?

-----

Temporal Factors
- Specific months
- Weekdays
- **Days of the month (first half of the month)**
- Hour of the day
- **Community age of the request (earlier the better)**

Textual Factors
- Politeness (e.g., **gratitude**)
- **Evidentiality** (2nd largest parameter estimate)
- Reciprocity (respond to a positive action with another positive action, **pay it forward**)
- Sentiment (e.g., **urgency**)
- **Length**

Social Factors
- **Status**
    - karma points (up-votes minus down-votes) that Reddit counts on link submissions and comments,
    - user has posted on RAOP before and thus could be considered a member of the sub-community. 
    - **user account age based on the hypothesis that “younger” accounts might be less trusted**


- Similarity: intersection size between the set of the giver and receiver, and the Jaccard similarity (intersection
over union) of the two. NOT included in logistic regression model.

Narratives (identified through topic modeling)
- **Desire**
- **Family**
- **Job**
- **Money**
- Student

-----

Conclusion
- Drawing from social psychology literature [they] extract high-level social features from text that operationalize the relation between recipient and donor and demonstrate that these extracted relations are predictive of success. 
- [They] show that [they] can detect key narratives automatically that have significant impact on the success of the request. 
- [They] further demonstrate that linguistic indications of gratitude, evidentiality, and reciprocity, as well as the high status of the asker, all increase the likelihood of success, while neither politeness nor positive sentiment seem to be associated with success in [the] setting.

Limitations
- A shortcoming of any case study is that findings might be specific to the scenario at hand. While [they] have shown that particular linguistic and social factors differentiate between successful and unsuccessful requests [they] cannot claim a causal relationship between the proposed factors and success that would guarantee success. 
- Furthermore, the set of success factors studied in this work is likely to be incomplete as well and excludes,
for instance, group behavior dynamics. 
- Despite these limitations, [they] hope that this work and the data [they] make available will provide a basis for further research on success factors and helping behavior in other online communities.

-----

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import codecs
import json
import csv

import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

In [2]:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
# Convert a JSON string to pandas object

X = pd.read_json('./pizza_request_dataset.json')
#print X.head()
#print X.describe()
#print

'''
shuffle = np.random.permutation(np.arange(d.shape[0]))

print shuffle.max()

print X['giver_username_if_known'][:2]
X = X.sample(frac=1) #.reset_index(drop=True)
print X['giver_username_if_known'][:2]

#X, Y = X[shuffle], Y[shuffle]
'''

# set random seed
np.random.seed(0)

# randomly assign 70% to train_data, and 30% to dev_data
msk = np.random.rand(len(X)) <= 0.7
train_data = X[msk]
dev_data = X[~msk]

# create output dataframe Y of train_labels
train_labels = train_data[["requester_received_pizza"]]

# delete train_labels from input dataframe of train_data
del train_data["requester_received_pizza"]

# create output dataframe of dev_labels
dev_labels = dev_data[["requester_received_pizza"]]
#Y.describe()

# delete dev_labels from input dataframe of dev_data
del dev_data["requester_received_pizza"]

# print labels and data
print "train_labels" 
print "----------"
print list(train_labels)
print train_labels.shape
print
print "train_data" 
print "----------"
print list(train_data)
print train_data.shape
print

print "dev_labels" 
print "----------"
print list(dev_labels)
print dev_labels.shape
print
print "dev_data"
print "----------"
print list(dev_data)
print dev_data.shape
print

# print percent of train_data and dev_data whose posts led to receipt of pizza
print "train labels"
print "----------"
print np.mean(train_labels)
print
print "dev labels"
print "----------"
print np.mean(dev_labels)
print

train_labels
----------
[u'requester_received_pizza']
(3975, 1)

train_data
----------
[u'giver_username_if_known', u'in_test_set', u'number_of_downvotes_of_request_at_retrieval', u'number_of_upvotes_of_request_at_retrieval', u'post_was_edited', u'request_id', u'request_number_of_comments_at_retrieval', u'request_text', u'request_text_edit_aware', u'request_title', u'requester_account_age_in_days_at_request', u'requester_account_age_in_days_at_retrieval', u'requester_days_since_first_post_on_raop_at_request', u'requester_days_since_first_post_on_raop_at_retrieval', u'requester_number_of_comments_at_request', u'requester_number_of_comments_at_retrieval', u'requester_number_of_comments_in_raop_at_request', u'requester_number_of_comments_in_raop_at_retrieval', u'requester_number_of_posts_at_request', u'requester_number_of_posts_at_retrieval', u'requester_number_of_posts_on_raop_at_request', u'requester_number_of_posts_on_raop_at_retrieval', u'requester_number_of_subreddits_at_request', u'

In [3]:
import operator
# http://stackoverflow.com/questions/209840/map-two-lists-into-a-dictionary-in-python
# http://stackoverflow.com/questions/268272/getting-key-with-maximum-value-in-dictionary

# Notes
# Classifier precision--when a positive value is predicted, proportion of time the prediction is correct--equals (TP) / (TP + FP)
# Classifier recall--when the actual value is positive, the proportion of time the prediction is correct--equals (TP) / (TP + FN)

train_data = train_data["request_text"]
dev_data = dev_data["request_text"]

def iterate():
### STUDENT START ###

    # create empty vector
    accuracies = []

    # Source: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
    # The F1 score can be interpreted as a weighted average of the precision and recall, 
    # where an F1 score reaches its best value at 1 and worst score at 0. 
    # The relative contribution of precision and recall to the F1 score are equal. 
    # The formula for the F1 score is: F1 = 2 * (precision * recall) / (precision + recall)
    
    vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
    #vectorizer = CountVectorizer()
    train_vectors = vectorizer.fit_transform(train_data)
    print "train_vectors.shape:", train_vectors.shape
    
    dev_vectors = vectorizer.transform(dev_data)
    print "dev_vectors.shape:", dev_vectors.shape
    print
    
    #------------------------
    # K Nearest Neighbors
    #------------------------
    
    print "------------------------------"
    print "K Nearest Neighbors (K-NN)"
    print "------------------------------"
    
    # Euclidean distance, when you go to 10 to 20+ dimensions, too many examples can be close to each other
    # With K-NN on text, Cosine or Manhattan distance might be better. Cosine distance measures the angle between examples,
    # more robust for high-dimensional problems. 
    # Dot product measures length of vectors AND angle between these vectors. 
    # With Cosine distance, you can get a value 0 to 1.
    
    # create two vectors
    # ks refers to a vector of k nearest neighbor values
    
    ks = [1, 5, 15, 16, 17, 18, 19, 20, 28, 29, 30, 31, 32, 150, 300]
    f1_scores = []
    
    for k in ks:
        knn = KNeighborsClassifier(n_neighbors=k, distance='cosine', algorithm='brute')
        knn.fit(train_vectors, train_labels)
        pred_1 = knn.predict(dev_vectors)
        
        # http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
        # f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)[source]¶
            # y_true = Ground truth (correct) target values 
            # y_pred = Estimated targets as returned by a classifier.
            # average = required for multiclass/multilabel targets.
                # 'weighted': Calculate metrics for each label, and find their average, weighted by 
                # the number of true instances for each label. This alters ‘macro’ to account for label imbalance; 
                # it can result in an F-score that is not between precision and recall.
            
        print "K-NN: f1_score = %s, k = %s" % (round(metrics.f1_score(dev_labels, pred_1, average='binary'),4), k)

        # append f1_scores to vector
        f1_scores.append(metrics.f1_score(dev_labels, pred_1))
    
    print
    
    # map two vectors into a dictionary
    results_knn = dict(zip(ks, f1_scores))
    #print results_knn
    
    # print the key with the max fl_score
    print "K-NN: optimal k =", max(results_knn.iteritems(), key=operator.itemgetter(1))[0]
    print

    #------------------------
    # Multinomial Naive Bayes
    #------------------------
    
    print "-----------------------------"
    print "Multinomial Naive Bayes (MNB)"
    print "-----------------------------"
    
    # create two vectors
    
    alphas = [0.0, 0.00001, 0.0001, 0.001, 0.01, 0.094, 0.095, 0.096, 0.1, 0.105, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1.0, 10.0]
    f1_scores = []
    
    for a in alphas:
        mnb = MultinomialNB(alpha=a)
        mnb.fit(train_vectors, train_labels)
        pred_2 = mnb.predict(dev_vectors)
        print "MNB: f1_score = %s, alpha = %s" % (round(metrics.f1_score(dev_labels, pred_2, average='binary'), 4), a)
        
        # append f1_scores to vector
        f1_scores.append(metrics.f1_score(dev_labels, pred_2))
        
    print
    
    # map two vectors into a dictionary
    results_mnb = dict(zip(alphas, f1_scores))
    #print results_mnb
    
    # print the key with the max fl_score
    print "Multinomial Naive Bayes: optimal alpha =", max(results_mnb.iteritems(), key=operator.itemgetter(1))[0]
    print
    
    #------------------------
    # Logistic Regression
    #------------------------
    
    print "------------------------"
    print "Logistic Regression (LR)"
    print "------------------------"
    print
    
    # create two vectors
    # cs refers to the vector of C (inverse of regularization strength) values
    
    cs = [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, \
          10, 20, 30, 40, 50, 100, 1000]
    f1_scores = []
    
    for c in cs:
        
        # logistic regression fits a line like linear regression, but instead of predicting any number, 
        # it predicts a number between 0 and 1 (sigmoid function).
        
        # http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
        # C (inverse of regularization strength) controls how much the weights influence the loss, and
        # penalizes the sum of squared weights if very different weights exist between different tokens.
  
        # use l2 regularization, per instructions
        lr = LogisticRegression(penalty='l2',C=c)
        lr.fit(train_vectors, train_labels)
        pred_3 = lr.predict(dev_vectors)
        
        print "-------------------------------"
        print "LR: f1_score = %s, C = %s" % (round(metrics.f1_score(dev_labels, pred_3, average='binary'),4), c)
        print "-------------------------------"
        
        # append f1_scores to vector
        f1_scores.append(metrics.f1_score(dev_labels, pred_3, average='binary'))
        
        accuracies.append((lr.score(dev_vectors, dev_labels))*100) 
        
        '''
        # first define function that squares a given value, for later use in the 'for loop' below
        fun_sq_wts = lambda x: x**2
        
        for label in range(0,4):
         
            # use map function, likely faster (because written in C) than list comprehension.
            # map function itself applies a function, specifically the first argument on the second argument.
            # from coef_, take raw weights (coefficient of the features in the decision function), 
            # and sum the squares of these weights.
            
            # note: averege=weight vs. average=default should be about same score if similar number of examples across classes
            sq_wts = map(fun_sq_wts, lr.coef_[label])
            sum_sq_wts = round(sum(sq_wts),2)
            print "Label = %s, sum of squared weights = %s" % (label, sum_sq_wts)
        
        print
        '''
        
    # map two vectors into a dictionary
    results_lr = dict(zip(cs, f1_scores))
    #print results_lr
    
    # print the key with the max fl_score
    print "Logistic Regression: optimal C =", max(results_lr.iteritems(), key=operator.itemgetter(1))[0]
    print
    print "accuracy =", max(accuracies)
        
### STUDENT END ###

iterate()

train_vectors.shape: (3975, 12313)
dev_vectors.shape: (1696, 12313)

------------------------------
K Nearest Neighbors (K-NN)
------------------------------
K-NN: f1_score = 0.4133, k = 1




K-NN: f1_score = 0.0259, k = 5
K-NN: f1_score = 0.0, k = 15




K-NN: f1_score = 0.0, k = 16
K-NN: f1_score = 0.0, k = 17




K-NN: f1_score = 0.0, k = 18
K-NN: f1_score = 0.0, k = 19

  'precision', 'predicted', average, warn_for)



K-NN: f1_score = 0.0, k = 20
K-NN: f1_score = 0.0, k = 28




K-NN: f1_score = 0.0, k = 29
K-NN: f1_score = 0.0, k = 30




K-NN: f1_score = 0.0, k = 31
K-NN: f1_score = 0.0, k = 32




K-NN: f1_score = 0.0, k = 150
K-NN: f1_score = 0.0, k = 300

  y = column_or_1d(y, warn=True)




K-NN: optimal k = 1

-----------------------------
Multinomial Naive Bayes (MNB)
-----------------------------
MNB: f1_score = 0.0907, alpha = 0.0
MNB: f1_score = 0.1261, alpha = 1e-05
MNB: f1_score = 0.1263, alpha = 0.0001
MNB: f1_score = 0.1201, alpha = 0.001
MNB: f1_score = 0.1032, alpha = 0.01
MNB: f1_score = 0.0377, alpha = 0.094
MNB: f1_score = 0.0377, alpha = 0.095
MNB: f1_score = 0.0335, alpha = 0.096
MNB: f1_score = 0.0337, alpha = 0.1
MNB: f1_score = 0.0339, alpha = 0.105
MNB: f1_score = 0.0045, alpha = 0.2
MNB: f1_score = 0.0045, alpha = 0.3
MNB: f1_score = 0.0, alpha = 0.4
MNB: f1_score = 0.0, alpha = 0.5
MNB: f1_score = 0.0, alpha = 0.6

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



MNB: f1_score = 0.0, alpha = 0.7
MNB: f1_score = 0.0, alpha = 1.0
MNB: f1_score = 0.0, alpha = 10.0

Multinomial Naive Bayes: optimal alpha = 0.0001

------------------------
Logistic Regression (LR)
------------------------

-------------------------------
LR: f1_score = 0.0, C = 0.01
-------------------------------
-------------------------------
LR: f1_score = 0.0, C = 0.1
-------------------------------
-------------------------------
LR: f1_score = 0.0045, C = 0.2
-------------------------------
-------------------------------

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



LR: f1_score = 0.0135, C = 0.3
-------------------------------
-------------------------------
LR: f1_score = 0.0178, C = 0.4
-------------------------------
-------------------------------
LR: f1_score = 0.0177, C = 0.5
-------------------------------
-------------------------------
LR: f1_score = 0.0177, C = 0.54
-------------------------------
-------------------------------

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



LR: f1_score = 0.0221, C = 0.55
-------------------------------
-------------------------------
LR: f1_score = 0.0221, C = 0.56
-------------------------------
-------------------------------
LR: f1_score = 0.022, C = 0.57
-------------------------------
-------------------------------
LR: f1_score = 0.022, C = 0.58
-------------------------------
-------------------------------

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



LR: f1_score = 0.0263, C = 0.59
-------------------------------
-------------------------------
LR: f1_score = 0.0306, C = 0.6
-------------------------------
-------------------------------
LR: f1_score = 0.0475, C = 0.7
-------------------------------
-------------------------------
LR: f1_score = 0.0641, C = 0.8
-------------------------------
-------------------------------
LR: f1_score = 0.0761, C = 0.9
-------------------------------
-------------------------------
LR: f1_score = 0.0795, C = 1.0
-------------------------------
-------------------------------
LR: f1_score = 0.0828, C = 1.1
-------------------------------
-------------------------------

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



LR: f1_score = 0.2324, C = 10
-------------------------------
-------------------------------
LR: f1_score = 0.2415, C = 20
-------------------------------
-------------------------------

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



LR: f1_score = 0.2541, C = 30
-------------------------------
-------------------------------
LR: f1_score = 0.2638, C = 40
-------------------------------
-------------------------------
LR: f1_score = 0.2701, C = 50
-------------------------------
-------------------------------
LR: f1_score = 0.2739, C = 100
-------------------------------
-------------------------------
LR: f1_score = 0.2755, C = 1000
-------------------------------
Logistic Regression: optimal C = 1000

accuracy = 74.233490566


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [4]:
from pandas import *
from sklearn.feature_selection import SelectFromModel

# Feature Selection Notes:
'''
http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
http://scikit-learn.org/stable/modules/feature_selection.html
http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py

These objects take as input a scoring function that returns univariate p-values:
-For regression: f_regression
-For classification: chi2 or f_classif

Feature selection with sparse data:
-If you use sparse data (i.e. data represented as sparse matrices), 
only chi2 will deal with the data without making it dense.
-Warning: Beware not to use a regression scoring function with a classification problem, 
you will get useless results.

With SVMs and logistic-regression, the parameter C controls the sparsity: 
the smaller C the less features selected. 
'''

def top20(type):
### STUDENT START ###

    if type == "unigram":
        
        # use stop_words='english' to remove less meaningful words. 
        # only applies if default analyzer='word'.
        vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
        #vectorizer = CountVectorizer(min_df=1, stop_words='english')
        train_vectors = vectorizer.fit_transform(train_data)
        print
        print "----------"
        print "unigram"
        print "----------"
        print
        print "train_vectors.shape:", train_vectors.shape
        print
        
    elif type == "bigram":
        
        # use stop_words='english' to remove less meaningful words from the resulting tokens. 
        # only applies if default analyzer='word'.
        # set bigrams to be 2 words only
        vectorizer = TfidfVectorizer(min_df=1, stop_words='english', ngram_range=(2, 2))
        #vectorizer = CountVectorizer(min_df=1, stop_words='english', ngram_range=(2, 2))
        train_vectors = vectorizer.fit_transform(train_data)
        print
        print "----------"
        print "bigram"
        print "----------"
        print
        print "train_vectors.shape:", train_vectors.shape
        print
      
    # use C=12
    for c in [12]:
        
        # in the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the default ‘multi_class’ option is set to ‘ovr’ 
        lr = LogisticRegression(penalty='l2',C=c)
        #print lr
        
        # fit the model and generate coef_
        lr.fit(train_vectors, train_labels)
         
        # interested in magnitude of the weights (coefficients), so take absolute value.
        # sort absolute values in descending order.
        # important to know if negative or positive weight, so still output the positive/negative sign.
        # after fitting logistic regression for class vs. all other classes, negative weight of a token 
        # indicates a class other than class of interest.
        # (visual example of negative and positive on a sigmoid function helps) 
        
        print lr.coef_
        
        # for each label, store the column indices of the top 5 weights 
        top20 = sorted(range(len(lr.coef_[0])), key=lambda i: abs(lr.coef_[0][i]), reverse=True)[:20]
       
        col_1 = []
        
        # for each label, access and store weights via column indices
        for index in (top20):

            col_1.append(lr.coef_[0][index])
           
        print top20

        # store feature names, after converting to an array
        feature_names = np.asarray(vectorizer.get_feature_names())
       
        # create a Pandas dataframe with 20 rows and 4 columns, plus descriptive headers
        df = DataFrame({'Feature': feature_names[top20], 'word': col_1})
        print df    

    print

#-----
         
### STUDENT END ###
top20("unigram")
top20("bigram")


----------
unigram
----------

train_vectors.shape: (3975, 12313)

[[-1.48092905 -0.39311615 -0.12825177 ...,  0.44453487  1.63180069
  -0.87946286]]
[3697, 4147, 6786, 8221, 1039, 6858, 3236, 1286, 6712, 10159, 1523, 10688, 4918, 10689, 9907, 7050, 3901, 2573, 2155, 9349]
       Feature      word
0         edit  4.824324
1       father  4.627494
2         mean  4.376547
3      pockets  4.057164
4          ass  3.995387
5    mentioned  3.977975
6          die  3.930683
7        basic  3.899384
8      married  3.850098
9     southern -3.846223
10       bloke  3.807984
11    surprise  3.805289
12   graveyard  3.749854
13   surprised  3.697465
14     sitting -3.697393
15       mommy  3.658543
16  especially  3.627059
17  constantly  3.593246
18      cheesy  3.548595
19     running  3.539010




  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


----------
bigram
----------

train_vectors.shape: (3975, 90693)

[[-0.31775435  0.42001296 -0.29407174 ..., -0.26401881 -0.22234693
   0.57766744]]
[23002, 38997, 37806, 32065, 28212, 73029, 33880, 44381, 54471, 23001, 81228, 41801, 77046, 2781, 47597, 9433, 84630, 16480, 57349, 59245]
                Feature      word
0           edit thanks  3.882043
1             imgur com  3.879634
2            http imgur  3.547851
3             got pizza  3.038731
4         forward money  2.984111
5        sounds amazing  2.960338
6        happy birthday  2.947813
7   letsfytinglove best  2.947813
8        north carolina  2.936326
9            edit thank  2.927036
10      tonight greatly  2.813269
11           just spent  2.809594
12         surprise son  2.768559
13         afford ramen  2.674511
14             love pie  2.633101
15         broke payday  2.627687
16               ve got  2.625031
17        craving pizza -2.599092
18          pay forward  2.518712
19       pizza actually  2.51293

In [15]:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html
# transform from pandas dataframe to numpy array
train_data_np = train_data.as_matrix(columns=None)
dev_data_np = dev_data.as_matrix(columns=None)

train_labels_np = train_labels.as_matrix(columns=None)
dev_labels_np = dev_labels.as_matrix(columns=None)

#train_labels_np = np.ravel(train_labels_np)
#dev_labels_np = np.ravel(dev_labels_np)

# define function fs (feature selection)
def fs():
    # Keep this random seed here to make comparison easier.
    
    # create two empty vectors
    accuracies = []
    vocab_size = []
    
    ### STUDENT START ###

    ### Logistic regression seeks the set of weights that minimizes errors in the training data AND has a small size.
    ### For this size, the default regularization, L2, computes the sum of the squared weights (see P3, above), while 
    ### L1 regularization computes the sum of the absolute values of the weights. 
    ### L2 regularization makes all the weights relatively small, whereas
    ### L1 regularization drives lots of the weights to 0, effectively removing unimportant features [for feature selection].

    ### http://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_l1_l2_sparsity.html
     
    # set min_df=10 to ignore words that appear in less than 10 documents
    # use stop_words='english' to remove less meaningful words from the resulting tokens, only applies if default analyzer='word'.

    vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
    #vectorizer = CountVectorizer(min_df=1, stop_words='english')
    train_vectors = vectorizer.fit_transform(train_data_np)
    dev_vectors = vectorizer.transform(dev_data_np)    
    
    cs = [0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0.5, 0.57, 0.7, 1, 10, 12, 30, 50, 70, 100, 200, 300]
    # no longer use np.linspace to return evenly spaced numbers over a specified interval.
    # it offers less control.
    
    for c in cs:

        # fit l1 and l2 models
        lr_l1 = LogisticRegression(C=c, penalty='l1', tol=0.01)
        lr_l2 = LogisticRegression(C=c, penalty='l2', tol=0.01)
        lr_l1.fit(train_vectors, train_labels)
        lr_l2.fit(train_vectors, train_labels)
        
        # store predictions
        pred_l1 = lr_l1.predict(dev_vectors)
        pred_l2 = lr_l2.predict(dev_vectors)
        
        print "-----------------"
        print "C = ", round(c,3)
        print "-----------------"
        
        print "LR L1 regularization: f1_score = %s" % (round(metrics.f1_score(dev_labels, pred_l1, average='binary'),4))
        print "LR L2 regularization: f1_score = %s" % (round(metrics.f1_score(dev_labels, pred_l2, average='binary'),4))
        print
        
        #print "lr_l1.coef_:", lr_l1.coef_
        #print "lr_l2.coef_:", lr_l2.coef_
        
        # take mean weight for each class
        # axis=0 refers to mean of each column across 4 rows in coef_
        # use as definition of sparsity
        vec1 = np.mean(lr_l1.coef_, axis=0)
        vec2 = np.mean(lr_l2.coef_, axis=0)
        
        #print "vec1:", vec1
        #print "vec2:", vec2
        
        print "LR L1 regularization: number of non-zero weights =", (vec1 != 0).sum()
        print "LR L2 regularization: number of non-zero weights =", (vec2 != 0).sum()
        print 
        
        # http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score
        # score(X, y, sample_weight=None)
        # Returns the mean accuracy on the given dev or test data and labels
        # In multi-label classification, this is the subset accuracy which is a harsh metric 
        # since you require for each sample that each label set be correctly predicted.
        
        print "LR L1 regularization: accuracy = %.2f%%" % ((lr_l1.score(dev_vectors, dev_labels))*100)
        print "LR L2 regularization: accuracy = %.2f%%" % ((lr_l2.score(dev_vectors, dev_labels))*100)
        print
        
        #print "recheck", train_vectors.shape
        #print "recheck", train_labels.shape
        
        #---------------
        # re-train model
        #---------------
        
        # likely no need to use fit_transform again, as we still have our vocabulary in matrix format with token counts.
        # we simply select non-zero weighted features (from columns), and leave documents (from rows) as is.
        
        # first, only select features that have non-zero weights from L1 regularization.
        # vec1 includes weights for each feature (column).
        train_vectors_rt = train_vectors[:, vec1 != 0]
        dev_vectors_rt = dev_vectors[:, vec1 != 0]
        
        print train_vectors_rt
        
        #print "recheck", train_vectors_rt.shape
        #print "recheck", train_labels.shape
        
        #############
        # ERROR below due to 0 features
        #############
        
        '''
        lr_l2_rt = LogisticRegression(C=c, penalty='l2', tol=0.1)
    
        # refit our classifier to the model, so it can learn from the model
        lr_l2_rt.fit(train_vectors_rt, train_labels)
        pred_l2_rt = lr_l2_rt.predict(dev_vectors_rt)
        
        # take mean weight for each class
        # axis=0 refers to mean of each column across 4 rows in coef_
        # use as definition of sparsity
        vec_rt = np.mean(lr_l2_rt.coef_, axis=0)
        
        # append to vectors
        # note: try .score method (mean accuracy on the given test data and labels) rather than f1_score method,
        #        partly because sometimes the output cell shows a system automated warning about the f1_score
        accuracies.append((lr_l2_rt.score(dev_vectors_rt, dev_labels))*100)  
        vocab_size.append(train_vectors_rt.shape[1])
        
        print "***Re-trained model w/ L1 non-zero features***" 
        print "LR L2 regularization: f1_score = %s" % (round(metrics.f1_score(dev_labels, pred_l2_rt, average='binary'),4))
        print "LR L2 regularization: number of non-zero weights:", (vec_rt != 0).sum()
        print "LR L2 regularization: accuracy = %.2f%%" % ((lr_l2_rt.score(dev_vectors_rt, dev_labels))*100)
        print
        print "LR L2 regularization: vocab size:", (train_vectors_rt.shape[1])
        print
    
    #print accuracies
    #print vocab_size
    #print
    
    plt.scatter(vocab_size, accuracies)
    plt.ylabel('Accuracy')
    plt.xlabel('Vocabulary Size')
    plt.title('Relationship between Accuracy and Vocabulary Size')
    '''
    ### STUDENT END ###
fs()

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


-----------------
C =  0.01
-----------------
LR L1 regularization: f1_score = 0.0
LR L2 regularization: f1_score = 0.0

LR L1 regularization: number of non-zero weights = 0
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 74.12%
LR L2 regularization: accuracy = 74.12%


-----------------
C =  0.03
-----------------
LR L1 regularization: f1_score = 0.0
LR L2 regularization: f1_score = 0.0

LR L1 regularization: number of non-zero weights = 0
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 74.12%
LR L2 regularization: accuracy = 74.12%


-----------------
C =  0.05
-----------------
LR L1 regularization: f1_score = 0.0
LR L2 regularization: f1_score = 0.0

LR L1 regularization: number of non-zero weights = 0
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 74.12%
LR L2 regularization: accuracy = 74.12%


-----------------
C =  0.07
-----------------
LR L1 regula

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



LR L2 regularization: accuracy = 73.88%

  (0, 18)	0.131488281208
  (0, 33)	0.128616157726
  (0, 16)	0.0703176456145
  (0, 1)	0.0854815153819
  (0, 9)	0.0592093216538
  (0, 12)	0.072392176734
  (0, 36)	0.0684814651588
  (0, 29)	0.0769481275445
  (1, 18)	0.0773020500597
  (1, 12)	0.0851188199955
  (1, 23)	0.105034099048
  (1, 30)	0.143631285742
  (1, 8)	0.0930310837977
  (1, 6)	0.0909178660832
  (2, 33)	0.151850942384
  (2, 16)	0.166041358124
  (2, 32)	0.261775215436
  (2, 20)	0.215739055306
  (3, 18)	0.0791984096001
  (3, 8)	0.0953133050736
  (3, 34)	0.0796062520293
  (3, 19)	0.084367646913
  (5, 18)	0.0731041305069
  (5, 33)	0.0715073031098
  (5, 1)	0.0950510843857
  :	:
  (3965, 20)	0.092275179321
  (3965, 34)	0.100112224021
  (3965, 0)	0.0426428230597
  (3965, 7)	0.0388307252242
  (3965, 24)	0.0450654491918
  (3965, 31)	0.0779315345366
  (3965, 4)	0.0990794605162
  (3966, 33)	0.0780980246048
  (3966, 16)	0.085396256807
  (3966, 9)	0.0719059119959
  (3966, 20)	0.110956137545
  (3966

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



C =  0.7
-----------------
LR L1 regularization: f1_score = 0.043
LR L2 regularization: f1_score = 0.0475

LR L1 regularization: number of non-zero weights = 64
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 73.76%
LR L2 regularization: accuracy = 74.00%

  (0, 29)	0.131488281208
  (0, 57)	0.128616157726
  (0, 62)	0.0882030199794
  (0, 27)	0.0703176456145
  (0, 35)	0.0833097087161
  (0, 1)	0.0854815153819
  (0, 14)	0.0592093216538
  (0, 20)	0.072392176734
  (0, 61)	0.0684814651588
  (0, 51)	0.0769481275445
  (1, 29)	0.0773020500597
  (1, 20)	0.0851188199955
  (1, 40)	0.105034099048
  (1, 63)	0.128159558573
  (1, 3)	0.0875019879519
  (1, 52)	0.143631285742
  (1, 32)	0.086776955993
  (1, 12)	0.0930310837977
  (1, 9)	0.0909178660832
  (2, 57)	0.151850942384
  (2, 27)	0.166041358124
  (2, 56)	0.261775215436
  (2, 36)	0.215739055306
  (3, 29)	0.0791984096001
  (3, 63)	0.13130354507
  :	:
  (3966, 36)	0.110956137545
  (3966, 58)	0.0802531860333
  

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



C =  10.0
-----------------
LR L1 regularization: f1_score = 0.2674
LR L2 regularization: f1_score = 0.2371

LR L1 regularization: number of non-zero weights = 1794
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 67.04%
LR L2 regularization: accuracy = 70.40%

  (0, 317)	0.0864076489886
  (0, 112)	0.148521746893
  (0, 924)	0.131488281208
  (0, 1692)	0.128616157726
  (0, 867)	0.101201982341
  (0, 897)	0.120502336384
  (0, 32)	0.172363809879
  (0, 1781)	0.0882030199794
  (0, 1101)	0.0942274380032
  (0, 1426)	0.123566142645
  (0, 719)	0.113112789872
  (0, 858)	0.0703176456145
  (0, 172)	0.0924492814871
  (0, 1586)	0.102104460156
  (0, 57)	0.0854815153819
  (0, 1074)	0.120502336384
  (0, 618)	0.126248490588
  (0, 569)	0.159946626501
  (0, 139)	0.143097558545
  (0, 1730)	0.127495331414
  (0, 73)	0.144774641873
  (0, 1139)	0.0761613900474
  (0, 1024)	0.106899158437
  (0, 1273)	0.174978636747
  (0, 1211)	0.114075611446
  :	:
  (3971, 74)	0.192312101

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



C =  12.0
-----------------
LR L1 regularization: f1_score = 0.2753
LR L2 regularization: f1_score = 0.2376

LR L1 regularization: number of non-zero weights = 1851
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 67.10%
LR L2 regularization: accuracy = 70.11%

  (0, 330)	0.0864076489886
  (0, 117)	0.148521746893
  (0, 958)	0.131488281208
  (0, 1751)	0.128616157726
  (0, 899)	0.101201982341
  (0, 931)	0.120502336384
  (0, 33)	0.172363809879
  (0, 1839)	0.0882030199794
  (0, 1144)	0.0942274380032
  (0, 1475)	0.123566142645
  (0, 744)	0.113112789872
  (0, 889)	0.0703176456145
  (0, 180)	0.0924492814871
  (0, 1642)	0.102104460156
  (0, 60)	0.0854815153819
  (0, 1116)	0.120502336384
  (0, 647)	0.126248490588
  (0, 596)	0.159946626501
  (0, 145)	0.143097558545
  (0, 1789)	0.127495331414
  (0, 76)	0.144774641873
  (0, 1184)	0.0761613900474
  (0, 1062)	0.106899158437
  (0, 623)	0.127925573916
  (0, 1324)	0.174978636747
  :	:
  (3972, 939)	0.162926480

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



C =  50.0
-----------------
LR L1 regularization: f1_score = 0.2815
LR L2 regularization: f1_score = 0.2642

LR L1 regularization: number of non-zero weights = 2517
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 65.68%
LR L2 regularization: accuracy = 67.81%

  (0, 2085)	0.106899158437
  (0, 175)	0.148521746893
  (0, 1286)	0.131488281208
  (0, 2374)	0.128616157726
  (0, 1208)	0.101201982341
  (0, 1251)	0.120502336384
  (0, 40)	0.172363809879
  (0, 2500)	0.0882030199794
  (0, 1555)	0.0942274380032
  (0, 2003)	0.123566142645
  (0, 1001)	0.113112789872
  (0, 1192)	0.0703176456145
  (0, 1838)	0.0906046711917
  (0, 265)	0.0924492814871
  (0, 1439)	0.0833097087161
  (0, 2225)	0.102104460156
  (0, 914)	0.111514500995
  (0, 82)	0.0854815153819
  (0, 1513)	0.120502336384
  (0, 881)	0.126248490588
  (0, 1440)	0.156904152464
  (0, 820)	0.159946626501
  (0, 2021)	0.112645312642
  (0, 216)	0.143097558545
  (0, 104)	0.144774641873
  :	:
  (3972, 2484)	0.2

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



C =  70.0
-----------------
LR L1 regularization: f1_score = 0.2885
LR L2 regularization: f1_score = 0.2703

LR L1 regularization: number of non-zero weights = 2673
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 65.98%
LR L2 regularization: accuracy = 67.22%

  (0, 509)	0.0864076489886
  (0, 2217)	0.106899158437
  (0, 188)	0.148521746893
  (0, 1363)	0.131488281208
  (0, 2523)	0.128616157726
  (0, 1277)	0.101201982341
  (0, 1326)	0.120502336384
  (0, 48)	0.172363809879
  (0, 2653)	0.0882030199794
  (0, 1646)	0.0942274380032
  (0, 2126)	0.123566142645
  (0, 1066)	0.113112789872
  (0, 1260)	0.0703176456145
  (0, 1954)	0.0906046711917
  (0, 285)	0.0924492814871
  (0, 1526)	0.0833097087161
  (0, 2367)	0.102104460156
  (0, 973)	0.111514500995
  (0, 92)	0.0854815153819
  (0, 1605)	0.120502336384
  (0, 936)	0.126248490588
  (0, 1527)	0.156904152464
  (0, 871)	0.159946626501
  (0, 2147)	0.112645312642
  (0, 235)	0.143097558545
  :	:
  (3972, 1373)	0.

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)



-----------------
C =  300.0
-----------------
LR L1 regularization: f1_score = 0.2459
LR L2 regularization: f1_score = 0.2727

LR L1 regularization: number of non-zero weights = 4471
LR L2 regularization: number of non-zero weights = 12313

LR L1 regularization: accuracy = 64.56%
LR L2 regularization: accuracy = 66.04%

  (0, 853)	0.0864076489886
  (0, 3755)	0.106899158437
  (0, 328)	0.148521746893
  (0, 2292)	0.131488281208
  (0, 4228)	0.128616157726
  (0, 2154)	0.101201982341
  (0, 2235)	0.120502336384
  (0, 78)	0.172363809879
  (0, 4434)	0.0882030199794
  (0, 2766)	0.0942274380032
  (0, 3604)	0.123566142645
  (0, 1784)	0.113112789872
  (0, 2129)	0.0703176456145
  (0, 492)	0.0924492814871
  (0, 2188)	0.192075955732
  (0, 2563)	0.0833097087161
  (0, 3992)	0.102104460156
  (0, 1631)	0.111514500995
  (0, 158)	0.0854815153819
  (0, 3532)	0.139350453525
  (0, 2700)	0.120502336384
  (0, 1577)	0.126248490588
  (0, 2564)	0.156904152464
  (0, 1481)	0.159946626501
  (0, 3637)	0.112645312642


In [14]:
def fs2():
    
    ### STUDENT START ###
    
    # CountVectorizer:
    # Tokenize the documents and count the occurrences of token and return them as a sparse matrix

    # TfidfTransformer:
    # Apply Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts
    
    # Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency
    # This is a common term weighting scheme in information retrieval, 
    # that has also found good use in document classification.
    # The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to 
    # scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically 
    # less informative than features that occur in a small fraction of the training corpus.

    vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
    
    # apply the CountVectorizer fit_transform method -- which includes two methods in one -- on the train_data.
    # learn the vocabulary dictionary (all tokens from the raw documents) and return a matrix, 
    # extracting token counts to the cells.
    train_vectors = vectorizer.fit_transform(train_data)
    
    # apply the transform method to the dev_data
    dev_vectors = vectorizer.transform(dev_data)
    
    # transform train_data to matrix and print count of rows and columns
    # 2034 documents, 3064 words
    print "vocabulary size:", train_vectors.toarray().shape[1] 
    print
    
    lr = LogisticRegression(penalty='l2', C=100)
    lr.fit(train_vectors, train_labels)
    pred_4 = lr.predict(dev_vectors)

    # for each documents, store and print predicted probabilities that it belongs to each class
    
    # use the method, predict_proba
    
    '''
    Probability estimates.
    The returned estimates for all classes are ordered by the label of classes.
    For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class. 
    Else use a one-vs-rest approach, i.e., calculate the probability of each class assuming it to be positive using the logistic function 
    and normalize these values across all the classes.
    '''
    
    # for each document in dev_vectors, get their probability estimates for all classes 
    p = lr.predict_proba(dev_vectors)
    print p

    # create an empty vector
    
    p_max_rates = []
    R_rates = []
    
    # iterate over each row (document) of p
    for i, p_docs in enumerate(p):
        # p_docs is a 1x2 vector from p with a document's probability to each class on one row
        # take the document's probability of the correct label
        # dev_labels[10] will give dev_label of document 9
        p_correct_class = p_docs[dev_labels]
        # take the document's max probability across the 4 labels
        p_max = p_docs.max()

        p_max_rates.append(p_max)
        
        # calculate R
        R = p_max / p_correct_class

        # append to the R_rates vector
        R_rates.append(R)

    # create vector that have indices of top 3 R_rates
    
    #############
    # ERROR below due to shape of vectors
    #############
    
    print
    print dev_vectors.shape
    print R_rates[:5]
        
    '''
    
    #print sorted(range(dev_vectors.shape[0]), key=lambda i: R_rates[i], reverse=True)
            
    #print "dev indices of top 3 R_rates:", top3_index
    print
    print "dev labels"
    print "0: 'alt.atheism', 1: 'comp.graphics', 2: 'sci.space', 3: 'talk.religion.misc'"
    print
    for i in top3_index:
        
        # find index (0, 1, 2, or 3) of max probability within each row
        # np.argmax returns the indices of the maximum values along an axis
        index_max_prob = np.argmax(p[i,:])
                                   
        print "---------------------------------------------------------------------"
        print "W207 Results"
        print "------------"
        print "R_rate:", R_rates[i]
        print "label probabilities:", p[i,:]
        print "Max probability dev_label -> %s: %s" % (index_max_prob, dev_labels[index_max_prob])
        print "Correct dev_label -> %s" % (dev_labels)
        print "dev_data below:"
        print "---------------------------------------------------------------------"
        print
        print dev_data
        print
    '''

    ### STUDENT END ###
fs2()

vocabulary size: 12313

[[ 0.21100575  0.78899425]
 [ 0.99503696  0.00496304]
 [ 0.93736369  0.06263631]
 ..., 
 [ 0.98865797  0.01134203]
 [ 0.48297265  0.51702735]
 [ 0.18677316  0.81322684]]

(1696, 12313)
[array([[ 1.        ],
       [ 1.        ],
       [ 1.        ],
       ..., 
       [ 3.73920741],
       [ 3.73920741],
       [ 3.73920741]]), array([[ 200.48957473],
       [ 200.48957473],
       [ 200.48957473],
       ..., 
       [   1.        ],
       [   1.        ],
       [   1.        ]]), array([[ 14.96518094],
       [ 14.96518094],
       [ 14.96518094],
       ..., 
       [  1.        ],
       [  1.        ],
       [  1.        ]]), array([[ 23.72653157],
       [ 23.72653157],
       [ 23.72653157],
       ..., 
       [  1.        ],
       [  1.        ],
       [  1.        ]]), array([[ 241.09908341],
       [ 241.09908341],
       [ 241.09908341],
       ..., 
       [   1.        ],
       [   1.        ],
       [   1.        ]])]


  y = column_or_1d(y, warn=True)
