<center><H1>Final Project: Detecting Sarcasm on Reddit</H1>
<H3>Karina Lin, Joyce Zhao</H3></center>

In [151]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score


<p>First, we will import the raw dataset. Observe the first 10 entries.</p>

In [152]:
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head(10)

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...
5,0,"I don't pay attention to her, but as long as s...",only7inches,AskReddit,0,0,0,2016-09,2016-09-02 10:35:08,do you find ariana grande sexy ?
6,0,Trick or treating in general is just weird...,only7inches,AskReddit,1,-1,-1,2016-10,2016-10-23 21:43:03,What's your weird or unsettling Trick or Treat...
7,0,Blade Mastery+Masamune or GTFO!,P0k3rm4s7,FFBraveExvius,2,-1,-1,2016-10,2016-10-13 21:13:55,Probably Sephiroth. I refuse to taint his grea...
8,0,"You don't have to, you have a good build, buy ...",SoupToPots,pcmasterrace,1,-1,-1,2016-10,2016-10-27 19:11:06,What to upgrade? I have $500 to spend (mainly ...
9,0,I would love to see him at lolla.,chihawks,Lollapalooza,2,-1,-1,2016-11,2016-11-21 23:39:12,Probably count Kanye out Since the rest of his...


<p>This dataset, along with the label, includes 9 different features. For our purposes, we will only use comments and scores. There are 1,010,826 total raw entries. A label of 0 indicates a non-sarcastic comment, while 1 indicates a sarcastic comment.</p>

In [153]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010826 entries, 0 to 1010825
Data columns (total 10 columns):
label             1010826 non-null int64
comment           1010773 non-null object
author            1010826 non-null object
subreddit         1010826 non-null object
score             1010826 non-null int64
ups               1010826 non-null int64
downs             1010826 non-null int64
date              1010826 non-null object
created_utc       1010826 non-null object
parent_comment    1010826 non-null object
dtypes: int64(4), object(6)
memory usage: 77.1+ MB


<p>We will drop the entries with no comments.</p>

In [154]:
train_df.dropna(subset=['comment'], inplace=True)

<p>Then we will create a new data frame that includes only the features we need (comments and scores).</p>

In [155]:
clean_df = train_df.drop(['author', 'subreddit', 'ups', 'downs','date','created_utc', 'parent_comment','label'], axis=1)

In [156]:
print(clean_df.shape)

(1010773, 2)


<p>We will be using a subset of this large dataset, so take 5% of the entire dataset. We are using train_test_split to isolate a subset of the entire dataset, because train_test_split returns a random subset.</p>

In [157]:
_, data_X, _, data_y = train_test_split(clean_df, train_df['label'], test_size=0.05, random_state=24)

In [158]:
print (data_X.shape)

(50539, 2)


<p>From the 5% cut of the entire dataset, split into testing and training data.  We'll be using a ratio of 75% training and 25% testing.</p>

In [159]:
train_X, test_X, train_y, test_y = train_test_split(data_X, data_y, test_size=0.25, random_state=24)

<p>Observe the number of datapoints found in the training data.  The training data has two features (comment and score).</p>

In [160]:
print(train_X.shape)

(37904, 2)


<p>Next, we will featurize each comment with the Bag of Words approach using uni- and bi-grams. The TF-IDF Vectorizer tokenizes, counts, and normalizes the text. </p>


In [161]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2)) 
X_train = vectorizer.fit_transform(train_X.values[:,0])
X_test = vectorizer.transform(test_X.values[:,0])

In [162]:
print(X_train.shape)
print(X_test.shape)

(37904, 205433)
(12635, 205433)


Observe now that there are now hundreds of thousands of features for each comment! The matrix of featurized text is sparse, because most comments will not use every word from other comments.

In [163]:
y_train = np.array(train_y)
y_test = np.array(test_y)

In [164]:
print(y_train.shape)
print(y_test.shape)

(37904,)
(12635,)


<h1>Logistic Regression</h1>

The first model we will use is a logistic regression classification. We will tune the regularization parameter <code>C</code> with values of to locate the value that yields the highest accuracy. In <code>sk-learn</code>'s logistic regression, the parameter of <code>C</code> refers to the <em>inverse</em> of regularization strength, so that smaller values of <code>C</code> specify stronger regularization. We will also perform 5-fold cross validation to reduce the effects of overfitting on our data. For logistic regression, we will not be performing dimensionality reduction.

In [165]:
from scipy.sparse import hstack

scores_train = train_X.values[:,1]
scores_test = test_X.values[:,1]

scores_train = scores_train.reshape((scores_train.shape[0],1))
features_train = hstack([X_train, scores_train.astype(float)])

scores_test = scores_test.reshape((scores_test.shape[0],1))
features_test = hstack([X_test, scores_test.astype(float)])

print(X_train.shape)
print(features_train.shape)

(37904, 205433)
(37904, 205434)


In [38]:
from sklearn.linear_model import LogisticRegression
from statistics import mean

reg = [0.1, 1, 3, 10, 30, 100, 300, 1000]
for i in reg: 
    logic = LogisticRegression(C=i)
    logic.fit(features_train, y_train)
    scores = cross_val_score(logic, features_train, y_train, cv=5)
    print("A regularization parameter of " + str(i) + " has an average accuracy of " + str(mean(scores)))

A regularization parameter of 0.1 has an average accuracy of 0.661830006658
A regularization parameter of 1 has an average accuracy of 0.679558904074
A regularization parameter of 3 has an average accuracy of 0.679585150101
A regularization parameter of 10 has an average accuracy of 0.674598903024
A regularization parameter of 30 has an average accuracy of 0.668082281624
A regularization parameter of 100 has an average accuracy of 0.663386386859
A regularization parameter of 300 has an average accuracy of 0.661011995084
A regularization parameter of 1000 has an average accuracy of 0.658611186775


From our results, we see that <code>C</code> = 3 yields the highest accuracy of 67.958%. We will then score our testing data with a regularization parameter of 1 using the F1 score (since our data set is unbalanced).

In [166]:
from sklearn.metrics import f1_score

logic = LogisticRegression(C=3)
logic.fit(features_train, y_train)
prediction = logic.predict(features_test)
f1 = f1_score(y_test, prediction)  
print(f1)

0.674065793676


The logistic regression yields an F1 score on 67.407% on the testing data. We will show the weights learned by the logistic regression (using the eli5 library).

In [167]:
import eli5

In [168]:
eli5.show_weights(estimator=logic, vec=vectorizer, top=(20,20))

Weight?,Feature
+8.831,obviously
+8.015,because
+6.659,clearly
+6.429,totally
+6.011,yeah
+5.040,yeah because
+4.969,duh
+4.671,yes because
+4.494,yea
+4.297,forgot


<h1>Neural Network</h1>

Next, we explore the use of a neural network to classify our data with a reduced number of dimensions. We will train an MLP Classifier using 1 hidden layer with nodes equal to the number of features in the dataset and the default activation rectified linear unit function. However, in this use case, we will use only uni-grams for the sake of runtime and number of dimensions. 

First, reduce the number of features to 1000 using TruncatedSVD, which is an alternative to PCA for matrices that are sparse.

In [18]:
%%time
from sklearn.decomposition import TruncatedSVD 
n_components = 1000
svd_train = TruncatedSVD(n_components=n_components).fit(X_train)
svd_test = TruncatedSVD(n_components=n_components).fit(X_test)

CPU times: user 8min 40s, sys: 1min 3s, total: 9min 43s
Wall time: 7min 14s


In [19]:
print(svd_train.explained_variance_ratio_.sum())

0.23628872447


Observe a mediocre variance of 0.2363, as an ideal number is between .95 and .99.  However, we've chosen to sacrifice this for the sake of runtime.

In [20]:
X_train_svd = svd_train.transform(X_train)
X_test_svd = svd_test.transform(X_test)

print(X_train_svd.shape)

(37904, 1000)


In [21]:
scores_train = train_X.values[:,1]
scores_test = test_X.values[:,1]
print(scores_train.shape)

(37904,)


Reshape scores so that we can hstack scores with featurized comments.

In [22]:
scores_train = scores_train.reshape((scores_train.shape[0],1))
features_train = np.hstack([X_train_svd, scores_train.astype(float)])

scores_test = scores_test.reshape((scores_test.shape[0],1))
features_test = np.hstack([X_test_svd, scores_test.astype(float)])
print(features_train.shape)

(37904, 1001)


We will scale the featurized comments with the attached comment scores.

In [23]:
scaler = StandardScaler(with_mean=False)
scaler.fit_transform(features_train)
scale_train = scaler.transform(features_train)

scaler.fit(features_test)
scale_test = scaler.transform(features_test)

In [24]:
from sklearn.neural_network import MLPClassifier

Our first neural network yields an F1 score of 50.509%. We will try to increase the performance by cross validating the training data and tuning the regularization parameter. 

In [None]:
%%time
from statistics import mean
reg = [0.001, 0.01, 0.1, 1, 10, 100]
for i in reg: 
    nn_clf = MLPClassifier(hidden_layer_sizes=(1001,), alpha=i)
    nn_clf.fit(scale_train, y_train)
    scores = cross_val_score(nn_clf, scale_train, y_train, cv=5)
    print(scores)
    print("An alpha " + str(i) + " has an average accuracy of " + str(mean(scores)))

[ 0.6549723   0.66363277  0.65914787  0.65065963  0.6525066 ]
An alpha 0.001 has an average accuracy of 0.656183833107
[ 0.64758639  0.65030999  0.65268434  0.65343008  0.63825858]
An alpha 0.01 has an average accuracy of 0.648453874219
[ 0.65549987  0.64490173  0.64503364  0.6378628   0.64709763]
An alpha 0.1 has an average accuracy of 0.646079131


From the observations above, a regularization constant of alpha = 0.001 (the default), yields the best cross-validation accuracy. We will then fit our model with this alpha.

In [25]:
%%time
nn_clf = MLPClassifier(hidden_layer_sizes=(1001,))
nn_clf.fit(scale_train, y_train)

CPU times: user 6min 58s, sys: 1min 25s, total: 8min 23s
Wall time: 4min 16s


In [26]:
from sklearn.metrics import f1_score
prediction = nn_clf.predict(scale_test)
f1 = f1_score(y_test, prediction)  
print(f1)

0.532210263609


<h1>Full-Dataset Logistic Regression</h1>

In [175]:
print(clean_df.shape)
full_train_X, full_test_X, full_train_y, full_test_y = train_test_split(clean_df, train_df['label'], test_size=0.25, random_state=24)

(1010773, 2)


In [176]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2)) 
full_X_train = vectorizer.fit_transform(full_train_X.values[:,0])
full_X_test = vectorizer.transform(full_test_X.values[:,0])

full_y_train = np.array(train_y)
full_y_test = np.array(test_y)

In [177]:
from scipy.sparse import hstack
scores_train = full_train_X.values[:,1]
scores_test = full_test_X.values[:,1]

scores_train = scores_train.reshape((scores_train.shape[0],1))
features_train = hstack([full_X_train, scores_train.astype(float)])

scores_test = scores_test.reshape((scores_test.shape[0],1))
features_test = hstack([full_X_test, scores_test.astype(float)])

print(scores_test.shape)
print(features_test.shape)


(252694, 1)
(252694, 1985175)


In [178]:
from sklearn.linear_model import LogisticRegression
from statistics import mean

print(features_train.shape)
print(full_train_y.shape)
reg = [0.1, 1, 3, 10]
for i in reg: 
    logic = LogisticRegression(C=i)
    logic.fit(features_train, full_train_y)
    scores = cross_val_score(logic, features_train, full_train_y, cv=5)
    print("A regularization parameter of " + str(i) + " has an average accuracy of " + str(mean(scores)))

(758079, 1985175)
(758079,)
A regularization parameter of 0.1 has an average accuracy of 0.701779102263
A regularization parameter of 1 has an average accuracy of 0.721340387328
A regularization parameter of 3 has an average accuracy of 0.720885290563
A regularization parameter of 10 has an average accuracy of 0.712573492953


We observe from our cross-validation that a regularization parameter of C = 1 yields the highest accuracy. We will then apply this value to our model and test.

In [179]:
logic = LogisticRegression(C=1)
logic.fit(features_train, full_train_y)
score = logic.score(features_test, full_test_y)
print(score)

0.724370186866
