<center><H1>Final Project: Detecting Sarcasm on Reddit</H1>
<H3>Karina Lin, Joyce Zhao</H3></center>

In [13]:
import os
import numpy as np
import pandas as pd

First, we will import the raw dataset. Observe the first 10 entries.

In [14]:
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head(10)

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...
5,0,"I don't pay attention to her, but as long as s...",only7inches,AskReddit,0,0,0,2016-09,2016-09-02 10:35:08,do you find ariana grande sexy ?
6,0,Trick or treating in general is just weird...,only7inches,AskReddit,1,-1,-1,2016-10,2016-10-23 21:43:03,What's your weird or unsettling Trick or Treat...
7,0,Blade Mastery+Masamune or GTFO!,P0k3rm4s7,FFBraveExvius,2,-1,-1,2016-10,2016-10-13 21:13:55,Probably Sephiroth. I refuse to taint his grea...
8,0,"You don't have to, you have a good build, buy ...",SoupToPots,pcmasterrace,1,-1,-1,2016-10,2016-10-27 19:11:06,What to upgrade? I have $500 to spend (mainly ...
9,0,I would love to see him at lolla.,chihawks,Lollapalooza,2,-1,-1,2016-11,2016-11-21 23:39:12,Probably count Kanye out Since the rest of his...


This dataset, along with the label, includes 9 different features. For our purposes, we will only use comments and scores. There are 1,010,826 total raw entries. A label of 0 indicates a non-sarcastic comment, while 1 indicates a sarcastic comment.

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010826 entries, 0 to 1010825
Data columns (total 10 columns):
label             1010826 non-null int64
comment           1010773 non-null object
author            1010826 non-null object
subreddit         1010826 non-null object
score             1010826 non-null int64
ups               1010826 non-null int64
downs             1010826 non-null int64
date              1010826 non-null object
created_utc       1010826 non-null object
parent_comment    1010826 non-null object
dtypes: int64(4), object(6)
memory usage: 77.1+ MB


Then we will drop the entries with no comments

In [16]:
train_df.dropna(subset=['comment'], inplace=True)

Observe that the dataset is relatively balanced (even number of sarcastic vs. normal comments). 

In [17]:
train_df['label'].value_counts()

0    505405
1    505368
Name: label, dtype: int64

Then we will create a new data frame that includes only the features we need (comments and scores).

In [18]:
clean_df = train_df.drop(['author', 'subreddit', 'ups', 'downs','date','created_utc', 'parent_comment','label'], axis=1)

Split the cleaned data into training and testing sets. We are using a 75& training, 25% testing ratio.

In [71]:
from sklearn.model_selection import train_test_split
_, data_X, _, data_y = \
        train_test_split(clean_df, train_df['label'], test_size=0.05, random_state=24)

In [72]:
print (data_X.shape)

(50539, 2)


In [73]:
train_X, test_X, train_y, test_y = train_test_split(data_X, data_y, test_size=0.2, random_state=24)

In [74]:
print(train_X.shape)
print(train_X.values)

(40431, 2)
[['You fucking white racist.' 0]
 ["You're one of the few to see forest Gump" 27]
 [ "Great thing about twitter, you don't have to follow her if you find her intolerable :)."
  8]
 ..., 
 ['because Lynchburg college is so beautiful.....' 1]
 ["Because It doesn't have a lot of support?" 0]
 ['BUT WHAT ABOUT THE TRUCK?' 2]]


Next, we will featurize each comment with the Bag of Words approach. The TF-IDF Vectorizer tokenizes, counts, and normalizes the text. 

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer() 
X_train = vectorizer.fit_transform(train_X.values[:,0])

Observe now that there are 12,060 features for each comment! The matrix of featurized text is sparse, because most comments will not use every word from other comments.

In [76]:
print(X_train.shape)

(40431, 30279)


In [77]:
y_train = np.array(train_y)

In [78]:
y_train.shape

(40431,)

In [79]:
%%time
from sklearn.decomposition import TruncatedSVD # PCA does not support sparse input
n_components = 1000
svd = TruncatedSVD(n_components=n_components).fit(X_train)

CPU times: user 1min 16s, sys: 8.83 s, total: 1min 25s
Wall time: 1min 3s


In [80]:
svd.explained_variance_ratio_.sum() 

0.51065567283490676

In [81]:
X_train_svd = svd.transform(X_train)
print(X_train_svd.shape)

(40431, 1000)


In [82]:
print(X_train_svd.shape)
scores = train_X.values[:,1]
print(scores.shape)

(40431, 1000)
(40431,)


In [83]:
print(X_train_svd.shape)
print(scores[:, None])
scores1 = scores.reshape((scores.shape[0],1))

(40431, 1000)
[[0]
 [27]
 [8]
 ..., 
 [1]
 [0]
 [2]]


In [84]:
print(type(X_train_svd))
print(type(X_train))

<class 'numpy.ndarray'>
<class 'scipy.sparse.csr.csr_matrix'>


In [85]:
from scipy.sparse import hstack
print(X_train_svd.shape)
print(scores1.shape)
features = np.hstack([X_train_svd,scores1.astype(float)])

(40431, 1000)
(40431, 1)


In [86]:
print(features.shape)

(40431, 1001)


In [87]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(with_mean=False)
scaler.fit(features)
scaleTrain = scaler.transform(features)

In [88]:
print(scaleTrain.shape)

(40431, 1001)


In [90]:
%%time
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(hidden_layer_sizes=(1001,))
clf.fit(scaleTrain, y_train) 

CPU times: user 8min 48s, sys: 1min, total: 9min 48s
Wall time: 5min 8s


In [91]:
clf.score(scaleTrain,y_train)

0.97078974054562095

In [None]:
from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(clf, scaleTrain, y_train, cv=5)

In [69]:
print(accuracy)

[ 0.59147095  0.58714462  0.59431045  0.58750773  0.58910891]
