<center><H1>Final Project: Detecting Sarcasm on Reddit</H1>
<H3>Karina Lin, Joyce Zhao</H3></center>

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD 
from scipy.sparse import hstack
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

<p>First, we will import the raw dataset. Observe the first 10 entries.</p>

In [2]:
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head(10)

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...
5,0,"I don't pay attention to her, but as long as s...",only7inches,AskReddit,0,0,0,2016-09,2016-09-02 10:35:08,do you find ariana grande sexy ?
6,0,Trick or treating in general is just weird...,only7inches,AskReddit,1,-1,-1,2016-10,2016-10-23 21:43:03,What's your weird or unsettling Trick or Treat...
7,0,Blade Mastery+Masamune or GTFO!,P0k3rm4s7,FFBraveExvius,2,-1,-1,2016-10,2016-10-13 21:13:55,Probably Sephiroth. I refuse to taint his grea...
8,0,"You don't have to, you have a good build, buy ...",SoupToPots,pcmasterrace,1,-1,-1,2016-10,2016-10-27 19:11:06,What to upgrade? I have $500 to spend (mainly ...
9,0,I would love to see him at lolla.,chihawks,Lollapalooza,2,-1,-1,2016-11,2016-11-21 23:39:12,Probably count Kanye out Since the rest of his...


<p>This dataset, along with the label, includes 9 different features. For our purposes, we will only use comments and scores. There are 1,010,826 total raw entries. A label of 0 indicates a non-sarcastic comment, while 1 indicates a sarcastic comment.</p>

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010826 entries, 0 to 1010825
Data columns (total 10 columns):
label             1010826 non-null int64
comment           1010773 non-null object
author            1010826 non-null object
subreddit         1010826 non-null object
score             1010826 non-null int64
ups               1010826 non-null int64
downs             1010826 non-null int64
date              1010826 non-null object
created_utc       1010826 non-null object
parent_comment    1010826 non-null object
dtypes: int64(4), object(6)
memory usage: 77.1+ MB


<p>Then we will drop the entries with no comments</p>

In [4]:
train_df.dropna(subset=['comment'], inplace=True)

<p>Observe that the dataset is relatively balanced (even number of sarcastic vs. normal comments). </p>

In [5]:
train_df['label'].value_counts()

0    505405
1    505368
Name: label, dtype: int64

<p>Then we will create a new data frame that includes only the features we need (comments and scores).</p>

In [6]:
clean_df = train_df.drop(['author', 'subreddit', 'ups', 'downs','date','created_utc', 'parent_comment','label'], axis=1)

In [7]:
print(clean_df.shape)

(1010773, 2)


<p>We will be using a subset of this large dataset, so take 5% of the entire dataset</p>

In [109]:
_, data_X, _, data_y = train_test_split(clean_df, train_df['label'], test_size=0.01, random_state=24)

In [110]:
print (data_X.shape)

(10108, 2)


<p>From the 5% cut of the entire dataset, split into testing and training data.  We'll be using a ratio of 75% training and 25% testing.</p>

In [111]:
train_X, test_X, train_y, test_y = train_test_split(data_X, data_y, test_size=0.25, random_state=24)

<p>Observe the number of datapoints found in the training data.  The training data has two features (comment and score).</p>

In [112]:
print(train_X.shape)
print(train_X.values)

(7581, 2)
[[ "Wouldn't say he doesn't know shit lol and I'll tell u what everyone tells me when I make a joke on here...u forgot the"
  -3]
 [ "It wasn't supposed to be a salty tone, it was more a sarcastic one, but I forgot the"
  1]
 ['landing is launching from an acceleration perspective.' 8]
 ..., 
 ['The sounds still haunts me.' 2]
 ['Shame on you for asking an insightful question...' 4]
 ["Because It doesn't have a lot of support?" 0]]


<p>Next, we will featurize each comment with the Bag of Words approach. The TF-IDF Vectorizer tokenizes, counts, and normalizes the text. </p>


In [113]:
vectorizer = TfidfVectorizer() 
X_train = vectorizer.fit_transform(train_X.values[:,0])
X_test = vectorizer.fit_transform(test_X.values[:,0])


Observe now that there are now tens of thousands of features for each comment! The matrix of featurized text is sparse, because most comments will not use every word from other comments.

In [114]:
print(X_train.shape)

(7581, 11617)


In [115]:
y_train = np.array(train_y)
y_test = np.array(test_y)

In [116]:
print(y_train.shape)

(7581,)


Reduce the number of features using TruncatedSVD, which is an alternative to PCA for matrices that are sparse.

In [118]:
%%time
n_components = 1000
svd_train = TruncatedSVD(n_components=n_components).fit(X_train)
svd_test = TruncatedSVD(n_components=n_components).fit(X_test)

CPU times: user 32.3 s, sys: 3.35 s, total: 35.7 s
Wall time: 24.9 s


Observe a mediocre variance, as an ideal number is between .95 and .99.  However, we've chosen to sacrifice this for the sake of runtime and the ability to utilize a larger dataset.

In [119]:
print(svd_train.explained_variance_ratio_.sum() )
print(svd_test.explained_variance_ratio_.sum() )

0.597176461794
0.759673933312


In [120]:
X_train_svd = svd_train.transform(X_train)
X_test_svd = svd_test.transform(X_test)

print(X_train_svd.shape)

(7581, 1000)


In [121]:
print(X_train_svd.shape)
scores_train = train_X.values[:,1]
scores_test = test_X.values[:,1]
print(scores_train.shape)

(7581, 1000)
(7581,)


In [122]:
print(type(X_train_svd))
print(type(X_train))
print(X_train_svd.shape)
print(scores_train.shape)

<class 'numpy.ndarray'>
<class 'scipy.sparse.csr.csr_matrix'>
(7581, 1000)
(7581,)


Reshape scores so that we can hstack scores with featurized comments.

In [123]:
scores_train = scores_train.reshape((scores_train.shape[0],1))
features_train = np.hstack([X_train_svd, scores_train.astype(float)])

scores_test = scores_test.reshape((scores_test.shape[0],1))
features_test = np.hstack([X_test_svd, scores_test.astype(float)])

In [124]:
print(features_train.shape)

(7581, 1001)


In [125]:
scaler = StandardScaler(with_mean=False)
scaler.fit(features_train)
scale_train = scaler.transform(features_train)

scaler.fit(features_test)
scale_test = scaler.transform(features_test)

Train an MLP Classifier using 1 hidden layer with nodes equal to the number of features in the dataset.

In [None]:
%%time
clf = MLPClassifier(hidden_layer_sizes=(1001,))
clf.fit(scale_train, y_train)

In [None]:
clf.score(scale_train,y_train)
clf.score(scale_test,y_test)

In [None]:
accuracy = cross_val_score(clf, scaleTrain, y_train, cv=5)

In [None]:
print(accuracy)

<h1>SVM</h1>

In [126]:
print(scale_train.shape)

(7581, 1001)


In [127]:
from sklearn.svm import SVC

train_X_SVM, test_X_SVM, train_y_SVM, test_y_SVM = train_test_split(scale_train, y_train, test_size=0.25, random_state=24)

In [107]:
%%time
SVM = SVC()
SVM.fit(train_X_SVM, train_y_SVM)

CPU times: user 2h 34min 2s, sys: 59.5 s, total: 2h 35min 1s
Wall time: 2h 36min 29s


In [108]:
print(SVM.score(scale_train, y_train))
print(SVM.score(scale_test, y_test))

0.827749050232
0.50288880095


In [128]:
parameters = [1.0, 10.0, 100.0, 1000.0]

for x in parameters:
    for y in parameters:
        SVM = SVC(C=x, gamma=y)
        SVM.fit(train_X_SVM, train_y_SVM)
        print("******")
        print(x)
        print(y)
        print(SVM.score(scale_train, y_train))
        print(SVM.score(scale_test, y_test))

******
1.0
1.0
0.881018335312
0.510091017016
******
1.0
10.0
0.881414061469
0.510091017016
******
1.0
100.0
0.881677878908
0.510091017016
******
1.0
1000.0
0.881545970189
0.510091017016
******
10.0
1.0
0.881150244031
0.510091017016
******
10.0
10.0
0.881809787627
0.510091017016
******
10.0
100.0
0.881809787627
0.510091017016
******
10.0
1000.0
0.881545970189
0.510091017016
******
100.0
1.0
0.881545970189
0.510091017016
******
100.0
10.0
0.881809787627
0.510091017016
******
100.0
100.0
0.881809787627
0.510091017016
******
100.0
1000.0
0.881545970189
0.510091017016
******
1000.0
1.0
0.881677878908
0.510091017016
******
1000.0
10.0
0.881941696346
0.510091017016
******
1000.0
100.0
0.881809787627
0.510091017016
******
1000.0
1000.0
0.881545970189
0.510091017016
