**Juhwan Lee**

CS410: Natural Language Processing

Assignment 1: Sarcasm Classifier



Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

Data Loading

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [2]:
df = pd.read_json('Sarcasm_Headlines.json', lines=True)
df.rename(columns={'headline': 'text'}, inplace=True)
df.head()

Unnamed: 0,is_sarcastic,text,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


Feature Engineering

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [3]:
df['text'] = df['text'].str.replace('!', ' exclamation ')
df['text'] = df['text'].str.replace('?', ' question ')
df['text'] = df['text'].str.replace('\'', ' singlequote ')
df['text'] = df['text'].str.replace('\"', ' doublequote ')
df['text'] = df['text'].str.replace(':', ' colon ')
df['text'] = df['text'].str.replace(';', ' semicolon ')
df.head()

Unnamed: 0,is_sarcastic,text,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies colon 9 deliciously differen...,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word singl...,https://www.theonion.com/mother-comes-pretty-c...


Here we used punctuation to text replacing. If we just use punctuation, vectorizer will not recognize it so we will have less features for the model but if we replace punctuation to text, vectorizer will recognize it so there will be more features for the model. Therefore the F score and accuracy score went up approximately by 2% as a result.

Text To Numeric Data (N-grams)

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [4]:
#vectorizer = CountVectorizer(ngram_range=(1,1)) #unigram
vectorizer = CountVectorizer(ngram_range=(1,2)) #unigram + bigram
#vectorizer = CountVectorizer(ngram_range=(1,3)) #unigram + bigram + trigram
bow = vectorizer.fit_transform(df['text'])
bow.shape

(28619, 197074)

Here we used N-grams. There are three options unigram, unigram + bigram, and unigram + bigram + trigram. As we add more features to the model, the model will have more features to work with. Therefore, we expect increase in F score and accuracy score. However, more features does not neccesarily mean more accuracy. 

Features and Labels

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [5]:
X = bow
y = df['is_sarcastic']
y.shape

(28619,)

Create variables for 10-fold cross validation

Source
*   https://github.com/codebasics/py/blob/master/ML/12_KFold_Cross_Validation/12_k_fold.ipynb



In [6]:
folds = StratifiedKFold(n_splits=10)

f1_NB = []
accuracy_NB = []
f1_SVM = []
accuracy_SVM = []

10-fold cross validation (Naive Bayes)

Source
*   https://github.com/codebasics/py/blob/master/ML/12_KFold_Cross_Validation/12_k_fold.ipynb



In [7]:
clf = MultinomialNB()

for train_index, test_index in folds.split(X, y):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_NB.append(f1_score(y_test, y_pred))
    accuracy_NB.append(accuracy_score(y_test, y_pred))

Print results for Naive Bayes

In [8]:
print("Naive Bayes: ")
print("F1 score: ", np.average(f1_NB))
print("Accuracy score: ", np.average(accuracy_NB))

Naive Bayes: 
F1 score:  0.8586560554105048
Accuracy score:  0.8678499207760648


10-fold cross validation (SVM)

Source
*   https://github.com/codebasics/py/blob/master/ML/12_KFold_Cross_Validation/12_k_fold.ipynb



In [9]:
clf = SVC(kernel='linear')

for train_index, test_index in folds.split(X, y):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_SVM.append(f1_score(y_test, y_pred))
    accuracy_SVM.append(accuracy_score(y_test, y_pred))

Print results for SVM

In [10]:
print("SVM: ")
print("F1 score: ", np.average(f1_SVM))
print("Accuracy score: ", np.average(accuracy_SVM))

SVM: 
F1 score:  0.8614098850392375
Accuracy score:  0.8684090680935036


Evaluation

As we add more features to the model, the F score and accuracy score gradually went up for both models. When only unigram was used, Naive Bayes performed better than SVM. However, when unigram + bigram combination was used, SVM started to perform better than Naive Bayes. When unigram + bigram + trigram combination was used, SVM also performed better than Naive Bayes. The best performance was unigram + bigram combination on SVM model. Below is the detail result.

Unigram
- NB:
	F1 score:  0.8457815371727246
	Accuracy score:  0.8554807648388861
- SVM:
	F1 score:  0.8243749896980166
	Accuracy score:  0.8351092220470917

Unigram + Bigram
- NB:
	F1 score:  0.8586560554105048
	Accuracy score:  0.8678499207760648
- SVM:
	F1 score:  0.8614098850392375
	Accuracy score:  0.8684090680935036

Unigram + Bigram + Trigram
- NB:
	F1 score:  0.8593461711938227
    Accuracy score:  0.8682692690514207
- SVM:
    F1 score:  0.8612576227287649
    Accuracy score:  0.8681994000621872


Error Analysis

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [11]:
results = pd.DataFrame(data={'predicted': y_pred, 'actual': y_test})
predictions = results.join(df)

def is_correct(predicted, actual):
    if predicted == actual:
        return True
    else:
        return False

predictions['correct'] = predictions.apply(lambda x: is_correct(x.predicted, x.actual), axis=1)
predictions = predictions[['text','predicted','actual','correct']]

predictions[predictions['correct']==False].sample(10)

Unnamed: 0,text,predicted,actual,correct
27584,mark zuckerberg prepares for congressional tes...,0,1,False
28163,new jersey supreme court rules the bastard had...,0,1,False
28609,bakery owner vows to stop making wedding cakes...,1,0,False
27833,damon albarn gets carried off stage in denmark...,1,0,False
28230,postmaster general colon singlequote letter ...,0,1,False
26802,obama calls for turret-mounted video cameras o...,0,1,False
26758,florida resort allows guests to swim with miam...,0,1,False
27017,gorgeous new nasa image shows earth singlequo...,1,0,False
25806,high school freshman thinks singlequote romeo...,0,1,False
26513,dick van dyke surprises denny singlequote s pa...,1,0,False


Here are 10 samples from the list of wrong prediction. In my opinion, simply how many times a certain word appears in a headline is not enough for sarcasm detection. To give more information to the learning model, N-gram and punctuation-to-text techniques were used, but although there was a slight improvement in accuracy and F score, it was not enough to extract the performance of the late 90%. Since sarcasm detection is closely related to emotion, better accuracy and F score are likely to be extracted if NRC emotion features are used to provide emo words and emo intensity as additional features to the learning model.