**Juhwan Lee**

CS410: Natural Language Processing

Assignment 1: Sarcasm Classifier



Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

Data Loading

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [2]:
df = pd.read_json('Sarcasm_Headlines.json', lines=True)
df.rename(columns={'headline': 'text'}, inplace=True)
df.head()

Unnamed: 0,is_sarcastic,text,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


Feature Engineering

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [3]:
df['text'] = df['text'].str.replace('!', ' exclamation ')
df['text'] = df['text'].str.replace('?', ' question ')
df['text'] = df['text'].str.replace('\'', ' singlequote ')
df['text'] = df['text'].str.replace('\"', ' doublequote ')
df['text'] = df['text'].str.replace(':', ' colon ')
df['text'] = df['text'].str.replace(';', ' semicolon ')
df.head()

Unnamed: 0,is_sarcastic,text,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies colon 9 deliciously differen...,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word singl...,https://www.theonion.com/mother-comes-pretty-c...


Replacing punctuation with text enhances the performance of the model by increasing the number of features recognized by the vectorizer, resulting in a 2% increase in both the F-score and accuracy.

Text To Numeric Data (N-grams)

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [4]:
#vectorizer = CountVectorizer(ngram_range=(1,1)) #unigram
vectorizer = CountVectorizer(ngram_range=(1,2)) #unigram + bigram
#vectorizer = CountVectorizer(ngram_range=(1,3)) #unigram + bigram + trigram
bow = vectorizer.fit_transform(df['text'])
bow.shape

(28619, 197074)

Using N-grams, including options such as unigram, unigram + bigram, and unigram + bigram + trigram, increases the number of features the model has to work with, therefore, an increase in F-score and accuracy is expected. However, it is important to note that an increase in features does not necessarily guarantee an increase in accuracy.

Features and Labels

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [5]:
X = bow
y = df['is_sarcastic']
y.shape

(28619,)

Create variables for 10-fold cross validation

Source
*   https://github.com/codebasics/py/blob/master/ML/12_KFold_Cross_Validation/12_k_fold.ipynb



In [6]:
folds = StratifiedKFold(n_splits=10)

f1_NB = []
accuracy_NB = []
f1_SVM = []
accuracy_SVM = []

10-fold cross validation (Naive Bayes)

Source
*   https://github.com/codebasics/py/blob/master/ML/12_KFold_Cross_Validation/12_k_fold.ipynb



In [7]:
clf = MultinomialNB()

for train_index, test_index in folds.split(X, y):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_NB.append(f1_score(y_test, y_pred))
    accuracy_NB.append(accuracy_score(y_test, y_pred))

Print results for Naive Bayes

In [8]:
print("Naive Bayes: ")
print("F1 score: ", np.average(f1_NB))
print("Accuracy score: ", np.average(accuracy_NB))

Naive Bayes: 
F1 score:  0.8586560554105048
Accuracy score:  0.8678499207760648


10-fold cross validation (SVM)

Source
*   https://github.com/codebasics/py/blob/master/ML/12_KFold_Cross_Validation/12_k_fold.ipynb



In [9]:
clf = SVC(kernel='linear')

for train_index, test_index in folds.split(X, y):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1_SVM.append(f1_score(y_test, y_pred))
    accuracy_SVM.append(accuracy_score(y_test, y_pred))

Print results for SVM

In [10]:
print("SVM: ")
print("F1 score: ", np.average(f1_SVM))
print("Accuracy score: ", np.average(accuracy_SVM))

SVM: 
F1 score:  0.8614098850392375
Accuracy score:  0.8684090680935036


Evaluation

Adding more features to the model gradually increases the F-score and accuracy for both models. When using only unigram, the Naive Bayes model performed better than the SVM model. However, when using the unigram + bigram combination, the SVM model began to perform better than Naive Bayes. The unigram + bigram + trigram combination also resulted in better performance for the SVM model compared to the Naive Bayes. The best performance was achieved with the unigram + bigram combination on the SVM model. The detailed results are provided below.

Unigram
- NB:
	F1 score:  0.8457815371727246
	Accuracy score:  0.8554807648388861
- SVM:
	F1 score:  0.8243749896980166
	Accuracy score:  0.8351092220470917

Unigram + Bigram
- NB:
	F1 score:  0.8586560554105048
	Accuracy score:  0.8678499207760648
- SVM:
	F1 score:  0.8614098850392375
	Accuracy score:  0.8684090680935036

Unigram + Bigram + Trigram
- NB:
	F1 score:  0.8593461711938227
    Accuracy score:  0.8682692690514207
- SVM:
    F1 score:  0.8612576227287649
    Accuracy score:  0.8681994000621872


Error Analysis

Source
*   https://practicaldatascience.co.uk/machine-learning/how-to-detect-sarcasm-using-machine-learning



In [11]:
results = pd.DataFrame(data={'predicted': y_pred, 'actual': y_test})
predictions = results.join(df)

def is_correct(predicted, actual):
    if predicted == actual:
        return True
    else:
        return False

predictions['correct'] = predictions.apply(lambda x: is_correct(x.predicted, x.actual), axis=1)
predictions = predictions[['text','predicted','actual','correct']]

predictions[predictions['correct']==False].sample(10)

Unnamed: 0,text,predicted,actual,correct
27584,mark zuckerberg prepares for congressional tes...,0,1,False
28163,new jersey supreme court rules the bastard had...,0,1,False
28609,bakery owner vows to stop making wedding cakes...,1,0,False
27833,damon albarn gets carried off stage in denmark...,1,0,False
28230,postmaster general colon singlequote letter ...,0,1,False
26802,obama calls for turret-mounted video cameras o...,0,1,False
26758,florida resort allows guests to swim with miam...,0,1,False
27017,gorgeous new nasa image shows earth singlequo...,1,0,False
25806,high school freshman thinks singlequote romeo...,0,1,False
26513,dick van dyke surprises denny singlequote s pa...,1,0,False


The following are 10 samples from the list of incorrect predictions. It is believed that simply counting the number of times a certain word appears in a headline is not sufficient for sarcasm detection. To provide additional information to the learning model, techniques such as N-grams and punctuation-to-text were used. However, although there was a slight improvement in accuracy and F-score, it was not enough to achieve a performance level of 90%. Sarcasm detection is closely related to emotions, thus, it is likely that better accuracy and F-score can be achieved by using NRC emotion features to provide emo words and emo intensity as additional features to the learning model.