# CS289 Final Project: Irony Detection in English Tweets

Team: Jayanth, Sudharsan Krishnaswamy, Debleena Sengupta, Shadi Shahsavari

# Abstract

The advent of social media like Twitter and Facebook has led to rise of people using more creative and figurative language use like Irony, Sarcasm, Hyperbole etc to catch the social network’s attention for more likes and retweets. Natural Language Processing Tasks on such social media datasets like Sentiment Analysis, Opinion Mining, Argument Analysis etc struggle to maintain high performance, when applied to Ironic texts. We try to tackle this hard problem of Irony Detection using new advances in Deep Learning technologies. We approach the first tasks in our work based on SemEval 2018 dataset. The first task is to detect if a tweet is “ironic” or not (binary classification of 0 or 1).

# SemEval Dataset Preprocessing and Corpus Analysis


The dataset was processed to ambiguate the urls as URL and the usernames as USER using regular expression. Also, the emojis were converted to their unicode equivalent aliases as defined by the [unicode consortium](http://www.unicode.org/emoji/charts/full-emoji-list.html).

In [37]:
import re,emoji
sent = u'@SincerelyTumblr: One day I want to travel with my bestfriend 🌏 ✈️ http://t.co/AXD3Ax5qC1 DONE DID TRAVELED DA WORLD!! @Bethanycsmithh 🖤 '
sent = emoji.demojize(sent)
sent = re.sub(r'https?:\/\/[^ \n\t\r]*', 'URL', sent)
sent = re.sub(r'@[a-zA-Z0-9_]+', 'USR', sent);
print(sent)

USR: One day I want to travel with my bestfriend :globe_showing_Asia-Australia: :airplane:️ URL DONE DID TRAVELED DA WORLD!! USR :black_heart: 


Collocations are expressions of multiple words which commonly co-occur. They give important insight into the common patterns in both classes. A number of measures are available in NLTK to score collocations or other associations. The collocations in the corpus were then, scored based on those metrices and ranked.

In [None]:
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True).tokenize
tokens = tokenizer('\n'.join(corpus))
finder = BigramCollocationFinder.from_words(tokens)
bigram_measures = BigramAssocMeasures()
scored = finder.score_ngrams(bigram_measures.chi_sq)
sorted(bigram for bigram, score in scored)
map(lambda x: print(' '.join(x[0]), x[1], "\n"), scored[:10])

## Collocation Scoring

| Raw Freq | Chi_Sq/Dice/Jaccard/Phi_Sq/pmi     | Likelihood Ratio          |
|----------|------------------------------------|---------------------------|
| ! !      | #034i #100                         | ! !                       |
| in the   | #100 #glitter                      | i love                    |
| . i      | #1stphoto #rektek                  | I I                       |
| i love   | #2003 2003                         | going to                  |
| I I      | #2015isthenewturnup #myboos        | to be                     |
| to be    | #2015season #2014sucks             | face_with_tears_of_joy :: |
| of the   | #2am http://t.co/49XwyrlADo        | in the                    |
| . I      | #2n1edition http://t.co/oRT6ZYfGhx | can't wait                |
| for the  | #2o14 #bestie                      | :: face_with_tears_of_joy |
| to the   | #2of6 #6daystretch                 | : ️                        |


Refer [this spreadsheet](https://docs.google.com/spreadsheets/d/1SzHaj5J7IedYPns2vLK2_3aX85TJNJwuLoMNe5Vz670/edit#gid=1691220959) to look at top 10 collocations using different metrices.

# Baseline Implementations of traditional ML algorithms

We tried out two different N-gram language models as features:
1. [Unigram](http://www.nltk.org/_modules/nltk/model/ngram.html)
2. [Bigram](http://www.nltk.org/_modules/nltk/model/ngram.html)

Those featues were then, converted to one of the following encodings for training:
1. [Bag of Words](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer)
2. [Count](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
3. [Tf-idf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

Following is the code for featurizing the input:

In [10]:
def featurize(corpus):
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True).tokenize
    vectorizer = TfidfVectorizer(strip_accents="unicode", analyzer="word", tokenizer=tokenizer, stop_words="english")
    X = vectorizer.fit_transform(corpus)
    return X

We tried out multiple ML classifiers with different features and also, did comparative study.
The following classifiers were tried:

1. [Naive Bayes](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.naivebayes)
2. [Decision Tree](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.decisiontree)
3. [MaxEnt](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent)
4. [GIS](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent)
5. [IIS](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent)
6. [Bernoulli Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)
7. [Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
8. [Extra Trees Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
9. [Gausian Naïve Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
10. [Gradient Boosting Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
11. [K Nearest Neighbours Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
12. [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
13. [Linear Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
14. [Multinomial Naïve Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
15. [Nu-Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html)
16. [Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
17. [Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

Following is the code for using the classifiers and doing [10-fold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html):

In [None]:
    K_FOLDS = 10 # 10-fold crossvalidation
    CLF = LinearSVC() # the default, non-parameter optimized linear-kernel SVM
    corpus, y = parse_dataset(DATASET_FP) # Loading dataset and featurised simple Tfidf-BoW model
    X = featurize(corpus)
    # Returns an array of the same size as 'y' where each entry is a prediction obtained by cross validated
    predicted = cross_val_predict(CLF, X, y, cv=K_FOLDS)
    score = metrics.f1_score(y, predicted, pos_label=1)
    print ("F1-score Task", TASK, score)

## Baseline Comparision

Refer [this spreadsheet](https://docs.google.com/spreadsheets/d/1x-MsT3iaUSs85UQPOhBzIOWSe4DOmhXs7M9TqUs_QJ4/edit#gid=256991540) for complete comparision.

### Bigram does not improve much

### Tf-idf performs the best

### Nu-SVC, Naive Bayes and Logistic Regression give comparativeley better performance than all

We also found the most informative features for the classifications to see, which words were important in distinguishing both class.

In [39]:
def most_informative_feature_for_binary_classification(vectorizer, classifier, n=10):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
    topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

    for coef, feat in topn_class1:
        print (class_labels[0], coef, feat)
    print ()
    for coef, feat in reversed(topn_class2):
        print (class_labels[1], coef, feat)

| Naive Bayes                                                     | SVC                               | Linear SVC                        | MaxEnt                                        |
|-----------------------------------------------------------------|-----------------------------------|-----------------------------------|-----------------------------------------------|
| waking = 1               irony : non-ir =     14.4 : 1.0        | irony 3.43705831007 love          | irony 2.09798273149 unamused_face | 0.892 swag==1 and label is 'irony'            |
| final = 1               irony : non-ir =      9.7 : 1.0         | irony 3.14001851865 great         | irony 1.9912945247 love           | 0.892 illridewithyou==1 and label is 'irony'  |
| understand = 1              non-ir : irony  =      8.3 : 1.0    | irony 2.68216438685 fun           | irony 1.95849441798 great         | 0.866 speakeasy==1 and label is 'non-irony'   |
| yay = 1               irony : non-ir =      7.6 : 1.0           | irony 2.10512791389 unamused_face | irony 1.89300320932 yay           | 0.631 socialist==1 and label is 'irony'       |
| unamused_face = 1               irony : non-ir =      7.0 : 1.0 | irony 1.94194114863 yay           | irony 1.56713777847 joy           | -0.609 waking==1 and label is 'non-irony'     |
| fix = 1               irony : non-ir =      6.4 : 1.0           | irony 1.75237273405 nice          | irony 1.55896446035 monday        | 0.604 amazeballs==1 and label is 'irony'      |
| have = 2               irony : non-ir =      6.4 : 1.0          | irony 1.7355600331 thanks         | irony 1.52283732966 nice          | 0.604 peak==1 and label is 'non-irony'        |
| check = 1              non-ir : irony  =      6.1 : 1.0         | irony 1.6065307335 wow            | irony 1.5100562397 fun            | 0.600 kmyb19hr==1 and label is 'non-irony'    |
| wonderful = 1               irony : non-ir =      5.8 : 1.0     | irony 1.60091530592 oh            | non-irony -1.50574938325 check    | 0.597 subtweeting==1 and label is 'non-irony' |
| dont = 1              non-ir : irony  =      5.8 : 1.0          | non-irony -1.57302071027 :        | irony 1.47746195477 test          | 0.593 bliss==1 and label is 'irony'           |

# DNN Model to Capture Linguistic Property of Irony

# Deep Learning Algorithm Implementation and Model Tuning

# Boosting Implementation and Model Tuning

### Overview of Boosting:
Based on the history of the performance of boosting, we hypothesized that with the proper feature selection, boosting would achieve a significantly high accuracy score. In addition to the deep learning model, we explored a boosting implementation. The idea was to use the same word embedding features that were used in the DNN in the boosting algorithm to see if we could achieve a better performance. A variety of feature implementations were tested. The features included sentiment score extractions, word embeddings, and a combination of the two.

### Sentiment Score Features:

It was hypothesized that ironic tweets would have words with starkly contrasting sentiment scores. Given a corpus, each tweet was tokenized. Each word in the tokenized list was passed through a sentiment analysis tool (nltk sentiwordnet) and the positive and negative scores of the word was kept track of. Looking at groups of 3 words at a time, the maximum positive sentiment score and the maximum negative sentiment score in the 3 word window was determined and the two values were subtracted to represent how "far away" the sentiments of these words were from each other. The number of 3-word chunks that were observed was a hyperparameter MAX_LEN_2 (shown in the code below) that was tuned.

The full code for boosting with sentiment score features can be found at: https://github.com/jayanthjaiswal/SemEval2018-Task3/blob/master/xgboost_test/boost_test.py

The code listed below demonstrates how the sentiment features were constructed for each tweet:


In [None]:
MAX_LEN_2 = 6
def get_sent_max_pos_neg(corpus):
    for curr_sentence in corpus:
        curr_sentence = re.sub(r'^https?:\/\/.*[\r\n]*', ' ', curr_sentence)
        curr_sentence = re.sub(r'@[a-zA-Z0-9_]+', ' ', curr_sentence)
        sentence_tokenized = tokenizer(curr_sentence)
        sent_list = []
        for i in range(len(sentence_tokenized)-2):
            curr_word = sentence_tokenized[i]
            next_word = sentence_tokenized[i+1]
            next_next = sentence_tokenized[i+2]
            if curr_word.startswith('#'):
                curr_word = curr_word[1:]
            if next_word.startswith('#'):
                next_word = next_word[1:]
            if next_next.startswith('#'):
                next_next = next_next[1:]
            curr_senti_synsets = swn.senti_synsets(curr_word)
            next_senti_synsets = swn.senti_synsets(next_word)
            next_next_senti_synsets = swn.senti_synsets(next_next)
            if len(curr_senti_synsets) > 0 and len(next_senti_synsets) > 0 and len(next_next_senti_synsets) > 0:
                curr_pos = curr_senti_synsets[0].pos_score()
                curr_neg = curr_senti_synsets[0].neg_score()
                next_pos = next_senti_synsets[0].pos_score()
                next_neg = next_senti_synsets[0].neg_score()
                next_next_pos = next_next_senti_synsets[0].pos_score()
                next_next_neg = next_next_senti_synsets[0].neg_score()
                max_pos = max(curr_pos, next_pos, next_next_pos)
                max_neg = max(curr_neg, next_neg, next_next_neg)
                curr_sent = max_pos - max_neg
                if curr_sent != 0:
                    sent_list.append(curr_sent)
        if len(sent_list) > MAX_LEN_2:
            sent_list = sent_list[:MAX_LEN_2]
        for i in range(MAX_LEN_2 - len(sent_list)):
            sent_list.append(0.0)
        inp_X.append(sent_list)
    return inp_X

### Word Embedding Features:

Since simply using sentiment scores did not perform very well, exploring word embedding features was the next approach. We utilized the pre-trained word embeddings called GloVe from Stanford and outputted two sets of features. For the first set, each tweet corresponded to a text file where each row represented a word in the tweet using a 25d word embedding. For the second set, each tweet corresponded to a text file where each row represented a word in the tweet using a 50d word embedding.  

The first task was to preprocess the text files by flattening each text file into a 1 dimensinal array representing the word embedding for approximately the first 10 words in a tweet. The number of words that were included in the flattened word embedding vector was a hyperparameter, MAX_LEN (shown in the code below). Each tweet example was represented as a 1  x 250 dimensional vector. This way, the examples were stacked together to get a full input training data matrix (data_X in the code below) with dimensions 2834 X 251 (last column was the labels). 

The same was done for the 50d word embeddings, resulting in an input training data matrix (data_X in the code below) with dimensions 2834 X 501. 


### Boosting Implementation
Once the data was preprocessed with the sentiment score features and word embedding features (stored in data_x in code below), the training data was split into X_train, X_test, y_train and y_test (as shown in the code below). Xgboost was run to fit a model to the training data. We used the XGBRegressor from the xgboost library. This was a classifier used a logistic regression model to perform the binary classification task. 


In [None]:
MAX_LEN = 350
test_size = 0.1
X = data_x[:,0:MAX_LENGTH+MAX_LEN_2]
Y = data_x[:,MAX_LENGTH+MAX_LEN_2]
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=test_size)
# fit model no training data
model = XGBRegressor(max_depth=3) #gave 56.51%
model.fit(X_train, y_train)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
accuracies.append(accuracy)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

The full code for boosting with word embeddings and sentiment score features can be found at: https://github.com/jayanthjaiswal/SemEval2018-Task3/blob/master/xgboost_test/boost_test_2.py


### Feature Engineering, Hyperparameters, Accuracy Analysis

A combination of word embedding features and sentiment score features were used to analyze which would produce the best accuracy scores. The main parameters that were tuned in the model were MAX_LEN, MAX_LEN_2, test_size (representing the k_fold), and the max_depth of the decision trees in the boosting model. The best results were achieved with test_size = 0.1 and the max_depth = 3 for all models. For the models involving the 25d word embedding features, MAX_LEN = 350. For the models involving the 50d word embedding features, MAX_LEN = 550. For models involving sentiment scores MAX_LEN_2 = 6.

The findings of the best scores along with the corresponding hyperparameters are summarized in the chart below:


### Boosting Summary
Even though we hypothesized that boosting would be a simple method to increase performance scores, it was observed that it was unable to beat the DNN approaches. The highest that was achieved using the boosting method was 62.50%, using 25d word embedding and sentiment score features. We concluded that in order to observe good performance for a boosting algorithm, proper feature selection is key. Feature selection in the task of classifying irony is complex and required more sophisticated systems such as DNNs, as opposed to the boosting approach.


# Result Analysis and Test Data Evaluation Submission on SemEval Dataset