# CS269 Final Project: Irony Detection in English Tweets

Team: Jayanth, Sudharsan Krishnaswamy, Debleena Sengupta, Shadi Shahsavari

# Abstract

For our final project, we have explored the task of irony detection in Tweets (semEval 2018 - Task 3). The advent of social media like Twitter and Facebook has led to rise of people using more creative and figurative language use like Irony, Sarcasm, Hyperbole etc to catch the social network’s attention for more likes and retweets. Natural Language Processing Tasks on such social media datasets like Sentiment Analysis, Opinion Mining, and Argument Analysis struggle to maintain high performance, when applied to Ironic texts. We try to tackle this hard problem of Irony Detection using boosting and new advances in Deep Learning technologies. We approach the first tasks in our work based on SemEval 2018 dataset. The task is to detect if a tweet is “ironic” or not (binary classification of 0 or 1).

# SemEval Dataset Preprocessing and Corpus Analysis


The dataset was processed to ambiguate the urls as URL and the usernames as USER using regular expression. Also, the emojis were converted to their unicode equivalent aliases as defined by the [unicode consortium](http://www.unicode.org/emoji/charts/full-emoji-list.html).

In [37]:
import re,emoji
sent = u'@SincerelyTumblr: One day I want to travel with my bestfriend 🌏 ✈️ http://t.co/AXD3Ax5qC1 DONE DID TRAVELED DA WORLD!! @Bethanycsmithh 🖤 '
sent = emoji.demojize(sent)
sent = re.sub(r'https?:\/\/[^ \n\t\r]*', 'URL', sent)
sent = re.sub(r'@[a-zA-Z0-9_]+', 'USR', sent);
print(sent)

USR: One day I want to travel with my bestfriend :globe_showing_Asia-Australia: :airplane:️ URL DONE DID TRAVELED DA WORLD!! USR :black_heart: 


Collocations are expressions of multiple words which commonly co-occur. They give important insight into the common patterns in both classes. A number of measures are available in NLTK to score collocations or other associations. The collocations in the corpus were then, scored based on those metrices and ranked.

In [None]:
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True).tokenize
tokens = tokenizer('\n'.join(corpus))
finder = BigramCollocationFinder.from_words(tokens)
bigram_measures = BigramAssocMeasures()
scored = finder.score_ngrams(bigram_measures.chi_sq)
sorted(bigram for bigram, score in scored)
map(lambda x: print(' '.join(x[0]), x[1], "\n"), scored[:10])

## Collocation Scoring

| Raw Freq | Chi_Sq/Dice/Jaccard/Phi_Sq/pmi     | Likelihood Ratio          |
|----------|------------------------------------|---------------------------|
| ! !      | #034i #100                         | ! !                       |
| in the   | #100 #glitter                      | i love                    |
| . i      | #1stphoto #rektek                  | I I                       |
| i love   | #2003 2003                         | going to                  |
| I I      | #2015isthenewturnup #myboos        | to be                     |
| to be    | #2015season #2014sucks             | face_with_tears_of_joy :: |
| of the   | #2am http://t.co/49XwyrlADo        | in the                    |
| . I      | #2n1edition http://t.co/oRT6ZYfGhx | can't wait                |
| for the  | #2o14 #bestie                      | :: face_with_tears_of_joy |
| to the   | #2of6 #6daystretch                 | : ️                        |


Refer [this spreadsheet](https://docs.google.com/spreadsheets/d/1SzHaj5J7IedYPns2vLK2_3aX85TJNJwuLoMNe5Vz670/edit#gid=1691220959) to look at top 10 collocations using different metrices.

# Baseline Implementations of traditional ML algorithms

We tried out two different N-gram language models as features:
1. [Unigram](http://www.nltk.org/_modules/nltk/model/ngram.html)
2. [Bigram](http://www.nltk.org/_modules/nltk/model/ngram.html)

Those featues were then, converted to one of the following encodings for training:
1. [Bag of Words](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer)
2. [Count](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
3. [Tf-idf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

Following is the code for featurizing the input:

In [10]:
def featurize(corpus):
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True).tokenize
    vectorizer = TfidfVectorizer(strip_accents="unicode", analyzer="word", tokenizer=tokenizer, stop_words="english")
    X = vectorizer.fit_transform(corpus)
    return X

We tried out multiple ML classifiers with different features and also, did comparative study.
The following classifiers were tried:

1. [Naive Bayes](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.naivebayes)
2. [Decision Tree](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.decisiontree)
3. [MaxEnt](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent)
4. [GIS](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent)
5. [IIS](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent)
6. [Bernoulli Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)
7. [Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
8. [Extra Trees Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
9. [Gausian Naïve Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
10. [Gradient Boosting Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
11. [K Nearest Neighbours Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
12. [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
13. [Linear Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
14. [Multinomial Naïve Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
15. [Nu-Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html)
16. [Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
17. [Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

Following is the code for using the classifiers and doing [10-fold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html):

In [None]:
    K_FOLDS = 10 # 10-fold crossvalidation
    CLF = LinearSVC() # the default, non-parameter optimized linear-kernel SVM
    corpus, y = parse_dataset(DATASET_FP) # Loading dataset and featurised simple Tfidf-BoW model
    X = featurize(corpus)
    # Returns an array of the same size as 'y' where each entry is a prediction obtained by cross validated
    predicted = cross_val_predict(CLF, X, y, cv=K_FOLDS)
    score = metrics.f1_score(y, predicted, pos_label=1)
    print ("F1-score Task", TASK, score)

Refer the code [here](../benchmark_system/example.py) for baselines and corpus analysis.

## Baseline Comparision

In order to test the baselines, we trained many classifiers and evaluated them. For almost every classifier we also used three set of features: bag of words (presence), count and tf-idf. For every feature set we tried Unigram and Bigram model. For some classifier we tried hyper parameters to get the best accuracy. 
In order to see which one is the best accuracy:
As we observe, bag of words in most cases is not a good selection. Also  we can observe that tf-idf is probably the best feature, implying that we should have less sparse and better embeddings as features.
Unigrams, according to the figure below, shows better performance than bigrams. This holds for almost every classifier. 
Finally when comparing the classifiers, the best performance was obtained from Nu-Support Vector Classifier with 66.92%. The minimum scores was observed when using the Random forest classifier, which requires more parameter tuning to achieve comparable results.

Refer to [this spreadsheet](https://docs.google.com/spreadsheets/d/1x-MsT3iaUSs85UQPOhBzIOWSe4DOmhXs7M9TqUs_QJ4/edit#gid=256991540) for complete comparision.


### Bigram does not improve much
<img src="Bigram_Unigram.png" alt="Bigram vs Unigram" title="Bigram vs Unigram" />

### Tf-idf performs the best

<img src="Features_Extraction.png" alt="Features Extraction" title="Features Extraction"/>

### Nu-SVC, Naive Bayes and Logistic Regression give comparativeley better performance than all

<img src="Baseline_Comparisions.png" alt="Baseline Comparision" title="Baseline Comparision"/>

We also found the most informative features for the classifications to see, which words were important in distinguishing both class.

In [39]:
def most_informative_feature_for_binary_classification(vectorizer, classifier, n=10):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
    topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

    for coef, feat in topn_class1:
        print (class_labels[0], coef, feat)
    print ()
    for coef, feat in reversed(topn_class2):
        print (class_labels[1], coef, feat)

| Naive Bayes                                                     | Logistic Regresssion                               | Linear SVC                        | MaxEnt                                        |
|-----------------------------------------------------------------|-----------------------------------|-----------------------------------|-----------------------------------------------|
| waking = 1               irony : non-ir =     14.4 : 1.0        | irony 3.43705831007 love          | irony 2.09798273149 unamused_face | 0.892 swag==1 and label is 'irony'            |
| final = 1               irony : non-ir =      9.7 : 1.0         | irony 3.14001851865 great         | irony 1.9912945247 love           | 0.892 illridewithyou==1 and label is 'irony'  |
| understand = 1              non-ir : irony  =      8.3 : 1.0    | irony 2.68216438685 fun           | irony 1.95849441798 great         | 0.866 speakeasy==1 and label is 'non-irony'   |
| yay = 1               irony : non-ir =      7.6 : 1.0           | irony 2.10512791389 unamused_face | irony 1.89300320932 yay           | 0.631 socialist==1 and label is 'irony'       |
| unamused_face = 1               irony : non-ir =      7.0 : 1.0 | irony 1.94194114863 yay           | irony 1.56713777847 joy           | -0.609 waking==1 and label is 'non-irony'     |
| fix = 1               irony : non-ir =      6.4 : 1.0           | irony 1.75237273405 nice          | irony 1.55896446035 monday        | 0.604 amazeballs==1 and label is 'irony'      |
| have = 2               irony : non-ir =      6.4 : 1.0          | irony 1.7355600331 thanks         | irony 1.52283732966 nice          | 0.604 peak==1 and label is 'non-irony'        |
| check = 1              non-ir : irony  =      6.1 : 1.0         | irony 1.6065307335 wow            | irony 1.5100562397 fun            | 0.600 kmyb19hr==1 and label is 'non-irony'    |
| wonderful = 1               irony : non-ir =      5.8 : 1.0     | irony 1.60091530592 oh            | non-irony -1.50574938325 check    | 0.597 subtweeting==1 and label is 'non-irony' |
| dont = 1              non-ir : irony  =      5.8 : 1.0          | non-irony -1.57302071027 :        | irony 1.47746195477 test          | 0.593 bliss==1 and label is 'irony'           |

# Linguistic Property in Irony

[Gibbs [1994]](http://psycnet.apa.org/record/1994-98510-000) states that verbal irony is a technique of using incongruity to suggest a distinction between reality and expectation. Incongruity is defined as the state of being incongruous (i.e. lacking in harmony; not in agreement with principles).

<img src="CS269-diagram.png" alt="Incongruity in Irony" title="Incongruity in Irony" />

We try to capture this principle in our deep neurel net models.


# DNN Model to Capture Linguistic Property of Irony

Our first approach is to use a neural network for the classification task. We use a simple multi-layer preceptron of 3 layers with 'Relu' and 'tanh' activations. The model was trained on both sentiment scores and word embeddings as features. The maximum length of the input sentences was fixed at 20 words. This enables us to use vanilla neural networks for the task. For extracting sentiment features, we used sentiwordnet to get an objective and subjective score for sentiments of individual words. We omitted words of neutral sentiments. This helps in making our model robust. A single vector formed by concatenating the sentiment scores is taken as a sentiment feature. Similarly, we add word embeddings of the individual words to incorporate semantic features. 

The idea here is that, a sarcastic/ironical utterance has various forms of incongruities. There can be semantic incongruity, where the meaning of one part of the sentence may contradict the other. Likewise, one can also have sentiment incongruity, where the sentiment of a word/phrase may contradict with the other. For example, consider "I love bad news". Here, "love" has positive sentiment, while "bad" has negative sentiment. Thus, capturing the difference between the sentiment polarities of various parts of the utterance can provide valuable information on the presence of sarcasm/irony. 

In our case, we considered sentiment incongruity between the words, but one can calculate phrase level and sentence level incongruity too.  With these features, a vanilla neural network of 3 layers was trained. The number of inputs were 40 (20 for word embeddings and 20 for sentiments). The hidden layers learn multiple levels of abstraction/feature representations which is used by the softmax layer at the top to predict the output class. 'Binary cross entropy' was used as the loss function. The use of other loss functions were found to yield poor results. The number of hidden units and the hidden layers are considered as hyper-parameters and optimized using cross-validation. The overall implementation is [here](../NN_system). Parts of our implementation are shown below: 

The features, extracted using a tf-idf vectorizer and a Twitter tokenizer, are using the below function:

In [None]:
# feature extraction using tf-idf
def featurize(corpus):
    # Tokenizes and creates TF-IDF BoW vectors.
    # param corpus: A list of strings each string representing sentence.
    # return: X: A sparse csr matrix of TFIDF-weigted ngram counts.

    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True).tokenize
    vectorizer = TfidfVectorizer(strip_accents="unicode", analyzer="word", tokenizer=tokenizer, stop_words="english")
    X = vectorizer.fit_transform(corpus)
    # print(vectorizer.get_feature_names()) # to manually check if the tokens are reasonable
    return X

The overall architecture of our model is as follows:

In [None]:
# Our model
def neural_net(x):
    # Hidden fully connected layer with 20 neurons and relu activation
    layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
    layer_1 = tf.nn.relu(layer_1,name="Hidden1_activations");
    # Hidden fully connected layer with 10 neurons and tanh activation
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    layer_2 = tf.nn.tanh(layer_2,name="Hidden2_activations");
    # Output fully connected layer with a neuron for each class
    out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
    return out_layer

And the training of the above model using AdamOptimizer is performed using the below tensorflow snippet.

In [None]:
# Training the model using AdamOptimizer and softmax_cross_entropy loss
logits = neural_net(X)
# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)
# Evaluate model (with test logits, for dropout to be disabled)
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
generator = gen(permute, batch_size);
# Start training
with tf.Session() as sess:
    # Run the initializer
    sess.run(init)
    for step in range(1, num_steps + 1):
        batch_x, batch_y = generator.next()
        # Run optimization op (backprop)
        sess.run(train_op, feed_dict={X: batch_x, Y: batch_y})
        if step % display_step == 0 or step == 1:
            # Calculate batch loss and accuracy
            loss, acc = sess.run([loss_op, accuracy], feed_dict={X: batch_x,Y: batch_y})
            print("Step "+str(step)+", Minibatch Loss= "+"{:.4f}".format(loss)+", Training Accuracy= "+"{:.3f}".format(acc))
    print("Optimization Finished!")

Training accuracy   : ~94%

Validation accuracy : ~72% (second position in the shared task SemEval Task3)

The final accuracy obtained by our model on the validation set is 72%. It is to be noted that the state-of-the-art accuracy for irony detection is ~88%. That model was attention-based RNN using around 100k examples, while we had only 3834 examples in our corpus. Hence, we plan to increase our model complexity with the addition of more data.

# Recurrent Neural Networks with Attention

The next idea was to use a recurrent neural network coupled with attention. The benefit of RNNs is that they can model the context of the inputs, which can be extremely helpful for sarcasm detection. For example, consider this scenario: 'A student forgot to complete his homework. The teacher says "And that's how a homework should be done!"' The second utterance, when stood alone, conveys a positive sentiment. But when we take the context into account, we can infer the underlying sarcasm in it. Thus, we used a recurrent neural network to model the overall context of our input sentences. The inputs to our recurrent neural network are word embeddings, which, as we have seen, can capture the lexical and semantic inconguities present in the inputs. 

The RNN is made up of LSTM units. The memory size of the LSTM unit was chosen to be 32 (out of trial and error). The maximum sequence length was chosen to be 20 (as it covers most of the examples). Word embeddings of dimensions 25 and 50 were tried. Then we applied attention on the LSTM outputs. This is to give different weights to the outputs of the LSTM, and use this weighted combination as features for classification. The outputs of the LSTMs are typically fed to a neural network, which outputs a probability distribution over the weights of the attention. The weighted combination is taken as a feature for the classification, along with the output state from the LSTM, which is used independent of the attention. The overall attention model is shown below:

<img src="files/LSTM-attention.JPG">

In the above diagram, the h's are the outputs of the LSTM units, not shown in the image. The following are some of the snippets from the code. The more elaborate version can be found [here](https://github.com/jayanthjaiswal/SemEval2018-Task3/tree/master/attention_lstm). The snippet for applying attention over the LSTM outputs is shown below:

In [None]:
# apply attention
def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:    # if True, assumes uniform weight over all dimensions of a particular LSTM output
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    # multiply element wise and retain the shape
    output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    return output_attention_mul

# extract the model
def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32 # dimensionality of the output space / #LSTM units
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    attention_mul = Flatten()(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

The key component of the code for performing model training and validation is as follows:

In [None]:
# number of training examples. In our case, 3834.
N = Num_Training_Inputs;
# get inputs from a data generator.
inputs, outputs = get_data_recurrent(Num_Training_Inputs, TIME_STEPS, INPUT_DIM)
if APPLY_ATTENTION_BEFORE_LSTM:
    m = model_attention_applied_before_lstm()
else:
    m = model_attention_applied_after_lstm()

# compile the model and train it.
m.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
m.fit([inputs], outputs, epochs=60, batch_size=20, validation_split=0.1,callbacks=[history])

# plceholder for attention weights. for later visualization. 
attention_vectors = []
for i in range(10):
    testing_inputs, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
    attention_vector = np.mean(get_activations(m,testing_inputs,print_shape_only=True,
                                               layer_name='attention_vec')[0], axis=2).squeeze()
    print('attention =', attention_vector)
    assert (np.sum(attention_vector) - 1.0) < 1e-5
    attention_vectors.append(attention_vector)
attention_vector_final = np.mean(np.array(attention_vectors), axis=0)

Results:

Training accuracy   : 90%

Validation Accuracy : 62% (Overfitting, even with the simplest attention network)

### Discussion:
While training the model, we observed that even the simplest attention based RNNs perform poorly when compared to the vanilla versions. We note that, this is due to the complexity of the model. The simplest of our attention network has ~12k parameters while the number of examples we have is ~4k. So, there is a high possibility of our model overfitting. Infact, that's what we observed while analysing the evolution of validation loss. The final accuracy we obtained on the validation set was 62%, which is much lower than many traditional baselines. In future, we intend to exploit additional datasets, to increase our model complexity.

# Boosting Implementation and Model Tuning

### Overview of Boosting:
Based on the history of the performance of boosting, we hypothesized that with the proper feature selection, boosting would achieve a significantly high accuracy score. In addition to the deep learning model, we explored a boosting implementation. The idea was to use the same word embedding features that were used in the DNN in the boosting algorithm to achieve a better performance. A variety of feature implementations were tested. The features included sentiment score extractions, word embeddings, and a combination of the two.

### Sentiment Score Features:

It was hypothesized that ironic tweets would have words with starkly contrasting sentiment scores. Given a corpus, each tweet was tokenized. Each word in the tokenized list was passed through a sentiment analysis tool (nltk sentiwordnet) and the positive and negative scores of the word were kept track of. Looking at groups of 3 words at a time, the maximum positive sentiment score and the maximum negative sentiment score in a 3 word window was determined and the two values were subtracted to represent how "far away" the sentiments of these words were from each other. The number of 3-word chunks that were observed was a hyperparameter MAX_LEN_2 that was tuned(shown in the code below).

The full code for boosting with sentiment score features can be found [here](https://github.com/jayanthjaiswal/SemEval2018-Task3/blob/master/xgboost_test/boost_test.py)

The code listed below demonstrates how the sentiment features were constructed for each tweet:


In [None]:
MAX_LEN_2 = 6
def get_sent_max_pos_neg(corpus):
    for curr_sentence in corpus:
        curr_sentence = re.sub(r'^https?:\/\/.*[\r\n]*', ' ', curr_sentence)
        curr_sentence = re.sub(r'@[a-zA-Z0-9_]+', ' ', curr_sentence)
        sentence_tokenized = tokenizer(curr_sentence)
        sent_list = []
        for i in range(len(sentence_tokenized)-2):
            curr_word = sentence_tokenized[i]
            next_word = sentence_tokenized[i+1]
            next_next = sentence_tokenized[i+2]
            if curr_word.startswith('#'):
                curr_word = curr_word[1:]
            if next_word.startswith('#'):
                next_word = next_word[1:]
            if next_next.startswith('#'):
                next_next = next_next[1:]
            curr_senti_synsets = swn.senti_synsets(curr_word)
            next_senti_synsets = swn.senti_synsets(next_word)
            next_next_senti_synsets = swn.senti_synsets(next_next)
            if len(curr_senti_synsets) > 0 and len(next_senti_synsets) > 0 and len(next_next_senti_synsets) > 0:
                curr_pos = curr_senti_synsets[0].pos_score()
                curr_neg = curr_senti_synsets[0].neg_score()
                next_pos = next_senti_synsets[0].pos_score()
                next_neg = next_senti_synsets[0].neg_score()
                next_next_pos = next_next_senti_synsets[0].pos_score()
                next_next_neg = next_next_senti_synsets[0].neg_score()
                max_pos = max(curr_pos, next_pos, next_next_pos)
                max_neg = max(curr_neg, next_neg, next_next_neg)
                curr_sent = max_pos - max_neg
                if curr_sent != 0:
                    sent_list.append(curr_sent)
        if len(sent_list) > MAX_LEN_2:
            sent_list = sent_list[:MAX_LEN_2]
        for i in range(MAX_LEN_2 - len(sent_list)):
            sent_list.append(0.0)
        inp_X.append(sent_list)
    return inp_X

### Word Embedding Features:

Since simply using sentiment scores did not perform very well, exploring word embedding features was the next approach. We utilized the pre-trained word embeddings called GloVe from Stanford and outputted two sets of word embeddings. For the first set, each tweet corresponded to a text file where each row represented a word in the tweet using a 25d word embedding. For the second set, each tweet corresponded to a text file where each row represented a word in the tweet using a 50d word embedding.  

The first task was to preprocess the text files by flattening each text file into a 1 dimensinal array representing the word embedding for approximately the first 10 words in a tweet. The number of words that were included in the flattened word embedding vector was a hyperparameter, MAX_LEN (shown in the code below). Each tweet example was represented as a 1  x 250 dimensional vector. This way, the examples were stacked together to get a full input training data matrix (data_X in the code below) with dimensions 3834 X 251 (last column was the labels). 

The same was done for the 50d word embeddings, resulting in an input training data matrix (data_X in the code below) with dimensions 3834 X 501. 


### Boosting Implementation
Once the data was preprocessed with the sentiment score features and word embedding features (stored in data_x in code below), the training data was split into X_train, X_test, y_train and y_test (as shown in the code below). Xgboost was run to fit a model to the training data. We used the XGBRegressor from the xgboost library. This was a classifier used a logistic regression model to perform the binary classification task. 


In [None]:
MAX_LEN = 350
test_size = 0.1
X = data_x[:,0:MAX_LENGTH+MAX_LEN_2]
Y = data_x[:,MAX_LENGTH+MAX_LEN_2]
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=test_size)
# fit model no training data
model = XGBRegressor(max_depth=3) #gave 56.51%
model.fit(X_train, y_train)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
accuracies.append(accuracy)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

The full code for boosting with word embeddings and sentiment score features can be found [here](https://github.com/jayanthjaiswal/SemEval2018-Task3/blob/master/xgboost_test/boost_test_2.py)


### Feature Engineering, Hyperparameters, Accuracy Analysis

A combination of word embedding features and sentiment score features were used to analyze which would produce the best accuracy scores. The main parameters that were tuned in the model were MAX_LEN, MAX_LEN_2, test_size (representing the k_fold), and the max_depth of the decision trees in the boosting model. The best results were achieved with test_size = 0.1 and the max_depth = 3 for all models. For the models involving the 25d word embedding features, MAX_LEN = 350. For the models involving the 50d word embedding features, MAX_LEN = 550. For models involving sentiment scores MAX_LEN_2 = 6.

The findings of the best scores along with the corresponding hyperparameters are summarized in the chart below:


| Feature Types                       | Accuracy | MAX_LEN | MAX_LEN_2 | test_size | max_depth |
|-------------------------------------|----------|---------|-----------|-----------|-----------|
| Only_Sentiment_Scores               | 54.95%   | n/a     | 6         | 0.1       | 3         |
| Only_25d_Word_Embedding             | 61.20%   | 350     | n/a       | 0.1       | 3         |
| Only_50d_Word_Embedding             | 61.98%   | 550     | n/a       | 0.1       | 3         |
| 25d_Word_Embedding_Sentiment_Scores | 62.50%   | 350     | 6         | 0.1       | 3         |
| 50d_Word_Embedding_Sentiment_Scores | 60.16%   | 550     | 6         | 0.1       | 3         |

### Discussion
Even though we hypothesized that boosting would be a simple method to increase performance scores, it was observed that it was unable to beat the DNN approaches. The highest that was achieved using the boosting method was 62.50%, using 25d word embedding and sentiment score features. We concluded that in order to observe good performance for a boosting algorithm, proper feature selection is key. Feature selection in the task of classifying irony is complex. With more basic feature extractions, sophisticated systems such as DNNs perform better, as opposed to the boosting approach.


# Result Analysis and Future Work

Here is a final look at the results:

| Model   Types                                   | Accuracy |
|-------------------------------------------------|----------|
| Naïve_Bayes_Unigram_Count                       | 65.66%   |
| Logistic_Regression_Unigram_TFIDF               | 64.44%   |
| Nu-SVC_RBF_Bigram_TFIDF                         | 66.92%   |
| XGBoost_Sentiment_Scores                        | 54.95%   |
| XGBoost_25d_Word_Embedding                      | 61.20%   |
| XGBoost_50d_Word_Embedding                      | 61.98%   |
| XGBoost_25d_Word_Embedding_Sentiment_Scores     | 62.50%   |
| XGBoost_50d_Word_Embedding_Sentiment_Scores     | 60.16%   |
| NN_Word_Embeddings                              | 71.52%   |
| NN_Sentiment_Scores                             | 73.01%   |
| **NN_Word_Embeddings_Sentiment_Scores**         |**73.67%**|
| RNNs_Attention_Word_Embeddings_GloVe            | 65.05%   |
| RNNs_Attention_Word_Embeddings_Word2Vec         | 62.80%   |
| RNNs_Attention_Sentiment_Scores                 | 66.89%   |
| RNNs_Attention_Word_Embeddings_Sentiment_Scores | 65.19%   |

### Key Features That Were Observed Through This Study: 

* The dataset was too small (3834 tweets), often leading to overfitting
* Neural Nets with word embeddings and sentiment scores perform the best
* For future work, we need to exploit tweet markers as specific features like hashtags and user anchors for more accuracy
* Language incongruity can also be incorporated in the Neural Net model by training a tweet language model and comparing the model’s output with currently present word in tweet
* Context incongruity can also be incorporated in the models by scraping the url to get the context in tweet dialouges or contextual referances.
* Dataset has ambiguous and highly contextual irony labels like:
    - [We want turkey!!](../datasets/train/irony-corpus-taskA/irony/116.txt)
    - [@JordanNoftall that’s funny](../datasets/train/irony-corpus-taskA/irony/147.txt)

Finally, we need to submit all the models in the SemEval System to see the performance on their hidden dataset, which will be released in January 15th.