This project deals with the statistical approach - both conventional machine learning and deep learning techniques to identity which one of four emotions a tweet represents: anger, fear, joy, or sadness. 

The datasets used for this project is taken from the available modified datasets given in kaggle competition and some additional datasets from - [Affect in Tweets task at SemEval in 2018](https://competitions.codalab.org/competitions/17751#learn_the_details)

The **Support Vector Classification** Model is the conventional machine learning model chosen here. Support Vector Classifiers are one of the most memory efficient classifier and works relatively well when there is a clear margin of classifications or separation between the classes. This is because SVM takes data as an input and outputs the line that separates the classes. 

The SVM consistently achieved good performance for emotion prediction outperforming other conventional models. With their ability to generalize well in high dimensional feature spaces, SVMs eliminate the need for feature selection, making the application of text categorization considerably easier.

> TfidfVectorizer was used to transform text to feature vectors that can be used as input to estimator.

The **Convolutional Neural Network** Model was the selected deep learning model here. Text as a sequence is passed to a CNN. The input is primarily passed to an embedding layer and then to a convolutional layer and a pooling layer respectively. Finally, a Dense layer with four classes and a sigmoid activation is applied.

> Tokenizer utility class is used to vectorize a text corpus into a list of integers.

> To counter the issue of text sequences with different length of words, pad_sequence() is used which simply pads the sequence of words with zeros.

The convolutional Neural network was the best performing model among the two, which rendered an accuracy of **71.751%** for the public dataset while Support Vector Classification gave a slightly lower accuracy of **70.300 %**



In [0]:
#importing required libraries and modules

import numpy as np
import pandas as pd
import pickle
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk import wordnet
from nltk.corpus import stopwords
import string
from nltk.corpus import wordnet as part_of_speech
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection , naive_bayes , svm
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from nltk.stem.porter import PorterStemmer
from collections import defaultdict
from sklearn.linear_model import LogisticRegression
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Using TensorFlow backend.


The dataset consists of three types of files related to training data, validation data and public test data; all  in npy and xlsx format. One type of file contains the content of the tweets, with words represented by indexes; the other type of file contains the labels. There is also one single file mapping words to indexes, in Python pickle format.

In [0]:
#first set of train data
train_labels_main = np.load('text_train_labels.npy')
with open('text_word_to_idx.pkl', 'rb') as f:
    data = pickle.load(f)
train_tweets_main = np.load('text_train_tweets.npy')

#second set of train data
df_train_tweet2_main = pd.read_excel ('Trail Tweets.xlsx')
df_train_label2_main  = pd.read_excel ('Trial labels.xlsx', header = None)

#validation set
Val_labels_main = np.load('text_val_labels.npy')
Val_tweets_main = np.load('text_val_tweets.npy')

#public tweets
public_tweets = np.load('text_test_public_tweets_rand.npy')

The code below handles the data preprocessing and then feature extraction

In [0]:
#Interchanging dictionary key and values
new_dict = dict([(value, key) for key, value in data.items()])

#decoding the tweets in train_tweets
xyz = []
for i in train_tweets_main:
    new_data = []
    for j in i:
         new_data.append(new_dict[j] )
    xyz.append(new_data)

In [0]:
#part of speech tagging 

part_of_speech_tag = defaultdict (lambda : part_of_speech.NOUN)
part_of_speech_tag['J'] = part_of_speech.ADJ
part_of_speech_tag['V'] = part_of_speech.VERB
part_of_speech_tag['R'] = part_of_speech.ADV

Function to preprocess .npy data

In [0]:
def preprocessinng(x):
    xyz = []
    for i in x:
        new_data = []
        for j in i:
             new_data.append(new_dict[j] )
        xyz.append(new_data)

    df = pd.DataFrame(xyz)

    #Some initial pre-processing
    df.replace('<START>' , np.nan , inplace= True)
    df.replace('<NULL>' , np.nan , inplace= True)
    df.replace('<END>' , np.nan , inplace= True)
    df.replace('<user>' , np.nan , inplace= True)
    df['ColumnA'] = df[df.columns[:]].apply( lambda x: ','.join(x.dropna().astype(str)),axis=1)
    df2 = pd.DataFrame(df['ColumnA']) 
    df2.replace('#' , "", inplace= True, regex = True)
    #df2.replace('_' ," " , inplace= True, regex = True)
    df2['ColumnA'] = [entry.lower () for entry in df2['ColumnA']]
    df2.replace('angry','anger',inplace=True, regex = True)
    df2.replace('furious','anger',inplace=True, regex = True)
    df2.replace('irritated','anger',inplace=True, regex = True)
    df2.replace('enraged','anger',inplace=True, regex = True)
    df2.replace('annoyed','anger',inplace=True, regex = True)
    df2.replace('sad','sadness',inplace=True, regex = True)
    df2.replace('depressed','sadness',inplace=True, regex = True)
    df2.replace('devastated','sadness',inplace=True, regex = True)
    df2.replace('miserable','sadness',inplace=True, regex = True)
    df2.replace('disappointed','sadness',inplace=True, regex = True)
    df2.replace('terrified','fear',inplace=True, regex = True)
    df2.replace('discouraged','fear',inplace=True, regex = True)
    df2.replace('scared','fear',inplace=True, regex = True)
    df2.replace('anxious','fear',inplace=True, regex = True)
    df2.replace('fearful','fear',inplace=True, regex = True)
    df2.replace('happy','joy',inplace=True, regex = True)
    df2.replace('ecstatic','joy',inplace=True, regex = True)
    df2.replace('glad','joy',inplace=True, regex = True)
    df2.replace('relieved','joy',inplace=True, regex = True)
    df2.replace('excited','joy',inplace=True, regex = True)
    df2.replace('irritating','anger',inplace=True, regex = True)
    df2.replace('vexing','anger',inplace=True, regex = True)
    df2.replace('outrageous','anger',inplace=True, regex = True)
    df2.replace('annoying','anger',inplace=True, regex = True)
    df2.replace('displeasing','sadness',inplace=True, regex = True)
    df2.replace('depressing','sadness',inplace=True, regex = True)
    df2.replace('serious','sadness',inplace=True, regex = True)
    df2.replace('grim','sadness',inplace=True, regex = True)
    df2.replace('heartbreaking','sadness',inplace=True, regex = True)
    df2.replace('gloomy','fear',inplace=True, regex = True)
    df2.replace('horrible','fear',inplace=True, regex = True)
    df2.replace('threatening','fear',inplace=True, regex = True)
    df2.replace('terrifying','fear',inplace=True, regex = True)
    df2.replace('shocking','fear',inplace=True, regex = True)
    df2.replace('dreadful','joy',inplace=True, regex = True)
    df2.replace('funny','joy',inplace=True, regex = True)
    df2.replace('hilarious','joy',inplace=True, regex = True)
    df2.replace('amazing','joy',inplace=True, regex = True)
    df2.replace('wonderful','joy',inplace=True, regex = True)
    
    # Remove blank rows.
    df2['ColumnA'].dropna(inplace = True)

    # Convert the text to lowercase
    df2['ColumnA'] = [entry.lower () for entry in df2['ColumnA']]

    # Tokenise the text 
    df2['ColumnA']= [word_tokenize (entry) for entry in df2['ColumnA']]

    # Remove stop words and non alphabetic words and perform lemmatisation
    for index, entry in enumerate(df2['ColumnA']):
        final_words= []
        word_lemmatized= WordNetLemmatizer()
        for word, tag in pos_tag(entry):
            if word not in stopwords.words('english') and word.isalpha():
                word_final= word_lemmatized.lemmatize(word, part_of_speech_tag[tag[0]])
                final_words.append(word_final)
                df2.loc[index, 'text_final'] = str(final_words)


    del df2['ColumnA']
    return df2

Function to preprocess excel data (additional dataset to train the models)

In [0]:
def preprocessinng2(x):
    
    df = pd.DataFrame(x)

    df2 = pd.DataFrame(df['ColumnA'])
    df2 = df2.astype(str)
    df2.replace('#' , "", inplace= True, regex = True)
    #df2.replace('_' ," " , inplace= True, regex = True)
    df2['ColumnA'] = [entry.lower () for entry in df2['ColumnA']]
    df2.replace('angry','anger',inplace=True, regex = True)
    df2.replace('furious','anger',inplace=True, regex = True)
    df2.replace('irritated','anger',inplace=True, regex = True)
    df2.replace('enraged','anger',inplace=True, regex = True)
    df2.replace('annoyed','anger',inplace=True, regex = True)
    df2.replace('sad','sadness',inplace=True, regex = True)
    df2.replace('depressed','sadness',inplace=True, regex = True)
    df2.replace('devastated','sadness',inplace=True, regex = True)
    df2.replace('miserable','sadness',inplace=True, regex = True)
    df2.replace('disappointed','sadness',inplace=True, regex = True)
    df2.replace('terrified','fear',inplace=True, regex = True)
    df2.replace('discouraged','fear',inplace=True, regex = True)
    df2.replace('scared','fear',inplace=True, regex = True)
    df2.replace('anxious','fear',inplace=True, regex = True)
    df2.replace('fearful','fear',inplace=True, regex = True)
    df2.replace('happy','joy',inplace=True, regex = True)
    df2.replace('ecstatic','joy',inplace=True, regex = True)
    df2.replace('glad','joy',inplace=True, regex = True)
    df2.replace('relieved','joy',inplace=True, regex = True)
    df2.replace('excited','joy',inplace=True, regex = True)
    df2.replace('irritating','anger',inplace=True, regex = True)
    df2.replace('vexing','anger',inplace=True, regex = True)
    df2.replace('outrageous','anger',inplace=True, regex = True)
    df2.replace('annoying','anger',inplace=True, regex = True)
    df2.replace('displeasing','sadness',inplace=True, regex = True)
    df2.replace('depressing','sadness',inplace=True, regex = True)
    df2.replace('serious','sadness',inplace=True, regex = True)
    df2.replace('grim','sadness',inplace=True, regex = True)
    df2.replace('heartbreaking','sadness',inplace=True, regex = True)
    df2.replace('gloomy','fear',inplace=True, regex = True)
    df2.replace('horrible','fear',inplace=True, regex = True)
    df2.replace('threatening','fear',inplace=True, regex = True)
    df2.replace('terrifying','fear',inplace=True, regex = True)
    df2.replace('shocking','fear',inplace=True, regex = True)
    df2.replace('dreadful','joy',inplace=True, regex = True)
    df2.replace('funny','joy',inplace=True, regex = True)
    df2.replace('hilarious','joy',inplace=True, regex = True)
    df2.replace('amazing','joy',inplace=True, regex = True)
    df2.replace('wonderful','joy',inplace=True, regex = True)

    # Remove blank rows.
    df2['ColumnA'].dropna(inplace = True)

    # Convert the text to lowercase
    df2['ColumnA'] = [entry.lower () for entry in df2['ColumnA']]

    # Tokenise the text 
    df2['ColumnA']= [word_tokenize (entry) for entry in df2['ColumnA']]

    # Remove stop words and non alphabetic words and perform lemmatisation
    for index, entry in enumerate(df2['ColumnA']):
        final_words= []
        word_lemmatized= WordNetLemmatizer()
        for word, tag in pos_tag(entry):
            if word not in stopwords.words('english') and word.isalpha():
                word_final= word_lemmatized.lemmatize(word, part_of_speech_tag[tag[0]])
                final_words.append(word_final)
                df2.loc[index, 'text_final'] = str(final_words)


    del df2['ColumnA']
    return df2

In [0]:
#preprocesing train dataset
X_train_initial = preprocessinng(train_tweets_main)
train_labels_df =     pd.DataFrame(train_labels_main)

#preprocessing external dataset for training
X_train_initial2 = preprocessinng2(df_train_tweet2_main)
train_labels_df_2 = pd.DataFrame(df_train_label2_main)

In [0]:
#preprocessing public tweets
public_tweets =preprocessinng(public_tweets)

#preprocessing validation data
Val_tweets = preprocessinng(Val_tweets_main)   
val_labels = pd.DataFrame(Val_labels_main)

In [0]:
#concatenating the tweets to split into test and train

initial_tweet = pd.concat([X_train_initial,X_train_initial2])
initial_labels = pd.concat([train_labels_df,train_labels_df_2])

**Conventional Machine Learning Model**

The final model that produced the best-performing predictions for the Kaggle submission (accuracy 70.300%) was an SVM with a linear kernel.


In [0]:
combined_df = pd.concat([initial_tweet, Val_tweets, public_tweets])
combined_df = combined_df.astype(str)

X_train_initial_combined = pd.concat([initial_tweet, Val_tweets])
y_train = pd.concat([initial_labels, val_labels])
X_test_initial = public_tweets

#coverting the tweets to string datatype
X_train_initial = X_train_initial_combined.astype(str)
X_test_initial = X_test_initial.astype(str)

#converting test set and train set to python series
X_train = X_train_initial.text_final
X_test = X_test_initial.text_final

# Vectorize the words by using TF IDF Vectorizer
tfidf_vect= TfidfVectorizer(max_features=50000)
tfidf_vect.fit(combined_df['text_final'])


TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=50000,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [0]:
#inspecting the learned vocabulary 
print(tfidf_vect.vocabulary_)

In [0]:
#Transform X_train and X_test to vectorized X_train_tfidf and X_test_tfidf
X_train_tfidf= tfidf_vect.transform(X_train)
X_test_tfidf = tfidf_vect.transform(X_test)

#print(X_train_tfidf)
#print(X_test_tfidf)

The model used here is a support vector classifier with a regularization parameter 1, a linear kernal and 'scale' as the kernal coefficiet. This was the optimum parameters for this model and rendered an accuracy of 70.300 % for public tweet dataset in kaggle

In [0]:
#SVM classifier#

SVM = svm.SVC(C=1, kernel='linear', degree=3, gamma='scale')

# Fit the training dataset.
SVM.fit(X_train_tfidf, y_train)

# Predict the labels on the validation dataset.
predictions_SVM= SVM.predict(X_test_tfidf)

#convert the prediction to DataFrame
predictions_SVM = pd.DataFrame(predictions_SVM)

#convert the DataFrame of predictions to csv format
predictions_SVM.to_csv('45575657-conv.csv', index = True)

  y = column_or_1d(y, warn=True)


In addition to this model, I have also tried logistic regression model. The accuracy of this model (68.9%)  was a bit low when compared to the accuracy of SVC. This might be because SVM tries to finds the best margin (distance between the line and the support vectors) that separates the classes and this reduces the risk of error on the data, while logistic regression does not, instead it can have different decision boundaries with different weights that are near the optimal point.

**DEEP LEARNING**

The final model that produced the best-performing predictions for the Kaggle submission (accuracy 71.751%) was a Convolutional Neural network. Some further steps are required to input the preprocessed data to the deep learning model which are as follows.

In [0]:
#Converting the labels into categorical variables

initial_labels = to_categorical(initial_labels)
val_labels = to_categorical(val_labels)

In [0]:
#datatype conversion
initial_tweet = initial_tweet.astype(str)
Val_tweets = Val_tweets.astype(str)

In [0]:
#convert the tweets to arrays

sentences = initial_tweet['text_final'].values
val_tweets = Val_tweets['text_final'].values   

The CountVectorizer provided by the scikit-learn library is used to vectorize sentences. It takes the words of each sentence and creates a vocabulary of all the unique words in the sentences. This vocabulary can then be used to create a feature vector of the count of the words

In [0]:
tokenizer = Tokenizer(num_words=100000)
tokenizer.fit_on_texts(sentences) #confused if need to fit everything

X_train = tokenizer.texts_to_sequences(sentences)
X_test = tokenizer.texts_to_sequences(val_tweets)
#val_tweets = tokenizer.texts_to_sequences(val_tweets)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

One problem that we have is that each text sequence has in most cases different length of words. To counter this,  pad_sequence() is used which simply pads the sequence of words with zeros. By default, it prepends zeros.

Additionally a maxlen parameter is added to specify how long the sequences should be. This cuts sequences that exceed that number. 

In [0]:
#sentence padding
maxlen = 250

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
#val_tweets = pad_sequences(val_tweets, padding='post', maxlen=maxlen)

In [0]:
#Preprocessing public tweets for deep learning model

public_tweets = public_tweets.astype(str)
public_tweets = public_tweets['text_final'].values  
public_tweets = tokenizer.texts_to_sequences(public_tweets)
public_tweets = pad_sequences(public_tweets, padding='post', maxlen=maxlen)

An Embedding Layer of Keras which takes the previously calculated integers and maps them to a dense vector of the embedding is used here.

The output of the embedding layer is taken and plugged it into a Convolutional layer. A pooling layer is added after this to downsample the input data. 

This is finally connected to a dense layer with 4 classes and a sigmoid activation.

The model is then configured with categorical_crossentropy and categorical_accuracy loss function and metrics respectively.

In [0]:
embedding_dim = 100

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(4, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 250, 100)          1867200   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 246, 128)          64128     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                1290      
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 44        
Total params: 1,932,662
Trainable params: 1,932,662
Non-trainable params: 0
_________________________________________________________________


To train the model..

In [0]:
history = model.fit(X_train, initial_labels,
                    epochs=5,
                    verbose=False,
                    validation_data=(X_test, val_labels),
                    batch_size=10)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [0]:
loss, accuracy = model.evaluate(X_train, initial_labels, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, val_labels, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Training Accuracy: 0.9602
Testing Accuracy:  0.7988


In addition to the final model, I also tried a CNN without the pooling layer. This has given me slightly lower accuracy (63.902%) when compared to the model with pooling layer. The pooling layer used here is maxpooling which take the maximum value of all features in the pool for each feature dimension.

In [0]:
tweets = model.predict(public_tweets , verbose=0)
tweets_final = np.array([np.where(l == max(l), 1, 0) for l in tweets])
out = np.argmax(tweets_final, axis = 1) 
out = pd.DataFrame(out)
out.to_csv('45575657-deep.csv', index = True)

Comparing my final conventional ML and deep learning models, the deep learning one performed better by **1.451%** on the public test set. The deep learning model performed well, with the top-performing system having 71.751% accuracy.

Developer: **Midhun MJ**
Date:  **July 23 2020**
Place:  **Sydney, Australia**