
# ULMFiT for Airline Tweet Sentiment Analysis

This notebook demonstrates how to apply a supervised ULMFiT model to "Twitter US Airline Sentiment" dataset available at https://www.kaggle.com/crowdflower/twitter-airline-sentiment#Tweets.csv

## Environment Setup 

In [3]:
# ! conda create -n fastai
# ! conda activate fastai
# ! conda install jupyter notebook
# ! conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
# ! conda install nltk
# ! conda install pandas

In [59]:
"""
Authour: Manoj Pravakar Saha
Email: hello@manojsaha.com
License: Apache License 2.0
"""

import re
import os
from functools import partial
from collections import Counter
import string
import pandas as pd
import nltk
from nltk.corpus import wordnet
from fastai.text import *

## Pre-processing Data

In this step, we'll pre-process the data for feeding into the model. I am jumping directly to pre-processing, before Exploratory Data Analysis (EDA) for brevity. To ensure better model performce, we must perform EDA to unerstand the data prior to moving on to the model.

For pre-processing I am using a subset of techniques discussed in the paper titled "A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis". The code is available at https://github.com/Deffro/text-preprocessing-techniques. I am using the provided code with some minor modifications.

In [97]:
# Import tweets from csv file and view the first few lines
df = pd.read_csv('Tweets.csv', sep=',')
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [39]:
# get the feature names
features = df.columns.tolist()
print(features)

['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence', 'airline', 'airline_sentiment_gold', 'name', 'negativereason_gold', 'retweet_count', 'text', 'tweet_coord', 'tweet_created', 'tweet_location', 'user_timezone']


In [40]:
# Number of tweets and features
df.shape

(14640, 15)

We have 14,640 tweets in the dataset and a number of features. Since I am using ULMFiT, I will only use the text (or their contexual embeddings) as features for fine-tuning the language model and the supervised text classifier.


### Pre-processing techniques
I am using the following tweet pre-processing techniques.
1. Remove unicode strings
2. Replace urls with empty string
3. Replace user mentions with empty string
4. Replace hashtags
5. Replace slang and abbreviations
6. Replace contractions
7. Remove numbers
8. Remove punctuation marks and special characters
9. Replace emoticons
10. Lowercase text
11. Replace negations

I have tested, but omitted the spell correction feature, since the implementation is not very efficient and takes too long. For the same reeason, I have not applied stopword removal here.

In [41]:
# A subset of techniques for tweet pre-processing
# Originally published by Dimitrios Effrosynidis at 
# https://github.com/Deffro/text-preprocessing-techniques

def removeUnicode(text):
    """ Removes unicode strings like "\u002c" and "x96" """
    text = re.sub(r'(\\u[0-9A-Fa-f]+)',r'', text)       
    text = re.sub(r'[^\x00-\x7f]',r'',text)
    return text

def replaceURL(text):
    """ Replaces url address with "url" """
    # text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','url',text)
    text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)
    text = re.sub(r'#([^\s]+)', r'\1', text)
    return text

def replaceAtUser(text):
    """ Replaces "@user" with "atUser" """
    # text = re.sub('@[^\s]+','atUser',text)
    text = re.sub('@[^\s]+','',text)
    return text

def removeHashtagInFrontOfWord(text):
    """ Removes hastag in front of a word """
    text = re.sub(r'#([^\s]+)', r'\1', text)
    return text

def removeNumbers(text):
    """ Removes integers """
    text = ''.join([i for i in text if not i.isdigit()])         
    return text

def removeEmoticons(text):
    """ Removes emoticons from text """
    text = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', '', text)
    return text

""" Creates a dictionary with slangs and their equivalents and replaces them """
with open('slang.txt', encoding='utf8', errors='ignore') as file:
    slang_map = dict(map(str.strip, line.partition('\t')[::2])
    for line in file if line.strip())

slang_words = sorted(slang_map, key=len, reverse=True) # longest first for regex
regex = re.compile(r"\b({})\b".format("|".join(map(re.escape, slang_words))))
replaceSlang = partial(regex.sub, lambda m: slang_map[m.group(1)])

def replaceElongated(word):
    """ Replaces an elongated word with its basic form, unless the word exists in the lexicon """

    repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
    repl = r'\1\2\3'
    if wordnet.synsets(word):
        return word
    repl_word = repeat_regexp.sub(repl, word)
    if repl_word != word:      
        return replaceElongated(repl_word)
    else:       
        return repl_word

""" Replaces contractions from a string to their equivalents """
contraction_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'),
                         (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), (r'(\w+)\'d', '\g<1> would'), (r'&', 'and'), (r'dammit', 'damn it'), (r'dont', 'do not'), (r'wont', 'will not') ]
def replaceContraction(text):
    patterns = [(re.compile(regex), repl) for (regex, repl) in contraction_patterns]
    for (pattern, repl) in patterns:
        (text, count) = re.subn(pattern, repl, text)
    return text


def lowercase(text):
    """ Make all characters lowercase """
    return text.lower()

In [42]:
### Spell Correction begin ###
""" Spell Correction http://norvig.com/spell-correct.html """
def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('corporaForSpellCorrection.txt').read()))

def P(word, N=sum(WORDS.values())): 
    """P robability of `word`. """
    return WORDS[word] / N

def spellCorrection(word): 
    """ Most probable spelling correction for word. """
    return max(candidates(word), key=P)

def candidates(word): 
    """ Generate possible spelling corrections for word. """
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    """ The subset of `words` that appear in the dictionary of WORDS. """
    return set(w for w in words if w in WORDS)

def edits1(word):
    """ All edits that are one edit away from `word`. """
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    """ All edits that are two edits away from `word`. """
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

### Spell Correction End ###

In [43]:
## Replace Negations Begin ###

def replace(word, pos=None):
    """ Creates a set of all antonyms for the word and if there is only one antonym, it returns it """
    antonyms = set()
    for syn in wordnet.synsets(word, pos=pos):
      for lemma in syn.lemmas():
        for antonym in lemma.antonyms():
          antonyms.add(antonym.name())
    if len(antonyms) == 1:
      return antonyms.pop()
    else:
      return None

def replaceNegations(text):
    """ Finds "not" and antonym for the next word and if found, replaces not and the next word with the antonym """
    i, l = 0, len(text)
    words = []
    while i < l:
      word = text[i]
      if word == 'not' and i+1 < l:
        ant = replace(text[i+1])
        if ant:
          words.append(ant)
          i += 2
          continue
      words.append(word)
      i += 1
    return words

### Replace Negations End ###

In [9]:
# Some more methods for pre-processing
# Author: Manoj Pravakar Saha

def removeSpecialCharacters(text):
    """ Removes puncatuations from text """
    # translator = str.maketrans('', '', string.punctuation)
    # return text.translate(translator)
    return re.sub(r'[^\w\s]',' ',text)

def replaceNegationsText(text):
    """ Replace negations from the entire text string (not a single token) """
    tokens = nltk.word_tokenize(text)
    tokens = replaceNegations(tokens) # Technique 6: finds "not" and antonym 
                                      # for the next word and if found, replaces not 
                                      # and the next word with the antonym
    onlyOneSentence = " ".join(tokens) # form again the sentence from the list of tokens
    return onlyOneSentence

def spellCorrectionText(text):
    """ Correct misspelled words in entire text """
    onlyOneSentenceTokens = [] # tokens of one sentence each time
    tokens = nltk.word_tokenize(text)
    for token in tokens:
        final_word = spellCorrection(token)
        onlyOneSentenceTokens.append(final_word)
    return " ".join(onlyOneSentenceTokens)

In [98]:
# Pre-processing techniques applied sequentially
df.text = df.text.apply(removeUnicode) # Remove Unicode characters
df.text = df.text.apply(lowercase) # Lowercase the text
df.text = df.text.apply(replaceURL) # Replace URLs with empty string
df.text = df.text.apply(replaceAtUser) # Replace @user with empty string
df.text = df.text.apply(removeHashtagInFrontOfWord) # Remove hashtags
df.text = df.text.apply(replaceSlang) # Replace slang and abbreviations
df.text = df.text.apply(replaceContraction) # Replace contractions with equivalent words
df.text = df.text.apply(removeNumbers) # Remove numbers from text
df.text = df.text.apply(removeEmoticons) # Remove emoticons from text
df.text = df.text.apply(removeSpecialCharacters) # Remove special characters
df.text = df.text.apply(replaceNegationsText) # Replace negations with antonyms


In [99]:
# Create new dataframe with text and labels
df = df[['airline_sentiment', 'text']]
df = df.rename(columns={'airline_sentiment':'labels'})
df.head()

Unnamed: 0,labels,text
0,neutral,what said
1,positive,plus you have added commercials to the experie...
2,neutral,i did not today must mean i need to take anoth...
3,negative,it is really aggressive to blast obnoxious ent...
4,negative,and it is a really big bad thing about it


In [50]:
# Change labels into integers 
# df.loc[df['labels'] == 'positive', 'labels'] = 0
# df.loc[df['labels'] == 'neutral', 'labels'] = 1
# df.loc[df['labels'] == 'negative', 'labels'] = 2

In [100]:
# Divide data into training and test sets
test_df = df.sample(frac=0.2) # Randomly select 20% as test set
train_df = df.drop(test_df.index) # Keep the rest as training set

In [101]:
# Print the number of samples in each set
print('Trainset-sample size: {} \nTestset-sample size: {}'.\
      format(train_df.shape[0], test_df.shape[0]))

Trainset-sample size: 11712 
Testset-sample size: 2928


Now that we have splitted the data into training and test at 80-20 ratio, we should verify if both datasets contain similar distribution of sentiments.

In [53]:
def column_value_counts(df, target_column, new_column):
    '''
    Get value counts of each categorical variable. Store this data in 
    a dataframe. Also add a column with relative percentage of each 
    categorical variable.
    
    :param df: A Pandas dataframe
    :param target_column: Name of the column in the original dataframe (string)
    :param new_column: Name of the new column where the frequency counts are stored 
    :type df: pandas.core.frame.DataFrame
    :type target_column: str
    :type new_column: str
    :return: A Pandas dataframe containing the frequency counts
    :rtype: pandas.core.frame.DataFrame
    '''
    df_value_counts = df[target_column].value_counts()
    df = pd.DataFrame(df_value_counts)
    df.columns = [new_column]
    df[new_column+'_%'] = 100*df[new_column] / df[new_column].sum()
    return df

# Get frequency distribution of labels in each set
df_train = column_value_counts(train_df, 'labels', 'Train')
df_test = column_value_counts(test_df, 'labels', 'Test')

label_count = pd.concat([df_train, df_test], axis=1) # Merge dataframes by index
label_count = label_count.fillna(0) # Replace Nan with 0 (zero)
label_count = label_count.round(2) # Rounding decimals to two digits after .
print(pronoun_count.sort_values(by=['Train'], ascending=False))

          Train  Train_%  Test  Test_%
negative   7333    62.61  1845   63.01
neutral    2497    21.32   602   20.56
positive   1882    16.07   481   16.43


The above table shows that in both sets the distribution of negative, postitive and neutral tweets are similar. Hence, we can now save this for future use.

In [56]:
# Save training and test sets into CSV files
train_df.to_csv('train.csv', header=False, index=False, encoding='utf-8')
test_df.to_csv('test.csv', header=False, index=False, encoding='utf-8')

## Language Model and Supervised Classifier

To complete this part, I have taken help from three different sources.
- https://docs.fast.ai/text.html#Fine-tuning-a-language-model
- https://www.analyticsvidhya.com/blog/2018/11/tutorial-text-classification-ulmfit-fastai-library/
- https://github.com/estorrs/twitter-celebrity-tweet-sentiment/blob/master/celebrity-twitter-sentiment.ipynb

I have used fastai version 1.0 for this demo. The last example above is based on version 0.7. However, it helped me understand some of the issues related to retraining the ULMFiT model for a new dataset.

Before we can proceed with retraing and fine-tuning the language model on our dataset, we need to download the ULMFiT pretrained models weights on WikiPedia using the following command.

In [114]:
# ! wget -nH -r -np -P http://files.fast.ai/models/wt103/

We would also use the LSTM model weights pre-trained on the same dataset.

In [115]:
# ! wget -nH -r -np -P http://files.fast.ai/models/wt103_v1/lstm_wt103.pth
# ! wget -nH -r -np -P http://files.fast.ai/models/wt103_v1/itos_wt103.pkl

Now that we have downloaded the pretrained models, we can reload the training and test sets.

In [76]:
# Load training set in Pandas dataframe
train_df = pd.read_csv('train.csv', header=None,  encoding='latin-1') 

# Load test set in Pandas dataframe
val_df = pd.read_csv('test.csv', header=None, encoding='latin-1') 

Now that we have both the pretrained model and the datasets, we can prepare the data for our language model and classifier model. Notice that we would need two different data objects. Fast.ai DataBunch class in version 1.0 has made it really easy to read preapre the data for training the models. The basic ppre-processing tasks are handle internally by the DataBunch class.

In [77]:
# Prepare data for language model
data_lm = TextLMDataBunch.from_df(train_df = train_df, valid_df = test_df, path = "")

# Prepare data for classifier model
# I am using a batch size of 16
data_clas = TextClasDataBunch.from_df(path = "", train_df = train_df, valid_df = test_df, 
                                      vocab=data_lm.train_ds.vocab, bs=16)

### Language model
Since we have the data ready, we can now re-train and fine-tune the language model. The AWD_LSTM model automatically use the pretrained weights. Probably this is why, the LSTM model provides the best downstream performance. I'll be using the LSTM model to train and fine-tune my model with the pre-trained weights.

In [78]:
# Initialize the learner object with the AWD_LSTM model
# I am using 50% dropout
learn = language_model_learner(data_lm, arch=AWD_LSTM, 
                               pretrained_fnames=['lstm_wt103', 'itos_wt103'], drop_mult=0.5)

Fast.ai provides two different methods to train the model - fit() and fit_one_cycle(). I have tested both. For re-training and fine-tuning I'll stick to fit_one_cycle(). To know more about these you can read - https://arxiv.org/abs/1803.09820

In [79]:
# train the learner object with learning rate = 1e-2
learn.fit_one_cycle(1, 1e-2)
#learn.fit(10)

epoch,train_loss,valid_loss,accuracy,time
0,5.448481,4.554461,0.189872,00:08


Let's start our fine-tuning process now. I'll use gradual unfreezing of the last layers before fine-tuning all layers.

In [80]:
# unfreeze the last layer
learn.freeze_to(-1)
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.660953,4.218702,0.22454,00:08


In [81]:
# unfreeze one more layer
learn.freeze_to(-2)
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.254465,3.933921,0.262137,00:08


In [82]:
# unfreeze one more layer
learn.freeze_to(-3)
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.981843,3.774588,0.281864,00:10


In [83]:
# unfreeze all layers
learn.unfreeze()
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.805899,3.721997,0.287291,00:11


We are done with the fine-tuning for now. We can now save the model for future use.

In [84]:
# Save the language model
learn.save_encoder('tweet_lm')

### Classifier model
We have fine-tuned the language model. Now we can use the model to build our sentiment classifier. I am using a LSTM based classifier. However, we could have also gone for the RNN classifier. In that case, we would need to train our language model differently.

In [85]:
# Initialize classifier model using the fine-tuned language model
# I am using the AWD_LSTM model with 50% dropout
learn_c = text_classifier_learner(data_clas, arch=AWD_LSTM, drop_mult=0.5)
learn_c.load_encoder('tweet_lm')

Now we go through a similar process of re-training and fine-tuning for the classifier model, as compared to the language model.

In [86]:
learn_c.fit_one_cycle(1, 1e-2)


epoch,train_loss,valid_loss,accuracy,time
0,0.640346,0.539703,0.776639,00:16


In [87]:
learn_c.freeze_to(-1)
learn_c.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.649272,0.54579,0.771516,00:16


In [88]:
learn_c.unfreeze()
learn_c.fit_one_cycle(1, slice(2e-3/100, 2e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.583233,0.498302,0.803962,00:32


Our classifier model is now trained. We can now use the model to predict the classes. For a single tweet we need to use the predict() method. For batch prediction, we would need to use the get_preds() method.

In [110]:
# Example of prediction on a single tweet
learn_c.predict('your ticket prices are bad')

(Category negative, tensor(0), tensor([0.9728, 0.0239, 0.0033]))

In [105]:
# Example of batch prediction on the validation set
# It ouputs class probabilities, which we would need to process 
# to get the final class value
learn_c.get_preds(ds_type=DatasetType.Valid)

[tensor([[0.9796, 0.0153, 0.0051],
         [0.8600, 0.1064, 0.0336],
         [0.7035, 0.2479, 0.0486],
         ...,
         [0.0692, 0.1977, 0.7331],
         [0.3106, 0.5750, 0.1144],
         [0.1814, 0.6975, 0.1211]]), tensor([0, 0, 1,  ..., 2, 1, 1])]

## Remarks

The classifier model was able to achieve around 80% accuracy. This result can be improved by applying the following.
- We should test how different pre-processing techniques affects the accuracy.
- I have noticed that gradual unfreezing improves accuracy by a significant amount. This should be explored futher.
- However, I have not touched two other prominent features of ULMFiT - discriminative fine-tuning and slanted triangular learning rates. I believe, the language and classifier models can be improved a lot by trying out these two.

### END