## ULMFiT Sentiment

### Problem: Apply a supervised or semi-supervised ULMFiT model to Twitter US Airlines Sentiment
#### The objective is to fine-tune the pre trained model and use it for sentiment analysis

In [1]:
import pandas as pd
import fastai
import nltk
import sklearn

#### Reading the data and feeding them into pandas DataFrame

In [2]:
data = pd.read_csv('tweets.csv')
df = pd.DataFrame(data)

In [3]:
# Printing out all columns and renaming the sentiment column, which will be the label
df = df.rename(columns={"airline_sentiment": "sentiment"})
df.columns

Index(['tweet_id', 'sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

In [4]:
df = pd.DataFrame({'sentiment':df.sentiment, 'text':df.text})
df

Unnamed: 0,sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...
...,...,...
14635,positive,@AmericanAir thank you we got on a different f...
14636,negative,@AmericanAir leaving over 20 minutes Late Flig...
14637,neutral,@AmericanAir Please bring American Airlines to...
14638,negative,"@AmericanAir you have my money, you change my ..."


#### The dataset contains 9178 negative, 3099 neutral and 2363 positive tweets

In [5]:
df['sentiment'].value_counts()

negative    9178
neutral     3099
positive    2363
Name: sentiment, dtype: int64

### Preprocessing

#### Clean  text by retaining only alphabets and removing everything else in the text column and getting rid of english stopwords from the nltk package

In [6]:
df['text'] = df['text'].str.replace("[^a-zA-Z]", " ")
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [7]:
# tokenization 
tokenized_doc = df['text'].apply(lambda x: x.split())

# remove stop-words 
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization 
detokenized_doc = [] 
for i in range(len(df)): 
    t = ' '.join(tokenized_doc[i]) 
    detokenized_doc.append(t) 
# Since the ULMFiT from fastai module works well with untokenized text I will leave this untokenized
df['text'] = detokenized_doc


#### The dataset needs to be devided into training and testing sets. I took 80% of the data for training and 20% for testing

In [8]:
from sklearn.model_selection import train_test_split

# split data into training and validation set, where 80% is used for training and 20% for testing
df_trn, df_test = train_test_split(df, stratify = df['sentiment'], test_size = 0.2)
df_trn.shape, df_test.shape

((11712, 2), (2928, 2))

In [9]:
# language model data
from fastai.text import *
from fastai import *
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_test, path = "")

In [10]:
# checking language model
data_lm

TextLMDataBunch;

Train: LabelList (11712 items)
x: LMTextList
xxbos americanair delay xxmaj your counter take valid xxup xxunk card valid xxup id needed xxup tsa precheck pass,xxbos united i flying st class one leg xxmaj chicago long flight xxmaj china i still able use lounge xxmaj chicago,xxbos united staff rather efficient got us solutions xxunk little xxunk air,xxbos united xxmaj hmmm seems like could something changed xxunk,xxbos usairways well going help i xxunk hold hrs amp client wants ff nbr record elite xxunk
y: LMLabelList
,,,,
Path: .;

Valid: LabelList (2928 items)
x: LMTextList
xxbos i left xxunk plane way get back united,xxbos united traveling xxunk xxunk xxmaj gate agent xxmaj chicago awesome helping xxup ty xxunk,xxbos americanair xxmaj only xxunk aa took hrs call back xxmaj ended paying extra xxup us statement true,xxbos virginamerica xxmaj you flights flying xxmaj boston tomorrow i need home xxmaj cancelled xxmaj flightled flight anything,xxbos united xxmaj surprised

In [11]:
# classification model data
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn, valid_df = df_test, vocab=data_lm.train_ds.vocab, bs=32)

In [12]:
#checking classification model
data_clas

TextClasDataBunch;

Train: LabelList (11712 items)
x: TextList
xxbos americanair delay xxmaj your counter take valid xxup xxunk card valid xxup id needed xxup tsa precheck pass,xxbos united i flying st class one leg xxmaj chicago long flight xxmaj china i still able use lounge xxmaj chicago,xxbos united staff rather efficient got us solutions xxunk little xxunk air,xxbos united xxmaj hmmm seems like could something changed xxunk,xxbos usairways well going help i xxunk hold hrs amp client wants ff nbr record elite xxunk
y: CategoryList
negative,neutral,positive,positive,negative
Path: .;

Valid: LabelList (2928 items)
x: TextList
xxbos i left xxunk plane way get back united,xxbos united traveling xxunk xxunk xxmaj gate agent xxmaj chicago awesome helping xxup ty xxunk,xxbos americanair xxmaj only xxunk aa took hrs call back xxmaj ended paying extra xxup us statement true,xxbos virginamerica xxmaj you flights flying xxmaj boston tomorrow i need home xxmaj cancelled xxmaj flightled flight

##### data_clas and data_lm did all the necessary preprocessing automatically

## Fine-tuning a language model
#### The pretrained language model from fastai was trained on the Wikitext 103 dataset

In [13]:
# AWD_LSTM is ASGD Weight-Dropped LSTM
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)


In [14]:
# train the learner object with learning rate = 1e-2
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,6.365916,5.326722,0.164063,00:07


In [15]:
learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,5.121527,4.889538,0.19846,00:07


In [16]:
#save this encoder to use it for classification later

learn.save_encoder('ft_enc')

## Building Classifier

In [17]:
# creating classifier
learn = text_classifier_learner(data_clas, drop_mult=0.5, arch=AWD_LSTM)
# loading previously saved encoder
learn.load_encoder('ft_enc')
# printing a batch from classification model
data_clas.show_batch()

text,target
xxbos usairways xxup fuk u xxup us xxup airways xxup with xxup yo xxup shitty xxup chicken xxup xxunk xxup sandwich xxup that xxup so xxup overpriced xxup and u xxup xxunk xxup make xxup me xxup wait xxup in a xxup hr xxup layover xxup fuk u xxup and,negative
xxbos united i xxup just xxup asked xxup my xxup boyfriend xxup to xxup prom xxup over xxup the xxup xxunk xxup on xxup flight xxup he xxup said xxup yes xxup best xxup day xxup ever xxup thank u xxup so xxup much,positive
xxbos united xxmaj hi question future xxmaj flight xxmaj booking xxmaj problems xxup dub xxup jac xxup jac xxup lax xxup lax xxup dub i g xxmaj what checked bag allowance xxup jac xxup lax,neutral
xxbos usairways xxup is xxup this xxup xxunk xxup brothers xxup xxunk xxup and xxup xxunk xxup should i xxup keep xxup my xxup eyes xxup xxunk xxup for xxup the xxup clown xxup car,negative
xxbos usairways xxmaj customer service dead xxmaj last wk flts delayed xxmaj cancelled xxmaj flighted xxmaj bags lost days xxmaj last nt flt delayed xxmaj cancelled xxmaj flighted xxmaj no meal voucher,negative


In [18]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.736122,0.681485,0.711066,01:20


In [19]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.640473,0.636522,0.731216,01:21


In [23]:
# Training the model more. After 10 epochs I interrupted because of the lack of resources.
learn.unfreeze()
learn.fit(50, slice(2e-3/100, 2e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.623444,0.585113,0.758197,01:23
1,0.544833,0.559686,0.77015,01:25
2,0.498273,0.572858,0.762637,01:22
3,0.484877,0.565175,0.768101,01:22
4,0.442247,0.553177,0.778347,01:23
5,0.425883,0.556719,0.78347,01:23
6,0.395695,0.582328,0.767418,01:23
7,0.335035,0.614729,0.76571,01:22
8,0.313135,0.616616,0.771516,01:24
9,0.2883,0.611852,0.770833,01:22


KeyboardInterrupt: 

In [20]:
# get predictions
preds, targets = learn.get_preds()

In [21]:
import numpy as np

##### Printing frequency table of model predictions (0=negative, 1=neutral, 2=positive sentiment)

In [22]:
predictions = np.argmax(preds, axis = 1)
pd.crosstab(predictions, targets)

col_0,0,1,2
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1712,413,154
1,68,143,32
2,56,64,286


### Summary
After training ULMFiT initially achieved an accuracy of approximately. 0.71. After the model continued to learn the accuracy increased and it achieved an accuracy of 0.77