The notebook is divided in the following steps:

* Import libraries
* Data load and pre-processing
* Text pre-processing
* Universal Language Model Fine Tunning (ULMFiT) application
 * General-domain LM pretraining
 * Target task LM fine-tuning
 * Target task classifier fine-tuning
* Analysis of results
* Conclusions

## Import libraries

In [81]:
import re
import pandas as pd
import collections

In [5]:
from fastai.text import *

## Data load and pre-processing

In [41]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [56]:
# Load dataset with tweets into a dataframe
#path = "data/Tweets.csv"
#path = "https://raw.githubusercontent.com/kwulffert/ULMFiT_Sentiment_Analysis/master/data/Tweets.csv?token=AI4NJNANKTNV52E4BUAC6JK7BQ7IW"
folder = os.path.join("/content/gdrive/My Drive/Colab Notebooks" , "model")
path = os.path.join("/content/gdrive/My Drive/Colab Notebooks/data", "Tweets.csv")
df = pd.read_csv(path)

In [57]:
# First look at the first 5 rows of the dataset
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials to the experience... tacky.,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I need to take another trip!,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &...",,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing about it,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [58]:
# Size of the dataset
df.shape

(14640, 15)

In [59]:
# List column names and type of data in each column
df.dtypes

tweet_id                          int64
airline_sentiment                object
airline_sentiment_confidence    float64
negativereason                   object
negativereason_confidence       float64
airline                          object
airline_sentiment_gold           object
name                             object
negativereason_gold              object
retweet_count                     int64
text                             object
tweet_coord                      object
tweet_created                    object
tweet_location                   object
user_timezone                    object
dtype: object

In [60]:
# For sentiment analysis of the tweet, we need the text only. Let's have a look at one example
pd.options.display.max_colwidth = 100
df["text"][df["tweet_id"] == 569987622484848640]

62    @VirginAmerica @ladygaga @carrieunderwood all are great , but I have to go with #CarrieUnderwood 😍👌
Name: text, dtype: object

In [70]:
# Let's define a reduced dataframe with only the columns with the text of the tweets and the label in airline_sentiment 
df_sent = df.copy()
df_sent = df_sent[["airline_sentiment","text"]]
df_sent = df_sent.rename(columns={"airline_sentiment":"label"})
df_sent.head(2)
                      

Unnamed: 0,label,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials to the experience... tacky.


In [71]:
df_sent["label"].value_counts()

negative    9178
neutral     3099
positive    2363
Name: label, dtype: int64

In [72]:
da_train = df_sent.sample(frac = 0.9, random_state= 23)
da_test = df_sent.drop(da_train.index)

In [75]:
da_summary = pd.DataFrame((100*df_sent["label"].value_counts()/len(df_sent)).round(2))
da_summary["train[%]"] = (100*da_train["label"].value_counts()/len(da_train)).round(2)
da_summary["test[%]"] = (100*da_test["label"].value_counts()/len(da_test)).round(2)
da_summary = da_summary.rename(columns = {"label" : "original[%]"})
da_summary

Unnamed: 0,original[%],train[%],test[%]
negative,62.69,62.88,61.0
neutral,21.17,21.01,22.61
positive,16.14,16.11,16.39


## Universal Language Model Fine Tunning (ULMFiT) application

In [76]:
# Text in the tweet is one line long. Some text contain emojis, tags and possible spellimg mistakes. 
# They contain informal phraising too. 
# So first fine-tuning will be done on a social media corpus rather on the wikitext one.

In [77]:
data_lm = TextLMDataBunch.from_df(train_df = da_train, valid_df = da_train, path = folder)
data_lm.save('data_lm.pkl')

In [78]:
bs = 48
data_lm = load_data(folder, 'data_lm.pkl', bs=bs)

In [79]:
data_lm.show_batch()

idx,text
0,cheese xxunk and xxunk of entertainment options . xxmaj time just flew by . xxbos @southwestair just announced non - stop flights to xxmaj dallas from xxmaj columbus . xxmaj well next time xxunk best you 'll have less time airport xxunk 😂 😂 xxbos @united xxup thank u ! xxmaj secured room for the night xxmaj thx to xxup very helpful customer service rep xxup n. xxmaj xxunk ..
1,"numbers auto rebooked flights to non connecting cities xxbos @united do n't see a xxunk cost to get on an early flight with seats . xxmaj no airline charges to conveniently get their passengers in early xxbos @southwestair xxmaj are there discounts every tuesday cause i m leaving fron xxmaj birmingham xxmaj airport to xxmaj san fran xxmaj next week in march sometime xxbos @jetblue i cheated on you ,"
2,"houston until tomorrow morning . pretty sure overflight xxmaj booking xxmaj problems and maintenance are n't our fault . xxbos @usairways i was completely ripped of by xxup us xxmaj airways today never fly this airline i am contacting my local news xxbos @southwestair it 's not disappointment , it 's a blatant disregard for your business select customers , it 's becoming a problem that 's pushing xxbos @jetblue"
3,"rebook , but wondering if there will be other issues getting out . xxbos @united xxmaj so excited i was put on an earlier flight to get home ! xxmaj woo xxmaj hoo ! # travel 🎉 🎉 🎉 xxbos @southwestair xxmaj twitter says i ca n't xxup dm someone unless they follow me . xxmaj can @southwestair follows my twitter ? thanks you . xxbos @americanair xxmaj somehow between"
4,the accounts is denied xxbos @southwestair xxmaj tough i can take . xxmaj zero meaningful assistance while stranded for 2 days is another matter . xxmaj looking for signs you care abt cust . xxbos @jetblue xxmaj flight i want to book was $ 320 one day ; went to purchase next day & & price xxunk to $ 737 . xxmaj xxunk on chance it may go down ?


In [84]:
# Number of words in the vocabulary data_lm
len(data_lm.vocab.itos)

6312

### General-domain LM pretraining

### Target task LM fine-tuning

### Target task classifier fine-tuning

## Analysis of results

## Conclusions