<a href="https://colab.research.google.com/github/imtiazBDSgit/Deep-Learning-Projects/blob/master/Sentiment_Analysis_on_Twitter_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis on Twitter Reviews
In this notebook, we show how transfer learning can be applied to detecting the sentiment of twitter reviews, between positive and negative reviews.

This notebook uses the work from [Howard and Ruder, Ulmfit](https://arxiv.org/pdf/1801.06146.pdf).
The idea of the paper (and it implementation explained in the [fast.ai deep learning course](http://course.fast.ai/lessons/lesson10.html)) is to learn a language model trained on a very large dataset, e.g. a Wikipedia dump. The intuition is that if a model is able to predict the next word at each word, it means it has learnt something about the structure of the language we are using.

# New Section

# Content of this notebook

The notebook is organized as such:

- Tokenize the reviews and create dictionaries
- Download a pre-trained model and link the dictionary to the embedding layer of the model
- Fine-tune the language model on the twitter reviews texts

We have then the backbone of our algorithm: a pre-trained language model fine-tuned on twitter reviews

- Add a classifier to the language model and train the classifier layer only
- Gradually defreeze successive layers to train different layers on the twitter reviews
- Run a full classification task for several epochs
- Use the model for inference!

We end this notebook by looking at the specific effect of training size on the overall performance. This is to test the hypothesis that the ULMFit model does not need much labeled data to perform well.

# Data

Before starting, you should upload the data from folder somewhere you like, and use this path for this notebook.

Also, recommended working on a dedicated environment (e.g. mkvirtualenv fastai). Then clone the fastai github repo https://github.com/fastai/fastai and install requirements.

In [0]:
from fastai.text import *
import html
import os
import pandas as pd
import pickle
import re
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, \
confusion_matrix
from sklearn.model_selection import train_test_split
from time import time

In [0]:
path = '/content/sample_data/'
data=pd.read_csv(path+'Data.csv')

In [0]:
data[['text','sentiment']].head()

Unnamed: 0,text,sentiment
0,What @dhepburn said.,neutral
1,plus you've added commercials to the experienc...,positive
2,I didn't today... Must mean I need to take ano...,neutral
3,"it's really aggressive to blast obnoxious ""ent...",negative
4,and it's a really big bad thing about it,negative


In [0]:
train, test, train_label, test_label = train_test_split( data['text'], data['sentiment'],
                                                        test_size=0.20, random_state=42,stratify=data['sentiment'])

In [0]:
print("train shape",train.shape)
print("validation shape",test.shape)

train shape (11712,)
validation shape (2928,)


In [0]:
print("train sentiment distribution")
train_label.value_counts(normalize=True)

train sentiment distribution


negative    0.626878
neutral     0.211663
positive    0.161458
Name: sentiment, dtype: float64

In [0]:
print("valid sentiment distribution")
test_label.value_counts(normalize=True)

valid sentiment distribution


negative    0.627049
neutral     0.211749
positive    0.161202
Name: sentiment, dtype: float64

In [0]:
valid_index=test.index
valid_index

Int64Index([ 4839,  7719, 13337,  3764,  5657,  2973,   336, 10504,  2923,
            11763,
            ...
             8627, 10487, 12100,   712,  9732, 10313,  2099, 10317,  8658,
            14210],
           dtype='int64', length=2928)

In [0]:
data['is_valid']=[True if x in valid_index else False for x in data.index]

In [0]:
data_prep=data[['sentiment','text','is_valid']]
data_prep.columns=['label','text','is_valid']
data_prep.head()

Unnamed: 0,label,text,is_valid
0,neutral,What @dhepburn said.,False
1,positive,plus you've added commercials to the experienc...,False
2,neutral,I didn't today... Must mean I need to take ano...,False
3,negative,"it's really aggressive to blast obnoxious ""ent...",False
4,negative,and it's a really big bad thing about it,True


In [0]:
# Language model data
data_lm = TextLMDataBunch.from_df(PATH,data_prep[data_prep.is_valid==False],data_prep[data_prep.is_valid==True])
# Classifier model data
data_clas = TextClasDataBunch.from_df(PATH,data_prep[data_prep.is_valid==False],data_prep[data_prep.is_valid==True], vocab=data_lm.train_ds.vocab, bs=32,text_cols=['text'])

In [0]:
data_lm.save('data_lm_export.pkl')
data_clas.save('data_clas_export.pkl')

In [0]:
data_lm = load_data(PATH,'data_lm_export.pkl')
data_clas = load_data(PATH,'data_clas_export.pkl', bs=32)

In [0]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.769918,4.180489,0.243359,12:20


In [0]:
# Run one epoch of fine-tuning 
learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,4.077415,3.933756,0.278181,13:30


In [0]:
learn.predict("and it's a really big", n_words=10)

"and it's a really big deal ! ! No help i m supposed to"

In [0]:
learn.save_encoder('ft_enc')


# Going back to classification!

Now that we spent some time fine-tuning the language model on our twitter data, let's see if we can classify easily these reviews.
As before, some cells should be run once, and then use data loaders for later use.

In [0]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('ft_enc')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (11712 items)
x: TextList
xxbos xxmaj what xxunk said .,xxbos plus you 've added commercials to the experience ... xxunk .,xxbos i did n't today ... xxmaj must mean i need to take another trip !,xxbos it 's really aggressive to blast obnoxious " entertainment " in your guests ' faces & & they have little recourse,xxbos yes , nearly every time i fly xxup vx this _ _ ar xxunk _ won _ _ go away :)
y: CategoryList
neutral,positive,neutral,negative,positive
Path: /content/sample_data;

Valid: LabelList (2928 items)
x: TextList
xxbos and it 's a really big bad thing about it,xxbos seriously would pay $ 30 a flight for seats that did n't have this playing . 
  it 's really the only bad thing about flying xxup va,xxbos xxmaj really missed a prime opportunity for xxmaj xxunk xxmaj without xxmaj xxunk xxunk , there . https : / / t.co / xxunk,xxbos xxmaj well , i xxunk xxup now i xxup do ! xxup xxunk,xxbos xxunk i 'm flying your # fabulous # xx

In [0]:
data_clas.show_batch()

text,target
xxbos xxmaj hi have a question re future xxmaj flight xxmaj booking xxmaj problems . xxup dub - xxup jac 29 / 9 xxup jac - xxup lax 8 / 10 xxup lax - xxup dub 13 / 10 . i 'm * xxup g. xxmaj what is checked bag allowance for xxup jac - xxup lax ?,neutral
xxbos i keep calling your help line ( 800 ) 428 - 4322 and it keeps saying they are too busy to help & & to call back xxmaj late xxmaj flightr . xxup for a xxup week xxup now !,negative
xxbos xxmaj flight 830 xxup clt to xxmaj phl . i was 1st on list . xxmaj someone else got spot . xxmaj rude employee in coach . xxmaj xxunk give xxup id . xxmaj said he was cute red head,negative
"xxbos not happy w / app xxmaj late xxmaj flightly . xxmaj last time i flew would n't let me check in , xxmaj this time i checked in went on xxmaj late xxmaj flightr says i never checked in",negative
"xxbos xxmaj it 's for me , i spoke with a rep on the phone who suggested i "" xxmaj voice a concern "" via "" xxmaj email us "" on your site . i did a few moments ago",neutral


In [0]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.685052,0.562647,0.772541,14:24


In [0]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.62317,0.53107,0.780738,14:33


In [0]:
learn.unfreeze()
learn.fit_one_cycle(1, slice(2e-3/100, 2e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.581099,0.491195,0.803962,16:40


# Inference
Nonw, let's play with the model we've just learned!

In [0]:
learn.predict("seriously would pay $30 a flight for seats that didn't have this playing.")

(Category negative, tensor(0), tensor([0.9379, 0.0568, 0.0052]))

In [161]:
learn.predict(data_prep[data_prep.is_valid==True]['text'].iloc[14])

(Category positive, tensor(2), tensor([0.3009, 0.1566, 0.5425]))

In [150]:
learn.predict(data_prep[data_prep.is_valid==True]['text'].iloc[9])

(Category neutral, tensor(1), tensor([0.2564, 0.6093, 0.1342]))

# Conclusions
Lety's see the evollution of the accuracy when we increas the size of the train data.
For each training size, we report the best accuracy among the different epochs.

In [0]:
predictions=[str((learn.predict(x))[0]) for x in data_prep[data_prep.is_valid==True]['text']]

In [0]:
def returnLabel(x):
  if x =='negative':
    return 0
  elif x=='neutral':
    return 1
  else:
    return 2
y_pred=[returnLabel(x) for x in predictions]

In [0]:
y_true=[returnLabel(x) for x in data_prep[data_prep.is_valid==True]['label']]

In [169]:
target_names = ['negative', 'neutral', 'positive']
print(classification_report(y_true,y_pred,target_names=target_names))

              precision    recall  f1-score   support

    negative       0.82      0.94      0.88      1836
     neutral       0.72      0.51      0.60       620
    positive       0.81      0.66      0.73       472

    accuracy                           0.80      2928
   macro avg       0.78      0.70      0.73      2928
weighted avg       0.80      0.80      0.79      2928

