# Sentiment Analysis of Tweets using BERT
> In this notebook we will go through the process of classiying tweets(or any text data for that matter) into positive,negative or neutral.
The dataset we use for this task is the [Airline Tweets Dataset](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

> We will be using [MLFlow](https://mlflow.org/) to track our traininig process.

> If you are not running it via a jupyterhub image but locally or by cloning the repository,to set up the environment please refer to this [doc](https://docs.google.com/document/d/1BUEzAeymOr1NyWQT4_vY22dFlMinjcbeV6iFBZhBTYY/edit) and the requirements.txt in the repository


- toc: false
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter, sentimentanalysis, machinelearning, naturallanguageprocessing, deeplearing, interpret-text, interpretability, bert]
- hide: false
- search_exclude: true


# Abstract

Sentiment Analysis is the automated process of analyzing text data and sorting it into sentiments depending on the problem statement. The ability to extract insights from this type of data is a practice that is widely adopted by many organisations across the world. Its applications are broad and powerful. A very important use case for sentiment analysis is brand reputation management.

Red Hat has a variety of text based artifacts coming from sources starting from partner and customer engagements to documentation and communication logs. These text based artifacts are valuable and can be used to generate business insights and inform decisions if appropriately mined. The goal of this project is to allow other teams across Red Hat to have a tool at their disposal allowing them to analyze their text data and make informed decisions based on the insights gained from them. In this blog post we take a public dataset as an example to walk through the work flow.

# Problem Statement
The goal is to make a deep learning model which can classify emotion in a given sentence.

We do this by making use of transfer learning on the BERT model architecture. So that this can be used as a sample workflow, we take publicly available data as an example, as the original workflow consists of sensitive data. We also discuss interpreting BERT using the Unified Information Explainer algorithm.

# Methodology
Given a text our goal is to predict whether it conveys a positive, negative or neutral emotion. Hence we want to build a text classifier for our data. There are various approaches to perform this task but for our project we pick the approach used in most state-of-the-art textual analysis systems i.e. deep learning.

To construct a deep learning model which is very accurate we require huge amounts of data and compute resources. But luckily for us models like [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) are pre trained on large amounts of data and made publicly available. Therefore we can fine tune an already pre trained model like BERT on our own data to leverage what the model has already learnt. This process is called [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning).

First, we will download the pre trained model and files required which allow us to use it easily.

In [1]:
#collapse-hide

!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip -P models/bert

--2020-07-17 16:28:07--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.13.80, 172.217.13.240, 172.217.12.240, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.13.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘models/bert/uncased_L-12_H-768_A-12.zip’



In [18]:
#hide

import tensorflow as tf
import numpy as np 
import pandas as pd
import re
import gc
import os
import fileinput
import string
import zipfile
import datetime
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.metrics import classification_report
import mlflow
from pandas import DataFrame
sys.path.insert(0, 'models/bert')
from models.bert import modeling
from models.bert import optimization
from models.bert import run_classifier
from models.bert import tokenization

#extracting the downloaded model
folder = 'models/bert'
with zipfile.ZipFile("models/bert/uncased_L-12_H-768_A-12.zip","r") as zip_ref:
    zip_ref.extractall(folder)

Here we initalize the MLFlow client in the following step so that we can track our run and the results

In [24]:
MLFLOW_CLIENT = mlflow.tracking.MlflowClient(tracking_uri='http://mlflow-server-route-aiops-prod-prometheus-scrape.cloud.paas.psi.redhat.com')
mlflow.set_tracking_uri("http://mlflow-server-route-aiops-prod-prometheus-scrape.cloud.paas.psi.redhat.com")

In [26]:
mlflow.set_experiment('sentiment_analysis_test_0.1')
mlflow.start_run(run_name="airline_tweets-trialrun-same-artifacts")

In [27]:
mlflow_run_id = mlflow.active_run().info.run_id

# BERT implementation

We are going to use Google's pre trained BERT for our classification tasks. 
Apart from the model itself we also directly use Google's scripts to run our classifier which enables us to use the model for our data specifically.


## Exploring the dataset

For our demo we make use of the Twitter US Airline Sentiment public [dataset](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) . This dataset consists of tweets directed at six US airlines with each of them classified into neutral, positive or negative.

### Loading and Cleaning data 

First we load up our data in the csv format into a pandas dataframe.

In [3]:
#collapse-show

tweets = pd.read_csv('dataset/Tweets.csv')

#Shuffling the data
tweets.sample(frac=1).head(10)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
77,569940323746516993,neutral,1.0,,,Virgin America,,TaylorLumsden,,0,@VirginAmerica first time flying you all. do y...,,2015-02-23 11:22:16 -0800,"Dallas, Texas",Mountain Time (US & Canada)
9418,569943857418399746,negative,1.0,Customer Service Issue,0.6772,US Airways,,thomashoward88,,0,@USAirways US 728 stated their issues as: no o...,,2015-02-23 11:36:19 -0800,,
8703,567935527481188352,neutral,1.0,,,Delta,,ToTravelToLive,,0,@JetBlue Anywhere warm cause its freezing in NYC,,2015-02-17 22:35:56 -0800,NYC,
75,569941957490774016,positive,1.0,,,Virgin America,,TaylorLumsden,,0,@VirginAmerica awesome. I flew yall Sat mornin...,,2015-02-23 11:28:46 -0800,"Dallas, Texas",Mountain Time (US & Canada)
13267,569900784965554176,negative,0.7123,Flight Booking Problems,0.7123,American,,milz02315,,0,@AmericanAir Can you add my KTN to an existing...,,2015-02-23 08:45:10 -0800,,
12962,569972985521532929,negative,1.0,Customer Service Issue,1.0,American,,T_Lubinski,,0,@AmericanAir Trying to get my flight changed t...,,2015-02-23 13:32:03 -0800,"Boston, MA",Eastern Time (US & Canada)
11773,567767738886545408,negative,1.0,Customer Service Issue,1.0,US Airways,,izzyflan,,0,@USAirways never in my life have I dealt with ...,,2015-02-17 11:29:12 -0800,,
9205,570068659193950208,neutral,1.0,,,US Airways,,EverettWJones,,0,@USAirways thank you.,,2015-02-23 19:52:14 -0800,,Quito
1283,569853646411661314,negative,1.0,Late Flight,1.0,United,,crog,,0,".@united You may ""dislike delays"" but I paid y...",,2015-02-23 05:37:51 -0800,,Mountain Time (US & Canada)
11899,570304633048047616,neutral,0.6529,,0.0,American,,AesaGaming,,0,@AmericanAir Do you have any sort of live chat...,,2015-02-24 11:29:54 -0800,,


We have the following columns in our data:

In [4]:
list(tweets.columns) 

['tweet_id',
 'airline_sentiment',
 'airline_sentiment_confidence',
 'negativereason',
 'negativereason_confidence',
 'airline',
 'airline_sentiment_gold',
 'name',
 'negativereason_gold',
 'retweet_count',
 'text',
 'tweet_coord',
 'tweet_created',
 'tweet_location',
 'user_timezone']

We are only concerned with the text and airline_sentiment columns as the purpose of this blog is to walkthrough a basic sentiment analysis pipeline, of course we can make use of other features to extract more information from the data if we wish to.

If we look into the text columns, this is what some of them look like

In [11]:
#hide
pd.set_option('display.max_colwidth', -1)

In [12]:
tweets['text'][0:5]

0    @VirginAmerica What @dhepburn said.                                                                                           
1    @VirginAmerica plus you've added commercials to the experience... tacky.                                                      
2    @VirginAmerica I didn't today... Must mean I need to take another trip!                                                       
3    @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
4    @VirginAmerica and it's a really big bad thing about it                                                                       
Name: text, dtype: object

A quick division of data is show in the image below:

![](assets/img_2.png)

Although the data contains a much larger percentage of negative tweets, the other categories still have enough data in them. Hence we don’t have to perform any undersampling/oversampling operations.

We also perform some pre processing to clean our data like getting rid of special characters, removing single characters which provide no value to us, eliminating extra spaces.

<h2>Preprocessing tweets</h2>

We perfrom some basic cleaning on our text data using regular expressions.
We then split our data into test and training sets.

In [19]:
#collapse-show

from sklearn.model_selection import train_test_split
features = tweets.iloc[:, 10].values
labels = tweets.iloc[:, 1].values
#preprocessing 
processed_features = []

for sentence in range(0, len(features)):
    #Getting rid of special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))
    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 
    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)
    # Converting to Lowercase
    processed_feature = processed_feature.lower()
    processed_features.append(processed_feature)

#Splitting the data 
X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)

Mapping the emotions to numbers for the training and inference step

In [20]:
#collapse-show

d = {"positive":2,"negative":0,"neutral":1}
y_train = [d[x] for x in y_train]
y_test = [d[x] for x in y_test]

print(X_test[10],y_test[10])

 united your announcement for pre boarding only addresses mobility my disability requires me to travel with lot of stuff do preboard  1


## Applying Deep Learning using BERT

As mentioned before we would be using BERT and fine tune it to make predictions on our data.

The diagram below shows how BERT fits into our workflow.



![](assets/img_3.png)

Once we have cleaned our text data, all we have to do is to prepare it for consumption by the model. Depending on which implementation of BERT you want to use this step may differ. But all the approaches require us to encode our labels and tokenize the text. Both these functionalities are generally provided by the libraries offering the BERT implementation.

Since we just want to fine tune the model, we don’t have to put in a lot of resources in training. A couple of epochs are good enough to give us good results.

Loading the model

In [28]:
#collapse-show

folder = 'models/bert'
BERT_MODEL = 'uncased_L-12_H-768_A-12'
BERT_PRETRAINED_DIR = f'{folder}/uncased_L-12_H-768_A-12'
OUTPUT_DIR = f'{folder}/outputs'
print(f'>> Model output directory: {OUTPUT_DIR}')
print(f'>>  BERT pretrained directory: {BERT_PRETRAINED_DIR}')

In [31]:
#collapse-show

# keep track of the model name as a mlflow run tag
mlflow.set_tag("model", OUTPUT_DIR)

<h2> Training the model</h2>

Now that we have our data ready for use we move on the next step i.e training the model on our data.Since we alrady have the pre-learned weights on the model we can get good results by training the model on our data for just a few epochs.

We first start by intializing our model and transforming our data ready for consumption by the model

In [None]:
#collapse-hide

def create_examples(lines, set_type, labels=None):
#Generate data for the BERT model. We nned data in this format before being fed for training
    guid = f'{set_type}'
    examples = []
    if guid == 'train':
        for line, label in zip(lines, labels):
            text_a = line
            label = str(label)
            examples.append(
              run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    else:
        for line in lines:
            text_a = line
            label = '0'
            examples.append(
              run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

# Model Hyper Parameters
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 8
LEARNING_RATE = 1e-5
NUM_TRAIN_EPOCHS = 3.0
WARMUP_PROPORTION = 0.1
#We need this to be a little lower thant the max length of tweets we have 
MAX_SEQ_LENGTH = 50
# Model configs
SAVE_CHECKPOINTS_STEPS = 100000 #if you wish to finetune a model on a larger dataset, use larger interval
# each checpoint weights about 1,5gb
ITERATIONS_PER_LOOP = 100000
NUM_TPU_CORES = 8
VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')
DO_LOWER_CASE = BERT_MODEL.startswith('uncased')

label_list = [str(num) for num in range(3)]
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)
train_examples = create_examples(X_train, 'train', labels=y_train)

tpu_cluster_resolver = None #Since training will happen on GPU, we won't need a cluster resolver
#TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.
run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=OUTPUT_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

num_train_steps = int(
    len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

model_fn = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=INIT_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=False, #If False training will fall on CPU or GPU, depending on what is available  
    use_one_hot_embeddings=True)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=False, #If False training will fall on CPU or GPU, depending on what is available 
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)

Logging parameters into MLFlow

In [None]:
#collapse-hide

# log parameters before run
mlflow.log_param("TRAIN_BATCH_SIZE", TRAIN_BATCH_SIZE)
mlflow.log_param("EVAL_BATCH_SIZE", EVAL_BATCH_SIZE)
mlflow.log_param("LEARNING_RATE", LEARNING_RATE)
mlflow.log_param("NUM_TRAIN_EPOCHS", NUM_TRAIN_EPOCHS)
mlflow.log_param("WARMUP_PROPORTION", WARMUP_PROPORTION)
mlflow.log_param("MAX_SEQ_LENGTH", MAX_SEQ_LENGTH)
mlflow.log_param("SAVE_CHECKPOINTS_STEPS", SAVE_CHECKPOINTS_STEPS)
mlflow.log_param("ITERATIONS_PER_LOOP", ITERATIONS_PER_LOOP)

<h2>Training</h2>

We now train our model accroding to the previously designed hyper-parameters

In [None]:
#collapse-show

print('Please wait...')
train_features = run_classifier.convert_examples_to_features(
    train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('>> Started training at {} '.format(datetime.datetime.now()))
print('  Num examples = {}'.format(len(train_examples)))
print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
tf.logging.info("  Num steps = %d", num_train_steps)
train_input_fn = run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print('>> Finished training at {}'.format(datetime.datetime.now()))

### Export and save model variables and protobuf

In [None]:
#collapse-hide

serving_model_save_path = 'models/saved_models'


def serving_input_receiver_fn():
    input_ids = tf.placeholder(dtype=tf.int64, shape=[None, MAX_SEQ_LENGTH], name='input_ids')
    input_mask = tf.placeholder(dtype=tf.int64, shape=[None, MAX_SEQ_LENGTH], name='input_mask')
    segment_ids = tf.placeholder(dtype=tf.int64, shape=[None, MAX_SEQ_LENGTH], name='segment_ids')
    label_ids = tf.placeholder(dtype=tf.int64, shape=[None, ], name='unique_ids')

    receive_tensors = {'input_ids': input_ids, 'input_mask': input_mask, 'segment_ids': segment_ids,
                       'label_ids': label_ids}
    features = {'input_ids': input_ids, 'input_mask': input_mask, 'segment_ids': segment_ids, "label_ids": label_ids}
    return tf.estimator.export.ServingInputReceiver(features, receive_tensors)

estimator._export_to_tpu = False
estimator.export_saved_model(serving_model_save_path, serving_input_receiver_fn)


<h2>Predicting and Evaluating</h2>

Now that our training step is complete in the next steps we will use what our model learned to make predictions on the dataset. We will then evaluate our results

In [None]:
#collapse-hide

def input_fn_builder(features, seq_length, is_training, drop_remainder):
  """Creates an `input_fn` closure to be passed to TPUEstimator."""

  all_input_ids = []
  all_input_mask = []
  all_segment_ids = []
  all_label_ids = []

  for feature in features:
    all_input_ids.append(feature.input_ids)
    all_input_mask.append(feature.input_mask)
    all_segment_ids.append(feature.segment_ids)
    all_label_ids.append(feature.label_id)

  def input_fn(params):
    """The actual input function."""
    print(params)
    batch_size = 500

    num_examples = len(features)

    d = tf.data.Dataset.from_tensor_slices({
        "input_ids":
            tf.constant(
                all_input_ids, shape=[num_examples, seq_length],
                dtype=tf.int32),
        "input_mask":
            tf.constant(
                all_input_mask,
                shape=[num_examples, seq_length],
                dtype=tf.int32),
        "segment_ids":
            tf.constant(
                all_segment_ids,
                shape=[num_examples, seq_length],
                dtype=tf.int32),
        "label_ids":
            tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
    })

    if is_training:
      d = d.repeat()
      d = d.shuffle(buffer_size=100)

    d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
    return d

  return input_fn

In [None]:
#collapse-show

predict_examples = create_examples(X_test, 'test')

predict_features = run_classifier.convert_examples_to_features(
    predict_examples, label_list, MAX_SEQ_LENGTH, tokenizer)

predict_input_fn = input_fn_builder(
    features=predict_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=False)

result = estimator.predict(input_fn=predict_input_fn)

## Results

We get the following results for our model post training :



![](assets/img_4.png)

Our model doesn’t perform well when it comes to neutral sentiment. Possible reason for this could be the general ambiguity which comes in classifying a neutral emotion. Not to say the performance can’t be improved with some tweaking!

In [None]:
#collapse-hide

preds = []
for prediction in result:
      preds.append(np.argmax(prediction['probabilities']))

In [None]:
#collapse-hide

print("Accuracy of BERT is:",accuracy_score(y_test, preds))

Accuracy of BERT is: 0.7990654205607477

In [None]:
#collapse-hide

print("F1 Score of BERT is:",f1_score(y_test, preds, average='macro'))

F1 Score of BERT is: 0.660558251784892


In [None]:
#collapse-hide

metrics = classification_report(y_test, preds, output_dict=True)

In [None]:
#collapse-hide

outputframe = DataFrame(dict(sentence = pd.Series(X_train), old_model_label = pd.Series(y_train), pred_label = pd.Series(preds))).reset_index()

Saving our output into a csv for further analysis.

In [None]:
#collapse-hide

outputframe.to_csv('output/airline_tweets.csv')

In [None]:
#collapse-hide

MLFLOW_CLIENT.log_metric(mlflow_run_id, "Avg_Precision ", metrics['macro avg']['precision'])
MLFLOW_CLIENT.log_metric(mlflow_run_id, "Avg_recall ", metrics['macro avg']['recall']) 
MLFLOW_CLIENT.log_metric(mlflow_run_id, "Avg_f1-score ",  metrics['macro avg']['f1-score'])
MLFLOW_CLIENT.log_metric(mlflow_run_id, "Avg_support ",  metrics['macro avg']['support']) 
MLFLOW_CLIENT.log_metric(mlflow_run_id, "Accuracy ",  accuracy_score(y_test, preds)) 
MLFLOW_CLIENT.log_metric(mlflow_run_id, "0_Precision ",  metrics['0']['precision'])
MLFLOW_CLIENT.log_metric(mlflow_run_id, "0_recall ",  metrics['0']['recall']) 
MLFLOW_CLIENT.log_metric(mlflow_run_id, "0_f1-score ",  metrics['0']['f1-score'])
MLFLOW_CLIENT.log_metric(mlflow_run_id, "0_support ",  metrics['0']['support']) 
MLFLOW_CLIENT.log_metric(mlflow_run_id, "1_Precision ",  metrics['1']['precision'])
MLFLOW_CLIENT.log_metric(mlflow_run_id, "1_recall ",  metrics['1']['recall']) 
MLFLOW_CLIENT.log_metric(mlflow_run_id, "1_f1-score ",  metrics['1']['f1-score'])
MLFLOW_CLIENT.log_metric(mlflow_run_id, "1_support ",  metrics['1']['support']) 
MLFLOW_CLIENT.log_metric(mlflow_run_id, "2_Precision ",  metrics['2']['precision'])
MLFLOW_CLIENT.log_metric(mlflow_run_id, "2_recall ",  metrics['2']['recall']) 
MLFLOW_CLIENT.log_metric(mlflow_run_id, "2_f1-score ",  metrics['2']['f1-score'])
MLFLOW_CLIENT.log_metric(mlflow_run_id, "2_support ",  metrics['2']['support']) 

In [None]:
mlflow.end_run()

## Interpreting The Model

To better understand and improve our model we need some insights about how decisions are being made. One approach to do this is to use interpretability techniques.

For our case, we make use of the library [interpret-text](https://github.com/interpretml/interpret-text). As this library supports only PyTorch we will retrain our model using pytorch. We then use this trained BERT model to run our interpretability algorithm.

We use the [Unified Information Explainer](https://www.microsoft.com/en-us/research/publication/towards-a-deep-and-unified-understanding-of-deep-neural-models-in-nlp/) for the task. Let us look at an example.

The way the dashboard works is that we can move our slider to pick the ‘n’ most important features according to the model for making a certain prediction. It considers not just the word but it’s surrounding words as well.

For the example we are focusing on what the model sees at the 12th and final classification layer.

Sentence: @united yup it just happens way too often 5 times in the last 12 months

**True Label**: negative

**Prediction**: negative

![](assets/img_5.png)

We see that our model focuses the most around the part ‘way too often’ and correctly predicts that it conveys a negative emotion.

We can effectively use this tool to look at a subset of sentences and tweak our model by looking at how it processes the sentences.