# BERT Sentiment Analysis PoC
__Leslie A. McFarlin, Principal UX Architect @ Wheels__, Created 03 Feb 2022.

PoC created with HuggingFace Transformers library to access BERT. Run this notebook in a virtual environment with TensorFlow, Keras, and Transformers installed if not running in Google Colab.

In [None]:
## MOUNT GOOGLE DRIVE
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
## INSTALLS - ONLY FOR GOOGLE COLAB
!pip install transformers



In [None]:
## IMPORTS

# File access
import os

# Data handling
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from collections import Counter

# TensorFlow
import tensorflow as tf

# BERT
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

# Model management
import pickle

In [None]:
## INITIALIZATIONS

# NOTE- You may receive an ImportError to update Jupyter and ipywidgets. In terminal navigate to the relevant virtual environment
# and run the following lines (without the comment marks):
# pip install ipywidgets
# Then go to the Environments tab in Anaconda Navigator and select your environment.
# With search set for Not Installed packages, search for ipywidgets then click to install it.
# Restart the kernel and re-run all cells.
# https://ipywidgets.readthedocs.io/en/stable/user_install.html

# Model Initialization - Pretrained BERT Sequence Classification
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")

# Tokenizer Initialization - Pretrained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Pre-trained BERT is ideal for this task because there is a dropout layer to prevent overfitting, and then the final fully-connected layer is a common choice for the classification task. However, this model will need to be fine-tuned for our Wheels data.

## Phase I - Training with Twitter Data

### Data Handling and Processing
Since there is a lack of Wheels data (<2000 pieces of input), tweets will be used to train the model. These tweets are stored in the data folder as a CSV file, Tweets.csv

In [None]:
## GRAB THE DATA

# Import the CSV
tweets_df = pd.read_csv("/content/drive/MyDrive/Python/SentimentAnalysis/data/Tweets.csv")

tweets_df.head(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [None]:
## GET THE SIZES

# Number of tweets
tweet_count = len(tweets_df.text)

# Number of labels
label_count = len(tweets_df.airline_sentiment)

# Check if they're equal
print("Total number of tweets is:", tweet_count)
print("Total number of labels is:", label_count)

Total number of tweets is: 14640
Total number of labels is: 14640


### Training and Testing Data Splits
Begin by creating a new dataframe that contains only the tweet and sentiment label.

In [None]:
## DATAFRAME
# Specify the columns - use column names that are intuitive to figure out which is which as they carry over to split set structures
data = {'tweets': tweets_df.text, 'labels': tweets_df.airline_sentiment}
# Create the dataframe
tweet_dataset = pd.DataFrame(data = data)

# Check dataframe creation
tweet_dataset.head(5)

Unnamed: 0,tweets,labels
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative


In [None]:
## CONVERT LABELS TO NUMERIC VALUES

# Create label dictionary
label_dict = {'negative': 0, 'neutral': 0.5, 'positive': 1}

# Replace labels with numbers
tweet_dataset.labels = tweet_dataset.labels.replace(['negative', 'neutral', 'positive'], [0, 0.5, 1], inplace = False)
tweet_dataset.head(5)

Unnamed: 0,tweets,labels
0,@VirginAmerica What @dhepburn said.,0.5
1,@VirginAmerica plus you've added commercials t...,1.0
2,@VirginAmerica I didn't today... Must mean I n...,0.5
3,@VirginAmerica it's really aggressive to blast...,0.0
4,@VirginAmerica and it's a really big bad thing...,0.0


In [None]:
## SPLIT THE DATASET
# Reserve 20% for test

# Creates train and test streams
train, test = train_test_split(tweet_dataset, test_size = 0.2, random_state = 42, shuffle = True)

# Create train dataframe - transform
train_df = pd.DataFrame(train)
# Size of training dataframe
train_len = len(train_df)


# Create test dataframe
test_df = pd.DataFrame(test)
# Size of test dataframe
test_len = len(test_df)

test_df.head(5)

Unnamed: 0,tweets,labels
4794,@SouthwestAir you're my early frontrunner for ...,1.0
10480,@USAirways how is it that my flt to EWR was Ca...,0.0
8067,@JetBlue what is going on with your BDL to DCA...,0.0
8880,@JetBlue do they have to depart from Washingto...,0.5
8292,@JetBlue I can probably find some of them. Are...,0.0


In [None]:
## CONVERT TEST AND TRAIN TO A FORMAT USABLE BY BERT
# 4 arguments: train data, test data, data column, label column
# Use InputExample() on train and test data, https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/processors#transformers.DataProcessor.get_dev_examples
# also use apply() with axis = 1, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
def convertData(train, test, data_column, label_column):
    # Training data
    train_examples = train.apply(lambda x: InputExample(guid = None,
                                                        text_a = x[data_column],
                                                        text_b = None, 
                                                        label = x[label_column]), 
                                 axis = 1)
    # Testing data
    test_examples = test.apply(lambda x: InputExample(guid = None,
                                                        text_a = x[data_column],
                                                        text_b = None, 
                                                        label = x[label_column]), 
                                 axis = 1)
    # Return converted inputs
    return train_examples, test_examples

data_column = 'tweets'
label_column = 'labels'

In [None]:
## PRE-PROCESS TEST SAMPLES FOR MODEL USE

# Find the maximum character count
# Pad up to the max length
test_max_length = max(len(l) for l in test_df.tweets)

# Convert samples to a dataframe
# 3 arguments: examples, tokenizer, max_length
# max_length value should change depending upon 
def convertToDataSet(samples, tokenizer, max_length = test_max_length):
    # For InputFeatures
    features = [] # Initialize
    
    # Iterate through the samples list to clean the input and pad it
    for s in samples: 
        input_dict = tokenizer.encode_plus(s.text_a,
                                            add_special_tokens = True,
                                            max_length = max_length,
                                            return_token_type_ids=True,
                                            return_attention_mask=True,
                                            padding = 'max_length', # pads to the right by default # CHECK THIS for pad_to_max_length
                                            truncation=True)
        # Attach IDs
        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"], input_dict["token_type_ids"], input_dict['attention_mask'])
        
        # Add to features
        features.append(InputFeatures(input_ids=input_ids, 
                                      attention_mask=attention_mask, 
                                      token_type_ids=token_type_ids, 
                                      label=s.label))
     
    # Generates the individual features associated with each sample input
    def gen():
        # Iterate through each feature
        for f in features:
            yield ({"input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids},
                    f.label)

    # Return a dataset based on the output of gen()
    return tf.data.Dataset.from_generator(gen,({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
                                               ({"input_ids": tf.TensorShape([None]),
                                                 "attention_mask": tf.TensorShape([None]),
                                                 "token_type_ids": tf.TensorShape([None])},
                                              tf.TensorShape([])))     

In [None]:
## PROCESS DATA FOR BERT
# Input conversions
training_input, test_input = convertData(train_df, test_df, data_column, label_column)

# Convert to dataset
train_data = convertToDataSet(list(training_input), tokenizer)
train_data = train_data.shuffle(100).batch(16).repeat(2)

test_data = convertToDataSet(list(test_input), tokenizer)
test_data = test_data.batch(16)

### Fine Tuning BERT with Twitter Data

In [None]:
## FINE TUNE BERT

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

# Fit the model - https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
model.fit(train_data, epochs=2, verbose = 1, validation_data=test_data, steps_per_epoch = 40)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fc2356552d0>

In [None]:
## SAVE THE MODEL OUTPUT - Use pickle
# Be aware that sometimes functions won't be traceable.
# This has happened on Google colab using using CPU, GPU, and TPU.
data_path = "/content/drive/MyDrive/Python/SentimentAnalysis/"
outfile = pickle.dump(model, open('model.pkl', 'wb'))



INFO:tensorflow:Assets written to: ram://ee587b7f-408a-4016-8dda-a961804e684e/assets


INFO:tensorflow:Assets written to: ram://ee587b7f-408a-4016-8dda-a961804e684e/assets


In [None]:
## TREAT WHEELS DATA

# Create the data frame
wheels_df = pd.read_csv("/content/drive/MyDrive/Python/SentimentAnalysis/data/wheels_fvus_all.csv")



__Note:__ Some clients entered variations of N/A that trigger pandas to mark them as 'nan'. These have to be removed. The fastest and easiest way to do this without changing the contents of the original dataframe is to create a list of the texts column based on type of each cell value. If the type is a string, collect it into the list. If it is a float, remove it.

### Evaluating Wheels Data

In [None]:
## PROCESSING VERBATIMS
# Create text list from dataframe column
texts = [t for t in wheels_df.text if type(t) != float]

# Get length of new string-only list
text_counts = len(texts)
f"There are {text_counts} items for evaluation."

# Find the maximum value
text_max_length = max(len(t) for t in texts)
f"The maximum item length is {text_max_length} characters."

# Tokenize - USE MULTIPLE BATCHES TO AVOID CRASHING AND TIME OUTS
tf_batch = tokenizer(texts, max_length=text_max_length, padding=True, truncation=True, return_tensors='tf')

In [None]:
# Recall the model
# load saved model
#model = pickle.load(open('/content/drive/MyDrive/Python/SentimentAnalysis/model.pkl','rb'))

In [None]:
## RUN THE MODEL
tf_outputs = model(tf_batch)

# Predictions
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

# Labels
#labels = ['negative','neutral','positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()

# Make and capture predictions
# Create the dataframe to hold the predictions
predictions = pd.DataFrame(columns = ['texts', 'predictions'])

# For each prediction made, add it to the dataframe
for i in range(text_counts):
    predictions.loc[len(predictions.index)] = [texts[i], label[i]]

In [None]:
# Make a copy of the original data frame housing the predictions
preds_from_tweets = predictions.copy(deep = True)

# Replace labels with numbers
preds_from_tweets.predictions = preds_from_tweets.predictions.replace([0, 0.5, 1], ["negative", "neutral", "positive"], inplace = False)
preds_from_tweets.head(5)

In [None]:
## SUMMARIZE FINDINGS

# Create a counter for the labels specifically
label_counts = Counter(preds_from_tweets.predictions)

print(label_counts)

Counter({'negative': 1380, 'positive': 30})


### Observations
The twitter data used to fine tune BERT did about as well as VADER during the first experiment conducted on the 2021 FleetView Yearly Usability Survey. This suggests that twitter data is not the best training data set to use. For that reason, fine tuning should be done again using a data set of website/software/app reviews. This is because of the type of information typically mentioned in such reviews (example: system speed, number of clicks), as well as the variability in lengths.

Next, the Wheels data should undergo further cleaning. There are some single character responses that don't make sense to keep.

The final issue was the dichotomy of results. There were only positive and negative results, yet it's very clear some comments were neutral as they provided neither comment nor criticism of the platform. For sentiment analysis, neutral valence is denoted by less positive and less negative scores (some middle value range) as set by a researcher. This was something reflected in the VADER work from 2021, and something not appropriately planned for during this phase of the PoC.

Therefore, for Phase II of this POC, the following will be done:
- Employ new training data.
- Remove evaluation data that is less than 3 characters in size.
- Train for neutral valence.

In [None]:
# Output to CSV
preds_from_tweets.to_csv(r'/content/drive/MyDrive/Python/SentimentAnalysis/BERT_SA_11Feb2022.csv')