# Airline Tweets - Three Generations (👵 👩 👧) of Sentiment Analysis (😊 😟)
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import spacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline
import torch

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

## Problem description

We have a collection of 10.000 tweets directed at airlines in the US. Originally, this dataset came from Crowdflower's Data for Everyone library (discontinued). The data was collected in February 2015 and multiple human annotators were asked classify the tweets into the classes `positive` and `negative`.

## Load data

In [None]:
tweets = pd.read_csv("https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/airline_tweets/data/airlinetweets.csv")
tweets.head()

## Prepare data

Perform the typical splits into features and labels and training and test sets.

In [None]:
X = tweets[["tweet_id", "airline", "text"]]
y = tweets[["sentiment_groundtruth"]]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Exploratory data analysis

Let's explore the training data a little bit. It's always a good idea to first read a couple of texts.

In [None]:
print(X_train.iloc[0]["text"])
print("===")
print(y_train.iloc[0]["sentiment_groundtruth"])

In [None]:
print(X_train.iloc[1]["text"])
print("===")
print(y_train.iloc[1]["sentiment_groundtruth"])

Visualize the distribution of tweets across the positive and negative classes.

In [None]:
sns.countplot(data=y_train, x="sentiment_groundtruth")

In [None]:
y_train["sentiment_groundtruth"].value_counts()

In [None]:
y_train["sentiment_groundtruth"].value_counts()[1]/y_train.shape[0]

## 1st Generation 👵 - Dictionary-based Sentiment Analysis with VADER

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/gen1.png"/><br></center>

Let's start by trying a dictinary-based approach to sentiment analysis. In the following, we will use `VADER` (Valence Aware Dictionary and sEntiment Reasoner), a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media (https://github.com/cjhutto/vaderSentiment).

Initialize the VADER sentiment analyzer and try it out on a single example from the training set.

In [None]:
vader_sa_classifier = SentimentIntensityAnalyzer()

In [None]:
print(X_train.iloc[1]["text"])
vader_sa_classifier.polarity_scores(X_train.iloc[1]["text"])

Use VADER to analyze all tweets from the test set.

In [None]:
y_test_vader = []
for index, row in X_test.iterrows():
    vs = vader_sa_classifier.polarity_scores(row["text"])
    if vs["compound"] > 0:
      sentiment = "positive"
    else:
      sentiment = "negative"
    y_test_vader.append(sentiment)

Let's look at some exemplary predictions and calculate the accuracy using all observations from the test set.

In [None]:
y_test_vader[0:10]

In [None]:
accuracy_score(y_test, y_test_vader)

Looking at the confusion matrix gives us an idea what kind of errors the model makes.

In [None]:
cm = confusion_matrix(y_test, y_test_vader)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["neg", "pos"])
disp.plot()
plt.show()

## 2nd Generation 👩 - Statistical Sentiment Analysis with Bag-of-Words Representation and Regularized Logistic Regression

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/gen2.png"/><br></center>

Next, we will use a statistical approach. More precisely, we create a bag-of-words representation of our tweets and then train a regularized logistic regression model to learn linear relationships between single words and sentiment.

Before we create a term-document matrix, we perform some standard preprocessing like lemmatization. We will use the excellent `Spacy` (https://spacy.io/) library for this.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
def spacy_prep(dataset):
  dataset = dataset.to_dict("records")
  for i, entry in enumerate(dataset):
      text = nlp(entry[u'text'])
      tokens_to_keep = []
      for token in text:
          if token.is_alpha:
              tokens_to_keep.append(token.lemma_.lower())
      entry[u'text_prep'] = " ".join(tokens_to_keep)
  dataset = pd.DataFrame(dataset)
  return(dataset)

In [None]:
X_train = spacy_prep(X_train)

In [None]:
X_train.head()

Now we are ready to create the term-document matrix for the training set. The `CountVectorizer` from the sklearn package performs this job for us.

In [None]:
count_vect = CountVectorizer(min_df=2)
X_train_matrix = count_vect.fit_transform(X_train["text_prep"].tolist())

In [None]:
X_train_matrix.shape

In [None]:
X_train_matrix[0:10,0:10].todense()

Finally, we fit a logistic regression model on the training set.

In [None]:
bow_sa_classifier = LogisticRegression(max_iter=1000, penalty="l1", solver="liblinear")
bow_sa_classifier.fit(X_train_matrix, np.ravel(y_train))

Now that we have a sentiment classifier especially trained to predict sentiments for our data, we can apply it to the test set and evaluate its predictive accuracy. Note that we have to repeat the exact same preprocessing steps that we applied to the training set also to the test set.

In [None]:
X_test = spacy_prep(X_test)
X_test_matrix = count_vect.transform(X_test["text_prep"])

In [None]:
y_test_bow = bow_sa_classifier.predict(X_test_matrix)

In [None]:
accuracy_score(y_test, y_test_bow)

Let's look at the confusion matrix.

In [None]:
cm = confusion_matrix(y_test, y_test_bow)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["neg", "pos"])
disp.plot()
plt.show()

The performance is impressive. In addition, we can inspect the coefficients our logistic regression model in order to understand how the model makes predictions.

In [None]:
coeffs = bow_sa_classifier.coef_[0].tolist()
words = count_vect.get_feature_names_out()
words_with_coeffs = pd.DataFrame(coeffs, words, columns=["coeff"])

In [None]:
words_with_coeffs.sort_values("coeff", ascending=False).head(20)

In [None]:
words_with_coeffs.sort_values("coeff", ascending=True).head(20)

## 3rd Generation 👧 - Neural Sentiment Analysis with Pre-trained BERT

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/transfer_learning.png"/><br></center>

Lastly, we will try a pre-trained BERT sentiment analysis model on our data. The model has been pre-trained on Wikipedia and BookCorpus using a language modeling task and then fine-tuned for sentiment analysis on more than 200.000 labeled sentences from movie reviews.

The [Huggingface](https://huggingface.co/) model hub is a great place to find pre-trained models.

In [None]:
transformer_sa_classifier = pipeline("sentiment-analysis")

As the model is already trained, we can directly apply it to our test data and assess it's predictive accuracy.

In [None]:
results = transformer_sa_classifier(X_test["text"].to_list())

y_test_transformer = []
for result in results:
  y_test_transformer.append(result["label"].lower())

In [None]:
accuracy_score(y_test, y_test_transformer)

And the confusion matrix...

In [None]:
cm = confusion_matrix(y_test, y_test_transformer)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["neg", "pos"])
disp.plot()
plt.show()

Not bad, given that this model has been trained on 🎥 reviews and we applied it to 🛫 tweets.