# Airline Tweets - Topic Modeling
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
# Install packages that are not already installed on Colab
#!pip install pyLDAvis

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import spacy
from gensim.models import LdaModel
from gensim import corpora

import pyLDAvis
import pyLDAvis.gensim

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression



## Problem description

We have a collection of 10.000 tweets directed at airlines in the US. Originally, this dataset came from Crowdflower's Data for Everyone library (discontinued). The data was collected in February 2015 and multiple human annotators were asked classify the tweets into the classes `positive` and `negative`.

## Load data

In [None]:
tweets = pd.read_csv("https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/airline_tweets/data/airlinetweets.csv")
tweets.head()

## Prepare data

Perform the typical splits into features and labels and training and test sets.

In [None]:
X = tweets[["tweet_id", "airline", "text"]]
y = tweets[["sentiment_groundtruth"]]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Perform standard text preprocessing steps, such as lemmaization, stop word removal, and lowercasing.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
def spacy_prep(dataset):
  dataset = dataset.to_dict("records")
  for i, entry in enumerate(dataset):
      text = nlp(entry[u'text'])
      tokens_to_keep = []
      for token in text:
          if token.is_alpha and token.is_stop == False:
              tokens_to_keep.append(token.lemma_.lower())
      entry[u'text_prep'] = " ".join(tokens_to_keep)
  dataset = pd.DataFrame(dataset)
  return(dataset)

In [None]:
X_train = spacy_prep(X_train)
X_test = spacy_prep(X_test)

In [None]:
X_train.head()

## Topic modeling

The `gensim` topic modeling library expects inputs in the form of a list of lists, where each list contains a list of the tokens of a document.

In [None]:
corpus_train = [doc.split() for doc in X_train["text_prep"].to_numpy()]
corpus_test = [doc.split() for doc in X_test["text_prep"].to_numpy()]

In [None]:
corpus_train[0:3]

Next, we have to construct a dictionary comprising all the tokens in the corpus.

In [None]:
dictionary = corpora.Dictionary(corpus_train, prune_at=10000)
len(dictionary)

The dictionary will now be used to convert the tokens into a term-document-matrix (bag-of-words) representation.

In [None]:
doc_term_matrix_train = [dictionary.doc2bow(doc) for doc in corpus_train]
doc_term_matrix_test = [dictionary.doc2bow(doc) for doc in corpus_test]

Finally, we can train the LDA model by passing the term-document-matrix and the dictionary to the `LdaModel` class. The most important parameter is the number of topics we want to extract.

In [None]:
model_lda = LdaModel(doc_term_matrix_train, num_topics=30, minimum_probability=0.0, id2word = dictionary, random_state=42)

The following loop extracts all topics and their most likely words.

In [None]:
per_topic_term_dist = pd.DataFrame(columns=["word_id", "prob"])
for i in range(0, 30):
    this_term_dist = pd.DataFrame(model_lda.get_topic_terms(i, topn=10), columns=["word_id", "prob"])
    this_term_dist["word"] = this_term_dist["word_id"].apply(lambda x: dictionary[x])
    this_term_dist["topic_id"] = i
    per_topic_term_dist = pd.concat([per_topic_term_dist, this_term_dist])

In [None]:
per_topic_term_dist

Let's look at some topics and their most likely words. 

In [None]:
per_topic_term_dist[per_topic_term_dist["topic_id"] == 1].sort_values(by="prob", ascending=False)

In [None]:
per_topic_term_dist[per_topic_term_dist["topic_id"] == 2].sort_values(by="prob", ascending=False)

We can infer the topic distribution of a single new document by passing the document to the trained LDA model.

In [None]:
X_test["text"].iloc[666]

In [None]:
doc_topic_vector = model_lda[doc_term_matrix_test[666]]
doc_topic_vector = pd.DataFrame(doc_topic_vector, columns=["topic_id", "prob"])
doc_topic_vector

Plot the topic distribution of a single document as bar charts.

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x="topic_id", y="prob", data=doc_topic_vector)

Topic 24 is the most likely topic for this document.

In [None]:
per_topic_term_dist[per_topic_term_dist["topic_id"] == 24].sort_values(by="prob", ascending=False)

We can also use the `LDAvis`package to visualize this topic model.

In [None]:
vis = pyLDAvis.gensim.prepare(topic_model=model_lda, 
                              corpus=doc_term_matrix_train, 
                              dictionary=dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

## Predict sentiment from topics

Finally, we can use the per-document topic distribution as features for training a classifier to predict the sentiment of the tweets.

First, we extract the topic distributions of all documents in the training and test set and store them in dataframes.

In [None]:
X_train_doc_topic_vectors = pd.DataFrame(columns=["topic_id", "prob"])
for j in range(0, X_train.shape[0]):
    doc_topic_vector = model_lda[doc_term_matrix_train[j]]
    doc_topic_vector = pd.DataFrame(doc_topic_vector, columns=["topic_id", "prob"])
    doc_topic_vector["doc_id"] = j
    X_train_doc_topic_vectors = pd.concat([X_train_doc_topic_vectors, doc_topic_vector])

X_train_doc_topic_vectors = X_train_doc_topic_vectors.pivot(columns="topic_id", values="prob", index="doc_id")


In [None]:
X_train_doc_topic_vectors.shape

In [None]:
X_test_doc_topic_vectors = pd.DataFrame(columns=["topic_id", "prob"])
for j in range(0, X_test.shape[0]):
    doc_topic_vector = model_lda[doc_term_matrix_test[j]]
    doc_topic_vector = pd.DataFrame(doc_topic_vector, columns=["topic_id", "prob"])
    doc_topic_vector["doc_id"] = j
    X_test_doc_topic_vectors = pd.concat([X_test_doc_topic_vectors, doc_topic_vector])

X_test_doc_topic_vectors = X_test_doc_topic_vectors.pivot(columns="topic_id", values="prob", index="doc_id")


In [None]:
X_test_doc_topic_vectors.shape

Next, like we did for the bag-of-words model, we train a logistic regression model on the topic distributions and evaluate its performance.

In [None]:
tm_sa_classifier = LogisticRegression(max_iter=1000, penalty="l1", solver="liblinear")
tm_sa_classifier.fit(X_train_doc_topic_vectors, np.ravel(y_train))

In [None]:
pred = tm_sa_classifier.predict(X_test_doc_topic_vectors)

In [None]:
accuracy_score(y_test, pred)