# BERTopic Modeling

What we're most interested in over the course of this project are the contents of speeches that were given by non-right-wing populist Members of the European Parliament (MEP) but classified by our deep learning model as being given by a far-right MEP. What are the contents of this particular subset of speeches? Do exhibit right-wing talking points, or do they perhaps pertain to topics that are discursively dominated by the far-right? 


In the following, we perform BERTopic modeling on the subset of speeches that are false positives (i.e. where `y` = 0 /`far_right` = 0 but `y_pred` = 1) to gain a better understanding of the contents of such speeches. Due to its computational intensity, we recommend running this notebook on [Google Colab](https://colab.research.google.com/). Moreover, it is crucial to perform relevant cleaning and preprocessing steps before running this notebook -- the code for this can be found in the [`data_cleaning`](../data_cleaning) folder of this repository. In order to retrieve the dataset containing all speeches labeled as false positive, it is also necessary to run the code in [`BERT_classifier.ipynb`](BERT_classifier.ipynb) before running the code in this notebook.

## Setup

Run the following few code cells to load all required packages and mount your Google Drive to this notebook.

In [None]:
#uncomment the next line to install bertopic visualizations if you haven't done so yet
#!pip install bertopic[visualization] --quiet

In [None]:
#import packages
import numpy as np
import pandas as pd
import os
import random
from copy import deepcopy
from bertopic import BERTopic
from google.colab import drive

In [None]:
#mount google drive if using colab
drive.mount('/content/gdrive')

#set seed
random.seed(42)

In [None]:
#define cwd
cwd = os.getcwd()

#load data
fp = pd.read_csv(f'{cwd}/false_positives.csv', index_col = [0])
fn = pd.read_csv(f'{cwd}/false_negatives.csv', index_col = [0])
full = pd.read_csv(f'{cwd}/clean_fullsample_test.csv', index_col = [0])

## Topic Modeling on False Positives

The code below performs topic modeling on all speeches that were classified as given by right-wing populists by our deep learning model when in reality they were not. We first create a list of documents (i.e. speeches) with which we fit transform the model we used for topic modeling, BERTopic. We then visualize the topic modeling process on an intertopic distance map, which also allows us to explore individual clusters by hovering over a given cluster.

In [None]:
#create list of contribution texts used for topic modeling
docs_fp = list(fp.loc[:, "contribution_text"].values)

In [None]:
#specify the model
model_fp = BERTopic(language="english")

In [None]:
#fit and transform the model on the data
topics, probs = model_fp.fit_transform(docs_fp)

In [None]:
#display topic frequency
model_fp.get_topic_freq()

In [None]:
#visualize topic modeling, create intertopic distance map
model_fp.visualize_topics()

## Topic Modeling on False Negatives

Do the topics from our false positives dataset differ significantly from speeches falsely classified as not far-right? In order to check whether or not our findings from above are meaningfully different, we repeat the same steps using the false negatives dataset. 

In [None]:
#create list of contribution texts used for topic modeling
docs_fn = list(fn.loc[:, "contribution_text"].values)

In [None]:
#specify the model
model_fn = BERTopic(language="english")

In [None]:
#fit and transform the model on the data
topics_fn, probs_fn = model_fn.fit_transform(docs_fn)

In [None]:
#display topic frequency
model_fn.get_topic_freq()

In [None]:
#visualize topic modeling, create intertopic distance map
model_fn.visualize_topics()

## Topic Modeling on Actual Far-Right Speeches

Finally, we're also interested to see what topics actual right-wing populists in the European Parliament discuss in their speeches. We're particularly interested to see how much of an overlap exists between the false and true positives in our data. In order to obtain these insights, we once again perform the same steps as before, but this time on all speeches in our dataset that were given by right-wing populist MEPs.

In [None]:
#create dataset of true positives
tp = full[full['far_right'] == 1]

In [None]:
#create list of contribution texts used for topic modeling
docs_tp = list(tp.loc[:, "contribution_text"].values)

In [None]:
#specify the model
model_tp = BERTopic(language="english")

In [None]:
#fit and transform the model on the data
topics_tp, probs_tp = model_tp.fit_transform(docs_tp)

In [None]:
#display topic frequency
model_tp.get_topic_freq()

In [None]:
#visualize topic modeling, create intertopic distance map
model_tp.visualize_topics()