# Balancing & Sampling

In order to accurately train a deep learning model, it is often not enough to simply train it on a complete dataset and expect it to be able to make accurate predictions. Rather, it is important to account for potential biases in the underlying data, and train a model on a more balanced dataset.

Because most Members of the European Parliament (MEP) are members of conservative, left-wing, environmentalist or liberal parties, it is only natural that most speeches in the European Parliament are not given by right-wing populists, but rather politicians from a variety of ideological backgrounds. Since our goal is to train a deep learning classifier to accurately predict whether a given speech was given by a right-wing populist MEP or not, training it on the original, unbalanced data would likely not yield satisfying results. Since only about 15% of the speeches in our dataset were given by right-wing populists, a model trained on the unbalanced, original dataset could, at an accuracy of 85%, just predict every speech not to be given by a right-wing populist MEP. 

To avoid such an outcome, we instead want to train the model on balanced data. In this notebook, we transform our dataset to become more balanced. We use three different approaches for this: downsampling, upsampling, and synthetic sampling. To reproduce the results of our project, we recommend running the code below to retrace our steps and understand the conceptual differences between the three balancing approaches before moving on to the notebooks training our BERT classifier and performing topic modeling. 

## Preparing the Dataset

In the first few cells of code, we load required packages as well as the dataset resulting from cleaning and preprocessing steps taken in the [`data_cleaning` notebook](data_cleaning_preprocessing.ipynb). Please make sure to read in the cleaned and preprocessed data here. After reading in the data frame, we split the data into a test and train set, as we do **not** want to balance the test data, only the training set. If you choose to do so, you can save the test set as a `.csv` file by uncommenting the relevant line of code below.

In [1]:
#load packages
import pandas as pd
import numpy as np
import os
import random
import nltk
from sklearn.model_selection import train_test_split

In [2]:
#read in data frame
path = f"{os.getcwd()}"
data = pd.read_csv(f"{path}/final_data.csv")

In [3]:
#split data into train and test sets
X = data[['contribution_text']]
y = data['far_right']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

#join X_train and y_train
data = X_train.join(y_train)

#join X_test and y_test
test = X_test.join(y_test)

In [4]:
#uncomment the following line to save test set to csv file
#test.to_csv(f"{path}/test.csv")

## Downsampling

One approach to balancing a dataset is to downsample it. After performing a value count of the target variable, `far_right`, we see that there are roughly 17,000 speeches given by right-wing populists in the dataset. In order to balance the data, we then take a random sample of roughly 34,000 speeches where `far_right` = 0, and create a new data frame where 1/3 of speeches come from far-right MEPs, and 2/3 of speeches do not. This way, we maintain the trend that the majority of speeches are not given by far-right Members of the European Parliament, without allowing for our model to predict the target variable to be 0 across the board. You can save the resulting data frame as a `.csv` file by uncommenting the relevant line of code below.

In [5]:
#split data into two groups based on y variable
category_0 = data[data['far_right'] == 0]
category_1 = data[data['far_right'] == 1]
print(f"The number of speeches given by non-far-right MEP is {len(category_0)}, while the number of speeches given by far-right MEP is {len(category_1)}.")

The number of speeches given by non-far-right MEP is 82339, while the number of speeches given by far-right MEP is 17520.


In [6]:
#reduce number of speeches where far_right = 0 -- create a dataset with 1/3 far_right = 1, 2/3 far_right = 0
cat_0_sample = category_0.sample(len(category_1)*2)
d_sample = pd.concat([cat_0_sample, category_1], axis=0)
d_sample['far_right'].value_counts()

0    35040
1    17520
Name: far_right, dtype: int64

In [7]:
#uncomment the following line to save downsampled data to csv file
#d_sample.to_csv(f"{path}/downsample.csv")

## Upsampling 

Upsampling is another approach to balancing a dataset. Instead of removing observations of the majority category from the data frame, observations from the minority category are duplicated in order to create a better balance in the target variable. The resulting data frame contains 70,080 speeches that were given far-right MEP, and 82,339 speeches that were given by MEP from other ideological backgrounds. Again, you can save the resulting data frame as a `.csv` file by uncommenting the relevant line of code below.

In [8]:
cat_1_usample = pd.concat([category_1, category_1], axis = 0)
u_sample_1 = pd.concat([cat_1_usample, cat_1_usample])

In [9]:
u_sample = pd.concat([u_sample_1, category_0], axis = 0)
u_sample['far_right'].value_counts()

0    82339
1    70080
Name: far_right, dtype: int64

In [10]:
#uncomment the following line to save downsampled data frame to csv file
#u_sample.to_csv(f"{path}/upsample.csv")

## Synthetic Sampling

The final approach to balancing the dataset involves synthetic sampling. To do this, we split each contribution into individual sentences, and try to create new observations from these sentences that are shorter in length and thus more plentiful than the previous subsets of data for which `far_right` = 1. Keep in mind that this process is only done in order to train the model, not to test its performance. While relevant metadata is lost in the training process, it has no impact on the model's evaluation of a speech it hasn't seen before. 

In [11]:
#get a list of all sentences in the data set
list_of_sentences = []

for i in category_1.contribution_text:
    k = nltk.tokenize.sent_tokenize(i)
    for j in k:
        list_of_sentences.append(j)

In [12]:
#check how many sentences are present in the data set
len(list_of_sentences)

185870

In [13]:
#check average number of sentences per contribution
len(list_of_sentences) / len(category_1)

10.609018264840183

In [14]:
#check average number of characters per contribution
avg_length = np.mean(category_1.contribution_text.map(len))
avg_length

1478.615011415525

In [15]:
#randomize list of sentences
random_list_of_sentences = random.sample(list_of_sentences, len(list_of_sentences))

In [16]:
#create a dictionary that reassembles sentences and synthesizes new observations -- triple number of obs where far_right = 1

synth_dict = {}
list_of_charcounts = []

for i in range(len(category_1) * 3):
    sample = []
    char_count = 0
    for j in range(10):
        rand_idx = random.randint(0, len(category_1)-1)
        sentence = random_list_of_sentences[rand_idx]
        sample.append(sentence)
        char_count += len(sentence)
    if char_count < avg_length:
        rand_idx = random.randint(0, len(category_1)-1)
        sample.append(random_list_of_sentences[rand_idx])
        char_count += len(random_list_of_sentences[rand_idx])
    list_of_charcounts.append(char_count)
    synth_dict[i] = [sample]

In [17]:
#check difference in average number of characters per observation
avg_length - np.mean(list_of_charcounts)

12.702245053272463

In [18]:
#create dataframe based on the synth sample dictionary
s_sample_1 = pd.DataFrame.from_dict(synth_dict).T
s_sample_1['far_right'] = 1
s_sample_1 = s_sample_1.rename(columns = {0: 'contribution_text'})

In [19]:
#shorten cat_0 data frame to be able to concat later on
cat_0_short = category_0[['contribution_text', 'far_right']]

In [20]:
#concatenate newly synthesized rows with previously existing rows where far_right = 0, shuffle and reset index
s_sample = pd.concat([cat_0_short, s_sample_1], axis = 0)
s_sample = s_sample.sample(frac = 1)
s_sample.reset_index(drop=True, inplace = True)

In [21]:
#check value counts of far_right column
s_sample['far_right'].value_counts()

0    82339
1    52560
Name: far_right, dtype: int64

In [22]:
#save dataframe to csv
#s_sample.to_csv(f"{path}/synthsample.csv")