# Ideas behind Sampling:

In this notebook, we try three different approaches to creating a training set with which we will train our deep learning model. First, we downsample the data frame to counter an imbalance that can be found in the data. Second, we use an upsampling strategy, essentially creating copies of observations that are in the minority class. Finally, we artificially create a new training set by splitting each observation into sentences and rearranging observations at shorter lengths.

In [1]:
#load packages
import pandas as pd
import numpy as np
import os
import random
import nltk
from sklearn.model_selection import train_test_split

In [4]:
#read in data frame
path = f"{os.getcwd()}"
path = "/Users/konratpekkip/Desktop/Detecting-Far-Right-Talking-Points/raw_data/clean_data"
data = pd.read_csv(f"{path}/final_data.csv", index_col = [0])

In [5]:
data

Unnamed: 0,contribution,chapter,chapter_title,contribution_id,speaker_name_lower,contribution_text,date,party,language,far_right
0,"KER ID=""141"" NAME=""Jean-Paul Gauzès"" AFFILIATI...","""009-06""",6. Innovative financing at a global and Europe...,"11-03-08-""141""",jean-paul gauzès,"(FR) Madam President, I regret that there is s...",11-03-08,PPE,,0
1,"KER ID=""142"" NAME=""Martin Schulz"" AFFILIATION=...","""009-06""",6. Innovative financing at a global and Europe...,"11-03-08-""142""",martin schulz,"(DE) Madam President, I can tell Mr Gauzès thi...",11-03-08,S&D,,0
2,"KER ID=""143"" NAME=""Rodi Kratsa-Tsagaropoulou"" ...","""009-06""",6. Innovative financing at a global and Europe...,"11-03-08-""143""",rodi kratsa-tsagaropoulou,"(EL) Madam President, yesterday in plenary, th...",11-03-08,PPE,,0
3,"KER ID=""183"" NAME=""Sepp Kusstatscher"" AFFILIAT...","""011""",Development of the Community's railways Certi...,"07-01-17-""183""",sepp kusstatscher,"- (DE) Mr President, ladies and gentlemen, I a...",07-01-17,Verts/ALE,,0
4,"KER ID=""184"" NAME=""Pedro Guerreiro"" AFFILIATIO...","""011""",Development of the Community's railways Certi...,"07-01-17-""184""",pedro guerreiro,"(PT) Once again, the majority in Parliament is...",07-01-17,GUE/NGL,,0
...,...,...,...,...,...,...,...,...,...,...
142651,"KER ID=244 NAME=""Gillis"">\nMadam President, I ...",10,Marketing of seeds - Implementation of Regulat...,98-05-14-244,gillis,"Madam President, I would first of all like to ...",98-05-14,"[""European People's Party"", ""European People's...",,0
142652,"KER ID=245 LANGUAGE=""DA"" NAME=""Iversen"">\nMada...",10,Marketing of seeds - Implementation of Regulat...,98-05-14-245,iversen,"Madam President, I would like to begin by than...",98-05-14,"['Socialist Group', 'Party of European Sociali...",DA,0
142653,"KER ID=246 LANGUAGE=""SV"" NAME=""Wibe"">\nMadam P...",10,Marketing of seeds - Implementation of Regulat...,98-05-14-246,wibe,"Madam President, in Amendment No 1 the rapport...",98-05-14,"['Socialist Group', 'Party of European Sociali...",SV,0
142654,"KER ID=249 NAME=""Graefe zu Baringdorf"">\nMadam...",10,Marketing of seeds - Implementation of Regulat...,98-05-14-249,graefe zu baringdorf,"Madam President, I should like briefly to pick...",98-05-14,"['Green Group', 'Greens/European Free Alliance']",,0


In [6]:
#split data into train and test sets
X = data[['contribution_text']]
y = data['far_right']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
train = X_train.join(y_train)
test = X_test.join(y_test)

## Downsampling test data

In [8]:
#rename train data as data and run the following code
data = test

In [10]:
category_0 = data[data['far_right'] == 0]
category_1 = data[data['far_right'] == 1]
print(f"{len(category_0)},{len(category_1)}")

35316,7481


In [12]:
cat_0_sample = category_0.sample(len(category_1))
d_sample = pd.concat([cat_0_sample, category_1], axis=0)
d_sample['far_right'].value_counts()

0    7481
1    7481
Name: far_right, dtype: int64

In [13]:
#save test set to csv file
test.to_csv(f"{path}/test.csv")

## Downsampling

In [None]:
#rename train data as data and run the following code
data = train

In [197]:
#split data into two groups based on y variable
category_0 = data[data['far_right'] == 0]
category_1 = data[data['far_right'] == 1]
print(f"{len(category_0)},{len(category_1)}")

82341,17518


In [198]:
#take sample of y=0 in order to downsample and create better balance between features
cat_0_sample = category_0.sample(len(category_1)*2)
d_sample = pd.concat([cat_0_sample, category_1], axis=0)
d_sample['far_right'].value_counts()

0    35036
1    17518
Name: far_right, dtype: int64

In [217]:
#save to csv
#d_sample.to_csv(f"{path}/downsample.csv")

## Upsampling 

In [200]:
cat_1_usample = pd.concat([category_1, category_1], axis = 0)
u_sample_1 = pd.concat([cat_1_usample, cat_1_usample])

In [201]:
u_sample = pd.concat([u_sample_1, category_0], axis = 0)

In [216]:
#save to csv
#u_sample.to_csv(f"{path}/upsample.csv")

## Synthetic Sampling

In this step, we split each contribution into individual sentences and try to create new observations from these sentences that are shorter in length, but thus more plentiful than the previous subsets of the data for which far_right = 1. 

In [203]:
#get a list of all sentences in the data set
list_of_sentences = []

for i in category_1.contribution_text:
    k = nltk.tokenize.sent_tokenize(i)
    for j in k:
        list_of_sentences.append(j)

In [204]:
#check how many sentences are present in the data set
len(list_of_sentences)

185428

In [205]:
#check average number of sentences per contribution
len(list_of_sentences) / len(category_1)

10.58499828747574

In [206]:
#check average number of characters per contribution
avg_length = np.mean(category_1.contribution_text.map(len))
avg_length

1472.9630665601096

In [207]:
#randomize list of sentences
random_list_of_sentences = random.sample(list_of_sentences, len(list_of_sentences))

In [208]:
#create a dictionary that reassembles sentences and synthesizes new observations -- triple number of obs where far_right = 1

synth_dict = {}
list_of_charcounts = []

for i in range(len(category_1) * 3):
    sample = []
    char_count = 0
    for j in range(10):
        rand_idx = random.randint(0, len(category_1)-1)
        sentence = random_list_of_sentences[rand_idx]
        sample.append(sentence)
        char_count += len(sentence)
    if char_count < avg_length:
        rand_idx = random.randint(0, len(category_1)-1)
        sample.append(random_list_of_sentences[rand_idx])
        char_count += len(random_list_of_sentences[rand_idx])
    list_of_charcounts.append(char_count)
    synth_dict[i] = [sample]

In [209]:
#check difference in average number of characters per observation
avg_length - np.mean(list_of_charcounts)

1.2887696464588316

In [210]:
#create dataframe based on the synth sample dictionary
s_sample_1 = pd.DataFrame.from_dict(synth_dict).T
s_sample_1['far_right'] = 1
s_sample_1 = s_sample_1.rename(columns = {0: 'contribution_text'})

In [211]:
#shorten cat_0 data frame to be able to concat later on
cat_0_short = category_0[['contribution_text', 'far_right']]

In [212]:
#concatenate newly synthesized rows with previously existing rows where far_right = 0, shuffle and reset index
s_sample = pd.concat([cat_0_short, s_sample_1], axis = 0)
s_sample = s_sample.sample(frac = 1)
s_sample.reset_index(drop=True, inplace = True)

In [213]:
#check value counts of far_right column
s_sample['far_right'].value_counts()

0    82341
1    52554
Name: far_right, dtype: int64

In [215]:
#save dataframe to csv
#s_sample.to_csv(f"{path}/synthsample.csv")