# Ideas behind Sampling:

In this notebook, we try three different approaches to creating a training set with which we will train our deep learning model. First, we downsample the data frame to counter an imbalance that can be found in the data. Second, we use an upsampling strategy, essentially creating copies of observations that are in the minority class. Finally, we artificially create a new training set by splitting each observation into sentences and rearranging observations at shorter lengths.

In [58]:
#load packages
import pandas as pd
import numpy as np
import os
import random
import nltk
from sklearn.model_selection import train_test_split

In [3]:
#read in data frame
path = f"{os.getcwd()}"
data = pd.read_csv(f"{path}/final_data.csv")

In [192]:
#split data into train and test sets
X = data[['contribution_text']]
y = data['far_right']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
train = X_train.join(y_train)
test = X_test.join(y_test)

In [196]:
#rename train data as data and run the following code
data = train

In [218]:
#save test set to csv file
#test.to_csv(f"{path}/test.csv")

In [220]:
pwd

'/Users/konratpekkip/code/konratp/Detecting-Far-Right-Talking-Points'

## Downsampling

In [197]:
#split data into two groups based on y variable
category_0 = data[data['far_right'] == 0]
category_1 = data[data['far_right'] == 1]
print(f"{len(category_0)},{len(category_1)}")

82341,17518


In [198]:
#take sample of y=0 in order to downsample and create better balance between features
cat_0_sample = category_0.sample(len(category_1)*2)
d_sample = pd.concat([cat_0_sample, category_1], axis=0)
d_sample['far_right'].value_counts()

0    35036
1    17518
Name: far_right, dtype: int64

In [217]:
#save to csv
#d_sample.to_csv(f"{path}/downsample.csv")

## Upsampling 

In [200]:
cat_1_usample = pd.concat([category_1, category_1], axis = 0)
u_sample_1 = pd.concat([cat_1_usample, cat_1_usample])

In [201]:
u_sample = pd.concat([u_sample_1, category_0], axis = 0)

In [216]:
#save to csv
#u_sample.to_csv(f"{path}/upsample.csv")

## Synthetic Sampling

In this step, we split each contribution into individual sentences and try to create new observations from these sentences that are shorter in length, but thus more plentiful than the previous subsets of the data for which far_right = 1. 

In [203]:
#get a list of all sentences in the data set
list_of_sentences = []

for i in category_1.contribution_text:
    k = nltk.tokenize.sent_tokenize(i)
    for j in k:
        list_of_sentences.append(j)

In [204]:
#check how many sentences are present in the data set
len(list_of_sentences)

185428

In [205]:
#check average number of sentences per contribution
len(list_of_sentences) / len(category_1)

10.58499828747574

In [206]:
#check average number of characters per contribution
avg_length = np.mean(category_1.contribution_text.map(len))
avg_length

1472.9630665601096

In [207]:
#randomize list of sentences
random_list_of_sentences = random.sample(list_of_sentences, len(list_of_sentences))

In [208]:
#create a dictionary that reassembles sentences and synthesizes new observations -- triple number of obs where far_right = 1

synth_dict = {}
list_of_charcounts = []

for i in range(len(category_1) * 3):
    sample = []
    char_count = 0
    for j in range(10):
        rand_idx = random.randint(0, len(category_1)-1)
        sentence = random_list_of_sentences[rand_idx]
        sample.append(sentence)
        char_count += len(sentence)
    if char_count < avg_length:
        rand_idx = random.randint(0, len(category_1)-1)
        sample.append(random_list_of_sentences[rand_idx])
        char_count += len(random_list_of_sentences[rand_idx])
    list_of_charcounts.append(char_count)
    synth_dict[i] = [sample]

In [209]:
#check difference in average number of characters per observation
avg_length - np.mean(list_of_charcounts)

1.2887696464588316

In [210]:
#create dataframe based on the synth sample dictionary
s_sample_1 = pd.DataFrame.from_dict(synth_dict).T
s_sample_1['far_right'] = 1
s_sample_1 = s_sample_1.rename(columns = {0: 'contribution_text'})

In [211]:
#shorten cat_0 data frame to be able to concat later on
cat_0_short = category_0[['contribution_text', 'far_right']]

In [212]:
#concatenate newly synthesized rows with previously existing rows where far_right = 0, shuffle and reset index
s_sample = pd.concat([cat_0_short, s_sample_1], axis = 0)
s_sample = s_sample.sample(frac = 1)
s_sample.reset_index(drop=True, inplace = True)

In [213]:
#check value counts of far_right column
s_sample['far_right'].value_counts()

0    82341
1    52554
Name: far_right, dtype: int64

In [215]:
#save dataframe to csv
#s_sample.to_csv(f"{path}/synthsample.csv")