### Data pre-processing and augmentation

In this part of the project, ...

In [None]:
import pandas as pd
import numpy as np
import importlib

from sklearn.model_selection import train_test_split

import pre_processing
#importlib.reload(pre_processing)

import data_augmentation
#importlib.reload(data_augmentation)

#### *Reading initial dataset*

The ArXiv dataset with the data which will be used during the whole deep learning project has been read from a CSV file.\
The initial dataset does have **77208 rows** (articles), with the following structure in terms of columns:
- **text**, a string with the article text that will be analyzed and manipulating in our text classification task;
- **label**, the category (level 1) related to the text (for example, physics), which will be the target variable of our task.

In [None]:
arxiv_data = pd.read_csv('data/arxiv-dataset-cat1.csv')

#### *Splitting in training and test set*

The initial dataset is partitioned in training and test set, with the following strategy:
- **70%** training and **30%** test;
- **Stratified sampling** based on the value of the label (categeory).

*Training and test sets have been stored in two CSV files*

In [None]:
train_set, test_set = train_test_split(arxiv_data, test_size = 0.3, stratify = arxiv_data['label'], random_state = 19)

In [None]:
# train_set.to_csv('data/train-set-cat1.csv', index = False)
# test_set.to_csv('data/test-set-cat1.csv', index = False)

#### *Pre-processing on training and test set*

A phase of pre-processing is applied to the textual observation of the training and also test set, with the following operations:
- Converting in **lower case**;
- Removing **special characters and symbols**;
- Removing **stop words**;
- **Lemmatization**.

*The processed datasets has been saved in CSV files, in such a way that it is not necessary to repeat the time-consuming procedure each time the code is executed.*

In [None]:
# train_set_processed = pre_processing.dataPreProcessing(train_set)
# train_set_processed.to_csv('data/train-set-cat1-processed.csv', index = False)

train_set = pd.read_csv('data/train-set-cat1-processed.csv')

In [None]:
#test_set_processed = pre_processing.dataPreProcessing(test_set)
#test_set_processed.to_csv('data/test-set-cat1-processed.csv', index = False)

test_set = pd.read_csv('data/test-set-cat1-processed.csv')

#### *Data augmentation*

We have decided to use some **data augmentation techniques** with the aim of balancing the training set, considering these issues:
- **Number of words per observation**

    About this problem, a **threshold** of at least **15 words per observation** has been set. In order to obtain this result, the **Random Insertion** technique has been used.\
    Our developed **algorithm** follows these steps:
    
    1. For each observation which does have less words than the threshold, get a **random sample of 5 words** from the initial text;
    2. Then, get the **5 most similar words** for each of the previously retrieved tokens and from this set, get **n random words** (we set *n* equal to *15*);
    3. Insert the **n words** in random positions within the starting text and substitute the augmented text in the train set;
    4. Redo the **pre-processing phase**, due to the fact that the random insertion could have include words we do not accept in our context (like stop words). In this way, we also apply the formatting;
    5. **Filter** another time the train set with the aim of finding "under-threshold" observations (pre-processing could have deleted some words);
    6. Repeat the algorithm until there are no more observations to augment.

*This procedure is large time and computational consuming, so the resulted dataframe has been saved in a CSV file and then it will be read to pursue the project.*

In [None]:
train_set_augmented = train_set.copy()

# Count the number of words for each observation
train_set_augmented['words'] = train_set_augmented['text'].apply(lambda row: len(row.split()))

# Min number of words per observation
min_number_words = 15

# Set of observations with less than the threshold (needed data augmentation)
observations_to_augment = train_set_augmented[train_set_augmented['words'] < min_number_words]

# Only if there are still observations to augment
while(len(observations_to_augment) != 0):

    # Creating the augmented observations
    obs_augmented = observations_to_augment['text'].apply(lambda row: data_augmentation.randomInsert(row, min_number_words))

    # For each augmented observation replace the text in the training set and redo the pre-processing
    for i in obs_augmented.index:

        train_set_augmented.loc[i,'text'] = obs_augmented[i]
        train_set_augmented.loc[i,'text'] = pre_processing.dataPreProcessing(pd.DataFrame(train_set_augmented.loc[i,:]).transpose()).loc[i,'text']
        train_set_augmented.loc[i,'words'] = len(train_set_augmented.loc[i,'text'].split())

    observations_to_augment = train_set_augmented[train_set_augmented['words'] < min_number_words]

# Drop the unuseful column and write the CSV (previous operations are computational expensive)
train_set_augmented.drop('words', axis = 1, inplace = True)

# train_set_augmented.to_csv('data/train-set-cat1-augmented-words-obs.csv', index = False)

In [None]:
# Train set after the first phase of the data augmentation
train_set_augmented = pd.read_csv('data/train-set-cat1-augmented-words-obs.csv')

- **Class imbalance management**

    We noticed the problem of **class imbalance** related to the articles' label, so we decided to apply a strategy based on **Synonym Replacement** technique with the aim of balancing the dataset and also increase the variability before the actual training phase.\
    Our **approach** follows these steps:
    1. Select only the classes which need an increase in the number of observation, in detail we have opted for a threshold of less than *5000 observation per class*;
    2. For each one of the previous identified classes, a number of new (augmented) texts **n** to create has been defined as the **number of the initial observations of the class divided by two**: this choice is based on the fact that we wanted to keep the *number of "original" observations higher than the number of augmented ones*. In this case, the resulting proportion from the process will be *1/3* augmented data and *2/3* original data for the "challenged" classes;
    3. A sample of **n** elements of the class has been retrieved from the intial dataset, as the starting point for the data augmentation techinque;
    4. For each one of these observations, an augmented text has been created using the **Synonym Replacement algorithm**, which substitutes each word of the text with one synonym (we used the NLTK library);
    5. The new data has been concatenated to the train set, resulting in an increase of observations for the considered classes and so an improvement in the balance of the dataset;

*This procedure is large time and computational consuming, so the resulted dataframe has been saved in a CSV file and then it will be read to pursue the project.*

In [None]:
# Ratio of observations for each class before data augmentation (class imbalance detection)
round(train_set_augmented['label'].value_counts() / sum(train_set['label'].value_counts()), 2)

In [None]:
class_aug_threshold = 5000

obs_per_class = train_set_augmented['label'].value_counts()
class_to_balance = np.array(obs_per_class[obs_per_class < class_aug_threshold].index)

# For each class to augment
for class_label in class_to_balance:

    i = 0

    # Dataframe with only the observations of the considered class
    class_obs = train_set_augmented[train_set_augmented['label'] == class_label].copy()
    num_aug_obs = round(len(class_obs) / 2)

    # Get the sample of observations
    random_class_obs = class_obs.sample(num_aug_obs)

    # Until n augmented observations are created
    while(i < num_aug_obs):

        aug_obs = data_augmentation.synonymReplacement(random_class_obs.iloc[i,0], 1)[0]
        train_set_augmented = pd.concat([ train_set_augmented, pd.DataFrame({ 'text': [aug_obs], 'label': [class_label] }) ])
        
        i = i + 1

# train_set_augmented.to_csv('data/train-set-cat1-augmented.csv', index = False)

In [None]:
# Train set after the data augmentation
train_set_augmented = pd.read_csv('data/train-set-cat1-augmented.csv')

In [None]:
# Ratio of observations for each class after data augmentation
round(train_set_augmented['label'].value_counts() / sum(train_set['label'].value_counts()), 2)