# Data Augmentation for EDOS

In this notebook, we will explore different data augmentation techniques for generating additional training data from the training dataset:

First, we install some libraries that we will use: 

In [1]:
!pip install datasets textaugment transformers nlpaug translators

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
from datasets import load_dataset
# As the dataset is private, we have to use this: 
access_token="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" # True, for public dataset
dataset_dict = load_dataset("ISEGURA/edos", use_auth_token=access_token)
dataset_dict



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['rewire_id', 'text', 'label_sexist', 'label_category', 'label_vector'],
        num_rows: 9800
    })
    test: Dataset({
        features: ['rewire_id', 'text', 'label_sexist', 'label_category', 'label_vector'],
        num_rows: 2814
    })
    validation: Dataset({
        features: ['rewire_id', 'text', 'label_sexist', 'label_category', 'label_vector'],
        num_rows: 1386
    })
})

Data augmentations techniques should only be applied to the training data, so we will only work with this split: 

In [9]:
training_data = dataset_dict['train']
# free some space
del(dataset_dict)
training_data

Dataset({
    features: ['rewire_id', 'text', 'label_sexist', 'label_category', 'label_vector'],
    num_rows: 9800
})

In [None]:
import nltk
# we need this NLTK's modules for EDA
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4') 
from textaugment import EDA
import nlpaug.augmenter.word as naw # contextualized language model such as BERT to generate new sentences: 

t = EDA()
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert")

def generate(example):
    original_text = example['text']
    # textaugmenter
    example['text_eda'] = t.synonym_replacement(original_text)
    #NLPAug
    example['text_nlpaug'] = aug.augment(original_text)[0]

    return example


training_data = training_data.map(generate)
training_data

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


  0%|          | 0/9800 [00:00<?, ?ex/s]

In [None]:
from google.colab import drive
# mount your google drive
drive.mount('/content/drive')
path = "/content/drive/My Drive/Colab Notebooks/data/edos/"
#save to csv
training_data.to_csv(path + "edos_agumented_train.csv", index = None)
print('file saved')