# Data Augmentation

This notebook covers the augmentation of the training data with techniques including:

- **Back Translation**: Use translators to transform a sentence. First, a sentence in original language is translated into a second language (in this notebook english) and then translated back into the original language. Note: In case of duplicates entries in the data after translating they were removed.

- **Synyonm Replacement**: Replacement of tokens in the original input sequence with synonyms. For each word in the corpus, a list of synonyms is created by considering word2vec distance. This list is filtered regarding the original POS tag.

In [1]:
!pip install transformers[sentencepiece] -q
!pip install sentencepiece -q
!pip install "transformers[sentencepiece]" -q
!pip install gensim -q

import pandas as pd
from transformers import pipeline
from gensim.models.keyedvectors import KeyedVectors

In [10]:
# read training df
df_train = pd.read_csv('df_competition.csv')
ANNOTATOR_COLUMNS = ['A001', 'A002', 'A003', 'A004', 'A005', 'A007', 'A008', 'A009', 'A010', 'A012']

### Synonym Replacement

Word vectors for German language obtained from [fasttext](https://fasttext.cc/docs/en/crawl-vectors.html). Due to capacity constraints, the use was limited to 2,000,000 vectors. Specific words like 'woman', 'women', 'man', and 'men' were retained without replacement, as altering them would significantly change the meaning of the comments. This decision was made to maintain the relevance of the vectors to the topic of sexism and misogyny.

In [2]:
vecs = KeyedVectors.load_word2vec_format('cc.de.300.vec', limit=2000000)

In [3]:
def replace_with_synonym(word, vecs, num_synonyms=1):
    """
    Replaces the given word with its most similar synonym from word embeddings.
    """
    if word.lower() in ['frau', 'mann', 'männer', 'frauen']:
        return word
    try: 
        synonyms = vecs.most_similar(word, topn=num_synonyms)
        if synonyms:
            return synonyms[0][0]
        else:
            return word
    except: 
        return word
    
def synonym_replacement(text, vecs, num_synonyms=1):
    words = text.split()
    replaced_words = [replace_with_synonym(word, vecs, num_synonyms) for word in words]
    return ' '.join(replaced_words)

In [9]:
# EXAMPLE / TESTING

text = df_train['text'][27]
print(text + '\n')
replaced_text = synonym_replacement(text, vecs)
print(replaced_text)

Das schöne Gesicht der Frauenquote

.Das wunderschöne Antlitz die Frauenquoten


In [24]:
def replace_text_with_synonyms(row, vecs):
    """
    Replaces multiple words in the given text with their most similar synonyms using word embeddings.
    """
    text = row['text']
    replaced_text = synonym_replacement(text, vecs)
    new_row = row.copy()
    new_row['text'] = replaced_text
    #new_row.loc[new_row == 0] = -1
    return new_row

df_auf = df_train.apply(replace_text_with_synonyms, axis=1, args=(vecs,))

In [25]:
df_augmented_1 = df_train.append(df_auf, ignore_index=True)

  df_augmented_1 = df_train.append(aug_df, ignore_index=True)


### Back Translation

For translating the entries from German Language into English and back, we utilized the Helsinki NLP models ([de-en](https://huggingface.co/Helsinki-NLP/opus-mt-de-en) and [en-de](https://huggingface.co/Helsinki-NLP/opus-mt-en-de)). These models were not explicitly pre-trained or fine-tuned on sexist related content.

In [None]:
to_english = pipeline("translation", model="Helsinki-NLP/opus-mt-de-en")
to_german = pipeline("translation", model="Helsinki-NLP/opus-mt-en-de")

In [None]:
def translate_to_english(text):
    """
    Translates text from German to English.
    """
    return to_english(text)[0]['translation_text']

def translate_to_german(text):
    """
    Translates text from English to German.
    """
    return to_german(text)[0]['translation_text']

def back_translation(text):
    """
    Performs back translation from German to English and back to German.
    """
    english = translate_to_english(text)
    german = translate_to_german(english)
    return german

In [29]:
df_train['back_translated_text'] = df_train['text'].apply(back_translation)
new_rows = df_train[df_train['text'] != df_train['back_translated_text']].copy()
new_rows['text'] = new_rows['back_translated_text']
new_rows = new_rows.drop(columns=['back_translated_text'])

#df_augmented_2 = pd.concat([df, new_rows], ignore_index=True)

Your input_length: 511 is bigger than 0.9 * max_length: 512. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


Merge both augmented df into one incl. the original df

In [30]:
df_augmented = pd.concat([df_augmented_1, new_rows], ignore_index=True)

In [57]:
#df_augmented.to_csv('df_train_original.csv')

Experimented with downsampling label 0 after first split (for the augmented data entries with 0 label are deleted to improve class balance). Did result in worse performance, hence all data entries were kept (more details in paper).

In [58]:
df_augmented.loc[11995:, annotator_columns] = df_augmented.loc[11995:, ANNOTATOR_COLUMNS].replace(0, -1)
df_augmented

Unnamed: 0,text,A012,A003,A005,A001,A008,A004,A007,A010,A002,A009
0,"Wen man nicht reinläßt, den muss man auch nich...",0,0,0,0,0,0,0,0,0,0
1,Und eine Katze die schnurrt genügt Ihnen? \nUn...,1,1,0,0,1,0,1,0,0,0
2,"Des Oaschloch is eh scho berühmt, de virz‘g Ju...",3,3,3,0,4,3,2,4,3,4
3,Trump hat 2 Dinge übersehen:\nevery vote count...,3,3,3,0,1,3,3,0,3,2
4,Mit der Fo×÷e hat er sich keinen Gefallen geta...,2,4,3,4,2,4,4,4,4,3
...,...,...,...,...,...,...,...,...,...,...,...
17908,"Ich bin nur jeden Tag wirklich überrascht, wen...",-1,-1,-1,-1,-1,-1,-1,3,2,3
17909,Ein Kameramann bekommt in der Regel Anweisunge...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
17910,Aber warum stimmen 36 % der Frauen in Österrei...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
17911,"Ich denke, es ist schwer, sich die Realitäten ...",1,-1,-1,-1,-1,-1,-1,-1,-1,1


In [59]:
# delete rows with only -1 entries
df_augmented = df_augmented.loc[~(df_augmented[annotator_columns].eq(-1).all(axis=1))]

In [61]:
#df_augmented.to_csv('df_train_0_after_2.csv')