## Importing Relevant Packages and Datasets

In [2]:
import pandas as pd

In [3]:
all_data = pd.read_csv('../final_fixed_data.csv', index_col=0)

The above 'fixed_data.csv' file is a modified version of the dataset constructed by CrowS-Pairs, which can be found on [their GitHub](https://github.com/nyu-mll/crows-pairs). This modification was done manually and removes some instances where the bias is not adequately represented in the dataset.

## Modifying the Structure of the Dataset

The model has to be structured in a specific way to be able to train BERT on it.

The model only requires two columns, namely the sentence and the label.

As CrowS-Pairs is structured in sentence pairs, I have labelled the 'sent_more' sentence as the stereotype or antistereotype sentences,  and the 'sent_less' one as the non-stereotype. In this model, 'stereotype' is an umbrella term for both sentence pairs that show stereotypes and antistereotypes. This is a binary classification model, stereotype and antistereotype sentences are represented with the label 1, and non-stereotype sentences are represented with the label 0. These stereotypes/anti-stereotypes are described [here](https://github.com/nyu-mll/crows-pairs/blob/master/README.md) as:

```
The stereotypical direction of the pair: A stereo direction denotes that sent_more is a sentence that demonstrates a stereotype of a historically disadvantaged group. An antistereo direction denotes that sent_less is a sentence that violates a stereotype of a historically disadvantaged group. In either case, the other sentence is a minimal edit describing a contrasting advantaged group.
```

In [5]:
finetune_data = {
    'sentence' : [],
    'label' : []
}

In [6]:
for index, row in all_data.iterrows():
    
    finetune_data['sentence'].append(row['sent_more'])
    finetune_data['label'].append(1)
        
    finetune_data['sentence'].append(row['sent_less'])
    finetune_data['label'].append(0)

In [7]:
finetune_dataframe = pd.DataFrame(finetune_data)

## Splitting the Dataset into Training and Testing (80/20 split)

I've split the data using an 80/20 split.

I have not split these on random subsets, as I felt it would affect the models if the sentence pairs were not together.

For example, if a stereotype sentence is in the training set, the model will identify this sentence as stereotype, then if it is tested on the non-stereotype sentence pair in the test set, the model is likely to identify this as stereotype too due to their similarity.

In [11]:
train_finetune_dataframe = finetune_dataframe[:2412]
test_finetune_dataframe = finetune_dataframe[2412:]

In [12]:
train_finetune_dataframe.to_csv('training_CrowS-Pairs.csv', index=False)
test_finetune_dataframe.to_csv('testing_CrowS-Pairs.csv', index=False)