### Making Hugging Face dataset

### Preprocessing

In [1]:
import pandas as pd

Read .tsv file from "/data/raw folder"
I have also deleted the first column and columns for similarity and length. I will not use them for now.

In [2]:
df = pd.read_csv('../data/raw/filtered.tsv', sep='\t')

# delete unnamed column and columns for similarity and length
df = df.drop(df.columns[0], axis=1)
df = df.drop(df.columns[2], axis=1)
df = df.drop(df.columns[2], axis=1)

Sort values pairwise, so less toxic sentences are going first

In [3]:
temp_df = df.copy()

df.loc[temp_df.ref_tox>temp_df.trn_tox, 'reference'] = temp_df.loc[temp_df.ref_tox>temp_df.trn_tox, 'translation']
df.loc[temp_df.ref_tox>temp_df.trn_tox, 'translation'] = temp_df.loc[temp_df.ref_tox>temp_df.trn_tox, 'reference']
df.loc[temp_df.ref_tox>temp_df.trn_tox, 'trn_tox'] = temp_df.loc[temp_df.ref_tox>temp_df.trn_tox, 'ref_tox']
df.loc[temp_df.ref_tox>temp_df.trn_tox, 'ref_tox'] = temp_df.loc[temp_df.ref_tox>temp_df.trn_tox, 'trn_tox']

df.head(6)

Unnamed: 0,reference,translation,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.009402,0.999348
5,I'm not going to breed kids with a genetic dis...,I'm not gonna have a child... ...with the same...,0.035846,0.950956


Here we can see that they are now in right order.

In [4]:
print((df['trn_tox'] >= df['ref_tox']).value_counts())

True    577777
Name: count, dtype: int64


Also, now values of toxicity for reference (non-toxic) sentences are in range [0 : 0.5] and translation (toxic) sentences are in range [0.5 : 1]

In [151]:
print("Max and min values for reference:")
print("max:", df.ref_tox.max(), ", ", "min: ", df.ref_tox.min())
print("Max and min values for translation:")
print("max:", df.trn_tox.max(), ", ", "min: ", df.trn_tox.min())

Max and min values for reference:
max: 0.4994940161705017 ,  min:  3.283871046733111e-05
Max and min values for translation:
max: 0.9997304081916808 ,  min:  0.5001394152641296


The data consists of pairs of non-toxic and toxic sentences along with 4 difference-features.
For now I propose to use only the first two columns and label them as "toxic" and "non-toxic".

In [152]:
sentences = df.iloc[:, 0:2]
sentences.columns = ["non-toxic", "toxic"]
sentences.head()

Unnamed: 0,non-toxic,toxic
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ..."
1,Now you're getting nasty.,you're becoming disgusting.
2,"Well, we could spare your life, for one.","well, we can spare your life."
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up."
4,I've got orders to put her down.,I have orders to kill her.


### Making dataset
To make dataset I will use HuggingFace Datasets library.

In [153]:
from tqdm import tqdm
from datasets import DatasetDict, Dataset, Value

I used this [Tutorial](https://kl1p.com/huggingface-dataset-from-pandas-with-code-examples/) to make dataset from pandas dataframe.

First we need to define the schema of the dataset. I will use wmt16 dataset structure as in Lab 4 of the PMLDL course, but where "non-toxic" sentences are source and "toxic" sentences are target. 

In [154]:
# Define the schema of the dataset
schema = {
    "train": {
        "translation" : {
            "non-toxic": Value("string"),
            "toxic": Value("string"),
        },
    },
    "validation": {
        "translation" : {
            "non-toxic": Value("string"),
            "toxic": Value("string"),
        },
    },
    "test": {
        "translation" : {
            "non-toxic": Value("string"),
            "toxic": Value("string"),
        },
    },
}

I will divide data into train, validation and test sets. As the dataset is quite big, I will use 90% for training, 5% for validation and 5% for testing.

In [155]:
train_len = int(len(sentences)*0.9)
val_len = int(len(sentences)*0.05)
test_len = int(len(sentences)*0.05)

# To get same structure as in wmt16 dataset, I will use pairwise split.
train_pairs = []
val_pairs = []
test_pairs = []

# Create dataset dict
dataset = DatasetDict(schema)


# Add pairs to lists
for i in tqdm(range(train_len)):
    train_pairs.append({"non-toxic": sentences.iloc[i, 0], "toxic": sentences.iloc[i, 1]})
    
for i in tqdm(range(train_len, train_len+val_len)):
    val_pairs.append({"non-toxic": sentences.iloc[i, 0], "toxic": sentences.iloc[i, 1]})
    
for i in tqdm(range(train_len+val_len, train_len+val_len+test_len)):
    test_pairs.append({"non-toxic": sentences.iloc[i, 0], "toxic": sentences.iloc[i, 1]})

100%|██████████| 519999/519999 [00:18<00:00, 28554.97it/s]
100%|██████████| 28888/28888 [00:00<00:00, 30084.92it/s]
100%|██████████| 28888/28888 [00:00<00:00, 30164.41it/s]


In [156]:
# Add pairs through dataset dict
dataset["train"] = Dataset.from_dict({"translation": train_pairs})
dataset["validation"] = Dataset.from_dict({"translation": val_pairs})
dataset["test"] = Dataset.from_dict({"translation": test_pairs})

In [157]:
dataset["train"][:2]

{'translation': [{'non-toxic': 'If Alkar is flooding her with psychic waste, that explains the high level of neurotransmitters.',
   'toxic': 'if Alkar floods her with her mental waste, it would explain the high levels of neurotransmitter.'},
  {'non-toxic': "Now you're getting nasty.",
   'toxic': "you're becoming disgusting."}]}

Save dataset to file .\data\interim\justification dataset

In [158]:
dataset.save_to_disk("../data/interim/justification_dataset")

Saving the dataset (0/1 shards):   0%|          | 0/519999 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/28888 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/28888 [00:00<?, ? examples/s]

**Note:** This python notebook is only demonstration of the data preprocessing and my way to make the dataset. The script for making dataset is in the folder "src/data" and is called "make_dataset.py". Instructions for running it are in the README.md file.