In [1]:
import pandas as pd

Let's load the data:

In [2]:
df = pd.read_csv('../data/raw/filtered.tsv', delimiter='\t', index_col=0)
df

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348
...,...,...,...,...,...,...
577772,You didn't know that Estelle had stolen some f...,you didn't know that Estelle stole your fish f...,0.870322,0.030769,0.000121,0.949143
577773,It'il suck the life out of you!,you'd be sucked out of your life!,0.722897,0.058824,0.996124,0.215794
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.617511,0.212121,0.984538,0.000049
577775,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care.",0.679613,0.358209,0.991945,0.000124


# Source and Target disambiguation

We see that the fields `reference` and `translation` are often mixed in terms of toxicity level. Let us introduce the fields `source` and `target` and distribute the samples so that `toxicity(source) > toxicity(target)`.

`source` and `target` fields will be a better choice for training and fine-tuning the models. 

In [3]:
columns = ['reference', 'translation', 'ref_tox', 'trn_tox']
df['source'] = df[columns].apply(lambda x: x['reference'] if x['ref_tox'] > x['trn_tox'] else x['translation'], axis=1)
df['target'] = df[columns + ['source']].apply(lambda x: x['translation'] if x['source'] == x['reference'] else x['reference'], axis=1)
df.sample(5)

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox,source,target
473673,"And I hope you like the taste of skate, dorko!","I hope you like your skates, idiot!",0.678681,0.234043,0.381982,0.999682,"I hope you like your skates, idiot!","And I hope you like the taste of skate, dorko!"
358109,"Oh, shit, from the party.",from the party.,0.646242,0.384615,0.999491,0.000133,"Oh, shit, from the party.",from the party.
98177,"Look, this fucking recording studio ain't big ...","look, this recording studio isn't big enough f...",0.802288,0.037037,0.999136,4.5e-05,"Look, this fucking recording studio ain't big ...","look, this recording studio isn't big enough f..."
259257,"Well, maybe those fat cracker bears should lea...",maybe your fat white bears should learn to swim.,0.724546,0.209677,0.017823,0.979685,maybe your fat white bears should learn to swim.,"Well, maybe those fat cracker bears should lea..."
303876,What the he** are you doing?,what the hell are you doing?,0.84464,0.0,0.109258,0.975427,what the hell are you doing?,What the he** are you doing?


## Save the data (to /src later)

In [40]:
df.to_csv('../data/interim/distinguished.tsv', sep='\t', index=False)

In [42]:
df_dvizh = pd.read_csv('../data/interim/distinguished.tsv', delimiter='\t')
df_dvizh

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox,source,target
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983,"if Alkar floods her with her mental waste, it ...","If Alkar is flooding her with psychic waste, t..."
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039,you're becoming disgusting.,Now you're getting nasty.
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068,"well, we can spare your life.","Well, we could spare your life, for one."
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215,"monkey, you have to wake up.","Ah! Monkey, you've got to snap out of it."
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348,I have orders to kill her.,I've got orders to put her down.
...,...,...,...,...,...,...,...,...
577772,You didn't know that Estelle had stolen some f...,you didn't know that Estelle stole your fish f...,0.870322,0.030769,0.000121,0.949143,you didn't know that Estelle stole your fish f...,You didn't know that Estelle had stolen some f...
577773,It'il suck the life out of you!,you'd be sucked out of your life!,0.722897,0.058824,0.996124,0.215794,It'il suck the life out of you!,you'd be sucked out of your life!
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.617511,0.212121,0.984538,0.000049,"I can't fuckin' take that, bruv.",I really can't take this.
577775,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care.",0.679613,0.358209,0.991945,0.000124,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care."
