In [36]:
import pandas as pd
import string

Firsly, take a look at the data:

In [37]:
df = pd.read_csv('../data/raw/filtered.tsv', delimiter='\t', index_col=0)
df

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348
...,...,...,...,...,...,...
577772,You didn't know that Estelle had stolen some f...,you didn't know that Estelle stole your fish f...,0.870322,0.030769,0.000121,0.949143
577773,It'il suck the life out of you!,you'd be sucked out of your life!,0.722897,0.058824,0.996124,0.215794
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.617511,0.212121,0.984538,0.000049
577775,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care.",0.679613,0.358209,0.991945,0.000124


# Basic Preprocessing

We will remove punctuation signs, however the meaningful apostrophe character should remain.

In [30]:
def basic_preprocessing(text: str) -> str:
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation.replace("'", "")))
    # lowercase
    text = text.lower()
    return text

# apply basic preprocessing
for column in ['reference', 'translation']:
    df[f'{column}_processed'] = df[column].apply(basic_preprocessing)

df.sample(5)

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox,reference_processed,translation_processed,source,target
407555,"I got dominoes, and what the hell is going on?",what the hell is going on here?,0.720545,0.319149,0.023868,0.877907,i got dominoes and what the hell is going on,what the hell is going on here,"I got dominoes, and what the hell is going on?",what the hell is going on here?
454088,"You bury yours, Marcus.","you're killing yourself, Marcus.",0.662741,0.272727,0.015029,0.990032,you bury yours marcus,you're killing yourself marcus,"You bury yours, Marcus.","you're killing yourself, Marcus."
247574,I like your swagger.,I like your guts.,0.728945,0.142857,0.000186,0.971967,i like your swagger,i like your guts,I like your swagger.,I like your guts.
98678,We get to hang ourselves?,are we hanging around?,0.641338,0.115385,0.974838,4.6e-05,we get to hang ourselves,are we hanging around,are we hanging around?,We get to hang ourselves?
497575,White.,White!,0.949085,0.0,0.002257,0.800949,white,white,White.,White!


# Source and Target disambiguation

We see that the fields `reference` and `translation` are often mixed in terms of toxicity level. Let us introduce the fields `source` and `target` and distribute the samples so that `toxicity(source) > toxicity(target)`

In [38]:
(df.ref_tox > df.trn_tox).mean()

0.5523618974102465

In [39]:
columns = ['reference', 'translation', 'ref_tox', 'trn_tox']
df['source'] = df[columns].apply(lambda x: x['reference'] if x['ref_tox'] > x['trn_tox'] else x['translation'], axis=1)
df['target'] = df[columns + ['source']].apply(lambda x: x['translation'] if x['source'] == x['reference'] else x['reference'], axis=1)
df.sample(5)

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox,source,target
220179,Let's have a toast - to the future generation ...,let's toast to a new generation of consumerism...,0.637896,0.057692,0.995674,0.000467,Let's have a toast - to the future generation ...,let's toast to a new generation of consumerism...
489073,You got him in the left shoulder.,I shot him in the shoulder.,0.623977,0.176471,0.004725,0.982798,I shot him in the shoulder.,You got him in the left shoulder.
235042,"(rapid gunfire) CALLEN: Deeks, Kensi, shooters...","Deeks, Kensi, shooters are going your way!",0.773226,0.328125,0.02915,0.718189,"Deeks, Kensi, shooters are going your way!","(rapid gunfire) CALLEN: Deeks, Kensi, shooters..."
467105,In the spring and summer they ate all the godd...,they ate all the flowers in the spring and sum...,0.887404,0.116667,0.999108,7.1e-05,In the spring and summer they ate all the godd...,they ate all the flowers in the spring and sum...
254415,They're smug and they are condescending.,they're arrogant and arrogant.,0.654547,0.243902,0.076535,0.998256,they're arrogant and arrogant.,They're smug and they are condescending.


## Save the data (to /src later)

In [40]:
df.to_csv('../data/interim/distinguished.tsv', sep='\t', index=False)

In [42]:
df_dvizh = pd.read_csv('../data/interim/distinguished.tsv', delimiter='\t')
df_dvizh

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox,source,target
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983,"if Alkar floods her with her mental waste, it ...","If Alkar is flooding her with psychic waste, t..."
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039,you're becoming disgusting.,Now you're getting nasty.
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068,"well, we can spare your life.","Well, we could spare your life, for one."
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215,"monkey, you have to wake up.","Ah! Monkey, you've got to snap out of it."
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348,I have orders to kill her.,I've got orders to put her down.
...,...,...,...,...,...,...,...,...
577772,You didn't know that Estelle had stolen some f...,you didn't know that Estelle stole your fish f...,0.870322,0.030769,0.000121,0.949143,you didn't know that Estelle stole your fish f...,You didn't know that Estelle had stolen some f...
577773,It'il suck the life out of you!,you'd be sucked out of your life!,0.722897,0.058824,0.996124,0.215794,It'il suck the life out of you!,you'd be sucked out of your life!
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.617511,0.212121,0.984538,0.000049,"I can't fuckin' take that, bruv.",I really can't take this.
577775,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care.",0.679613,0.358209,0.991945,0.000124,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care."
