## This notebook addresses the task of creating a vocabulary of toxic words for dataset.

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/filtered.tsv", sep="\t", index_col=0)

df.head()

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348


## Try to find which words are removed in translation

In [6]:
def diff(reference: str, translation: str):
    """
    diff function returns which words were removed in translation compared to reference
    """

    # lower case all strings
    reference = reference.lower()
    translation = translation.lower()

    reference = reference.split()
    translation = translation.split()

    return list(set(reference) - set(translation))

In [7]:
df["reference"][0], df["translation"][0], diff(df["reference"][0], df["translation"][0])

('If Alkar is flooding her with psychic waste, that explains the high level of neurotransmitters.',
 'if Alkar floods her with her mental waste, it would explain the high levels of neurotransmitter.',
 ['is',
  'level',
  'that',
  'psychic',
  'flooding',
  'explains',
  'neurotransmitters.'])

In [8]:
import spacy
!python -m spacy download en_core_web_md

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_md")

In [9]:
doc1 = nlp(df["reference"][0])
doc2 = nlp(df["translation"][0])

doc1, doc2

(If Alkar is flooding her with psychic waste, that explains the high level of neurotransmitters.,
 if Alkar floods her with her mental waste, it would explain the high levels of neurotransmitter.)

In [16]:
# use spacy to get the lemmas
from typing import List


def get_lemmas(text: str) -> List[str]:
    """
    get_lemmas returns a list of lemmas of a given text
    """

    # lower case all strings
    text = text.lower()

    # tokenize
    doc = nlp(text)

    # get lemmas
    lemmas = [token.lemma_ for token in doc]

    return lemmas


def lemma_diff(l1: List[str], l2: List[str]) -> List[str]:
    """
    lemma_diff returns which lemmas were removed in translation compared to reference
    """

    return list(set(l1) - set(l2))


res = lemma_diff(get_lemmas(df["reference"][0]), get_lemmas(df["translation"][0]))

res

['that', 'psychic', 'be']

## check the type of speech of the words removed


In [22]:
# get type of words
doc = nlp(" ".join(res))

[(token.pos_, token) for token in doc]

[('SCONJ', that), ('ADJ', psychic), ('VERB', be)]

# Conclusion

As we see, it is pretty rough method of collecting toxic words because it has a lot of false positives. But it is a good start.

In [file](../src/data/make_toxic_set.py) you can find the code that creates the unfiltered vocabulary of toxic words. I have found it too computationally heavy to find differences with lemmas, so I have used simple difference between words sets.
