# Field Exploration

## 1. Importing Libraries

In [13]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
import pickle
import sys
import warnings
from src.data.make_dataset import TextDataset
warnings.filterwarnings('ignore')

## 2. Importing Dataset

In [129]:
df = pickle.load(open('../data/interim/text_dataset.pkl', 'rb')).data

## 3. Exploring Dataset

After the preprocessing, the dataset is composed of 3 columns:
- `toxic`: toxic version of the sentence
- `normal`: non-toxic version of the sentence
- `toxic_reduction`: toxicity difference between the toxic and normal version of the sentence

In [130]:
df.sort_values('toxic_reduction', ascending=True).head()

Unnamed: 0,toxic,normal,toxic_reduction
302957,"[going, let, pas, mass, spectrometer, gon, na,...","[fine, gon, na, run, mass, spec, find, making,...",0.500002
274475,"[tie, put, cuff, control]","[going, tie, going, handcuff, going, take, con...",0.500002
368899,"[smothered, death]",[suffocated],0.500002
321294,"[kill, fire, fire, forever, worship]","[kill, fire, plant, worship]",0.500002
128885,"[remember, must, harmed, worthless]","[remember, harmed, worthless, trade]",0.500004


We can see that sentences with toxic_reduction close to 0.5 are almost the same in both versions. Also, we can notice that there are no sentences with toxic_reduction less than 0.5.

In [131]:
df.sort_values('toxic_reduction', ascending=False).head()

Unnamed: 0,toxic,normal,toxic_reduction
513596,"[stupid, meaningful, relationship, stella]","[meaningful, relationship, stella]",0.999681
155243,"[started, firm, day, said, goodbye, stupid, bos]","[day, started, business, said, goodbye]",0.999678
506123,"[idiot, station, sure]","[type, station]",0.999677
336425,"[think, shit, dawg]","[think, buddy]",0.999677
429942,"[like, stupid, game, tom]","[like, game, tom]",0.999677


From the sentences with the most toxicity reduction, we can see that the toxic version of the sentence is almost the same as the normal version, but with some words replaced by their synonyms or even removed. Instantly comes the idea for baseline model: just remove the toxic words from the sentence.

### Finding the difference between the toxic and normal version of the sentence

In [133]:
all_words_from_toxic_sentences = set([word for sentence in df['toxic'].values for word in sentence])
all_words_from_normal_sentences = set([word for sentence in df['normal'].values for word in sentence])

In [134]:
toxic_words = all_words_from_toxic_sentences.difference(all_words_from_normal_sentences)

In [135]:
len(toxic_words)

18936

In [150]:
import random

random.sample(toxic_words, 10)

['masteryoda',
 'nosiest',
 'fuckery',
 'stuporous',
 'celestic',
 'ofbones',
 'stanky',
 'vanja',
 'somethingmore',
 'dongrangov']

Many words from this set are just misspelled words or words not present in normal sentences.

### Finding the most common toxic words

In [151]:
from collections import Counter

toxic_words_counter = Counter([word for sentence in df['toxic'].values for word in sentence])

In [152]:
toxic_words_counter.most_common(10)

[('like', 34989),
 ('shit', 32183),
 ('fucking', 30957),
 ('get', 26970),
 ('know', 22888),
 ('kill', 21720),
 ('want', 21317),
 ('damn', 21188),
 ('fuck', 20768),
 ('hell', 20626)]

However, the most common toxic words are in fact toxic words. Some of them, like 'like' are just used in toxic sentences more often than in normal sentences.

### Saving the set of toxic words

In [153]:
with open('../data/interim/toxic_words.pkl', 'wb') as f:
    pickle.dump(toxic_words, f)