# Baseline solution

This notebook presents the most elementary solution to the problem. The solution is to simply remove toxic words from the sentence without replacing them with anything and without paying attention to the context.  The solution is presented as an algorithm that relies on a ready-made list of toxic words.

## Data Loading

#### 1. Load the whole dataset, because this algorithm does not need train and test

In [1]:
import pandas as pd

df = pd.read_csv('../data/interim/dataset.csv')

X = df['source'].to_list()
y = df['target'].to_list()

#### 2. Load a list of toxic words that was compiled in 1.0-initial_data_exploration

In [2]:
with open('../data/interim/toxic_word.txt', "r") as f:
    toxic_words_line = f.readlines()

    
my_toxic_words = list(set(toxic_words_line[0].split()))

#### 3. Load a list of toxic words that was found on the internet

In [3]:
external_toxic_words = []

with open('../data/external/profanity_words_en.txt', "r") as f:
    for word in f.readlines():
        external_toxic_words.append(word[:-1])

## Base algorithm

In [4]:
import re
from nltk.tokenize import RegexpTokenizer

def baseline_detoxic_text(X, toxic_words):
    # Define a set of punctuation characters
    p = '.,?!'
    
    # Create a tokenizer that preserves words and some punctuation
    tokenizer = RegexpTokenizer(r"\b\w+\b|[.,!?'\"]")
    
    detox_sentences = []
    
    for sentence in X:
        
        # Tokenize the sentence into words
        words = tokenizer.tokenize(sentence)
        
        result_sentence = []
        
        flag = False
        
        for word in words: 
            
            # Check if the flag is set (toxic word encountered)
            if flag:
                # If the current word is not punctuation, add it to the result
                if word not in p:
                    result_sentence.append(word)
                flag = False
                
            # Check if the word is not in the list of toxic words    
            elif word.lower() not in toxic_words:
                result_sentence.append(word)
            else:
                # Set the flag to handle toxic words
                flag = True
        
        result_sentence = ' '.join(result_sentence)

        # Correctly handle contractions like "It's"
        result_sentence = re.sub(r"(\w+) ' (\w+)", r"\1'\2", result_sentence)

        # Remove spaces before punctuation
        result_sentence = re.sub(r" ([.,!?])", r"\1", result_sentence)
        
        detox_sentences.append(result_sentence)
        
    return detox_sentences

### Functions for comparing sentences

In [5]:
def basic_comparison(X, output):
    same = 0
    for i in range(len(X)):
        
        # Check if the output sentence differs from the original
        if X[i] != output[i]:
            if len(X[i]) <= 50:
                print(f'{i + 1}. Before the algorithm: {X[i]}\nAfter the algorithm: {output[i]}')
        else:
            # Count identical sentences
            same += 1

    print('Number of identical sentences:', same)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def cosine_similar(sentence1, sentence2):
    
    # Create a CountVectorizer to convert sentences into a bag of words representation
    vectorizer = CountVectorizer()
    sentences = [sentence1, sentence2]
    v = vectorizer.fit_transform(sentences)

    # Calculate cosine similarity between the sentences
    cosine_similarities = cosine_similarity(v)
    
    return cosine_similarities[0][1]

In [7]:
def comparison_with_cosine(X, output):
    result = []
    same = 0
    
    for i in range(len(X)):
        
        # Calculate the cosine similarity between the original and modified sentences
        cs = cosine_similar(X[i], output[i])
        
        if cs >= 0.99:
            # Count identical sentences
            same += 1
            
        if cs <= 0.9:
            # Print the index, original, and modified sentences for low cosine similarity
            print(f'{i + 1}. Before the algorithm: {X[i]}\nAfter the algorithm: {output[i]}')
            
    print('Number of identical sentences:', same)
        

### Solution

#### 1. Running the algorithm with a self-compiled list of toxic words

In [8]:
output_1 = baseline_detoxic_text(X, my_toxic_words)

In [9]:
basic_comparison(X[:50], output_1[:50])

3. Before the algorithm: I have orders to kill her.
After the algorithm: I have orders to her.
10. Before the algorithm: Real life starts the first time you fuck, kid.
After the algorithm: Real life starts the first time you kid.
13. Before the algorithm: Shit, this one I can't even pronounce.
After the algorithm: this one I can't even pronounce.
15. Before the algorithm: Hey, leave the poor bastard alone!
After the algorithm: Hey, leave the poor alone!
21. Before the algorithm: It told you this was a waste of my fucking time.
After the algorithm: It told you this was a waste of my time.
24. Before the algorithm: 'Shut up, you two, 'said Granny.
After the algorithm: ' up, you two, ' said Granny.
26. Before the algorithm: Does anal...
After the algorithm: Does..
32. Before the algorithm: I don't have to do shit.
After the algorithm: I don't have to do
33. Before the algorithm: God damn, this is gonna be a long night.
After the algorithm: God this is gonna be a long night.
39. Before the

In [11]:
comparison_with_cosine(X[:50], output_1[:50])

3. Before the algorithm: I have orders to kill her.
After the algorithm: I have orders to her.
26. Before the algorithm: Does anal...
After the algorithm: Does..
32. Before the algorithm: I don't have to do shit.
After the algorithm: I don't have to do
40. Before the algorithm: Fuck! Get out of the fucking way!
After the algorithm: Get out of the way!
41. Before the algorithm: Trying to kill Ethan.
After the algorithm: Trying to Ethan.
42. Before the algorithm: "Thanks, ass hole," Case said.
After the algorithm: " Thanks, hole, " Case said.
44. Before the algorithm: Really fucking annoying.
After the algorithm: Really annoying.
Number of identical sentences: 25


##### Conclusion

Looking at the results, we can conclude that the algorithm does not change almost half of the sentences considered, as it does not look at the context of the sentence. Also the problem could be the list of toxic words, in order to make an unbiased evaluation, a solution with a different list was proposed.

#### 2. Running the algorithm with a ready-made list of toxic words

In [12]:
output_2 = baseline_detoxic_text(X, external_toxic_words)

In [13]:
basic_comparison(X[:50], output_2[:50])

3. Before the algorithm: I have orders to kill her.
After the algorithm: I have orders to her.
10. Before the algorithm: Real life starts the first time you fuck, kid.
After the algorithm: Real life starts the first time you kid.
13. Before the algorithm: Shit, this one I can't even pronounce.
After the algorithm: this one I can't even pronounce.
15. Before the algorithm: Hey, leave the poor bastard alone!
After the algorithm: Hey, leave the poor alone!
21. Before the algorithm: It told you this was a waste of my fucking time.
After the algorithm: It told you this was a waste of my time.
24. Before the algorithm: 'Shut up, you two, 'said Granny.
After the algorithm: ' up, you two, ' said Granny.
26. Before the algorithm: Does anal...
After the algorithm: Does..
30. Before the algorithm: What the hell is going on?
After the algorithm: What the is going on?
32. Before the algorithm: I don't have to do shit.
After the algorithm: I don't have to do
33. Before the algorithm: God damn, this 

In [14]:
comparison_with_cosine(X[:50], output_2[:50])

3. Before the algorithm: I have orders to kill her.
After the algorithm: I have orders to her.
26. Before the algorithm: Does anal...
After the algorithm: Does..
32. Before the algorithm: I don't have to do shit.
After the algorithm: I don't have to do
40. Before the algorithm: Fuck! Get out of the fucking way!
After the algorithm: Get out of the way!
41. Before the algorithm: Trying to kill Ethan.
After the algorithm: Trying to Ethan.
42. Before the algorithm: "Thanks, ass hole," Case said.
After the algorithm: " Thanks, hole, " Case said.
44. Before the algorithm: Really fucking annoying.
After the algorithm: Really annoying.
Number of identical sentences: 22


##### Conclusion

In this case, more sentences are changed, so the algorithm depends on the list of toxic words. So this file is better to use for further work.     

### Conclusion

As previously mentioned, a significant number of sentences remain unaltered, largely owing to the context in which they are framed. When we look at the comparison made using the cosine analogy, we observe minimal alterations to these sentences. However, there exist other challenges. Notably, when we remove offensive or toxic words from a sentence, there is a risk of losing its intended meaning or rendering the sentence incorrect. To address this issue effectively, it is advisable not merely to eliminate words but also to substitute them with more benign or less harmful alternatives, preserving the intended meaning and ensuring the sentences remain accurate.