# Project: Toxic Comment Filter

Costruire un modello in grado di filtrare i commenti degli utenti in base al grado di dannosità del linguaggio:
1. Preprocessare il testo eliminando l'insieme di token che non danno contributo significativo a livello semantico
2. Trasformare il corpus testuale in sequenze
3. Costruire un modello di Deep Learning comprendente dei layer ricorrenti per un task di classificazione multilabel
4. In prediction time, il modello deve ritornare un vettore contenente un 1 o uno 0 in corrispondenza di ogni label presente nel dataset (toxic,	severe_toxic,	obscene,	threat,	insult,	identity_hate). In questo modo, un commento non dannoso sarà classificato da un vettore di soli 0 [0,0,0,0,0,0]. Al contrario, un commento pericoloso presenterà almeno un 1 tra le 6 labels.

In [None]:
import pandas as pd
df = pd.read_csv("Filter_Toxic_Comments_dataset.csv")

In [None]:
df.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,sum_injurious
0,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,0
1,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,0
2,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,0
3,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,0
4,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,0


In [None]:
df.shape

(159571, 8)

Il dataframe contiene 159.571 righe e 8 colonne

Dato che la mia RAM è parecchio lenta, e il processo di cleaning prevede tempi biblici, provo a selezionare randomicamente un decimo del dataframe, così da velocizzare sia il processo di cleaning che quello successivo di training

In [None]:
rows_to_mantain = len(df)//10
rows_to_mantain

15957

In [None]:
import random

# Estraggo casualmente gli indici delle righe da mantenere
selected_indices = random.sample(range(159571), rows_to_mantain)

In [None]:
# Seleziono le righe dal dataframe utilizzando gli indici estratti
df = df.iloc[selected_indices]

In [None]:
df.shape

(15957, 8)

## Preprocessing del testo
Rimozione token che non danno contributo a livello semantico

In [None]:
import spacy

# per gestire la punteggiatura
import string

# per gestire le stopwords
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

# importo le stopwords in lingua inglese
english_stopwords = stopwords.words("english")

# per gestire numeri e spazi multipli
import re

# importo il modello nlp
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def data_cleaner(sentence):
    sentence = sentence.lower()
    for c in string.punctuation:
        sentence = sentence.replace(c, " ")
    document = nlp(sentence)
    sentence = ' '.join(token.lemma_ for token in document)
    sentence = ' '.join(word for word in sentence.split() if word not in english_stopwords)
    sentence = re.sub('\d', '', sentence)

    return sentence

In [None]:
X = df.comment_text
X.shape

(15957,)

In [None]:
y = df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
y.shape

(15957, 6)

In [None]:
%%time
X_cleaned = []
for text in X:
    X_cleaned.append(data_cleaner(text))

CPU times: user 4min 56s, sys: 640 ms, total: 4min 57s
Wall time: 4min 58s


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_cleaned,y,test_size=.2, random_state=1)

In [None]:
pip install keras_preprocessing



In [None]:
from keras_preprocessing.text import Tokenizer
tokenizer = Tokenizer()

In [None]:
tokenizer.fit_on_texts(X_train)

In [None]:
train_sequences = tokenizer.texts_to_sequences(X_train)
test_sequences = tokenizer.texts_to_sequences(X_test)

In [None]:
vocabulary = len(tokenizer.index_word)+1

In [None]:
vocabulary

34283

In [None]:
maxlen = len(max(train_sequences, key=len))

In [None]:
maxlen

1078

In [None]:
from keras_preprocessing.sequence import pad_sequences
padded_train_sequences = pad_sequences(train_sequences, maxlen=maxlen)
padded_test_sequences = pad_sequences(test_sequences, maxlen=maxlen)

## **Modello**

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

In [None]:
model = Sequential()

embedding_dim = 100
model.add(Embedding(input_dim=vocabulary, output_dim=embedding_dim, input_length=maxlen))

lstm_units = 64
# applico due tecniche di regolarizzazione, anche visto e considerata la decimazione del dataframe
model.add(LSTM(units=lstm_units, dropout=0.2, recurrent_dropout=0.2))

num_labels = 6
model.add(Dense(num_labels, activation='sigmoid'))

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 1078, 100)         3428300   
                                                                 
 lstm_1 (LSTM)               (None, 64)                42240     
                                                                 
 dense_1 (Dense)             (None, 6)                 390       
                                                                 
Total params: 3,470,930
Trainable params: 3,470,930
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Fit e predict

In [None]:
batch_size = 32
epochs = 5

model.fit(padded_train_sequences, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f139adbd1b0>

In [None]:
predictions = model.predict(padded_test_sequences)



In [None]:
threshold = 0.5

binary_predictions = (predictions > threshold).astype(int)

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, binary_predictions)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9145


Creo un nuovo dataframe che conterrà il riferimento la frase a cui corrispondono i valori predetti, così da verificare in un colpo solo:
- che l'ultima colonna di somma abbia come minimo 0 e un numero diverso come massimo
- che le previsioni siano accurate (debug codice)

In [None]:
predicted_df = pd.DataFrame()

In [None]:
sum_injurious = binary_predictions.sum(axis=1)

In [None]:
min(sum_injurious)

0

In [None]:
max(sum_injurious)

4

In [None]:
predicted_df['comment_text'] = X_test
predicted_df['predicted_toxic'] = binary_predictions[:, 0]
predicted_df['predicted_severe_toxic'] = binary_predictions[:, 1]
predicted_df['predicted_obscene'] = binary_predictions[:, 2]
predicted_df['predicted_threat'] = binary_predictions[:, 3]
predicted_df['predicted_insult'] = binary_predictions[:, 4]
predicted_df['predicted_identity_hate'] = binary_predictions[:, 5]
predicted_df['sum_injurious'] = sum_injurious

In [None]:
predicted_df.head()

Unnamed: 0,comment_text,predicted_toxic,predicted_severe_toxic,predicted_obscene,predicted_threat,predicted_insult,predicted_identity_hate,sum_injurious
0,stop crack whore go die fire dick suck fag,1,0,1,0,1,0,3
1,article take ninjutsu seriously fucking fake g...,1,0,0,0,0,0,1
2,change title article know name age conan hybor...,0,0,0,0,0,0,0
3,previous contribution special contribution ...,0,0,0,0,0,0,0
4,– germany weimar republic way similar present ...,0,0,0,0,0,0,0
