# Pulizia del dataset

**Dataset** : [Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)

Il dataset è stato estratto da commenti di Wikiperdia che sono stati etichettati manualmente in base alla loro 'tossicità'. Ecco è domposto da:

* training set (train.csv)
* test set (test.csv) non etichettato
* etichette del test set (test_labels.csv), contentente le etichette del test set (il valore '-1' indica che il commento corrispondente non è stato etichettato)

In [None]:
from google.colab import drive
import re
import pandas as pd

In [None]:
drive.mount('/content/drive', force_remount = True)
root_dir = '/content/drive/My Drive/Toxic_comment_classification_Maggio_Monti/dataset/'
train = pd.read_csv(root_dir+'train.csv')  #Caricamento train set
test = pd.read_csv(root_dir+'test.csv')    #Caricamento test set
test_labels = pd.read_csv(root_dir+'test_labels.csv')  #caricamento etichette del test set

Mounted at /content/drive


In [None]:
def get_only_chars(line):
  try:
    clean_line = ""

    line = line.replace("’", "")
    line = line.replace("'", "")
    line = line.replace("-", " ") #replace hyphens with spaces
    line = line.replace("\t", " ")
    line = line.replace("\n", " ")
    line = line.lower()

    for char in line:
        if char in 'qwertyuiopasdfghjklzxcvbnm ':
            clean_line += char
        else:
            clean_line += ' '

    clean_line = re.sub(' +',' ',clean_line) #delete extra spaces
    if clean_line[0] == ' ':
        clean_line = clean_line[1:]
    return clean_line
  except:
    return line

## Test set cleaning

Eliminazione delle etichette '-1' (che indicano commenti non etichettati):

In [None]:
print(test_labels.shape)
test_labels.head(5)

(153164, 7)


Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,-1,-1,-1,-1,-1,-1


In [None]:
test_labels = test_labels[(test_labels[['toxic','severe_toxic', 'obscene', 'threat', 
                                        'insult', 'identity_hate']] != -1).all(axis=1)]
print(test_labels.shape)
test_labels.head(5)

(63978, 7)


Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
5,0001ea8717f6de06,0,0,0,0,0,0
7,000247e83dcc1211,0,0,0,0,0,0
11,0002f87b16116a7f,0,0,0,0,0,0
13,0003e1cccfd5a40a,0,0,0,0,0,0
14,00059ace3e3e9a53,0,0,0,0,0,0


Unione delle etichette al test set (attraverso una inner-join per eliminare i commenti senza etichetta):

In [None]:
test['comment_text'] = test['comment_text'].apply(get_only_chars)


In [None]:
test = pd.merge(test_labels, test, on='id', how='inner')
print(test.shape)
test.head(5)

(63978, 8)


Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate,comment_text
0,0001ea8717f6de06,0,0,0,0,0,0,thank you for understanding i think very highl...
1,000247e83dcc1211,0,0,0,0,0,0,dear god this site is horrible
2,0002f87b16116a7f,0,0,0,0,0,0,somebody will invariably try to add religion r...
3,0003e1cccfd5a40a,0,0,0,0,0,0,it says it right there that it is a type the t...
4,00059ace3e3e9a53,0,0,0,0,0,0,before adding a new product to the list make s...


Controllo di eventuali valori mancanti:

In [None]:
print('Somma dei valori mancanti (Null): ')
print(test.isnull().sum())

print('\nSomma dei valori mancanti (NaN): ')
print(test.isna().sum())

Somma dei valori mancanti (Null): 
id               0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
comment_text     0
dtype: int64

Somma dei valori mancanti (NaN): 
id               0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
comment_text     0
dtype: int64


Esportazione del test set pulito:

In [None]:
test.to_csv(root_dir + 'dataset_clean/test_clean.csv')

## Training set cleaning

In [None]:
print(train.shape)
train.head(5)

(159571, 8)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [None]:
train['comment_text'] = train['comment_text'].apply(get_only_chars)

In [None]:
train.head(5)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanation why the edits made under my userna...,0,0,0,0,0,0
1,000103f0d9cfb60f,daww he matches this background colour im seem...,0,0,0,0,0,0
2,000113f07ec002fd,hey man im really not trying to edit war its j...,0,0,0,0,0,0
3,0001b41b1c6bb37e,more i cant make any real suggestions on impro...,0,0,0,0,0,0
4,0001d958c54c6e35,you sir are my hero any chance you remember wh...,0,0,0,0,0,0


Controllo di eventuali valori mancanti:

In [None]:
print('Somma dei valori mancanti (Null): ')
print(test.isnull().sum())

print('\nSomma dei valori mancanti (NaN): ')
print(test.isna().sum())

Somma dei valori mancanti (Null): 
id               0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
comment_text     0
dtype: int64

Somma dei valori mancanti (NaN): 
id               0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
comment_text     0
dtype: int64


Esportazione del train set:

In [None]:
train.to_csv(root_dir + 'dataset_clean/train_clean.csv')