#Text preprocessing

Il preprocessing del testo è una fase fondamentale nel campo dell'elaborazione del linguaggio naturale (NLP), poiché si occupa di preparare i dati testuali grezzi per l'analisi successiva.

Questa fase può includere una varietà di operazioni, come la rimozione della punteggiatura, la tokenizzazione, la rimozione delle stopword, la stemmatizzazione, la lemmatizzazione e altro ancora.

Il preprocessing del testo è importante perché i testi grezzi possono contenere informazioni non rilevanti, rumore o varie forme di inconsistenza che possono influenzare negativamente l'accuratezza e l'efficacia dei modelli di NLP.

##Installazione/Importazione delle dependencies

In [None]:
!pip install cleantext

In [None]:
import pandas as pd
import numpy as np
import cleantext
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from google.colab import drive

##Caricamento e studio preliminare del dataset

In [None]:
# Monta la cartella di Google Drive
drive.mount('/content/drive')

# Lettura dei dati dal file
my_path = "/content/drive/MyDrive/Lavoro/Test/Text Processing/Datasets"
data = pd.read_csv(f"{my_path}/spam_text_messages.csv", dtype = "category")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Il dataset è costituito da una colonna ***Message*** che contiene il corpo dei messaggi e da una colonna ***Category*** che distingue i messaggi di spam (negativo) e di ham (positivo).

In [None]:
# Aumenta la lunghezza del print del dataframe fino a 200 caratteri
pd.set_option('display.max_colwidth', 200)

data[:10]

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"
5,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030


Come sono distribuiti i messaggi di ham e jam?

In [None]:
values = data['Category'].value_counts()

# Distribuzione messaggi
print(values)

# Messaggi ham
print(f"\nHam: {round(values[0] / (values[0] + values[1]) * 100, 1)} %")

# Messaggi spam
print(f"Spam: {round(values[1] / (values[0] + values[1]) * 100, 1)} %")

# Totale messaggi
print(f"\nMessaggi: {values[0] + values[1]}")

ham     4825
spam     747
Name: Category, dtype: int64

Ham: 86.6 %
Spam: 13.4 %

Messaggi: 5572


C'è qualche valore nullo all'interno del dataset?

In [None]:
data.isnull().sum()

Category    0
Message     0
dtype: int64

Rimuovi eventuali valori nulli

In [None]:
data = data.dropna()

##Pulizia del testo

Definamo la funzione che applica il preprocessing al corpo dei messaggi

In [None]:
def clean_text(text_list):
    messages = []

    for message in text_list:
        corpus = cleantext.clean_words(message,
                              clean_all= False,    # Execute all cleaning operations
                              extra_spaces=True,   # Remove extra white spaces
                              stemming=True,       # Stem the words
                              stopwords=True,      # Remove stop words
                              lowercase=True,      # Convert to lowercase
                              numbers=False,       # Remove all digits
                              punct=True,          # Remove all punctuations
                              stp_lang='english'   # Language
        )
        messages.append(corpus)


    return messages

Applichiamo la pulizia del testo

In [None]:
cleaned = clean_text(data['Message'])

Confrontiamo i primi risultati del testo processato con quello non processato

In [None]:
data[:9]

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"
5,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.


In [None]:
for index, phrase in enumerate(cleaned[:9]):
    string = ""

    for word in phrase:
        string +=  word + " "

    print(f"{str(index)}  {string}")

0  go jurong point crazi avail bugi n great world la e buffet cine got amor wat 
1  ok lar joke wif u oni 
2  free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd txt ratetc appli 08452810075over18 
3  u dun say earli hor u c alreadi say 
4  nah dont think goe usf live around though 
5  freemsg hey darl 3 week word back id like fun still tb ok xxx std chg send £150 rcv 
6  even brother like speak treat like aid patent 
7  per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend callertun 
8  winner valu network custom select receivea £900 prize reward claim call 09061701461 claim code kl341 valid 12 hour 
