<a href="https://colab.research.google.com/github/izaleme/CienciaDeDados/blob/main/Naive_Bayes_SK_Learn_Spam_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Classificador de Spam muito simples usando Naive Bayes e o dataset da UCI

In [2]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
from nltk.corpus import stopwords
import string
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,confusion_matrix

### PARTE 1: PRÉ PROCESSAMENTO DE DADOS

Fazer o upload do dataset no ambiente do Colab

In [4]:
from google.colab import files
uploaded = files.upload()

MessageError: ignored

In [None]:
import io
messages = pd.read_csv(io.BytesIO(uploaded['spam.csv']), encoding='latin-1')

Como o dataset possui uma coluna adicional sem nome, preciso excluí-la primeiro

In [None]:
messages.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)

NameError: ignored

Renomeando as colunas

In [None]:
messages = messages.rename(columns={'v1': 'class','v2': 'text'})

In [None]:
messages.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### PARTE 2: CRIAR UM "TOKENIZER"

In [None]:
def process_text(text):
    '''
    O que será feito:
    1. Remover pontuações
    2. Remover stopwords
    3. Retornar a lista limpa contendo as palavras do texto
    '''
    
    #1
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    #2
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    #3
    return clean_words

### PARTE 3: DIVIDINDO O DATASET

In [None]:
msg_train, msg_test, class_train, class_test = train_test_split(messages['text'],messages['class'],test_size=0.2)

### PARTE 4: PRÉ PROCESSAMENTO DOS DADOS

Espere, nós já criamos o tokenizer, certo? Vamos usar um pipeline para fazer o resto.

### PARTE 5: CRIAÇÃO DO MODELO

In [None]:
pipeline = Pipeline([
    ('bow',CountVectorizer(analyzer=process_text)), # converts strings to integer counts
    ('tfidf',TfidfTransformer()), # converts integer counts to weighted TF-IDF scores
    ('classifier',MultinomialNB()) # train on TF-IDF vectors with Naive Bayes classifier
])

### PARTE 6: TESTANDO....

In [None]:
pipeline.fit(msg_train,class_train)

Pipeline(memory=None,
         steps=[('bow',
                 CountVectorizer(analyzer=<function process_text at 0x7f4f82ed1f28>,
                                 binary=False, decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('classifier',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [None]:
predictions = pipeline.predict(msg_test)

In [None]:
print(classification_report(class_test,predictions))

              precision    recall  f1-score   support

         ham       0.97      1.00      0.98       973
        spam       1.00      0.77      0.87       142

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.93      1115
weighted avg       0.97      0.97      0.97      1115



In [None]:
pred = pipeline.predict(["Free entry in 2 a wkly comp to win FA Cup"])
print(pred)