#### Objective: Build a spam classifier using Bayes' Theorem and a data set of emails labeled as spam or non-spam.

#### Objetivo: Construir un clasificador de correo no deseado (spam) utilizando el Teorema de Bayes y un conjunto de datos de correos electrónicos etiquetados como spam o no spam


#### Import lib

In [1]:
import csv
import pandas as pd
import numpy as np
import string


##### Load file csv

In [2]:
spam_or_ham = pd.read_csv("spam.csv", encoding='latin-1')[["v1", "v2"]]
spam_or_ham.columns = ["label", "text"]
spam_or_ham.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


##### Filter within spam and ham

In [3]:
spam_or_ham["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

#### Optimization process

In [4]:
import string
punctuation = set(string.punctuation)
def tokenize(sentence):
    tokens = []
    for token in sentence.split():
        new_token = []
        for character in token:
            if character not in punctuation:
                new_token.append(character.lower())
        if new_token:
            tokens.append("".join(new_token))
    return tokens

##### Use tokenization algoritm

In [5]:
spam_or_ham.head()["text"].apply(tokenize)

0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, dont, think, he, goes, to, usf, he, l...
Name: text, dtype: object

#### We are going to use the scikit-learn library to do the heavy lifting of the learning process and testing. We are saying what function you have to use for tokenization and that it must be binary.
### we separate the data between training and testing.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
demo_vectorizer = CountVectorizer(
    tokenizer = tokenize,
    binary = True
)

In [7]:
from sklearn.model_selection import train_test_split
train_text, test_text, train_labels, test_labels = train_test_split(spam_or_ham["text"], spam_or_ham["label"], stratify=spam_or_ham["label"])
print(f"Training examples: {len(train_text)}, testing examples {len(test_text)}")

# stratify (to ensure that the data split between the training set)

Training examples: 4179, testing examples 1393


#### We were left with 4179 examples for training and 1393 examples for testing. We create a new vectorizer, from scratch, in which we are only going to use the training data
### We will NOT use test data.

In [8]:
real_vectorizer = CountVectorizer(tokenizer = tokenize, binary=True)
train_X = real_vectorizer.fit_transform(train_text)
test_X = real_vectorizer.transform(test_text)

#### We create the new classifier and use the fit() method to process the data, which prepares the classifier for later use. Again, we use the training data to prepare it, not the test data.



In [9]:
from sklearn.svm import LinearSVC
classifier = LinearSVC()
classifier.fit(train_X, train_labels)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

LinearSVC()

#### The classifier is ready to work
#### We use predict() to measure the accuracy of the model with data test . 
#### The scikit-learn function called accuracy_score() helps us calculate the score

In [10]:
from sklearn.metrics import accuracy_score
predicciones = classifier.predict(test_X)
accuracy = accuracy_score(test_labels, predicciones)
print(f"Accuracy: {accuracy:.4%}")

Accuracy: 98.2053%


In [13]:
frases = [
  'Are you looking to redesign your website with new modern look and feel?',
  'Please send me a confirmation of complete and permanent erasure of the personal data',
  'You have been selected to win a FREE suscription to our service',
  'We’re contacting you because the webhook endpoint associated with your account in test mode has been failing',
  'Confirma tu cuenta de Facebook en el siguiente link',
  'You have been selected to participate in a free service'
]

#### We use our transformation and vectorization algorithm, to finally receive the classification predictions

In [14]:
frases_X = real_vectorizer.transform(frases)
predicciones = classifier.predict(frases_X)

#### show results and predictions

In [15]:
for text, label in zip(frases, predicciones):
  print(f"{label:5} - {text}")

spam  - Are you looking to redesign your website with new modern look and feel?
ham   - Please send me a confirmation of complete and permanent erasure of the personal data
spam  - You have been selected to win a FREE suscription to our service
ham   - We’re contacting you because the webhook endpoint associated with your account in test mode has been failing
ham   - Confirma tu cuenta de Facebook en el siguiente link
spam  - You have been selected to participate in a free service
