# Detecção de SPAM

**D3TOP – Tópicos em Ciência de Dados** <br />
**D3APL – Aplicações em Ciência de Dados** <br />
Especialização em Ciência de Dados - IFSP Campinas  <br />

Grupo:
- Michelle Melo Cavalcante

## 1. Descrição geral

### 1.1. Visão de negócio

A detecção de spam por SMS é importante porque protege os usuários finais de links maliciosos e fraudes, economiza tempo e dinheiro, melhora a qualidade do serviço e evita a sobrecarga de rede. Isso garante que apenas mensagens legítimas e relevantes sejam entregues, melhorando a experiência do usuário e a satisfação com o serviço. 

### 1.2. Conjunto de dados

A Coleção de Spam de SMS é um conjunto público de mensagens rotuladas de SMS que foram coletadas para pesquisa de spam em telefones celulares. Os dados obtidos são:
- `Category` - Rótulo de identificação se a mensagem é spam ou não,
- `Message` - Mensagem enviada.

Para obter mais informações sobre os recursos do conjunto de dados, consulte SMS Spam Collection Data Set pelo link https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection#.

### 1.3. Objetivos

Os objetivos deste notebook são:
- Expor o problema a ser resolvido
- Descrever a base de dados obtida
- Executar análise exploratória de dados (AED)
- Realizar a limpeza e pré-processamento dos dados
- Extração de características e aplicação de modelos de ML
- Discussão de resultados e trabalhos futuros
- Deploy em produção


## 2. Análise Exploratória de dados

### 2.1. Importação do dataset e data cleaning

In [2]:
pip install keras

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from keras.models import Sequential
from keras.layers import Dense, Dropout

2023-04-21 16:49:47.000518: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-21 16:49:49.352720: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-21 16:49:49.353557: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
df = pd.read_csv("data/spam.csv", encoding="latin-1")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(df['Category'])
print(y)

[0 0 1 ... 0 0 0]


In [7]:
mensagens = df['Message'].values 
X_train, X_test, y_train, y_test = train_test_split(mensagens, y, test_size=0.3)
print(X_train)

['ok....take care.umma to you too...'
 'You stayin out of trouble stranger!!saw Dave the other day heÂ\x92s sorted now!still with me bloke when u gona get a girl MR!ur mum still Thinks we will get 2GETHA!'
 'That depends. How would you like to be treated? :)' ...
 'Where are you ? You said you would be here when I woke ... :-('
 'I am late. I will be there at'
 '&lt;#&gt;  great loxahatchee xmas tree burning update: you can totally see stars here']


In [8]:
vetorizador = CountVectorizer()
vetorizador.fit(X_train)                         #fit cria o modelo sem transformacao
X_train = vetorizador.transform(X_train)         #Se fosse só o X_train, poderia chamar direto o fit_transform
X_test = vetorizador.transform(X_test)
print(X_train.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [9]:
np.set_printoptions(threshold=np.inf)
X_train.toarray()[0]   #ocorrencia de numeros 1 quando existe coindicencia da vetorização com o índice

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [10]:
#empilhamento das camadas
modelo = Sequential()
modelo.add(Dense(units=10,activation="relu",input_dim=X_train.shape[1]))   #numero de neuronios / funcao de ativacao / qtd para camada de entrava via dimensao do vetor (colunas)
modelo.add(Dropout(0.1))                                                   #minimizar overfiting - vai remover de forma aleatória 10% algumas sinapses (conexões entre uma rede e outras)
modelo.add(Dense(units=8,activation="relu"))                               #units pode ser arbitrário
modelo.add(Dropout(0.1))
modelo.add(Dense(units=1,activation="sigmoid"))                            #temos um problema binário (0 ou 1), retornando um valor > units=1

In [11]:
#compilar o modelo (montar a estrutura)
modelo.compile(loss="mean_squared_error", optimizer="adam", metrics=["accuracy"])       #loss=medida entre o que previu e o real / optimizer=otimização dos pesos da rede / metrica para a rede avaliar a performance dela
modelo.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 10)                72480     
                                                                 
 dropout (Dropout)           (None, 10)                0         
                                                                 
 dense_1 (Dense)             (None, 8)                 88        
                                                                 
 dropout_1 (Dropout)         (None, 8)                 0         
                                                                 
 dense_2 (Dense)             (None, 1)                 9         
                                                                 
Total params: 72,577
Trainable params: 72,577
Non-trainable params: 0
_________________________________________________________________


In [12]:
#processo de treinamento 
modelo.fit(X_train, y_train, epochs=30, batch_size=10, verbose=True, validation_data=(X_test,y_test))

Epoch 1/30


2023-04-21 16:49:58.640232: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [3900]
	 [[{{node Placeholder/_1}}]]
2023-04-21 16:49:58.640717: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [3900]
	 [[{{node Placeholder/_1}}]]




2023-04-21 16:50:00.943435: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [1672]
	 [[{{node Placeholder/_1}}]]


Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f07d5b51060>

In [13]:
loss, accuracy = modelo.evaluate(X_test, y_test)
print("Loss: ", loss)
print("Acurácia: ", accuracy)

Loss:  0.01124043669551611
Acurácia:  0.9868420958518982


2023-04-21 16:52:43.699460: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [1672]
	 [[{{node Placeholder/_1}}]]


In [14]:
#Previsão de novas mensagens
previsao = modelo.predict(X_test)
print(previsao)

[[8.25742740e-14]
 [2.48471483e-08]
 [6.85519539e-04]
 [3.98535849e-06]
 [1.14292177e-07]
 [3.77528758e-10]
 [9.77491960e-04]
 [3.51581235e-08]
 [1.22445609e-09]
 [1.48151317e-04]
 [1.03290243e-09]
 [4.60116235e-06]
 [1.17959687e-04]
 [1.95173961e-05]
 [6.96026680e-19]
 [2.85855476e-05]
 [1.42993383e-06]
 [1.66401105e-16]
 [3.13733691e-17]
 [9.99980330e-01]
 [2.69913536e-11]
 [3.78666073e-06]
 [9.99922514e-01]
 [9.99984205e-01]
 [6.71558226e-11]
 [1.76530248e-05]
 [2.96516555e-06]
 [3.43024622e-05]
 [3.13486339e-08]
 [1.20575055e-02]
 [9.87885142e-06]
 [6.35115284e-05]
 [2.15746766e-07]
 [9.69352168e-11]
 [2.28113277e-06]
 [1.33470655e-03]
 [8.29082855e-05]
 [5.19736372e-11]
 [1.03266258e-08]
 [1.79478411e-11]
 [7.31604176e-13]
 [2.71761563e-13]
 [3.29691395e-02]
 [1.79676898e-13]
 [2.01593450e-10]
 [1.89369104e-07]
 [7.18424189e-07]
 [6.70684912e-08]
 [1.25363549e-05]
 [1.71969719e-12]
 [1.69797332e-09]
 [5.93062225e-07]
 [8.30865545e-08]
 [0.00000000e+00]
 [2.34239224e-06]
 [9.086696

2023-04-21 16:54:27.477413: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype variant and shape [1672,3]
	 [[{{node Placeholder/_0}}]]


In [15]:
prev = (previsao > 0.5)
print(prev)

[[False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [ True]
 [False]
 [False]
 [ True]
 [ True]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [ True]
 [ True]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [ True]
 [ True]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [ True]
 [False]
 [False]
 [ True]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [ True]
 [False]
 [ True]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 [False]
 

In [16]:
cm = confusion_matrix(y_test, prev)
print(cm)

[[1440    1]
 [  21  210]]
