# Detecção de SPAM

**D3TOP – Tópicos em Ciência de Dados** <br />
**D3APL – Aplicações em Ciência de Dados** <br />
Especialização em Ciência de Dados - IFSP Campinas  <br />

Grupo:
- Michelle Melo Cavalcante

## 1. Descrição geral

### 1.1. Visão de negócio

A detecção de spam por SMS é importante porque protege os usuários finais de links maliciosos e fraudes, economiza tempo e dinheiro, melhora a qualidade do serviço e evita a sobrecarga de rede. Isso garante que apenas mensagens legítimas e relevantes sejam entregues, melhorando a experiência do usuário e a satisfação com o serviço. 

### 1.2. Conjunto de dados

A Coleção de Spam de SMS é um conjunto público de mensagens rotuladas de SMS que foram coletadas para pesquisa de spam em telefones celulares. Os dados obtidos são:
- `Category` - Rótulo de identificação se a mensagem é spam ou não,
- `Message` - Mensagem enviada.

Para obter mais informações sobre os recursos do conjunto de dados, consulte SMS Spam Collection Data Set pelo link https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection#.

### 1.3. Objetivos

Os objetivos deste notebook são:
- Expor o problema a ser resolvido
- Descrever a base de dados obtida
- Executar análise exploratória de dados (AED)
- Realizar a limpeza e pré-processamento dos dados
- Extração de características e aplicação de modelos de ML
- Discussão de resultados e trabalhos futuros
- Deploy em produção


## 2. Análise Exploratória de dados

### 2.1. Importação do dataset e data cleaning

In [None]:
pip install keras

In [None]:
pip install tensorflow

In [None]:
pip install keras.preprocessing

In [3]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting py4j==0.10.9.7 (from pyspark)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.5/200.5 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=c17cad7bb7696cfa6381bc8e206e9b9e569ec765089159a7cfd2522aaa24936c
  Stored in directory: /home/codespace/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.4.0
No

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
#from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
#from keras.models import Sequential
#from keras.layers import Dense, Dropout, Flatten, Embedding
#from keras.preprocessing.text import Tokenizer
#from keras_preprocessing.sequence import pad_sequences

from pyspark.sql import SparkSession
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import Tokenizer, StringIndexer, Word2Vec

In [6]:
#Sessão
spark = SparkSession.builder.appName("nlp").getOrCreate()

23/04/21 23:25:41 WARN Utils: Your hostname, codespaces-98f670 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
23/04/21 23:25:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/21 23:25:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [9]:
spam = spark.read.csv("data/spam.csv", encoding="latin1", header=True, inferSchema=True)  #Ler o arquivo CSV em Spark
spam.createOrReplaceTempView("spam")                                                      #Registro a tabela 'spam' no SparkSQL 
spam_df = spark.sql("SELECT * FROM spam")                                                 #Consultar SQL em 'spam' usando o SparkSQL
spam_df.head(5)

[Row(Category='ham', Message='Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
 Row(Category='ham', Message='Ok lar... Joking wif u oni...'),
 Row(Category='spam', Message="Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"),
 Row(Category='ham', Message='U dun say so early hor... U c already then say...'),
 Row(Category='ham', Message="Nah I don't think he goes to usf, he lives around here though")]

In [10]:
df = spam_df.toPandas()     #converter 'spam_df' para um DataFrame
df.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [14]:
#Transformar coluna Category em número para o modelo possa processar a informação
stringmodel = StringIndexer(inputCol="Category", outputCol="CategoryIndex")
spamnew = stringmodel.fit(spam).transform(spam)                                      #spark permite fazer o fit e a transformação (transform) e retorna o objeto dataframe spamnew
spamnew_df = spamnew.toPandas()
spamnew_df.head(5)

Unnamed: 0,Category,Message,CategoryIndex
0,ham,"Go until jurong point, crazy.. Available only ...",0.0
1,ham,Ok lar... Joking wif u oni...,0.0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1.0
3,ham,U dun say so early hor... U c already then say...,0.0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0.0


In [17]:
#Tokenização
tokens = Tokenizer(inputCol="Message", outputCol="MessageToken")
spamtoken = tokens.transform(spamnew)
spamtoken_df = spamtoken.toPandas()
spamtoken_df.head(5)

Unnamed: 0,Category,Message,CategoryIndex,MessageToken
0,ham,"Go until jurong point, crazy.. Available only ...",0.0,"[go, until, jurong, point,, crazy.., available..."
1,ham,Ok lar... Joking wif u oni...,0.0,"[ok, lar..., joking, wif, u, oni...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1.0,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,0.0,"[u, dun, say, so, early, hor..., u, c, already..."
4,ham,"Nah I don't think he goes to usf, he lives aro...",0.0,"[nah, i, don't, think, he, goes, to, usf,, he,..."


In [19]:
#Representação vetorial (criação do embedding)
word2vec = Word2Vec(inputCol="MessageToken", outputCol="Messagew2v")
spamresult = word2vec.fit(spamtoken).transform(spamtoken)
spamresult_df = spamresult.toPandas()
spamresult_df.head(5)

                                                                                

Unnamed: 0,Category,Message,CategoryIndex,MessageToken,Messagew2v
0,ham,"Go until jurong point, crazy.. Available only ...",0.0,"[go, until, jurong, point,, crazy.., available...","[0.005808000502292999, -0.010057617016718723, ..."
1,ham,Ok lar... Joking wif u oni...,0.0,"[ok, lar..., joking, wif, u, oni...]","[-0.03447391729181011, 0.07259231914455691, -0..."
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1.0,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[0.0336706693425575, -0.04170963670393186, -0...."
3,ham,U dun say so early hor... U c already then say...,0.0,"[u, dun, say, so, early, hor..., u, c, already...","[-0.042305203757926145, 0.0924930994144895, -0..."
4,ham,"Nah I don't think he goes to usf, he lives aro...",0.0,"[nah, i, don't, think, he, goes, to, usf,, he,...","[-0.03323681803885847, 0.0245418526017322, 0.0..."


In [21]:
#Divisão dos dados/Split
spamtrain, spamtest = spamresult.randomSplit([0.7,0.3])

In [22]:
#criação do modelo
rf = RandomForestClassifier(labelCol="CategoryIndex", featuresCol="Messagew2v", numTrees=500)  #coluna variavel dependente / coluna com token / nº de arvores aleatórias
modelo = rf.fit(spamtrain)

23/04/22 00:05:02 WARN DAGScheduler: Broadcasting large task binary with size 1394.4 KiB
23/04/22 00:05:05 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
                                                                                

In [23]:
#Previsões
previsao = modelo.transform(spamtest)

In [26]:
previsao_df = previsao.toPandas()
previsao_df.head()

23/04/22 00:06:58 WARN DAGScheduler: Broadcasting large task binary with size 2.7 MiB
                                                                                

Unnamed: 0,Category,Message,CategoryIndex,MessageToken,Messagew2v,rawPrediction,probability,prediction
0,ham,"""7 wonders in My WORLD 7th You 6th Ur style 5t...",0.0,"[""7, wonders, in, my, world, 7th, you, 6th, ur...","[0.00972666570014553, -0.0020479442086070777, ...","[446.73802604262977, 53.016530125512766, 0.0, ...","[0.8934760520852594, 0.1060330602510255, 0.0, ...",0.0
1,ham,"""A cute thought for friendship: """"Its not nece...",0.0,"[""a, cute, thought, for, friendship:, """"its, n...","[0.023858107529122208, -0.02429645996016916, -...","[319.7755678802016, 179.94379503934144, 0.0, 0...","[0.639551135760403, 0.35988759007868276, 0.0, ...",0.0
2,ham,"""Awesome question with a cute answer: Someone ...",0.0,"[""awesome, question, with, a, cute, answer:, s...","[0.019155969574617654, -0.0189708280018889, 0....","[487.7606197234332, 11.054470026677354, 0.0, 1...","[0.9755212394468662, 0.022108940053354703, 0.0...",0.0
3,ham,"""Beautiful Truth against Gravity.. Read carefu...",0.0,"[""beautiful, truth, against, gravity.., read, ...","[-0.0056389791414123746, -0.000313028388728316...","[491.7353442163703, 8.227821031760987, 0.0, 0....","[0.9834706884327407, 0.016455642063521975, 0.0...",0.0
4,ham,"""Best line said in Love: . """"I will wait till ...",0.0,"[""best, line, said, in, love:, ., """"i, will, w...","[-0.02265100063825095, 0.02101879501370368, -0...","[492.38527827023375, 7.576961791067708, 0.0, 0...","[0.9847705565404674, 0.015153923582135414, 0.0...",0.0


In [28]:
#Performance do modelo
#vamos utilizar método para classificação binária

from pyspark.ml.evaluation import BinaryClassificationEvaluator
avaliar = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="CategoryIndex", metricName="areaUnderROC")    #quanto mais próximo de 1 melhor
areaUnderRoc = avaliar.evaluate(previsao)
print(areaUnderRoc)

23/04/22 00:13:17 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB


0.8545848091302637


In [31]:
#acuracia e matriz confusao?