### Rede Neural LSTM
   Objetivo da rede é identificar se determinado e-mail é spam ou ham dependendo do seu conteúdo. 
   Foi utilizada a Rede RNN LSTM

In [1]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import LSTM
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.optimizers import Adam
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

Using TensorFlow backend.


In [3]:
MAX_SENTENCE_LENGTH = 150
VOCABULARY_SIZE = 2000
VECTOR_DIM = 100
TRAIN_TEST_SPLIT = 0.2
BATCH_SIZE = 64
EPOCHS = 10

In [4]:
print('Carregando vetor de palavras pré-treinadas')

embedding_dic = {}

with open ('./glove.6b.100d.txt', encoding="utf8") as vector_list:
    for line in vector_list:
        dimensions = line.split()
        word = dimensions[0]
        vector = np.asarray(dimensions[1:], dtype='float32')
        embedding_dic[word] = vector
        
print("Total embedding vectors:"+str(len(embedding_dic)))

Carregando vetor de palavras pré-treinadas
Total embedding vectors:400000


In [5]:
email_data = pd.read_csv('./spam.csv', delimiter=',', encoding='latin-1')
email_data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [6]:
email_data.drop(['Unnamed: 2','Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
email_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
v1    5572 non-null object
v2    5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [7]:
emails = email_data["v2"].fillna("DUMMY_VALUES").values
labels = email_data["v1"]

print(emails[:10])

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 'U dun say so early hor... U c already then say...'
 "Nah I don't think he goes to usf, he lives around here though"
 "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv"
 'Even my brother is not like to speak with me. They treat me like aids patent.'
 "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
 'WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.'
 'Had your mobil

In [8]:
le = LabelEncoder()
labels = le.fit_transform(labels)
labels = labels.reshape(-1,1)

In [9]:
tok = Tokenizer(num_words=VOCABULARY_SIZE)

tok.fit_on_texts(emails)
print("Emails em texto -> "+str(emails[0]))

email_sequences = tok.texts_to_sequences(emails)
print("Emails em formato numérico -> "+str(email_sequences[0]))

emails_sequence_matrix = pad_sequences(email_sequences, maxlen=MAX_SENTENCE_LENGTH)
print("Emails numéricos padronizados com tamanho 100 -> \n"+str(emails_sequence_matrix[0]))


Emails em texto -> Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Emails em formato numérico -> [50, 469, 841, 751, 657, 64, 8, 1324, 89, 121, 349, 1325, 147, 1326, 67, 58, 144]
Emails numéricos padronizados com tamanho 100 -> 
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0   50  469  841  751  657   64   

In [10]:
word2idx = tok.word_index
print("Quantidade de tokens únicos -> "+str(len(word2idx)))

Quantidade de tokens únicos -> 8920


In [11]:
total_words =  min(VOCABULARY_SIZE, len(word2idx)+1)

In [12]:
embedding_matrix = np.zeros((total_words, VECTOR_DIM))

In [13]:
for word, index in word2idx.items():
    if index < VOCABULARY_SIZE:
        vector = embedding_dic.get(word)
        if vector is not None:
            embedding_matrix[index] = vector

In [14]:
embedding_layer = Embedding(total_words, VECTOR_DIM, weights=[embedding_matrix],
                           input_length=MAX_SENTENCE_LENGTH, trainable=False)

In [15]:
input_ = Input(shape=(MAX_SENTENCE_LENGTH,))

x = embedding_layer(input_)
x = LSTM(20)(x)

outuput = Dense(1, activation="sigmoid")(x)

model = Model(input_, outuput)
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.01), metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.


In [16]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 150)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 150, 100)          200000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 20)                9680      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 21        
Total params: 209,701
Trainable params: 9,701
Non-trainable params: 200,000
_________________________________________________________________


In [17]:
print("Treinando modelo")

classifier = model.fit(emails_sequence_matrix, labels, batch_size=BATCH_SIZE, epochs=EPOCHS,
                      validation_split=TRAIN_TEST_SPLIT)


Treinando modelo
Instructions for updating:
Use tf.cast instead.
Train on 4457 samples, validate on 1115 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Conclusão

  Foi utilizada poucas camadas e poucas épocas para treinamento, mas podemos observar que foi obtida uma acurácia muito boa de 98%.
  A parte que demandou mais esforço foi o pré-processamento dos dados.