<br>

# Using word embeddings for spam detection
  

In [0]:
import tensorflow as tf
import numpy as np
import argparse
import gensim.downloader as api
import os
import shutil
from sklearn.metrics import accuracy_score, confusion_matrix

<br>

## Getting the data 

Getting the data The data for our model is available publicly and comes from the SMS spam collection dataset from the UCI Machine Learning Repository. The following code will download the file and parse it to produce a list of SMS messages and their corresponding labels.

In [0]:
def download_and_read(url):
    local_file = url.split('/')[-1]
    p = tf.keras.utils.get_file(local_file, url, 
        extract=True, cache_dir=".")
    labels, texts = [], []
    local_file = os.path.join("datasets", "SMSSpamCollection")
    with open(local_file, "r") as fin:
        for line in fin:
            label, text = line.strip().split('\t')
            labels.append(1 if label == "spam" else 0)
            texts.append(text)
    return texts, labels

In [0]:
DATASET_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
texts, labels = download_and_read(DATASET_URL)

In [4]:
texts[1:3]

['Ok lar... Joking wif u oni...',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]

In [5]:
labels[1:3]

[0, 1]

<br>

## Tokenizing and padding text  


- We will use the Keras tokenizer to convert each SMS text into a sequence of words, and then create the vocabulary using the **fit_on_texts()** method on the tokenizer.
- We then convert the SMS messages to a sequence of integers using the **texts_to_sequences()**. Finally, since the network can only work with fixed length sequencesof integers, we call the **pad_sequences()** function to pad the shorter SMS messages with zeros.
- The longest SMS message in our dataset has 189 tokens (words). In many applications where there may be a few outlier sequences that are very long, we would restrict the length to a smaller number by setting the maxlen flag. In that case, sentences longer than maxlen tokens would be truncated, and sentences shorter than maxlen tokens would be padded.


In [6]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()

tokenizer.fit_on_texts(texts)
text_sequences = tokenizer.texts_to_sequences(texts) # convert SMS to integers
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences) # pad the shorter SMS with zeros.

num_records = len(text_sequences)
max_seqlen = len(text_sequences[0]) # The length of all the text sequences is 189

print(f"{num_records:d} sentences, max length: {max_seqlen:d}")

5574 sentences, max length: 189


In [7]:
text_sequences[2]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

<br>

<br>

### Note:
> We will also convert our labels to categorical or one-hot encoding format, because the loss function we would like to choose *(categorical cross-entropy)* expects to see the labels in that format.

In [0]:
NUM_CLASSES = 2 # There are only two categories, spam or ham

cat_labels = tf.keras.utils.to_categorical(labels, 
                                           num_classes=NUM_CLASSES)

In [9]:
cat_labels[:5]

array([[1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]], dtype=float32)

<br>

<br>

<br>

The tokenizer allows access to the vocabulary created through the word_index attribute, which is basically a dictionary of vocabulary words to their index positions in the vocabulary. We also build the reverse index that enables us to go from index position to the word itself. In addition, we create entries for the PAD character.

In [10]:
# vocabulary
word2idx = tokenizer.word_index
idx2word = {v:k for k, v in word2idx.items()}
word2idx["PAD"] = 0
idx2word[0] = "PAD"
vocab_size = len(word2idx)

print(f"vocab size: {vocab_size:d}")

vocab size: 9010


<br>

<br>

## Create dataset object
Finally, we create the dataset object that our network will work with. The dataset object allows us to set up some properties, such as the batch size, declaratively. <br>
Here, we build up a dataset from our padded sequence of integers and categorical labels, shuffle the data, and split it into training, validation, and test sets.<br> 
Finally, we set the batch size for each of the three datasets.


In [0]:
dataset = tf.data.Dataset.from_tensor_slices((text_sequences, cat_labels))

dataset = dataset.shuffle(10000)

test_size = num_records // 4
val_size = (num_records - test_size) // 10

test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)

BATCH_SIZE = 128

test_dataset = test_dataset.batch(BATCH_SIZE, drop_remainder=True)
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)


<br>

<br>

## Building the embedding matrix

In [0]:
def build_embedding_matrix(sequences, word2idx, embedding_dim, embedding_file):
    EMBEDDING_MODEL = "glove-wiki-gigaword-300"
    if os.path.exists(embedding_file):
        E = np.load(embedding_file)
    else:
        vocab_size = len(word2idx)
        E = np.zeros((vocab_size, embedding_dim))
        word_vectors = api.load(EMBEDDING_MODEL)
        for word, idx in word2idx.items():
            try:
                E[idx] = word_vectors.word_vec(word)
            except KeyError:   # word not in embedding
                pass
            # except IndexError: # UNKs are mapped to seq over VOCAB_SIZE as well as 1
            #     pass
        np.save(embedding_file, E)
    return E


#### Observation
- In order to keep our model size small, we want to only consider embeddings for words that exist in our vocabulary. This is done using the above code, which creates a smaller embedding matrix for each word in the vocabulary. Each row in the matrix corresponds to a word, and the row itself is the vector corresponding to the embedding for the word.


In [13]:
DATA_DIR = "data"

if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

EMBEDDING_DIM = 300
EMBEDDING_NUMPY_FILE = os.path.join(DATA_DIR, "E.npy")

E = build_embedding_matrix(text_sequences, 
                           word2idx, 
                           EMBEDDING_DIM, 
                           EMBEDDING_NUMPY_FILE)

print("Embedding matrix:", E.shape)

Embedding matrix: (9010, 300)


<br>

<br>

## Define spam filter

Depending on the run mode, that is, whether we will learn the embeddings from scratch, do transfer learning, or do fine-tuning, the Embedding layer in the network would be slightly different. 

- When the network starts with randomly initialized embedding weights (run_mode == "scratch"), and learns the weights during the training, we set the trainable parameter to True. 

- In the transfer learning case (run_mode == "vectorizer"), we set the weights from our embedding matrix E but set the trainable parameter to False, so it doesn't train.

- In the fine-tuning case (run_mode == "finetuning"), we set the embedding weights from our external matrix E (in the data folder), as well as set the layer to trainable.


In [0]:
class SpamClassifierModel(tf.keras.Model):
    def __init__(self, vocab_sz, embed_sz, input_length, num_filters, kernel_sz, output_sz, run_mode, embedding_weights, **kwargs):
      super(SpamClassifierModel, self).__init__(**kwargs)
      if run_mode == "scratch":
        self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                                                   embed_sz,
                                                   input_length=input_length,
                                                   trainable=True)
      elif run_mode == "vectorizer":
        self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                                                   embed_sz,
                                                   input_length=input_length,
                                                   weights=[embedding_weights],
                                                   trainable=False)
      else:
        self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                                                   embed_sz,
                                                   input_length=input_length,
                                                   weights=[embedding_weights],
                                                   trainable=True)
      self.dropout = tf.keras.layers.SpatialDropout1D(0.2)

      self.conv = tf.keras.layers.Conv1D(filters=num_filters,
                                         kernel_size=kernel_sz,
                                         activation="relu")
      
      self.pool = tf.keras.layers.GlobalMaxPooling1D()
      
      self.dense = tf.keras.layers.Dense(output_sz, 
                                         activation="softmax")

    def call(self, x):
        x = self.embedding(x)
        x = self.dropout(x)
        x = self.conv(x)
        x = self.pool(x)
        x = self.dense(x)
        
        return x

In [15]:
conv_num_filters = 256
conv_kernel_size = 3

# (run_mode == "scratch")    => Learns the weights during the training.
# (run_mode == "vectorizer") => Transfer learning
# (run_mode == "finetuning") => Fine tuning 
run_mode='vectorizer'



model = SpamClassifierModel(vocab_size, 
                            EMBEDDING_DIM, 
                            max_seqlen, # 189 words
                            conv_num_filters, 
                            conv_kernel_size, 
                            NUM_CLASSES,
                            run_mode, E)

model.build(input_shape=(None, max_seqlen))

model.summary()

Model: "spam_classifier_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        multiple                  2703000   
_________________________________________________________________
spatial_dropout1d (SpatialDr multiple                  0         
_________________________________________________________________
conv1d (Conv1D)              multiple                  230656    
_________________________________________________________________
global_max_pooling1d (Global multiple                  0         
_________________________________________________________________
dense (Dense)                multiple                  514       
Total params: 2,934,170
Trainable params: 231,170
Non-trainable params: 2,703,000
_________________________________________________________________


#### Compile the model

In [0]:
model.compile(optimizer="adam", loss="categorical_crossentropy",metrics=["accuracy"])

#### Train the model

In [17]:
CLASS_WEIGHTS = { 0: 1, 1: 8 }
NUM_EPOCHS = 3

model.fit(train_dataset, 
          epochs=NUM_EPOCHS, 
          validation_data=val_dataset, 
          class_weight=CLASS_WEIGHTS)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fe2241bb390>

<br>

<br>

## Evaluate the model

In [18]:
labels, predictions = [], []
for Xtest, Ytest in test_dataset:
    Ytest_ = model.predict_on_batch(Xtest)
    ytest = np.argmax(Ytest, axis=1)
    ytest_ = np.argmax(Ytest_, axis=1)
    labels.extend(ytest.tolist())
    predictions.extend(ytest.tolist())

print(f"test accuracy: {accuracy_score(labels, predictions):.3f}")
print("confusion matrix")
print(confusion_matrix(labels, predictions))


test accuracy: 1.000
confusion matrix
[[1116    0]
 [   0  164]]


---