# SPAM CLASSIFICATION 

In this notebook we are going to follow a step by step procedure to create a spam classifier.
This spam classifier uses embeddings trained on the spam classification data set using Fasttext library.
These embeddings are then fed to bidirectional LSTM layer to train the model.

In [None]:
#importing relevant libraries
import fasttext
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, Dropout, Embedding, LSTM, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from tqdm import tqdm

In [None]:
#estimate embedding size and maximum words in words sequence (sentence)
embedding_size = 50
max_words_len = 50

The dataframe consists of two columns the v1 and v2 and three other empty columns. v2 column contains the mail body while v1 column contain mail class (ham or spam)

In [None]:
#read the data
df = pd.read_csv('C:/Users/User/Downloads/spam.csv', encoding='latin-1').loc[:,['v1','v2']]
df.head()

In [None]:
#next we clean the data to be ready for training embeddings
def preprocess(df_):
    df_cleaned = df_.copy()
    #remove nan values
    df_cleaned.dropna(inplace = True)
    #replace any number by the word number
    df_cleaned['v2'] = df_cleaned['v2'].str.replace(r'\d',' number ')
    #remove any punctuations
    df_cleaned['v2'] = df_cleaned['v2'].str.replace(r'[^a-zA-Z]', ' ', regex = True)
    #remove single characters
    df_cleaned['v2'] = df_cleaned['v2'].str.replace(r'\s+[a-zA-Z]\s+', ' ', regex = True)
    #remove extra spaces
    df_cleaned['v2'] = df_cleaned['v2'].str.replace(r'\s+', ' ', regex = True).map(lambda x:x.lower())
    return df_cleaned

df = preprocess(df)

Having the text cleaned. we create a corpus of the entire dataset in a txt file. This corpus will be fed to a skipgram
model to train embeddings. The main advantage of training embeddings using fasttext is that it trains on the entire word 
and its subwords as well. This minimizes the probability of having out of vocabulary words as in such case, this word will 
be divided into subwords hopefully present in fasttext model subwords. then the word embedding will be the average of its 
subwords embeddings.

In [None]:
#create corpus for training embeddings
with open(r'C:\Users\User\Downloads\ftw\spamcorpus.txt', 'w', encoding="latin-1") as txtfile:
    for i in range(len(df)):
        line = df.loc[i,'v2']
        txtfile.write(line)
        txtfile.write('\n')

In [None]:
#create and train skipgram model using your own custom configurtions
model = fasttext.train_unsupervised('C:/Users/User/Downloads/ftw/spamcorpus.txt',
                                    minCount = 5, 
                                    model='skipgram',
                                    minn = 2,
                                    maxn = 5,
                                    dim = embedding_size,
                                    lr = 0.1,
                                    epoch = 10)

Next, we need to create a txt file that contains every unique word in the dataset and its embeddings. This can be done by selecting all unique word in the dataset. Then using the trained skigram model, we can have the embeddings for each word.

In [None]:
#create a list of all unique words in the dataset
with open(r'C:\Users\User\Downloads\ftw\spamcorpus.txt', 'r', encoding="utf-8") as txtfile:
    corpus_sentences = txtfile.readlines()
    corpus_words = []
    for sent in corpus_sentences:
        tokenized_sent = sent.split()
        for word_ in tokenized_sent:
            corpus_words.append(word_)
            
    corpus_unique_words = list(set(corpus_words))

In [None]:
#create embedding txt file
with open(r'C:\Users\User\Downloads\ftw\fasttext_embeddings.txt', 'w', encoding="utf-8") as txtfile:
    txtfile.write(str(len(corpus_unique_words)) + " " + str(model.get_dimension()))
    txtfile.write('\n')
    for word in corpus_unique_words:
        embedding = model.get_word_vector(word)
        vstr = ""
        for vi in embedding:
            vstr += " " + str(vi)
        txtfile.write(word + vstr)
        txtfile.write('\n')

All the previous steps were done to create the words embeddings txt file.
As this file is ready now, the data preparation steps will be as following:

Step 1: Create an embedding dictionary (keys are unique words, values are embedding arrays)

Step 2: Create a keras tokenizer and fit it on the cleaned text.

The fitted tokenizer now has a dictionary of every unique word and its index in a randomly initialized embedding matrix

Step 3: We will create embedding matrix from the created embedding dictionary and will use it instead of the randomly initialized embedding matrix. we will assign every word embedding with its index in the tokenizer.

The embedding matrix is now ready and will be fed directly to the model.

Step 4: The tokenizer converts each sequence of words to a sequence of their indices in both tokenizer and embeddings 

Step 5: sequences with more words than maximum words length (50 in this notebook) are truncated, whereas sequences with less words are padded to maximum words length

In [None]:
#step 1: Create an embedding dictionary 
embedding_dictionary = dict()

with open(r'C:\Users\User\Downloads\ftw\fasttext_embeddings.txt', 'r', encoding="utf-8") as txtfile:
    embeddings = txtfile.readlines()[1:]
    for line in embeddings:
        x = line.split()
        word = x[0]
        embeds = np.asarray(x[1:]).astype(np.float32)
        embedding_dictionary[word] = embeds
    embedding_dictionary['UNK'] = np.mean(list(embedding_dictionary.values()), axis = 0)

In [None]:
#Step 2: Create a keras tokenizer and fit it on the cleaned text.
num_words = len(corpus_unique_words)
tokenizer = Tokenizer(num_words+1, oov_token = 'UNK')
tokenizer.fit_on_texts(df['v2'])

#Note1: the number of words in the tokenizer is 1 indexed (index starts from 1)
#Note2: we add 1 to the number of words in the tokenizer as it includes the unknown token 
#Note3: we don't have to add the total number of unique words in the tokenizer, 
#if we use less number, the tokenizer will account for only the top frequent n words we enter
#but I added the total number of words as every single word now hopefully has a meaningful embedding thanks to fasttext

In [None]:
#Step 3: We will create embedding matrix from the created embedding dictionary
vocab_size = len(tokenizer.word_index)+1
embeddings_matrix = np.zeros(shape = (vocab_size , embedding_size))

for word, index in tqdm(tokenizer.word_index.items()):
    embeddings_matrix[index] = embedding_dictionary.get(word)

In [None]:
#Step 4: The tokenizer converts each sequence of words to a sequence of their indices in both tokenizer and embeddings
X = tokenizer.texts_to_sequences(df['v2'])
#step 5: padding short sequences and truncating long sequences
X = pad_sequences(X, padding = 'post', maxlen = max_words_len, truncating='post')
#encoding labels
Y = pd.get_dummies(df['v1'])['spam'].values

In [None]:
#split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.1, stratify = Y)

In [None]:
#create a bidirectional LSTM model
def create_model():
    model = Sequential()
    embedding_layer = Embedding(vocab_size, embedding_size, 
                                weights=[embeddings_matrix], 
                                input_length=max_words_len , 
                                trainable=True)
    
    model.add(embedding_layer)
    model.add(Bidirectional(LSTM(64)))
    model.add(Dense(1, activation='sigmoid'))
    return model

model = create_model()

early_stopping = EarlyStopping(monitor= 'val_acc', 
                               mode = 'max',
                               patience=30, 
                               verbose=1)

model_checkpoint = ModelCheckpoint('SPAM_CLASSIFIER',
                                   monitor = 'val_acc', 
                                   mode = 'max', 
                                   save_best_only=True, 
                                   verbose=1)


opt = Adam(lr = 0.01)

model.compile(opt, loss = 'binary_crossentropy', metrics=['accuracy'])

In [None]:
#train the model
history = model.fit(x_train, 
                    y_train, 
                    validation_data=[x_test, y_test],
                    batch_size=32,
                    epochs=200,
                    callbacks = [early_stopping, model_checkpoint])