# Training ML Model

For training our ML model, we will use Google Colab as it offers Free GPUs for training models

- Google Colab Notebook: https://colab.research.google.com/drive/1S3LjzvbDs1FK1UTYXRdMtHOsVGxvz9wM?usp=sharing
- Reference Blog Post: https://www.codingforentrepreneurs.com/blog/build-a-spam-classifier-with-keras

The matching code on the Google Colab Notebook is copied below as well

## Load the Previously Saved  Data

In [1]:
# Dependencies
import os
import pickle
import numpy as np
import pandas as pd
import json
from pathlib import Path

# For Tokenizing texts
from tensorflow.keras.preprocessing.text import Tokenizer
# For uniforming our token vectors
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Keras Models
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.models import Model, Sequential

In [9]:
# Datasets directories
PROJ_DIR = Path().resolve().parent
DATASETS_DIR = os.path.join(PROJ_DIR, "datasets")
EXPORTS_DIR = os.path.join(DATASETS_DIR, "exports")
METADATA_PKL_PATH = os.path.join(EXPORTS_DIR, "spam-metadata.pkl")
TOKENIZER_JSON_PATH = os.path.join(EXPORTS_DIR, "spam-tokenizer.json")
MODEL_EXPORT_PATH = os.path.join(EXPORTS_DIR,"spam-ml-model.h5")

**Warning About `pickle`**

- It is possible for outputs of `pickle` to contain malicious data
- If someone gives you a pickle file, be wary of where it came from or you might infest your system
- Only run pickle files from trusted sources
  - It is fine if you are the one manipulating the data so you know where the data came from
  - But do not use pickle file from someone else
  - Another option is to simply ask them as `csv` files
  
In this case, we generated this pickle file earlier so we can make use of it ourselves

In [3]:
# Load the pickle file exported from previous step
pickle_data = {}

with open(METADATA_PKL_PATH, 'rb') as f:
    pickle_data = pickle.load(f)
    
# Preview the data
display(pickle_data)

{'X_train': array([[  0,   0,   0, ..., 151,  15,  11],
        [  0,   0,   0, ...,  15,   5, 159],
        [  0,   0,   0, ...,  72, 104,  83],
        ...,
        [  0,   0,   0, ...,  62, 220, 160],
        [  0,   0,   0, ...,   0,   0,  47],
        [  0,   0,   0, ...,   7, 102,  19]]),
 'X_test': array([[  0,   0,   0, ...,   1, 152,  26],
        [  0,   0,   0, ...,  71,  41, 149],
        [  0,   0,   0, ...,  30,  34,   7],
        ...,
        [  0,   0,   0, ...,  11,   6,  13],
        [  0,   0,   0, ...,   0,  76,  10],
        [  0,   0,   0, ...,   8, 142, 185]]),
 'y_train': array([[1., 0.],
        [1., 0.],
        [0., 1.],
        ...,
        [1., 0.],
        [1., 0.],
        [0., 1.]], dtype=float32),
 'y_test': array([[0., 1.],
        [1., 0.],
        [0., 1.],
        ...,
        [1., 0.],
        [1., 0.],
        [1., 0.]], dtype=float32),
 'max_num_words': 280,
 'max_sequence_length': 300,
 'labels_to_int_mapping': {'ham': 0, 'spam': 1},
 'int_to_la

We can transform the extracted data back into what we need

In [6]:
# Transform pickle_data back to useful Data Frames
X_train = pickle_data["X_train"]
X_test = pickle_data["X_test"]
y_train = pickle_data["y_train"]
y_test = pickle_data["y_test"]

MAX_NUM_WORDS = pickle_data["max_num_words"]
MAX_SEQUENCE_LENGTH = pickle_data["max_sequence_length"]

labels_to_int_mapping = pickle_data["labels_to_int_mapping"]
int_to_labels_mapping = pickle_data["int_to_labels_mapping"]

In [5]:
# Also re-load the tokenizer
with open(TOKENIZER_JSON_PATH) as json_file:
    tokenizer_data = json.load(json_file)
    tokenizer = Tokenizer(tokenizer_data)
    
tokenizer

<keras_preprocessing.text.Tokenizer at 0x1dc83e22910>

## Create our LSTM Model

This is a Keras Model for classification problems

- Most of this is based on Keras Documentation
- This is one of the best model for the cross-category classification: `categorical_crossentropy`
  - Works for 2 labels or more
- Key things:
  - `MAX_NUM_WORDS`
  - `input_length` - Must be all of the same length (Uniform Matrix)
  - `LSTM` - A common model for text-related data

In [7]:
embed_dim = 128
lstm_out = 196

# Create the Model
model = Sequential()

# Add layers on Model
model.add(Embedding(
    MAX_NUM_WORDS, 
    embed_dim, 
    input_length = X_train.shape[1]
))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(
    lstm_out, 
    dropout = 0.3, 
    recurrent_dropout = 0.3
))
model.add(Dense(
    2, # We only have 2 labels: Spam or Ham
    activation = "softmax"
))

# Compile the model
model.compile(
    loss = "categorical_crossentropy", 
    optimizer = "adam", 
    metrics = ["accuracy"]
)

# Check the final result
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 128)          35840     
                                                                 
 spatial_dropout1d (SpatialD  (None, 300, 128)         0         
 ropout1D)                                                       
                                                                 
 lstm (LSTM)                 (None, 196)               254800    
                                                                 
 dense (Dense)               (None, 2)                 394       
                                                                 
Total params: 291,034
Trainable params: 291,034
Non-trainable params: 0
_________________________________________________________________
None


After creating the model, we can start fitting it on our data

**Note**: Keras is doing validation while training

- Often, we would want a final validation set for confirming our final model
- For our demo-purpose here, we do not have that

In [8]:
# Fit the model on our data
# WARNING: This can take a while and consume computing-power
batch_size = 32
epochs = 5

model.fit(
    X_train, 
    y_train, 
    validation_data = (X_test, y_test), # Keras is doing validation while training
    batch_size = batch_size, 
    verbose = 1, 
    epochs = epochs
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1dcada33df0>

Finally, we want to export the trained model into file

In [10]:
# Save model into h5 file
model.save(str(MODEL_EXPORT_PATH))