# Predict New Data

- After training our model, we can now use it to predict on new data
- For that purpose, we will use the `predict` function we defined in the `prediction_funcs` module

For training our ML model, we can use Google Colab as it offers Free GPUs for training models

- Google Colab Notebook: https://colab.research.google.com/drive/1S3LjzvbDs1FK1UTYXRdMtHOsVGxvz9wM?usp=sharing
- Reference Blog Post: https://www.codingforentrepreneurs.com/blog/build-a-spam-classifier-with-keras

The matching code on the Google Colab Notebook is copied below as well

In [1]:
# Dependencies
import json
import os
import pickle
import sys
from pathlib import Path
# For importing a saved model
from tensorflow import keras
# For Tokenizing texts
from tensorflow.keras.preprocessing.text import Tokenizer

# Set Directory for this project to look for Custom Modules
sys.path.append(os.path.join("..", "custom_funcs"))

# Custom modules
from prediction_funcs import predict

In [2]:
# Datasets directories
PROJ_DIR = Path().resolve().parent
DATASETS_DIR = os.path.join(PROJ_DIR, "datasets")
EXPORTS_DIR = os.path.join(DATASETS_DIR, "exports")
METADATA_PKL_PATH = os.path.join(EXPORTS_DIR, "spam-metadata.pkl")
TOKENIZER_JSON_PATH = os.path.join(EXPORTS_DIR, "spam-tokenizer.json")
MODEL_EXPORT_PATH = os.path.join(EXPORTS_DIR,"spam-ml-model.h5")

In [3]:
# Load the pickle file exported from previous step
pickle_data = {}

with open(METADATA_PKL_PATH, 'rb') as f:
    pickle_data = pickle.load(f)
    
# Preview the data
display(pickle_data)

{'X_train': array([[  0,   0,   0, ..., 151,  15,  11],
        [  0,   0,   0, ...,  15,   5, 159],
        [  0,   0,   0, ...,  72, 104,  83],
        ...,
        [  0,   0,   0, ...,  62, 220, 160],
        [  0,   0,   0, ...,   0,   0,  47],
        [  0,   0,   0, ...,   7, 102,  19]]),
 'X_test': array([[  0,   0,   0, ...,   1, 152,  26],
        [  0,   0,   0, ...,  71,  41, 149],
        [  0,   0,   0, ...,  30,  34,   7],
        ...,
        [  0,   0,   0, ...,  11,   6,  13],
        [  0,   0,   0, ...,   0,  76,  10],
        [  0,   0,   0, ...,   8, 142, 185]]),
 'y_train': array([[1., 0.],
        [1., 0.],
        [0., 1.],
        ...,
        [1., 0.],
        [1., 0.],
        [0., 1.]], dtype=float32),
 'y_test': array([[0., 1.],
        [1., 0.],
        [0., 1.],
        ...,
        [1., 0.],
        [1., 0.],
        [1., 0.]], dtype=float32),
 'max_num_words': 280,
 'max_sequence_length': 300,
 'labels_to_int_mapping': {'ham': 0, 'spam': 1},
 'int_to_la

In [4]:
# Transform pickle_data back to useful data
MAX_NUM_WORDS = pickle_data["max_num_words"]
MAX_SEQUENCE_LENGTH = pickle_data["max_sequence_length"]

labels_to_int_mapping = pickle_data["labels_to_int_mapping"]
int_to_labels_mapping = pickle_data["int_to_labels_mapping"]

We need to import our pre-trained model to be used on the new prediction

In [5]:
# Import the pre-trained model
model = keras.models.load_model(MODEL_EXPORT_PATH)
display(model)

<keras.engine.sequential.Sequential at 0x2a190fd3820>

We also need to use our tokenizer

In [6]:
# Load our tokenizer
with open(TOKENIZER_JSON_PATH) as json_file:
    tokenizer_data = json.load(json_file)
    tokenizer = Tokenizer(tokenizer_data)
    
display(tokenizer)

<keras_preprocessing.text.Tokenizer at 0x2a192b3ad60>

Now, we can run a prediction on a new string to see if it is a Spam email or not

In [7]:
# New text to test
new_text = "Hello world! This is your daily dose of internet"

# Run prediction on it
predictions = predict(
    model,
    new_text, 
    int_to_labels_mapping = int_to_labels_mapping,
    tokenizer = tokenizer,
    max_sequence_length = MAX_SEQUENCE_LENGTH
)

# Show result
display(predictions)

[{'ham': 0.95276815}, {'spam': 0.047231834}]

For additional steps on putting the model into a cloud-based storage, check the details in Google Colab Notebook