<h1>SMS Spam Classifier</h1>
<br />
This notebook shows how to implement a basic spam classifier for SMS messages using Apache MXNet as deep learning framework.
The idea is to use the SMS spam collection dataset available at <a href="https://archive.ics.uci.edu/ml/datasets/sms+spam+collection">https://archive.ics.uci.edu/ml/datasets/sms+spam+collection</a> to train and deploy a neural network model by leveraging on the built-in open-source container for Apache MXNet available in Amazon SageMaker.

Let's get started by setting some configuration variables and getting the Amazon SageMaker session and the current execution role, using the Amazon SageMaker high-level SDK for Python.

In [1]:
from sagemaker import get_execution_role

bucket_name = 'smlambda-workshop-gauravagrawal'

role = get_execution_role()
bucket_key_prefix = 'sms-spam-classifier'
vocabulary_length = 9013

print(role)

arn:aws:iam::888913162450:role/service-role/AmazonSageMaker-ExecutionRole-20200510T140807


We now download the spam collection dataset, unzip it and read the first 10 rows.

In [2]:
!mkdir -p dataset
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip -o dataset/smsspamcollection.zip
!unzip -o dataset/smsspamcollection.zip -d dataset
!head -10 dataset/SMSSpamCollection

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  198k  100  198k    0     0   244k      0 --:--:-- --:--:-- --:--:--  243k
Archive:  dataset/smsspamcollection.zip
  inflating: dataset/SMSSpamCollection  
  inflating: dataset/readme          
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham	Even my brother is not like to speak with me. They treat

We now load the dataset into a Pandas dataframe and execute some data preparation.
More specifically we have to:
<ul>
    <li>replace the target column values (ham/spam) with numeric values (0/1)</li>
    <li>tokenize the sms messages and encode based on word counts</li>
    <li>split into train and test sets</li>
    <li>upload to a S3 bucket for training</li>
</ul>

In [3]:
import pandas as pd
import numpy as np
import pickle
from sms_spam_classifier_utilities import one_hot_encode
from sms_spam_classifier_utilities import vectorize_sequences

df = pd.read_csv('dataset/SMSSpamCollection', sep='\t', header=None)
df[df.columns[0]] = df[df.columns[0]].map({'ham': 0, 'spam': 1})

targets = df[df.columns[0]].values
messages = df[df.columns[1]].values

# one hot encoding for each SMS message
one_hot_data = one_hot_encode(messages, vocabulary_length)
encoded_messages = vectorize_sequences(one_hot_data, vocabulary_length)

df2 = pd.DataFrame(encoded_messages)
df2.insert(0, 'spam', targets)

# Split into training and validation sets (80%/20% split)
split_index = int(np.ceil(df.shape[0] * 0.8))
train_set = df2[:split_index]
val_set = df2[split_index:]

train_set.to_csv('dataset/sms_train_set.gz', header=False, index=False, compression='gzip')
val_set.to_csv('dataset/sms_val_set.gz', header=False, index=False, compression='gzip')

We have to upload the two files back to Amazon S3 in order to be accessed by the Amazon SageMaker training cluster.

In [4]:
import boto3

s3 = boto3.resource('s3')
target_bucket = s3.Bucket(bucket_name)

with open('dataset/sms_train_set.gz', 'rb') as data:
    target_bucket.upload_fileobj(data, '{0}/train/sms_train_set.gz'.format(bucket_key_prefix))
    
with open('dataset/sms_val_set.gz', 'rb') as data:
    target_bucket.upload_fileobj(data, '{0}/val/sms_val_set.gz'.format(bucket_key_prefix))

<h2>Training the model with MXNet</h2>

We are now ready to run the training using the Amazon SageMaker MXNet built-in container. First let's have a look at the script defining our neural network.

In [5]:
!cat 'sms_spam_classifier_mxnet_script.py'

from __future__ import print_function

import logging
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn
import numpy as np
import json
import time

import pip

try:
    from pip import main as pipmain
except:
    from pip._internal import main as pipmain

pipmain(['install', 'pandas'])
import pandas

#logging.basicConfig(level=logging.DEBUG)

# ------------------------------------------------------------ #
# Training methods                                             #
# ------------------------------------------------------------ #


def train(hyperparameters, input_data_config, channel_input_dirs, output_data_dir,
          num_gpus, num_cpus, hosts, current_host, **kwargs):
    # SageMaker passes num_cpus, num_gpus and other args we can use to tailor training to
    # the current container environment, but here we just use simple cpu context.
    ctx = mx.cpu()

    # retrieve the hyperparameters and apply some defaults

We are now ready to run the training using the MXNet estimator object of the SageMaker Python SDK.

In [8]:
from sagemaker.mxnet import MXNet

output_path = 's3://{0}/{1}/output'.format(bucket_name, bucket_key_prefix)
code_location = 's3://{0}/{1}/code'.format(bucket_name, bucket_key_prefix)

m = MXNet('sms_spam_classifier_mxnet_script.py',
          role=role,
          train_instance_count=1,
          train_instance_type='ml.c5.2xlarge',
          output_path=output_path,
          base_job_name='sms-spam-classifier-mxnet',
          framework_version='1.2',
          code_location = code_location,
          hyperparameters={'batch_size': 100,
                         'epochs': 20,
                         'learning_rate': 0.01})

inputs = {'train': 's3://{0}/{1}/train/'.format(bucket_name, bucket_key_prefix),
 'val': 's3://{0}/{1}/val/'.format(bucket_name, bucket_key_prefix)}

m.fit(inputs)

1.6.0 is the latest version of mxnet that supports Python 2. Newer versions of mxnet will only be available for Python 3.Please set the argument "py_version='py3'" to use the Python 3 mxnet image.


2020-05-10 23:12:12 Starting - Starting the training job...
2020-05-10 23:12:13 Starting - Launching requested ML instances......
2020-05-10 23:13:16 Starting - Preparing the instances for training...
2020-05-10 23:14:11 Downloading - Downloading input data
2020-05-10 23:14:11 Training - Training image download completed. Training in progress..[34m2020-05-10 23:14:11,225 INFO - root - running container entrypoint[0m
[34m2020-05-10 23:14:11,225 INFO - root - starting train task[0m
[34m2020-05-10 23:14:11,230 INFO - container_support.training - Training starting[0m
[34m2020-05-10 23:14:18,152 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'enable_cloudwatch_metrics': False, 'available_gpus': 0, 'channels': {u'train': {u'TrainingInputMode': u'File', u'RecordWrapperType': u'None', u'S3DistributionType': u'FullyReplicated'}, u'val': {u'TrainingInputMode': u'File', u'RecordWrapperType': u'None', u'S3DistributionType': u'FullyReplicated'}}, '_ps_verbose': 0, 'resource_config

<h3><span style="color:red">THE FOLLOWING STEPS ARE NOT MANDATORY IF YOU PLAN TO DEPLOY TO AWS LAMBDA AND ARE INCLUDED IN THIS NOTEBOOK FOR EDUCATIONAL PURPOSES.</span></h3>

<h2>Deploying the model</h2>

Let's deploy the trained model to a real-time inference endpoint fully-managed by Amazon SageMaker.

In [9]:
mxnet_pred = m.deploy(initial_instance_count=1,
                      instance_type='ml.m5.large')

1.6.0 is the latest version of mxnet that supports Python 2. Newer versions of mxnet will only be available for Python 3.Please set the argument "py_version='py3'" to use the Python 3 mxnet image.


-------------!

In [12]:
mxnet_pred

<sagemaker.mxnet.model.MXNetPredictor at 0x7f67aafb7f98>

<h2>Executing Inferences</h2>

Now, we can invoke the Amazon SageMaker real-time endpoint to execute some inferences, by providing SMS messages and getting the predicted label (SPAM = 1, HAM = 0) and the related probability.

In [11]:
from sagemaker.mxnet.model import MXNetPredictor
from sms_spam_classifier_utilities import one_hot_encode
from sms_spam_classifier_utilities import vectorize_sequences

# Uncomment the following line to connect to an existing endpoint.
# mxnet_pred = MXNetPredictor('<endpoint_name>')

test_messages = ["FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop"]
test_messages = ["Hello Gaurav this side"]

one_hot_test_messages = one_hot_encode(test_messages, vocabulary_length)
encoded_test_messages = vectorize_sequences(one_hot_test_messages, vocabulary_length)

result = mxnet_pred.predict(encoded_test_messages)
print(result)

{'predicted_label': [[0.0]], 'predicted_probability': [[0.14486829936504364]]}


In [44]:
from sagemaker.mxnet.model import MXNetPredictor
from sms_spam_classifier_utilities import one_hot_encode
from sms_spam_classifier_utilities import vectorize_sequences

# Uncomment the following line to connect to an existing endpoint.
mxnet_pred1 = MXNetPredictor('sms-spam-classifier-mxnet-2020-05-10-23-12-11-768')

test_messages = ["FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop"]
#test_messages = ["Congratulations. You won the 1000 crores"]

print(type(test_messages))

one_hot_test_messages = one_hot_encode(test_messages, 9013)
encoded_test_messages = vectorize_sequences(one_hot_test_messages, 9013)

result = mxnet_pred1.predict(encoded_test_messages)
print(result)

<class 'list'>
{'predicted_label': [[1.0]], 'predicted_probability': [[0.9999325275421143]]}


In [42]:
print(result['predicted_label'][0][0])


if(result['predicted_label'][0][0]==1.0):
    CLASSIFICATION = "SPAM"
else:
    CLASSIFICATION = "HAM"
    
CLASSIFICATION_CONFIDENCE_SCORE = str(round((result['predicted_probability'][0][0] * 100),2))
CLASSIFICATION_CONFIDENCE_SCORE1 = str(result['predicted_probability'][0][0] * 100)


print(CLASSIFICATION)
print((CLASSIFICATION_CONFIDENCE_SCORE))
print((CLASSIFICATION_CONFIDENCE_SCORE1))

0.0
HAM
4.66
4.661540314555168


<h2>Cleaning-up</h2>

When done, we can delete the Amazon SageMaker real-time inference endpoint.

In [45]:
mxnet_pred.delete_endpoint()