*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

You should create a function called `predict_message` that takes a message string as an argument and returns a list. The first element in the list should be a number between zero and one that indicates the likeliness of "ham" (0) or "spam" (1). The second element in the list should be the word "ham" or "spam", depending on which is most likely.

For this challenge, you will use the [SMS Spam Collection dataset](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/). The dataset has already been grouped into train data and test data.

The first two cells import the libraries and data. The final cell tests your model and function. Add your code in between these cells.


In [90]:
# import libraries
#try:
#  # %tensorflow_version only exists in Colab.
#  !pip install tf-nightly
#except Exception:
#  pass
import tensorflow as tf
import pandas as pd
from tensorflow import keras
#import tensorflow_datasets as tfds
#import tensorflow_text as text
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print(tf.__version__)

2.8.0


In [3]:
# Needed this as I was working in a container without wget...
!apt install wget

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  wget
0 upgraded, 1 newly installed, 0 to remove and 18 not upgraded.
Need to get 348 kB of archives.
After this operation, 1012 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 wget amd64 1.20.3-1ubuntu2 [348 kB]
Fetched 348 kB in 1s (373 kB/s)m[33m
debconf: delaying package configuration, since apt-utils is not installed

7[0;23r8[1ASelecting previously unselected package wget.
(Reading database ... 79954 files and directories currently installed.)
Preparing to unpack .../wget_1.20.3-1ubuntu2_amd64.deb ...
7[24;0f[42m[30mProgress: [  0%][49m[39m [..........................................................] 87[24;0f[42m[30mProgress: [ 20%][49m[39m [###########...............................................] 8Unpacking wget (1.20.3-1ubuntu2) ...
7[24;0f[42m[30mProgres

In [4]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

--2022-02-17 21:41:36--  https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358233 (350K) [text/tab-separated-values]
Saving to: ‘train-data.tsv’


2022-02-17 21:41:37 (1.49 MB/s) - ‘train-data.tsv’ saved [358233/358233]

--2022-02-17 21:41:38--  https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118774 (116K) [text/tab-separated-values]
Saving to: ‘valid-data.tsv’


2022-02-17 21:41:39 (1.15 MB/s) - ‘valid-data.tsv’ saved [118774/118774]



In [27]:
train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

## Tutorial used
During the lectures on Machine Learning with Python, there's an example where the sentiment of movie reviews are evaluated as positive or negative. This seems to be what this exercise is about, so I will use that as a base. The section in the lecture is called *Natural Language Processing with RNN's*

### Loading the data into dataframes

In [51]:
df_train = pd.read_csv(train_file_path, sep='\t', header=None, names=['class', 'message'], encoding='utf-8')
df_test = pd.read_csv(test_file_path, sep='\t', header=None, names=['class', 'message'], encoding='utf-8')

In [30]:
df_train.tail()

Unnamed: 0,class,message
4174,ham,just woke up. yeesh its late. but i didn't fal...
4175,ham,what do u reckon as need 2 arrange transport i...
4176,spam,free entry into our £250 weekly competition ju...
4177,spam,-pls stop bootydelious (32/f) is inviting you ...
4178,ham,tell my bad character which u dnt lik in me. ...


### Convert words to int in 

In [52]:
# Replace spam and ham with 0 and 1 respectively.
df_train['class'].replace({'spam':0, 'ham':1}, inplace=True)
df_test['class'].replace({'spam':0, 'ham':1}, inplace=True)
df_train.tail()

Unnamed: 0,class,message
4174,1,just woke up. yeesh its late. but i didn't fal...
4175,1,what do u reckon as need 2 arrange transport i...
4176,0,free entry into our £250 weekly competition ju...
4177,0,-pls stop bootydelious (32/f) is inviting you ...
4178,1,tell my bad character which u dnt lik in me. ...


In [83]:
train_data = df_train.iloc[:,1].values.flatten()
test_data = df_test.iloc[:,1].values.flatten()
train_labels = df_train.iloc[:,0].values.flatten()
test_labels = df_test.iloc[:,0].values.flatten()

#### Tokenization
Following [this](https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html) guide in order to create a vocabulary and tokenize all the texts, i.e. turning them into integers that correspnd to a word

In [74]:
num_words = 10000 # Max words stored in the vocabulary
oov_token = '<UNK>' # Placeholder for tokens that are not present in our created vocabulary
pad_type = 'pre' # Strings shorter than the max length will be padded with 0's. These 0's will be added before the sentence begins (or rather, before the integers begin)
trunc_type = 'pre' # Strings that are longer than the max length allowed will be cut off, starting from the beginning of the string.

In [75]:
# Tokenize the training data
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)
tokenizer.fit_on_texts(train_data)

# Get our training data word index
word_index = tokenizer.word_index

# Encode training data sentences into sequences
train_sequences = tokenizer.texts_to_sequences(train_data)

# Get max training sequence length
maxlen = max([len(x) for x in train_sequences])

# Pad the training sequences
train_padded = pad_sequences(train_sequences, padding=pad_type, truncating=trunc_type, maxlen=maxlen)

# Output the result
print("Word index: \n", word_index)
print("\nTraining sequences:\n", train_sequences)
print("\nPadded training sequences:\n", train_padded)
print("\nPadded training shape:", train_padded.shape)
print("Training sequences data type:", type(train_sequences))
print("Padded training sequences data type:", type(train_padded))

Word index: 

Training sequences:
 [[3667, 37, 2483, 45, 143, 5, 402, 767, 79, 7, 726, 24, 2, 94, 56, 7, 163, 20, 2, 461, 55, 177, 79, 1616, 111, 25, 2, 315, 154, 45, 13, 15], [4, 30, 282, 28, 341], [20, 7, 832, 56, 3668, 3669, 358, 3670, 56, 2, 10, 3671, 19, 96, 416, 86, 494, 114, 7, 3672, 42, 3673, 435, 538], [768, 145, 41, 277, 3, 47, 57, 47, 57, 89, 30, 3674, 3675, 342, 96, 3676, 3677], [282, 316, 607, 2, 262, 972, 59, 146, 317, 63, 78, 69, 11, 96, 1617, 198, 141, 76], [9, 3678, 3679, 358, 516, 375, 100, 11, 6, 1205, 1945, 13, 435, 517, 2, 30, 376, 2484, 19, 12, 973, 158, 302, 16, 3680, 83, 1206, 213, 2485, 89, 160, 6, 3681, 3682, 3683], [2486, 539, 45, 50, 14, 540, 98, 69, 204, 148, 3, 12, 1618], [108, 727, 417, 3, 1382, 6, 3684], [218, 17, 2487, 53, 14, 403, 14, 891, 42, 1946, 272, 27, 436, 518, 161, 577, 462, 578, 164, 318, 377, 329, 1619, 2488, 2489, 974, 1947, 273], [20, 21, 4, 18, 519, 94, 88, 37, 892, 13, 144, 58, 2490, 8, 177, 2, 90, 21, 30, 258, 4, 330, 2, 385, 1207, 4, 5,

In [76]:
test_sequences = tokenizer.texts_to_sequences(test_data)
test_padded = pad_sequences(test_sequences, padding=pad_type, truncating=trunc_type, maxlen=maxlen)

print("Padded testing sequences:\n", test_padded)
print("\nPadded testing shape:", test_padded.shape)

Padded testing sequences:
 [[   0    0    0 ...   86    9  532]
 [   0    0    0 ...   50  812    4]
 [   0    0    0 ...    7  173   11]
 ...
 [   0    0    0 ...    2   80    4]
 [   0    0    0 ...  964  740 1187]
 [   0    0    0 ...    1 2599  359]]

Padded testing shape: (1392, 189)


In [77]:
for x, y in zip(test_data, test_padded):
    print('{} -> {}'.format(x, y))

i am in hospital da. . i will return home in evening -> [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    2   65    9 1777
   96    2   32 1472 

### Create the model

In [86]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(num_words, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])

In [87]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          320000    
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 328,353
Trainable params: 328,353
Non-trainable params: 0
_________________________________________________________________


### Train the model

In [88]:
model.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop', metrics = ['acc'])
history = model.fit(train_padded, train_labels, epochs = 10, validation_split = 0.2)

Epoch 1/10


2022-02-19 21:23:29.962840: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8100


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [89]:
results = model.evaluate(test_padded, test_labels)
print(results)

[0.0788046196103096, 0.9784482717514038]


### Function to encode text

In [91]:
def encode_text(text):
    tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    return pad_sequences([tokens], maxlen)[0]

In [92]:
# Test the encoder
text = "Get over here now!"
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0  36 245 121  20]


#### Function to decode integers into words/sentences

In [93]:
reverse_word_index = {v: k for (k, v) in word_index.items()}

In [94]:
def decode_integers(integers):
    PAD = 0
    text = ""
    for num in integers:
        if num != PAD:
            text += reverse_word_index[num] + " "
    return text[:-1]

In [95]:
# Test the decoder
print(decode_integers(encoded))

get over here now


In [98]:
# function to predict messages based on model
# (should return list containing prediction and label, ex. [0.008318834938108921, 'ham'])
def predict_message(pred_text):
  encoded_text = encode_text(pred_text)
  pred = np.zeros((1, maxlen))
  pred[0] = encoded_text
  result = model.predict(pred).flatten()[0]
  label = ""
  if result < 0.5:
    label = "spam"
  else:
    label = "ham"
  prediction = [result, label]
  return (prediction)

pred_text = "how are you doing today?"

prediction = predict_message(pred_text)
print(prediction)

[0.9999584, 'ham']


___
## Tests

In [99]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()


You passed the challenge. Great job!
