<a href="https://colab.research.google.com/github/mtitus6/Python-Projects/blob/main/fcc_sms_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

You should create a function called `predict_message` that takes a message string as an argument and returns a list. The first element in the list should be a number between zero and one that indicates the likeliness of "ham" (0) or "spam" (1). The second element in the list should be the word "ham" or "spam", depending on which is most likely.

For this challenge, you will use the [SMS Spam Collection dataset](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/). The dataset has already been grouped into train data and test data.

The first two cells import the libraries and data. The final cell tests your model and function. Add your code in between these cells.


In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from nltk.corpus import stopwords
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
import regex as re
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing import sequence

In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

--2022-01-21 15:17:01--  https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.2.33, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358233 (350K) [text/tab-separated-values]
Saving to: ‘train-data.tsv’


2022-01-21 15:17:01 (11.0 MB/s) - ‘train-data.tsv’ saved [358233/358233]

--2022-01-21 15:17:01--  https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.2.33, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118774 (116K) [text/tab-separated-values]
Saving to: ‘valid-data.tsv’


2022-01-21 15:17:01 (10.2 MB/s) - ‘valid-data.tsv’ saved [118774/118774]



In [None]:
train_data = pd.read_csv(train_file_path,sep = '\t',names=['label','message'])
test_data = pd.read_csv(test_file_path,sep = '\t',names=['label','message'])

In [None]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1392 entries, 0 to 1391
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    1392 non-null   object
 1   message  1392 non-null   object
dtypes: object(2)
memory usage: 21.9+ KB


In [None]:
train_data.head()

Unnamed: 0,label,message
0,ham,ahhhh...just woken up!had a bad dream about u ...
1,ham,you can never do nothing
2,ham,"now u sound like manky scouse boy steve,like! ..."
3,ham,mum say we wan to go then go... then she can s...
4,ham,never y lei... i v lazy... got wat? dat day ü ...


In [None]:
y_train = train_data.pop('label')
y_test = test_data.pop('label')

In [None]:
# convert y to binary
y_train_bin = np.array([0 if element == 'ham' else 1 for element in y_train])
y_test_bin = np.array([0 if element == 'ham' else 1 for element in y_test])

In [None]:
#clean out the special characters and stop words
def clean_text(text, remove_stopwords=True):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"<br />", " ", text)
    text = re.sub(r"[^a-z]", " ", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    text = re.sub(r"  ", " ", text)
    
    # Return a list of words
    return(text)

In [None]:
# Get the number of texts based on the dataframe column size
num_rows = train_data.size

# Initialize an empty list to hold the clean text
train_clean = []

# of the text list 
for i in range( 0, num_rows):
    # Call our function for each one, and add the result to the list of
    train_clean.append(clean_text(train_data['message'][i])) 

train_clean[0:10]

['ahhhh just woken up had bad dream u tho so dont like u right didnt know anything comedy night guess im it ',
 'never nothing',
 'u sound like manky scouse boy steve like travelling da bus home wot u inmind recreation dis eve ',
 'mum say wan go go shun bian watch da glass exhibition ',
 'never lei v lazy got wat dat day send da url cant work one ',
 'xam hall boy asked girl tell starting term dis answer den manage lot hesitation n lookin around silently said the intha ponnungale ipaditan ',
 'genius what s up brother pls send number skype ',
 'finally came fix ceiling ',
 'urgent call   landline complimentary ibiza holiday  cash await collection sae t cs po box  sk wp  ppm ',
 'started dont stop pray good ideas anything see help guys i ll forward link ']

In [None]:
# Get the number of texts based on the dataframe column size
num_rows = test_data.size

# Initialize an empty list to hold the clean text
test_clean = []

# of the text list 
for i in range( 0, num_rows):
    # Call our function for each one, and add the result to the list of
    test_clean.append(clean_text(test_data['message'][i])) 

test_clean[0:10]

['hospital da return home evening',
 'much textin bout you ',
 'probably eat today think i m gonna pop weekend u miss me ',
 'don t give flying monkeys wot think certainly don t mind friend mine that ',
 'seeing ',
 'opinion me  jada kusruthi lovable silent spl character matured stylish simple pls reply ',
 'yesterday going home ',
 'yes innocent fun o ',
 'boy late home father power frndship',
 'ur changes da report big cos i ve already made changes da previous report ']

In [None]:
#tokenize all of the words
all_data = train_clean + test_clean
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_data)

train_seq = tokenizer.texts_to_sequences(train_clean)

test_seq = tokenizer.texts_to_sequences(test_clean)

train_seq[0]

[3899,
 847,
 2689,
 203,
 1768,
 316,
 848,
 1,
 560,
 248,
 47,
 16,
 1,
 89,
 347,
 13,
 110,
 1769,
 64,
 258,
 94,
 29]

In [None]:
#make all of the sequences the same length
max_length = 25

train_pad = pad_sequences(train_seq, maxlen = max_length)

test_pad = pad_sequences(test_seq, maxlen = max_length)

train_pad[0]

array([   0,    0,    0, 3899,  847, 2689,  203, 1768,  316,  848,    1,
        560,  248,   47,   16,    1,   89,  347,   13,  110, 1769,   64,
        258,   94,   29], dtype=int32)

In [None]:
#build model
VOCAB_SIZE = len(tokenizer.word_index)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

In [None]:
#train model
model.compile(loss="binary_crossentropy",optimizer="rmsprop",metrics=['acc'])

history = model.fit(train_pad, y_train_bin, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
#check accurracy on test data
results = model.evaluate(test_pad, y_test_bin)



In [None]:
def encode_text(text):
  tokens = keras.preprocessing.text.text_to_word_sequence(text)
  tokens = [tokenizer.word_index[word] if word in tokenizer.word_index else 0 for word in tokens]
  return sequence.pad_sequences([tokens], max_length)[0]

In [None]:
# function to predict messages based on model
# (should return list containing prediction and label, ex. [0.008318834938108921, 'ham'])
def predict_message(pred_text):
  # clean_text = clean_text(pred_text)
  encoded_text = encode_text(pred_text)
  prediction = np.zeros((1,25))
  prediction[0] = encoded_text
  result = model.predict(prediction) 
  if result <=0.5:
    result_text = 'ham'
  else:
    result_text = 'spam' 

  return (result[0],  result_text)


In [None]:
# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()


You passed the challenge. Great job!
