<a href="https://colab.research.google.com/github/l43lu/NLP/blob/master/Spam_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
%tensorflow_version 2.x
import pandas as pd
import re
import tensorflow as tf
import os
import io
tf.__version__

'2.6.0'

In [2]:
# Download the zip file
path_to_zip = tf.keras.utils.get_file("smsspamcollection.zip", origin="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip", extract=True)

Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip


In [3]:
# Unzip the file into a folder
!unzip $path_to_zip -d data

Archive:  /root/.keras/datasets/smsspamcollection.zip
  inflating: data/SMSSpamCollection  
  inflating: data/readme             


In [4]:
# Let's see if we read the data correctly
lines = io.open('data/SMSSpamCollection').read().strip().split('\n')
lines[0]

'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [5]:
spam_dataset = []
for line in lines:
  label, text = line.split('\t')
  if label.strip() == 'spam':
    spam_dataset.append((1, text.strip()))
  else:
    spam_dataset.append(((0, text.strip())))
  

print(spam_dataset[0])


(0, 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')


In [7]:
df = pd.DataFrame(spam_dataset, columns=['Spam', 'Message'])

In [11]:
df.head()

Unnamed: 0,Spam,Message,Capitals,Punctuation,Length
0,0,"Go until jurong point, crazy.. Available only ...",3,28,111
1,0,Ok lar... Joking wif u oni...,2,11,29
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,10,33,155
3,0,U dun say so early hor... U c already then say...,2,16,49
4,0,"Nah I don't think he goes to usf, he lives aro...",2,14,61


In [9]:
def message_length(x):
  # returns total number of characters
  return len(x)

def num_capitals(x):
  _, count = re.subn(r'[A-Z]', '', x) # only works in english
  return count

def num_punctuation(x):
  _, count = re.subn(r'\W', '', x)
  return count

In [10]:
df['Capitals'] = df['Message'].apply(num_capitals)
df['Punctuation'] = df['Message'].apply(num_punctuation)
df['Length'] = df['Message'].apply(message_length)
df.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length
count,5574.0,5574.0,5574.0,5574.0
mean,0.134015,5.621636,18.942591,80.443488
std,0.340699,11.683233,14.825994,59.841746
min,0.0,0.0,0.0,2.0
25%,0.0,1.0,8.0,36.0
50%,0.0,2.0,15.0,61.0
75%,0.0,4.0,27.0,122.0
max,1.0,129.0,253.0,910.0


The following code can be used to split the dataset into training and test sets, with 80% of the records in the training set and the rest in the test set. Further more, labels will be removed from both the training and test sets:

In [13]:
train = df.sample(frac=0.8,random_state=42)
test  = df.drop(train.index)

x_train = train[['Length', 'Capitals', 'Punctuation']]
y_train = train[['Spam']]

x_test = test[['Length', 'Capitals', 'Punctuation']]
y_test = test[['Spam']]

In [15]:
# A function is defined that allows the construction of models with different numbers of inputs and hidden units

# This model uses binary cross-entropy for computing loss and the Adam optimizer for training. The key metric, given that this is a binary classification problem, 
# is accuracy. The default parameters passed to the function are sufficient as only three features are being passed in.

# Basic 1-layer neural network model for evaluation
def make_model(input_dims=3, num_units=12):
  model = tf.keras.Sequential()
  # Adds a densely-connected layer with 12 units to the model:
  model.add(tf.keras.layers.Dense(num_units, input_dim=input_dims, activation='relu'))
  # Add a sigmoid layer with a binary output unit:
  model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model


In [16]:
# train our simple baseline model with only three features
model = make_model()
model.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f9cfe4bdb10>

In [17]:
# Evaluating it
model.evaluate(x_test, y_test)



[0.20455998182296753, 0.9363228678703308]

In [20]:
#A quick verification can be performed by plotting the confusion matrix to see the performance
y_train_pred = model.predict(x_train)
tf.math.confusion_matrix(tf.constant(y_train.Spam), y_train_pred)

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[3866,    1],
       [ 592,    0]], dtype=int32)>