#### Spam Detection

In [None]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
import pandas as pd
import numpy as np

import re
import string

In [None]:
df = pd.read_table('/content/SMSSpamCollection', header=None, encoding='utf-8')

In [None]:
df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
def preprocess_sms(sms):
  # case folding
  sms = sms.lower()

  # replace email addresses with 'email'
  sms = re.sub(r'^.+@[^\.].*\.[a-z]{2,}$', 'email', sms)
  
  # replace URLs with 'web'
  sms = re.sub(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$', 'web', sms)
  
  # replace money symbols with 'currency' (£ can by typed with ALT key + 156)
  sms = re.sub(r'£|\$', 'currency', sms)
  
  # replace 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phone'
  sms = re.sub(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$', 'phone', sms)
  
  # replace numbers with 'number'
  sms = re.sub(r'\d+(\.\d+)?', 'number', sms)
  
  # remove useless character
  sms = re.sub(r'[^\w\d\s]', ' ', sms)
  
  # replace whitespace between terms with a single space
  sms = re.sub(r'\s+', ' ', sms)
  
  # remove leading and trailing whitespace
  sms = re.sub(r'^\s+|\s+?$', '', sms)

  # tokenize sms
  sms_tokens = word_tokenize(sms)
  
  stopwords_english = stopwords.words('english')
  stemmer = PorterStemmer()

  clean_sms = []
  for word in sms_tokens:
    if word not in stopwords_english:
      stem_word = stemmer.stem(word)
      clean_sms.append(stem_word)

  return clean_sms

In [None]:
df['text'] = df[1].apply(lambda sms: preprocess_sms(sms))

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

In [None]:
df['label'] = le.fit_transform(df[0])

In [None]:
dict(zip(le.classes_, range(len(le.classes_))))

{'ham': 0, 'spam': 1}

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df['text']
y = df['label'].to_numpy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=21, stratify=y)

#### Feature Engineering

In [None]:
def bag_of_words(texts, labels):
  freqs = {}
  for text, label in zip(texts, labels):
    for word in text:
      pair = (word, label)
      freqs[pair] = freqs.get(pair, 0) + 1

  return freqs

In [None]:
freqs = bag_of_words(X_train, y_train)

In [None]:
def extract_features(text, freqs):
  features = np.zeros((1, 3))

  features[0,0] = 1 # bias

  for word in text:
    features[0,1] += freqs.get((word, 1), 0)
    features[0,2] += freqs.get((word, 0), 0)

  return features

In [None]:
sample_text = X_train[0]

In [None]:
extract_features(sample_text, freqs)

array([[  1.,  83., 922.]])

#### Logistic Regression

Logistic regression is a supervised machine learning classifier that extracts real-valued features from the input, multiplies each by a weight, sums them, and passes the sum through a sigmoid function to generate a probability

#### The Sigmoid Function

Consider a single input observation $x$, which we will represent by a vector of features $[x_1, x_2, \ldots, x_n]$. 

We want to know the probability $P(y \mid x)$ that this observation is a member of the class


Logistic regression solves this task by learning, from a training set, a vector of weights and bias term / intercept

To make a decision on a test instance after we’ve learned the weights in training the classifier first multiplies each $x_i$ by its weight $w_i$, sums up the weighted features, and adds the bias term $b$. The resulting single number $z$ expresses the weighted sum
of the evidence for the class

\begin{align}
z = \left(\sum_{i=1}^{n} w_i x_i\right) + b  = w \cdot x + b
\end{align}

The sigmoid function $\sigma(z)$ takes a real value and maps it to the range (0,1)

\begin{align}
\sigma(z) = \frac{1}{1 + e^{-z}}
\end{align}

In [None]:
def sigmoid(z):
  return 1 / (1 + np.exp(-z))

#### Classification with Logistic Regression

\begin{align}
    decision(x)= 
\begin{cases}
    1, & \text{if } P(y \mid x) > 0.5 \\
    0, & \text{otherwise}
\end{cases}
\end{align}

#### Learning in Logistic Regression

We want to learn parameters ($w$ and $b$) that make $\widehat{y}$ for each training observation as close as possible to the true $y$. This requires two components

1. **Loss function** distance between the system output and the gold output
2. **Optimization algorithm** iteratively updating the weights so as to minimize this loss function

#### The cross-entropy loss function

The weights (vector w and bias b) are learned from a labeled training set via a loss function, that must be minimized.

**Conditional Maximum Likelihood Estimation** choose the parameters $w, b$ that maximize the log probability of the true $y$ labels in the training data given the observations $x$. 

The resulting loss function is the negative log likelihood loss

We’d like to learn weights that maximize the probability of the correct label $P(y \mid x)$. Since there are only two discrete outcomes this is a **Bernoulli distribution**

\begin{align}
P(y \mid x) = \widehat{y}^{\:y}(1 - \widehat{y})^{1-y} 
\end{align}

Maximize a probability will also maximize the log of the probability

\begin{align}
log(P(y \mid x)) &= log\bigr[\widehat{y}^{\:y}(1 - \widehat{y})^{1-y}\bigr] \\
&= y \: log \: \widehat{y} + (1 - y) \: log(1 - \widehat{y})
\end{align}

In order to turn this into a loss function we’ll just flip the sign on

\begin{align}
L_{CE}(\widehat{y}, y) = -\bigr[y \: log \: \widehat{y} + (1 - y) \: log(1 - \widehat{y})\bigr]
\end{align}

**Batch training**

The cost function for the batch of m examples is the average loss for each example

\begin{align}
Cost(\widehat{y}, y) = \frac{1}{m} \sum_{i=1}^{m} L_{CE}(\widehat{y}^{(i)}, y^{(i)})
\end{align}

In [None]:
def cross_entropy_loss(y_hat, y):
  return - (np.squeeze(np.matmul(y.T , np.log(y_hat))) + np.squeeze(np.matmul((1 - y).T , np.log(1 - y_hat))))

#### Gradient Descent

Minimizing this loss function is a convex optimization problem, and iterative algorithms like gradient descent are used to find the optimal weights.

\begin{align}
\theta^{\:t+1} = \theta^{\:t} - \eta \nabla L (f(x;\theta), y)
\end{align}

Definition for the gradient

\begin{align}
\nabla L (f(x;\theta), y) = \frac{\partial L_{CE}(\widehat{y}, y)}{\partial w_j} = x_j \: (\widehat{y} - y)
\end{align}

In [None]:
def gradient_loss(X, y_hat, y):
  return np.squeeze(np.matmul(X.T,(y_hat - y)))[:,np.newaxis]

In [None]:
def gradient_descent(X, y, eta=1e-8, epochs=500):
  m, f = X.shape
  theta = np.zeros((f, 1))
  for epoch in range(epochs):
    z = np.matmul(X, theta)
    y_hat = sigmoid(z)
    g = 1 / m * gradient_loss(X, y_hat, y)
    theta = theta - eta * g
    cost = 1 / m * cross_entropy_loss(y_hat, y)  

  return cost, theta  

In [None]:
X_train_features = np.zeros((len(X_train), 3))

for i in range(len(X_train)):
    X_train_features[i, :] = extract_features(X_train.iloc[i], freqs)

In [None]:
y_train = y_train.reshape(-1, 1)

In [None]:
cost, theta = gradient_descent(X_train_features, y_train)

In [None]:
def test_gradient_descent(X_test, theta):
  y_preds = []
  for sms in X_test:
    features = extract_features(sms, freqs)
    y_pred = sigmoid(np.matmul(features, theta))
    
    y_preds.append(round(y_pred.item()))

  return y_preds

In [None]:
from sklearn.metrics import accuracy_score, f1_score

In [None]:
y_preds = test_gradient_descent(X_test, theta)

In [None]:
scores = {'acc': accuracy_score(y_preds, y_test), 'f1': f1_score(y_preds, y_test)}

In [None]:
scores

{'acc': 0.9325197415649676, 'f1': 0.7526315789473683}

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_preds, target_names=le.classes_))

              precision    recall  f1-score   support

         ham       0.96      0.96      0.96      1206
        spam       0.74      0.76      0.75       187

    accuracy                           0.93      1393
   macro avg       0.85      0.86      0.86      1393
weighted avg       0.93      0.93      0.93      1393

