Binary Text Classification using Logistic Regression of ham and spam text messages

The workflow that this notebook will follow is as follows:

1. Data Preprocessing: 
    -Load the dataset into sentences and labels
    -Split the dataset into training, validation and testing sets 
    -Report the distribution in the form of a table
    -Clean the data of any noise (urls, punctuation, and numbers) & change to lower case
    -Tokenize input text into tokens, including work stemming and stopwords
    -Build your own TD-IDF feature extractor using the training set
2. Build a logistic regression classifier using using L2 regularization
    -Derive the gradient of the objective function of LR with respect to w and b. 
    -Implement logistic regression via initialization, objective function, and gradient descent
    -Implement accuracy, precision, recall and F1 score as test metrics
    -Write a function for SGD and Mini-batch GD
    -Evaluate the model of the test set and report the metrics 
3. Cross Validation
    -Implement cross validation to choose the best hyperparameter lambda for the validation set
4. Conclusion
    -Analyze the results and compare to baseline
5. Create a multiclass classifier from various authors

In [1]:
import pandas as pd
import string
import math
import numpy as np

Load the dataset, include more information about what this importing section does

In [2]:
spam_df = pd.read_csv('a1-data/SMSSpamCollection', sep='\t', header=None, names=['label', 'text'])

spam_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
spam_df.shape

(5572, 2)

Objective of the split_dataset function: 
    -Split the dataset ino training, validation and test sets
    -Return each set split into features and labels (this enables easier tokenization later)
    -This also allows for reproducubility as the split will not be exactly the same each time meaning the if the structure of the model is effecive it should learn at the same rate regardless of the data split

In [4]:
def split_dataset(df, train_size, val_size, test_size):
    df = df.sample(frac=1,random_state=42).reset_index(drop=True)
    n = (len(df))
    train_end = int(train_size *n)
    val_end = train_end + int(val_size *n)

    train_df = df.iloc[:train_end]
    val_df = df.iloc[train_end:val_end]
    test_df = df.iloc[val_end:n]

    X_train, y_train = train_df[['text']], train_df['label']
    X_val, y_val = val_df[['text']], val_df['label']
    X_test, y_test = test_df[['text']], test_df['label']

    return X_train, X_val, X_test, y_train, y_val, y_test

X_train, X_val, X_test, y_train, y_val, y_test = split_dataset(spam_df, 0.6, 0.2, 0.2)

Output a table showing the number of samples in each class for the training, validation and test sets.

In [5]:
def data_distribution(y_train, y_val, y_test):
    df =pd.DataFrame({'Train': y_train.value_counts(), 'Validation': y_val.value_counts(), 'Test': y_test.value_counts()}).fillna(0).astype(int)
    return df

data_distribution(y_train, y_val, y_test)

Unnamed: 0_level_0,Train,Validation,Test
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ham,2898,966,961
spam,445,148,154


Objective of the clean_data function:
    -Remove punctuation, urls and numbers
    -Change text to lowercase

In [6]:
def clean_text(X):
    X = X.str.lower()
    X = X.str.translate(str.maketrans("", "", string.punctuation))
    X = X.str.replace("http\\S+", "", regex=True)
    X = X.str.replace("https\\S+", "", regex=True)
    X = X.str.replace("\\d+", "", regex=True)
    return X

# Convert spam and ham to 0 and 1s for classification
map = {"spam": 1, "ham": 0}
y_train = np.array([map[v] for v in y_train], dtype=float)
y_val = np.array([map[v] for v in y_val], dtype=float)
y_test = np.array([map[v] for v in y_test], dtype=float)

Tokenize the dataset:
    -Remove whitespace between words
    -Including word stems
    -Removing stop words (removing common words that do not add any semantic value)

In [7]:
STOPWORDS = {
    "a", "an", "the", "and", "or", "but",
    "is", "are", "was", "were", "be",
    "to", "of", "in", "on", "for", "with",
    "that", "this", "it", "as", "at"
}

def tokenize_text(X):
    return X.apply(lambda x: x.split())

def remove_stopwords(tokens, stopwords = STOPWORDS):
    return tokens.apply(lambda x: [t for t in x if t not in stopwords])

def stem_token(token):
    suffixes = ["ing", "ly", "ed", "s", "es", "est"]
    for suf in suffixes:
        if token.endswith(suf) and len(token) > len(suf) + 2:
            return token[:-len(suf)]
    return token

def stem_tokens(tokens):
    return tokens.apply(lambda x: [stem_token(t) for t in x])

In [8]:
def preprocess_text(X):
    X  = X.copy()
    X = clean_text(X)
    X = tokenize_text(X)
    X = remove_stopwords(X)
    X = stem_tokens(X)
    return X

X_train['text'] = preprocess_text(X_train['text'])
X_val['text'] = preprocess_text(X_val['text'])
X_test['text'] = preprocess_text(X_test['text'])
X_train.head()

Unnamed: 0,text
0,"[squeeeeeze, christma, hug, if, u, lik, my, fr..."
1,"[also, ive, sorta, blown, him, off, couple, ti..."
2,"[mmm, that, better, now, i, got, roast, down, ..."
3,"[mm, have, some, kanji, dont, eat, anyth, heav..."
4,"[so, there, ring, come, guy, costume, its, the..."


Build a TD-IDF vecotrizorizer from stratch:
    -Goal is to create a function that would take in the list of words we have and return a matrix of importance of each word
    -TF = Term Frequency: The more the word appears in the document, the higher the TF
    -IDF = Inverse Document Frequency: The less the word appears in the corpus, the higher the IDF
What the functions do: 
    -fit_tfidf:
        -Treats each row as a new "document and counts the total number of documents
        -Builds the document frequency required for IDF by converting the token list to a set so each word is counted at most once per document
        -Update the voacb to say that the words appears in a specific document
        -Building a vocabulary mapping and sorts the indices alphabetically
        -Computes IDF for each word in the documents
    -transform_tfidf:
        -Counts term occurrences within that document
        -Computes TF per term, typically count / len(doc)
        -Computes TF-IDF per term
        -Stores the result as a sparse vector 
        -Returns a list of vectors per document


In [9]:
def fit_tfidf(text):
    # Document frequency: df[word] = number of docs containing word
    df = {}
    N = 0

    for doc in text:
        N += 1
        seen = set(doc)
        for w in seen:
            df[w] = df.get(w, 0) + 1

    # Build vocab (deterministic order: alphabetical)
    vocab_words = sorted(df.keys())
    vocab = {w: i for i, w in enumerate(vocab_words)}

    idf = [0.0] * len(vocab_words)
    for w, i in vocab.items():
        idf[i] = math.log((1.0 + N) / (1.0 + df[w])) + 1.0

    return vocab, idf


def transform_tfidf(text, vocab, idf, normalize=True):
    vectors = []

    for doc in text:
        # term counts
        counts = {}
        for w in doc:
            if w in vocab:
                idx = vocab[w]
                counts[idx] = counts.get(idx, 0) + 1

        # compute TF-IDF (using TF = count / len(doc))
        doc_len = len(doc) if len(doc) > 0 else 1
        vec = {}
        for idx, c in counts.items():
            tf = c / doc_len
            vec[idx] = tf * idf[idx]

        '''
        # optional L2 normalization
        if normalize and vec:
            import math
            norm = math.sqrt(sum(v * v for v in vec.values()))
            if norm > 0:
                for idx in list(vec.keys()):
                    vec[idx] = vec[idx] / norm
        '''
        vectors.append(vec)

    return vectors

In [10]:
train_docs = X_train['text'].tolist()
vocab, idf = fit_tfidf(train_docs)
X_train_tfidf = transform_tfidf(train_docs, vocab, idf)

X_train_tfidf[:3]

[{4589: 0.29040606184981543,
  873: 0.24260280801809503,
  2255: 0.5330088698679104,
  2305: 0.26567122657239356,
  5139: 0.6067940829285435,
  2731: 0.25880982971622385,
  3174: 0.10983909761429993,
  1809: 0.29040606184981543,
  1214: 0.22092664734835799,
  2952: 0.10849018669721469,
  373: 0.15991676412849606,
  1889: 0.12648915870821661,
  3907: 0.48333845355664923,
  1113: 0.25880982971622385,
  2847: 0.29040606184981543,
  4481: 0.12203358176063021,
  2836: 0.22330571188239082,
  3295: 0.2764245063978097,
  3596: 0.1970250204324978,
  2081: 0.25252287948194957},
 {159: 0.280564912646649,
  2444: 0.2842816267105413,
  4527: 0.4432513575602446,
  544: 0.4432513575602446,
  2158: 0.5233461313487431,
  3378: 0.2896844769242299,
  1042: 0.35854409901108136,
  4968: 0.22436277422552497,
  3978: 0.377316464481541,
  4481: 0.18626178268727767,
  2294: 0.3276079587530751,
  3938: 0.3854296581566598,
  3311: 0.19105926530073353,
  4877: 0.22775953849592345,
  3487: 0.20881202617848102,
  5

We only transform the validation and test sets using the transform method to not introduce any data leakage

In [11]:
val_doc = X_val['text'].tolist()
X_val_tfidf = transform_tfidf(val_doc, vocab, idf)

test_doc = X_test['text'].tolist()
X_test_tfidf = transform_tfidf(test_doc, vocab, idf)

Next we need to create a logistic regression model with L2 normalization:
    -Import Equation here

In [12]:
def dot_sparse(w, x):
    s = 0
    for j, v in x.items():
        s += w[j]*v
    return s
    
def objective(w, b, X, y, lam):
    n = len(X)
    eps = 1e-15
    loss=0
    for x_i, y_i in zip(X,y):
        # Linear Transformation of the objective function
        z = dot_sparse(w, x_i) + b
        # Pass the linear transformation through the sigmoid activation function
        y_hat = 1 / (1 + np.exp(-z))
        # Binary Cross Entropy Loss
        # This ensures we do not have any log(0) errors
        y_hat = np.clip(y_hat, eps, 1 - eps)
        loss += -(y_i * np.log(y_hat) + (1-y_i) * np.log(1-y_hat))
        
    loss = loss / n
    # L2 regularization
    reg = lam * np.sum(w ** 2)

    return loss + reg


Generate a function for the gradient and gradient descent

In [None]:
def gradient(w, b, X, y, lam):
    n = len(X)
    d = len(w)
    dw = np.zeros(d)
    db = 0.0

    for x_i, y_i in zip(X, y):
    # Get predictions
        z = dot_sparse(w, x_i) + b
        y_hat = 1 / (1+np.exp(-z))
        # Error term
        error = y_hat - y_i

        for j, v, in x_i.items():
            dw[j] += error * v
        db += error
    dw = (1.0 / n) * dw + 2 * lam * w
    db = (1.0 / n) * db
    return dw, db

In [14]:
def gradient_descent(w, b, X, y, lam, learning_rate, max_epochs=100, print_every=50):
    objvals = []

    for epoch in range(max_epochs):
        # Compute gradients for both the weight and bias
        dw, db = gradient(w, b, X, y, lam)

        # Update parameters using the learning rate
        w = w - learning_rate * dw
        b = b - learning_rate * db
        # Update the objective value
        obj = objective(w, b, X, y, lam)
        objvals.append(obj)
        # Update progress during training
        if epoch % print_every == 0 or epoch == max_epochs - 1:
            print(f"Epoch {epoch:4d} | Loss = {obj:.6f}")
        # If gradient starts to diverge we should stop early
        if not np.isfinite(obj):
            print('Stopped early at {epoch}')
            break
    return w, b, objvals

A secondary objective is to implement a stochastic gradient descent and mini-batch gradient descent function for this model
This requires a new gradient function and training function for each method

In [None]:
def gradient_sgd(w, b, x_i, y_i, lam):
    grad_w = np.zeros_like(w)

    z = dot_sparse(w, x_i) + b
    y_hat = 1 / (1 = np.exp(-z))
    error = y_hat - y_i
    
    for j, v in x_i.items()
        dw[j] += error * v
    
    db = error

    dw += 2 * lam * 2
    
    return dw, db

In [None]:
def sgd(w, b, X, y, lam, learning_rate, max_epochs=10, print_every=1, shuffle=True):
    objvals = []
    n = len(X)

    for epoch in range(max_epochs):
        indices = np.arange(n)
        if shuffle:
            np.random.shuffle(indices)

        for i in indices:
            # single-sample gradient
            dw, db = gradient_sgd_step(w, b, X[i], y[i], lam)
            w = w - learning_rate * dw
            b = b - learning_rate * db

        # track loss once per epoch
        obj = objective_sparse(w, b, X, y, lam)
        objvals.append(obj)

        if epoch % print_every == 0 or epoch == max_epochs - 1:
            print(f"Epoch {epoch:4d} | Loss = {obj:.6f}")

        if not np.isfinite(obj):
            print(f"Stopped early at epoch {epoch}")
            break

    return w, b, objvals

In [None]:
def gradient_minibatch(w, b, X_batch, y_batch, lam):
    batch_size = len(X_batch)
    d = len(w)

    grad_w = np.zeros(d)
    grad_b = 0.0

    for x_i, y_i in zip(X_batch, y_batch):
        z = dot_sparse(w, x_i) + b
        y_hat = 1.0 / (1.0 + np.exp(-z))
        error = y_hat - y_i

        for j, v in x_i.items():
            grad_w[j] += error * v

        grad_b += error

    # average over batch + L2
    dw = (1.0 / batch_size) * grad_w + 2 * lam * w
    db = (1.0 / batch_size) * grad_b
    return dw, db

In [None]:
def gradient_descent_minibatch(
    w, b, X, y, lam, learning_rate,
    max_epochs=100, batch_size=32,
    shuffle=True, print_every=50
):
    objvals = []
    n = len(X)

    for epoch in range(max_epochs):
        # shuffle indices each epoch
        indices = np.arange(n)
        if shuffle:
            np.random.shuffle(indices)

        # iterate over mini-batches
        for start in range(0, n, batch_size):
            batch_idx = indices[start:start + batch_size]

            X_batch = [X[i] for i in batch_idx]
            y_batch = y[batch_idx] 

            dw, db = gradient_minibatch(w, b, X_batch, y_batch, lam)

            w = w - learning_rate * dw
            b = b - learning_rate * db

        # track full training loss once per epoch
        obj = objective_sparse(w, b, X, y, lam)
        objvals.append(obj)

        if epoch % print_every == 0 or epoch == max_epochs - 1:
            print(f"Epoch {epoch:4d} | Loss = {obj:.6f}")

        if not np.isfinite(obj):
            print(f"Stopped early at epoch {epoch}")
            break

    return w, b, objvals

Create a cross fold using the validation set created earlier to update lambda function:
    -Key point here is to fit and transform the TF-IDF on training data only and transform the validation data on using the same TF-IDF vectorizer
    -We only want the validation data to be used to update the lambda function and then average metrics across folds

In [15]:
def k_folds(n, k=5, shuffle=True, seed=42):
    idx = np.arrange(n)
    if shuffle:
        rng = np.random.RandomState(seed)
        rng.shuffle(idx)
    folds = np.array_split(idx, k)
    for i in range(k):
        val_idx = folds[i]
        train_idx = np.concatenate([folds[j] for j in range(k) if j != i])
        yield train_idx, val_idx

In [16]:
def cross_validate_logreg(w, b, X_train_tfidf, y_train, X_val_tfidf, y_val, learning_rate=0.01, lam=0.01, max_epochs=500, print_every=None):
    train_loss_history = []
    val_loss_history = []
    
    for epoch in range(max_epochs):
        dw, db = gradient(w, b, X_train_tfidf, y_train, lam)

        w = w - learning_rate * dw
        b = b - learning_rate * db

        train_loss = objective(w, b, X_train_tfidf, y_train, lam)
        val_loss = objective(w, b, X_val_tfidf, y_val, lam)

        train_loss_history.append(train_loss)
        val_loss_history.append(val_loss)

        if print_every is not None and (epoch % print_every == 0 or epoch == max_epochs - 1):
            print(f"Epoch {epoch:4d} | Train Loss={train_loss:.6f} | Val Loss={val_loss:.6f}")

        if not np.isfinite(train_loss):
            print(f"Stopped at epoch {epoch}: train loss diverged.")
            break
    return w, b, train_loss_history, val_loss_history

Now we can initilization the model to create a loop for it to begin to train

In [17]:
d =len(vocab)
w0 = np.zeros(d)
b0 = 0.0

w_final, b_final, train_history, val_history = cross_validate_logreg(
    w0,
    b0,
    X_train_tfidf,
    y_train,
    X_test_tfidf,
    y_test,
    lam=0.001,
    learning_rate=0.01,
    max_epochs=1000,
    print_every=25
)

Epoch    0 | Train Loss=0.691713 | Val Loss=0.691734
Epoch   25 | Train Loss=0.658175 | Val Loss=0.658712
Epoch   50 | Train Loss=0.628715 | Val Loss=0.629736
Epoch   75 | Train Loss=0.602813 | Val Loss=0.604288
Epoch  100 | Train Loss=0.580008 | Val Loss=0.581909
Epoch  125 | Train Loss=0.559895 | Val Loss=0.562199
Epoch  150 | Train Loss=0.542124 | Val Loss=0.544806
Epoch  175 | Train Loss=0.526388 | Val Loss=0.529427
Epoch  200 | Train Loss=0.512424 | Val Loss=0.515801
Epoch  225 | Train Loss=0.500004 | Val Loss=0.503700
Epoch  250 | Train Loss=0.488930 | Val Loss=0.492930
Epoch  275 | Train Loss=0.479035 | Val Loss=0.483321
Epoch  300 | Train Loss=0.470171 | Val Loss=0.474730
Epoch  325 | Train Loss=0.462213 | Val Loss=0.467031
Epoch  350 | Train Loss=0.455050 | Val Loss=0.460116
Epoch  375 | Train Loss=0.448590 | Val Loss=0.453892
Epoch  400 | Train Loss=0.442749 | Val Loss=0.448276
Epoch  425 | Train Loss=0.437457 | Val Loss=0.443199
Epoch  450 | Train Loss=0.432652 | Val Loss=0.

We need to make predictions for the test set to evaluate the model later

In [18]:
def make_predictions(X, w, b, threshold=0.5):
    preds = []
    for x in X:
        z = dot_sparse(w, x) + b
        p = 1.0 / (1.0 + np.exp(-z))
        preds.append(1 if p >= threshold else 0)
    return np.array(preds)

In [31]:
y_test_pred = make_predictions(X_test_tfidf, w_final, b_final, threshold=0.2)

Evaluation metrics after training:
    -Implement accuracy, precision, recall and F1 score

In [32]:
def evaluate(y_true, y_pred, verbose=True):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))

    accuracy  = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    results = {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1_score": f1,
            "tp": tp,
            "tn": tn,
            "fp": fp,
            "fn": fn
        }

    if verbose:
        print(f"Accuracy : {accuracy:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall   : {recall:.4f}")
        print(f"F1-score : {f1:.4f}")

    return results

Evaluate the preductions vs the actual labels

In [34]:
evaluate(y_test, y_test_pred)

Accuracy : 0.8942
Precision: 1.0000
Recall   : 0.2338
F1-score : 0.3789


{'accuracy': np.float64(0.8941704035874439),
 'precision': np.float64(1.0),
 'recall': np.float64(0.23376623376623376),
 'f1_score': np.float64(0.3789473684210526),
 'tp': np.int64(36),
 'tn': np.int64(961),
 'fp': np.int64(0),
 'fn': np.int64(118)}