Binary Text Classification using Logistic Regression of ham and spam text messages

The workflow that this notebook will follow is as follows:

1. Data Preprocessing: 
    -Load the dataset into sentences and labels
    -Split the dataset into training, validation and testing sets 
    -Report the distribution in the form of a table
    -Clean the data of any noise (urls, punctuation, and numbers) & change to lower case
    -Tokenize input text into tokens, including work stemming and stopwords
    -Build your own TD-IDF feature extractor using the training set
2. Build a logistic regression classifier using using L2 regularization
    -Derive the gradient of the objective function of LR with respect to w and b. 
    -Implement logistic regression via initialization, objective function, and gradient descent
    -Implement accuracy, precision, recall and F1 score as test metrics
    -Write a function for SGD and Mini-batch GD
    -Evaluate the model of the test set and report the metrics 
3. Cross Validation
    -Implement cross validation to choose the best hyperparameter lambda for the validation set
4. Conclusion
    -Analyze the results and compare to baseline
5. Create a multiclass classifier from various authors dataset

In [1]:
import pandas as pd
import string
import math
import numpy as np

Load the dataset, include more information about what this importing section does

In [2]:
spam_df = pd.read_csv('a1-data/SMSSpamCollection', sep='\t', header=None, names=['label', 'text'])

spam_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
spam_df.shape

(5572, 2)

Objective of the split_dataset function: 
    -Split the dataset ino training, validation and test sets
    -Return each set split into features and labels (this enables easier tokenization later)
    -This also allows for reproducubility as the split will not be exactly the same each time meaning the if the structure of the model is effecive it should learn at the same rate regardless of the data split

In [4]:
def split_dataset(df, train_size, val_size, test_size):
    df = df.sample(frac=1,random_state=42).reset_index(drop=True)
    n = (len(df))
    train_end = int(train_size * n)
    val_end = train_end + int(val_size * n)

    train_df = df.iloc[:train_end]
    val_df = df.iloc[train_end:val_end]
    test_df = df.iloc[val_end:n]

    X_train, y_train = train_df[['text']], train_df['label']
    X_val, y_val = val_df[['text']], val_df['label']
    X_test, y_test = test_df[['text']], test_df['label']

    return X_train, X_val, X_test, y_train, y_val, y_test

X_train, X_val, X_test, y_train, y_val, y_test = split_dataset(spam_df, 0.6, 0.2, 0.2)

Output a table showing the number of samples in each class for the training, validation and test sets.

In [5]:
def data_distribution(y_train, y_val, y_test):
    df =pd.DataFrame({'Train': y_train.value_counts(), 'Val': y_val.value_counts(), 'Test': y_test.value_counts()}).fillna(0).astype(int)
    return df

data_distribution(y_train, y_val, y_test)

Unnamed: 0_level_0,Train,Val,Test
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ham,2898,966,961
spam,445,148,154


Objective of the clean_data function:
    -Remove punctuation, urls and numbers
    -Change text to lowercase

In [6]:
def clean_text(X):
    X = X.str.lower()
    X = X.str.translate(str.maketrans("", "", string.punctuation))
    X = X.str.replace("http\\S+", "", regex=True)
    X = X.str.replace("https\\S+", "", regex=True)
    X = X.str.replace("\\d+", "", regex=True)
    return X

# Convert spam and ham to 0 and 1s for classification
map = {"spam": 1, "ham": 0}
y_train = np.array([map[v] for v in y_train], dtype=float)
y_val = np.array([map[v] for v in y_val], dtype=float)
y_test = np.array([map[v] for v in y_test], dtype=float)

Tokenize the dataset:
    -Remove whitespace between words
    -Including word stems
    -Removing stop words (removing common words that do not add any semantic value)

In [7]:
STOPWORDS = {
    "a", "an", "the", "and", "or", "but",
    "is", "are", "was", "were", "be",
    "to", "of", "in", "on", "for", "with",
    "that", "this", "it", "as", "at"
}

def tokenize_text(X):
    return X.apply(lambda x: x.split())

def remove_stopwords(tokens, stopwords = STOPWORDS):
    return tokens.apply(lambda x: [t for t in x if t not in stopwords])

def stem_token(token):
    suffixes = ["ing", "ly", "ed", "s", "es", "est"]
    for suf in suffixes:
        if token.endswith(suf) and len(token) > len(suf) + 2:
            return token[:-len(suf)]
    return token

def stem_tokens(tokens):
    return tokens.apply(lambda x: [stem_token(t) for t in x])

In [8]:
def preprocess_text(X):
    X  = X.copy()
    X = clean_text(X)
    X = tokenize_text(X)
    X = remove_stopwords(X)
    X = stem_tokens(X)
    return X

X_train['text'] = preprocess_text(X_train['text'])
X_val['text'] = preprocess_text(X_val['text'])
X_test['text'] = preprocess_text(X_test['text'])
X_train.head()

Unnamed: 0,text
0,"[squeeeeeze, christma, hug, if, u, lik, my, fr..."
1,"[also, ive, sorta, blown, him, off, couple, ti..."
2,"[mmm, that, better, now, i, got, roast, down, ..."
3,"[mm, have, some, kanji, dont, eat, anyth, heav..."
4,"[so, there, ring, come, guy, costume, its, the..."


Build a TD-IDF vecotrizorizer from stratch:
    -Goal is to create a function that would take in the list of words we have and return a matrix of importance of each word
    -TF = Term Frequency: The more the word appears in the document, the higher the TF
    -IDF = Inverse Document Frequency: The less the word appears in the corpus, the higher the IDF
What the functions do: 
    -fit_tfidf:
        -Treats each row as a new "document and counts the total number of documents
        -Builds the document frequency required for IDF by converting the token list to a set so each word is counted at most once per document
        -Update the voacb to say that the words appears in a specific document
        -Building a vocabulary mapping and sorts the indices alphabetically
        -Computes IDF for each word in the documents
    -transform_tfidf:
        -Counts term occurrences within that document
        -Computes TF per term, typically count / len(doc)
        -Computes TF-IDF per term
        -Stores the result as a sparse vector 
        -Returns a list of vectors per document


In [9]:
def fit_tfidf(text):
    # Document frequency: df[word] = number of docs containing word
    df = {}
    N = 0

    for doc in text:
        N += 1
        seen = set(doc)
        for w in seen:
            df[w] = df.get(w, 0) + 1

    # Build vocab (deterministic order: alphabetical)
    vocab_words = sorted(df.keys())
    vocab = {w: i for i, w in enumerate(vocab_words)}

    idf = [0.0] * len(vocab_words)
    for w, i in vocab.items():
        idf[i] = math.log((1.0 + N) / (1.0 + df[w])) + 1.0

    return vocab, idf


def transform_tfidf(text, vocab, idf, normalize=True):
    vectors = []

    for doc in text:
        # term counts
        counts = {}
        for w in doc:
            if w in vocab:
                idx = vocab[w]
                counts[idx] = counts.get(idx, 0) + 1

        # compute TF-IDF (using TF = count / len(doc))
        doc_len = len(doc) if len(doc) > 0 else 1
        vec = {}
        for idx, c in counts.items():
            tf = c / doc_len
            vec[idx] = tf * idf[idx]

        '''
        # optional L2 normalization
        if normalize and vec:
            import math
            norm = math.sqrt(sum(v * v for v in vec.values()))
            if norm > 0:
                for idx in list(vec.keys()):
                    vec[idx] = vec[idx] / norm
        '''
        vectors.append(vec)

    return vectors

In [10]:
train_docs = X_train['text'].tolist()
vocab, idf = fit_tfidf(train_docs)
X_train_tfidf = transform_tfidf(train_docs, vocab, idf)

X_train_tfidf[:3]

[{4589: 0.29040606184981543,
  873: 0.24260280801809503,
  2255: 0.5330088698679104,
  2305: 0.26567122657239356,
  5139: 0.6067940829285435,
  2731: 0.25880982971622385,
  3174: 0.10983909761429993,
  1809: 0.29040606184981543,
  1214: 0.22092664734835799,
  2952: 0.10849018669721469,
  373: 0.15991676412849606,
  1889: 0.12648915870821661,
  3907: 0.48333845355664923,
  1113: 0.25880982971622385,
  2847: 0.29040606184981543,
  4481: 0.12203358176063021,
  2836: 0.22330571188239082,
  3295: 0.2764245063978097,
  3596: 0.1970250204324978,
  2081: 0.25252287948194957},
 {159: 0.280564912646649,
  2444: 0.2842816267105413,
  4527: 0.4432513575602446,
  544: 0.4432513575602446,
  2158: 0.5233461313487431,
  3378: 0.2896844769242299,
  1042: 0.35854409901108136,
  4968: 0.22436277422552497,
  3978: 0.377316464481541,
  4481: 0.18626178268727767,
  2294: 0.3276079587530751,
  3938: 0.3854296581566598,
  3311: 0.19105926530073353,
  4877: 0.22775953849592345,
  3487: 0.20881202617848102,
  5

We only transform the validation and test sets using the transform method to not introduce any data leakage

In [11]:
val_doc = X_val['text'].tolist()
X_val_tfidf = transform_tfidf(val_doc, vocab, idf)

test_doc = X_test['text'].tolist()
X_test_tfidf = transform_tfidf(test_doc, vocab, idf)

Next we need to create a logistic regression model with L2 normalization:
    -Import Equation here

In [12]:
def dot_sparse(w, x):
    s = 0
    for j, v in x.items():
        s += w[j]*v
    return s
    
def objective(w, b, X, y, lam):
    n = len(X)
    eps = 1e-15
    loss=0
    for x_i, y_i in zip(X,y):
        # Linear Transformation of the objective function
        z = dot_sparse(w, x_i) + b
        # Pass the linear transformation through the sigmoid activation function
        y_hat = 1 / (1 + np.exp(-z))
        # Binary Cross Entropy Loss
        # This ensures we do not have any log(0) errors
        y_hat = np.clip(y_hat, eps, 1 - eps)
        loss += -(y_i * np.log(y_hat) + (1-y_i) * np.log(1-y_hat))
        
    loss = loss / n
    # L2 regularization
    reg = lam * np.sum(w ** 2)

    return loss + reg


Generate a function for the gradient and gradient descent

In [13]:
def gradient(w, b, X, y, lam):
    n = len(X)
    d = len(w)
    dw = np.zeros(d)
    db = 0.0

    for x_i, y_i in zip(X, y):
    # Get predictions
        z = dot_sparse(w, x_i) + b
        y_hat = 1 / (1 + np.exp(-z))
        # Error term
        error = y_hat - y_i

        for j, v, in x_i.items():
            dw[j] += error * v
        db += error
    dw = (1.0 / n) * dw + 2 * lam * w
    db = (1.0 / n) * db
    return dw, db

In [14]:
def gradient_descent(w, b, X, y, lam, learning_rate, max_epochs=100, print_every=50):
    objvals = []

    for epoch in range(max_epochs):
        # Compute gradients for both the weight and bias
        dw, db = gradient(w, b, X, y, lam)

        # Update parameters using the learning rate
        w = w - learning_rate * dw
        b = b - learning_rate * db
        # Update the objective value
        obj = objective(w, b, X, y, lam)
        objvals.append(obj)
        # Update progress during training
        if epoch % print_every == 0 or epoch == max_epochs - 1:
            print(f"Epoch {epoch:4d} | Loss = {obj:.6f}")
        # If gradient starts to diverge we should stop early
        if not np.isfinite(obj):
            print('Stopped early at {epoch}')
            break
    return w, b, objvals

A secondary objective is to implement a stochastic gradient descent and mini-batch gradient descent function for this model
This requires a new gradient function and training function for each method

In [15]:
def gradient_sgd(w, b, x_i, y_i, lam):
    d = len(w)
    dw = np.zeros(d)

    z = dot_sparse(w, x_i) + b
    y_hat = 1 / (1 + np.exp(-z))
    error = y_hat - y_i
    
    for j, v in x_i.items():
        dw[j] += error * v
    
    db = error

    dw += 2 * lam * w
    
    return dw, db

In [16]:
def sgd(w, b, X, y, lam, learning_rate, max_epochs=10, print_every=100, shuffle=True):
    objvals = []
    n = len(X)

    for epoch in range(max_epochs):
        indices = np.arange(n)
        if shuffle:
            np.random.shuffle(indices)

        for i in indices:
            # single-sample gradient
            dw, db = gradient_sgd(w, b, X[i], y[i], lam)
            w = w - learning_rate * dw
            b = b - learning_rate * db

        # track loss once per epoch
        obj = objective(w, b, X, y, lam)
        objvals.append(obj)

        if epoch % print_every == 0 or epoch == max_epochs - 1:
            print(f"Epoch {epoch:4d} | Loss = {obj:.6f}")

        if not np.isfinite(obj):
            print(f"Stopped early at epoch {epoch}")
            break

    return w, b, objvals

In [17]:
def gradient_minibatch(w, b, X_batch, y_batch, lam):
    batch_size = len(X_batch)
    d = len(w)

    grad_w = np.zeros(d)
    grad_b = 0.0

    for x_i, y_i in zip(X_batch, y_batch):
        z = dot_sparse(w, x_i) + b
        y_hat = 1.0 / (1.0 + np.exp(-z))
        error = y_hat - y_i

        for j, v in x_i.items():
            grad_w[j] += error * v

        grad_b += error

    # average over batch + L2
    dw = (1.0 / batch_size) * grad_w + 2 * lam * w
    db = (1.0 / batch_size) * grad_b
    return dw, db

In [18]:
def gradient_descent_minibatch(
    w, b, X, y, lam, learning_rate,
    max_epochs=100, batch_size=32,
    shuffle=True, print_every=50
):
    objvals = []
    n = len(X)

    for epoch in range(max_epochs):
        # shuffle indices each epoch
        indices = np.arange(n)
        if shuffle:
            np.random.shuffle(indices)

        # iterate over mini-batches
        for start in range(0, n, batch_size):
            batch_idx = indices[start:start + batch_size]

            X_batch = [X[i] for i in batch_idx]
            y_batch = y[batch_idx] 

            dw, db = gradient_minibatch(w, b, X_batch, y_batch, lam)

            w = w - learning_rate * dw
            b = b - learning_rate * db

        # track full training loss once per epoch
        obj = objective(w, b, X, y, lam)
        objvals.append(obj)

        if epoch % print_every == 0 or epoch == max_epochs - 1:
            print(f"Epoch {epoch:4d} | Loss = {obj:.6f}")

        if not np.isfinite(obj):
            print(f"Stopped early at epoch {epoch}")
            break

    return w, b, objvals

We need to make predictions for the test set to evaluate the model later

In [19]:
def make_predictions(X, w, b, threshold=0.5):
    preds = []
    for x in X:
        z = dot_sparse(w, x) + b
        p = 1.0 / (1.0 + np.exp(-z))
        preds.append(1 if p >= threshold else 0)
    return np.array(preds)

Evaluation metrics after training:
    -Implement accuracy, precision, recall and F1 score

In [20]:
def evaluate(y_true, y_pred, verbose=True):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))

    accuracy  = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    results = {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1_score": f1,
            "tp": tp,
            "tn": tn,
            "fp": fp,
            "fn": fn
        }

    if verbose:
        print(f"Accuracy : {accuracy:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall   : {recall:.4f}")
        print(f"F1-score : {f1:.4f}")

    return results

Create a cross fold using the validation set created earlier to update lambda function:
    -Key point here is to fit and transform the TF-IDF on training data only and transform the validation data on using the same TF-IDF vectorizer
    -We only want the validation data to be used to update the lambda function and then average metrics across folds
    -This functiion will have 3 variations in the final code; one for each gradient descent method

In [21]:
def validation_gd(X_train, y_train, X_val, y_val, lambdas, lr, epochs):
    results = {}
    
    # 1. Determine weight vector size (d) from training data
    # (Assuming X_train is a list of TF-IDF dicts)
    all_indices = [idx for doc in X_train for idx in doc.keys()]
    d = max(all_indices) + 1 if all_indices else 0
    
    print(f"Starting Validation on {len(lambdas)} candidates...")

    for lam in lambdas:
        # 2. ALWAYS reset weights for each new lambda
        w = np.zeros(d)
        b = 0.0
        
        # 3. Train on the fixed Training Set
        # Note: 'history' contains the loss per epoch if you want to plot it
        w_final, b_final, history = gradient_descent(
            w, b, X_train, y_train, lam, lr, epochs, print_every=100
        )
        
        # 4. Evaluate on the 'Held-out' Validation Set
        # Use the objective function (Loss) to see how well it generalizes
        val_loss = objective(w_final, b_final, X_val, y_val, lam)
        
        results[lam] = val_loss
        print(f"Lambda: {lam:6.3f} | Validation Loss: {val_loss:.6f}")
        
    return results

In [22]:
lambdas_to_test = [0.001, 0.01, 0.1, 1, 10]
cv_results = validation_gd(X_train_tfidf, y_train, X_test_tfidf, y_test, lambdas=lambdas_to_test, lr=0.01, epochs = 500)
best_lambda =  min(cv_results, key=cv_results.get)
print(best_lambda)

Starting Validation on 5 candidates...
Epoch    0 | Loss = 0.691713
Epoch  100 | Loss = 0.580008
Epoch  200 | Loss = 0.512424
Epoch  300 | Loss = 0.470171
Epoch  400 | Loss = 0.442749
Epoch  499 | Loss = 0.424443
Lambda:  0.001 | Validation Loss: 0.430770
Epoch    0 | Loss = 0.691713
Epoch  100 | Loss = 0.580132
Epoch  200 | Loss = 0.512812
Epoch  300 | Loss = 0.470879
Epoch  400 | Loss = 0.443802
Epoch  499 | Loss = 0.425853
Lambda:  0.010 | Validation Loss: 0.432135
Epoch    0 | Loss = 0.691713
Epoch  100 | Loss = 0.581220
Epoch  200 | Loss = 0.515754
Epoch  300 | Loss = 0.475534
Epoch  400 | Loss = 0.449834
Epoch  499 | Loss = 0.432942
Lambda:  0.100 | Validation Loss: 0.438900
Epoch    0 | Loss = 0.691714
Epoch  100 | Loss = 0.584986
Epoch  200 | Loss = 0.520735
Epoch  300 | Loss = 0.480118
Epoch  400 | Loss = 0.453675
Epoch  499 | Loss = 0.436131
Lambda:  1.000 | Validation Loss: 0.441601
Epoch    0 | Loss = 0.691722
Epoch  100 | Loss = 0.586089
Epoch  200 | Loss = 0.521263
Epoch 

In [23]:
def validation_sgd(X_train, y_train, X_val, y_val, lambdas, lr, epochs):
    results = {}
    
    # 1. Determine weight vector size (d) from training data
    # (Assuming X_train is a list of TF-IDF dicts)
    all_indices = [idx for doc in X_train for idx in doc.keys()]
    d = max(all_indices) + 1 if all_indices else 0
    
    print(f"Starting Validation on {len(lambdas)} candidates...")

    for lam in lambdas:
        # 2. ALWAYS reset weights for each new lambda
        w = np.zeros(d)
        b = 0.0
        
        # 3. Train on the fixed Training Set
        # Note: 'history' contains the loss per epoch if you want to plot it
        w_final, b_final, history = sgd(
            w, b, X_train, y_train, lam, lr, epochs, print_every=100
        )
        
        # 4. Evaluate on the Validation Set
        # Use the objective function (Loss) to see how well it generalizes
        val_loss = objective(w_final, b_final, X_val, y_val, lam)
        
        results[lam] = val_loss
        print(f"Lambda: {lam:6.3f} | Validation Loss: {val_loss:.6f}")
        
    return results

In [24]:
cv_results = validation_sgd(X_train_tfidf, y_train, X_val_tfidf, y_val, lambdas=lambdas_to_test, lr=0.01, epochs = 500)
best_lambda =  min(cv_results, key=cv_results.get)
print(best_lambda)

Starting Validation on 5 candidates...
Epoch    0 | Loss = 0.352875
Epoch  100 | Loss = 0.243251
Epoch  200 | Loss = 0.243203
Epoch  300 | Loss = 0.243213
Epoch  400 | Loss = 0.243354
Epoch  499 | Loss = 0.243199
Lambda:  0.001 | Validation Loss: 0.271996
Epoch    0 | Loss = 0.369421
Epoch  100 | Loss = 0.363053
Epoch  200 | Loss = 0.364274
Epoch  300 | Loss = 0.363041
Epoch  400 | Loss = 0.363039
Epoch  499 | Loss = 0.363048
Lambda:  0.010 | Validation Loss: 0.369985
Epoch    0 | Loss = 0.389631
Epoch  100 | Loss = 0.389577
Epoch  200 | Loss = 0.389782
Epoch  300 | Loss = 0.390614
Epoch  400 | Loss = 0.389608
Epoch  499 | Loss = 0.389569
Lambda:  0.100 | Validation Loss: 0.389956
Epoch    0 | Loss = 0.392638
Epoch  100 | Loss = 0.392572
Epoch  200 | Loss = 0.392647
Epoch  300 | Loss = 0.392637
Epoch  400 | Loss = 0.394034
Epoch  499 | Loss = 0.392828
Lambda:  1.000 | Validation Loss: 0.392381
Epoch    0 | Loss = 0.392724
Epoch  100 | Loss = 0.392512
Epoch  200 | Loss = 0.392884
Epoch 

In [25]:
def validation_minibatch(X_train, y_train, X_val, y_val, lambdas, lr, epochs):
    results = {}
    
    # 1. Determine weight vector size (d) from training data
    # (Assuming X_train is a list of TF-IDF dicts)
    all_indices = [idx for doc in X_train for idx in doc.keys()]
    d = max(all_indices) + 1 if all_indices else 0
    
    print(f"Starting Validation on {len(lambdas)} candidates...")

    for lam in lambdas:
        # 2. ALWAYS reset weights for each new lambda
        w = np.zeros(d)
        b = 0.0
        
        # 3. Train on the fixed Training Set
        # Note: 'history' contains the loss per epoch if you want to plot it
        w_final, b_final, history = gradient_descent_minibatch(
            w, b, X_train, y_train, lam, lr, epochs, print_every=100
        )
        
        # 4. Evaluate on the 'Validation Set
        # Use the objective function (Loss) to see how well it generalizes
        val_loss = objective(w_final, b_final, X_val, y_val, lam)
        
        results[lam] = val_loss
        print(f"Lambda: {lam:6.3f} | Validation Loss: {val_loss:.6f}")
        
    return results

In [26]:
cv_results = validation_minibatch(X_train_tfidf, y_train, X_val_tfidf, y_val, lambdas=lambdas_to_test, lr=0.01, epochs =500)
best_lambda =  min(cv_results, key=cv_results.get)
print(best_lambda)

Starting Validation on 5 candidates...
Epoch    0 | Loss = 0.576502
Epoch  100 | Loss = 0.300758
Epoch  200 | Loss = 0.266621
Epoch  300 | Loss = 0.253238
Epoch  400 | Loss = 0.247724
Epoch  499 | Loss = 0.245343
Lambda:  0.001 | Validation Loss: 0.270833
Epoch    0 | Loss = 0.576231
Epoch  100 | Loss = 0.363267
Epoch  200 | Loss = 0.363008
Epoch  300 | Loss = 0.363005
Epoch  400 | Loss = 0.363005
Epoch  499 | Loss = 0.363005
Lambda:  0.010 | Validation Loss: 0.369845
Epoch    0 | Loss = 0.577953
Epoch  100 | Loss = 0.389058
Epoch  200 | Loss = 0.389058
Epoch  300 | Loss = 0.389057
Epoch  400 | Loss = 0.389058
Epoch  499 | Loss = 0.389058
Lambda:  0.100 | Validation Loss: 0.389405
Epoch    0 | Loss = 0.581901
Epoch  100 | Loss = 0.391947
Epoch  200 | Loss = 0.391947
Epoch  300 | Loss = 0.391947
Epoch  400 | Loss = 0.391946
Epoch  499 | Loss = 0.391949
Lambda:  1.000 | Validation Loss: 0.391550
Epoch    0 | Loss = 0.582723
Epoch  100 | Loss = 0.392276
Epoch  200 | Loss = 0.392279
Epoch 

Retrain each model only on the training set with the fine tuned lambda

In [29]:
all_indices = [idx for doc in X_train_tfidf for idx in doc.keys()]
d = max(all_indices) + 1 if all_indices else 0

w = np.zeros(d)
b = 0.0

In [32]:
w_gd, b_gd, obj_vals_gd = gradient_descent(w, b, X_train_tfidf, y_train, lam=0.001, learning_rate=0.01, max_epochs=1000, print_every=50)

Epoch    0 | Loss = 0.691713
Epoch   50 | Loss = 0.628715
Epoch  100 | Loss = 0.580008
Epoch  150 | Loss = 0.542124
Epoch  200 | Loss = 0.512424
Epoch  250 | Loss = 0.488930
Epoch  300 | Loss = 0.470171
Epoch  350 | Loss = 0.455050
Epoch  400 | Loss = 0.442749
Epoch  450 | Loss = 0.432652
Epoch  500 | Loss = 0.424291
Epoch  550 | Loss = 0.417309
Epoch  600 | Loss = 0.411433
Epoch  650 | Loss = 0.406448
Epoch  700 | Loss = 0.402186
Epoch  750 | Loss = 0.398516
Epoch  800 | Loss = 0.395332
Epoch  850 | Loss = 0.392550
Epoch  900 | Loss = 0.390102
Epoch  950 | Loss = 0.387932
Epoch  999 | Loss = 0.386033


In [33]:
w_sgd, b_sgd, obj_vals_sgd = sgd(w, b, X_train_tfidf, y_train, lam=0.001, learning_rate=0.01, max_epochs=1000, print_every=50, shuffle=True)

Epoch    0 | Loss = 0.352426
Epoch   50 | Loss = 0.243205
Epoch  100 | Loss = 0.243200
Epoch  150 | Loss = 0.243324
Epoch  200 | Loss = 0.243207
Epoch  250 | Loss = 0.243195
Epoch  300 | Loss = 0.243330
Epoch  350 | Loss = 0.243203
Epoch  400 | Loss = 0.243204
Epoch  450 | Loss = 0.243216
Epoch  500 | Loss = 0.243219
Epoch  550 | Loss = 0.243218
Epoch  600 | Loss = 0.243245
Epoch  650 | Loss = 0.243277
Epoch  700 | Loss = 0.243259
Epoch  750 | Loss = 0.243229
Epoch  800 | Loss = 0.243203
Epoch  850 | Loss = 0.243196
Epoch  900 | Loss = 0.243210
Epoch  950 | Loss = 0.243287
Epoch  999 | Loss = 0.243198


In [34]:
w_mgd, b_mgd, obj_vals_mgd = gradient_descent_minibatch(w, b, X_train_tfidf, y_train, lam=0.001, learning_rate=0.01, max_epochs=1000, batch_size=32, shuffle=True, print_every=50)

Epoch    0 | Loss = 0.576424
Epoch   50 | Loss = 0.334676
Epoch  100 | Loss = 0.300768
Epoch  150 | Loss = 0.279705
Epoch  200 | Loss = 0.266631
Epoch  250 | Loss = 0.258441
Epoch  300 | Loss = 0.253240
Epoch  350 | Loss = 0.249904
Epoch  400 | Loss = 0.247727
Epoch  450 | Loss = 0.246290
Epoch  500 | Loss = 0.245334
Epoch  550 | Loss = 0.244686
Epoch  600 | Loss = 0.244246
Epoch  650 | Loss = 0.243942
Epoch  700 | Loss = 0.243731
Epoch  750 | Loss = 0.243583
Epoch  800 | Loss = 0.243478
Epoch  850 | Loss = 0.243403
Epoch  900 | Loss = 0.243349
Epoch  950 | Loss = 0.243309
Epoch  999 | Loss = 0.243281


Make prediction for each fine tuned model

In [35]:
# Predictions from gradient descent training
preds_gd = make_predictions(X_test_tfidf, w_gd, b_gd)

In [36]:
# Predictions from stochastic gradient descent training
preds_sgd = make_predictions(X_test_tfidf, w_sgd, b_sgd)

In [37]:
# Predictions from mini-batch gradient descent training
preds_mgd = make_predictions(X_test_tfidf, w_mgd, b_mgd)

Evaluate the predictions vs the actual labels

In [38]:
evaluate(y_test, preds_gd)

Accuracy : 0.8619
Precision: 0.0000
Recall   : 0.0000
F1-score : 0.0000


{'accuracy': np.float64(0.8618834080717489),
 'precision': 0.0,
 'recall': np.float64(0.0),
 'f1_score': 0.0,
 'tp': np.int64(0),
 'tn': np.int64(961),
 'fp': np.int64(0),
 'fn': np.int64(154)}

In [39]:
evaluate(y_test, preds_sgd)

Accuracy : 0.9067
Precision: 0.9808
Recall   : 0.3312
F1-score : 0.4951


{'accuracy': np.float64(0.9067264573991032),
 'precision': np.float64(0.9807692307692307),
 'recall': np.float64(0.33116883116883117),
 'f1_score': np.float64(0.4951456310679612),
 'tp': np.int64(51),
 'tn': np.int64(960),
 'fp': np.int64(1),
 'fn': np.int64(103)}

In [40]:
evaluate(y_test, preds_mgd)

Accuracy : 0.9067
Precision: 0.9808
Recall   : 0.3312
F1-score : 0.4951


{'accuracy': np.float64(0.9067264573991032),
 'precision': np.float64(0.9807692307692307),
 'recall': np.float64(0.33116883116883117),
 'f1_score': np.float64(0.4951456310679612),
 'tp': np.int64(51),
 'tn': np.int64(960),
 'fp': np.int64(1),
 'fn': np.int64(103)}

Multiclass Classification Logistic Regreesion

In [41]:
books_df = pd.read_csv('a1-data/books.txt', sep='\t', header=None, names=['label', 'text'])

books_df.head()

Unnamed: 0,label,text
0,Jane Austen,﻿PERSUASION
1,Jane Austen,by Jane Austen
2,Jane Austen,(1818)
3,Jane Austen,Chapter 1
4,Jane Austen,"Sir Walter Elliot, of Kellynch Hall, in Somers..."


Steps:
    -Split data into train, validation and test 
    -Use the preprocess function to clean all text
    -Map the labels for multiclass classification
    -Use tf-idf function to train on the new information and create new vectors
    -Use transform_idf on all three dataset
    -Rebuild objective function for categorical cross entropy loss (new objective function)
    -Update gradient function to include softmax in place of sigmoid as the activation function
    -Make predictions 
    -Evaluate models

In [None]:
X_train, X_val, X_test, y_train, y_val, y_train = split_dataset(books_df, 0.6, 0.2, 0.2)


X_train['text'] = preprocess_text(X_train['text'])
X_val['text'] = preprocess_text(X_val['text'])
X_test['text'] = preprocess_text(X_test['text'])

In [60]:
map = {"Arthur Conan Doyle": 2, "Jane Austen": 1, "Fyodor Dostoyevsky": 0}
y_train = np.array([map[v] for v in y_train], dtype=float)
y_val = np.array([map[v] for v in y_val], dtype=float)
y_test = np.array([map[v] for v in y_test], dtype=float)

In [61]:
train_docs = X_train['text'].tolist()
vocab, idf = fit_tfidf(train_docs)
X_train_tfidf = transform_tfidf(train_docs, vocab, idf)

In [62]:
val_doc = X_val['text'].tolist()
X_val_tfidf = transform_tfidf(val_doc, vocab, idf)

test_doc = X_test['text'].tolist()
X_test_tfidf = transform_tfidf(test_doc, vocab, idf)

We need to create a softmax activation function

In [63]:
def softmax(z):
    # Subtracting max(z) is a common trick to prevent numerical overflow (Exp getting too big)
    exp_z = np.exp(z - np.max(z))
    return exp_z / exp_z.sum()

In [64]:
def objective_cce(w, b, X, y, lam):
    """
    y: One-hot encoded labels (n_samples, n_classes)
    W: Weight matrix (n_features, n_classes)
    """
    n = len(X)
    n_classes = w.shape[1]
    eps = 1e-15
    total_loss = 0
    
    for i in range(n):
        # 1. Linear pass for ALL classes
        # z will be a vector of length n_classes
        z = np.zeros(n_classes)
        for j, v in X[i].items():
            z += w[j, :] * v
        z += b
        
        # 2. Get probability distribution
        y_hat = softmax(z)
        y_hat = np.clip(y_hat, eps, 1 - eps)
        
        # 3. Add the loss for the true class
        # In CCE, we only sum the log of the predicted prob for the actual class
        total_loss += -np.sum(y[i] * np.log(y_hat))
        
    avg_loss = total_loss / n
    
    # 4. Frobenius Norm for L2 Regularization (sum of squares of all weights)
    reg = lam * np.sum(w**2)
    
    return avg_loss + reg

In [72]:
def gradient_minibatch_cce(w, b, X_batch, y_batch, lam):
    """
    W: Weight matrix (features x classes)
    b: Bias vector (classes)
    X_batch: List of sparse dictionaries
    y_batch: One-hot encoded labels for the batch
    """
    batch_size = len(X_batch)
    d, K = w.shape
    grad_W = np.zeros_like(w)
    grad_b = np.zeros_like(b)

    for i in range(batch_size):
        # 1. Linear Pass (Scores for all classes)
        z = np.zeros(K)
        for j, v in X_batch[i].items():
            z += w[j, :] * v
        z += b
        
        # 2. Softmax for probability distribution
        y_hat = softmax(z) 
        
        # 3. Error (Prediction - Actual)
        error = y_hat - y_batch[i]
        
        # 4. Aggregate gradients
        for j, v in X_batch[i].items():
            grad_w[j, :] += error * v
        grad_b += error

    # 5. Average and Regularize
    # Note: L2 is applied to the whole matrix
    dw = (1.0 / batch_size) * grad_w + 2 * lam * w
    db = (1.0 / batch_size) * grad_b
    
    return dw, db

In [73]:
def train_minibatch_cce(X, y, n_classes, lam, lr, epochs, batch_size=32):
    # 1. Initialize Matrix and Bias
    all_indices = [idx for doc in X for idx in doc.keys()]
    d = max(all_indices) + 1
    w = np.zeros((d, n_classes))
    b = np.zeros(n_classes)
    
    n = len(X)
    history = []

    for epoch in range(epochs):
        # Shuffle indices at the start of each epoch
        indices = np.random.permutation(n)
        
        for start in range(0, n, batch_size):
            # Slice the batch
            batch_idx = indices[start : start + batch_size]
            X_batch = [X[i] for i in batch_idx]
            y_batch = y[batch_idx] # y must be one-hot encoded
            
            # Compute Gradient
            dw, db = gradient_minibatch_cce(w, b, X_batch, y_batch, lam)
            
            # Update Parameters
            w -= lr * dW
            b -= lr * db
            
        # Optional: Track loss (Warning: objective_cce is slow for large data)
        # loss = objective_cce(W, b, X, y, lam)
        # history.append(loss)
        
    return w, b, history

In [78]:
cv_results_cce = train_minibatch_cce(X_train_tfidf, y_train, n_classes = 3, lam=0.001, lr=0.01, epochs = 500)
best_lambda =  min(cv_results, key=cv_results.get)
print(best_lambda)

IndexError: index 4015 is out of bounds for axis 0 with size 3850