## Binary Text Classification using Logistic Regression of ham and spam text messages

Install all dependicies and make sure the data is stored in the same directory as this notebook

In [None]:
pip install pandas numpy

The workflow that this notebook will follow is as follows:

1. Data Preprocessing: 
    <br>Load the dataset into sentences and labels
    <br>Split the dataset into training, validation and testing sets 
    <br>Report the distribution in the form of a table
    <br>Clean the data of any noise (urls, punctuation, and numbers) & change to lower case
    <br>Tokenize input text into tokens, including work stemming and stopwords
    <br>Build your own TD-IDF feature extractor using the training set
2. Build a logistic regression classifier using using L2 regularization
    <br>Derive the gradient of the objective function of LR with respect to w and b. 
    <br>Implement logistic regression via initialization, objective function, and gradient descent
    <br>Implement accuracy, precision, recall and F1 score as test metrics
    <br>Write a function for SGD and Mini-batch GD
    <br>Evaluate the model of the test set and report the metrics 
3. Cross Validation
    <br>Implement cross validation to choose the best hyperparameter lambda for the validation set
4. Conclusion
    <br>Analyze the results and compare to baseline
5. Create a multiclass classifier from various authors dataset

In [1]:
import pandas as pd
import numpy as np
import math
import string
import warnings
warnings.filterwarnings('ignore')

Load the dataset: make sure that this dataset is stored in the same folder as the juypter notebook to ensure it will run properly

In [2]:
spam_df = pd.read_csv('a1-data/SMSSpamCollection', sep='\t', header=None, names=['label', 'text'])

spam_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Split the dataset into a training, validation and test set

In [3]:
def split_dataset(df, train_size, val_size, test_size):
    df = df.sample(frac=1,random_state=42).reset_index(drop=True)
    n = (len(df))
    train_end = int(train_size * n)
    val_end = train_end + int(val_size * n)

    train_df = df.iloc[:train_end]
    val_df = df.iloc[train_end:val_end]
    test_df = df.iloc[val_end:n]

    X_train, y_train = train_df[['text']], train_df['label']
    X_val, y_val = val_df[['text']], val_df['label']
    X_test, y_test = test_df[['text']], test_df['label']

    return X_train, X_val, X_test, y_train, y_val, y_test

X_train, X_val, X_test, y_train, y_val, y_test = split_dataset(spam_df, 0.6, 0.2, 0.2)

Report the data distribution of each class

In [4]:
def data_distribution(y_train, y_val, y_test):
    df =pd.DataFrame({'Train': y_train.value_counts(), 'Val': y_val.value_counts(), 'Test': y_test.value_counts()}).fillna(0).astype(int)
    return df

data_distribution(y_train, y_val, y_test)

Unnamed: 0_level_0,Train,Val,Test
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ham,2898,966,961
spam,445,148,154


## 1. TextPreprocessor Class
This will be used to cleaen the text. More specially remove urls, punctuation and numbers, as well as changing text to lowercase

In [5]:
class TextPreprocessor:
    def __init__(self, stopwords=None):
        self.stopwords = stopwords or {
            "a", "an", "the", "and", "or", "but", "is", "are", 
            "was", "were", "be", "to", "of", "in", "on", "for"
        }

    def clean_text(self, text_series):
        # Converts to lowercase
        text = text_series.str.lower()
        # Removes URLs
        text = text.str.replace(r'https?://\S+|www\.\S+', ' ', regex=True)
        # Removes Punctuation
        text = text.str.translate(str.maketrans("", "", string.punctuation))
        # Removes Digits
        text = text.str.replace(r'\d+', ' ', regex=True)
        return text

    def stem_token(self, token):
        # Reduces words to there root form (e.g. -> Running to Run)
        suffixes = ["ing", "ly", "ed", "s", "es", "est"]
        for suf in suffixes:
            # This avoids over stemming by checking if the length of the word could result in a real word
            if token.endswith(suf) and len(token) > len(suf) + 2:
                return token[:-len(suf)]
        return token

    def preprocess(self, X):
        # This is the function that wraps the other together to create one smooth pipeline to stem and clean text
        # This acts as a way to create raw tokens by splitting the "documents" in the corpus into words
        cleaned = self.clean_text(X)
        
        # Tokenize, filter, and stem in one pass per document
        def process_doc(doc):
            words = doc.split()
            return [self.stem_token(w) for w in words if w not in self.stopwords]
            
        return cleaned.apply(process_doc)

Run the Text preprocessor class to clean the data and output the cleaned version of the data. This will also tokenize the word inputs

In [6]:
# Preprocess all three datasets
preprocessor = TextPreprocessor()
train_processed = preprocessor.preprocess(X_train['text'])
val_processed = preprocessor.preprocess(X_val['text'])
test_processed = preprocessor.preprocess(X_test['text'])

Show the cleaned and tokenized text

In [8]:
train_processed.head()

0    [squeeeeeze, thi, christma, hug, if, u, lik, m...
1    [also, ive, sorta, blown, him, off, couple, ti...
2    [mmm, that, better, now, i, got, roast, down, ...
3    [mm, have, some, kanji, dont, eat, anyth, heav...
4    [so, there, ring, that, come, with, guy, costu...
Name: text, dtype: object

Next we need to map the y labels to 0 and 1 so later the values can be used within the logistic regression model

In [9]:
# Convert labels
label_map = {"spam": 1, "ham": 0}
y_train_bin = np.array([label_map[v] for v in y_train], dtype=float)
y_val_bin = np.array([label_map[v] for v in y_val], dtype=float)
y_test_bin = np.array([label_map[v] for v in y_test], dtype=float)

## 2. TfidfVectorizer Class
Build a TD-IDF vecotrizorizer from stratch:
    <br>Goal is to create a function that would take in the list of words we have and return a dictonary of importance of each word
    <br>TF = Term Frequency: The more the word appears in the document, the higher the TF
    <br>IDF = Inverse Document Frequency: The less the word appears in the corpus, the higher the IDF
What the functions do: 
    <br>fit_tfidf:
        <br>Treats each row as a new "document and counts the total number of documents
        <br>Builds the document frequency required for IDF by converting the token list to a set so each word is counted at most once per document
        <br>Update the voacb to say that the words appears in a specific document
        <br>Building a vocabulary mapping and sorts the indices alphabetically
        <br>Computes IDF for each word in the documents
    <br>transform_tfidf:
        <br>Counts term occurrences within that document
        <br>Computes TF per term, typically count / len(doc)
        <br>Computes TF-IDF per term
        <br>Stores the result as a sparse vector 
        <br>Returns a list of vectors per document

In [10]:
class TfidfVectorizer:
    def __init__(self):
        # Maps word -> column index (e.g., "python" -> 5)
        self.vocab = {}
        # Stores the importance weight for every word in the vocab
        self.idf = []
        # List of words in the same order as the IDF values
        self.vocab_words = []

    def fit(self, text_list):
        # Build vocab and IDF from training data
        df = {} # Document Frequency: How many documents contain word 'w'
        N = 0   # Total number of documents

        for doc in text_list:
            N += 1
            # Using a set ensures we only count a word once per document
            # for the Document Frequency (DF) calculation
            seen = set(doc)
            for w in seen:
                df[w] = df.get(w, 0) + 1

        # Sort keys to ensure the feature matrix columns are always in the same order
        self.vocab_words = sorted(df.keys())
        # Create a lookup for fast indexing during the transform phase
        self.vocab = {w: i for i, w in enumerate(self.vocab_words)}

        self.idf = [0.0] * len(self.vocab_words)
        for w, i in self.vocab.items():
            # THE IDF FORMULA -> log((1 + N) / (1 + df)) + 1 provides a "smoothed" IDF.
            # This prevents division by zero and ensures words that appear in every document don't get a weight of exactly zero.
            self.idf[i] = math.log((1.0 + N) / (1.0 + df[w])) + 1.0
            
        return self

    def transform(self, text_list):
        #Transform documents into sparse vectors
        vectors = []
        for doc in text_list:
            counts = {}
            for w in doc:
                # IMPORTANT: If a word wasn't in the training data (fit), it is ignored here.
                if w in self.vocab:
                    idx = self.vocab[w]
                    counts[idx] = counts.get(idx, 0) + 1

            # Avoid division by zero for empty documents
            doc_len = len(doc) if len(doc) > 0 else 1
            vec = {} # Sparse representation: {index: tf-idf_score}
            for idx, c in counts.items():
                # TF (Term Frequency): How often word appears in THIS doc
                tf = c / doc_len
                # TF-IDF: Multiply local importance (TF) by global rarity (IDF)
                vec[idx] = tf * self.idf[idx]
            vectors.append(vec)
        return vectors
    
    def fit_transform(self, text_list):
        # Convenience method to learn vocab and return vectors in one go
        self.fit(text_list)
        return self.transform(text_list)

Run the tfidf class on the training data to fit and transform the training class. This will establish the dictonary for tfidf while introducing a fit for the test and validation data

In [11]:
# Vectorize
tfidf = TfidfVectorizer()
X_train_vec = tfidf.fit_transform(train_processed.tolist())
X_val_vec = tfidf.transform(val_processed.tolist())
X_test_vec = tfidf.transform(test_processed.tolist())

We need to define the functions that will do the heavy lifting for logistic regression:
<br> This includes a sparse function for the linear function that is passed before the activation function
<br> Activation functions for binary and multiclass cases
<br> Gradient functions for binary and multiclass cases

In [12]:
def dot_sparse(w, x):
    # Computes w^T * x for a single document
    s = 0
    # Only iterates over non-zero features in the dictionary
    for j, v in x.items():
        # Ignores features that weren't in the training vocabulary
        if j < len(w):
            s += w[j] * v
    return s

In [13]:
def sigmoid(z):
    # Activation function mapping any real number to the range (0, 1)
    # z is the log-odds; sigmoid(z) is the probability of the positive class
    return 1.0 / (1.0 + np.exp(-z))

In [14]:
def softmax(z):
    # Generalization of sigmoid for multiple classes
    # Subtracting the max prevents np.exp() from blowing up (overflow)
    exp_z = np.exp(z - np.max(z))
    # Normalize so that all class probabilities sum to 1.0
    return exp_z / exp_z.sum()

In [15]:
def compute_gradient_binary(w, b, X_batch, y_batch, lam):
    # Calculates partial derivatives for Binary Cross-Entropy loss
    dw = np.zeros_like(w)
    db = 0.0
    m = len(X_batch)
    
    for x_i, y_i in zip(X_batch, y_batch):
        # 1. Forward Pass: calculate prediction
        z = dot_sparse(w, x_i) + b
        error = sigmoid(z) - y_i # Difference between predicted and actual
        
        # 2. Backward Pass: attribute error to individual feature weights
        for j, v in x_i.items():
            dw[j] += error * v
        db += error
    
    # Return average gradient plus L2 penalty (derivative of lam * w^2 is 2 * lam * w)
    return (dw / m) + 2 * lam * w, db / m

In [16]:
def compute_gradient_multiclass(W, b, X_batch, y_batch, lam):
    # Calculates gradients for Categorical Cross-Entropy (Softmax Regression)
    dw = np.zeros_like(W)  # Shape: (features, classes)
    db = np.zeros_like(b)  # Shape: (classes,)
    m = len(X_batch)
    K = W.shape[1]
    
    for x_i, y_i in zip(X_batch, y_batch):
        # Linear pass for all classes at once
        z = np.zeros(K)
        for j, v in x_i.items():
            z += W[j, :] * v
        z += b
        
        # Get probability distribution across all K classes
        y_hat = softmax(z)
        
        # Convert class index (e.g., 2) to "one-hot" vector (e.g., [0,0,1,0])
        target = np.zeros(K)
        target[int(y_i)] = 1.0
        error = y_hat - target # Vector of differences
        
        # Update gradient matrix: each weight j contributes to the error of class k
        for j, v in x_i.items():
            dw[j, :] += error * v
        db += error
        
    return (dw / m) + 2 * lam * W, db / m

def dot_sparse_multiclass(W, x):
    # Efficiently computes the score for each class for a sparse document
    # Result z contains one logit score per class (K,)
    K = W.shape[1]
    z = np.zeros(K)
    for j, v in x.items():
        if j < W.shape[0]:
            # Add the weighted contribution of feature j to every class's score
            z += W[j, :] * v
    return z

## 3. LogisticRegression Class
<br> The goal of this class is to act as a hub to set up all three methods of gradient descent, stochastic gradient descent and mini-batch gradient descent.
<br> This class also includes a predict function used to make preictions on validation and testing sets
<br> For future use, this class also includes the ability for multiclass and binary class problems

In [17]:
class LogisticRegressionModel:
    def __init__(self, lr=0.01, lam=0.01, epochs=100, batch_size=32, method='minibatch', multiclass=False):
        # Hyperparameters: lr (step size), lam (L2 strength), epochs (passes over data)
        self.lr, self.lam, self.epochs = lr, lam, epochs
        # Optimization settings: batch_size for 'minibatch', method choice, and problem type
        self.batch_size, self.method, self.multiclass = batch_size, method, multiclass
        # Weights (w), bias (b), and a container for loss history tracking
        self.w, self.b, self.history = None, None, []

    def _get_batch_indices(self, n):
        # Generator that yields indices for different gradient descent methods
        indices = np.arange(n)
        # Shuffle for SGD and Minibatch to ensure variety in gradients
        if self.method != 'batch':
            np.random.shuffle(indices)
        
        if self.method == 'batch':
            # Batch: Use all data at once for every step
            yield indices
        elif self.method == 'sgd':
            # SGD: Step after every single example
            for i in indices:
                yield [i]
        else: # minibatch
            # Minibatch: Use a "batch" from the data
            for i in range(0, n, self.batch_size):
                yield indices[i : i + self.batch_size]

    def _compute_loss(self, X, y):
        # Calculates the objective function to minimize (Cross-Entropy + L2)
        eps = 1e-15 # Prevents log(0) which leads to NaN errors
        y_prob = self.predict_proba(X)
        n = len(X)
        
        if self.multiclass:
             # Categorical Cross-Entropy (CCE): Measures divergence for multiple labels
             true_class_probs = y_prob[np.arange(n), y.astype(int)]
             true_class_probs = np.clip(true_class_probs, eps, 1 - eps)
             loss = -np.sum(np.log(true_class_probs))
        else:
             # Binary Cross-Entropy (BCE): Measures divergence for 0 vs 1
             y_prob = np.clip(y_prob, eps, 1 - eps)
             loss = -np.sum(y * np.log(y_prob) + (1 - y) * np.log(1 - y_prob))
             
        # L2 Regularization: penalizes large weights to prevent overfitting
        # Loss = (Error / N) + (lambda * sum of squared weights)
        reg = self.lam * np.sum(self.w ** 2)
        return (loss / n) + reg

    def fit(self, X, y, X_val=None, y_val=None):
        # Finds the optimal w and b that minimize the loss function
        # 1. Initialize weights based on the number of unique features (d)
        all_idx = [idx for doc in X for idx in doc.keys()]
        d = max(all_idx) + 1 if all_idx else 0
        n = len(X)
        
        # Determine strategy: Binary (vector w) or Multiclass (matrix W)
        if self.multiclass:
            K = len(np.unique(y))
            self.w, self.b = np.zeros((d, K)), np.zeros(K)
            grad_func = compute_gradient_multiclass
        else:
            self.w, self.b = np.zeros(d), 0.0
            grad_func = compute_gradient_binary

        # 2. Main Solver Loop (Optimization)
        for epoch in range(self.epochs):
            for batch_idx in self._get_batch_indices(n):
                # Pull the sparse feature dictionaries for the current batch
                X_batch = [X[i] for i in batch_idx]
                y_batch = y[batch_idx]

                # Call the specific gradient calculator (should be implemented elsewhere)
                dw, db = grad_func(self.w, self.b, X_batch, y_batch, self.lam)
                
                # Gradient Descent Update Rule: Move w in the direction that lowers loss
                self.w -= self.lr * dw
                self.b -= self.lr * db

            # Periodic reporting for monitoring convergence
            if epoch % 10 == 0 or epoch == self.epochs - 1:
                train_loss = self._compute_loss(X, y)
                log_msg = f"Epoch {epoch:4d} | Train Loss: {train_loss:.6f}"
                
                if X_val is not None and y_val is not None:
                    val_loss = self._compute_loss(X_val, y_val)
                    log_msg += f" | Val Loss: {val_loss:.6f}"
                
                print(log_msg)
                
        return self

    def predict_proba(self, X):
        # Passes the dot product through an activation function
        probs = []
        for x in X:
            if self.multiclass:
                # Softmax: Multi-class probabilities that sum to 1.0
                z = dot_sparse_multiclass(self.w, x) + self.b
                probs.append(softmax(z))
            else:
                # Sigmoid: Maps z to a value between 0 and 1
                z = dot_sparse(self.w, x) + self.b
                probs.append(sigmoid(z))
        return np.array(probs)

    def predict(self, X, threshold=0.5):
            # Converts soft probabilities into hard class predictions
            probs = self.predict_proba(X)
            if self.multiclass:
                # Argmax: Pick the class with the highest probability
                return np.argmax(probs, axis=1)
            else:
                # Binary: Simple threshold check (usually 0.5)
                return (probs >= threshold).astype(int)

## 4. Cross validation function
<br> Used to hyperparameter tune lambda for L2 regularization

In [18]:
# 1. Helper for Loss Calculation
def compute_loss(model, X, y):
    eps = 1e-15
    y_prob = model.predict_proba(X) # Returns (n, K) or (n,)
    n = len(X)
    
    if model.multiclass:
        # y_prob is (n, K)
        # y is indices (n,)
        # CCE
        true_class_probs = y_prob[np.arange(n), y.astype(int)]
        true_class_probs = np.clip(true_class_probs, eps, 1 - eps)
        loss = -np.sum(np.log(true_class_probs))
        
        # L2
        reg = model.lam * np.sum(model.w ** 2)
        return (loss / n) + reg
    else:
        # y_prob is (n,)
        y_prob = np.clip(y_prob, eps, 1 - eps)
        # BCE
        loss = -np.sum(y * np.log(y_prob) + (1 - y) * np.log(1 - y_prob))
        
        # L2
        reg = model.lam * np.sum(model.w ** 2)
        return (loss / n) + reg

# 2. Cross Validation Function
def cross_validate(X_train, y_train, X_val, y_val, lambdas, config):
    results = {}
    best_loss = float('inf')
    best_lam = None
    
    print(f"Starting Cross-Validation on {len(lambdas)} lambda candidates...")
    
    for lam in lambdas:
        # Update config with current lambda
        current_config = config.copy()
        current_config['lam'] = lam
        
        # Initialize and Train
        model = LogisticRegressionModel(**current_config)
        # Pass validation data to fit to see per-epoch progress if desired, 
        # though here we are just collecting final loss.
        model.fit(X_train, y_train)
        
        # Evaluate on VAL set
        val_loss = compute_loss(model, X_val, y_val)
        
        results[lam] = val_loss
        print(f"Lambda: {lam:<6} | Validation Loss: {val_loss:.6f}")
        
        if val_loss < best_loss:
            best_loss = val_loss
            best_lam = lam
            
    print(f"Best Lambda: {best_lam}")
    return best_lam, results

We need to configure 3 models to cross validate, gradient descent, mini-batch gradient descent and stochastic gradient descent. The same setup will be used with a different config

In [19]:
# Batch GD cross validation
lambdas_to_test = [0.001, 0.01, 0.1, 1, 10]
base_config = {
    'lr': 0.01,
    'epochs': 500,
    'batch_size': None,
    'method': 'batch',
    'multiclass': False
}

best_lambda_batch, cv_results_batch = cross_validate(X_train_vec, y_train_bin, X_val_vec, y_val_bin, lambdas_to_test, base_config)

# Capture best lambda for retraining and evaluation
final_config_batch = base_config.copy()
final_config_batch['lam'] = best_lambda_batch

Starting Cross-Validation on 5 lambda candidates...
Epoch    0 | Train Loss: 0.691714
Epoch   10 | Train Loss: 0.677781
Epoch   20 | Train Loss: 0.664555
Epoch   30 | Train Loss: 0.651998
Epoch   40 | Train Loss: 0.640074
Epoch   50 | Train Loss: 0.628751
Epoch   60 | Train Loss: 0.617996
Epoch   70 | Train Loss: 0.607779
Epoch   80 | Train Loss: 0.598070
Epoch   90 | Train Loss: 0.588841
Epoch  100 | Train Loss: 0.580068
Epoch  110 | Train Loss: 0.571725
Epoch  120 | Train Loss: 0.563788
Epoch  130 | Train Loss: 0.556236
Epoch  140 | Train Loss: 0.549048
Epoch  150 | Train Loss: 0.542203
Epoch  160 | Train Loss: 0.535683
Epoch  170 | Train Loss: 0.529471
Epoch  180 | Train Loss: 0.523549
Epoch  190 | Train Loss: 0.517903
Epoch  200 | Train Loss: 0.512517
Epoch  210 | Train Loss: 0.507378
Epoch  220 | Train Loss: 0.502472
Epoch  230 | Train Loss: 0.497788
Epoch  240 | Train Loss: 0.493312
Epoch  250 | Train Loss: 0.489036
Epoch  260 | Train Loss: 0.484947
Epoch  270 | Train Loss: 0.481

In [20]:
# SGD cross validation
lambdas_to_test = [0.001, 0.01, 0.1, 1, 10]
base_config_sgd = {
    'lr': 0.001,
    'epochs': 50,
    'batch_size': 1,
    'method': 'sgd',
    'multiclass': False
}

best_lambda_sgd, cv_results_sgd = cross_validate(X_train_vec, y_train_bin, X_val_vec, y_val_bin, lambdas_to_test, base_config)

# Capture best lambda for retraining and evaluation
final_config_sgd = base_config_sgd.copy()
final_config_sgd['lam'] = best_lambda_sgd

Starting Cross-Validation on 5 lambda candidates...
Epoch    0 | Train Loss: 0.691714
Epoch   10 | Train Loss: 0.677781
Epoch   20 | Train Loss: 0.664555
Epoch   30 | Train Loss: 0.651998
Epoch   40 | Train Loss: 0.640074
Epoch   50 | Train Loss: 0.628751
Epoch   60 | Train Loss: 0.617996
Epoch   70 | Train Loss: 0.607779
Epoch   80 | Train Loss: 0.598070
Epoch   90 | Train Loss: 0.588841
Epoch  100 | Train Loss: 0.580068
Epoch  110 | Train Loss: 0.571725
Epoch  120 | Train Loss: 0.563788
Epoch  130 | Train Loss: 0.556236
Epoch  140 | Train Loss: 0.549048
Epoch  150 | Train Loss: 0.542203
Epoch  160 | Train Loss: 0.535683
Epoch  170 | Train Loss: 0.529471
Epoch  180 | Train Loss: 0.523549
Epoch  190 | Train Loss: 0.517903
Epoch  200 | Train Loss: 0.512517
Epoch  210 | Train Loss: 0.507378
Epoch  220 | Train Loss: 0.502472
Epoch  230 | Train Loss: 0.497788
Epoch  240 | Train Loss: 0.493312
Epoch  250 | Train Loss: 0.489036
Epoch  260 | Train Loss: 0.484947
Epoch  270 | Train Loss: 0.481

In [21]:
# Mini-batch cross validation
lambdas_to_test = [0.001, 0.01, 0.1, 1, 10]
base_config_mini = {
    'lr': 0.01,
    'epochs': 500,
    'batch_size': 32,
    'method': 'minibatch',
    'multiclass': False
}

best_lambda_mini, cv_results_mini = cross_validate(X_train_vec, y_train_bin, X_val_vec, y_val_bin, lambdas_to_test, base_config)

# Capture best lambda for retraining and evaluation
final_config_mini = base_config_mini.copy()
final_config_mini['lam'] = best_lambda_mini

Starting Cross-Validation on 5 lambda candidates...
Epoch    0 | Train Loss: 0.691714
Epoch   10 | Train Loss: 0.677781
Epoch   20 | Train Loss: 0.664555
Epoch   30 | Train Loss: 0.651998
Epoch   40 | Train Loss: 0.640074
Epoch   50 | Train Loss: 0.628751
Epoch   60 | Train Loss: 0.617996
Epoch   70 | Train Loss: 0.607779
Epoch   80 | Train Loss: 0.598070
Epoch   90 | Train Loss: 0.588841
Epoch  100 | Train Loss: 0.580068
Epoch  110 | Train Loss: 0.571725
Epoch  120 | Train Loss: 0.563788
Epoch  130 | Train Loss: 0.556236
Epoch  140 | Train Loss: 0.549048
Epoch  150 | Train Loss: 0.542203
Epoch  160 | Train Loss: 0.535683
Epoch  170 | Train Loss: 0.529471
Epoch  180 | Train Loss: 0.523549
Epoch  190 | Train Loss: 0.517903
Epoch  200 | Train Loss: 0.512517
Epoch  210 | Train Loss: 0.507378
Epoch  220 | Train Loss: 0.502472
Epoch  230 | Train Loss: 0.497788
Epoch  240 | Train Loss: 0.493312
Epoch  250 | Train Loss: 0.489036
Epoch  260 | Train Loss: 0.484947
Epoch  270 | Train Loss: 0.481

Now we "retrain" each model type with the best lambda and evaluate the output

In [22]:
batch_model = LogisticRegressionModel(**final_config_batch)
batch_model.fit(X_train_vec, y_train_bin)

Epoch    0 | Train Loss: 0.691714
Epoch   10 | Train Loss: 0.677781
Epoch   20 | Train Loss: 0.664555
Epoch   30 | Train Loss: 0.651998
Epoch   40 | Train Loss: 0.640074
Epoch   50 | Train Loss: 0.628751
Epoch   60 | Train Loss: 0.617996
Epoch   70 | Train Loss: 0.607779
Epoch   80 | Train Loss: 0.598070
Epoch   90 | Train Loss: 0.588841
Epoch  100 | Train Loss: 0.580068
Epoch  110 | Train Loss: 0.571725
Epoch  120 | Train Loss: 0.563788
Epoch  130 | Train Loss: 0.556236
Epoch  140 | Train Loss: 0.549048
Epoch  150 | Train Loss: 0.542203
Epoch  160 | Train Loss: 0.535683
Epoch  170 | Train Loss: 0.529471
Epoch  180 | Train Loss: 0.523549
Epoch  190 | Train Loss: 0.517903
Epoch  200 | Train Loss: 0.512517
Epoch  210 | Train Loss: 0.507378
Epoch  220 | Train Loss: 0.502472
Epoch  230 | Train Loss: 0.497788
Epoch  240 | Train Loss: 0.493312
Epoch  250 | Train Loss: 0.489036
Epoch  260 | Train Loss: 0.484947
Epoch  270 | Train Loss: 0.481037
Epoch  280 | Train Loss: 0.477296
Epoch  290 | T

<__main__.LogisticRegressionModel at 0x12bb37eb0>

In [23]:
sgd_model = LogisticRegressionModel(**final_config_sgd)
sgd_model.fit(X_train_vec, y_train_bin)

Epoch    0 | Train Loss: 0.460685
Epoch   10 | Train Loss: 0.349961
Epoch   20 | Train Loss: 0.323132
Epoch   30 | Train Loss: 0.303094
Epoch   40 | Train Loss: 0.288225
Epoch   49 | Train Loss: 0.278172


<__main__.LogisticRegressionModel at 0x12bb34760>

In [24]:
mini_model = LogisticRegressionModel(**final_config_mini)
mini_model.fit(X_train_vec, y_train_bin)

Epoch    0 | Train Loss: 0.576598
Epoch   10 | Train Loss: 0.381383
Epoch   20 | Train Loss: 0.364771
Epoch   30 | Train Loss: 0.353968
Epoch   40 | Train Loss: 0.344349
Epoch   50 | Train Loss: 0.335564
Epoch   60 | Train Loss: 0.327538
Epoch   70 | Train Loss: 0.320205
Epoch   80 | Train Loss: 0.313532
Epoch   90 | Train Loss: 0.307440
Epoch  100 | Train Loss: 0.301890
Epoch  110 | Train Loss: 0.296846
Epoch  120 | Train Loss: 0.292254
Epoch  130 | Train Loss: 0.288070
Epoch  140 | Train Loss: 0.284264
Epoch  150 | Train Loss: 0.280794
Epoch  160 | Train Loss: 0.277645
Epoch  170 | Train Loss: 0.274780
Epoch  180 | Train Loss: 0.272167
Epoch  190 | Train Loss: 0.269787
Epoch  200 | Train Loss: 0.267623
Epoch  210 | Train Loss: 0.265646
Epoch  220 | Train Loss: 0.263841
Epoch  230 | Train Loss: 0.262198
Epoch  240 | Train Loss: 0.260698
Epoch  250 | Train Loss: 0.259321
Epoch  260 | Train Loss: 0.258070
Epoch  270 | Train Loss: 0.256926
Epoch  280 | Train Loss: 0.255882
Epoch  290 | T

<__main__.LogisticRegressionModel at 0x12bb36c80>

Make an evaluation function for binary and multiclass cases for accuracy, precision, recall and F1 score

In [46]:
def evaluate_model(y_true, y_pred, is_multiclass=False):
    metrics = {}
    
    if not is_multiclass:
        # Binary Logic
        tp = np.sum((y_true == 1) & (y_pred == 1))
        tn = np.sum((y_true == 0) & (y_pred == 0))
        fp = np.sum((y_true == 0) & (y_pred == 1))
        fn = np.sum((y_true == 1) & (y_pred == 0))

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        accuracy = (tp + tn) / len(y_true)

        metrics = {
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall,
            "f1_score": f1
        }
    else:
        # Multiclass Logic (Macro-averaged)
        classes = np.unique(y_true)
        acc = np.mean(y_true == y_pred)
        
        precisions, recalls, f1s = [], [], []
        
        for c in classes:
            tp = np.sum((y_true == c) & (y_pred == c))
            fp = np.sum((y_true != c) & (y_pred == c))
            fn = np.sum((y_true == c) & (y_pred != c))
            
            p = tp / (tp + fp) if (tp + fp) > 0 else 0
            r = tp / (tp + fn) if (tp + fn) > 0 else 0
            f = 2 * (p * r) / (p + r) if (p + r) > 0 else 0
            
            precisions.append(p)
            recalls.append(r)
            f1s.append(f)
            
        metrics = {
            "accuracy": acc,
            "precision": np.mean(precisions),
            "recall": np.mean(recalls),
            "f1_score": np.mean(f1s)
        }

    # Print Results
    print(f"{'Multiclass' if is_multiclass else 'Binary'} Results")
    for k, v in metrics.items():
        print(f"{k.capitalize():10}: {v:.4f}")
        
    return metrics

Now we predict and evaluate each models performance

In [None]:
# Evaluation of GD
preds_batch = batch_model.predict(X_test_vec, threshold=0.25)
results_batch = evaluate_model(y_test_bin, preds_batch, is_multiclass=False)

Binary Results
Accuracy  : 0.8054
Precision : 0.4118
Recall    : 0.9545
F1_score  : 0.5753


In [None]:
# Evaluation of SGD
preds_sgd = sgd_model.predict(X_test_vec, threshold=0.2)
results_sgd = evaluate_model(y_test_bin, preds_sgd, is_multiclass=False)

Binary Results
Accuracy  : 0.9677
Precision : 0.9470
Recall    : 0.8117
F1_score  : 0.8741


In [None]:
# Evaluation of Mini-Batch GD
preds_minibatch = mini_model.predict(X_test_vec, threshold=0.2)
results_mini = evaluate_model(y_test_bin, preds_minibatch, is_multiclass=False)

Binary Results
Accuracy  : 0.9704
Precision : 0.8903
Recall    : 0.8961
F1_score  : 0.8932


Analysis of the final results of the logistic regreesion model:
    -The first thing to note is the distribution of the data, roughly 85% of the data is ham vs 15% is spam. This class inbalance could cause the model to want to predict ham correctly as positive or negative but over perform in the correct prediction for the negative class(spam). In future versions of this model sampling from the data to create a more balanced dataset could be implemented to avoid potenital overfitting issues to the ham class. 

    -One common ground between the models is the lambda parameter, all 3 models used lambda = 0.001 as they yield the best loss during training. The ability for all three training methods to minimize the objective function decreased as the lambda parameters increased, this is expected as the regularization strength increased causing the regularization term to become larger and larger weights to grow in size. 

    -When looking at the final results we also see a trend when it comes to the threshold parameter when making predictions. Due to the large class imbalance all three models require the threshold for predictions to be lower (0.25<) to yield the best results. In a balanced class case 0.5 is the optimal threshold as the model can distinguish between the two classes without any help. In the case of a large class imbalance, the model is forced to predict the majority class more often. So by setting the threshold to 0.25 or lower we are saying that if the model makes a prediction below 0.25 we will predict the negative class. 
    
    -The gradient descent model performs poorly relative to the other models. While it achieves high recall (0.9545) at a low threshold of 0.25, indicating strong detection of positive cases, this comes at the cost of low precision (0.4118), resulting in many false positives. Although the accuracy is 0.8054, this metric is misleading due to class imbalance. The moderate F1 score of 0.5753 highlights the tradeoff between recall and precision and suggests that the model is not well balanced. As mentioned previously removing a class imbalance and redfining with a smaller dataset could allow for the model to generalize more effectively even on a smaller sample. This behavior could also be defined by the training method as we are training on the entire dataset so the minority class does not carry as much influence on the weighs of the final model.
    
    -For the stochastic gradient descent model we see a huge change is the models performance. We have an accuracy of 0.9677 indicating we are prediciting the correct class 96.77% of the time. We also have a high precision of 0.947 indicating we are predicting very few false positives. Now this comes at the cost of recall being roughly 0.8117 indicating we are generating more false negatives compared to the gradient descent model. Overall this yield a much more balanced model as shown by the F1 score of 0.8741. One thing to point out is for this type of training method, the epochs were lowered from 500 down to 50, this allowed for very fast training with a reduced risk of overfitting. This occurs due to the weights be updated per sample indicating more updatets per epoch resulting in the model being able to learn patterns in the data more effectively when compared to the gradient descent model. 
    
    -The final model uses a mini-batch gradient descent approach and is the most balanced of the three models. It achieves a high accuracy of 0.9704, indicating that approximately 97% of all samples are classified correctly. Both precision and recall are close to 0.89, showing that the model maintains a low rate of false positives while also correctly identifying most true positive cases. This balance between precision and recall is reflected in the high F1 score of 0.8932, indicating strong and consistent performance across both classes. The mini-batch training approach allows for more stable weight updates compared to stochastic gradient descent, reducing gradient noise during training. As a result, the model was able to train for a larger number of epochs without overfitting, ultimately converging to a better solution than the SGD model.

    -When it comes to improvements to be made in the future, the main takeaway is the more the weights are updated the better the model performs, but there must be a proper balance. The sgd model was updated per samples resulting in over 3000 updates per epoch, while the mini-batch model only updated for every 32 samples resulting in 100 updates per epoch, meaning even with the number of epochs being 10x for the mini-batch model the total number of updates was about 1/3 of the sgd model. This allowed for the mini-batch model to train with less noise and resulting in a more balanced model. When it comes to balancing the dataset the goal should be to have the decision boundary for a binary classification problem be as close to the center of the data as possible. So if sampling from the data to create a dataset where we have roughly a 60/40 split between the positive and negative cases or even 50/50 split would most likely return a model where the decision boundary does not need to be corrected to yield good results. 

## 5. Multiclass Classification (Books)
<br> We must split the data, preprocess, vectorize and train the models

In [29]:
books_df = pd.read_csv('a1-data/books.txt', sep='\t', header=None, names=['label', 'text'])

In [30]:
# Split into training, validation and testing sets
X_train_b, X_val_b, X_test_b, y_train_b, y_val_b, y_test_b = split_dataset(books_df, 0.6, 0.2, 0.2)

In [33]:
def data_distribution(y_train, y_val, y_test):
    df =pd.DataFrame({'Train': y_train.value_counts(), 'Val': y_val.value_counts(), 'Test': y_test.value_counts()}).fillna(0).astype(int)
    return df

data_distribution(y_train_b, y_val_b, y_test_b)

Unnamed: 0_level_0,Train,Val,Test
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jane Austen,6610,2200,2172
Fyodor Dostoyevsky,3555,1189,1200
Arthur Conan Doyle,1382,460,478


In [34]:
# Process
X_train_b_proc = preprocessor.preprocess(X_train_b['text'])
X_val_b_proc = preprocessor.preprocess(X_val_b['text'])
X_test_b_proc = preprocessor.preprocess(X_test_b['text'])

In [35]:
# Vectorize
tfidf_books = TfidfVectorizer()
X_train_b_vec = tfidf_books.fit_transform(X_train_b_proc.tolist())
X_val_b_vec = tfidf_books.transform(X_val_b_proc.tolist())
X_test_b_vec = tfidf_books.transform(X_test_b_proc.tolist())

In [36]:
# Map Labels
book_map = {"Arthur Conan Doyle": 2, "Jane Austen": 1, "Fyodor Dostoyevsky": 0}
y_train_b_idx = np.array([book_map[v] for v in y_train_b], dtype=int)
y_val_b_idx = np.array([book_map[v] for v in y_val_b], dtype=int)
y_test_b_idx = np.array([book_map[v] for v in y_test_b], dtype=int)

In [37]:
# Batch GD for cross validation of the multiclass case
lambdas_to_test = [0.001, 0.01, 0.1, 1, 10]
base_config = {
    'lr': 0.01,
    'epochs': 500,
    'batch_size': None,
    'method': 'batch',
    'multiclass': True
}

best_lambda_b_batch, cv_results_b_batch = cross_validate(X_train_b_vec, y_train_b_idx, X_val_b_vec, y_val_b_idx, lambdas_to_test, base_config)

# Capture best lambda for retraining and evaluation
final_config_b_batch = base_config.copy()
final_config_b_batch['lam'] = best_lambda_b_batch

Starting Cross-Validation on 5 lambda candidates...
Epoch    0 | Train Loss: 1.097516
Epoch   10 | Train Loss: 1.086950
Epoch   20 | Train Loss: 1.077075
Epoch   30 | Train Loss: 1.067848
Epoch   40 | Train Loss: 1.059225
Epoch   50 | Train Loss: 1.051167
Epoch   60 | Train Loss: 1.043638
Epoch   70 | Train Loss: 1.036601
Epoch   80 | Train Loss: 1.030024
Epoch   90 | Train Loss: 1.023876
Epoch  100 | Train Loss: 1.018127
Epoch  110 | Train Loss: 1.012749
Epoch  120 | Train Loss: 1.007718
Epoch  130 | Train Loss: 1.003009
Epoch  140 | Train Loss: 0.998599
Epoch  150 | Train Loss: 0.994469
Epoch  160 | Train Loss: 0.990598
Epoch  170 | Train Loss: 0.986968
Epoch  180 | Train Loss: 0.983563
Epoch  190 | Train Loss: 0.980367
Epoch  200 | Train Loss: 0.977365
Epoch  210 | Train Loss: 0.974543
Epoch  220 | Train Loss: 0.971890
Epoch  230 | Train Loss: 0.969393
Epoch  240 | Train Loss: 0.967041
Epoch  250 | Train Loss: 0.964825
Epoch  260 | Train Loss: 0.962735
Epoch  270 | Train Loss: 0.960

In [38]:
# Batch sgd for cross validation of the multiclass case
lambdas_to_test = [0.001, 0.01, 0.1, 1, 10]
base_config = {
    'lr': 0.01,
    'epochs': 50,
    'batch_size': 1,
    'method': 'sgd',
    'multiclass': True
}

best_lambda_b_sgd, cv_results_b_sgd = cross_validate(X_train_b_vec, y_train_b_idx, X_val_b_vec, y_val_b_idx, lambdas_to_test, base_config)

# Capture best lambda for retraining and evaluation
final_config_b_sgd = base_config.copy()
final_config_b_sgd['lam'] = best_lambda_b_sgd

Starting Cross-Validation on 5 lambda candidates...
Epoch    0 | Train Loss: 0.803837
Epoch   10 | Train Loss: 0.717253
Epoch   20 | Train Loss: 0.717200
Epoch   30 | Train Loss: 0.717371
Epoch   40 | Train Loss: 0.718919
Epoch   49 | Train Loss: 0.717801
Lambda: 0.001  | Validation Loss: 0.747007
Epoch    0 | Train Loss: 0.893700
Epoch   10 | Train Loss: 0.893571
Epoch   20 | Train Loss: 0.896761
Epoch   30 | Train Loss: 0.894995
Epoch   40 | Train Loss: 0.894306
Epoch   49 | Train Loss: 0.894833
Lambda: 0.01   | Validation Loss: 0.903648
Epoch    0 | Train Loss: 0.935174
Epoch   10 | Train Loss: 0.935548
Epoch   20 | Train Loss: 0.934038
Epoch   30 | Train Loss: 0.936946
Epoch   40 | Train Loss: 0.936615
Epoch   49 | Train Loss: 0.934992
Lambda: 0.1    | Validation Loss: 0.937013
Epoch    0 | Train Loss: 0.940527
Epoch   10 | Train Loss: 0.939395
Epoch   20 | Train Loss: 0.941074
Epoch   30 | Train Loss: 0.947528
Epoch   40 | Train Loss: 0.942460
Epoch   49 | Train Loss: 0.940255
Lam

In [39]:
# Mini-Batch GD for cross validation of the multiclass case
lambdas_to_test = [0.001, 0.01, 0.1, 1, 10]
base_config = {
    'lr': 0.01,
    'epochs': 500,
    'batch_size': 32,
    'method': 'minibatch',
    'multiclass': True
}

best_lambda_b_mini, cv_results_b_mini = cross_validate(X_train_b_vec, y_train_b_idx, X_val_b_vec, y_val_b_idx, lambdas_to_test, base_config)

# Capture best lambda for retraining and evaluation
final_config_b_mini = base_config.copy()
final_config_b_mini['lam'] = best_lambda_b_mini

Starting Cross-Validation on 5 lambda candidates...
Epoch    0 | Train Loss: 0.946886
Epoch   10 | Train Loss: 0.870920
Epoch   20 | Train Loss: 0.832263
Epoch   30 | Train Loss: 0.804354
Epoch   40 | Train Loss: 0.783849
Epoch   50 | Train Loss: 0.768579
Epoch   60 | Train Loss: 0.757063
Epoch   70 | Train Loss: 0.748318
Epoch   80 | Train Loss: 0.741598
Epoch   90 | Train Loss: 0.736419
Epoch  100 | Train Loss: 0.732403
Epoch  110 | Train Loss: 0.729257
Epoch  120 | Train Loss: 0.726794
Epoch  130 | Train Loss: 0.724855
Epoch  140 | Train Loss: 0.723317
Epoch  150 | Train Loss: 0.722097
Epoch  160 | Train Loss: 0.721133
Epoch  170 | Train Loss: 0.720350
Epoch  180 | Train Loss: 0.719723
Epoch  190 | Train Loss: 0.719229
Epoch  200 | Train Loss: 0.718819
Epoch  210 | Train Loss: 0.718493
Epoch  220 | Train Loss: 0.718225
Epoch  230 | Train Loss: 0.718009
Epoch  240 | Train Loss: 0.717834
Epoch  250 | Train Loss: 0.717691
Epoch  260 | Train Loss: 0.717575
Epoch  270 | Train Loss: 0.717

We need to retrain each model with the tuned lambda 

In [40]:
batch_b_model = LogisticRegressionModel(**final_config_b_batch)
batch_b_model.fit(X_train_b_vec, y_train_b_idx)

Epoch    0 | Train Loss: 1.097516
Epoch   10 | Train Loss: 1.086950
Epoch   20 | Train Loss: 1.077075
Epoch   30 | Train Loss: 1.067848
Epoch   40 | Train Loss: 1.059225
Epoch   50 | Train Loss: 1.051167
Epoch   60 | Train Loss: 1.043638
Epoch   70 | Train Loss: 1.036601
Epoch   80 | Train Loss: 1.030024
Epoch   90 | Train Loss: 1.023876
Epoch  100 | Train Loss: 1.018127
Epoch  110 | Train Loss: 1.012749
Epoch  120 | Train Loss: 1.007718
Epoch  130 | Train Loss: 1.003009
Epoch  140 | Train Loss: 0.998599
Epoch  150 | Train Loss: 0.994469
Epoch  160 | Train Loss: 0.990598
Epoch  170 | Train Loss: 0.986968
Epoch  180 | Train Loss: 0.983563
Epoch  190 | Train Loss: 0.980367
Epoch  200 | Train Loss: 0.977365
Epoch  210 | Train Loss: 0.974543
Epoch  220 | Train Loss: 0.971890
Epoch  230 | Train Loss: 0.969393
Epoch  240 | Train Loss: 0.967041
Epoch  250 | Train Loss: 0.964825
Epoch  260 | Train Loss: 0.962735
Epoch  270 | Train Loss: 0.960762
Epoch  280 | Train Loss: 0.958898
Epoch  290 | T

<__main__.LogisticRegressionModel at 0x12a824040>

In [41]:
sgd_b_model = LogisticRegressionModel(**final_config_b_sgd)
sgd_b_model.fit(X_train_b_vec, y_train_b_idx)

Epoch    0 | Train Loss: 0.802060
Epoch   10 | Train Loss: 0.717192
Epoch   20 | Train Loss: 0.719179
Epoch   30 | Train Loss: 0.719001
Epoch   40 | Train Loss: 0.717421
Epoch   49 | Train Loss: 0.717264


<__main__.LogisticRegressionModel at 0x12a824400>

In [42]:
mini_b_model = LogisticRegressionModel(**final_config_b_mini)
mini_b_model.fit(X_train_b_vec, y_train_b_idx)

Epoch    0 | Train Loss: 0.947193
Epoch   10 | Train Loss: 0.870916
Epoch   20 | Train Loss: 0.832270
Epoch   30 | Train Loss: 0.804359
Epoch   40 | Train Loss: 0.783851
Epoch   50 | Train Loss: 0.768573
Epoch   60 | Train Loss: 0.757063
Epoch   70 | Train Loss: 0.748319
Epoch   80 | Train Loss: 0.741598
Epoch   90 | Train Loss: 0.736421
Epoch  100 | Train Loss: 0.732397
Epoch  110 | Train Loss: 0.729258
Epoch  120 | Train Loss: 0.726795
Epoch  130 | Train Loss: 0.724854
Epoch  140 | Train Loss: 0.723320
Epoch  150 | Train Loss: 0.722098
Epoch  160 | Train Loss: 0.721125
Epoch  170 | Train Loss: 0.720347
Epoch  180 | Train Loss: 0.719723
Epoch  190 | Train Loss: 0.719223
Epoch  200 | Train Loss: 0.718819
Epoch  210 | Train Loss: 0.718490
Epoch  220 | Train Loss: 0.718232
Epoch  230 | Train Loss: 0.718009
Epoch  240 | Train Loss: 0.717838
Epoch  250 | Train Loss: 0.717691
Epoch  260 | Train Loss: 0.717574
Epoch  270 | Train Loss: 0.717479
Epoch  280 | Train Loss: 0.717401
Epoch  290 | T

<__main__.LogisticRegressionModel at 0x12a826530>

Now we create predictions for the test set for each model type and evaluate the output

In [None]:
# Evaluation of GD
preds_b_batch = batch_b_model.predict(X_test_b_vec, threshold=0.2)
results_b_batch = evaluate_model(y_test_b_idx, preds_b_batch, is_multiclass=True)

Multiclass Results
Accuracy  : 0.5642
Precision : 0.1881
Recall    : 0.3333
F1_score  : 0.2405


In [None]:
# Evaluation of SGD
preds_b_sgd= sgd_b_model.predict(X_test_b_vec, threshold=0.2)
results_b_sgd = evaluate_model(y_test_b_idx, preds_b_sgd, is_multiclass=True)

Multiclass Results
Accuracy  : 0.7610
Precision : 0.8452
Recall    : 0.5549
F1_score  : 0.5563


In [None]:
# Evaluation of Mini-Batch GD
preds_b_mini= mini_b_model.predict(X_test_b_vec, threshold=0.2)
results_b_mini = evaluate_model(y_test_b_idx, preds_b_mini, is_multiclass=True)

Multiclass Results
Accuracy  : 0.7675
Precision : 0.8468
Recall    : 0.5625
F1_score  : 0.5636


Overall we a similar result in this multi-class case when compared to the binary class. The best performing model is the mini-batch model, then followed closely by stochastic gradient descent and then standard gradient descent. The difference with this dataset is there are 3 classes and Jane Austin dominates the class at roughly 45% of the data. So similarily to the previous class we see that a threshold is needed to improve the predictive power of the model regardless of the training method. By looking at the F1 score there is a clear indication that even at 500 epochs the gradient descent model has a poor balance at predicting because if the model was completely randomly guessing it would have a F1 score of 33%, but this is less than that indicating the model is favoring one of the three authors, most likely Jane Austen. Looking at the stochastic gradient model and mini-batch have a similar F1 score at roughly 56%. This indicates it is better than random guessing but still is favoring one class over the others. Also more advanced tokenization or cleaning methods might result in better predictive power due to more context being learned by the model. 