# 50.007 Machine Learning - Summer 2024


## Task 1: Implement Logistics Regression

First we define the functions we are going to use, namely:

sigmoid(z): A function that takes in a Real Number input and returns an output value between 0 and 1.

loss(y, y_hat): A loss function that allows us to minimize and determine the optimal parameters. The function takes in the actual labels y and the predicted labels y_hat, and returns the overall training loss.

gradients(X, y, y_hat): The Gradient Descent Algorithm to find the optimal values of our parameters. The function takes in the training feature X, actual labels y and the predicted labels y_hat, and returns the partial derivative of the Loss function with respect to weights (w) and bias (db).

train(X, y, bs, epochs, lr, tol= 1e-4): The training function for the model. We modified the normal training function and included mini-batch gradient descent and a tolerance level for early stopping. Mini-batch gradient descent was used for better generalisation in the case of extreme data and for improve training speed.

predict(X): The prediction function to apply our validation and test sets.

accuracy(y_true, y_pred): function to calculate accuracy of model.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sigmoid function for logistic regression
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Loss function for logistic regression
def loss(y, y_hat):
    return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

# Gradient descent function to compute gradients
def gradient_descent(X, y, y_hat):
    m = X.shape[0]
    dw = (1 / m) * np.dot(X.T, (y_hat - y))
    db = (1 / m) * np.sum(y_hat - y)
    return dw, db

# Training function for logistic regression using mini-batch gradient descent
def train(X, y, bs, epochs, lr, tol=1e-4):
    n_samples, n_features = X.shape
    w = np.zeros((n_features, 1))  # Initialize weights
    b = 0  # Initialize bias
    y = y.reshape(n_samples, 1)
    
    for epoch in range(epochs):
        # Shuffle the dataset
        indices = np.arange(n_samples)
        np.random.shuffle(indices)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        # Mini-batch gradient descent
        for i in range(0, n_samples, bs):
            X_batch = X_shuffled[i:i + bs]
            y_batch = y_shuffled[i:i + bs]
            
            # Compute predictions
            y_hat = sigmoid(np.dot(X_batch, w) + b)
            # Compute gradients
            dw, db = gradient_descent(X_batch, y_batch, y_hat)
            
            # Update weights and bias
            w -= lr * dw
            b -= lr * db

        # Early Stopping check
        if epoch % 100 == 0:
            y_hat_full = sigmoid(np.dot(X, w) + b)
            current_loss = loss(y, y_hat_full)
            print(f'Epoch {epoch}, Loss: {current_loss:.4f}')

            if epoch > 10 and abs(previous_loss - current_loss) < tol:
                print(f"Early stopping at epoch {epoch}")
                break
            previous_loss = current_loss
    
    return w, b

# Prediction function for logistic regression
def predict(X, w, b):
    y_hat = sigmoid(np.dot(X, w) + b)
    pred = [1 if i > 0.5 else 0 for i in y_hat]
    return np.array(pred)

# Accuracy calculation function
def accuracy(y_true, y_pred):
    return np.mean(y_true == y_pred)

Next we prepare the data. 
We have also opted to perform a 80-20 split on the training data for to create a training set and validation set respectively. This allows us to measure the performance of the model.

In [3]:
# Load training data
data = pd.read_csv("Data/train_tfidf_features.csv")
X_features = data.drop(['label', 'id'], axis=1).values
Y_label = data['label'].values

# Load test data
test = pd.read_csv("Data/test_tfidf_features.csv")
X_test = test.drop(['id'], axis=1).values

# Split the data into training and validation sets
indices = np.arange(X_features.shape[0])
np.random.shuffle(indices)
split_idx = int(X_features.shape[0] * 0.8)
train_indices = indices[:split_idx]
val_indices = indices[split_idx:]

X_train_split = X_features[train_indices]
y_train_split = Y_label[train_indices]
X_val_split = X_features[val_indices]
y_val_split = Y_label[val_indices]



Now we can train the model.

A large epoch value was used to ensure we converge to a global maximum.

A small learning rate was used to ensure smooth convergence and to avoid overshooting.

While both of these combined can result in a long run time, the implementation of a tolerance level allows us to stop the training if there is too small a convergence, reducing the run time.

The predictions of the model can be found in Data/LogRed_Predictions.csv

Note that running the model will still take some time. ~10 mins

In [4]:
# Train the model
batch_size = 64
epochs = 1000
learning_rate = 0.01

w, b = train(X_train_split, y_train_split, batch_size, epochs, learning_rate)

# Predict on the validation set
y_pred = predict(X_val_split, w, b)

# Calculate validation accuracy
acc = accuracy(y_val_split, y_pred)
print(f'Validation Accuracy: {acc * 100:.2f}%')

# Predict on the test set
y_final = predict(X_test, w, b)

# Save the predictions to a CSV file
predictions_df = pd.DataFrame({'id': test['id'], 'label': y_final})
predictions_df.to_csv('LogRed_Prediction.csv', index=False)

Epoch 0, Loss: 0.6746
Epoch 100, Loss: 0.6409
Epoch 200, Loss: 0.6226
Epoch 300, Loss: 0.6080


KeyboardInterrupt: 

We then run sklearn's logreg package and see how our model compares to it. We will be evaluating based on the accuracy of the 2 models. 

Our model has an accuracy of 69.86%.

In [None]:
# Load training data
data = pd.read_csv("Data/train_tfidf_features.csv")
X_features = data.drop(['label', 'id'], axis=1).values
Y_label = data['label'].values

# Load test data
test = pd.read_csv("Data/test_tfidf_features.csv")
X_test = test.drop(['id'], axis=1).values

# Split the data into training and testing sets
X_train, X_val, y_train, y_val = train_test_split(X_features, Y_label, test_size=0.2, random_state=42)

# Create an instance of the LogisticRegression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the validation data
y_pred = model.predict(X_val)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_val, y_pred)

# Print the accuracy
print("Validation Accuracy:", accuracy)

SKLearn's logreg package has an accuracy of 71.5740471341286%, which is slightly better than our model, but still relatively competitive.

Now, we use 100% of the training set to train the model and make predictions on the test set.

In [None]:
# Load training data
data = pd.read_csv("Data/train_tfidf_features.csv")
X_features = data.drop(['label', 'id'], axis=1).values
Y_label = data['label'].values

# Load test data
test = pd.read_csv("Data/test_tfidf_features.csv")
X_test = test.drop(['id'], axis=1).values

# Train the model
batch_size = 64
epochs = 1000
learning_rate = 0.01

w, b = train(X_features, Y_label, batch_size, epochs, learning_rate)

# Predict on the test set
y_final = predict(X_test, w, b)

# Save the predictions to a CSV file
predictions_df = pd.DataFrame({'id': test['id'], 'label': y_final})
predictions_df.to_csv('LogReg_Prediction_SKLearn.csv', index=False)