1. Consider the “MNIST” dataset, (in “csv” format) available at “Kaggle.com” (https://www.kaggle.com/datasets/oddrationale/mnist-in-csv). The “mnist_train csv” file contains the 60,000 training examples and labels. The “mnist_test.csv” contains 10,000 test examples and labels. Each row consists of 785 values: the first value is the label (a number from 0 to 9) and the remaining 784 values are the pixel values (a number from 0 to 255).

In [1]:
import pandas as pd

# Preprocess the data
def preprocess_data(data, normalization_function = lambda v: v/255.0):
  labels = data.iloc[:, 0].values
  images = normalization_function(data.iloc[:, 1:].values)
  return images, labels

# Load and preprocess the data
train_data = pd.read_csv('mnist_train.csv')
test_data = pd.read_csv('mnist_test.csv')
train_images, train_labels = preprocess_data(train_data)
test_images, test_labels = preprocess_data(test_data)

print(train_data.shape)
print(test_data.shape)

(60000, 785)
(10000, 785)


 Develop and implement a “logistic_regression.py” script that contains a model able to distinguish between the “0”..”9” classes in this dataset.

In [12]:
import numpy as np

class RegularizedSoftmaxRegression:
  def __init__(self, learning_rate=0.1, max_iter=1000, C=1, tol=0.0001):
    self.learning_rate = learning_rate
    self.epochs = max_iter
    self.tol = tol
    self.lambda_ = 0.001 * (1.0 / C)
    self.epsilon = 1e-9

  def compute_loss(self, y_true, y_pred):
    y_pred = y_pred.clip(min=self.epsilon, max=1 - self.epsilon)
    loss = -np.mean(y_true * np.log(y_pred))
    l2_regularization = (self.lambda_ / 2) * np.sum(np.square(self.weights))
    return loss + l2_regularization

  def one_hot_encode(self, labels):
    labels = np.array(labels, dtype=int)
    one_hot_encoded = np.zeros((len(labels), len(np.unique(labels))))
    one_hot_encoded[np.arange(len(labels)), labels] = 1
    return one_hot_encoded

  def softmax(self, z):
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

  def predict_proba(self, X):
    z = np.dot(X, self.weights) + self.bias
    A = self.softmax(z)
    return A

  def fit(self, X, raw_y):
    y = self.one_hot_encode(raw_y)
    n_samples, n_features = X.shape
    n_classes = y.shape[1]
    self.weights = np.zeros((n_features, n_classes))
    self.bias = np.zeros(n_classes)
    self.losses = []
    for _ in range(self.epochs):
      A = self.predict_proba(X)
      self.losses.append(self.compute_loss(y, A))
      dz = A - y
      dw = (1 / n_samples) * np.dot(X.T, dz) + self.lambda_ * self.weights
      db = (1 / n_samples) * np.sum(dz, axis=0)
      self.weights -= self.learning_rate * dw
      self.bias -= self.learning_rate * db
      # Early stopping condition
      if len(self.losses) > 10:
        relative_improvement = (self.losses[-11] - self.losses[-1]) / (self.losses[-11] + self.epsilon)
        if relative_improvement < self.tol:
          break
    return self.weights, self.bias, self.losses

  def predict(self, X):
    A = self.predict_proba(X)
    y_predicted_cls = np.argmax(A, axis=1)
    return y_predicted_cls

In [3]:
from sklearn.metrics import accuracy_score

def train_and_evaluate(model, train_images, train_labels, test_images=test_images, test_labels=test_labels):
  model.fit(train_images, train_labels)
  pred_labels = model.predict(test_images)
  return accuracy_score(pred_labels, test_labels)

%time print("Accuracy: ", train_and_evaluate(RegularizedSoftmaxRegression(), train_images, train_labels))

Accuracy:  0.8992
CPU times: user 5min 3s, sys: 25.6 s, total: 5min 29s
Wall time: 3min 32s


Repeating training and evaluation process, but using sklearn's LogisticRegression

In [4]:
from sklearn.linear_model import LogisticRegression

%time print("Accuracy: ", train_and_evaluate(LogisticRegression(max_iter=1000), train_images, train_labels))

Accuracy:  0.9256
CPU times: user 8min 53s, sys: 33.4 s, total: 9min 27s
Wall time: 5min 42s


Avoiding memory and time consumption issues

In [5]:
import time

model_classes = {
    "MyLogisticRegression": RegularizedSoftmaxRegression,
    "ScikitLogisticRegression": LogisticRegression
}

for factor in [0.1, 0.2]: # Using subsets of 10 and 20 percent of the training data
  subset_size = int(len(train_images) * factor)
  for model_name in model_classes.keys():
      print(f"{model_name} with {subset_size} samples:")
      start_time = time.time()
      accuracy = train_and_evaluate(model_classes[model_name](max_iter=1000), train_images[:subset_size], train_labels[:subset_size])
      execution_time = time.time() - start_time
      print("Accuracy: ", accuracy)
      print(f"Execution Time: {int(execution_time)}s")
      print(f"Time-Normalized Accuracy: {accuracy / execution_time} \n")

MyLogisticRegression with 6000 samples:
Accuracy:  0.8924
Execution Time: 28s
Time-Normalized Accuracy: 0.03138009968836341 

ScikitLogisticRegression with 6000 samples:
Accuracy:  0.8968
Execution Time: 17s
Time-Normalized Accuracy: 0.05167572200487719 

MyLogisticRegression with 12000 samples:
Accuracy:  0.8929
Execution Time: 55s
Time-Normalized Accuracy: 0.016032313524142926 

ScikitLogisticRegression with 12000 samples:
Accuracy:  0.9083
Execution Time: 34s
Time-Normalized Accuracy: 0.026530489585009687 



Adopting a 10% subset for now on!

In [6]:
subset_size = int(len(train_images) * 0.1)
train_images_subset = train_images[:subset_size]
train_labels_subset = train_labels[:subset_size]

print(train_images_subset.shape)
print(train_labels_subset.shape)

(6000, 784)
(6000,)


a) different feature normalization strategies: \\
• Min-max \\
• Z-score

In [7]:
def z_score_normalizer(values):
  mean = values.mean(axis=0)
  std = values.std(axis=0)
  return (values - mean) / (std + 1e-7) #Adding a small value to avoid division by zero

z_scored_train_images_subset, _ = preprocess_data(train_data[:subset_size], z_score_normalizer)
z_scored_test_images, _ = preprocess_data(test_data, z_score_normalizer)

for model_name in model_classes.keys():
  accuracy = train_and_evaluate(model_classes[model_name](max_iter=1000), train_images_subset, train_labels_subset)
  print(f"{model_name} (Min-max normalization): {accuracy}")
  accuracy = train_and_evaluate(model_classes[model_name](max_iter=1000), z_scored_train_images_subset, train_labels_subset, z_scored_test_images)
  print(f"{model_name} (z-score normalization): {accuracy}")

MyLogisticRegression (Min-max normalization): 0.8924
MyLogisticRegression (z-score normalization): 0.9029
ScikitLogisticRegression (Min-max normalization): 0.8968
ScikitLogisticRegression (z-score normalization): 0.8897


b) different model regularization values.

In [8]:
regularization_transformers = {
    "Stronger": 0.01,
    "Strong": 0.1,
    "Default": 1,
    "Weak": 10.0,
    "Weaker": 100.0
}

print("MyLogisticRegression with z-score normalization on:")
for regularization_transformer_key in regularization_transformers:
  model = RegularizedSoftmaxRegression(C=regularization_transformers[regularization_transformer_key])
  accuracy = train_and_evaluate(model, z_scored_train_images_subset, train_labels_subset, z_scored_test_images)
  print(f"{regularization_transformer_key} regularization: {accuracy}")

print("\nScikitLogisticRegression with min-max normalization on:")
for regularization_transformer_key in regularization_transformers:
  model = LogisticRegression(max_iter=1000, C=regularization_transformers[regularization_transformer_key])
  accuracy = train_and_evaluate(model, train_images_subset, train_labels_subset)
  print(f"{regularization_transformer_key} regularization: {accuracy}")

MyLogisticRegression with z-score normalization on:
Stronger regularization: 0.8616
Strong regularization: 0.8906
Default regularization: 0.9029
Weak regularization: 0.9022
Weaker regularization: 0.9023

ScikitLogisticRegression with min-max normalization on:
Stronger regularization: 0.8945
Strong regularization: 0.9067
Default regularization: 0.8968
Weak regularization: 0.8858
Weaker regularization: 0.8799


c) different stopping criteria and learning rates for your model.

In [10]:
criterias = {
    "default": {
        "tol": 0.0001,
        "max_iter": 1000
    },
    "stopping sooner": {
        "tol": 0.001,
        "max_iter": 500
    },
    "trainning longer": {
        "tol": 0.00001,
        "max_iter": 2000
    }
}

for key in criterias.keys():
  my_model = RegularizedSoftmaxRegression(max_iter=criterias[key]["max_iter"], tol=criterias[key]["tol"])
  my_accuracy = train_and_evaluate(my_model, z_scored_train_images_subset, train_labels_subset, z_scored_test_images)
  print(f"MyLogisticRegression with z-score normalization ({key}): {my_accuracy}")
  sk_model = LogisticRegression(C=0.1, max_iter=criterias[key]["max_iter"], tol=criterias[key]["tol"])
  sk_accuracy = train_and_evaluate(sk_model, train_images_subset, train_labels_subset)
  print(f"ScikitLogisticRegression with min-max normalization ({key}): {sk_accuracy}\n")

MyLogisticRegression with z-score normalization (default): 0.9029
ScikitLogisticRegression with min-max normalization (default): 0.9067

MyLogisticRegression with z-score normalization (stopping sooner): 0.9031
ScikitLogisticRegression with min-max normalization (stopping sooner): 0.9067

MyLogisticRegression with z-score normalization (trainning longer): 0.9029
ScikitLogisticRegression with min-max normalization (trainning longer): 0.9067



In [13]:
learning_rates = [0.01, 0.1, 1.0]
for learning_rate in learning_rates:
  model = RegularizedSoftmaxRegression(learning_rate=learning_rate)
  accuracy = train_and_evaluate(model, z_scored_train_images_subset, train_labels_subset, z_scored_test_images)
  print(f"MyLogisticRegression with learning rate of {learning_rate}: {accuracy}")

MyLogisticRegression with learning rate of 0.01: 0.8957
MyLogisticRegression with learning rate of 0.1: 0.9029
MyLogisticRegression with learning rate of 1.0: 0.903
