<a href="https://colab.research.google.com/github/julurisaichandu/nlp/blob/main/perceptron_vs_logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import nltk
from collections import Counter
import numpy as np
nltk.download("punkt")
from sklearn.metrics import precision_score, recall_score, f1_score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Notebook 3: Multilayer Perceptron
===============

CS 6120 Natural Language Processing, Amir



Saichandu Juluri

Saving notebooks as pdfs
----------

Feel free to add cells to this notebook as you wish. Make sure to leave **code that you've written** and any **answers to questions** that you've written in your notebook. Turn in your notebook as a pdf at the end of lecture's day.


To convert your notebook to a pdf for turn in, you'll do the following:
1. Kernel -> Restart & Run All (clear your kernel's memory and run all cells)
2. File -> Download As -> .html -> open in a browser -> print to pdf

(The download as pdf option doesn't preserve formatting and output as nicely as taking the step "through" html, but will do if the above doesn't work for you.)

Task 1: Implement a Multilayer Perceptron for text classification in Torch
-------

In this notebook you will get to implement neural text classifiers using [Torch](https://pytorch.org/), a very popular deep learning framework. You may need to consult the documentation but since you will need to use this framework for the upcoming homework assignments this is an opportunity to get familiarized with it.

The goal is to build neural binary classifiers to predict the toxicity (i.e., toxic vs non-toxic) of a post using data from the [Jigsaw Unintended Bias in Toxicity Classification competition](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) (here we will use a very small subset of the data).

Recall that a MLP with $l$ hidden layers makes predictions as

$z = g(V^{l}\ldots g(V^2g(V^1f(x))))\\$
$P(\hat{y}|x) = \text{softmax}(Wz)$

where $f(x)$ is a feature representation of the input and hidden layer $j$ produces a new feature vector via a linear transformation paramterized by a weight vector $V^j$ followed by an activation function $g(\cdot)$ (i.e., an elementwise non-linear transformation). We recommend implementing your network using the [nn.Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) module but you dont have to.

The provided code is based on the feedforward neural network for the XOR problem that we saw in class. We also provide code to read the data and build BOW feature vectors. By default the code subsamples the training/test data to make development faster. Feel free to play with the full dataset if time permits.

In [None]:
#NEURAL NETWORK DEFINITION

import torch
import torch.nn as nn
from torch import optim
import numpy as np
import random
# fix the randomness to ensure reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed(42)
np.random.seed(42)
random.seed(42)

class MLP(nn.Module):
    """
    Defines the core neural network for doing multiclass classification over a single datapoint at a time. The network can be instantiated with arbitrary architectures (by which we mean number and size of hidden layers)
    e.g., architecture = [1000, 50, 2] is a MLP with input layer of size 1000, hidden layer of size 50 and output layer of size 2

    Recall that the hidden layer is computed as a linear transformation followed by an activation function (i.e., non-linearity)
    Linear transformations are implemented with the nn.Linear()
    Note 1: the input layer should have the same size as the input feature vectors and the output layer should be the number of classes.
    Note 2: be sure to match the input and output dimensions of all the layers
    """
    def __init__(self, architecture):
        """
        Constructs the computation graph by instantiating the various layers and initializing weights.
        :param architecture: dimensions of all the layers (list)
        """
        super(MLP, self).__init__()
        self.architecture = architecture
        self.layers = nn.Sequential(nn.Linear(architecture[0], architecture[1]),
                                    nn.ReLU(),
                                    nn.Linear(architecture[1], architecture[2]),
                                    nn.LogSoftmax(dim=0)
                                    )

    def forward(self, x):
        """
        Runs the neural network on the given data and returns log probabilities of the various classes.

        :param x: a [inp]-sized tensor of input data
        :return: an [out]-sized tensor of log probabilities. (In general your network can be set up to return either log
        probabilities or a tuple of (loss, log probability) if you want to pass in y to this function as well
        """
        return self.layers(x)

    def predict(self, x):
      probs = self.forward(x)
      return torch.argmax(probs)

def form_input(x) -> torch.Tensor:
    """
    Form the input to the neural network. In general this may be a complex function that synthesizes multiple pieces
    of data, does some computation, handles batching, etc.

    :param x: a [num_samples x inp] numpy array containing input data
    :return: a [num_samples x inp] Tensor
    """
    return torch.from_numpy(x).float()

def train_model(model, train_xs, train_ys, num_classes, num_epochs, learning_rate):

    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    for epoch in range(0, num_epochs):
        ex_indices = [i for i in range(0, len(train_xs))]
        random.shuffle(ex_indices)
        total_loss = 0.0
        for idx in ex_indices:
            x = form_input(train_xs[idx])
            y = train_ys[idx]
            # Build one-hot representation of y. Instead of the label 0 or 1, y_onehot is either [0, 1] or [1, 0]. This
            # way we can take the dot product directly with a probability vector to get class probabilities.
            y_onehot = torch.zeros(num_classes)
            # scatter will write the value of 1 into the position of y_onehot given by y
            y_onehot.scatter_(0, torch.from_numpy(np.asarray(y,dtype=np.int64)), 1)
            # Zero out the gradients from the model object. *THIS IS VERY IMPORTANT TO DO BEFORE CALLING BACKWARD()*
            model.zero_grad()
            log_probs = model.forward(x)
            # Can also use built-in NLLLoss as a shortcut but we're being explicit here
            loss = torch.neg(log_probs).dot(y_onehot)
            total_loss += loss
            # Computes the gradient and takes the optimizer step
            loss.backward()
            optimizer.step()
        print("Total loss on epoch %i: %f" % (epoch, total_loss))
    return model

In [None]:
#FRAMEWORK CODE

def read_data(path, sample_frac=1):
    df = pd.read_csv(path)
    df = df.iloc[:int(len(df)*sample_frac)]
    y = df["label"]
    x = df["comment_text"]
    return x, np.array(y).astype(np.float32)

def build_vocab(X):
    MIN_FREQ = 3
    ct = Counter()
    for x_i in X:
        ct.update(nltk.word_tokenize(x_i.lower().strip()))
    #only keep words longer than 2 characters that occur at least MIN_FREQ times
    vocab = {k:i for i,k in enumerate([k for (k,v) in ct.most_common() if v > MIN_FREQ and len(k)>1])}
    return vocab

def build_BOW(X, vocab):
    bows = []
    for x_i in X:
        bow = np.zeros(len(vocab)).astype(np.float32)
        tokens = nltk.word_tokenize(x_i.lower().strip())
        for t in tokens:
            if t in vocab:
                bow[vocab[t]]+=1
        bows.append(bow)
    return np.array(bows)

def report_metrics(classifier, test_data, golds):
  """
    Applies the trained classifier to test data and computes performance
  """
#   golds = [data[1] for data in test_data]
  classified = [classifier.predict(form_input(data)) for data in test_data]
  print("Precision:", precision_score(golds, classified))
  print("Recall:", recall_score(golds, classified))
  print("F1:", f1_score(golds, classified))

In [None]:
#read train/test data
train_docs, train_ys = read_data("data/toxicity_small_train.csv",sample_frac=0.5)
test_x, test_y = read_data("data/toxicity_small_test.csv",sample_frac=0.5)
#build vocabulary
vocab = build_vocab(train_docs)
#extract bag-of-word features
train_xs = build_BOW(train_docs, vocab)
test_xs = build_BOW(test_x, vocab)

#Network definition
input_layer_d = train_xs.shape[1]
hidden_layer_d = 200
num_classes = 2
#this just an example of how to instantiate the model
model = MLP([input_layer_d, hidden_layer_d, num_classes])

# RUN TRAINING AND TEST
num_epochs = 50
initial_learning_rate = 0.01
model = train_model(model, train_xs, train_ys, num_classes, num_epochs,
                    initial_learning_rate)

print(" === Train Set Performance === ")
report_metrics(model, train_xs, train_ys)

print(" === Test Set Performance === ")
report_metrics(model, test_xs, test_y)


Total loss on epoch 0: 2316.756592
Total loss on epoch 1: 1998.807129
Total loss on epoch 2: 1658.861694
Total loss on epoch 3: 1342.262695
Total loss on epoch 4: 1053.362549
Total loss on epoch 5: 783.003296
Total loss on epoch 6: 590.576477
Total loss on epoch 7: 448.241241
Total loss on epoch 8: 245.942459
Total loss on epoch 9: 248.593674
Total loss on epoch 10: 152.918808
Total loss on epoch 11: 92.979706
Total loss on epoch 12: 71.447441
Total loss on epoch 13: 60.748013
Total loss on epoch 14: 53.105839
Total loss on epoch 15: 49.576965
Total loss on epoch 16: 44.354233
Total loss on epoch 17: 41.094479
Total loss on epoch 18: 36.819019
Total loss on epoch 19: 35.837116
Total loss on epoch 20: 32.902729
Total loss on epoch 21: 31.941465
Total loss on epoch 22: 30.842512
Total loss on epoch 23: 29.314169
Total loss on epoch 24: 28.346806
Total loss on epoch 25: 26.615921
Total loss on epoch 26: 26.814150
Total loss on epoch 27: 25.592453
Total loss on epoch 28: 25.482359
Total lo

**Experimenting with double the hidden perceptrons than the initial**

In [None]:
# Experimenting with double the hidden perceptrons than beginning

hidden_layer_d = 400
num_classes = 2
#this just an example of how to instantiate the model
model_doubled = MLP([input_layer_d, hidden_layer_d, num_classes])

# RUN TRAINING AND TEST
num_epochs = 50
initial_learning_rate = 0.01
model_doubled = train_model(model_doubled, train_xs, train_ys, num_classes,
                            num_epochs, initial_learning_rate)

print(" === Train Set Performance === ")
report_metrics(model_doubled, train_xs, train_ys)

print(" === Test Set Performance === ")
report_metrics(model_doubled, test_xs, test_y)


Total loss on epoch 0: 2332.586426
Total loss on epoch 1: 2000.013306
Total loss on epoch 2: 1652.435303
Total loss on epoch 3: 1320.031250
Total loss on epoch 4: 991.057922
Total loss on epoch 5: 753.167847
Total loss on epoch 6: 494.144012
Total loss on epoch 7: 357.883728
Total loss on epoch 8: 276.690125
Total loss on epoch 9: 137.142593
Total loss on epoch 10: 99.156898
Total loss on epoch 11: 78.633606
Total loss on epoch 12: 66.142441
Total loss on epoch 13: 56.996254
Total loss on epoch 14: 50.048901
Total loss on epoch 15: 45.829216
Total loss on epoch 16: 42.156410
Total loss on epoch 17: 39.337246
Total loss on epoch 18: 38.193947
Total loss on epoch 19: 34.715218
Total loss on epoch 20: 33.004803
Total loss on epoch 21: 31.144903
Total loss on epoch 22: 29.691940
Total loss on epoch 23: 28.265812
Total loss on epoch 24: 27.564047
Total loss on epoch 25: 26.822023
Total loss on epoch 26: 25.839792
Total loss on epoch 27: 25.584017
Total loss on epoch 28: 24.097546
Total loss

**Experimenting with triple the hidden perceptrons than the initial**

In [None]:
# Experimenting with triple hidden perceptrons than initial 200

hidden_layer_d = 800
num_classes = 2
#this just an example of how to instantiate the model
model_tripled = MLP([input_layer_d, hidden_layer_d, num_classes])

# RUN TRAINING AND TEST
num_epochs = 50
initial_learning_rate = 0.01
model_tripled = train_model(model_tripled, train_xs, train_ys, num_classes,
                            num_epochs, initial_learning_rate)

print(" === Train Set Performance === ")
report_metrics(model_tripled, train_xs, train_ys)

print(" === Test Set Performance === ")
report_metrics(model_tripled, test_xs, test_y)


Total loss on epoch 0: 2310.184570
Total loss on epoch 1: 1966.795654
Total loss on epoch 2: 1592.843262
Total loss on epoch 3: 1248.381836
Total loss on epoch 4: 952.693970
Total loss on epoch 5: 707.831360
Total loss on epoch 6: 519.451477
Total loss on epoch 7: 253.913666
Total loss on epoch 8: 162.996674
Total loss on epoch 9: 202.486450
Total loss on epoch 10: 229.482941
Total loss on epoch 11: 237.042206
Total loss on epoch 12: 78.159943
Total loss on epoch 13: 61.066521
Total loss on epoch 14: 51.797520
Total loss on epoch 15: 46.039246
Total loss on epoch 16: 41.130043
Total loss on epoch 17: 38.697193
Total loss on epoch 18: 35.953083
Total loss on epoch 19: 32.619350
Total loss on epoch 20: 31.112679
Total loss on epoch 21: 28.341776
Total loss on epoch 22: 29.610992
Total loss on epoch 23: 27.882368
Total loss on epoch 24: 26.641840
Total loss on epoch 25: 25.665567
Total loss on epoch 26: 26.084274
Total loss on epoch 27: 24.053059
Total loss on epoch 28: 24.326094
Total lo

**Logistic regression**

In [None]:
#Network definition
input_layer_d = train_xs.shape[1]
# hidden_layer_d = 1
num_classes = 2

# Logistic Regression class inheriting from MLP
class LogisticRegression(MLP):
    def __init__(self, architecture):
        super(LogisticRegression, self).__init__(architecture)
        # Overriding the layers for logistic regression
        self.layers = nn.Sequential(
            nn.Linear(architecture[0], architecture[1]),  # No hidden layers
            nn.LogSoftmax(dim=0)
        )

#this just an example of how to instantiate the model
model_logistic = LogisticRegression([input_layer_d, num_classes, num_classes])

# RUN TRAINING AND TEST
num_epochs = 50
initial_learning_rate = 0.01
model_logistic = train_model(model_logistic, train_xs, train_ys, num_classes,
                             num_epochs, initial_learning_rate)

print(" === Train Set Performance === ")
report_metrics(model_logistic, train_xs, train_ys)

print(" === Test Set Performance === ")
report_metrics(model_logistic, test_xs, test_y)


Total loss on epoch 0: 2372.364502
Total loss on epoch 1: 1830.221680
Total loss on epoch 2: 1575.410278
Total loss on epoch 3: 1413.851196
Total loss on epoch 4: 1286.332031
Total loss on epoch 5: 1198.283813
Total loss on epoch 6: 1121.369507
Total loss on epoch 7: 1056.020386
Total loss on epoch 8: 1000.851196
Total loss on epoch 9: 947.921265
Total loss on epoch 10: 914.937744
Total loss on epoch 11: 874.240723
Total loss on epoch 12: 843.739929
Total loss on epoch 13: 815.333374
Total loss on epoch 14: 785.691284
Total loss on epoch 15: 764.618103
Total loss on epoch 16: 737.889709
Total loss on epoch 17: 719.100037
Total loss on epoch 18: 704.876465
Total loss on epoch 19: 681.574036
Total loss on epoch 20: 668.119995
Total loss on epoch 21: 652.833374
Total loss on epoch 22: 635.187866
Total loss on epoch 23: 622.627258
Total loss on epoch 24: 614.867188
Total loss on epoch 25: 597.585693
Total loss on epoch 26: 587.629272
Total loss on epoch 27: 576.284546
Total loss on epoch 2

#### Q1: Experiment with a couple of different MLP architectures. What do you observe?

I have tried by doubling and tripling the number of perceptrons the hidden layer. What I have observed is that, by increasing the number of perceptrons in the hidden layer, the model tries to converge in less ephocs than the model with less perceptrons in the hidden layer during tranining, with the model reaching a lower loss in fewer epochs compared to a model with fewer perceptrons. This behavior tells me that more perceptrons allow the network to learn more complex patterns in the data.

Another thing I have observed is that, there is no significant change in both the traning and test metrics after 50 epochs for all the three models.

#### Q2: Compare the performance your MLPs with Logistic Regression (note that this is just a MLP *without* any hidden layers). Are the results what you expected to see?

When I have used logistic regression instead if MLP, the major difference i have observed is that the covergence of the model is taking lot of epochs.

This result was expected because logistic regression is a linear model, which means it can only learn linear decision boundaries. On the other hand, The MLP with hidden layers can learn non-linear relationships in the data, which gives it more ability to model complex patterns.

The test accuracy has also not increased a lot even on traning for 100 epochs and is very less compared to the MLP method.

#### Q3: Investigate the impact of [initialization](https://pytorch.org/docs/stable/nn.init.html) of weight matrices using your best performing MLP. Compare Xavier and Glorot initialization with zero initialization.

**Model with Xavier and Glorot initialization**

In [None]:
class MLPXavierInitialization(MLP):
    def __init__(self, architecture):
        super(MLPXavierInitialization, self).__init__(architecture)
        self.V = nn.Linear(architecture[0],  architecture[1])
        self.g = nn.ReLU()
        self.W = nn.Linear( architecture[1],  architecture[2])
        self.log_softmax = nn.LogSoftmax(dim=0)
        # Initialize weights according to a formula due to Xavier Glorot.
        nn.init.xavier_uniform_(self.V.weight)
        nn.init.xavier_uniform_(self.W.weight)

        self.layers = nn.Sequential(
            self.V,
            self.g,
            self.W,
            self.log_softmax
        )

hidden_layer_d = 800
#this just an example of how to instantiate the model
model_xavier = MLPXavierInitialization([input_layer_d, hidden_layer_d,
                                        num_classes])

# RUN TRAINING AND TEST
num_epochs = 50
initial_learning_rate = 0.01
model_xavier = train_model(model_xavier, train_xs, train_ys, num_classes,
                           num_epochs, initial_learning_rate)

print(" === Train Set Performance === ")
report_metrics(model_xavier, train_xs, train_ys)

print(" === Test Set Performance === ")
report_metrics(model_xavier, test_xs, test_y)


Total loss on epoch 0: 2309.822754
Total loss on epoch 1: 1831.073730
Total loss on epoch 2: 1358.197021
Total loss on epoch 3: 906.903870
Total loss on epoch 4: 606.730713
Total loss on epoch 5: 348.369324
Total loss on epoch 6: 197.053940
Total loss on epoch 7: 136.775101
Total loss on epoch 8: 96.476120
Total loss on epoch 9: 76.772484
Total loss on epoch 10: 63.179474
Total loss on epoch 11: 55.180912
Total loss on epoch 12: 48.477638
Total loss on epoch 13: 43.828606
Total loss on epoch 14: 40.266731
Total loss on epoch 15: 36.754372
Total loss on epoch 16: 35.036648
Total loss on epoch 17: 32.324081
Total loss on epoch 18: 30.840771
Total loss on epoch 19: 29.423429
Total loss on epoch 20: 27.430243
Total loss on epoch 21: 27.501778
Total loss on epoch 22: 26.976696
Total loss on epoch 23: 25.792900
Total loss on epoch 24: 25.144373
Total loss on epoch 25: 23.878721
Total loss on epoch 26: 22.979839
Total loss on epoch 27: 22.814857
Total loss on epoch 28: 22.677376
Total loss on

**Model with weights initialized as weights**

In [None]:
class MLPZeroInitialization(MLP):
    def __init__(self, architecture):
        super(MLPZeroInitialization, self).__init__(architecture)
        self.V = nn.Linear(architecture[0], architecture[1])
        self.W = nn.Linear(architecture[1], architecture[2])

        # Set the weights to zero
        nn.init.zeros_(self.V.weight)
        nn.init.zeros_(self.W.weight)

        # Activation and output layers
        self.g = nn.ReLU()
        self.log_softmax = nn.LogSoftmax(dim=0)

        # Sequential model
        self.layers = nn.Sequential(
            self.V,
            self.g,
            self.W,
            self.log_softmax
        )

hidden_layer_d = 800
#this just an example of how to instantiate the model
model_zero = MLPZeroInitialization([input_layer_d, hidden_layer_d, num_classes])

# RUN TRAINING AND TEST
num_epochs = 50
initial_learning_rate = 0.01
model_zero = train_model(model_zero, train_xs, train_ys, num_classes,
                         num_epochs, initial_learning_rate)

print(" === Train Set Performance === ")
report_metrics(model_zero, train_xs, train_ys)

print(" === Test Set Performance === ")
report_metrics(model_zero, test_xs, test_y)


Total loss on epoch 0: 2420.698486
Total loss on epoch 1: 2247.050537
Total loss on epoch 2: 1980.751343
Total loss on epoch 3: 1781.930542
Total loss on epoch 4: 1615.297119
Total loss on epoch 5: 1383.258179
Total loss on epoch 6: 1349.789795
Total loss on epoch 7: 1232.670532
Total loss on epoch 8: 1131.286011
Total loss on epoch 9: 1097.545654
Total loss on epoch 10: 965.167419
Total loss on epoch 11: 848.700073
Total loss on epoch 12: 827.955444
Total loss on epoch 13: 874.957764
Total loss on epoch 14: 781.959839
Total loss on epoch 15: 780.155762
Total loss on epoch 16: 747.509399
Total loss on epoch 17: 607.706787
Total loss on epoch 18: 638.200317
Total loss on epoch 19: 692.903748
Total loss on epoch 20: 612.919861
Total loss on epoch 21: 519.164978
Total loss on epoch 22: 575.266174
Total loss on epoch 23: 662.061279
Total loss on epoch 24: 601.157898
Total loss on epoch 25: 469.870026
Total loss on epoch 26: 394.388977
Total loss on epoch 27: 423.869415
Total loss on epoch 

When I take my high performing model with weights initialized using Xavier and as zeros, I feel the starting loss is slightly higher for the model intitialized with zeros. And also for running the both for 50 epochs, we can observe that the model with zavier is converging quickly compared to the model initialized with zeros.

This happens because when the weights are initialized to zero, all the neurons in a layer end up learning the same things. This makes it harder for the model to learn effectively, leading to slow progress or even getting stuck, as the updates to the model during training don't work well.

If we compare the metrics aswell during the traning and testing, for the model intialized with xavier has good metrics compared to model with zeros.

#### Q4: Investigate the impact of [non-linear activations](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity). See Torch documentation on the available non-linear functions and compare the performance of 3 different functions (e.g., sigmoid, relu and tanh).

**Sigmod activation**

In [None]:
class MLPSigmoidActivation(MLP):
    def __init__(self, architecture):
        super(MLPSigmoidActivation, self).__init__(architecture)
        self.V = nn.Linear(architecture[0], architecture[1])
        self.W = nn.Linear(architecture[1], architecture[2])

        # Activation and output layers
        self.g = nn.Sigmoid()
        self.log_softmax = nn.LogSoftmax(dim=0)

        # Sequential model
        self.layers = nn.Sequential(
            self.V,
            self.g,
            self.W,
            self.log_softmax
        )

hidden_layer_d = 400

#this just an example of how to instantiate the model
model_sigmoid = MLPSigmoidActivation([input_layer_d, hidden_layer_d,
                                      num_classes])

# RUN TRAINING AND TEST
num_epochs = 50
initial_learning_rate = 0.01
model_sigmoid = train_model(model_sigmoid, train_xs, train_ys,
                            num_classes, num_epochs, initial_learning_rate)

print(" === Train Set Performance === ")
report_metrics(model_sigmoid, train_xs, train_ys)

print(" === Test Set Performance === ")
report_metrics(model_sigmoid, test_xs, test_y)

Total loss on epoch 0: 2920.830811
Total loss on epoch 1: 2601.025879
Total loss on epoch 2: 2324.942871
Total loss on epoch 3: 2058.722656
Total loss on epoch 4: 1796.940918
Total loss on epoch 5: 1600.729980
Total loss on epoch 6: 1401.272949
Total loss on epoch 7: 1252.659912
Total loss on epoch 8: 1083.487305
Total loss on epoch 9: 987.683350
Total loss on epoch 10: 871.386230
Total loss on epoch 11: 787.112366
Total loss on epoch 12: 714.123413
Total loss on epoch 13: 654.508728
Total loss on epoch 14: 566.223083
Total loss on epoch 15: 500.781921
Total loss on epoch 16: 451.632568
Total loss on epoch 17: 438.014465
Total loss on epoch 18: 404.248138
Total loss on epoch 19: 346.210236
Total loss on epoch 20: 325.901978
Total loss on epoch 21: 304.563660
Total loss on epoch 22: 283.918793
Total loss on epoch 23: 269.772186
Total loss on epoch 24: 267.113556
Total loss on epoch 25: 246.037796
Total loss on epoch 26: 224.568817
Total loss on epoch 27: 197.862778
Total loss on epoch 2

**TanH activation**

In [None]:
class MLPTanhActivation(MLP):
    def __init__(self, architecture):
        super(MLPTanhActivation, self).__init__(architecture)
        self.V = nn.Linear(architecture[0], architecture[1])
        self.W = nn.Linear(architecture[1], architecture[2])

        # Activation and output layers
        self.g = nn.Tanh()
        self.log_softmax = nn.LogSoftmax(dim=0)

        # Sequential model
        self.layers = nn.Sequential(
            self.V,
            self.g,
            self.W,
            self.log_softmax
        )

hidden_layer_d = 400

#this just an example of how to instantiate the model
model_tanh = MLPTanhActivation([input_layer_d, hidden_layer_d, num_classes])

# RUN TRAINING AND TEST
num_epochs = 50
initial_learning_rate = 0.01
model_tanh = train_model(model_tanh, train_xs, train_ys, num_classes,
                         num_epochs, initial_learning_rate)

print(" === Train Set Performance === ")
report_metrics(model_tanh, train_xs, train_ys)

print(" === Test Set Performance === ")
report_metrics(model_tanh, test_xs, test_y)

Total loss on epoch 0: 2301.167725
Total loss on epoch 1: 1955.876709
Total loss on epoch 2: 1700.426514
Total loss on epoch 3: 1472.700928
Total loss on epoch 4: 1231.246216
Total loss on epoch 5: 1059.766724
Total loss on epoch 6: 896.388672
Total loss on epoch 7: 752.691711
Total loss on epoch 8: 662.736755
Total loss on epoch 9: 585.227234
Total loss on epoch 10: 466.390411
Total loss on epoch 11: 386.509918
Total loss on epoch 12: 301.370300
Total loss on epoch 13: 227.525055
Total loss on epoch 14: 265.346619
Total loss on epoch 15: 203.323212
Total loss on epoch 16: 238.941818
Total loss on epoch 17: 160.961197
Total loss on epoch 18: 106.675034
Total loss on epoch 19: 83.824455
Total loss on epoch 20: 69.414474
Total loss on epoch 21: 62.745872
Total loss on epoch 22: 54.857712
Total loss on epoch 23: 51.321255
Total loss on epoch 24: 47.213696
Total loss on epoch 25: 42.860287
Total loss on epoch 26: 41.371277
Total loss on epoch 27: 38.804268
Total loss on epoch 28: 39.515381

When I observe the three models taking sigmoid, tanh and relu(its defind for answering the 1st question) activations, I observed the following

**With Sigmoid**
As sigmoid squeezes numbers between 0 and 1, It didn’t work as well because it made the model learn slowly. This is because it can make the updates to the model's weights really small, which slowed down learning.

**With Tanh**
As Tanh outputs values between -1 and 1, It worked better than Sigmoid because it produced stronger updates to the model, but it still had some of the same issues like slow learning.

**With Relu**
ReLU performed the best. We know that it sets negative values to zero and leaves positive values as they are. This helped the model learn faster and avoid the problems Sigmoid has.