# Setup

First, set up the required Python modules and perform some general configuration.

(This part of the code follows the [CNN notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_cnn.ipynb) that you should already be familiar with.)

Install the required Python packages using [pip](https://en.wikipedia.org/wiki/Pip):

* [`transformers`](https://huggingface.co/docs/transformers/index) is a popular deep learning package
* [`datasets`](https://huggingface.co/docs/datasets/) provides support for loading, creating, and manipulating datasets
* [`evaluate`](https://huggingface.co/docs/evaluate/index) is a library for easily evaluating machine learning models and datasets
* [`accelerate`](https://pypi.org/project/accelerate/) is a wrapper we need to install in order to train torch models using a transformers trainer

Both `transformers` and `datasets` are used extensively on this course.

In [72]:
!pip3 install -q transformers datasets evaluate accelerate

(Above, the `!` at the start of the line tells the notebook to run the line as an operating system command rather than Python code, and the `-q` argument to `pip` runs the command in "quiet" mode, with less output.)

We'll also use the [`pprint`](https://docs.python.org/3/library/pprint.html) ("pretty-print") module to format output more readably below. The only difference to just using `print` is that some data structures will be easier to read and interpret.

In [73]:
from pprint import PrettyPrinter

pprint = PrettyPrinter(compact=True).pprint

Finally, we will reduce logging output. The `transformers` library by default produces fairly verbose logging. Commenting out the following code will enable low-priority output (`INFO` logging level and below).

In [74]:
import logging

logging.disable(logging.INFO)

---

# Download and prepare data

We will again use the `datasets` library function [`load_dataset`](https://huggingface.co/docs/datasets/master/en/package_reference/loading_methods#datasets.load_dataset) to load a dataset for our experiments.

In [75]:
import datasets


dataset = datasets.load_dataset("imdb")

Let's see what the dataset contains:

In [76]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


and print out an example:

In [77]:
dataset = dataset.shuffle()
del dataset["unsupervised"]
pprint(dataset["train"][0])

{'label': 1,
 'text': 'There is no relation at all between Fortier and Profiler but the '
         'fact that both are police series about violent crimes. Profiler '
         'looks crispy, Fortier looks classic. Profiler plots are quite '
         "simple. Fortier's plot are far more complicated... Fortier looks "
         'more like Prime Suspect, if we have to spot similarities... The main '
         'character is weak and weirdo, but have "clairvoyance". People like '
         'to compare, to judge, to evaluate. How about just enjoying? Funny '
         'thing too, people writing Fortier looks American but, on the other '
         "hand, arguing they prefer American series (!!!). Maybe it's the "
         'language, or the spirit, but I think this series is more English '
         'than American. By the way, the actors are really good and funny. The '
         'acting is not superficial at all...'}


---

# Tokenize and vectorize data

(This part of the code follows the [CNN notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_cnn.ipynb) that you should already be familiar with.)

To tokenize and vectorize the texts of our dataset, we will again use previously created tokenizers through the simple [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) class.

The [`AutoTokenizer.from_pretrained`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer.from_pretrained) function can load the tokenizer associated with any of the large number of models found in the [Hugging Face models repository](https://huggingface.co/models). Here, our texts are in English, and we'll load the tokenizer for the [`bert-base-cased`](https://huggingface.co/bert-base-cased) model.

(**Note**: we're not actually using the BERT model here, just its tokenizer.)

In [78]:
import transformers

model_name = "bert-base-cased"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)



As in the [CNN notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_cnn.ipynb), we will define a simple tokenization function and tokenize and vectorize our whole dataset with the tokenizer by calling the [`Dataset.map`](https://huggingface.co/docs/datasets/v2.14.4/en/package_reference/main_classes#datasets.Dataset.map) function.

Note that here we're providing a `max_length` argument and `truncation=True` in the tokenizer call. This limits the maximum length of outputs to the given length (see the [tokenizers documentation](https://huggingface.co/docs/transformers/preprocessing#everything-you-always-wanted-to-know-about-padding-and-truncation) for details). This makes training faster, potentially at some cost in performance.

In [79]:
# Define a simple function that applies the tokenizer
def tokenize(example):
    return tokenizer(
        example["text"],
        max_length=128,
        truncation=True,
    )

# Apply the tokenizer to the whole dataset using .map()
dataset = dataset.map(tokenize)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

---

# Build model

As usual, we will create a PyTorch model class with an `__init__()` function that creates the layers and a `forward()` function which implements the actual computation. For more information on these, please see the [PyTorch tutorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

We're here building a simple RNN with the following structure:

* As in the [CNN](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_cnn.ipynb), the token IDs are first mapped to embeddings of a user-specified size (`config.embedding_dim`) in a [torch.nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer. Note that here the embeddings are initialized randomly and learned along with other model weights. In real-world applications, the embeddings would typically be initialized with previously learned weights.
* Second, the embedded inputs are passed through an RNN ([torch.nn.RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html)), which produces a series of outputs ($(y_1, \ldots, y_n)$, where $n$ is the length of the input) and the final hidden state $h_n$. Here, we will only use the last output $y_n$.
* Finally, there is a fully connected layer ([torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)) that maps the last RNN output to the two possible output values of the classifier.

We can interpret this model as processing the input step by step, attempting to identify tokens (embeddings) that in the context of its previous input express either positive or negative opinions, and to output a value at the end of the sequence that can be mapped to the positive or negative class.

In the `forward` function we mostly just pass the input through the layers, with the following additional steps:

* To invoke the RNN, we need to provide the value of the initial hidden state $h_0$. Here we simply use [torch.zeros](https://pytorch.org/docs/stable/generated/torch.zeros.html) to create a tensor of the appropriate size filled with zeros.
* To get the value of the last item in the sequence of RNN outputs (`rnn_outputs`), we slice the three-dimensional tensor (batch, rnn step, output dim) with `rnn_outputs[:, -1, :]`. This returns all values in the first and last dimensions, and the last in the second. If you are not familar with this syntax, consider the following example:

In [80]:
import numpy

a = numpy.array([
  [[11], [12], [13]],
  [[21], [22], [23]],
  [[31], [32], [33]],
])

print(a[:, -1, :])

[[13]
 [23]
 [33]]


(The two points above can be considered technical details and understanding them in detail is not required to understand the model.)

Here's the model:

In [87]:
import torch


# This gives a new name to the config class, just for convenience
BasicConfig = transformers.PretrainedConfig


# This is the model
class SimpleRNN(transformers.PreTrainedModel):

    config_class = BasicConfig

    # In the initialization method, one instantiates the layers
    # these will be the parameters of the model
    def __init__(self, config):
        super().__init__(config)
        # Embedding layer: vocab size x embedding dim
        self.embeddings = torch.nn.Embedding(
            num_embeddings=config.vocab_size,
            embedding_dim=config.embedding_dim
        )
        # RNN with configurable hidden size and nonlinearity
        self.rnn = torch.nn.LSTM(
            input_size=config.embedding_dim,
            hidden_size=config.hidden_size,
            num_layers=config.num_layers,
            batch_first=True,
            bidirectional=True # added for bidirectionality
        )
        # Output layer: embedding size to output size
        self.output_layer = torch.nn.Linear(
            in_features=config.hidden_size*2,
            out_features=config.num_labels
        )
        # Loss function: standard loss for classification
        self.loss = torch.nn.CrossEntropyLoss()



    """
    N = batch size
    L = seq length
    D = 2 if bidirectional=True, otherwise 1
    H_in = input_size
    H_cell = hidden size
    H_out = proj_size if proj_size > 0 otherwise hidden_size
    """
    def forward(self, input_ids, labels=None, attention_mask=None):
        # Embed input ids
        x = self.embeddings(input_ids)
        # Set initial hidden state to all-zero values
        batch_size = x.shape[0]
        """
        h_0: tensor of shape (D∗num_layers,Hout)(D∗num_layers,Hout​) for unbatched input or (D∗num_layers,N,Hout)(D∗num_layers,N,Hout​)
        containing the initial hidden state for each element in the input sequence.
        Defaults to zeros if (h_0, c_0) is not provided.
        """
        h0 = torch.zeros(
            (self.config.num_layers*2, batch_size, self.config.hidden_size),
            device=input_ids.device    # place on same device as input
        )
        """
        c_0: tensor of shape (D∗num_layers,Hcell)(D∗num_layers,Hcell​)
        for unbatched input or (D∗num_layers,N,Hcell)(D∗num_layers,N,Hcell​)
        containing the initial cell state for each element in the input sequence.
        Defaults to zeros if (h_0, c_0) is not provided.
        """
        c0 = torch.zeros(
            (self.config.num_layers*2, batch_size, self.config.hidden_size),
            device=input_ids.device
        )
        # Run RNN repeatedly to get sequence of outputs and last hidden state
        rnn_outputs, (h_n, c_n) = self.rnn(x, (h0,c0))
        # Get last RNN output
        y_n = rnn_outputs[:, -1, :]
        # Map to outputs with fully connected layer
        output = self.output_layer(y_n)

        # Return value computed as in MLP and CNN:
        if labels is not None:
            # We have labels, so we can calculate the loss
            return (self.loss(output,labels), output)
        else:
            # No labels, so just return the output
            return (output,)

---

# Configure and train model

We'll first configure and instantiate the model. Here `vocab_size` should always be the vocabulary size of the tokenizer and `num_labels` the number of unique labels in the data (as here), but the others are hyperparameters that you can choose:

* `embedding_dim`: the size of the word (i.e. token) embeddings
* `hidden_size`: the size of the RNN hidden state vector _h_
* `num_layers`: number of stacked RNN layers
* `nonlinearity`: the non-linear function to apply in RNN (`'tanh'` or `'relu'`)

In [88]:
"""
num_layers – Number of recurrent layers.
E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM,
with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
  """

config = BasicConfig(
    vocab_size = tokenizer.vocab_size,
    num_labels = len(set(dataset["train"]["label"])),
    embedding_dim = 64,
    hidden_size = 96,

    num_layers = 2, # changed from 1 to 2
    #nonlinearity = "tanh", not needed as LSTM doesnt approve this as parameter
)

model = SimpleRNN(config)

Training arguments are set similarly as in the [CNN notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_cnn.ipynb). Many number of these settings relate to the frequency of evaluation and output during training, but the following are hyperparameters that you may wish to adjust:

* `learning_rate`: the step size for weight updates
* `per_device_train_batch_size`: number of examples per batch
* `max_steps`: the maximum number of steps to train for

In [89]:
# Set training arguments
trainer_args = transformers.TrainingArguments(
    "checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    load_best_model_at_end=True,
    eval_steps=500,
    logging_steps=500,
    learning_rate=0.001,
    per_device_train_batch_size=8,
    max_steps=2500,
)



We'll then define the standard accuracy metric (ratio of correct out of all predictions), create a [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) to pad inputs to the same length (as required for batching) and an [EarlyStoppingCallback](https://huggingface.co/docs/transformers/main_classes/callback#transformers.EarlyStoppingCallback) to stop training when performance fails to improve for the given number of evaluations.

(These should all be familiar to you from the [CNN notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_cnn.ipynb))

In [90]:
import evaluate
accuracy = evaluate.load("accuracy")


def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = outputs.argmax(axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

data_collator = transformers.DataCollatorWithPadding(tokenizer)

# Argument gives the number of steps of patience before early stopping
early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=5
)

Finally, as in the [CNN notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_cnn.ipynb), we'll create a simple custom [callback](https://huggingface.co/docs/transformers/main_classes/callback) to store values logged during training so that we can more easily examine them later. (This is only needed for visualization and is not necessary to understand in detail.)

In [91]:
from collections import defaultdict

class LogSavingCallback(transformers.TrainerCallback):
    def on_train_begin(self, *args, **kwargs):
        self.logs = defaultdict(list)
        self.training = True

    def on_train_end(self, *args, **kwargs):
        self.training = False

    def on_log(self, args, state, control, logs, model=None, **kwargs):
        if self.training:
            for k, v in logs.items():
                if k != "epoch" or v not in self.logs[k]:
                    self.logs[k].append(v)

training_logs = LogSavingCallback()

We then pass the model, trainer arguments, training and evaluation data, metric, the collator, and the callbacks to a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) and call `.train()` to train the model.

In [92]:
trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    callbacks=[early_stopping, training_logs]
)

trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6903,0.691788,0.52372
1000,0.6876,0.686373,0.54768
1500,0.6837,0.681575,0.56908
2000,0.6732,0.661619,0.6116
2500,0.65,0.651865,0.6292


TrainOutput(global_step=2500, training_loss=0.6769697509765625, metrics={'train_runtime': 86.0568, 'train_samples_per_second': 232.405, 'train_steps_per_second': 29.051, 'total_flos': 5337937920000.0, 'train_loss': 0.6769697509765625, 'epoch': 0.8})

---

# Results

Evaluate and print out results:

In [93]:
eval_results = trainer.evaluate(dataset["test"])

pprint(eval_results)

print('Accuracy:', eval_results['eval_accuracy'])

{'epoch': 0.8,
 'eval_accuracy': 0.6292,
 'eval_loss': 0.6518653631210327,
 'eval_runtime': 13.8098,
 'eval_samples_per_second': 1810.311,
 'eval_steps_per_second': 226.289}
Accuracy: 0.6292


Let's also have a look at training and evaluation loss and evaluation accuracy progression as we did in the [CNN notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_cnn.ipynb). (The code here is only for visualization and you do not need to understand it, but you should aim to be able to interpret the plots.)

## CHANGES MADE AND RESULTS

### a) Start from last week's RNN notebook, and update it to use an LSTM cell instead of the vanilla RNN cell. You can begin by simply replacing torch.nn.RNN with torch.nn.LSTM. However, expect a few errors. Try to debug the code by examining the error messages and referring to the RNN and LSTM cell documentation.**bold text**

Changed self.rnn to self.LSTM

```python
self.rnn = torch.nn.LSTM(
```

removed nonliniarity parameter from the config

```python
self.rnn = torch.nn.LSTM(
            input_size=config.embedding_dim,
            hidden_size=config.hidden_size,
            num_layers=config.num_layers,
            nonlinearity=config.nonlinearity, # removed this
            batch_first=True
       )
```
                          
Modified the `def forward`

```python
c0 = torch.zeros(
    (self.config.num_layers, batch_size, self.config.hidden_size),
    device=input_ids.device
)
# Run RNN repeatedly to get sequence of outputs and last hidden state
rnn_outputs, (h_n, c_n) = self.rnn(x, (h0,c0))
```
c0, c_n new parameters and self.rnn call to contain initialized c0

#### RESULTS OF CHANGES

```python3
{'epoch': 0.8,
 'eval_accuracy': 0.57432,
 'eval_loss': 0.6802768707275391,
 'eval_runtime': 17.2538,
 'eval_samples_per_second': 1448.958,
 'eval_steps_per_second': 181.12}
Accuracy: 0.57432
```

### b) After successfully training with the LSTM cell, attempt to enhance performance by implementing bidirectional and stacked architectures. Once again, few errors are to be expected.

Changed num_layers 1 ==> 2

```python

"""
num_layers – Number of recurrent layers.
E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM,
with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
  """

config = BasicConfig(
    vocab_size = tokenizer.vocab_size,
    num_labels = len(set(dataset["train"]["label"])),
    embedding_dim = 64,
    hidden_size = 96,

    num_layers = 2, # changed from 1 to 2
    #nonlinearity = "tanh", not needed as LSTM doesnt approve this as parameter
)

model = SimpleRNN(config)

```
RESULTS OF CHANGES

```python3
{'epoch': 0.8,
 'eval_accuracy': 0.51548,
 'eval_loss': 0.6923003792762756,
 'eval_runtime': 11.5596,
 'eval_samples_per_second': 2162.695,
 'eval_steps_per_second': 270.337}
Accuracy: 0.51548
```

Changed LSTM to bidirectional with changes in:

`def __init__`

increased `D` value (hidden as 1 in the original lstm) to 2 in `config.hidden_size`

```python

# Output layer: embedding size to output size
self.output_layer = torch.nn.Linear(
  in_features=config.hidden_size*2,
  out_features=config.num_labels
)
```

`def forward`

increased `D` value (hidden as 1 in the original lstm) to 2 in h0 and c0

```python
"""
h_0: tensor of shape (D∗num_layers,Hout)(D∗num_layers,Hout​) for unbatched input or (D∗num_layers,N,Hout)(D∗num_layers,N,Hout​)
containing the initial hidden state for each element in the input sequence.
Defaults to zeros if (h_0, c_0) is not provided.
"""
h0 = torch.zeros(
    (self.config.num_layers*2, batch_size, self.config.hidden_size),
    device=input_ids.device    # place on same device as input
)
"""
c_0: tensor of shape (D∗num_layers,Hcell)(D∗num_layers,Hcell​)
for unbatched input or (D∗num_layers,N,Hcell)(D∗num_layers,N,Hcell​)
containing the initial cell state for each element in the input sequence.
Defaults to zeros if (h_0, c_0) is not provided.
"""
c0 = torch.zeros(
    (self.config.num_layers*2, batch_size, self.config.hidden_size),
    device=input_ids.device
)
```



```python3
{'epoch': 0.8,
 'eval_accuracy': 0.6292,
 'eval_loss': 0.6518653631210327,
 'eval_runtime': 13.8098,
 'eval_samples_per_second': 1810.311,
 'eval_steps_per_second': 226.289}
Accuracy: 0.6292
```


## OPTUNA STUDY FOR LEARNING RATE AND NUMBER OF LAYERS FOR BIDIRECTINAL LSTM

In [96]:
import optuna

In [101]:
import optuna

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 5e-4, 5e-2, log=True)
    num_layers = trial.suggest_int("num_layers",1,5)

    config = BasicConfig(
      vocab_size = tokenizer.vocab_size,
      num_labels = len(set(dataset["train"]["label"])),
      embedding_dim = 64,
      hidden_size = 96,
      num_layers = num_layers
    )

    model = SimpleRNN(config)

    # Set training arguments
    trainer_args = transformers.TrainingArguments(
        "checkpoints",
        evaluation_strategy="steps",
        logging_strategy="steps",
        load_best_model_at_end=True,
        eval_steps=500,
        logging_steps=500,
        learning_rate=learning_rate, # <--- parameter goes here
        per_device_train_batch_size=8,
        max_steps=2500,
    )

    trainer = transformers.Trainer(
        model=model,
        args=trainer_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        compute_metrics=compute_accuracy,
        data_collator=data_collator,
        callbacks=[transformers.EarlyStoppingCallback(early_stopping_patience=5), LogSavingCallback()]
    )

    trainer.train()
    eval_results = trainer.evaluate(dataset["test"])
    print('Learning rate:', learning_rate, 'Number of layers:', num_layers, 'Accuracy:', eval_results['eval_accuracy'])
    return eval_results['eval_accuracy']



study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=7) # <--- How many trials we run, more would be needed in real case!

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6904,0.698311,0.50068
1000,0.6897,0.688449,0.5354
1500,0.6791,0.672322,0.60604
2000,0.6539,0.648847,0.63744
2500,0.6287,0.62736,0.66728


max_steps is given, it will override any value given in num_train_epochs


Learning rate: 0.001785342739404788 Number of layers: 1 Accuracy: 0.66728


Step,Training Loss,Validation Loss,Accuracy
500,0.6956,0.692671,0.51036
1000,0.6936,0.694798,0.51036
1500,0.6937,0.692901,0.51036
2000,0.6931,0.692592,0.51036
2500,0.6929,0.692889,0.5


max_steps is given, it will override any value given in num_train_epochs


Learning rate: 0.012573706789392404 Number of layers: 2 Accuracy: 0.51036


Step,Training Loss,Validation Loss,Accuracy
500,0.695,0.692698,0.51028
1000,0.6933,0.694344,0.50016
1500,0.6934,0.692916,0.51036
2000,0.6931,0.692652,0.51036
2500,0.6928,0.692886,0.5


max_steps is given, it will override any value given in num_train_epochs


Learning rate: 0.008192600104922971 Number of layers: 3 Accuracy: 0.51036


Step,Training Loss,Validation Loss,Accuracy
500,0.692,0.691874,0.5258
1000,0.6947,0.695432,0.51076
1500,0.6938,0.691803,0.5214
2000,0.6868,0.679185,0.57564
2500,0.6738,0.673603,0.59036


max_steps is given, it will override any value given in num_train_epochs


Learning rate: 0.0007987956655643505 Number of layers: 1 Accuracy: 0.59036


Step,Training Loss,Validation Loss,Accuracy
500,0.691,0.691095,0.53432
1000,0.6863,0.676524,0.58388
1500,0.6843,0.680849,0.57052
2000,0.6705,0.658928,0.62496
2500,0.6488,0.656658,0.6244


max_steps is given, it will override any value given in num_train_epochs


Learning rate: 0.0005474328674891419 Number of layers: 3 Accuracy: 0.6244


Step,Training Loss,Validation Loss,Accuracy
500,0.6926,0.693651,0.5
1000,0.6931,0.693926,0.49992
1500,0.6934,0.693293,0.49996
2000,0.6933,0.69316,0.5
2500,0.6932,0.693235,0.5


max_steps is given, it will override any value given in num_train_epochs


Learning rate: 0.0021293420185889444 Number of layers: 3 Accuracy: 0.5


Step,Training Loss,Validation Loss,Accuracy
500,0.6912,0.700542,0.50008
1000,0.6908,0.691456,0.51348
1500,0.6906,0.689649,0.53824
2000,0.6889,0.685262,0.55676
2500,0.683,0.686448,0.53692


Learning rate: 0.0010050086030851926 Number of layers: 3 Accuracy: 0.55676


Learning rate: 0.001785342739404788 Number of layers: 1 Accuracy: 0.66728
