# Concise Implementation of Softmax Regression

Just as PyTorch made it much easier to implement linear regression, we'll find it similarly (or possibly more)
convenient for implementing classification models. Again, we begin with our import ritual.

In [19]:
import os
import sys
sys.path.insert(0, '..')
import d2l
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

Let's stick with the Fashion-MNIST dataset and keep the batch size at $256$ as in the last section.

In [20]:
batch_size = 256
def load_data_fashion_mnist(batch_size):
    train_data_path = '/Users/yingshuaihao/PycharmProjects/Dataset/FashionMNIST/processed/training.pt'
    test_data_path = '/Users/yingshuaihao/PycharmProjects/Dataset/FashionMNIST/processed/test.pt'

    if os.path.exists(train_data_path) and os.path.exists(test_data_path):
        X, y = torch.load(train_data_path)
        X = X.float()
        mnist_train = TensorDataset(X, y)

        X, y = torch.load(test_data_path)
        X = X.float()
        mnist_test = TensorDataset(X, y) 
    else:
        root = os.path.expanduser(root)
        transformer = []
        if resize:
            transformer += [transforms.Resize(resize)]
        transformer += [transforms.ToTensor()]
        transformer = transforms.Compose(transformer)

        mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, transform=transformer, download=True)
        mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, transform=transformer, download=True)

    num_workers = 0 if sys.platform.startswith('win32') else 4

    train_iter = DataLoader(mnist_train, batch_size, shuffle=True, num_workers=num_workers)
    test_iter = DataLoader(mnist_test, batch_size, shuffle=False, num_workers=num_workers)
    return train_iter, test_iter

train_iter, test_iter = load_data_fashion_mnist(batch_size)

## Initialize Model Parameters

As mentioned in `chapter_softmax`, the output layer of softmax regression is a fully connected (`Linear`) layer. Therefore, to implement our model, we just need to add one `Linear` layer with 10 outputs to our `Sequential`. Again, here, the `Sequential` isn't really necessary, but we might as well form the habit since it will be ubiquitous when implementing deep models. Again, we initialize the weights at random with zero mean and standard deviation 0.01.

In [21]:
class Reshape(torch.nn.Module):
    def forward(self, x):
        return x.view(-1,784)

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.normal_(m.weight, std=0.01)

Sequential(
  (0): Reshape()
  (1): Linear(in_features=784, out_features=10, bias=True)
)

## The Softmax

In the previous example, we calculated our model's output and then ran this
output through the cross-entropy loss. Mathematically, that's a perfectly reasonable thing to do. However,
computationally, things can get hairy when dealing with exponentiation due to
numerical stability issues, a matter we've already discussed a few times
(e.g. in `chapter_naive_bayes`) and in the problem set of the previous chapter).
Recall that the softmax function calculates $\hat y_j = \frac{e^{z_j}}{\sum_{i=1}^{n} e^{z_i}}$, where $\hat y_j$
is the j-th element of ``yhat`` and $z_j$ is the j-th element of the input
``y_linear`` variable, as computed by the softmax.

If some of the $z_i$ are very large (i.e. very positive),
$e^{z_i}$ might be larger than the largest number
we can have for certain types of ``float`` (i.e. overflow).
This would make the denominator (and/or numerator) ``inf`` and we get zero,
or ``inf``, or ``nan`` for $\hat y_j$.
In any case, we won't get a well-defined return value for ``cross_entropy``. This is the reason <font color=red>we subtract $\text{max}(z_i)$
from all $z_i$ first in ``softmax`` function.
You can verify that this shifting in $z_i$
will not change the return value of ``softmax``.</font>

After the above subtraction/ normalization step,
it is possible that $z_j$ is very negative.
Thus, $e^{z_j}$ will be very close to zero
and might be rounded to zero due to finite precision (i.e underflow),
which makes $\hat y_j$ zero and we get ``-inf`` for $\text{log}(\hat y_j)$.
A few steps down the road in backpropagation,
we start to get horrific not-a-number (``nan``) results printed to screen.

Our salvation is that even though we're computing these exponential functions, we ultimately plan to take their log in the cross-entropy functions.
It turns out that by combining these two operators
``softmax`` and ``cross_entropy`` together,
we can escape the numerical stability issues
that might otherwise plague us during backpropagation.
As shown in the equation below, we avoided calculating $e^{z_j}$
but directly used $z_j$ due to $\log(\exp(\cdot))$.

$$
\begin{aligned}
\log{(\hat y_j)} & = \log\left( \frac{e^{z_j}}{\sum_{i=1}^{n} e^{z_i}}\right) \\
& = \log{(e^{z_j})}-\text{log}{\left( \sum_{i=1}^{n} e^{z_i} \right)} \\
& = z_j -\log{\left( \sum_{i=1}^{n} e^{z_i} \right)}
\end{aligned}
$$

We'll want to keep the conventional softmax function handy
in case we ever want to evaluate the probabilities output by our model.
But instead of passing softmax probabilities into our new loss function,
we'll just pass $\hat{y}$ and compute the softmax and its log
all at once inside the softmax_cross_entropy loss function,
which does smart things like the log-sum-exp trick ([see on Wikipedia](https://en.wikipedia.org/wiki/LogSumExp)).


In [22]:
loss = nn.CrossEntropyLoss()

## Optimization Algorithm

We use the mini-batch random gradient descent
with a learning rate of $0.1$ as the optimization algorithm.
Note that this is the same choice as for linear regression
and it illustrates the general applicability of the optimizers.

## Training

Next, we use the training functions defined in the last section to train a model.

In [29]:
num_epochs = 10
lr = 1.0e-5
net = nn.Sequential(Reshape(), nn.Linear(784, 10))
net.apply(init_weights)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, lr)

epoch 1, loss 0.0925, train acc 0.667, test acc 0.719
epoch 2, loss 0.0487, train acc 0.747, test acc 0.741
epoch 3, loss 0.0440, train acc 0.766, test acc 0.774
epoch 4, loss 0.0426, train acc 0.770, test acc 0.772
epoch 5, loss 0.0419, train acc 0.775, test acc 0.809
epoch 6, loss 0.0409, train acc 0.777, test acc 0.718
epoch 7, loss 0.0397, train acc 0.783, test acc 0.781
epoch 8, loss 0.0387, train acc 0.783, test acc 0.776
epoch 9, loss 0.0384, train acc 0.788, test acc 0.697
epoch 10, loss 0.0388, train acc 0.787, test acc 0.776


Just as before, this algorithm converges to a solution
that achieves an accuracy of 83.3%,
albeit this time with a lot fewer lines of code than before.
Note that in many cases, PyTorch takes specific precautions
in addition to the most well-known tricks for ensuring numerical stability.
This saves us from many common pitfalls that might befall us
if we were to code all of our models from scratch.

## Exercises

1. Try adjusting the hyper-parameters, such as batch size, epoch, and learning rate, to see what the results are.
1. Why might the test accuracy decrease again after a while? How could we fix this?