We implement softmax. Note that this initial version is defective owing to not using the logsumexp trick.

In [1]:
def log_softmax(x):
    return (x.exp() / (x.exp().sum(-1, keepdim=True))).log()


The division can and should be expressed in terms of logprobs.

In [2]:
def log_softmax(x):
    return x - x.exp().sum(-1, keepdim=True).log()

We implement the logsumexp trick.

In [3]:
def logsumexp(x):
    m = x.max(-1)[0]
    return m + (x-m[:, None]).exp().sum(-1).log()


In [4]:
def log_softmax(x):
    return x - x.logsumexp(-1,keepdim=True)


We run this activation function on our predictions.

In [5]:
from fastai.vision.all import *

pickle_path = URLs.path('mnist_png')/'mnist_png.pkl'
path = untar_data(URLs.MNIST)/'training'

if not pickle_path.exists():
    pickle_path.parent.mkdir(parents=True, exist_ok=True)
    ds = DataBlock(
        blocks = (ImageBlock(PILImageBW), CategoryBlock),
        get_items = get_image_files,
        get_y = parent_label,
        splitter = RandomSplitter(1/6, seed=0)
    ).datasets(path)

    xs, ys = zip(*ds.train, *ds.valid)
    xs = np.stack(L(map(lambda x: np.array(x, dtype=np.float32).reshape(-1), xs))) / 255.
    ys = np.array(ys, dtype=np.int64)

    x_train, x_valid = xs[:len(ds.train)], xs[len(ds.train):]
    y_train, y_valid = ys[:len(ds.train)], ys[len(ds.train):]

    save_pickle(pickle_path, [x_train, y_train, x_valid, y_valid])

    del ds, xs, ys, x_train, y_train, x_valid, y_valid

x_train, y_train, x_valid, y_valid = map(tensor, load_pickle(pickle_path))


In [6]:
class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = [nn.Linear(n_in, nh), nn.ReLU(), nn.Linear(nh, n_out)]
        
    def __call__(self, x):
        for l in self.layers:
            x = l(x)
        return x


In [7]:
n, m = x_train.shape
c = y_train.max() + 1
nh = 50


In [8]:
model = Model(m, nh, c)
pred = model(x_train)
pred.shape


torch.Size([50000, 10])

In [21]:
test_close(logsumexp(pred), pred.logsumexp(-1))
sm_pred = log_softmax(pred)


tensor([-2.3426, -2.3154, -2.2348], grad_fn=<IndexBackward0>)

The cross-entropy loss for 1-hot encoded categories is simply $-\log(p_i)$, where $p_i$ is the probability assigned to the correct label $i$. Therefore we just need to retrieve the appropriate logprob through indexing. We demonstrate this for the first three records. Note that we can't use ranges for the `x` coordinate using this method of indexing.

In [22]:
sm_pred[range(3), y_train[:3]]


tensor([-2.3426, -2.3154, -2.2348], grad_fn=<IndexBackward0>)

NLL stands for negative log-likelihood.

In [23]:
def nll(input, target):
    return -input[range(target.shape[0]), target].mean()


In [24]:
loss = nll(sm_pred, y_train)
loss


tensor(2.3101, grad_fn=<NegBackward0>)

We can use PyTorch's implementation as follows

In [26]:
test_close(F.nll_loss(F.log_softmax(pred, -1), y_train), loss, 1e-4)
