There are 46 classes. The classification problem is more complex than the two class IMDB example

We used a 16 dimensional representational space for the intermediate IMDB layers 

But 16 dimensions might be too small

There are 46 classes. The classification problem is more complex than the two class IMDB example

We used a 16 dimensional representational space for the intermediate IMDB layers 

But 16 dimensions might be too small

Information dropped by a layer can never be recovered - a potential **information bottleneck**

Therefore we need bigger layers (more units)

Try 64 unit layers

Information dropped by a layer can never be recovered - a potential **information bottleneck**

Therefore we need bigger layers (more units)

Try 64 unit layers

We require that the network outputs a 46D vector; therefore the final layer must have 46 units

The layer should output the probability of class membership

A softmax activation produces an output interpretable as a probability distribution

\begin{align*}
f(x_i) &= \frac{e^{x_i}}{\sum_i e^{x_i}}
\end{align*}

We require that the network outputs a 46D vector; therefore the final layer must have 46 units. The layer should output the probabilty of class membership. A softmax activation produces an output interpretable as a probability distribution.

\begin{align*}
f(x_i) &= \frac{e^{x_i}}{\sum_i e^{x_i}}
\end{align*}

In [2]:
import numpy as np

def softmax(x):
    return np.exp(x) / sum(np.exp(x))

The softmax function as code.

In [3]:
x = np.array([1, 1, 1, 1])
print('x =', x, '\t', 'softmax(x) =', softmax(x), '\t', 'sum =', sum(softmax(x)))

x = [1 1 1 1] 	 softmax(x) = [0.25 0.25 0.25 0.25] 	 sum = 1.0


Let's look at softmax of a vector with equal elements... the output vector also has equal elements, and since the elements must sum to zero, each element of softmax(x) is 0.25.

In [4]:
x = np.array([1, 2, 1, 1])
print('x =', x, '\t', 'softmax(x) =', softmax(x), '\t', 'sum =', sum(softmax(x)))

x = [1 2 1 1] 	 softmax(x) = [0.1748777  0.47536689 0.1748777  0.1748777 ] 	 sum = 1.0000000000000002


The second element of x is larger than the other elements... and the second element of softmax(x) is the largest. The elements sum to 1. 

In [5]:
x = np.array([1, 8, 1, 1])
print('x =', x, '\t', 'softmax(x) =', softmax(x), '\t', 'sum =', sum(softmax(x)))

x = [1 8 1 1] 	 softmax(x) = [9.09394185e-04 9.97271817e-01 9.09394185e-04 9.09394185e-04] 	 sum = 0.9999999999999999


The second element of x is again the largest - but by a bigger proportion. Now softmax(x) has a clear peak in the second element - about 1000 times bigger than the other elements.

Softmax heightens the larger elements of a vector whilst ensuring that the vector elements sum to 1. A hardmax function in these examples would have produced [0, 1, 0, 0]. Softmax retains information of the smaller elements. Furthermore, it is differentiable.

In [1]:
from tensorflow.keras.datasets import reuters
(train_data, train_labels,), (test_data, test_labels) = reuters.load_data(num_words=10000)

import numpy as np
def vectorize_sequences(sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

from tensorflow.keras.utils import to_categorical # one hot encoder for lists
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

In [6]:
from tensorflow.keras import models
from tensorflow.keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation = 'relu', input_shape = (10000,)))
model.add(layers.Dense(64, activation = 'relu'))
model.add(layers.Dense(46, activation = 'softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 64)                640064    
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 46)                2990      
Total params: 647,214
Trainable params: 647,214
Non-trainable params: 0
_________________________________________________________________


Two 64 unit relu layers and a final 46 unit softmax layer. `model.summary` shows the output shape of each layer and the number of trainable parameters - the number of parameters in the w and b layer tensors. The length of the zero'th axis, the samples axis, is zero as usual. There are 647,214 trainable parameters. 

In [7]:
# The weight and bias tensors
def print_layer_tensor_shape(layer):
    weight_params = model.layers[layer].get_weights()[0]
    bias_params = model.layers[layer].get_weights()[1]
    print(layer, '\t', weight_params.shape, '\t', bias_params.shape)

print_layer_tensor_shape(layer=0)
print_layer_tensor_shape(layer=1)
print_layer_tensor_shape(layer=2)

0 	 (10000, 64) 	 (64,)
1 	 (64, 64) 	 (64,)
2 	 (64, 46) 	 (46,)


...the shape of the weight and bias tensors, layer by layer.

In [8]:
(10000 * 64 + 64) + (64 * 64 + 64) + (64 * 46 + 46)

647214

The weight tensor has shape (n, m) - a matrix with n rows and m columns. Therefore there are n times m elements. The bias tensor is just a vector so the number of elements is equal to the length. The total number of trainable parameters - the total number of elements in all the w and b tensors is therefore given by this sum... which evaluates to 647,214.

In [9]:
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy'])

The loss function should measure the distance from the output distribution to the true distribution. `categorical_crossentropy` is the best choice. The `rmsprop` optimiser is again chosen...