## Output layers in Neural Networks

### Introduction
When building a neural network, the design of the output layer is especially important since it determines how the model produces its final answer. This answer needs to match the type of problem you're solving. For example, you might be classifying whether an email is spam, identifying what kind of object appears in a photograph, or predicting the price of a house. Each of these tasks requires a different kind of output.

For *binary classification* problems where there are only two possible outcomes, such as *yes or no* or *spam or not spam*, the network typically ends with a single neuron in the output layer. This neuron uses the *sigmoid activation function*, which squeezes the output into a value between 0 and 1, making it easy to interpret as a probability. If the result is close to 1, the model is confident the answer is “yes”; if it's close to 0, it's likely “no.” To train the model, we use the *binary crossentropy* loss function, which compares how far off the predicted probability is from the true label.

For *multi-class classification* tasks where the input belongs to one of several categories, the output layer has one neuron for each class. For instance, if you're trying to recognise digits (0 through 9), you’d have 10 output neurons. These neurons use the softmax activation function, which ensures that all outputs sum to 1, so the network effectively makes a single choice among the available classes. The appropriate loss function here is usually *categorical crossentropy* (if your labels are one-hot encoded) or *sparse categorical crossentropy* (if your labels are provided as integers).

In some cases, each input might belong to multiple categories at once. This is known as *multi-label classification*. A photograph might be tagged as both *beach* and *sunset*, or a film might be categorised as *comedy* and *romance*. In this setup, the output layer again has one neuron per label, but instead of using softmax, each neuron has its own *sigmoid activation*. This allows the model to independently predict whether each label applies or not. We still use *binary crossentropy* as the loss function, but now it's applied separately to each label.

Lastly, when the goal is to predict a number rather than a category, such as estimating house prices, forecasting temperatures, or predicting how long a delivery will take, you're working on a *regression* task. In this case, the output layer simply returns one or more numbers, without using an activation function like sigmoid or softmax. The model is trained using loss functions such as *mean squared error (MSE)* or *mean absolute error (MAE)*, which measure how far the predictions are from the actual values.

Choosing the right output layer setup, including the number of neurons, the activation function, and the loss function is important to esure your neural network is properly aligned with the type of problem it's trying to solve.

### Binary classification
In *binary classification*, the goal is to categorise each input into one of two possible groups. For example, "yes" or "no", "spam" or "not spam", or "positive" versus "negative". It’s called *binary* because there are only two classes. Each input gets assigned a label, either `0` or `1`.

To achieve this, a neural network uses a very simple setup in the output layer: just *one neuron*. This neuron doesn’t just spit out a `0` or `1` directly. Instead, it produces a *probability*, a number between 0 and 1, that reflects how confident the model is that the input belongs to class `1`. This is done using something called the *sigmoid activation function*, which curves the output so that very high or very low input values get pushed close to 1 or 0, respectively.

For example, if the model sees a film review and outputs `0.87`, it’s saying, “I’m 87% confident this review is positive.” By convention, we often treat any value above 0.5 as class `1` (positive), and anything below or equal to 0.5 as class `0` (negative). This threshold can be adjusted depending on the application.

To train the model, we need a way to measure how good its predictions are. That’s where the *loss function* comes in, for binary classification, we use *binary cross-entropy*. This function gives low values when the predicted probabilities are close to the true labels, and high values when the model is wrong or uncertain. During training, the model adjusts its internal weights to reduce this loss, gradually improving its predictions.


### IMDB reviews dataset
An example of *binary classification* is our favourite *IMDB reviews dataset*, where the task is to decide whether a movie review is *positive* or *negative*. Each review is turned into a sequence of numbers (based on the words it contains), and the model learns from many examples how to detect the tone of the review. 

#### Installing Python libraries

In [None]:
!pip install numpy matplotlib tensorflow tensorflow-datasets scikit-multilearn liac-arff

#### Loading the dataset

The code below, loads the sentiment analysis dataset of movie reviews from `keras.datasets`, where positive reviews are assigned the label `'pos'` and negative reviews the label `'neg'`:

In [None]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam

# Load IMDB data
# num_words=10000 => we keep the top 10,000 most frequent words
(x_train_bin, y_train_bin), (x_test_bin, y_test_bin) = imdb.load_data(num_words=10000)


#### Preprocessing
We need to ensure all sentences are of the same length. If they are not we will add padding to compensate for smaller sentences, by adding an actual `<PAD>` token to extend the sentences:

In [None]:
# Pad sequences to a fixed length (e.g. 200)
x_train_bin = pad_sequences(x_train_bin, maxlen=200)
x_test_bin  = pad_sequences(x_test_bin, maxlen=200)


#### Model

In our model, the parameters are tuned via binary cross-entropy loss, which penalises confident but incorrect predictions more heavily than smaller errors, thereby encouraging the output neuron not only to be accurate but also to express appropriate confidence. In practice, inspecting this neuron’s probability distribution on unseen validation data provides immediate insight into both model performance and calibration, making it an indispensable focal point for evaluation and threshold adjustment, which we will do after training:


In [None]:
# Import necessary components from Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense
from tensorflow.keras.optimizers import Adam

# Build a simple binary classification model
model_bin = Sequential([
    # Embedding layer to learn a 32-dimensional embedding for each of the 10,000 words
    # input_dim=10000 specifies the vocabulary size
    # output_dim=32 is the size of the embedding vectors
    # input_length=200 is the length of input sequences
    Embedding(input_dim=10000, output_dim=32, input_length=200),
    
    # GlobalAveragePooling1D to convert a variable-length sequence of embeddings
    # into a single 32-dimensional vector by averaging over the time dimension
    GlobalAveragePooling1D(),
    
    # Dense layer with 1 neuron and sigmoid activation for a binary output (0 or 1)
    Dense(1, activation='sigmoid')
])

# Compile the model
model_bin.compile(
    optimizer=Adam(learning_rate=0.001),       # Adam optimiser with a small learning rate
    loss='binary_crossentropy',                # Binary crossentropy loss for binary classification
    metrics=['accuracy']                       # Track accuracy during training
)

# Train the model
history = model_bin.fit(
    x_train_bin,                               # Training input data
    y_train_bin,                               # Training labels (0 or 1)
    validation_data=(x_test_bin, y_test_bin),  # Validation data to monitor performance on unseen data
    epochs=10                                  # Number of complete passes through the training dataset
)


The culmination of this architecture is the single‐unit output layer, which applies a sigmoid activation to convert the preceding 32‐dimensional feature vector into a probability score between 0 and 1. During training, this lone neuron learns to map the rich, pooled embedding representation of an entire text sequence onto the likelihood of the positive class, essentially answering "yes" or "no" for the binary task at hand. 

Since sigmoid outputs behave like independent Bernoulli probabilities, we can interpret any value above a chosen threshold (commonly 0.5) as a positive prediction and anything below as negative, granting us a clear decision rule.

#### Evaluate
Let's review how our model performed during training:

In [None]:
from matplotlib import pyplot as plt
# Plot training & validation loss
plt.figure()

plt.plot(history.history['loss'], marker='o', label='Training loss')
plt.plot(history.history['val_loss'], marker='o', label='Validation loss')

plt.title('IMDB loss over epochs')

plt.xlabel('Epoch')
plt.ylabel('Loss')

plt.legend()
plt.show()


By the end of ten epochs, the model has clearly learned meaningful patterns in the text data: training accuracy rises steadily from about 66 per cent in the first epoch to nearly 95 per cent by epoch 10, while training loss falls from 0.65 down to 0.15. On the validation set, accuracy climbs from roughly 82 per cent in the first epoch to a peak of about 88 per cent around epoch 6, with the lowest validation loss (≈0.288) occurring at the same point.

After epoch 6 you can observe a mild drift: training accuracy continues to improve and training loss keeps dropping, but validation accuracy and loss start to plateau or slightly worsen (validation accuracy dips to 87.2 per cent by epoch 10, and loss creeps back up to 0.318). This hints at the model beginning to overfit the training data. In practice, you’d likely achieve your best generalisation by stopping training around epoch 6 or by stopping early (more on this later) and perhaps introduce some regularisation.

#### Predict
Let's look at the predictions. We first pull the built-in IMDB word index, which maps each vocabulary word to a unique integer. It then inverts that mapping adding a small offset so that indices 0–3 are reserved for special tokens like `<PAD>`, `<START>`, `<UNK>` and `<UNUSED>` so you can look up words from their numerical IDs. 

The helper function `decode_review` walks through an encoded review (a list of integers), skips over any special tokens, and stitches together the first 50 actual words (adding an ellipsis if it’s been truncated), giving us a human-readable preview of the text.

Next, we take our trained binary classifier (`model_bin`) and ask it to predict the "positive" probability for the first ten test reviews. We apply a 0.5 cutoff to those probabilities to convert them into predicted labels of 0 or 1. Finally, for each of those ten samples, we print out the decoded review text alongside the model's predicted label (and its probability) and the true label letting us see exactly where our sentiment detector gets things right or wrong:


In [None]:
from tensorflow.keras.datasets import imdb

# Load the IMDB word index mapping words to integer IDs
word_index = imdb.get_word_index()

# Build a reverse mapping (integer ID → word), offsetting by 3 to reserve special tokens
reverse_word_index = {value + 3: key for key, value in word_index.items()}
reverse_word_index.update({
    0: '<PAD>',    # Padding token for shorter sequences
    1: '<START>',  # Start-of-sequence token
    2: '<UNK>',    # Unknown word token
    3: '<UNUSED>'  # Unused/reserved token
})

def decode_review(encoded_review, max_words=50):
    """
    Convert an encoded review (list of integer IDs) back into a readable string.
    Skips any special tokens and limits output to max_words words.
    """
    words = []
    for i in encoded_review:
        # Lookup the word; use '?' if ID not found
        word = reverse_word_index.get(i, '?')
        # Skip special tags (e.g. <PAD>, <START>, etc.)
        if word.startswith('<') and word.endswith('>'):
            continue
        words.append(word)
        # Stop once we've collected max_words
        if len(words) >= max_words:
            break
    # Append an ellipsis if we truncated the review
    return ' '.join(words) + (' …' if len(encoded_review) > max_words else '')

# Predict probabilities on the first 10 test samples
y_pred_probs = model_bin.predict(x_test_bin[:10])

# Convert probabilities to binary labels (0 or 1) using a 0.5 threshold
y_pred = (y_pred_probs > 0.5).astype(int).flatten()

# Loop through each of the first 10 samples and print decoded review plus predictions
for i in range(10):
    print(f"\n--- Sample #{i+1} ---")
    print("Review text:", decode_review(x_test_bin[i]))
    print(
        f"Predicted label: {y_pred[i]} "
        f"(probability {y_pred_probs[i][0]:.4f}), "
        f"True label: {y_test_bin[i]}"
    )


### Multi-class classification

In *multi-class classification*, the goal is to assign each input to exactly one category out of several possible options. Unlike binary classification, where the answer is either *yes* or *no*, multi-class classification problems involve three or more distinct classes, and each input belongs to just one of them. 

A classic example is digit recognition: given an image of a handwritten digit, the model must decide whether it’s a `0`, `1`, `2`, ..., up to `9`. That’s a total of 10 classes, but each image can only be labelled as *one* digit.

To make this work, the output layer of the neural network has one neuron for each class. So in the MNIST digit classification task, the network ends with 10 output neurons. Each neuron is responsible for one digit class. These neurons use the softmax activation function, which converts the output values into a probability distribution. This means that all the output values are between 0 and 1, and together they add up to 1. 

For example, the model might output something like:

```
[0.02, 0.01, 0.88, 0.05, 0.01, 0.01, 0.01, 0.00, 0.00, 0.01]
```

In this case, the model is 88% confident that the input belongs to class `2` (third value in above array, but as it is 0-indexed, class=2). Because this class has the highest probability, the model will choose it as the predicted label.

The choice of loss function depends on how the true labels are formatted. If your labels are *one-hot encoded* (i.e., `[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]` to represent class `2`), you should use *categorical cross-entropy*. If your labels are just integers (e.g., `2` for class 2), which is more compact and common in practice, you should use *sparse categorical cross-entropy* — it does the same job but works directly with integer labels.

This setup is widely used in many practical applications, such as:

- Classifying types of clothing (e.g., shirts, trousers, shoes)
- Detecting objects in photos (e.g., cat, dog, car, tree)
- Categorising news articles by topic (e.g., politics, sports, technology)

### MNIST dataset
We’ll demonstrate multi-class classification using our well-known *MNIST dataset*, which as we know, contains images of handwritten digits (`0` to `9`). Each image is 28×28 pixels, and the task is to correctly identify the digit in the image. This is one of the most popular datasets for showing how softmax-based output layers and multi-class classification work:

#### Loading the dataset

In [None]:
import numpy as np
from tensorflow.keras.datasets import mnist

# Load full MNIST
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()


#### Resampling
We create a smaller and random subset from our full training and test sets so that subsequent experiments run faster. We draw 5,000 unique indices from the rows of `X_train` and 1,000 from `X_test` (without replacement, so no sample is picked twice). Finally, we use those index arrays to create corresponding feature matrices (`X_train_sample`, `X_test_sample`) and their matching label vectors (`Y_train`, `Y_test`). The result is a smaller, randomly selected train-test split that mirrors the distribution of the full dataset but trains and evaluates more quickly:


In [None]:
# Take a sample and create train and test
train_sample_size = 5000
test_sample_size  = 1000

np.random.seed(7)

# Randomly choose indices without replacement
train_idxs = np.random.choice(X_train.shape[0], size=train_sample_size, replace=False)
test_idxs  = np.random.choice(X_test.shape[0],  size=test_sample_size,  replace=False)

# Subset the arrays
X_train_sample = X_train[train_idxs]
X_test_sample  = X_test[test_idxs]

Y_test  = Y_test[test_idxs]
Y_train = Y_train[train_idxs]

#### Normalisation
As we are working with image data, we'll normalise the pixel values:

In [None]:
# Normalise pixels
X_train = X_train_sample.astype('float32') / 255.0
X_test  = X_test_sample.astype('float32')  / 255.0


#### Model
We create a model with a the ten-unit softmax output layer, which transforms our high dimensional feature vector into a proper probability distribution over the digits 0–9. Each neuron in this final layer computes a score for its associated digit, and the softmax activation then exponentiates and normalises those scores so that they all sum to one, meaning the network's prediction for a given image is simply the index of the neuron with the highest probability.

If we frame the problem in this way, we ensure two key advantages: first, the output directly tells us "there’s a 78.3 percent chance this image is a 3 and a 9.4 percent chance it’s an 8"  etc., giving a clear measure of confidence; and second, during training the model minimises the sparse categorical cross-entropy loss, which penalises it more when it assigns low probability to the correct class. 

In effect, that single softmax layer concentrates all of the network's decision-making power into a succinct, interpretable probability vector, perfectly aligned with the goal of assigning each handwritten digit to exactly one of ten categories:


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras.optimizers import SGD 

# Build a simple feed-forward neural network for multiclass classification
model = Sequential([
    # Flatten the 28×28 images into a 784-dimensional vector per sample
    Flatten(input_shape=(28, 28)),
    
    # First hidden layer with 128 neurons and ReLU activation for non-linearity
    Dense(128, activation='relu'),
    
    # Output layer with 10 neurons (one per digit) and softmax activation
    # so the network predicts a probability distribution over the 10 classes
    Dense(10, activation='softmax')
])

# Compile the model:
# We use SGD (Stochastic Gradient Descent) optimiser to update weights
# sparse_categorical_crossentropy is appropriate when labels are integer-encoded (0–9)
# accuracy will track the proportion of correct predictions
model.compile(
    optimizer=SGD(learning_rate=0.01),          # SGD optimiser with a modest learning rate
    loss='sparse_categorical_crossentropy',     # loss for integer labels in multiclass problems
    metrics=['accuracy']                        # track accuracy during training
)

# Train the model for 10 epochs, using the test set for validation each epoch
history = model.fit(
    X_train,                              # training images (normalised)
    Y_train,                              # training labels (0–9)
    validation_data=(X_test, Y_test), # validation on unseen test images
    epochs=10                                   # number of complete passes through the training set
)


#### Evaluation
We will now plot the loss over epochs to see how well the model trained on our data:

In [None]:
# Plot loss
plt.figure()

plt.plot(history.history['loss'], marker='o', label='Training loss')
plt.plot(history.history['val_loss'], marker='o', label='Validation loss')

plt.title('MNIST loss over epochs')

plt.xlabel('Epoch')
plt.ylabel('Loss')

plt.legend()
plt.show()

Over the ten training epochs, the network learns remarkably quickly. In the very first epoch, it moves from essentially random guessing (34 % training accuracy) to already achieving over 70 % accuracy on unseen test images, with loss dropping from around 2.04 to 1.28. By epoch 3 the model is comfortably into the mid-80 % range on validation accuracy, and by epoch 5 it reaches almost 87 %, while its validation loss has fallen below 0.50.

By the end of epoch 10, training accuracy settles at about 88.7 % with a loss of 0.41, and validation accuracy has climbed to 89.0 % with a loss of 0.37. The steadily narrowing gap between training and validation metrics and the absence of any upward drift in validation loss—indicates that the model is fitting the data well without significant overfitting. Overall, this simple two-layer network achieves strong performance on MNIST in just ten epochs.


#### Predict
We take the first ten MNIST test images, and ask the trained model to output a probability distribution over the ten digit classes for each image, it does this by picking the most likely class via `argmax`. We then create a plot and, for each image, display it along with the model's predicted digit and the actual label:

In [None]:
import numpy as np 

sample_size = 10

# Predict class probabilities for the first 10 test images
pred_probs   = model.predict(X_test[:sample_size])

# Convert the probability vectors to concrete class predictions (0–9)
pred_classes = np.argmax(pred_probs, axis=1)

fig, axes = plt.subplots(2, 5, figsize=(12, 6))
# Flatten the 2D array of axes into a 1D list for easier iteration
axes = axes.flatten()

# Loop through each subplot and corresponding test sample
for i, ax in enumerate(axes):

    # Display the i-th test image in greyscale
    ax.imshow(X_test[i], cmap='gray')

    # Set the subplot title to show predicted vs. true label
    ax.set_title(f"Pred: {pred_classes[i]} / True: {Y_test[i]}")

    # Hide the x- and y-axis ticks for a cleaner look
    ax.axis('off')

plt.tight_layout()

plt.show()


### Multi-label classification
Sometimes each input can have multiple labels at once. For instance, an image could be tagged as both "nature" and "sunset".

- *Output layer*: N neurons (one per label) with *sigmoid* activation. Each neuron outputs a probability of that label being present.
- *Loss function*: *Binary cross-entropy* across all labels.

Each label is treated like a *binary classification*, so the network applies a *sigmoid* to each output neuron. This allows any combination of labels to be `1` or `0`.

In our case, imagine you’re a biologist studying yeast cells, and you want to predict where each protein in the cell makes its home. Unlike a simple "yes or no" question, we ask "Is this protein in the nucleus?". Many proteins visit multiple rooms in the cell. One protein might hang out in both the mitochondria (the cell's power station) and the cytoplasm (the main interior), while another could shuttle between the endoplasmic reticulum and the vacuole.

To teach a computer to handle these overlapping locations, we use a *multi-label* approach. We first measure lots of properties for each protein; its composition of amino acids, its size, how it behaves compared with similar proteins in other organisms, and so on. Altogether, that gives us 103 different "clues" per protein. Then we build a neural network that, instead of choosing exactly one location, gives a *probability* for each of the 14 possible compartments.

At the end of our network are 14 outputs, one for each subcellular compartment, such as nucleus, mitochondrion, cytoplasm, and the rest. Each output uses a simple squashing function (called a sigmoid) that turns whatever number the network comes up with into a value between 0 and 1. If the network's output for "mitochondrion" is 0.85, it thinks there's an 85% chance that this protein lives there; if it's 0.10 for "nucleus," it’s only 10% confident the protein is in the nucleus.

When we actually assign labels, we pick a cutoff, often 0.5. Any compartment with a score above 0.5 we mark as "yes, the protein is there," and anything below we mark "no." This way, a single protein can get stamped with any combination of compartments, e.g. maybe three "yes" and eleven "no", this reflects that a protein might truly reside in multiple places.

Teaching the network happens through *binary cross-entropy*, which is just a way of scoring how well each of those 14 probabilities matches the real, experimentally known locations. The network looks at each protein's true set of compartments, compares its own guesses, and tweaks its internal parameters to improve. Over hundreds or thousands of proteins, it gradually learns the hidden patterns that tell, for example, "proteins with these amino-acid patterns often go to the vacuole" or "proteins with that molecular weight habitually end up in the endoplasmic reticulum."

In the end, you have a model that can bef given a new protein’s 103 measurements and return the 14 probabilities. You then decide which locations to accept based on your threshold. This flexible, multi-label setup fits perfectly with the yeast dataset: proteins are rarely confined to a single compartment, and our network can naturally reflect the rich, overlapping reality of cellular life:


### Yeast dataset
The Yeast dataset is a collection of information about thousands of yeast proteins, gathered to help us teach computers how to guess where in a cell each protein lives. Scientists have measured many different properties of each protein, over a hundred per protein, to capture clues about its behaviour and characteristics.

When we load the data, we get two main pieces. First, a matrix called *X* with shape (2417, 103). You can think of this as a big spreadsheet with 2,417 rows (one for each protein) and 103 columns (one for each measured property). These properties might include the protein’s molecular weight, how many of each amino acid it contains, or patterns in how similar it is to proteins from other organisms. All of these columns give the computer enough "clues" to start making educated guesses.

The second piece, *Y*, is also a 2,417-row table, but instead of 103 columns, it has 14 columns. Each of those 14 columns corresponds to a possible location inside the cell (for example, nucleus, cytoplasm, mitochondrion, cell membrane, and so on). Because a single protein can sometimes be found in more than one location, this isn’t a simple yes-or-no choice; it’s a set of yes-or-no flags. In computer speak, we call this a *multi-label* problem, meaning each row can have multiple "yes" labels turned on at once.

Putting it all together, training a machine-learning model on this dataset means showing the computer those 103-column "clue sheets" alongside the 14-column "location sheets" for each protein, then asking it to learn the hidden patterns. Once the model has learned, we can give it a brand-new protein’s clues (the 103 numbers) and ask it to predict which of the 14 cell compartments that protein is most likely to inhabit.

This kind of dataset is great for exploring different ways to handle multiple simultaneous labels. You might break the problem into 14 independent binary questions (“Is it in the nucleus? Is it in the mitochondrion?” and so on), or you might use methods that consider correlations between compartments (for instance, some proteins commonly move between cytoplasm and nucleus):.

#### Loading the data
Here we simply fetch the Yeast protein data from a library and call `load_dataset('yeast', 'undivided')`, which returns a table of measurements for each protein (like its amino-acid make-up) and a matching table of yes/no flags showing which of the 14 cellular compartments each protein can be found in:

In [None]:
import numpy as np
from skmultilearn.dataset import load_dataset

# Load the Yeast dataset from scikit-multilearn
X, Y, _, _ = load_dataset('yeast', 'undivided')

print(f"Feature shape: {X.shape}")
print(f"Label shape: {Y.shape}")  # Multi-label (sparse) matrix


#### Preprocessing
Before we train a model, we will preprocess the data into the right shape and scale. First, we turn the sparse feature and label matrices into ordinary arrays since our machine-learning tools prefer them. Using a `StandardScaler`, we centre each feature on a mean of zero and stretch or shrink it so it has a standard deviation of one to ensures that features in this data measured on wildly different scales do not unduly influence the learning process:

In [None]:
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler     

# Create a StandardScaler instance to normalise our features
scaler = StandardScaler()

# The feature matrix X is currently a sparse matrix; convert it to a dense NumPy array
X = X.toarray()

# Fit the scaler on X and transform X so that each feature has zero mean and unit variance
X = scaler.fit_transform(X)

# The label matrix Y is also sparse; convert it to a dense NumPy array for compatibility with most estimators
Y = Y.toarray()


#### Resampling
We now extract our train and test sets from our preprocessed data:

In [None]:
# Train/test split
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=seed)

#### Model
In the context of the Yeast dataset, our model ingests each protein’s quantitative descriptors - such as amino‐acid composition and physicochemical indices - through an explicitly defined input layer matching the dimensionality of the feature space. These features are subsequently transformed by two fully connected (“dense”) hidden layers, comprising 128 and 64 neurons respectively, each employing the rectified linear unit (ReLU) activation function. Through successive layers of linear transformation followed by non‐linear activation, the network hierarchically extracts and abstracts feature interactions that characterise distinct subcellular localisation patterns.

The terminal layer comprises 14 independent sigmoid‐activated units, corresponding to the fourteen possible subcellular compartments annotated in the Yeast dataset. If we apply a binary cross‐entropy loss function, the model treats each localisation label as a separate Bernoulli trial, thereby accommodating the inherently multi‐label nature of protein localisation (since a single protein may reside in multiple compartments). 

Model parameters are optimised via the `Adam` algorithm over 200 training epochs, with performance routinely evaluated against a held‐out test set to monitor generalisation and mitigate overfitting:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout
from tensorflow.keras.regularizers import l2

# Determine the number of input features and output labels
num_features = X_train.shape[1]   # e.g. number of features per protein descriptor
num_labels   = Y_train.shape[1]   # e.g. 14 yeast localisation classes

# Construct a sequential neural network for multi-label classification
model = Sequential([
    # Input layer expects vectors of length num_features
    Input(shape=(num_features,)),
    
    # Dense layer with 256 units, L2 weight decay to penalise large weights,
    # followed by batch norm to stabilise activations, then dropout to reduce overfitting
    Dense(256, activation='relu', kernel_regularizer=l2(1e-4)),  # 256 neurons, ReLU activation
    BatchNormalization(),                                        # normalise activations batch-wise
    Dropout(0.5),                                                # randomly drop 50% of neurons
    
    # Smaller dense layer with 128 units, same regularisation scheme
    Dense(128, activation='relu', kernel_regularizer=l2(1e-4)),  # 128 neurons, ReLU activation
    BatchNormalization(),                                        # batch norm for stability
    Dropout(0.3),                                                # randomly drop 30% of neurons
    
    # Output layer: one sigmoid neuron per label, outputs independent probabilities
    Dense(num_labels, activation='sigmoid')                      # multi-label setup
])

# Compile the model with appropriate settings:
# binary_crossentropy loss treats each label as a separate binary classification
# Adam optimiser automatically adjusts learning rates per parameter
# 'accuracy' metric shows fraction of correct individual label predictions
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Train the model:
# X_train, Y_train: preprocessed feature and label matrices
# validation_data: held-out test set to monitor generalisation each epoch
# epochs: total passes through the training data
# batch_size: number of samples processed before each weight update
# verbose: 1 for progress bar output per epoch
history = model.fit(
    X_train,
    Y_train,
    validation_data=(X_test, Y_test),
    epochs=10,
    batch_size=32,
    verbose=1
)


#### Evaluate
Let's see how well the model is learning over time. We plot the training loss and the validation loss to judge whether our model is converging (loss falling) and whether it’s over- or under-fitting (training and validation curves diverging or staying close):

In [None]:
# Plot training & validation loss
plt.figure()

plt.plot(history.history['loss'], marker="o", label='Training loss')
plt.plot(history.history['val_loss'], marker="o", label='Validation loss')

plt.title('Yeast loss over epochs')

plt.xlabel('Epoch')
plt.ylabel('Binary Crossentropy Loss')

plt.legend()

plt.show()

Over these ten epochs the network learns something useful almost immediately—validation accuracy jumps from about 22 % in the first epoch to a peak of roughly 26 % in epoch 2, and validation loss drops from 0.645 to 0.575 over the same period. However, beyond that point performance steadily degrades: by epoch 5 validation accuracy has fallen below 16 % and validation loss has more or less flattened out around 0.47. Although training accuracy and loss continue to improve (training loss falling to about 0.447 and training accuracy climbing to around 18 % by epoch 10), the widening gap between train and validation curves indicates the model is already overfitting. In practice, we might introduce stronger regularisation to improve the generalisation performance on this Yeast localisation task.

#### Predict

We use the trained multi-label network to score every one of the 14 Yeast subcellular locations for each protein in the test set. After calling `model.predict`, we get the matrix of probabilities (`pred_probs`) with one row per sample and one column per location. 

Focusing on a single example (here, the first test protein), we identify which columns exceed a 0.5 cutoff. These are the model's "yes" predictions. We do the same on the ground-truth label vector to see which compartments it really belongs to. Finally, we print both the raw index positions and, by mapping through a human-readable list (`yeast_labels`), the actual compartment names, so we can compare predicted compartments against the true ones to see exactly where the model got it right or wrong:


In [None]:
import numpy as np

# Use the trained model to predict probabilities for each label on the test set
pred_probs = model.predict(X_test)

# Optional list of human-readable names for each of the 14 Yeast subcellular locations
yeast_labels = [
    'MIT', 'NIT', 'NUN', 'EXC', 'CYT', 'ME2', 'ME3',
    'POX', 'VAC', 'NUC', 'ERL', 'DIA', 'MCM', 'HMG'
]

# Choose one test example to inspect
sample_index = 0

# Find which labels exceed the 0.5 probability threshold for this sample
pred_label_indices = np.where(pred_probs[sample_index] > 0.5)[0]

# Find which labels are actually present in the ground truth for this sample
true_label_indices = np.where(Y_test[sample_index] == 1)[0]

# Print out the numerical indices of predicted vs. true labels
print("\nPredicted label indices:", pred_label_indices)
print("True label indices     :", true_label_indices)

# Translate indices into human-friendly label names and print
print("\nPredicted label names:", [yeast_labels[i] for i in pred_label_indices])
print("True label names     :", [yeast_labels[i] for i in true_label_indices])


### Regression outputs
In regression tasks, the goal is to predict a continuous numeric value rather than a category. This is different from classification, where the model selects from a set of discrete classes. Common examples of regression include predicting house prices, estimating temperatures, or forecasting sales figures. In each case, the output is a real number, not a label.

To support this, a neural network for regression typically ends with an output layer that has no activation function. This means the output is a raw number, not constrained to a particular range like with sigmoid or softmax. If the model needs to predict a single value, it will have one output neuron. If multiple values are needed (e.g. predicting both x and y coordinates), the output layer will have one neuron per value.

The most common loss functions used in regression are *mean squared error (MSE)* and *mean absolute error (MAE)*. MSE penalises larger errors more heavily, making it sensitive to outliers, while MAE treats all errors equally and can be more robust. These losses help the model adjust its internal weights to get as close as possible to the true numeric targets during training.

Overall, regression outputs are widely used in both practical applications and scientific modelling. The setup is simple but powerful: numeric features go in, and the network learns to output one or more continuous values based on patterns in the data.


### California Housing dataset
The California Housing dataset is a classic benchmark drawn from the 1990 U.S. census and widely used for regression tasks in machine learning. It comprises information on various block groups in California (small geographic areas that each contain roughly one thousand households). For every block group, the dataset provides eight numerical features, including median income, average number of rooms per household, latitude and longitude, and other demographic or geographic measurements that might influence housing values.

The target variable is the median house value for each block group, expressed in hundreds of thousands of dollars. Because this dataset captures both socioeconomic and locational factors, it serves as a good example for linear regression or more complex tree-based/neural models, as well as, evaluation of continuous-value predictions.

#### Loading the data
The California Housing dataset can be loaded from scikit-learn:

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf

# Load the dataset
data = fetch_california_housing()

X = data.data  # Features: e.g., median income, house age, etc.
Y = data.target  # Target: median house value (in $100,000s)


#### Resampling
As always, let's extract the train and test sets: 

In [None]:
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.2, random_state=seed
)

#### Preprocessing
We will perform some simple scaling of the values by way of preprocessing:

In [None]:
# Feature scaling
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


#### Model
We define and train a straightforward neural network to predict a continuous target from the set of input features. We begin by declaring a three-layer feed-forward model: an input layer that matches the dimensionality of your feature matrix, followed by two hidden layers with 64 and 32 neurons respectively, each using a ReLU activation to capture non-linear relationships. The final layer is a single, linear neuron, which directly outputs your predicted value (no activation function) because this is a regression problem rather than classification.

The model is compiled with the Adam optimiser, which adjusts its own learning rate during training and uses mean squared error (MSE) as its loss function, which penalises large deviations between prediction and truth more heavily. Mean absolute error (MAE) is also tracked, giving us an interpretable metric in the same (numeric) units as our target variable. Finally, the network is fit over several epochs with batches of 32 samples:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
from tensorflow.keras.optimizers import Adam

# Determine the number of input features from the training data
input_dim = X_train.shape[1]  # e.g. number of predictors in the California Housing dataset

# Build a deeper, regularised feed-forward network for regression
model = Sequential([
    # Input layer: define expected feature vector length
    tf.keras.Input(shape=(input_dim,)),
    
    # Dense layer to capture complex feature interactions,
    # followed by batch norm to stabilise activations, then dropout to reduce overfitting
    Dense(128, activation='relu'),  # 128 neurons with ReLU activation
    BatchNormalization(),           # re-centre and re-scale layer outputs
    Dropout(0.5),                   # drop 50% of units randomly during training
    
    # Smaller dense layer for further non-linearity,
    # again with batch norm and some dropout
    Dense(64, activation='relu'),   # 64 neurons with ReLU activation
    BatchNormalization(),           # stabilise activations
    Dropout(0.3),                   # drop 30% of neurons
    
    # Final hidden layer to refine high-level representations
    Dense(32, activation='relu'),   # 32 neurons with ReLU activation
    BatchNormalization(),           # normalise to speed learning
    
    # Output layer - single linear neuron for continuous target prediction
    Dense(1)                        # no activation (identity), appropriate for regression
])

# Compile the model with appropriate loss and metrics for regression
model.compile(
    optimizer=Adam(),               # Adam optimiser (default lr=0.001) adapts learning over time
    loss='mean_squared_error',      # MSE loss penalises large errors quadratically
    metrics=['mae']                 # MAE to monitor average error in original units
)

# Train the model:
# X_train, Y_train: the training data
# validation_data: hold out X_test, Y_test to assess generalisation each epoch
# epochs: number of full passes through the training set
# batch_size: samples per gradient update
# verbose: display training progress
history = model.fit(
    X_train, Y_train,               # inputs and continuous targets
    validation_data=(X_test, Y_test),
    epochs=10,                      # train for 10 epochs
    batch_size=32,                  # groups of 32 samples per update
    verbose=1                       # show progress bar and metrics
)


Over the ten epochs, the network quickly learned a useful mapping from input features to home values. In the very first epoch it dropped training MSE from scratch down to about 3.16 (MAE ≈ 1.38), while validation MSE fell to 0.53 (MAE ≈ 0.53). By epoch 2 the model was already under half an MSE on the validation set (0.45) with MAE ≈ 0.48, and it improved further to its best validation loss of 0.39 (MAE ≈ 0.45) around epoch 4. Thereafter, validation error oscillates slightly but remains low ending at a validation MSE of 0.40 and MAE of 0.44, while training error steadily declines to MSE ≈ 0.42 and MAE ≈ 0.47.

This pattern shows solid learning and generally stable generalisation: we gain most of the benefit in the first few epochs, with only minor overfitting thereafter. A mean absolute error of roughly 0.44 (in units of median house value, i.e. about $44 000) on the validation set indicates the model is producing reasonably accurate price estimates for unseen data.


#### Evaluate
Let's visualise the loss to see how well it did during training:

In [None]:
# Plot training & validation loss
plt.figure()

plt.plot(history.history['loss'], marker='o', label='Training loss')
plt.plot(history.history['val_loss'], marker='o', label='Validation loss')

plt.title('Housing loss over epochs')

plt.xlabel('Epoch')
plt.ylabel('Binary Crossentropy Loss')

plt.legend()

plt.show()

#### Predict
We take the first ten examples from the test set and ask the trained regression model to predict their scaled median house values. Since we originally divided prices by 100,000 before training, each raw prediction is multiplied by 100,000 to convert it back into dollar terms. 

We then extract the single output from each prediction array and likewise pull out the corresponding true value, convert both to standard floats, and then print a comparison to gauge how close each estimate is to the actual sale price:


In [None]:
# Predict on a few test samples
sample_size = 10

predictions = model.predict(X_test[:sample_size])

# Show predictions vs actual values, converting to pure Python floats
for i in range(sample_size):
    # predictions[i] is a 1-element array, so extract the scalar
    predicted_price = float(predictions[i][0] * 100000)
    # Y_test[i] is also a numpy scalar or array, so extract similarly
    actual_price = float(Y_test[i] * 100000)
    
    # Now formatting works without error
    print(f"Sample {i+1}: Predicted: ${predicted_price:.2f}, Actual: ${actual_price:.2f}")


### Understanding output layers

As we have seen, in a neural network, the *output layer* is responsible for generating the final predictions after all hidden layers have completed their transformations. The architecture and configuration of this layer must align closely with the specific task at hand; whether it is classification, regression, or something more complex. This involves choosing the right number of neurons, the appropriate activation function, and a suitable loss function to guide learning during training. Below is a detailed breakdown of common output layer configurations across different types of machine learning tasks.

- *Binary Classification*
  Used when the task involves distinguishing between two classes (e.g. positive vs. negative sentiment). A single neuron with a sigmoid activation function outputs a probability between 0 and 1. The *binary cross-entropy* loss function penalises incorrect predictions proportionally to the distance from the correct class.

- *Multi-class classification*
  For problems with more than two classes (e.g. digit recognition 0–9), the output layer typically contains one neuron per class. A *softmax* activation converts the raw scores into probabilities that sum to 1. The model is trained using *categorical* or *sparse categorical cross-entropy*, depending on whether the labels are one-hot encoded or not.

- *Multi-label classification*
  In multi-label tasks (e.g.proteins with multiple locations), each label is independent, so a sigmoid activation is used for each output neuron. This allows multiple labels to be active simultaneously. Since each output is treated as a separate binary task, *binary cross-entropy* is used as the loss function.

- *Regression*
  For predicting continuous values (e.g. house prices), the output layer has no activation function (linear activation), allowing outputs to take on any real number. Depending on the problem, the loss function might be *mean squared error (MSE)*, *mean absolute error (MAE)*, or another suitable metric. 
  
Let's summarise this in a table for easier comparison:

| *Task type*   | *Output layer*            | *Activation*  | *Loss function*                                |
| ------------- | ------------------------- | ------------- | ---------------------------------------------- |
| *Binary*      | 1 neuron                  | Sigmoid       | Binary Cross-Entropy                           |
| *Multi-class* | N neurons (one per class) | Softmax       | Categorical / Sparse Categorical Cross-Entropy |
| *Multi-label* | N neurons (one per label) | Sigmoid       | Binary Cross-Entropy                           |
| *Regression*  | 1 (or more) neuron(s)     | None (Linear) | Mean Squared Error / Mean Absolute Error etc.  |


### What have we learnt?
Different tasks require different output layer setups. Activations like sigmoid/softmax are specifically chosen to match the nature of the labels. Loss functions must also match (binary vs. categorical vs. regression losses). But, you need to structure your neural network correctly for the classification or regression task, by ensuring that the final layer aligns with the problem’s requirements.

Choosing the correct configuration for the output layer is crucial to the performance and correctness of a neural network model. The number of neurons, activation functions, and loss functions must all be tailored to the nature of the problem being solved. A mismatch, for instance, using a softmax activation for a regression task, can severely impair the model's ability to learn. 

In conclusion, for each task type we have a specific set up to consider. For one class or the other tasks (spam/not spam, sentiment positive/negative) we have *binary* classification. For exactly one of *N* classes (digit 0–9, one category) we have a *multi-class* problem and we should use softmax. For any combination of labels (*multi-label*) we can use sigmoid on each. If we are predicting continuous values (house price, temperature), then we need *regression* with linear output.