## Deep Learning models in TensorFlow
### Introduction
We will explore different types of neural network models and how they can be applied to image classification using the *MNIST* dataset again, a well-known collection of handwritten digits (0–9). This will give us a chance to understand how various deep learning architectures work and when each might be useful.

We will introduce four types of neural network models, starting from the simplest (Perceptron) to more advanced models like LSTMs. We will explain how we evaluate model performance and understand whether a model is good enough.  We will then visualise model performance using metrics like *Accuracy*.  

Experimenting with each of these models, will enable us to develop a practical understanding of how deep learning can be applied to different types of data and problems, from simple classification to more complex, sequential prediction tasks.

We start by setting a common number of epochs for each of our models:

In [None]:
epochs = 10  # shortened to 10 for demonstration purposes

### Perceptron (Single-Layer Neural Network)
A *Perceptron* is the simplest kind of neural network and is often seen as the starting point for understanding more complex models. It was one of the earliest ideas in artificial intelligence and is mainly used to tell the difference between two groups of data, for example, whether a message is spam or not. The Perceptron does this by trying to draw a straight line (or boundary) that separates one type of data from another.

To understand how a perceptron makes decisions, we can break it down into a few key parts:

- *Inputs* – These are the features or values the model looks at, like pixel brightness in an image.  
- *Weights* – Each input is multiplied by a number that tells the model how important it is.  
- *Bias* – A small adjustment that shifts the decision boundary to improve accuracy.  
- *Weighted sum* – All the weighted inputs and the bias are added together to get a score.  
- *Activation function* – This function checks whether the score is high enough to trigger a certain output (usually 0 or 1).  

So, in essence, the perceptron adds up the inputs (each scaled by its weight), checks the total score, and decides which class the data belongs to During training, the perceptron learns by comparing its prediction to the correct answer and adjusting the weights if it gets it wrong. This happens repeatedly with many examples until the model improves. The key steps are:

- Make a prediction using the current weights.  
- Compare it to the correct label.  
- Adjust the weights slightly if the prediction is wrong (using a learning rate to control the step size).  

This learning process helps reduce errors and move the decision boundary in the right direction. While perceptrons are useful for learning the basics of neural networks, they have some clear limitations:

- Only works for linearly separable data – it can’t handle cases where a straight line won’t do the job (like XOR problems).  
- Too simple for complex patterns – it struggles with anything beyond basic classification.  
- Rigid activation function – the step function doesn’t allow for shades of grey in predictions.  


### Install Python libraries

In [None]:
!pip install tensorflow torch numpy matplotlib

### Load the data (create)
We create our own synthetic dataset representing the AND logic gate:

In [None]:
import numpy as np

# Create a simple dataset for the AND logic gate
# Input combinations: each row is a pair of binary inputs [x1, x2]
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

# Target outputs: results of the AND operation on the inputs
Y = np.array([0, 0, 0, 1])  # Only [1, 1] gives 1 in an AND gate

### Model
We will create our own Python class from scratch to represent the Perceptron:

In [None]:
# Define a Perceptron class from scratch
class Perceptron:
    def __init__(self, input_size, learning_rate=0.1, epochs=10):
        # Initialise weights and bias to zero
        self.weights = np.zeros(input_size)  # One weight per input feature
        self.bias = 0
        self.lr = learning_rate              # Learning rate controls step size
        self.epochs = epochs                 # Number of times to loop through the data

    def activation(self, z):
        # Step activation function: returns 1 if z ≥ 0, otherwise 0
        return 1 if z >= 0 else 0

    def predict(self, x):
        # Compute the linear combination of inputs and weights + bias
        z = np.dot(x, self.weights) + self.bias
        # Apply the step function to produce binary output
        return self.activation(z)

    def train(self, X, y):
        # Loop through the training data multiple times (epochs)
        for epoch in range(self.epochs):
            for xi, target in zip(X, y):         # For each input-output pair
                prediction = self.predict(xi)    # Predict current output
                error = target - prediction      # Calculate prediction error

                # Update rule for weights and bias (only if there's an error)
                self.weights += self.lr * error * xi
                self.bias += self.lr * error

            # Print weights and bias at the end of each epoch
            print(f"Epoch {epoch+1} | Weights: {self.weights} | Bias: {self.bias}")

# Create and train the Perceptron on the AND gate
model = Perceptron(input_size=2)
model.train(X, Y)

#### Predict
We can now make predictions on some test values from our training data. This is just for demonstration:

In [None]:

# Test the model's predictions on all inputs
print("\nPredictions:")
for xi in X:
    print(f"{xi} => {model.predict(xi)}")  # Show input and predicted output


### Multilayer Perceptron (MLP)

The basic *perceptron* is like a simple decision-maker. It looks at inputs (like exam scores or pixels in an image) and tries to make a yes/no decision. But it's quite limited, it can only handle very simple patterns, like drawing a straight line to split things into categories.

To go beyond that, we use something called a *Multilayer Perceptron* (MLP). Think of it like building a team of decision-makers, where each one passes information to the next. This stacked approach lets the model understand much more complicated patterns, even ones that can't be separated with a simple line. We construct a Multilayer Perceptron as follows:

- *Hidden layers*: 
These are extra layers of "mini-decision-makers" placed between the input and the final output. Each layer helps the model spot different features or patterns in the data. More layers mean more brainpower to figure things out.

- *Activation functions*: 
These are like switches that tell each layer how to respond. Two common ones are:
  - *ReLU*: Think of this as a filter that lets through positive numbers and blocks the negatives. It helps the model learn quickly and avoid problems during training (more on this later).
  - *Softmax*: Used at the end when we’re picking from several categories (like identifying handwritten digits). It turns the output into a list of probabilities, so we can pick the most likely answer.

- *Optimiser*: 
This is the algorithm that helps the model get better during training. One popular choice is *Adam*, it’s smart and automatically fine-tunes itself, so the model improves faster and more reliably.

When we stack layers and use activation functions, an MLP can go far beyond what a single perceptron can do.

### Breast Cancer dataset
The *Breast Cancer dataset* you are using comes from the `scikit-learn` library and is a well-known benchmark dataset in machine learning, particularly for binary classification tasks. It contains data collected from digitised images of breast tissue masses, where each sample is described by a set of features computed from the image, such as the radius, texture, perimeter, area, and smoothness of the cells. These features aim to capture the key physical characteristics of the cell nuclei, which help distinguish between benign (non-cancerous) and malignant (cancerous) tumours.

The dataset includes 569 samples and 30 numerical features, along with a binary target label indicating whether the tumour is malignant or benign. It is small, well-structured, and presents a meaningful real-world medical classification problem.


### Loading the data

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load Breast Cancer dataset
data = load_breast_cancer()

# Convert data to a pandas DataFrame 
# to print an overview of the data (we won't use this for training etc.)
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add the target (label) as a new column
df['target'] = data.target

# Show the first few rows
print(df.head())

### Resampling

We split our data into train and test:

In [None]:
X = data.data
Y = data.target  # 0 = malignant, 1 = benign

# Ensure input data are NumPy arrays
X = np.array(X)
Y = np.array(Y)

seed = 7

# Split into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=seed)


### Preprocessing
We apply a standard scaler to show how you might preprocess the features, but you may want to go further depending on the data you are working with:

In [None]:
# Standardise features
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We have some data with our features and labels extracted and scaled, and so we can now pass it to our model and train.

In the code below, we create a MLP using TensorFlow's `Sequential` model, which lets us stack layers one after another in a straightforward way. The model starts with an input layer that takes in the training features. This is followed by two hidden layers: the first has 32 neurons, and the second has 16. Both use the `ReLU` (Rectified Linear Unit) activation function, which helps the model learn quickly and effectively by allowing it to focus on patterns that matter. The final layer has just one neuron, using a sigmoid activation function to produce an output between `0` and `1`, essentially a probability that we can interpret as a "yes" or "no" answer for binary classification.

After building the model, we compile it with an `adam` optimiser, which is a clever algorithm that automatically adjusts the weights in the network during training to reduce errors. We use a loss function called binary cross-entropy, which is standard for yes/no tasks, and we ask the model to keep track of its accuracy as it learns. The model is then trained over a number of cycles, known as ``epochs`, during which it sees the training data again and again, gradually improving its predictions. 

After training, we ask the model to make predictions on unseen test data and evaluate its performance using several metrics: accuracy (how often it was right), mean absolute error (how far off its predictions were on average), and mean squared error (a similar measure that penalises larger mistakes more heavily).

A single perceptron is simple and limited to straightforward decisions, whereas, a multilayer perceptron like the one in this code is much more powerful. It can uncover hidden patterns in data and make better decisions, especially when the problem isn’t easily solved with a single dividing line:

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error
import numpy as np


# Build a Multilayer Perceptron (MLP) model using the Sequential API
model = Sequential([
    Input(shape=(X_train.shape[1],)),   # Define the input shape (number of features)
    Dense(32, activation='relu'),       # First hidden layer with 32 neurons and ReLU activation
    Dense(16, activation='relu'),       # Second hidden layer with 16 neurons and ReLU activation
    Dense(1, activation='sigmoid')      # Output layer with 1 neuron for binary classification
])

# Compile the model with optimiser, loss function, and metrics to track
model.compile(
    optimizer=Adam(),                   # Adam optimiser for efficient training
    loss='binary_crossentropy',         # Loss function for binary classification tasks
    metrics=['accuracy']                # Track accuracy during training and evaluation
)

# Train the model using training data, and validate on test data after each epoch
history = model.fit(
    X_train, Y_train,                   # Input features and labels for training
    epochs=epochs,                      # Number of training epochs
    validation_data=(X_test, Y_test)    # Validation data for monitoring performance
)

# Use the model to predict probabilities for the test data
y_pred_probs = model.predict(X_test).flatten()   # Outputs a probability for each sample; flatten to 1D array
y_pred = (y_pred_probs > 0.5).astype(int)        # Convert probabilities to 0 or 1 based on a 0.5 threshold

# Print evaluation metrics to assess model performance
print(f"Accuracy Score: {accuracy_score(Y_test, y_pred):.4f}")               # Classification accuracy
print(f"Mean Absolute Error: {mean_absolute_error(Y_test, y_pred_probs):.4f}")  # Average error in predicted probabilities
print(f"Mean Squared Error: {mean_squared_error(Y_test, y_pred_probs):.4f}")    # Squared error in predicted probabilities

Looking at the above output, training progressed very smoothly, with the network quickly learning to distinguish between the two activities. In the first epoch it jumped from essentially chance‐level (≈42 % training accuracy) to already 75 % on the validation set, while loss dropped from about 0.82 down to 0.58. By epoch 3 it surpassed 90 % validation accuracy with loss around 0.30, and by epoch 5 it was consistently in the mid-90s.

By the final epoch (10), training accuracy reached about 97 % with a loss of 0.11, and validation accuracy climbed to roughly 96.5 % with a loss near 0.10—indicating very strong generalisation and minimal overfitting. A held-out evaluation confirmed this, yielding an accuracy of 96.5 %, a mean absolute error of approximately 0.077, and a mean squared error of about 0.027. Overall, the model performs very well on this binary sequence-classification task.


#### Plot Train loss and Validation loss
We plot the train and validation loss to see how the model trained each epoch.

The first graph shows the loss over time, on the vertical axis. We plot how large the errors were, and on the horizontal axis, we plot the number of epochs. We include two lines: one for the training loss, showing how well the model is doing on the data it’s learning from, and one for the validation loss, showing how well it’s generalising to new data. Ideally, both lines should go down over time, but if the training loss keeps improving while the validation loss gets worse, it’s a sign the model is overfitting, that is, it’s memorising the training data rather than learning general patterns.

The second graph shows the accuracy over time, again with one line for training accuracy and one for validation accuracy. On the vertical axis, we have the proportion of correct predictions, ranging from 0 (completely wrong) to 1 (perfectly correct), and again, epochs are on the horizontal axis. This plot gives us a clear picture of whether the model is genuinely improving in terms of making correct predictions or if it’s simply becoming overconfident on the training set:

In [None]:
import matplotlib.pyplot as plt

# Plot training and validation loss and accuracy
plt.figure(figsize=(10, 4))

# Plotting Loss
plt.subplot(1, 2, 1)

# Plot training loss values stored in history.history['loss']
plt.plot(history.history['loss'], label='Train Loss')

# Plot validation loss values stored in history.history['val_loss']
plt.plot(history.history['val_loss'], label='Val Loss')

# Set the title of the plot
plt.title("Loss over Epochs")

plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.legend()

# Plotting Accuracy
plt.subplot(1, 2, 2)

# Plot training accuracy values stored in history.history['accuracy']
plt.plot(history.history['accuracy'], label='Train Acc')

# Plot validation accuracy values stored in history.history['val_accuracy']
plt.plot(history.history['val_accuracy'], label='Val Acc')

plt.title("Accuracy over Epochs")

plt.xlabel("Epoch")
plt.ylabel("Accuracy")

plt.legend()

# Adjust the spacing between subplots so labels/titles don't overlap
plt.tight_layout()

plt.show()


### Predict
Let's also see what the actual prediction look like. We use the trained model to predict probabilities on the test set, which gives values between 0 and 1 for each test sample, with a 1 meaning the patient has cancer, and a 0 meaning they are clear or have no indicators:

In [None]:
# Get the values between 0 and 1 for each test sample
y_pred_probs = model.predict(X_test).flatten()

# Convert probabilities to binary class predictions (0 or 1)
# Any value > 0.5 is classified as class 1 (benign), else class 0 (malignant)
y_pred = (y_pred_probs > 0.5).astype(int)

# Print 10 predicted class label
for i in range(10):
    print("Predicted class labels:", y_pred[i])

### Feedforward Neural Network (FNN)

A *Feedforward Neural Network* (often simply called an FNN) is one of the most common types of neural networks. It is called "feedforward" because the information flows in only one direction, from the input layer, through one or more hidden layers, and finally to the output layer. There are no loops or feedback connections, meaning each input is processed and passed straight through the network.

FNNs are an extension of the basic *perceptron* we looked at earlier. Instead of just one layer of neurons, a feedforward network includes one or more *hidden layers*, each made up of many neurons that transform the input data in increasingly complex ways. This allows the network to learn patterns that are much more advanced than a single-layer perceptron could ever handle.

Each neuron in a layer is connected to every neuron in the next layer, and the network learns by adjusting the *weights* on these connections based on how well it predicts the correct answer. This learning happens over multiple rounds, or *epochs*, gradually improving the network's performance.

FNNs are used for a wide variety of tasks, such as recognising handwritten digits, predicting house prices, or classifying emails as spam or not spam. They are especially useful when the data doesn’t have any particular order or sequence, for example, static images or tabular data, because they treat every input as independent.

>*MLP versus FNN*:
>
> A FNN sounds a lot like an MLP, the key difference is that a FNN is a broader term that refers to any neural network where information flows from input to output, without any loops or feedback. MLPs are a type of FNN. But not all FNNs are MLPs. For example, a *Convolutional Neural Network* (CNN) used for image data is also a type of feedforward network, but it uses convolutional layers instead of just fully connected layers. >An MLP, in contrast, uses just *Dense* layers throughout.
>

Although they don’t remember past information, feedforward networks are a powerful and versatile starting point for most deep learning problems. We will use the same *Breast Cancer dataset* we used in the MLP example, for comparison:

In [None]:
# Import necessary modules from the TensorFlow Keras library
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Build a Feedforward Neural Network (FNN) for tabular data
model_fnn = Sequential([
    # First hidden layer:
    Input(shape=(X_train.shape[1],)),   # Define the input shape (number of features)

    # Second hidden layer: 64 neurons and ReLU activation
    Dense(64, activation='relu'),

    # Output layer: 1 neuron (since it's a binary classification problem)
    # Sigmoid activation function to get a probability between 0 and 1
    Dense(1, activation='sigmoid')
])

# Specify parameters and compile the model:
#  Adam optimiser, a popular choice for gradient-based optimisation
#  Binary crossentropy is the loss function for binary classification
#  Accuracy is the performance metric to track during training
model_fnn.compile(
    optimizer=Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train the model on the training data (X_train, Y_train)
# For n epochs and validate on the test data (X_test, Y_test).
history_FNN = model_fnn.fit(
    X_train,
    Y_train,
    epochs=epochs,
    validation_data=(X_test, Y_test)
)

In this run the model learns very quickly and generalises exceptionally well. It starts already strong in epoch 1, with training accuracy at about 84 % (loss 0.47) and validation accuracy near 89 % (loss 0.28). By epoch 3 it's above 92 % on validation with loss down around 0.16, and by epoch 5 validation accuracy climbs to nearly 95 % with loss roughly 0.12. Training accuracy keeps improving and reaches about 98 % by epoch 10, while validation accuracy plateaus at around 96.5 % and validation loss steadily falls to about 0.085. Validation performance continues to improve (or at least doesn’'t degrade)as well as training loss, suggests minimal overfitting, and overall this configuration provides reliable binary classification on the sequence data:


#### Plot Train loss and Validation loss

In [None]:
import matplotlib.pyplot as plt

# Create a figure to plot training and validation metrics side by side
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)  # Create the first subplot (1 row, 2 columns, position 1)

# Plot training and validation loss values over epochs
plt.plot(history_FNN.history['loss'], label='Train Loss')        # Training loss
plt.plot(history_FNN.history['val_loss'], label='Val Loss')      # Validation loss

# Add title and axis labels
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Binary Crossentropy Loss')

# Add legend for better readability
plt.legend()

plt.subplot(1, 2, 2)  # Create the second subplot (position 2)

# Plot training and validation accuracy values over epochs
plt.plot(history_FNN.history['accuracy'], label='Train Accuracy')        # Training accuracy
plt.plot(history_FNN.history['val_accuracy'], label='Val Accuracy')      # Validation accuracy

# Add title and axis labels
plt.title('Accuracy over Epochs')

plt.xlabel('Epoch')
plt.ylabel('Accuracy')

plt.legend()

# Adjust spacing between subplots and display the plots
plt.tight_layout()  # Prevent subplots from overlapping
plt.show()      


### Recurrent Neural Network (RNN)

A *Recurrent Neural Network (RNN)* is a special type of neural network designed to work with *sequential data*, that is, data where the order of observations matters. Unlike standard feedforward networks, RNNs include a form of memory that allows them to use information from earlier time steps when making predictions. This makes them well-suited for tasks such as language modelling, time series prediction, and sensor-based activity recognition.

In this case, we’re using the *Heterogeneity Human Activity Recognition (HHAR)* dataset, which contains motion sensor data (accelerometer readings in x, y, and z axes) collected from smart devices as users perform different physical activities, such as walking, climbing stairs, or biking. These activities produce distinct motion patterns that can be recognised by a neural network, but only if we consider the *sequence* of sensor readings over time, not just isolated measurements.

To model this data with an RNN, we first split the continuous sensor data into *fixed-length sequences* (for example, 50 time steps per sample). Each sequence is treated as a mini time series, with shape:

`samples × time_steps × features`  

In our case, that's something like:  

`(number of segments, 50 readings, 3 sensor channels)`

This allows the RNN to process the signal one time step at a time, learning to detect patterns that evolve across the full window, such as acceleration spikes during running, or periodic motion during walking. By maintaining an internal state that gets updated over time, the RNN can "remember" key moments earlier in the sequence to help it classify the entire activity.

While RNNs can sometimes struggle to remember long-term dependencies, they’re an excellent starting point for sequential modelling and help us build toward more powerful variants like LSTMs and GRUs, which are designed to retain information across longer spans of time:

### Accelerometer and Gyro mobile phone dataset
This dataset contains motion sensor readings collected from the *accelerometer* and *gyroscope* of a mobile phone, designed to support experiments in *activity recognition*, *sensor data analysis*, and *human motion modelling*.

The data was gathered from a smartphone placed in a user's front pocket while performing a variety of physical activities. These include common everyday movements such as walking, climbing stairs, and standing still. The sensors recorded 3-axis acceleration and angular velocity at a fixed sampling rate, creating time series data that reflects changes in movement and orientation over time.

Each sample is labelled according to the type of activity being performed, making this a labelled, supervised learning dataset suitable for classification tasks. Researchers can use it to explore techniques in time-series preprocessing, feature extraction, and machine learning, particularly in the context of mobile or wearable sensor data.

This dataset is especially relevant for developing and testing models for real-time activity recognition, fall detection (elderly), fitness tracking, or general-purpose mobile sensing applications.

### Loading the data

In [None]:
import pandas as pd
import zipfile
import requests
import io

# Download the ZIP file
url = "http://www.archive.ics.uci.edu/static/public/755/accelerometer+gyro+mobile+phone+dataset.zip"
r = requests.get(url)

# Unzip contents into memory
with zipfile.ZipFile(io.BytesIO(r.content)) as zip_ref:
    zip_ref.extractall("activity_data")

# Load the CSV file
df = pd.read_csv("activity_data/accelerometer_gyro_mobile_phone_dataset.csv")

# Preview
df.head()


### Preprocessing

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Remove empty or NaN rows
df = df.dropna()

# Features and label
features = ['accX', 'accY', 'accZ', 'gyroX', 'gyroY', 'gyroZ']
X = df[features]
Y = df['Activity']

# Scale the feature values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Combine back into dataframe to keep order
df_scaled = pd.DataFrame(X_scaled, columns=features)
df_scaled['Activity'] = Y


### Resampling

Before feeding in the data,  we need to adapt our approach. We need to transform a continuous stream of time-stamped sensor data into fixed-size "chunks" (or windows) of readings. Each sequence becomes an input, and the label following the sequence becomes the output. This prepares the data for training an RNN model to recognise or predict activities over time:

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

seed = 7  # Set a random seed for reproducibility

# Function to convert raw data into sequences for RNN training
def create_sequences(data, seq_len=50):
    X, y = [], []

    # Loop through the dataset to extract sequences
    for i in range(len(data) - seq_len):
        # Extract a sequence of `seq_len` rows from the selected features
        sequence = data.iloc[i: i + seq_len][features].values

        # The label is the activity immediately following the end of the sequence
        label = data.iloc[i + seq_len]['Activity']

        # Store the sequence and its label
        X.append(sequence)
        y.append(label)

    # Return as NumPy arrays
    return np.array(X), np.array(y)

# Create input sequences and labels from the scaled dataframe
X_seq, Y_seq = create_sequences(df_scaled, seq_len=50)

# Split the data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(
    X_seq, Y_seq,
    test_size=0.2,             # 20% of the data used for testing
    random_state=seed,         # Ensure the split is reproducible
    stratify=Y_seq             # Preserve class distribution in the split
)


Let's visualise the sequences we have created, and plot the sensor data for a single sequence (e.g. the first one) to see how the signal evolves over time. This is especially useful in time-series data, where each feature (like `acc_x`, `acc_y`, `acc_z`, etc.) can reveal distinct motion patterns:

In [None]:
import matplotlib.pyplot as plt

# Select our features
features = ['acc_x', 'acc_y', 'acc_z']

# Choose a sample sequence (e.g. the first one)
sample_index = 0

sequence = X_seq[sample_index]

label = Y_seq[sample_index]

# Map numeric label to a name
label_map = {0: "standing", 1: "walking"}

# Plot each feature in the sequence over time
plt.figure(figsize=(10, 4))

for i, feature_name in enumerate(features):
    plt.plot(sequence[:, i], label=feature_name)

# Add vertical gridlines to emphasise time step boundaries between sequences (our chunks)
for t in range(sequence.shape[0]):
    plt.axvline(x=t, color='red', linestyle='--', linewidth=0.2)

# Plot annotations and formatting
plt.title(f"Sensor readings for sequence {sample_index} — Label: {label_map[label]}")

plt.xlabel("Time step")
plt.ylabel("Scaled sensor value")

plt.legend()

plt.tight_layout()

plt.show()


From the plot, you can observe that `acc_y` shows pronounced fluctuations, as it captures vertical movement,the natural up-and-down motion of the body during walking or running. This makes it especially useful for detecting steps or gait patterns.

The `acc_x` axis corresponds to lateral movement, such as swaying or shifting from side to side. This can be informative for identifying balance, stability, or side-stepping movements, which might appear in activities like turning or dancing.

The `acc_z` axis captures forward and backward acceleration, the direction aligned with walking or running speed. It's particularly useful for identifying the intensity or speed of motion, as well as detecting starts and stops.

Together, the three axes form a rich time series representation of full-body motion, and analysing their relative patterns can help distinguish between different types of activities, such as walking, standing, or climbing stairs. Features like frequency, amplitude, and variation across these axes often serve as key inputs to activity recognition models.

### Model
Our model performs binary sequence classification (for example, distinguishing “standing” from “walking”) by learning temporal dependencies directly from raw time-series inputs. It begins with a single SimpleRNN layer of 64 units, each using a ReLU activation to help the network ignore small fluctuations and concentrate on the most informative patterns across the `timesteps x features` input. As it processes each time step in the sequence, the RNN maintains an internal state that captures information from all preceding steps.

Once the RNN has consumed the entire input sequence, its final hidden state, now a 64-dimensional summary of the observed time-series is fed into a Dense output layer with a sigmoid activation. This produces a probability between 0 and 1 for the positive class (e.g. “walking”). 

The model is compiled with the Adam optimiser to adjust its weights efficiently, and binary cross-entropy loss to penalise incorrect or over-confident predictions. During training, accuracy on a held-out validation set is monitored each epoch. 

Finally, after training for twenty epochs, the network is evaluated on the test data, yielding an overall test accuracy that reflects how well it generalises to new sequences.


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense

# Increase epochs for this example.
epochs = 20

# Define a sequential model for binary classification (e.g. standing vs walking)
model = Sequential([
    # Recurrent layer with 64 units and ReLU activation
    # It expects input shape: (timesteps, features)
    SimpleRNN(64,
              activation='relu',
              input_shape=(X_train.shape[1], X_train.shape[2])
              ),

    # Output layer with sigmoid activation for binary classification
    Dense(1, activation='sigmoid')  # Output is between 0 and 1 (binary class)
])

# Compile the model using binary crossentropy loss and accuracy as a metric
model.compile(
    optimizer='adam',                    # Adam optimiser adapts learning rate
    loss='binary_crossentropy',         # Suitable for binary classification
    metrics=['accuracy']                # Monitor classification accuracy
)

# Train the model on the training data
history_RNN = model.fit(
    X_train,                            # Input features
    Y_train,                            # Binary labels (e.g. 0 for standing, 1 for walking)
    epochs=epochs,                      # Number of training epochs
    validation_data=(X_test, Y_test)    # Validate on test set after each epoch
)

# Evaluate the trained model on the test data
loss, acc = model.evaluate(X_test, Y_test)
print(f"Test Accuracy: {acc:.2f}")      # Print test accuracy


Over the twenty epochs, the model achieved very high and stable performance on both the training and validation sets. In the first epoch it already reached around 97.6 % training accuracy with a loss of about 0.09, and validation accuracy of 98.2 % with a loss near 0.06. Although there were a few odd spikes in the training loss, most notably in epochs 3, 4 and 6, the validation metrics remained largely unaffected, suggesting the model was robust to these fluctuations.

By the end of training, the network settled around 98.3 % validation accuracy with a validation loss of approximately 0.046, while final evaluation on the held-out test set gave 98.6 % accuracy and a loss of 0.041. The close alignment of training, validation and test accuracies, along with consistently low losses, indicates that the model both learned the underlying patterns effectively and generalised well to unseen data, which is what we want:


### Plot Train loss and Validation loss

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))

# Plot Training and Validation Loss
plt.subplot(1, 2, 1)

# Plot the training loss (how much error the model made on the training data)
plt.plot(history_RNN.history['loss'], label='Train Loss')

# Plot the validation loss (how much error the model made on unseen validation data)
plt.plot(history_RNN.history['val_loss'], label='Val Loss')

plt.title('Loss over Epochs')

plt.xlabel('Epoch')
plt.ylabel('Loss')

plt.legend()

# Plot Training and Validation Accuracy
plt.subplot(1, 2, 2)

# Plot the training accuracy (how often the model got predictions right on training data)
plt.plot(history_RNN.history['accuracy'], label='Train Accuracy')

# Plot the validation accuracy (how often the model got predictions right on validation data)
plt.plot(history_RNN.history['val_accuracy'], label='Val Accuracy')

plt.title('Accuracy over Epochs')

plt.xlabel('Epoch')
plt.ylabel('Accuracy')

plt.legend()

# Adjust the layout so that the two plots don’t overlap or get squashed
plt.tight_layout()

plt.show()


### Predict
This next part of our code takes a single time-series example from our test set and checks exactly what our binary classifier thinks it represents. We pick out the 6th sample (`sample_index = 5`) and look up its true label (0 or 1) in a small dictionary that calls 0 for "standing" and 1 for "walking". Because the model expects a batch of inputs, we use `np.expand_dims` to turn our one-dimensional sequence into a batch of size one. Passing that through `model.predict` yields a single probability for the "walking" class. 

If we check whether that probability exceeds 0.5, we can decide on a final predicted label. Finally, we print out the true activity, the model’s guess in words, and the exact probability it assigned so we can judge not just whether it was right or wrong, but also how confident it was:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Choose a sample from the test set
sample_index = 5

# Map numeric label to a name
label_map = {0: "standing", 1: "walking"}

# Extract the sequence and true label
sample_sequence = X_test[sample_index]

true_label = int(Y_test[sample_index])

# Reshape to match model input shape (batch size = 1)
input_sequence = np.expand_dims(sample_sequence, axis=0)

# Predict probability of class 1 (e.g., walking)
prediction_prob = model.predict(input_sequence)[0][0]

# Convert to binary prediction
predicted_label = int(prediction_prob > 0.5)

# Map numeric labels to names
true_label_name = label_map[true_label]
predicted_label_name = label_map[predicted_label]

# Display result
print(f"True Label: {true_label_name}")
print(f"Predicted Label: {predicted_label_name}")
print(f"Predicted Probability: {prediction_prob:.4f}")

This next part of our code visualises the raw sensor signals for a single example alongside the model’s verdict. It loops over the three accelerometer axes: `acc_x`, `acc_y` and `acc_z`, and plots each as a time-series on the same figure, so we can see how the motion evolves step by step. 

The chart reports the true activity (standing or walking), the model’s predicted label, and the confidence (the probability it assigned to "walking"). By labelling the axes ("Time step" on the x-axis and "Scaled sensor value" on the y-axis). This allows us to view the input data and how our classifier interpreted it:

In [None]:
# Plot the sequence of sensor readings (e.g. from accelerometer) and show the predicted label
features = ['acc_x', 'acc_y', 'acc_z']  # Names of the 3 input features

plt.figure(figsize=(10, 4))

# Loop through each feature and plot it over time
for i, feature_name in enumerate(features):
    plt.plot(sample_sequence[:, i], label=feature_name)  # Plot i-th feature (e.g. acc_x)

plt.title(f"Sequence {sample_index} — True: {true_label_name} | Predicted: {predicted_label_name} ({prediction_prob:.2f})")

plt.xlabel("Time step")                   # Each point represents one time step
plt.ylabel("Scaled sensor value")         # Sensor values are usually normalised

plt.legend()

# Ensure layout elements don't overlap
plt.tight_layout()

plt.show()


### Long Short-Term Memory (LSTM)

An *LSTM*, or Long Short-Term Memory network, is a special kind of *Recurrent Neural Network (RNN)* designed to remember information over longer sequences. In a basic RNN, the model processes inputs one step at a time. For example, reading a sentence word by word, or a signal one value at a time, but it tends to "forget" earlier inputs as the sequence gets longer. This is called the *vanishing gradient problem*, and it's a key limitation of simple RNNs.

LSTMs solve this problem by introducing a more sophisticated internal structure made up of *gates*:
- The *input gate* controls how much new information should be stored,
- The *forget gate* decides what information to discard from memory,
- And the *output gate* controls what gets passed to the next layer or time step.

These gates help the LSTM keep useful information for a much longer time, making it especially powerful for tasks where understanding *context over time* really matters, such as language modelling, machine translation, speech recognition, or time series forecasting.

In short, moving from an RNN to an LSTM gives us a more powerful, more memory-aware version of the same idea, without needing to change the input format at all:

### Air passenger dataset
The *Airline Passengers* dataset is a widely used and well-known time series dataset that records the monthly number of international airline passengers from January 1949 to December 1960. It is often used in teaching, research, and practical demonstrations of time series forecasting models, due to its clear structure and interesting temporal patterns.

The dataset consists of two columns: the first is the *month*, formatted as `YYYY-MM`, and the second is the *number of passengers* (in thousands) recorded during that month. With 144 data points in total, it provides just enough historical context to explore long-term trends while being small enough to process efficiently in most environments.

One of the most striking characteristics of the data is the presence of both a *trend* and *seasonality*. Over the years, the number of passengers steadily increases, reflecting the growth of air travel in the post-war period. At the same time, there are consistent seasonal peaks and troughs, typically with higher passenger numbers during mid-year months, making the dataset ideal for illustrating techniques that detect or model seasonal behaviour.

Because of its structure and historical nature, the Airline Passengers dataset is frequently used to introduce models like *ARIMA*, *exponential smoothing*, and our *LSTM neural network*. It provides a clean and interpretable context for learning about data preparation (e.g. lag features, windowing), visualisation, and forecasting performance evaluation.

### Loading the data

In [None]:
import pandas as pd

# Load monthly airline passenger totals (1949–1960) from a public dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv'

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(url)

# Display the first few rows of the dataset
print(df.head())


### Preprocessing
We use `MinMaxScaler` to scale the data to a range between 0 and 1. This is particularly important for models like LSTMs or neural networks, which are sensitive to the scale of input data. Without scaling, large input values could cause unstable learning, or very slow convergence due to imbalanced gradient updates (more on this later).

In [None]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Convert column to float and reshape
data = df['Passengers'].values.astype(float).reshape(-1, 1)

# Normalize to [0, 1] range
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)


The `create_sequences` function is designed to transform a continuous time series into a structured format suitable for training machine learning models, particularly those that handle sequential data, like Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs). Time series data on its own is just a long sequence of values, but to use it for prediction, we need to frame it as a supervised learning problem: using past values to predict future ones. This function accomplishes exactly that.

We will write another function that will implement a sliding window over the data. For each position in the sequence, it will take a fixed number of previous time steps. For example, the past 10 months and store that as one input sample (`X`). The value that comes immediately after that sequence becomes the target output (`y`) for that sample. This way, the model learns from each short history of past values and is trained to predict what comes next. This approach is known as *sequence-to-one prediction*, where a window of time steps is mapped to a single future value.

Using this format, we can generate hundreds of training examples from a relatively short time series. These overlapping sequences capture local patterns in the data, such as seasonal trends or repeating shapes in the curve. This is especially important for neural networks, which learn by recognising patterns across samples. Without this kind of preprocessing, a model would have no context about the structure of the time series and would be unable to make informed predictions.

Ultimately, this preparation step makes time series forecasting possible with deep learning models. It reshapes the raw time series into a set of training inputs (`X`) and targets (`y`), where each input is a 2D array of shape `(sequence_length, 1)` and each target is the next time step. This structure is exactly what RNNs and LSTMs are designed to work with, allowing them to learn from the temporal dynamics in the data.

We create a function `create_sequences` for this purpose, which takes the data and a set size for the sequence and converts the data to sequence data and the target labels:

In [None]:
import numpy as np

# Function to transform a 1D array of data into overlapping sequences
# and corresponding next-step targets for sequence prediction tasks
def create_sequences(data, seq_length=10):
    X, y = [], []  # Initialise lists to hold input sequences and labels
    
    # Loop over the data, stopping seq_length elements before the end
    for i in range(len(data) - seq_length):

        # Extract a window of length seq_length as the input sequence
        X.append(data[i:i + seq_length])

        # The label is the item immediately following the input window
        y.append(data[i + seq_length])
    
    # Convert lists to NumPy arrays for compatibility with ML frameworks
    return np.array(X), np.array(y)



In [None]:
X, Y = create_sequences(data_scaled, seq_length=10)

### Resampling
As always, start by splitting our data into train and test sets. We will apply a simple technique rather than import a library like `train_test_split` (for demonstration):

In [None]:
train_size = int(len(X) * 0.8)

X_train, X_test = X[:train_size], X[train_size:]
Y_train, Y_test = Y[:train_size], Y[train_size:]

### Model
This model is built to forecast the next value in a univariate time-series (monthly passenger counts) by learning temporal patterns from fixed-length input windows. It begins with an input layer that expects sequences of shape `(timesteps, features)`, in this case ten months of data with one feature per month.

The core layer is a single LSTM with 64 units and ReLU activation, which helps the network focus on meaningful trends and ignore small fluctuations that look like noise, while retaining information from earlier time-steps. 

After processing the entire ten-step sequence the LSTM produces a 64-dimensional summary of recent behaviour, which is then passed to a dense output layer with one neuron to generate the forecast for month eleven. 

The model is trained using the Adam optimiser to minimise mean squared error, running for fifty epochs with batches of sixteen sequences and validating on a held-out test set to monitor generalisation:


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, LSTM, Dense

# Increase epochs for this example.
epochs = 50

# Create a Sequential model, which lets us stack layers one after another
model = Sequential([
    # Input layer specifying the shape of each training example
    # Each example is a sequence of time steps (10 months), with 1 feature (passenger count)
    Input(shape=(X_train.shape[1], X_train.shape[2])),

    # LSTM layer with 64 units (or "neurons"), designed to learn patterns across time
    # LSTMs are good at remembering information from earlier in the sequence
    # activation='relu' helps the network focus on useful signals and ignore noise
    LSTM(64, activation='relu'),

    # Output layer with a single neuron for predicting the next value in the sequence
    # No activation function here since we're predicting a number (regression task)
    Dense(1)
])

# Compile the model: specify how it should learn
# 'adam' is an optimiser that helps the model improve efficiently
# 'mse' (mean squared error) is the loss function, measuring how far off the predictions are
model.compile(
    optimizer='adam', 
    loss='mse', 
    metrics=['mae']                # Monitor MAE
)

# Train the model using the training data
# epochs = how many times the model sees the full dataset
# batch_size = how many sequences it looks at before updating its learning
# validation_data = a separate set used to monitor how well it's generalising
history_LSTM = model.fit(
    X_train, Y_train,
    epochs=epochs,
    batch_size=16,
    validation_data=(X_test, Y_test)
)


Over the fifty epochs, the model’s mean absolute error on the training set fell dramatically from about 0.27 in epoch 1 to roughly 0.06 by epoch 50, showing that it was learning to predict the next value very precisely. 

On the validation set the MAE started around 0.52, plummeted to about 0.12 by epoch 4, and then continued a gentle downward trend, reaching approximately 0.11 by the final epoch. Because the training and validation MAE curves both drop in step and remain close throughout, there’s no sign of serious over-fitting, and the model achieves a low validation error of around 0.11. 

In practice you might stop once the validation MAE bottoms out, around epoch 40 or so, to save time, but overall the network converges cleanly to accurate, generalisable predictions:


### Plot Train loss and Validation loss

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))

# Plot Training and Validation Loss
plt.subplot(1, 2, 1)

# Plot the training loss (how much error the model made on the training data)
plt.plot(history_LSTM.history['loss'], label='Train Loss')

# Plot the validation loss (how much error the model made on unseen validation data)
plt.plot(history_LSTM.history['val_loss'], label='Val Loss')

plt.title('Loss over Epochs')

plt.xlabel('Epoch')
plt.ylabel('Loss')

plt.legend()

# Plot Training and Validation Accuracy
plt.subplot(1, 2, 2)

# Plot the training `accuracy` (how often the model got predictions right on training data)
plt.plot(history_LSTM.history['mae'], label='Train MAE')

# Plot the validation `accuracy` (how often the model got predictions right on validation data)
plt.plot(history_LSTM.history['val_mae'], label='Val MAE')

plt.title('MAE over Epochs')

plt.xlabel('Epoch')
plt.ylabel('MAE')

plt.legend()

# Adjust the layout so that the two plots don’t overlap or get squashed
plt.tight_layout()

plt.show()


### Predict
Here we take the trained LSTM model and use it to forecast passenger counts on our test sequences. First, `model.predict(X_test)` generates a sequence of scaled predictions. Because the model was trained on normalised data, we then apply `scaler.inverse_transform` to both the predictions and the true test targets (`Y_test`) to convert them back into the original passenger‐count units. 

Finally, we plot both the actual passenger numbers and the model’s forecasts on the same axes labelled "True Values" and "Predictions" with time steps along the x‐axis and passenger counts on the y‐axis. 

The resulting line chart gives you a clear visual comparison of how closely the model’s predictions track the real data over time:

In [None]:
# Predict
y_pred = model.predict(X_test)

# Inverse scale predictions
y_pred_inv = scaler.inverse_transform(y_pred)
y_test_inv = scaler.inverse_transform(Y_test)

# Plot predictions vs actual
plt.plot(y_test_inv, label='True Values')
plt.plot(y_pred_inv, label='Predictions')

plt.title("Airline Passenger Forecasting")

plt.xlabel("Time Step")
plt.ylabel("Number of Passengers")

plt.legend()

plt.show()


### Summarising models
Once we’ve trained multiple models, such as the Perceptron, Feedforward Neural Network (FNN), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM).  It’s important to go back and compare how well they perform. One of the most common ways to do this is by plotting *validation accuracy* over time as we have seen.

Validation accuracy tells us how well the model performs on data it hasn’t seen during training. Plotting this over each *epoch* (a complete pass through the training data), allows us to see whether the model is improving, plateauing, or even starting to overfit.

Visual comparison helps us answer questions like:
- Which model learns the fastest?
- Which one achieves the highest accuracy?
- Do any models overfit (perform well on training but poorly on validation)?
- Is the extra complexity (e.g. using an LSTM) actually leading to better results?

In short, we get a clearer picture of how each model is learning, which helps us choose the best one for the task.

### What have we learnt?
We explored the foundations of deep learning by training and comparing four different types of neural networks: the *Perceptron*, *Feedforward Neural Network (FNN)*, *Recurrent Neural Network (RNN)*, and *Long Short-Term Memory (LSTM)* model. We used a variety of different datasets and chose the most appropriate model to suit the data and the task.

We also learned how to evaluate the performance of these models. The main metric we focused on was *accuracy*, which tells us what percentage of the predictions were correct.

By now, you should have a clear understanding of how different neural network architectures behave, how to evaluate them properly, and how to swap in different types of data to experiment further. This sets the foundation for building more advanced and task-specific deep learning models.

Try changing the architecture, playing with different types of input data, or tuning model parameters and observe how these changes affect your results. This kind of hands-on experimentation is the best way to deepen your understanding of machine learning.