## Resampling Techniques

### Introduction
Imagine you're baking a cake and want to be sure the recipe works every time—not just by luck. You'd probably test it repeatedly, perhaps slightly changing ingredients or oven temperatures, to check its reliability.

Resampling techniques in machine learning work a bit like that. They help us understand how well our models might perform by repeatedly testing them with different subsets of the same data.  When we are "shuffling" or reusing the data we already have through resampling methods, we can gain confidence in our results, even if we don't have extra data.

What we want to do, is test how well our model will perform if we received data that it hasn't seen. When we say unseen data, this may be new future data we receive when our model goes into production, or if we were to receive additional historic data.

Essentially, resampling helps us to know whether our machine learning approach is truly reliable or just a series of lucky guesses.

We will explore how resampling techniques can be applied effectively across three distinct types of data.
First, we examine *numeric data* using the Pima Indians Diabetes dataset, illustrating how resampling helps improve predictions about diabetes risk.

Next, we look at *language data*, specifically using sentiment analysis—a method of identifying emotions or opinions from written language—to show how resampling enhances understanding of text-based information.

Finally, we apply these techniques to *image data* with the well-known Cats vs. Dogs dataset, highlighting how resampling assists machine learning models in accurately recognising visual patterns.

This varied approach showcases the versatility and practicality of resampling across different types of real-world data.


### Types of resampling

For each type of data we apply four widely-used resampling methods, these include:

- *Train/Test Split*:
We divide the dataset into two parts: one set for training the model, and the other set to test how well it predicts new, unseen data.

- *k-Fold Cross-Validation*:
Here, we split the data into $k$ smaller subsets (or folds). Each fold takes turns being the test set, while the remaining folds train the model. This way, every data point is tested exactly once, giving a balanced evaluation.

- *Leave-One-Out Cross-Validation (LOOCV)*:
This method is a special case of cross-validation, where each data point gets its own turn to be the test set. The model trains on all other points, ensuring thorough evaluation, especially useful with smaller datasets.

- *Repeated Random Test-Train Splits (Shuffle Split)*:
Instead of splitting data once, we randomly split it many times. Each split results in a slightly different train and test set. This repeated shuffling helps us understand how stable our model's performance is.

We apply different resampling techniques to a structured numeric dataset, the Pima Indian's dataset on Diabetes.

### Install Python libaries

In [None]:
!pip install pandas scikit-learn opencv-python matplotlib nltk seaborn

### Download the data

In [None]:
import urllib.request

url = 'https://raw.githubusercontent.com/martyn-harris-bbk/AppliedMachineLearning/main/data/pima-indians-diabetes.data.csv'
filename = 'pima-indians-diabetes.data.csv'

urllib.request.urlretrieve(url, filename)
print("Download complete.")

### Load the data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

filename = 'pima-indians-diabetes.data.csv'

header = [
    'Pregnancy_Count',
    'Glucone_conc',
    'Blood_pressure',
    'Skin_thickness',
    'Insulin',
    'BMI',
    'DPF',
    'Age',
    'Class'
]

data = pd.read_csv(filename, names=header)

print(data.shape)

data.head()

## Machine Learning model
To evaluate the effectiveness of various resampling techniques, we use a Logistic Regression model.

Logistic regression is particularly well-suited for classification problems where the target variable has two categories, such as determining if someone has diabetes or not, classifying sentiment as positive or negative, or distinguishing between images of cats and dogs.

We specifically choose the `'liblinear'` solver because it's efficient, stable, and performs well with smaller to medium-sized datasets, ensuring quick and reliable results during our comparisons.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')

We start by clearly identifying two components in our dataset: the **training data (X)** and the **target variable (Y)**. The training data (X) includes the input features or information we use to make predictions, while the target variable (Y) is the outcome or label we want our model to predict. Clearly separating X and Y is essential to train and evaluate models effectively:

In [None]:
X = data.iloc[:, :-1].values
Y = data.iloc[:, -1].values

### Train/Test Split  
Train/test split is one of the simplest and most commonly used techniques in machine learning. The idea is to divide the full dataset into two parts:
- The *training set* is used to teach the model by showing it patterns and relationships in the data.
- The *test set* is used to evaluate how well the model performs on new, unseen data — simulating how it might work in the real world.

#### Advantages:
- *Simplicity and speed*: It is very easy to use and quick to run, making it ideal for early testing.  
- *Clear evaluation*: It gives a straightforward way to check how the model performs on independent data.  
- *Low computational cost*: It only requires one round of training and testing, which keeps things efficient.

#### Disadvantages:
- *Risk of bias*: The model’s performance can depend heavily on how the data was split — one lucky or unlucky test set can affect the result.  
- *Less reliable with small datasets*: If you have limited data, this method might not give an accurate picture of performance.  
- *Limited use of data*: Only part of the data is used for training, so potentially useful information might be left out.

Overall, the train/test split approach is useful for quick experiments and early insights when building a model. However, it should not be the only evaluation method you rely on, especially if your dataset is small or if high accuracy is critical.

In the example below, we divide the dataset `(X, Y)` into two parts:
- *67%* for training the model (`X_train`, `Y_train`)  
- *33%* for testing the model's accuracy (`X_test`, `Y_test`)

In [None]:
from sklearn.model_selection import train_test_split  # For splitting data into training and test sets
from sklearn.linear_model import LogisticRegression  # Logistic regression classifier

# Create a logistic regression model
# 'liblinear' solver is efficient for small datasets and supports L1 regularisation
model = LogisticRegression(solver='liblinear')

# Set a random seed for reproducibility
seed = 7

# Split the dataset into training (67%) and test (33%) sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)

# Train (fit) the logistic regression model using the training data
model.fit(X_train, Y_train)

# Evaluate the model's accuracy on the test set and print it as a percentage with 3 decimal places
print(f'Accuracy: {model.score(X_test, Y_test) * 100:.3f}%')


### K-Fold Cross-Validation  
This method splits our dataset into smaller groups called *“folds”*, randomly shuffling the data first to ensure fairness and variety in each group.

#### Advantages:
- *Reduced bias*: Every data point is used for both training and testing, which gives a more balanced and fair assessment of model performance.  
- *Efficient use of data*: The entire dataset is fully used, so even with smaller datasets, we get the most out of the available data.  
- *Stability and consistency*: Since the model is tested multiple times, we can average the results to get a more reliable and consistent estimate of how well it performs.

#### Disadvantages:
- *Increased computational cost*: Because the model is trained and tested multiple times (once for each fold), the process takes longer than a single train/test split.  
- *Risk of variance with small datasets*: If the dataset is very small or unevenly balanced, some test folds might give misleading results.  
- *Sensitive to choice of "k"*: Picking the right number of folds is important—too few can lead to inaccurate results, while too many can make the process unnecessarily slow.

Overall, k-Fold Cross-Validation usually provides a more reliable and thorough assessment of a model’s performance than a single train/test split. It is especially useful when working with limited data.

We apply k-Fold Cross-Validation using `KFold(n_splits=10, shuffle=True)`. This means the model is trained and tested ten times, each time using a different fold as the test set and the remaining nine as the training set.

Finally, we calculate the average accuracy and look at how much the results vary across the 10 runs. This gives us a clearer picture of how well the model is likely to perform on new data:

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')

# k-Fold Cross-Validation
kfold = KFold(n_splits=10, random_state=7, shuffle=True)

results = cross_val_score(model, X, Y, cv=kfold)

print(f'k-Fold Accuracy: {results.mean()*100:.3f}% ({results.std()*100:.3f}%)')

<div style="border: 2px solid silver; border-radius: 5px; background-color: transparent;padding:10px;width:95%;margin: 10px;">
  <strong> Standard Deviation in Cross-Validation</strong>  
<br><br>  
The standard deviation in cross-validation helps us understand how much the model's accuracy varies from one fold to another. It is a measure of <em>consistency</em> in performance.

What it tells us:
- <em>Low standard deviation</em>: <br>
The model performs similarly across all folds. This suggests stable, reliable predictions regardless of which part of the data is used — a sign of a trustworthy model.<br><br>  
- <em>High standard deviation</em>:<br> 
The model's performance changes a lot across folds. It may do well on some subsets but poorly on others. This suggests the model might be sensitive to specific data or possibly overfitting.

In short, standard deviation gives us an idea of how dependable the model’s reported accuracy really is. A small value is a good sign; a large value means we should investigate further.
</div>

### Leave-One-Out Cross-Validation (LOO-CV)
Leave-One-Out Cross-Validation (LOO-CV) is a technique used to evaluate the performance of a machine learning model by training and testing it multiple times on slightly different subsets of data. In this method, for a dataset with *n* data points, the model is trained on *n-1* points while leaving out one data point for testing.

This process repeats *n* times, with each data point getting a turn as the test set exactly once. The final model performance is typically averaged over all iterations to get an estimate of how well the model generalises to new data.

#### Advantages:
- *Maximises training data utilisation*: Since LOO-CV uses *n-1* samples for training and only *1* for testing, it ensures that the model is trained on nearly all available data, which can be beneficial for small datasets.

- *Unbiased estimate of generalisation error*: As each sample is tested independently, LOO-CV provides a nearly unbiased estimate of how the model will perform on unseen data.

- *No need for Train-Test Split decisions*: Unlike standard train-test splits, LOO-CV systematically evaluates the model on every data point, reducing the variance introduced by arbitrary split choices like those obtain through the regular test-train split method.

- *Good for small datasets*: When your dataset is very limited, removing one data point at a time ensures that the model still gets trained on almost all available data. In short, LOO-CV is valuable in scenarios where every data point is crucial, such as medical predictions or rare-event modeling.

#### Disadvantages:
- *Extremely computationally expensive*: LOO-CV requires *n* separate model training runs, where *n* is the total number of samples. If the dataset is large (e.g., thousands of images), this can be highly inefficient and time-consuming, especially for complex models like deep learning networks.

- *High variance in evaluation scores*: As each test set consists of just one sample, the error estimate has high variance. If a sample is difficult to classify, it may significantly impact the overall performance estimate.

- *Not ideal for model selection*: Unlike K-Fold Cross-Validation, which smooths out fluctuations, LOO-CV is too sensitive to individual samples. This makes it less reliable for selecting hyperparameters, as small variations can lead to misleading conclusions.

- *May not reflect real-world generalisation*: In practice, models are tested on larger test sets, whereas LOO-CV only tests on a single instance at a time.
This does not always mimic real-world performance, where the test set is typically larger and provides a more stable estimate.

- *Poor performance with noisy data*: If the dataset contains noise (e.g., mislabeled images), LOO-CV may overemphasise these outliers, leading to unreliable performance metrics. A single misclassified noisy sample can significantly skew the results.

- *Risk of overfitting*: Since the model is trained on almost the entire dataset, there is a higher risk that it will become too specialised to the training data.
This can lead to overfitting, where the model performs well during validation but fails to generalise to completely unseen data.

In [None]:
from sklearn.model_selection import LeaveOneOut  # Leave-One-Out Cross-Validation tool
from sklearn.metrics import accuracy_score  # To measure model accuracy
from sklearn.linear_model import LogisticRegression  # Logistic regression classifier

# Create a logistic regression model
model = LogisticRegression(solver='liblinear')

# Set up Leave-One-Out Cross-Validation
# This technique uses one data point as the test set and the rest for training,
# repeating this process once for every point in the dataset
loo = LeaveOneOut()

# Lists to store actual labels and predicted labels from each fold
y_true = []
y_predicted = []

# Perform Leave-One-Out Cross-Validation
for train_index, test_index in loo.split(X):
    # Split the data into training and test sets for this fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = Y[train_index], Y[test_index]

    # Train the logistic regression model on the training data
    model.fit(X_train, y_train)

    # Predict the label for the one test instance
    y_pred = model.predict(X_test)

    # Store the true label and the predicted label
    y_true.append(y_test[0])
    y_predicted.append(y_pred[0])

# Calculate and display overall accuracy
accuracy = accuracy_score(y_true, y_predicted)
print(f'LOO-CV Accuracy: {accuracy:.4f}')


### Repeated random Test-Train Splits
Repeated Random Test-Train Splits, also known as Monte Carlo Cross-Validation, is a method used to assess the performance of a machine learning model. Unlike k-Fold Cross-Validation, where the dataset is split into a fixed number of folds, this method involves randomly dividing the dataset into training and test sets multiple times. Each split is independent, meaning a data point can appear in both training and test sets across different iterations. The model is trained and tested on different subsets each time, and the overall performance is averaged across all runs.

#### Advantages:
- Reduces variance compared to a single Train-Test Split: As we are repeating the process multiple times with different random splits, this method provides a more stable estimate of model performance. A single train-test split can be highly dependent on the specific split, whereas repetition reduces this issue.

- More computationally efficient than K-Fold or LOO-CV: Unlike LOO-CV (which requires *n* model runs) and K-Fold Cross Validation (which runs *K* times), repeated random splitting allows you to control the number of repetitions, making it computationally more efficient.

- More flexible than K-Fold Cross-Validation: You can control both the train-test ratio and the number of repetitions, giving you flexibility to adjust based on dataset size and computational power.

- Works well for large datasets: Since each split is independent, it can be used on large datasets where K-Fold CV or LOO-CV may be too slow.

- Helps prevent overfitting to a specific split: The model avoids overfitting to a specific train-test division by evaluating multiple different splits, leading to a more generalisable model.

#### Disadvantages:
- Some data points may not be included in the test set: Since splits are random, some samples may never appear in the test set, while others may be selected multiple times. This can introduce bias, as the model might not be tested on all examples equally.

- Less efficient use of data compared to K-Fold Cross Validation: In K-Fold Cross-Validation, every sample appears in both train and test sets at least once. In Repeated Random Splits, some data points may never be tested, making it less efficient in using all available data.

- Not always suitable for small datasets: If the dataset is small, random splits can lead to significant variations in results. A bad split where an entire class is underrepresented can distort model performance estimates.

- Higher risk of data leakage: If preprocessing steps (such as normalisation or feature scaling) are applied before splitting, information from the test set might unintentionally influence training. Ensuring that each split is properly handled can be more error-prone compared to K-Fold CV.

- Computational cost can still be high: The commputational cost is cheaper than LOO-CV, but running multiple repeated splits still increases the training time compared to a single train-test split.

In [None]:
import numpy as np
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')

# Define Repeated Random Test-Train Splits
num_splits = 10  # Number of repetitions
test_size = 0.3  # 30% of data is used for testing

shuffle_split = ShuffleSplit(n_splits=num_splits, test_size=test_size, random_state=7)

# Evaluate model using repeated random test-train splits
scores = cross_val_score(model, X, Y, cv=shuffle_split)

# Print the results
print(f'Mean Accuracy: {np.mean(scores):.3f}, Standard Deviation: {np.std(scores):.3f}')

## Resampling for Language data

When applying resampling techniques to text data, we first transform the text into numerical form—typically using *TF-IDF vectorisation*. TF-IDF converts textual data into numerical feature vectors based on word frequency and importance. This step makes text data compatible with machine learning algorithms like logistic regression.

Although this preprocessing step differs from numerical datasets (such as the Pima Diabetes dataset, which is already numeric), the core resampling approach remains similar:

- We still perform methods such as *train/test splits*, *k-fold cross-validation*, *Leave-One-Out Cross-Validation*, and *Shuffle splits*.
- The key difference is the additional step of converting text into numeric features before applying these resampling methods.

This ensures that text classification models are robustly evaluated in the same systematic way as models trained on purely numeric data.

The code below demonstrates how text data can be loaded and prepared for sentiment classification. It begins by importing the necessary libraries and defining the data's location, with categories labelled as positive (`pos`) and negative (`neg`).

Text files from each category are read one by one, with the content appended to a list (`X_text`) and the corresponding labels (indicating positive or negative sentiment) appended to another list (`y_text`):

### Downloading the data

In [None]:
import urllib.request
import tarfile
import os

# IMDb dataset URL
# url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" # Size 80.2MB
url = "http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens_0211.tar.gz" # Size 2.2MB

# Download the dataset to the current directory
urllib.request.urlretrieve(url, "aclImdb_v1.tar.gz") 

# Unpack (extract) the dataset
with tarfile.open("aclImdb_v1.tar.gz", "r:gz") as tar:
    tar.extractall()


### Loading the data

In [None]:
import os  # For accessing file system
import numpy as np  # Numerical operations
import random  # Random sampling of files

# Directory containing text data organised into 'pos' and 'neg' folders
root_dir = 'tokens/'
categories = ['pos', 'neg']  # Sentiment categories: positive and negative

X, Y = [], []  # Lists to store data (reviews) and labels (sentiments)

sample_size = 250  # Number of samples per category (limited for quicker demostration)

# Loop through each sentiment category
for cat in categories:
    file_list = os.listdir(os.path.join(root_dir, cat))  # List files in category folder
    sample_files = random.sample(file_list, sample_size)  # Randomly pick files for sampling

    # Read each sampled file
    for filename in sample_files:
        with open(os.path.join(root_dir, cat, filename), 'r') as f:
            label = 1 if cat == "pos" else 0  # Assign numeric label (positive=1, negative=0)
            
            X.append(f.read())  # Append review text
            Y.append(label)     # Append sentiment label

Y = np.array(Y)  # Convert labels list to NumPy array for easier computation later

# Display size of loaded dataset and a sample review with its sentiment label
print("Training size:", len(X), "Labels:", len(Y))
print("Sample Review:", X[0])
print("Sentiment (0-neg, 1-pos):", Y[0])

### Preprocessing
The data is converted into numerical form using TF-IDF vectorisation. The `TfidfVectorizer` is configured to remove common English words (`stop_words='english'`) and limit the features to the 3,000 most significant terms (`max_features=3000`).

The vectoriser then processes the raw text data (`X_text`) and transforms it into a numerical format (`X_tfidf`) suitable for training our machine learning model. This process ensures the models focus on meaningful words and reduces the computational complexity.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectoriser = TfidfVectorizer(stop_words='english', max_features=3000)

X_tfidf = vectoriser.fit_transform(X)

print(X_tfidf)

### Train/Test split
We use the train/test split approach again, applied to our TF-IDF vectorised text data. The data (`X_tfidf`) is divided into a training set (80%) and a test set (20%), ensuring reproducibility with `random_state=7`.

Our logistic regression model is then trained using the training data (`X_train`, `y_train`).

Finally, we evaluate and report the accuracy of the model on unseen text data (`X_test`, `y_test`), demonstrating how effectively the model generalises to new, previously unseen text examples.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, Y, test_size=0.20, random_state=7)

model.fit(X_train, y_train)

print(f'Text Data Accuracy: {model.score(X_test, y_test) * 100:.3f}%')

### k-Fold Cross-Validation
Let's try k-Fold Cross Validation, to get a better estimate of how good our model will be on our data:

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

results = cross_val_score(model, X_tfidf, Y, cv=kfold)

print(f'k-Fold Accuracy: {results.mean()*100:.3f}% ({results.std()*100:.3f}%)')

### Leave-One-Out Cross-Validation (LOO-CV)
Now let's see how we immplemment Leave-One-Out Cross-Validation. The code performs Leave-One-Out Cross-Validation on the training set `X` with labels `Y`.

The loop iterates over each data point, using it as the test set while the remaining *n-1* samples serve as the training set. Inside the loop, the model is trained on `X_train` and `y_train` (excluding the test sample) and then makes a prediction for `X_test` (the left-out sample). The true and predicted labels are stored in `y_true` and `y_predicted`, respectively.

After all iterations, the overall classification accuracy is calculated using `accuracy_score(y_true, y_predicted)` and printed:

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')

# Define Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Store predictions and actual values
y_true, y_predicted = [], []

# Perform LOO-CV
for train_index, test_index in loo.split(X_tfidf):
    X_train, X_test = X_tfidf[train_index], X_tfidf[test_index]
    y_train, y_test = Y[train_index], Y[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Make prediction
    y_pred = model.predict(X_test)

    # Store results
    y_true.append(y_test[0])  # Append actual class
    y_predicted.append(y_pred[0])  # Append predicted class

# Compute accuracy
accuracy = accuracy_score(y_true, y_predicted)
print(f'LOO-CV Accuracy: {accuracy:.4f}')


### Repeated random Test-Train Splits
The below code performs Repeated Random Test-Train Splits using `ShuffleSplit` from `sklearn.model_selection` to evaluate a model's performance. It defines `num_splits = 10`, meaning the dataset is randomly split into training and test sets 10 times, with `test_size = 0.33`, ensuring that 33% of the data is used for testing in each iteration.

The `ShuffleSplit` object generates different random splits while maintaining a fixed `random_state = 7` for reproducibility. The `cross_val_score` function trains and tests the model on these splits, recording the accuracy for each iteration.

Finally, the mean accuracy and standard deviation of the scores are computed using `np.mean(scores)` and `np.std(scores)`, respectively, and printed to provide a performance estimate.

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')

# Define Repeated Random Test-Train Splits
num_splits = 10  # Number of repetitions
test_size = 0.33  # 330% of data is used for testing

shuffle_split = ShuffleSplit(n_splits=num_splits, test_size=test_size, random_state=7)

# Evaluate model using repeated random test-train splits
scores = cross_val_score(model, X_tfidf, Y, cv=shuffle_split)

# Print the results
print(f'Mean Accuracy: {np.mean(scores):.3f}, Standard Deviation: {np.std(scores):.3f}')

## Resampling for Image data

Exploring resampling techniques on an **image dataset**, such as the Cats vs. Dogs dataset, is slightly different from numerical or text datasets because images typically require more complex preprocessing (such as resizing, feature extraction, or encoding) before training a model.

However, the core approach remains similar: we still apply resampling methods like train/test splits, k-fold cross-validation, LOOCV, and shuffle splits to assess how reliably our image classifier performs.

Even though handling images may involve additional steps like converting pixel data into features the model can interpret, the fundamental goal of resampling remains unchanged—evaluating the model’s ability to generalise to new, unseen examples.

When we use resampling with images, we ensure our model accurately distinguishes between categories (in this case, cats and dogs) rather than memorising specific images, resulting in a robust and trustworthy evaluation.


### Downloading the data

In [None]:
import os
import urllib.request
import zipfile

# Define URL and target filenames
url = "https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip"
zip_path = "cats_and_dogs_filtered.zip"
extract_dir = "cats_dogs"

# Download the zip file
print("Downloading dataset...")
urllib.request.urlretrieve(url, zip_path)
print("Download complete.")

# Extract the zip file
print("Extracting files...")
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)
print("Extraction complete.")

# Delete the zip file
print("Cleaning up...")
os.remove(zip_path)
print("Cleanup complete.")


#### Loading the data
The code below, loads and preprocesses an image dataset (Cats vs. Dogs) for machine learning. It loops through the images stored in folders named 'cats' and 'dogs', reading each image using OpenCV (`cv2`). Each image is resized to a consistent shape of 64x64 pixels to ensure uniformity.

These resized images are collected into an array `X`, while their labels (`0` for cat, `1` for dog) are stored in the array `Y`.

Afterwards, the image data is converted into NumPy arrays and normalised by dividing pixel values by `255.0`, scaling them between `0` and `1` to help the model learn more effectively.

We must then flatten each image from (64, 64, 3) to (64*64*3,) before training when using models like Logistic Regression.

In [None]:
import os                    # File and directory handling
import cv2                   # Image loading and processing with OpenCV
import numpy as np           # Numerical operations on arrays
import matplotlib.pyplot as plt  # Visualisation and plotting (useful later)

# Root directory containing training images
dataset_path = "cats_dogs/cats_and_dogs_filtered/train/"

# Categories corresponding to subfolders in the dataset ('cats' and 'dogs')
category_path = ["cats", "dogs"]
IMG_SIZE = 128  # Target size for resizing images

# Limit number of images per category (for faster processing)
limit = 250

# Lists to store image data (X) and labels (Y)
X = []
Y = []

# Loop through categories and load images
for category in category_path:
    path = os.path.join(dataset_path, category)  # Path to current category
    count = 0
    for file in os.listdir(path):  # Iterate over files in category folder
        if file.endswith(('.jpg', '.jpeg', '.png')):  # Only process image files
            img_path = os.path.join(path, file)  # Full path to image file
            img = cv2.imread(img_path)  # Load image
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGBA)
            img = cv2.resize(img, (IMG_SIZE, IMG_SIZE))  # Resize image to fixed size
            X.append(img)  # Store image data
            Y.append(category)  # Store corresponding label ('cats' or 'dogs')
            count += 1
            if count >= limit:  # Stop after reaching sample limit
                break

# Convert lists into NumPy arrays for further processing
X = np.array(X)
Y = np.array(Y)

# Flatten images from (128,128,3) into vectors (128*128*3), suitable for ML models
X_flattened = X.reshape(len(X), -1)


*Note*: We have selected a small sample of the images as the training data to demonstrate the principles. However, this means that the accuracy of our model will be quite poor. For better results you can use the full dataset if you have the resources. In addition, classical machine learning algorithms, are not the best model -- we would want to look at neural network models like Convolutional Neural Networks (CNNs) to achieve better performance.

In any case, let's quickly preview a few images from the dataset to check everything loaded:

In [None]:
import matplotlib.pyplot as plt
import random

# Combine image arrays and labels into a list of tuples
combined = list(zip(X, Y))

# Randomly select 20 image-label pairs
sampled = random.sample(combined, 20)

# Display the randomly selected images
fig, axes = plt.subplots(4, 5, figsize=(10, 8))  # 4 rows, 5 columns

for (img, label), ax in zip(sampled, axes.flatten()):
    # Show image
    ax.imshow(img)
    ax.set_title(f"Label: {label}")
    ax.axis('off')

plt.tight_layout()
plt.show()


#### Train/Test Split
This code applies a train/test split, as before, to the image dataset you've loaded and preprocessed. For the training data `(X_train, y_train)`, we use 80% of the datasetto train the model. Test data `(X_test, y_test)` is assigned the remaining 20%, which we reserve to evaluate the model's performance on unseen images.

We'll need to flatten the images so that each image (originally 3D: height × width × channels) becomes a 1D vector. This is necessary because most classical ML models like logistic regression expect 2D input: `(samples, features)`.

The parameter `random_state=seed`, ensures that the data split is reproducible. The print statement then displays how many images are allocated to the training and test sets, giving you a quick check to confirm the data has been split correctly:

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

seed = 7

# Train/Test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_flattened, Y, test_size=0.2, random_state=seed)

# Create and train the logistic regression model
model = LogisticRegression(solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)

# Evaluate and print accuracy
print(f'Image Data Accuracy: {model.score(X_test, y_test) * 100:.3f}%')


### Train/Test with folders

When working with real-world image datasets, it's common to find the data already neatly separated into *training* and *test* (or *validation*) folders. This helps make things simpler because the images have already been organised for us into separate groups. For instance, a typical folder structure might look like this:

```
cats_dogs/cats_and_dogs_filtered/
├── train/
│   ├── cats/
│   └── dogs/
├── validation/
│   ├── cats/
│   └── dogs/
```

Here, images of cats and dogs intended for *training* your machine learning model are stored separately from images intended for *validating* how well your model performs.

In this scenario:

- The *training set* contains images your model learns from. Looking at these images, our model will gradually learn to recognise distinguishing features—such as the shape of a cat’s ears or the length of a dog's nose.
- The *validation set* (also often called the test set) contains images your model hasn't seen before. After your model has finished learning, you show it these new images to test how accurately it can classify unseen data, effectively checking if it has genuinely learned general patterns or simply memorised the training images.

Splitting a dataset into training and validation sets is usually done manually. It involves carefully inspecting images and ensuring each set accurately represents the variety present in your data. For example, you wouldn't want all long-haired cats in your training set and all short-haired cats in your validation set—this could confuse the model when it tries to generalise from training to validation images. Having a good balance and sufficient data in both sets helps your model learn better and provides a clearer measure of how well it performs.

Since our dataset is already organised clearly into these folders, the next step is straightforward: we simply load the images from the *training* and *validation* folders into separate arrays or lists, ready for training and evaluating our model:

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models

# Set image size and batch size
IMG_SIZE = 128
BATCH_SIZE = 32

# Define the root path to the training dataset
dataset_path = "cats_dogs/cats_and_dogs_filtered/"  # Adjust this to your dataset location

# Define dataset directories
train_dir = dataset_path + 'train'
val_dir = dataset_path + 'validation'

# Load and preprocess images directly from directories - training
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    train_dir,
    image_size=(IMG_SIZE, IMG_SIZE),
    batch_size=BATCH_SIZE,
    label_mode='binary'  # because we have two classes: cat/dog
)

# Load and preprocess images directly from directories - validation
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    val_dir,
    image_size=(IMG_SIZE, IMG_SIZE),
    batch_size=BATCH_SIZE,
    label_mode='binary'
)

Let's demonstrate how to use the training and validation sets with a more appropriate model, a Convolutional Neural Network (CNN):

In [None]:
# Optional: Improve performance with prefetching
AUTOTUNE = tf.data.AUTOTUNE  # Let TensorFlow choose the optimal buffer size

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)  # Cache and prefetch training data

val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)  # Cache and prefetch validation data

# Define a simple CNN (Convolutional Neural Network) model
model = models.Sequential([
    layers.Rescaling(1./255, input_shape=(IMG_SIZE, IMG_SIZE, 3)),  # Normalise pixel values to [0, 1]
    layers.Conv2D(32, (3, 3), activation='relu'),  # First convolutional layer
    layers.MaxPooling2D(),                        # Downsample with max pooling
    layers.Conv2D(64, (3, 3), activation='relu'),  # Second convolutional layer
    layers.MaxPooling2D(),                        # Downsample again
    layers.Flatten(),                             # Flatten feature maps into a single vector
    layers.Dense(64, activation='relu'),          # Fully connected layer
    layers.Dense(1, activation='sigmoid')         # Output layer for binary classification
])

# Compile the model with optimiser, loss function, and evaluation metric
model.compile(optimizer='adam',
              loss='binary_crossentropy',  # Suitable for binary classification
              metrics=['accuracy'])

# Train the model on the training set, validating on the validation set
history = model.fit(train_ds, validation_data=val_ds, epochs=5)

# Evaluate the model on the validation data and print accuracy
val_loss, val_acc = model.evaluate(val_ds)
print(f'\nValidation Accuracy: {val_acc * 100:.2f}%')


### What have we learnt?
In our exploration of resampling techniques across three diverse datasets—numerical (Pima Indians Diabetes), textual (Sentiment Analysis), and image-based (Cats vs. Dogs)—we observed how methods like Train/Test Split, k-Fold Cross-Validation, Leave-One-Out Cross-Validation (LOOCV), and Shuffle Split provide reliable ways to evaluate machine learning models.  You will have noticed that regardless of the dataset used, the process and the code is pretty much the same.

For each dataset, applying these resampling methods allowed us to clearly measure both the accuracy and consistency (standard deviation) of our logistic regression model. We found that while a simple Train/Test split offers quick, straightforward insights, k-Fold Cross-Validation gives a more balanced, stable, and trustworthy estimate of model performance, particularly useful when the dataset is limited or prone to variability.

Whilst these approaches are suitable for numeric and language data, for image data, you would likely use a Neural Network such as a Convolutional Neural Network (CNN). Therefore, sampling involves creating separate train and test collections, with a good number of image sample sdistributed into the relevant folders and loaded separately.

