# MammoScan AI: 03 - Local Training of the Augmented Baseline Model

## 🎯 Goal
The purpose of this notebook is to provide a clean, documented, and reproducible record of the training process for our baseline model. This model incorporates on-the-fly data augmentation to combat the overfitting we discovered previously.

This notebook uses the exact same logic as our `ml/scripts/train.py` script but presents it in an interactive format.

**Note:** While this notebook can be run locally on a CPU, it will be very slow. For full, high-speed training runs, it is recommended to use the code within a Google Colab environment with a GPU enabled.

## ⚙️ Setup
First, we import our necessary libraries and set up the system path. This allows the notebook to find our custom, reusable functions in the `ml/src` directory, keeping our code clean and modular.

In [1]:
# --- Core Libraries ---
import os
import sys
import numpy as np
import tensorflow as tf

# --- Path Setup ---
# Add the project's root directory to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# --- Custom Modules ---
# Import our advanced model-building function from the "workshop"
from ml.src.model import build_full_model

2025-09-02 13:23:43.199886: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-09-02 13:23:43.203258: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-09-02 13:23:43.252832: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 📥 Step 1: Ensure Data is Present & Define Constants

Before we begin, we must ensure that the correct version of our processed data is available locally. We run `dvc pull` to synchronize our local data with our Google Cloud Storage remote.

We also define all our key constants and hyperparameters in one place. This makes it easy to see and change our settings for future experiments.

In [2]:
# This command ensures that our `data/processed` directory is up-to-date.
# We should be using the version WITHOUT the pre-augmented data.

# !dvc pull data/processed.dvc

In [3]:
# --- Constants ---
PROCESSED_DATA_DIR = os.path.join(project_root, 'data', 'processed')
MODEL_SAVE_DIR = os.path.join(project_root, 'models', 'checkpoints')
MODEL_SAVE_PATH = os.path.join(MODEL_SAVE_DIR, 'augmented_model_local.keras') # Give it a unique name

# Training Hyperparameters
IMG_HEIGHT = 224
IMG_WIDTH = 224
BATCH_SIZE = 32
EPOCHS = 20 # Can be lowered for quick local tests

## 🚚 Step 2: Load Datasets

We use TensorFlow's `image_dataset_from_directory` utility. This is a highly efficient method that creates a data pipeline directly from our organized image folders. It loads images in batches, which is much more memory-efficient than loading the entire dataset at once.

In [4]:
print("Loading datasets...")
train_dataset = tf.keras.utils.image_dataset_from_directory(
    os.path.join(PROCESSED_DATA_DIR, 'train'),
    image_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    label_mode='binary' # For Cancer/Non-Cancer
)

val_dataset = tf.keras.utils.image_dataset_from_directory(
    os.path.join(PROCESSED_DATA_DIR, 'val'),
    image_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    label_mode='binary'
)

class_names = train_dataset.class_names
print(f"Classes found: {class_names}")

Loading datasets...
Found 521 files belonging to 2 classes.
Found 112 files belonging to 2 classes.
Classes found: ['Cancer', 'Non-Cancer']


2025-09-02 13:25:38.229931: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-09-02 13:25:38.234045: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


## ⚖️ Step 3: Handle Class Imbalance

As discovered in our EDA, our dataset is imbalanced. To solve this, we calculate class weights. This technique tells the model to pay significantly more attention to the minority class (Cancer) during training, preventing it from becoming biased towards the majority class.

In [5]:
print("Calculating class weights for imbalanced data...")
labels = np.concatenate([y for x, y in train_dataset], axis=0)
neg, pos = np.bincount(labels.astype(int).flatten())
total = neg + pos

weight_for_0 = (1 / neg) * (total / 2.0)
weight_for_1 = (1 / pos) * (total / 2.0)

class_weights = {0: weight_for_0, 1: weight_for_1}
print(f"Calculated class weights: {class_weights}")

Calculating class weights for imbalanced data...
Calculated class weights: {0: 2.9942528735632186, 1: 0.6002304147465438}


2025-09-02 13:26:23.317175: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## 🏗️ Step 4: Create and Compile the Model

We now call our `build_full_model` function to create our robust model, which includes the on-the-fly data augmentation layers. We then compile it, providing the optimizer, the loss function (the goal), and the metrics we want to track.

In [6]:
print("Creating and compiling the model with data augmentation layers...")
model = build_full_model(input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))

model.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=['accuracy', tf.keras.metrics.Recall(name='recall')]
)

model.summary()

Creating and compiling the model with data augmentation layers...


  super().__init__(**kwargs)
2025-09-02 13:26:52.080096: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 102760448 exceeds 10% of free system memory.
2025-09-02 13:26:52.124227: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 102760448 exceeds 10% of free system memory.
2025-09-02 13:26:52.139526: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 102760448 exceeds 10% of free system memory.


## 🏃‍♂️ Step 5: Train the Model

This is the main event. We call `model.fit()` to begin the training process. The model will iterate through the training dataset for the specified number of epochs, and after each epoch, it will evaluate its performance on the unseen validation dataset.

In [None]:
print("\n--- Starting model training ---")
# Create the save directory if it doesn't exist
os.makedirs(MODEL_SAVE_DIR, exist_ok=True)

history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=EPOCHS,
    class_weight=class_weights
)
print("--- Model training finished ---\n")

## 💾 Step 6: Save the Trained Model

After training is complete, we save the final model artifact. This `.keras` file contains the model's architecture, its learned weights, and the optimizer state, allowing us to easily load it later for evaluation or deployment.

In [None]:
print(f"Saving augmented model to {MODEL_SAVE_PATH}...")
model.save(MODEL_SAVE_PATH)
print("✅ Model saved successfully!")