MCDAA, UdelaR
# Computer Vision 2025 -  Proyecto Final - Labels and Data Augmentation
Pablo Molina
Joana Auriello

We have 2 ways of doing Augmentation. In this collab, we are doing OFFLINE data augmentation. But we have also applied on the fly data augmentation in our original notebook, here are the main differences and points:

**1 — Offline Augmentation (Saving Augmented Images to Disk)**

In offline augmentation, we physically generate new images from the original dataset and save each augmentation type—such as flip-only, rotation-only, zoom-only, contrast-only—into separate files. This creates a larger and fixed augmented dataset that can be reused for any model (ResNet, EfficientNet, YOLO) and inspected visually. The downside is that it is extremely time- and storage-intensive: for tens of thousands of images, saving multiple augmented versions can require many hours and hundreds of thousands of file writes to Google Drive. Offline augmentation also produces static variations—once images are saved, the model always sees the exact same augmented samples every epoch, reducing variability during training.

**2 — Online Augmentation (ImageDataGenerator During Training)**

With online augmentation using ImageDataGenerator (or Keras Random* layers), the model receives a new, randomly transformed version of each image every epoch, with no files saved to disk. This approach is far more efficient: it keeps the dataset size small and performs augmentation in memory, delivering endless variation over the course of training. For example, in 10 epochs, the model may effectively see 10 different augmented versions of each training image—creating far more diversity than offline augmentation while requiring almost no preprocessing time. This is the most common modern approach in deep learning pipelines because it is fast, flexible, and maximizes generalization without consuming storage.

## 1. Mount Google Drive, add labels to each image, save labeled images dataset to csv file

### 1. Mount Google Drive

Run the following cell to mount your Google Drive. This will prompt you to authorize Colab to access your Drive files.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os
import pandas as pd
import numpy as np
import tensorflow as tf

# My working directory (small stuff: CSVs, logs, checkpoints)
WORK_DIR = '/content/drive/MyDrive/Colab Notebooks/ComputerVision'

# Pablo's directory (unlimited storage for big augmented dataset)
PABLO_DIR = '/content/drive/MyDrive/[01] - Pablo/[01].[02] - Facultad/[01].[02].[05] - Master Fing/Computer Vision/Project/Entrega Reciclaje'

# Original dataset location (folders like "Bio organico", "Envase Plasticos", etc.)
RAW_DATA_DIR = os.path.join(WORK_DIR, 'Dataset Consolidado')

print("WORK_DIR:", WORK_DIR)
print("PABLO_DIR:", PABLO_DIR)
print("RAW_DATA_DIR:", RAW_DATA_DIR)

# Quick check: list category folders
!ls "$RAW_DATA_DIR"


Mounted at /content/drive
WORK_DIR: /content/drive/MyDrive/Colab Notebooks/ComputerVision
PABLO_DIR: /content/drive/MyDrive/[01] - Pablo/[01].[02] - Facultad/[01].[02].[05] - Master Fing/Computer Vision/Project/Entrega Reciclaje
RAW_DATA_DIR: /content/drive/MyDrive/Colab Notebooks/ComputerVision/Dataset Consolidado
'Bio organico'			   'Reciclables Varios Metal'
 class_names.json		   'Reciclables Varios Otros'
'Envase Plasticos'		   'Reciclables Varios Plastico'
 image_labels.csv		   'Reciclables Varios Textiles'
'Papel y Carton'		   'Reciclables Varios Vidrio'
'Reciclables Varios Cigarro'	   'Todo lo demás'
'Reciclables Varios Electronicos'   train_files.csv
'Reciclables Varios Madera'	    val_files.csv


### 2. Tag Original Dataset and Save image_labels.csv


Here we scan the category folders, create a table with columns:

image_path: full path to each image

label: folder name (category)
We save this as image_labels.csv so all models (ResNet, EfficientNet, YOLO) can reuse the same labels.

In [None]:
import glob

# Find all image files in subfolders of RAW_DATA_DIR
image_paths = glob.glob(os.path.join(RAW_DATA_DIR, '*', '*.*'))  # every file one level under each label folder

data = []
for path in image_paths:
    label = os.path.basename(os.path.dirname(path))  # folder name = class label
    data.append((path, label))

df_all = pd.DataFrame(data, columns=['image_path', 'label'])

print("Total images found:", len(df_all))
df_all.head()

# Save tagged original dataset
labels_csv_path = os.path.join(WORK_DIR, 'image_labels.csv')
df_all.to_csv(labels_csv_path, index=False)
print("Saved:", labels_csv_path)


Total images found: 59001
Saved: /content/drive/MyDrive/Colab Notebooks/ComputerVision/image_labels.csv


### 3. Encode Labels and Save class_names.json


We create a stable mapping from label strings to integer IDs (label_id) and save the class list.
This ensures all models use the same class ordering.

In [None]:
import json

df = pd.read_csv(labels_csv_path)

# Sorted list of class names
class_names = sorted(df['label'].unique().tolist())
label_to_int = {c: i for i, c in enumerate(class_names)}
int_to_label = {i: c for c, i in label_to_int.items()}

df['label_id'] = df['label'].map(label_to_int)

# Save class list for reuse
class_json_path = os.path.join(WORK_DIR, 'class_names.json')
with open(class_json_path, 'w', encoding='utf-8') as f:
    json.dump(class_names, f, ensure_ascii=False, indent=2)

print("Classes:", class_names)
print("Saved class_names.json at:", class_json_path)
df.head()


Classes: ['Bio organico', 'Envase Plasticos', 'Papel y Carton', 'Reciclables Varios Cigarro', 'Reciclables Varios Electronicos', 'Reciclables Varios Madera', 'Reciclables Varios Metal', 'Reciclables Varios Otros', 'Reciclables Varios Plastico', 'Reciclables Varios Textiles', 'Reciclables Varios Vidrio', 'Todo lo demás']
Saved class_names.json at: /content/drive/MyDrive/Colab Notebooks/ComputerVision/class_names.json


Unnamed: 0,image_path,label,label_id
0,/content/drive/MyDrive/Colab Notebooks/Compute...,Papel y Carton,2
1,/content/drive/MyDrive/Colab Notebooks/Compute...,Papel y Carton,2
2,/content/drive/MyDrive/Colab Notebooks/Compute...,Papel y Carton,2
3,/content/drive/MyDrive/Colab Notebooks/Compute...,Papel y Carton,2
4,/content/drive/MyDrive/Colab Notebooks/Compute...,Papel y Carton,2


### 4. Train/Validation Split on Original Dataset

We split the original images into training and validation sets, stratified by class (keeping class proportions).
Only training images will be augmented; validation images stay clean.

In [None]:
SEED = 42
VAL_SPLIT = 0.2

rng = np.random.default_rng(SEED)

train_idx = []
val_idx = []

# Stratified split by label_id
for cls, g in df.groupby('label_id', sort=False):
    idx = g.index.to_numpy()
    rng.shuffle(idx)
    n_val = int(len(idx) * VAL_SPLIT)
    val_idx.extend(idx[:n_val])
    train_idx.extend(idx[n_val:])

df_train_orig = df.loc[train_idx].reset_index(drop=True)
df_val_orig   = df.loc[val_idx].reset_index(drop=True)

print("Train originals:", len(df_train_orig))
print("Val originals:", len(df_val_orig))

# Optionally save original splits (no augmentation yet)
df_train_orig.to_csv(os.path.join(WORK_DIR, 'original_train.csv'), index=False)
df_val_orig.to_csv(os.path.join(WORK_DIR, 'original_val.csv'), index=False)


Train originals: 47205
Val originals: 11796


## 2.  OFFLINE Data Augmentation and training dataset pipeline generation to reuse later in model training


### 1. Define Separate Augmentations (flip-only, rotate-only, etc.)
We define individual augmentation operations:

flip-only (horizontal)

rotation-only

zoom-only

contrast-only (custom)

salt-and-pepper-only (custom)
Each original training image will generate one image per augmentation type, plus a copy of the original.

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import load_img, img_to_array, array_to_img
import cv2

IMG_SIZE = (224, 224)

# ImageDataGenerator-based augmenters (one transform each)
flip_gen = ImageDataGenerator(horizontal_flip=True)
rot_gen  = ImageDataGenerator(rotation_range=15)   # ±15°
zoom_gen = ImageDataGenerator(zoom_range=0.20)     # ±20% zoom

def adjust_contrast(img, factor=1.5):
    """Contrast-only augmentation (RGB uint8 image)."""
    lab = cv2.cvtColor(img, cv2.COLOR_RGB2LAB)
    l, a, b = cv2.split(lab)
    l = np.clip(l.astype(np.float32) * factor, 0, 255).astype(np.uint8)
    enhanced = cv2.merge([l, a, b])
    return cv2.cvtColor(enhanced, cv2.COLOR_LAB2RGB)

def salt_and_pepper(image, amount=0.005):
    """Salt-and-pepper noise only on the image."""
    noisy = image.copy()
    num_salt = int(np.ceil(amount * image.size * 0.5))
    num_pepper = int(np.ceil(amount * image.size * 0.5))

    # Salt (white)
    coords = [np.random.randint(0, i - 1, num_salt) for i in image.shape[:2]]
    noisy[coords[0], coords[1], :] = 255

    # Pepper (black)
    coords = [np.random.randint(0, i - 1, num_pepper) for i in image.shape[:2]]
    noisy[coords[0], coords[1], :] = 0

    return noisy


### 1. Offline Augmentation (Training Set Only) into pablos folder

Now we create an offline augmented training dataset:
For each original training image we save:

1 original copy

1 flip-only image

1 rotation-only image

1 zoom-only image

1 contrast-only image

1 salt-and-pepper-only image
All images are saved in Drive folder under Offline_Dataset_Augmented/<label>/....


AUG_DATASET_DIR = os.path.join(PABLO_DIR, 'Offline_Dataset_Augmented')
os.makedirs(AUG_DATASET_DIR, exist_ok=True)
print("Augmented dataset will be stored at:", AUG_DATASET_DIR)

aug_train_records = []

for idx, row in df_train_orig.iterrows():
    orig_path = row['image_path']
    label = row['label']

    # Load original image and resize
    img = load_img(orig_path, target_size=IMG_SIZE)
    img_array = img_to_array(img).astype("uint8")

    # Create label folder in Pablo's Drive
    out_dir = os.path.join(AUG_DATASET_DIR, label)
    os.makedirs(out_dir, exist_ok=True)

    # 1) original copy
    orig_name = f"orig_{idx}.jpg"
    orig_save = os.path.join(out_dir, orig_name)
    array_to_img(img_array).save(orig_save)
    aug_train_records.append((orig_save, label))

    # 2) flip-only
    flipped = next(flip_gen.flow(np.expand_dims(img_array, 0), batch_size=1))[0].astype("uint8")
    flip_name = f"flip_{idx}.jpg"
    flip_save = os.path.join(out_dir, flip_name)
    array_to_img(flipped).save(flip_save)
    aug_train_records.append((flip_save, label))

    # 3) rotation-only
    rotated = next(rot_gen.flow(np.expand_dims(img_array, 0), batch_size=1))[0].astype("uint8")
    rot_name = f"rot_{idx}.jpg"
    rot_save = os.path.join(out_dir, rot_name)
    array_to_img(rotated).save(rot_save)
    aug_train_records.append((rot_save, label))

    # 4) zoom-only
    zoomed = next(zoom_gen.flow(np.expand_dims(img_array, 0), batch_size=1))[0].astype("uint8")
    zoom_name = f"zoom_{idx}.jpg"
    zoom_save = os.path.join(out_dir, zoom_name)
    array_to_img(zoomed).save(zoom_save)
    aug_train_records.append((zoom_save, label))

    # 5) contrast-only
    contrast_img = adjust_contrast(img_array, factor=1.5)
    contrast_name = f"contrast_{idx}.jpg"
    contrast_save = os.path.join(out_dir, contrast_name)
    array_to_img(contrast_img).save(contrast_save)
    aug_train_records.append((contrast_save, label))

    # 6) salt-and-pepper-only
    sp_img = salt_and_pepper(img_array, amount=0.005)
    sp_name = f"sp_{idx}.jpg"
    sp_save = os.path.join(out_dir, sp_name)
    array_to_img(sp_img).save(sp_save)
    aug_train_records.append((sp_save, label))

print("Number of augmented+original train images:", len(aug_train_records))



We initially considered many offline augmentations, but due to time and storage constraints we focused on the most realistic and impactful ones for waste classification: horizontal flip, small rotations, and small zoom. These reflect how a user might photograph waste in different orientations and distances. We removed very heavy transforms like strong noise and standalone contrast changes, which are less realistic and significantly increased preprocessing time

###FAST offline augmentation code (3 images per original)

Because the dataset contained over 47,000 training images, generating all augmentation types separately would require over 72 hours of compute time. To respect time constraints, we selected the two augmentations that are considered most important and realistic for waste classification: horizontal flip and small rotation. These cover common real-world variations while keeping processing feasible.

Expected runtime with this plan

Using only flip + rotation:

Processing per image = 1/3 of what we had before

Predictions:

~47k flips

~47k rotations

~47k original saves

This is approximately:

~141k images

~10–12 hours total on Google Drive

In [None]:
AUG_DATASET_DIR = os.path.join(PABLO_DIR, 'Offline_Dataset_Augmented')
os.makedirs(AUG_DATASET_DIR, exist_ok=True)
print("Augmented dataset will be stored at:", AUG_DATASET_DIR)

aug_train_records = []

for idx, row in df_train_orig.iterrows():
    orig_path = row['image_path']
    label = row['label']

    # Load original image and resize
    img = load_img(orig_path, target_size=IMG_SIZE)
    img_array = img_to_array(img).astype("uint8")

    # Create label folder
    out_dir = os.path.join(AUG_DATASET_DIR, label)
    os.makedirs(out_dir, exist_ok=True)

    # 1) original copy
    orig_name = f"orig_{idx}.jpg"
    orig_save = os.path.join(out_dir, orig_name)
    array_to_img(img_array).save(orig_save)
    aug_train_records.append((orig_save, label))

    # 2) flip-only
    flipped = next(flip_gen.flow(np.expand_dims(img_array, 0), batch_size=1))[0].astype("uint8")
    flip_name = f"flip_{idx}.jpg"
    flip_save = os.path.join(out_dir, flip_name)
    array_to_img(flipped).save(flip_save)
    aug_train_records.append((flip_save, label))

    # 3) rotation-only
    rotated = next(rot_gen.flow(np.expand_dims(img_array, 0), batch_size=1))[0].astype("uint8")
    rot_name = f"rot_{idx}.jpg"
    rot_save = os.path.join(out_dir, rot_name)
    array_to_img(rotated).save(rot_save)
    aug_train_records.append((rot_save, label))

print("Total saved images:", len(aug_train_records))


Augmented dataset will be stored at: /content/drive/MyDrive/[01] - Pablo/[01].[02] - Facultad/[01].[02].[05] - Master Fing/Computer Vision/Project/Entrega Reciclaje/Offline_Dataset_Augmented


OSError: image file is truncated (26 bytes not processed)

### 3. Build Final Train/Val CSVs
We build the final training and validation tables for model training:

offline_aug_train.csv: original + augmented training images.

offline_aug_val.csv: clean original validation images.

In [None]:
df_train_final = pd.DataFrame(aug_train_records, columns=['image_path', 'label'])
df_val_final   = df_val_orig[['image_path', 'label']].copy()

train_csv_path = os.path.join(WORK_DIR, 'offline_aug_train.csv')
val_csv_path   = os.path.join(WORK_DIR, 'offline_aug_val.csv')

df_train_final.to_csv(train_csv_path, index=False)
df_val_final.to_csv(val_csv_path, index=False)

print("Saved offline_aug_train.csv:", len(df_train_final))
print("Saved offline_aug_val.csv:", len(df_val_final))


NameError: name 'aug_train_records' is not defined

### 4. Visualize Some Augmented Images
Quick sanity check: show a few randomly chosen augmented images from a given class to verify that augmentation worked as expected.

In [None]:
import random
import matplotlib.pyplot as plt

def show_augmented_samples(label, n=6):
    folder = os.path.join(AUG_DATASET_DIR, label)
    image_files = os.listdir(folder)
    chosen = random.sample(image_files, min(n, len(image_files)))

    plt.figure(figsize=(12, 5))
    for i, name in enumerate(chosen):
        path = os.path.join(folder, name)
        img = cv2.imread(path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        plt.subplot(2, (n+1)//2, i+1)
        plt.imshow(img)
        plt.title(name[:12] + "...")
        plt.axis('off')

    plt.suptitle(f"Augmented samples for: {label}")
    plt.tight_layout()
    plt.show()

# Example: view some plastic images (change label as needed)
show_augmented_samples('Reciclables Varios Plastico')


##ResNet50 with OFFLINE data augmentation

### 1. Build tf.data Datasets (Train = Augmented, Val = Clean)
We load the offline-augmented training set and the clean validation set with tf.data.
No further augmentation is done here, because the augmentation is already baked into the training images.

In [None]:
IMG_SIZE = (224, 224)
BATCH_SIZE = 32
AUTOTUNE = tf.data.AUTOTUNE

df_train = pd.read_csv(train_csv_path)
df_val   = pd.read_csv(val_csv_path)

# Encode labels
class_names = sorted(df_train['label'].unique())
label_to_int = {c: i for i, c in enumerate(class_names)}
df_train['label_id'] = df_train['label'].map(label_to_int)
df_val['label_id']   = df_val['label'].map(label_to_int)

def load_image(path, label_id):
    img = tf.io.decode_image(tf.io.read_file(path), channels=3, expand_animations=False)
    img = tf.image.resize(img, IMG_SIZE)
    img = tf.cast(img, tf.float32) / 255.0
    return img, label_id

train_ds = (
    tf.data.Dataset.from_tensor_slices((df_train['image_path'], df_train['label_id']))
    .shuffle(8000)
    .map(load_image, num_parallel_calls=AUTOTUNE)
    .batch(BATCH_SIZE)
    .prefetch(AUTOTUNE)
)

val_ds = (
    tf.data.Dataset.from_tensor_slices((df_val['image_path'], df_val['label_id']))
    .map(load_image, num_parallel_calls=AUTOTUNE)
    .batch(BATCH_SIZE)
    .prefetch(AUTOTUNE)
)

print("Datasets ready.")


### 2. Compute Class Weights

In [None]:
from collections import Counter

counts = Counter(df_train['label_id'])
total = sum(counts.values())
class_weight = {cls: total / (len(counts) * cnt) for cls, cnt in counts.items()}

print("Class counts:", counts)
print("Class weights:", class_weight)


### 3. Define Callbacks

In [None]:
ckpt_dir = os.path.join(WORK_DIR, 'checkpoints_resnet50')
log_dir  = os.path.join(WORK_DIR, 'logs_resnet50')
os.makedirs(ckpt_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        filepath=os.path.join(ckpt_dir, 'best.keras'),
        monitor='val_accuracy',
        mode='max',
        save_best_only=True
    ),
    tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    ),
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=2
    ),
    tf.keras.callbacks.TensorBoard(log_dir=log_dir),
]


### 4. ResNet-50 Phase 1: Train Only the Classifier Head

In [None]:
from tensorflow.keras import layers, Model

num_classes = len(class_names)

# Pretrained ResNet50 backbone
base = tf.keras.applications.ResNet50(
    include_top=False,
    weights='imagenet',
    input_shape=(*IMG_SIZE, 3)
)
base.trainable = False  # freeze backbone

inputs = layers.Input(shape=(*IMG_SIZE, 3))
x = base(inputs, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(num_classes, activation='softmax')(x)

model = Model(inputs, outputs)

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

history1 = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10,
    callbacks=callbacks,
    class_weight=class_weight
)


### 5. ResNet-50 Phase 2: Fine-Tune the Top Layers

In [None]:
# Unfreeze last ~30 layers of the backbone
fine_tune_at = len(base.layers) - 30

for i, layer in enumerate(base.layers):
    layer.trainable = (i >= fine_tune_at)

print("Trainable backbone layers in Phase 2:",
      sum(layer.trainable for layer in base.layers))

# Re-compile with a smaller learning rate
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history2 = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10,
    callbacks=callbacks,
    class_weight=class_weight
)


### 6. Final Evaluation (Accuracy, Confusion Matrix, Report)

In [None]:
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Ground-truth labels
y_true = df_val['label_id'].values

# Predictions
y_pred_probs = model.predict(val_ds)
y_pred = np.argmax(y_pred_probs, axis=1)

print("Validation accuracy:", np.mean(y_true == y_pred))

print("\nClassification report:")
print(classification_report(y_true, y_pred, target_names=class_names))

cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix - Validation Set")
plt.tight_layout()
plt.show()


### 7. Prediction Function for New Images


In [None]:
def preprocess_single_image(image_path):
    img = tf.io.decode_image(tf.io.read_file(image_path), channels=3, expand_animations=False)
    img = tf.image.resize(img, IMG_SIZE)
    img = tf.cast(img, tf.float32) / 255.0
    return img

def predict_image(image_path, model, class_names, top_k=3):
    img = preprocess_single_image(image_path)
    img_batch = tf.expand_dims(img, 0)

    preds = model.predict(img_batch)[0]
    top_indices = np.argsort(preds)[::-1][:top_k]
    top_probs = preds[top_indices]
    top_labels = [class_names[i] for i in top_indices]

    print(f"Predictions for: {image_path}")
    for lbl, prob in zip(top_labels, top_probs):
        print(f"  {lbl}: {prob:.4f}")

    plt.figure(figsize=(4, 4))
    plt.imshow(img)
    plt.axis('off')
    plt.title(f"Top-1: {top_labels[0]} ({top_probs[0]:.2f})")
    plt.show()

# Example:
# predict_image('/content/drive/MyDrive/some_image.jpg', model, class_names)
