# Second Notebook

This notebook is primarily focused on generating .json files prior to training. It uses data previously created in Notebook 1. The paths and directories have already been defined, and the same paths will be reused here.

## Objectives

1. Read COCO2017 from `data/archive/coco2017`.
2. Filter annotations to 3 classes: person, car, airplane.
3. Save the reduced JSON files in `data/processed/`. Due to resource and training time constraints, it is not feasible to use the entire dataset:
   - coco_person_car_airplane_train.json
   - coco_person_car_airplane_val.json
4. Save fast index files, which will be consumed in Notebook 03:
   - index_train_person_car_airplane.json
   - index_val_person_car_airplane.json

In [None]:
"""
- Imports the standard libraries required for the notebook.
"""

import json
from pathlib import Path
from collections import Counter, defaultdict


1. Loading .json files and validating the project structure

In [None]:
"""
In this section, the notebook loads project_config.json and labelmap.json created in Notebook 01.
It also defines absolute paths without depending on the notebook’s current working directory.
"""

# Detectar project root desde el marker data/processed/project_config.json
def find_project_root(start: Path, max_up: int = 8) -> Path:
    cur = start.resolve()
    for _ in range(max_up):
        if (cur / "data" / "processed" / "project_config.json").exists():
            return cur
        cur = cur.parent
    raise FileNotFoundError("No se encontró data/processed/project_config.json. Ejecuta Notebook 01.")

PROJECT_ROOT = find_project_root(Path.cwd())
PROCESSED_DIR = (PROJECT_ROOT / "data" / "processed").resolve()

PROJECT_CONFIG_PATH = (PROCESSED_DIR / "project_config.json").resolve()
LABELMAP_PATH = (PROCESSED_DIR / "labelmap.json").resolve()

with open(PROJECT_CONFIG_PATH, "r", encoding="utf-8") as f:
    project_config = json.load(f)

with open(LABELMAP_PATH, "r", encoding="utf-8") as f:
    labelmap = json.load(f)

COCO_ROOT = Path(project_config["coco_root"])
TRAIN_DIR = Path(project_config["train_dir"])
VAL_DIR = Path(project_config["val_dir"])
TRAIN_ANN = Path(project_config["train_ann"])
VAL_ANN = Path(project_config["val_ann"])

TARGET_CLASSES = project_config["target_classes"]
target_cat_ids = labelmap["target_category_ids"]

print("Mismo contenido visto anteriormente, solo para confirmar las carpetas ")
print("PROJECT_ROOT:", PROJECT_ROOT)
print("COCO_ROOT:", COCO_ROOT)
print("TRAIN_ANN:", TRAIN_ANN)
print("VAL_ANN  :", VAL_ANN)
print("TARGET_CLASSES:", TARGET_CLASSES)
print("target_cat_ids:", target_cat_ids)


Mismo contenido visto anteriormente, solo para confirmar las carpetas 
PROJECT_ROOT: C:\Users\Johnny\Desktop\IA-final
COCO_ROOT: ..\data\archive\coco2017
TRAIN_ANN: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations\instances_train2017.json
VAL_ANN  : C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations\instances_val2017.json
TARGET_CLASSES: ['person', 'car', 'airplane']
target_cat_ids: [1, 3, 5]


In [None]:
"""
Here, a simple validation of the previously defined directories is performed to prevent future path conflicts.  
This is practically the same as in the first notebook and is done purely as a precaution.
"""

def assert_exists(path: Path, label: str) -> None:
    if not path.exists():
        raise FileNotFoundError(f"Recurso requerido no encontrado: {label}: {path}")
    print(f"OK: {label}: {path}")

assert_exists(COCO_ROOT, "coco_root")
assert_exists(TRAIN_DIR, "train_dir")
assert_exists(VAL_DIR, "val_dir")
assert_exists(TRAIN_ANN, "train_ann")
assert_exists(VAL_ANN, "val_ann")
assert_exists(PROCESSED_DIR, "processed_dir")


OK: coco_root: ..\data\archive\coco2017
OK: train_dir: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\train2017
OK: val_dir: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\val2017
OK: train_ann: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations\instances_train2017.json
OK: val_ann: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations\instances_val2017.json
OK: processed_dir: C:\Users\Johnny\Desktop\IA-final\data\processed


2. Dataset reduction for training

In [None]:
"""
This cell:
- Explicitly defines how many images will be used for base training.
- These values control the total training time on CPU.
- To prevent excessively large training runs, these limits must be enforced.
"""

# Recomendado para CPU y para que el proyecto sea usable
MAX_TRAIN_IMAGES = 3300
MAX_VAL_IMAGES = 500

print("MAX_TRAIN_IMAGES:", MAX_TRAIN_IMAGES)
print("MAX_VAL_IMAGES  :", MAX_VAL_IMAGES)


MAX_TRAIN_IMAGES: 3300
MAX_VAL_IMAGES  : 500


In [None]:
"""
Here we will use the previously created .json files, specifically the train and validation files.
Additionally, the original annotation counts will be displayed as a reference.
"""

def load_coco_json(path: Path) -> dict:
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)

train_coco = load_coco_json(TRAIN_ANN)
val_coco = load_coco_json(VAL_ANN)

print("Train original:")
print(" images:", len(train_coco["images"]))
print(" annotations:", len(train_coco["annotations"]))
print("Val original:")
print(" images:", len(val_coco["images"]))
print(" annotations:", len(val_coco["annotations"]))
"""
 Aqui se busca filtrar COCO para conservar solo anotaciones de las clases objetivo,mantiene únicamente imágenes que tengan al menos 1 anotación objetivo.
"""

def filter_coco_to_targets(coco: dict, target_cat_ids: list[int]) -> tuple[dict, dict]:
    images = coco["images"]
    annotations = coco["annotations"]
    categories = coco["categories"]

    cat_id_set = set(int(x) for x in target_cat_ids)

    reduced_categories = [c for c in categories if c["id"] in cat_id_set]
    filtered_anns = [a for a in annotations if a["category_id"] in cat_id_set]

    img_id_to_anns = defaultdict(list)
    for a in filtered_anns:
        img_id_to_anns[a["image_id"]].append(a)

    reduced_images = [img for img in images if img["id"] in img_id_to_anns]

    valid_img_ids = {img["id"] for img in reduced_images}
    filtered_anns = [a for a in filtered_anns if a["image_id"] in valid_img_ids]

    reduced = {
        "info": coco.get("info", {}),
        "licenses": coco.get("licenses", []),
        "images": reduced_images,
        "annotations": filtered_anns,
        "categories": reduced_categories,
    }

    stats = {
        "num_images_original": len(images),
        "num_annotations_original": len(annotations),
        "num_images_reduced": len(reduced_images),
        "num_annotations_reduced": len(filtered_anns),
    }

    return reduced, stats

reduced_train, stats_train = filter_coco_to_targets(train_coco, target_cat_ids)
reduced_val, stats_val = filter_coco_to_targets(val_coco, target_cat_ids)

print("Stats train filtrado:", stats_train)
print("Stats val filtrado  :", stats_val)

if stats_train["num_images_reduced"] == 0:
    raise RuntimeError("Train filtrado quedó vacío. Revisa TARGET_CLASSES/labelmap.")
if stats_val["num_images_reduced"] == 0:
    raise RuntimeError("Val filtrado quedó vacío. Revisa TARGET_CLASSES/labelmap.")


Train original:
 images: 118287
 annotations: 860001
Val original:
 images: 5000
 annotations: 36781
Stats train filtrado: {'num_images_original': 118287, 'num_annotations_original': 860001, 'num_images_reduced': 69622, 'num_annotations_reduced': 311467}
Stats val filtrado  : {'num_images_original': 5000, 'num_annotations_original': 36781, 'num_images_reduced': 2929, 'num_annotations_reduced': 13079}


In [None]:
"""
This cell:
- Reduces the filtered dataset to CPU-safe sizes.
- Updates the annotations so that only those corresponding to the selected images remain.
- The JSON saved to disk is genuinely reduced.
"""

def trim_coco(coco_reduced: dict, max_images: int) -> dict:
    images = coco_reduced["images"]
    annotations = coco_reduced["annotations"]

    images_trim = images[:max_images]
    valid_ids = {img["id"] for img in images_trim}

    anns_trim = [a for a in annotations if a["image_id"] in valid_ids]

    out = coco_reduced.copy()
    out["images"] = images_trim
    out["annotations"] = anns_trim
    return out

reduced_train = trim_coco(reduced_train, MAX_TRAIN_IMAGES)
reduced_val = trim_coco(reduced_val, MAX_VAL_IMAGES)

print("Train final:")
print(" images:", len(reduced_train["images"]))
print(" annotations:", len(reduced_train["annotations"]))

print("Val final:")
print(" images:", len(reduced_val["images"]))
print(" annotations:", len(reduced_val["annotations"]))


Train final:
 images: 3300
 annotations: 14546
Val final:
 images: 500
 annotations: 2218


3. Saving .json files for training with the reduced dataset

In [None]:
"""
Once the dataset size to be used has been defined, the reduced .json files are saved in `data/processed`.
This helps ensure that they physically exist on disk at the end of the process.
"""


OUT_TRAIN_JSON = (PROCESSED_DIR / "coco_person_car_airplane_train.json").resolve()
OUT_VAL_JSON = (PROCESSED_DIR / "coco_person_car_airplane_val.json").resolve()

with open(OUT_TRAIN_JSON, "w", encoding="utf-8") as f:
    json.dump(reduced_train, f, indent=2, ensure_ascii=False)

with open(OUT_VAL_JSON, "w", encoding="utf-8") as f:
    json.dump(reduced_val, f, indent=2, ensure_ascii=False)

if not OUT_TRAIN_JSON.exists():
    raise FileNotFoundError(f"No se creó: {OUT_TRAIN_JSON}")
if not OUT_VAL_JSON.exists():
    raise FileNotFoundError(f"No se creó: {OUT_VAL_JSON}")

print("Guardado:", OUT_TRAIN_JSON)
print("Guardado:", OUT_VAL_JSON)


Guardado: C:\Users\Johnny\Desktop\IA-final\data\processed\coco_person_car_airplane_train.json
Guardado: C:\Users\Johnny\Desktop\IA-final\data\processed\coco_person_car_airplane_val.json


In [None]:
"""
This section:
- Builds fast indices for each split:
  - List of images with metadata
  - Annotations grouped by image_id
  - Per-class counts
- Saves the indices in `data/processed`.
"""

def build_index(coco_reduced: dict, target_cat_ids: list[int]) -> dict:
    images = coco_reduced["images"]
    annotations = coco_reduced["annotations"]

    img_id_to_anns = defaultdict(list)
    class_counter = Counter()

    for a in annotations:
        img_id = a["image_id"]
        img_id_to_anns[img_id].append({
            "id": a["id"],
            "bbox": a["bbox"],
            "category_id": a["category_id"],
            "iscrowd": a.get("iscrowd", 0),
            "area": a.get("area", None),
        })
        class_counter[a["category_id"]] += 1

    index_images = [{
        "id": img["id"],
        "file_name": img["file_name"],
        "width": img.get("width"),
        "height": img.get("height"),
        "num_anns": len(img_id_to_anns.get(img["id"], [])),
    } for img in images]

    return {
        "target_classes": TARGET_CLASSES,
        "target_category_ids": target_cat_ids,
        "num_images": len(index_images),
        "num_annotations": len(annotations),
        "class_counts_by_category_id": {str(k): int(v) for k, v in class_counter.items()},
        "images": index_images,
        "annotations_by_image_id": {str(k): v for k, v in img_id_to_anns.items()},
    }

train_index = build_index(reduced_train, target_cat_ids)
val_index = build_index(reduced_val, target_cat_ids)

OUT_TRAIN_INDEX = (PROCESSED_DIR / "index_train_person_car_airplane.json").resolve()
OUT_VAL_INDEX = (PROCESSED_DIR / "index_val_person_car_airplane.json").resolve()

with open(OUT_TRAIN_INDEX, "w", encoding="utf-8") as f:
    json.dump(train_index, f, indent=2, ensure_ascii=False)

with open(OUT_VAL_INDEX, "w", encoding="utf-8") as f:
    json.dump(val_index, f, indent=2, ensure_ascii=False)

print("Guardado:", OUT_TRAIN_INDEX)
print("Guardado:", OUT_VAL_INDEX)


Guardado: C:\Users\Johnny\Desktop\IA-final\data\processed\index_train_person_car_airplane.json
Guardado: C:\Users\Johnny\Desktop\IA-final\data\processed\index_val_person_car_airplane.json


In [None]:
"""
After completing the previous steps, this section ensures that the final JSON saved to disk is reloaded and its counts are verified.
"""

with open(OUT_TRAIN_JSON, "r", encoding="utf-8") as f:
    train_check = json.load(f)

with open(OUT_VAL_JSON, "r", encoding="utf-8") as f:
    val_check = json.load(f)

print("Verificación desde disco:")
print("TRAIN images:", len(train_check["images"]))
print("VAL images  :", len(val_check["images"]))

if len(train_check["images"]) != MAX_TRAIN_IMAGES:
    raise RuntimeError(
        "El TRAIN_JSON guardado no coincide con MAX_TRAIN_IMAGES. "
        "No continúes con Notebook 03 hasta corregir esto."
    )

if len(val_check["images"]) != MAX_VAL_IMAGES:
    raise RuntimeError(
        "El VAL_JSON guardado no coincide con MAX_VAL_IMAGES. "
        "No continúes con Notebook 03 hasta corregir esto."
    )

print("Preprocesamiento finalizado correctamente. Notebook 03 ya puede entrenar en CPU.")


Verificación desde disco:
TRAIN images: 3300
VAL images  : 500
Preprocesamiento finalizado correctamente. Notebook 03 ya puede entrenar en CPU.
