# First Notebook

In this first notebook, key concepts for the rest of the project are defined, including the selected dataset, the main directory paths used for communication between notebooks, and the preprocessing structure.

The dataset used in this project requires generating reduced `.json` annotation files for efficient processing and CPU-based training.

The base dataset selected is **COCO 2017**, which can be publicly downloaded from the following link:

https://www.kaggle.com/datasets/awsaf49/coco-2017-dataset

For this project, the dataset is assumed to be already downloaded locally; therefore, it will not be automatically downloaded within this repository.

To reproduce the results, you must manually download the dataset from the link above and place it in the following directory:

IA-final/data/archive/coco2017


## Objectives

1. Verify the structure of the COCO2017 dataset located at:

   data/archive/coco2017

2. Confirm that all required files and folders exist and are properly organized.

3. Load the annotation files:
   - instances_train2017.json
   - instances_val2017.json

4. Define three target classes to enable efficient CPU-based training.

5. Generate and save a `labelmap.json` file containing the selected target classes for use in subsequent notebooks.


In [None]:
"""
- Import standard libraries used throughout the notebook.

"""

import json
import os
from pathlib import Path
from collections import Counter


## 1. Creation and Validation of Essential Project Directories

This section defines and validates the essential project directories required for the workflow. It ensures that all necessary folders exist and are properly structured before proceeding with dataset processing and model training.


In [None]:
"""

In this first section:
- Defines the COCO2017 dataset paths in a unique and centralized way.
- Sets fixed target classes: person, car, airplane. These classes were selected because they have a large number of instances in the dataset.
- Creates the data/processed directory if it does not exist.
- Stores the absolute path to avoid conflicts in notebooks.
"""
PROJECT_ROOT = Path("..")     

COCO_ROOT = PROJECT_ROOT / "data" / "archive" / "coco2017"

TRAIN_DIR = (COCO_ROOT / "train2017").resolve()
VAL_DIR = (COCO_ROOT / "val2017").resolve()
ANN_DIR = (COCO_ROOT / "annotations").resolve()

TRAIN_ANN = (ANN_DIR / "instances_train2017.json").resolve()
VAL_ANN = (ANN_DIR / "instances_val2017.json").resolve()

PROCESSED_DIR = (PROJECT_ROOT / "data" / "processed").resolve()
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

PROJECT_CONFIG_PATH = (PROCESSED_DIR / "project_config.json").resolve()
LABELMAP_PATH = (PROCESSED_DIR / "labelmap.json").resolve()

TARGET_CLASSES = ["person", "car", "airplane"]

print("Para que confirmes que los datos mas improtantes se encuentren tras tu descarga se debe confirmar que todos estos archvios existan")
print("TRAIN_DIR:", TRAIN_DIR)
print("VAL_DIR  :", VAL_DIR)
print("ANN_DIR  :", ANN_DIR)
print("TRAIN_ANN:", TRAIN_ANN)
print("VAL_ANN  :", VAL_ANN)
print("PROCESSED_DIR:", PROCESSED_DIR)
print("Clases seleccionadas:", TARGET_CLASSES)


Para que confirmes que los datos mas improtantes se encuentren tras tu descarga se debe confirmar que todos estos archvios existan
TRAIN_DIR: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\train2017
VAL_DIR  : C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\val2017
ANN_DIR  : C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations
TRAIN_ANN: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations\instances_train2017.json
VAL_ANN  : C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations\instances_val2017.json
PROCESSED_DIR: C:\Users\Johnny\Desktop\IA-final\data\processed
Clases seleccionadas: ['person', 'car', 'airplane']


In [None]:
"""
This cell:
Performs a simple validation of the previously defined directories to prevent future path conflicts.  
To avoid issues, all paths are declared from the first notebook.
"""

def assert_exists(path: Path, label: str) -> None:
    if not path.exists():
        raise FileNotFoundError(f"Recurso requerido no encontrado: {label}: {path}")
    print(f"OK: {label}: {path}")

assert_exists(COCO_ROOT, "COCO_ROOT")
assert_exists(TRAIN_DIR, "train2017")
assert_exists(VAL_DIR, "val2017")
assert_exists(ANN_DIR, "annotations")
assert_exists(TRAIN_ANN, "instances_train2017.json")
assert_exists(VAL_ANN, "instances_val2017.json")


OK: COCO_ROOT: ..\data\archive\coco2017
OK: train2017: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\train2017
OK: val2017: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\val2017
OK: annotations: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations
OK: instances_train2017.json: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations\instances_train2017.json
OK: instances_val2017.json: C:\Users\Johnny\Desktop\IA-final\data\archive\coco2017\annotations\instances_val2017.json


## 2. Dataset description and JSON creation:


In [None]:
"""
Dataset description and JSON creation:
- Loads the training annotations JSON file, which is required for dataset optimization.
- Extracts categories and creates id <-> name mappings.
- Verifies that the target classes exist in the COCO dataset.
"""

with open(TRAIN_ANN, "r", encoding="utf-8") as f:
    train_data = json.load(f)

categories = train_data["categories"]
images = train_data["images"]
annotations = train_data["annotations"]

cat_id_to_name = {c["id"]: c["name"] for c in categories}
cat_name_to_id = {c["name"]: c["id"] for c in categories}

missing = [c for c in TARGET_CLASSES if c not in cat_name_to_id]
if missing:
    raise ValueError(
        "Estas clases objetivo no existen en COCO categories: " + ", ".join(missing)
    )

target_cat_ids = [cat_name_to_id[c] for c in TARGET_CLASSES]

print("Train JSON cargado")
print("Categorias:", len(categories))
print("Imagenes  :", len(images))
print("Anotaciones:", len(annotations))
print("TARGET_CLASSES:", TARGET_CLASSES)
print("target_cat_ids:", target_cat_ids)


Train JSON cargado
Categorias: 80
Imagenes  : 118287
Anotaciones: 860001
TARGET_CLASSES: ['person', 'car', 'airplane']
target_cat_ids: [1, 3, 5]


In [None]:
"""
- Counts how many annotations exist in the training set for each target class.
- Serves as a sanity check to ensure there is sufficient data. It is important to note that, due to resource constraints, only a fraction of the dataset was used for training.
"""

counter = Counter()
for ann in annotations:
    cid = ann["category_id"]
    if cid in set(target_cat_ids):
        counter[cid] += 1

print("Anotaciones por clase objetivo (train):")
for cls_name, cid in zip(TARGET_CLASSES, target_cat_ids):
    print(f"{cls_name}: {counter[cid]}")


Anotaciones por clase objetivo (train):
person: 262465
car: 43867
airplane: 5135


In [None]:
"""
Here we start creating reusable code for the rest of the notebooks.
- Saves a single project configuration with absolute paths.
- The idea is that all notebooks read this file and do not redefine paths manually.
"""

project_config = {
    "project_root": str(PROJECT_ROOT.as_posix()),
    "coco_root": str(COCO_ROOT.as_posix()),
    "train_dir": str(TRAIN_DIR.as_posix()),
    "val_dir": str(VAL_DIR.as_posix()),
    "ann_dir": str(ANN_DIR.as_posix()),
    "train_ann": str(TRAIN_ANN.as_posix()),
    "val_ann": str(VAL_ANN.as_posix()),
    "processed_dir": str(PROCESSED_DIR.as_posix()),
    "target_classes": TARGET_CLASSES,
}

with open(PROJECT_CONFIG_PATH, "w", encoding="utf-8") as f:
    json.dump(project_config, f, indent=2, ensure_ascii=False)

print("Guardado:", PROJECT_CONFIG_PATH)


Guardado: C:\Users\Johnny\Desktop\IA-final\data\processed\project_config.json


In [None]:
"""
In this section, the following steps are performed:
- Saves a minimal and stable label map for the project.
- Includes:
  - dataset
  - target_classes
  - name_to_id and id_to_name only for the target classes
  - target_category_ids
"""

labelmap = {
    "dataset": "COCO2017",
    "target_classes": TARGET_CLASSES,
    "target_category_ids": target_cat_ids,
    "name_to_id": {name: cat_name_to_id[name] for name in TARGET_CLASSES},
    "id_to_name": {str(cat_name_to_id[name]): name for name in TARGET_CLASSES},
}

with open(LABELMAP_PATH, "w", encoding="utf-8") as f:
    json.dump(labelmap, f, indent=2, ensure_ascii=False)

print("Guardado:", LABELMAP_PATH)
print(labelmap)


Guardado: C:\Users\Johnny\Desktop\IA-final\data\processed\labelmap.json
{'dataset': 'COCO2017', 'target_classes': ['person', 'car', 'airplane'], 'target_category_ids': [1, 3, 5], 'name_to_id': {'person': 1, 'car': 3, 'airplane': 5}, 'id_to_name': {'1': 'person', '3': 'car', '5': 'airplane'}}


In [None]:
"""
This cell:
- Reloads project_config.json and labelmap.json to ensure they were saved correctly.
- This prevents situations where the notebook appears to work, but nothing is actually saved.
"""

with open(PROJECT_CONFIG_PATH, "r", encoding="utf-8") as f:
    cfg_check = json.load(f)

with open(LABELMAP_PATH, "r", encoding="utf-8") as f:
    lm_check = json.load(f)

print("project_config.json OK:", cfg_check["coco_root"])
print("labelmap.json OK:", lm_check["target_classes"])


project_config.json OK: ../data/archive/coco2017
labelmap.json OK: ['person', 'car', 'airplane']
