# 01_data_ingestion_and_preprocessing.ipynb

**Project**: Lumbar Spine Degenerative Classification  
**Description**: This notebook orchestrates the data ingestion (download from Kaggle) and preprocessing (generating tensor data) steps.  

---

## Table of Contents
1. [Environment and Imports](#section1)  
2. [Configuration Loading](#section2)  
3. [Data Download](#section3)  
4. [Data Preprocessing](#section4)  
5. [Extended Preprocessing](#section5)

---


<a id="section1"></a>
## 1. Environment and Imports

In [1]:
import os
import sys
import glob
import torch

# Change working directory to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")

# Path adjustments for custom modules (if not already on sys.path).
module_path = os.path.abspath(os.path.join("src"))
if module_path not in sys.path:
    sys.path.append(module_path)

from src.data.ingest_data import load_config, authenticate_kaggle, download_and_extract
from src.data.preprocess import DataPreprocessor

print("Environment diagnostics:")
print(f"  Working directory : {os.getcwd()}")
print(f"  Python executable : {sys.executable}")

Environment diagnostics:
  Working directory : /home/jkskw/git/ml_lumbar_mri
  Python executable : /home/jkskw/git/ml_lumbar_mri/venv/bin/python


<a id="section2"></a>
## 2. Configuration Loading

In [2]:
CONFIG_PATH = "config.yml"

config = load_config(CONFIG_PATH)
print("Configuration loaded successfully.")
print("Project:", config["project"]["name"])
print("Description:", config["project"]["description"])
print("Random Seed:", config["project"]["seed"])

Configuration loaded successfully.
Project: Lumbar Spine Degenerative Classification
Description: Automated evaluation of degenerative lumbar spine changes from MRI images using deep learning.
Random Seed: 42


<a id="section3"></a>
## 3. Data Download

Authenticates with Kaggle using credentials and downloads the competition files, extracting them into the designated `raw` data folder.

In [3]:
COMPETITION   = config["kaggle"]["competition"]
DOWNLOAD_PATH = config["kaggle"]["download_path"]
RAW_PATH      = config["data"]["raw_path"]

if os.path.exists(RAW_PATH):
    print(f"Data already exists at '{RAW_PATH}', skipping download.")
else:
    authenticate_kaggle()
    download_and_extract(
        competition=COMPETITION,
        download_dir=DOWNLOAD_PATH,
        extract_dir=RAW_PATH
    )
    print("Kaggle data downloaded and extracted.")

Data already exists at './data/raw', skipping download.


<a id="section4"></a>
## 4. Data Preprocessing

Leverages the `DataPreprocessor` class defined in `preprocess.py` to build merged CSV files and generate tensor volumes for modeling. Output is saved in a configuration-defined location.

In [None]:
preprocessor = DataPreprocessor(CONFIG_PATH)
preprocessor.process()

<a id="section5"></a>
## 5. Extended Preprocessing
