# 01 Data Preparation

Load and clean the OSM-derived grid data, then package it for remote runs.


In [None]:
from pathlib import Path
import sys

repo_root = Path(__file__).resolve().parents[1]
src_path = repo_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))
print(f"Using src path: {src_path}")


## Data sources and parsing
- **OSM prebuilt electricity network** (`data/raw/OSM Prebuilt Electricity Network/`): buses, lines, links, converters, transformers.
- **Custom CSV parsing**: `prepare_osm_source` uses a geometry-safe loader (handles commas inside WKT) to keep column counts correct.
- **Endpoint extraction**: First/last coordinates are pulled from WKT to map line/link endpoints to buses (tolerance 1e-5 degrees).
- **Country filter**: Defaults to DE/FR/PL/AT/IT; adjust via `countries` if needed.


In [None]:
from pypsa_simplified import prepare_osm_source

osm_dir = repo_root / "data" / "raw" / "OSM Prebuilt Electricity Network"
sources = prepare_osm_source(osm_dir)
print({k: v.shape if hasattr(v, 'shape') else v for k, v in sources.items()})


## Bus matching and tolerance
Bus IDs are matched to line/link endpoints using coordinate proximity (`tol=1e-5`). Known problematic buses (e.g., `way/61038773-220`) are dropped by default; edit `drop_buses` if you need them.


## Serialize for remote processing
The serialized artifact is compact (gzip + pickle) and ready to `scp` to the server.


In [None]:
from pypsa_simplified import serialize_network_source

out_path = repo_root / "data" / "processed" / "network_source.pkl.gz"
serialize_network_source(out_path, sources)
print(f"Saved serialized source to {out_path}")


### Next
- Use `notebooks/main.ipynb` to transfer the artifact and trigger the remote optimization.
- For custom country lists or tolerance, pass `countries`/`tol` to `prepare_osm_source`.
