# 📘 02_cleaning.ipynb - Cleaning and Structuring Bible Verse Data

This notebook demonstrates how to run the cleaning script `cleaning.py` to:
- Normalize and validate raw Bible verse data.
- Add unique identifiers for each verse.
- Prepare the data for emotion and theme annotation.

## 🛠️ 1. Setup: Cleaned File Structure
We assume your project has the following structure:

```
data/
├── raw/
│   └── bible_kjv/
│       ├── 1_genesis.csv
│       └── ...
├── processed/
│   └── bible_kjv/
logs/
├── cleaning_logs/
```

## 📥 2. Inspect Raw Data (Optional)

In [None]:
import pandas as pd

# Load an example raw file
df_raw = pd.read_csv("data/raw/bible_kjv/1_genesis.csv")
df_raw.head()

## 🚀 3. Run the Cleaning Script
Make sure you have the `cleaning.py` script ready. It will:
- Clean whitespace and normalize punctuation.
- Drop invalid rows.
- Add `verse_id`, `theme`, and `emotion` columns.
- Save output as `*_cleaned.csv` in `data/processed`.

In [None]:
# Option 1: Run the script from the notebook
!python src/cleaning.py

In [None]:
# Option 2: Call the main function directly (notebook-style)
from src.cleaning import clean_and_prepare_csvs

clean_and_prepare_csvs()

## ✅ 4. View Cleaned Output

In [None]:
# Load the cleaned result
df_cleaned = pd.read_csv("data/processed/bible_kjv/1_genesis_cleaned.csv")
df_cleaned.head()

## 📝 5. Review the Cleaning Log

In [None]:
from pathlib import Path

log_files = sorted(Path("logs/cleaning_logs").glob("cleaning_log_*.txt"), reverse=True)
print(f"Most recent log: {log_files[0]}")
with open(log_files[0], encoding="utf-8") as f:
    print(f.read())

## ✅ Conclusion
Your raw verse data is now cleaned and ready for:
- Emotion labeling (`emotion_theme_labeling.py`)
- Thematic labeling (`theme_labeling.py`)

👉 Continue to the next notebook: `03_label_emotions_and_themes.ipynb`.