# Alpaca Data Validation and Loading (Colab Demo)

This notebook demonstrates how to validate, load, and inspect Alpaca-format data for LLM training and evaluation.

- Validates all datasets in `data/processed/` using your project's schema
- Loads data with batching using `TemplateDataLoader`
- Runs a dry-run evaluation with a dummy evaluator

**Instructions:**
- Upload your code and data, or clone your repo in Colab.
- Adjust paths as needed if your directory structure differs.


In [1]:
# 1. Install dependencies (if needed)
!pip install torch transformers datasets wandb matplotlib seaborn pandas ipywidgets


zsh:1: /opt/homebrew/bin/pip: bad interpreter: /opt/homebrew/opt/python@3.11/bin/python3.11: no such file or directory


## 2. (Optional) Clone your repo or upload files
If your repo is private, use a personal access token or upload manually.


In [2]:
# 3. Set up Python path for src/ imports
import sys, os
os.chdir("..")  # Move up from /notebooks/ to project root
print("Current working directory:", os.getcwd())


# ## 4. Validate Alpaca-format data
# Checks all JSON files in `data/processed/` for schema compliance.


In [3]:
from src.data.validate_alpaca_schema import AlpacaSchemaValidator
from pathlib import Path

data_dir = Path('data/processed')
english_tokens_path = data_dir / 'english_tokens.json'
validator = AlpacaSchemaValidator(english_tokens_path if english_tokens_path.exists() else None)
reports = validator.validate_dir(data_dir)
for report in reports:
    print(f'\nFile: {report["file"]}')
    if 'total' in report:
        print(f'  Total examples: {report["total"]}')
        print(f'  Valid: {report["valid"]}')
        print(f'  Invalid: {report["invalid"]}')
        if report['invalid'] > 0:
            for err in report['errors']:
                print(f'    Example #{err["index"]}: {err["errors"]}')
    else:
        print(f'  Error: {report.get("error", "Unknown error")}')


ModuleNotFoundError: No module named 'src'

 "## 5. Load and inspect data using TemplateDataLoader\n",
    "Loads Alpaca-format data and prints stats and a sample batch.\n"
  

In [None]:
from src.data.data_loader import TemplateDataLoader, BatchConfig

loader = TemplateDataLoader(
    data_dir=data_dir,
    batch_config=BatchConfig(batch_size=4, max_length=128, shuffle=False)
)
stats = loader.get_stats()
print('Stats:', stats)
for batch in loader.train_batches():
    print('Batch:', batch)
    break  # Show only the first batch


 "## 7. (Optional) Continue with your actual training/evaluation code here\n",
    "You can expand this notebook to load a real model, run fine-tuning, or perform real evaluation as needed.\n"
  