# 04 - Evaluation & Verification

## Learning Goals

* Compute and interpret simple metrics (accuracy, confusion matrix).
* Produce reproducible **artifacts** and a machine-readable **receipt** for CI.
* Understand why verifiable outputs matter in a classroom or production pipeline.

## You Should Be Able To...

- Run model evaluation and interpret results
- Understand confusion matrices and accuracy metrics
- Generate verification receipts for ML pipelines
- Identify when models meet deployment criteria
- Reflect on the complete ML development process

---

## Concepts

**Confusion matrix**: where the classifier makes mistakes by class.

**Reproducible artifacts**: model file, benchmark reports, quantization summary, evaluation report.

**Receipt**: a small JSON proving all required files were created and basic checks passed.

## Common Pitfalls

* Not running evaluation on held-out test data
* Misinterpreting confusion matrix results
* Forgetting to generate verification artifacts
* Not checking that all pipeline components work together

## Success Criteria

* ‚úÖ `progress/receipt.json` says **PASS**
* ‚úÖ You can explain what each artifact is and where it lives
* ‚úÖ You can describe one change you'd make next (e.g., more data, different architecture)

---

## Setup & Environment Check


# ruff: noqa: E401
import os
import sys
from pathlib import Path

def cd_repo_root():
    p = Path.cwd()
    for _ in range(5):  # kl√§ttra upp√•t max 5 niv√•er
        if (p/"verify.py").exists() and (p/"scripts"/"evaluate_onnx.py").exists():
            if str(p) not in sys.path: sys.path.insert(0, str(p))
            if p != Path.cwd():
                os.chdir(p)
                print("-> Changed working dir to repo root:", os.getcwd())
            return
        p = p.parent
    raise RuntimeError("Could not locate repo root")

cd_repo_root()

# Hints & Solutions helper (pure Jupyter, no extra deps)
from IPython.display import Markdown, display

def hints(*lines, solution: str | None = None, title="Need a nudge?"):
    """Render progressive hints + optional collapsible solution."""
    md = [f"### {title}"]
    for i, txt in enumerate(lines, start=1):
        md.append(f"<details><summary>Hint {i}</summary>\n\n{txt}\n\n</details>")
    if solution:
        # keep code fenced as python for readability
        md.append(
            "<details><summary><b>Show solution</b></summary>\n\n"
            f"```python\n{solution.strip()}\n```\n"
            "</details>"
        )
    display(Markdown("\n\n".join(md)))


## ü§î Vad √§r utv√§rdering och varf√∂r beh√∂ver vi det?

**Utv√§rdering** = testa modellen p√• data den inte har sett under tr√§ning.

**Vad vi m√§ter**:
- **Accuracy** - hur m√•nga f√∂ruts√§gelser som √§r r√§tta
- **Confusion matrix** - detaljerad breakdown av r√§tta/felaktiga f√∂ruts√§gelser
- **Per-class performance** - hur bra modellen √§r p√• varje klass

**Varf√∂r viktigt**:
- **Validering** - s√§kerst√§ller att modellen faktiskt fungerar
- **Debugging** - visar vilka klasser som √§r sv√•ra
- **J√§mf√∂relse** - kan j√§mf√∂ra olika modeller/inst√§llningar

<details>
<summary>üîç Klicka f√∂r att se vad en confusion matrix visar</summary>

**Confusion matrix**:
- **Diagonal** = r√§tta f√∂ruts√§gelser
- **Off-diagonal** = felaktiga f√∂ruts√§gelser
- **Per class** = precision, recall f√∂r varje klass

</details>


In [None]:
# K√∂r utv√§rdering p√• v√•r modell
print("üîç K√∂r utv√§rdering...")

# Anv√§nd modellen fr√•n f√∂reg√•ende notebooks (eller skapa en snabb)
!python -m piedge_edukit.train --fakedata --no-pretrained --epochs 1 --batch-size 256 --output-dir ./models_eval


In [None]:
# K√∂r utv√§rdering med begr√§nsat antal samples (snabbare)
!python scripts/evaluate_onnx.py --model ./models_eval/model.onnx --fakedata --limit 32


In [None]:
# Visa utv√§rderingsresultat
import os

if os.path.exists("./reports/eval_summary.txt"):
    with open("./reports/eval_summary.txt", "r") as f:
        print("üìä Utv√§rderingsresultat:")
        print(f.read())
else:
    print("‚ùå Utv√§rderingsrapport missing")


In [None]:
# Visa tr√§ningsgrafer om de finns
from PIL import Image
from IPython.display import display

if os.path.exists("./reports/training_curves.png"):
    print("üìà Tr√§ningsgrafer:")
    display(Image.open("./reports/training_curves.png"))
else:
    print("‚ö†Ô∏è Tr√§ningsgrafer missing ‚Äì k√∂r tr√§ningen f√∂rst.")


In [None]:
# Visa confusion matrix om den finns
import matplotlib.pyplot as plt
from PIL import Image

if os.path.exists("./reports/confusion_matrix.png"):
    print("üìà Confusion Matrix:")
    img = Image.open("./reports/confusion_matrix.png")
    plt.figure(figsize=(8, 6))
    plt.imshow(img)
    plt.axis('off')
    plt.title('Confusion Matrix')
    plt.show()
else:
    print("‚ùå Confusion matrix missing")


## üîç Automatisk verifiering

**Verifiering** = automatiska checks som s√§kerst√§ller att lektionen fungerar korrekt.

**Vad kontrolleras**:
- **Artefakter finns** - alla n√∂dv√§ndiga filer √§r skapade
- **Benchmark fungerar** - latens-data √§r giltig
- **Kvantisering fungerar** - kvantiserad modell √§r skapad
- **Utv√§rdering fungerar** - confusion matrix och accuracy √§r tillg√§nglig

**Resultat**: `progress/receipt.json` med PASS/FAIL status


In [None]:
# K√∂r automatisk verifiering
print("üîç K√∂r automatisk verifiering...")
!python verify.py


In [None]:
# Analysera kvittot i detalj
import json

if os.path.exists("./progress/receipt.json"):
    with open("./progress/receipt.json", "r") as f:
        receipt = json.load(f)
    
    print("üìã Detaljerad kvitto-analys:")
    print(f"Status: {'‚úÖ PASS' if receipt['pass'] else '‚ùå FAIL'}")
    print(f"Timestamp: {receipt['timestamp']}")
    
    print("\nüîç Kontroller:")
    for check in receipt['checks']:
        status = "‚úÖ" if check['ok'] else "‚ùå"
        print(f"  {status} {check['name']}: {check['reason']}")
    
    print("\nüìä Metrics:")
    if 'metrics' in receipt:
        for metric, value in receipt['metrics'].items():
            print(f"  {metric}: {value}")
    
    print("\nüìÅ Genererade filer:")
    if 'artifacts' in receipt:
        for artifact in receipt['artifacts']:
            print(f"  - {artifact}")
else:
    print("‚ùå Kvitto missing")


## ü§î Reflektionsfr√•gor

### TODO R1 ‚Äî Reflect on results (2‚Äì4 bullets)
- Where did quantization help / hurt?
- Do your p50 and p95 match expectations after warm-up?
- One change you would make before deploying.

<details><summary>Hint</summary>
Tie back to goals: correctness, latency, and determinism. Fallback to FP32 is fine if INT8 regresses.
</details>

<details>
<summary>üí≠ Vilka m√•l verifieras av v√•r automatiska check?</summary>

**Svar**: V√•r verifiering kontrollerar:
- **Teknisk funktionalitet** - alla steg k√∂rs utan fel
- **Artefakt-generering** - n√∂dv√§ndiga filer skapas
- **Data-integritet** - rapporter √§r giltiga och parseable
- **Pipeline-integration** - alla komponenter fungerar tillsammans

**Vad som INTE verifieras**:
- Accuracy-kvalitet (bara att utv√§rdering k√∂rs)
- Latens-m√•l (bara att benchmark k√∂rs)
- Produktionsredo (bara att pipeline fungerar)

</details>

<details>
<summary>üí≠ Vad missing f√∂r "produktion"?</summary>

**Svar**: F√∂r produktion beh√∂ver vi:
- **Riktig data** - inte FakeData
- **Accuracy-m√•l** - specifika krav p√• precision/recall
- **Latens-m√•l** - SLA-krav p√• inference-tid
- **Robusthet** - hantering av edge cases och fel
- **Monitoring** - kontinuerlig √∂vervakning av prestanda
- **A/B-testing** - j√§mf√∂relse av olika modeller
- **Rollback** - m√∂jlighet att g√• tillbaka till tidigare version

</details>


## üéØ Ditt eget experiment

**Uppgift**: K√∂r verifiering p√• olika modeller och j√§mf√∂r kvittona.

**F√∂rslag**:
- Tr√§na modeller med olika inst√§llningar
- K√∂r verifiering p√• varje modell
- J√§mf√∂r kvittona och se vilka som passerar/failar
- Analysera vilka checks som √§r mest kritiska

**Kod att modifiera**:
```python
# Tr√§na olika modeller och k√∂r verifiering
MODELS = [
    {"epochs": 1, "batch_size": 128, "name": "quick"},
    {"epochs": 3, "batch_size": 64, "name": "balanced"},
    {"epochs": 5, "batch_size": 32, "name": "thorough"}
]

for model_config in MODELS:
    # Tr√§na modell
    # K√∂r verifiering
    # Analysera kvitto
```


In [None]:
# TODO: Implementera ditt experiment h√§r
# Tr√§na olika modeller och j√§mf√∂r kvittona

MODELS = [
    {"epochs": 1, "batch_size": 128, "name": "quick"},
    {"epochs": 3, "batch_size": 64, "name": "balanced"},
    {"epochs": 5, "batch_size": 32, "name": "thorough"}
]

print("üß™ Mitt experiment: J√§mf√∂r olika modeller")
for model_config in MODELS:
    print(f"  - {model_config['name']}: epochs={model_config['epochs']}, batch_size={model_config['batch_size']}")

# TODO: Implementera loop som tr√§nar och verifierar varje modell


## Final Reflection

Congratulations! You've completed the entire PiEdge EduKit lesson. Please reflect on your learning experience:

**1. What was the most challenging part of implementing the CNN architecture? What helped you understand it better?**

*Your answer here (2-3 sentences):*

---

**2. How did your understanding of model performance change after running the latency benchmarks?**

*Your answer here (2-3 sentences):*

---

**3. What surprised you most about the quantization process? What would you do differently in a real deployment?**

*Your answer here (2-3 sentences):*

---

**4. How important do you think automated verification is for ML pipelines? Why?**

*Your answer here (2-3 sentences):*

---

## Next Steps

**Congratulations!** You've successfully completed the PiEdge EduKit lesson. You now understand:

- ‚úÖ CNN implementation and training
- ‚úÖ Model export to ONNX format  
- ‚úÖ Performance benchmarking and analysis
- ‚úÖ Quantization and compression techniques
- ‚úÖ Evaluation and verification workflows

**Real-world applications**: Experiment with real data, different models, or deploy on Raspberry Pi!

**Key concepts mastered**:
- **Training**: Implementing and training neural networks
- **Export**: Converting models to deployment-ready formats
- **Benchmarking**: Measuring and analyzing performance
- **Quantization**: Optimizing models for edge deployment
- **Verification**: Automated quality assurance for ML pipelines
