# Module 3 – Automated Grading Notebook (Spark LLM)

## Purpose
This notebook is used by markers to grade **Module 3** submissions.

It explicitly aligns to the Module 3 rubric and uses the Spark-hosted LLM **as a simulation engine**, not a source of truth.

### Key Principles
- Prompt discipline is graded independently of model behaviour
- Malformed JSON handling **is assessed explicitly**
- Live LLM calls validate robustness, not semantic correctness
- All scoring paths are deterministic and appeal-safe


## Rubric Alignment

| Rubric Area | Weight | Assessed How |
|------------|--------|--------------|
| Prompt discipline | 35% | Static prompt inspection |
| API usage & parameters | 15% | Code inspection + runtime |
| JSON robustness | 30% | Injected malformed responses |
| Graceful failure handling | 20% | Live Spark call tolerance |
| **Total** | **100%** | |


## Marker Instructions

1. Run all cells top-to-bottom
2. Do **not** modify scoring logic
3. Review commentary cells for borderline cases
4. Record final score and notes externally


In [None]:
# --- Spark API Client (Provided) ---
import requests
import json
from typing import Any, Dict

SPARK_BASE_URL = "http://spark:11434"
MODEL = "phi3:mini"

def call_spark_llm(prompt: str, temperature: float = 0.0) -> str:
    payload = {
        "model": MODEL,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": temperature,
        "stream": False
    }
    r = requests.post(f"{SPARK_BASE_URL}/api/chat", json=payload, timeout=60)
    r.raise_for_status()
    return r.json()["message"]["content"]


## Student Submission

**Markers:** paste the student’s prompt and parsing function below exactly as submitted.


In [None]:
# --- Student Prompt ---
student_prompt = """
PASTE STUDENT PROMPT HERE
"""

# --- Student Parsing Function ---
def parse_llm_response(text: str) -> Dict[str, Any]:
    # PASTE STUDENT FUNCTION HERE
    pass


## 1. Prompt Discipline (35%)

**Marker commentary:**
- Does the prompt explicitly require JSON?
- Does it define keys / schema?
- Does it prohibit extra text?
- Are examples or constraints provided?


In [None]:
def grade_prompt(prompt: str) -> int:
    score = 0
    p = prompt.lower()

    if "json" in p:
        score += 10
    if "key" in p or "schema" in p:
        score += 10
    if "no extra" in p or "only" in p:
        score += 5
    if "example" in p:
        score += 10

    return min(score, 35)

prompt_score = grade_prompt(student_prompt)
prompt_score


## 2. JSON Robustness – Injected Tests (30%)

**Marker commentary:**
- Students are expected to handle malformed JSON
- Failures here reflect insufficient defensive parsing
- This does NOT depend on live model output


In [None]:
MALFORMED_RESPONSES = [
    '{"answer": "yes", "confidence": 0.8',
    '```json {"answer": "yes"} ```',
    '{"answer": "yes"} trailing text',
    '{"confidence": 0.9}'
]

def grade_parsing(fn) -> int:
    score = 0
    for text in MALFORMED_RESPONSES:
        try:
            result = fn(text)
            if isinstance(result, dict):
                score += 7.5
        except Exception:
            pass
    return int(min(score, 30))

parsing_score = grade_parsing(parse_llm_response)
parsing_score


## 3. Live Spark Robustness Check (20%)

**Marker commentary:**
- Live LLM output is used ONLY to validate robustness
- Semantic correctness is NOT graded
- Failure must be due to student code, not wording variance


In [None]:
def grade_live_behavior(prompt: str, parse_fn) -> int:
    try:
        raw = call_spark_llm(prompt)
        parsed = parse_fn(raw)
        if isinstance(parsed, dict):
            return 20
    except Exception:
        return 0
    return 0

live_score = grade_live_behavior(student_prompt, parse_llm_response)
live_score


## Final Score

**Marker commentary:**
- Use this total as the official Module 3 mark
- Record qualitative notes separately if needed


In [None]:
total_score = prompt_score + parsing_score + live_score
total_score
