# Module 3 — LLM Fundamentals (CodeVision Academy)

## Overview
This module introduces **Large Language Models (LLMs)** from an engineering and enterprise perspective.
It is **code-first**, grounded in Python, and builds directly on:

- **Module 1:** Python fundamentals (functions, JSON, notebooks)
- **Module 2:** Data work with Pandas and visualisation

You will learn how LLMs work, how to call them from Python, how they fail, and how to use them safely in regulated environments such as banking and financial services.

### Supported LLM access methods (choose one)
- **Local laptop LLM** — run a lightweight model using **Ollama** on your PC.
- **Remote CodeVision LLM API** — an Ollama-compatible `/api/generate` endpoint provided by the course admin.

Both options use the same request shape. Your code should work by changing **one** base URL.

## Learning objectives
By the end of this module, you will be able to:

1. Explain what an LLM is (and what it is not)
2. Explain tokens, context windows, and training vs inference
3. Call an LLM from Python via HTTP API (local or remote)
4. Control determinism using temperature
5. Force structured output (JSON) and validate it
6. Recognise hallucinations and common failure modes
7. Apply LLMs safely in a small data pipeline
8. Explain why LLMs alone are insufficient for enterprise use, and why grounding (RAG) helps (Module 4)

## Setup — choose your endpoint

### Option A — Local laptop LLM (Ollama)
1. Install Ollama: https://ollama.com/download  
2. Pull models:
   - `ollama pull phi3:mini` (mandatory for this module)
   - `ollama pull llama3.2:1b` (optional comparison)
3. Ensure Ollama is running (it often starts automatically)

Local API:
- Base URL: `http://localhost:11434`
- Endpoint: `/api/generate`

### Option B — Remote CodeVision LLM API
- Base URL: provided by course admin
- Endpoint is compatible with Ollama’s `/api/generate`

Set `LLM_BASE_URL` below to match your choice.

In [None]:
# ===== CONFIG (edit only LLM_BASE_URL if needed) =====
LLM_BASE_URL = "http://localhost:11434"   # Local laptop Ollama
# LLM_BASE_URL = "https://YOUR-REMOTE-ENDPOINT"  # Remote CodeVision LLM API

DEFAULT_MODEL = "phi3:mini"      # mandatory
# DEFAULT_MODEL = "llama3.2:1b"  # optional comparison

## Common helper: a robust LLM caller

We will reuse this helper in multiple sections. It:
- calls `/api/generate`
- passes `model`, `prompt`, `temperature`, `stream`
- returns parsed JSON

If your endpoint is down, you will get an HTTP error. That is normal and should be handled in production systems.

In [None]:
import requests

def call_llm(prompt: str, model: str = DEFAULT_MODEL, temperature: float = 0.0, stream: bool = False, timeout_s: int = 60) -> dict:
    """Call an Ollama-compatible /api/generate endpoint and return parsed JSON."""
    url = f"{LLM_BASE_URL.rstrip('/')}/api/generate"
    payload = {"model": model, "prompt": prompt, "temperature": temperature, "stream": stream}
    r = requests.post(url, json=payload, timeout=timeout_s)
    r.raise_for_status()
    return r.json()

# Smoke test
out = call_llm("In one sentence, define inflation for a banking audience.", temperature=0.0)
print(out.get("response", "")[:400])

# Section 3.1 — What is a Large Language Model?

An LLM is best understood as a **next-token prediction engine**. It generates text that is statistically likely, not text that is guaranteed true.

**Enterprise mindset:** treat LLM output as **untrusted** unless validated.

In [None]:
prompt = "Complete: 'Interest rates are rising because'"
print(call_llm(prompt, temperature=0.7).get("response","")[:300])

# Section 3.2 — Tokens: How LLMs see text

LLMs operate on **tokens** (subword pieces), not words. Tokenisation affects context limits and truncation.

Practical implication: keep prompts concise and plan for chunking on long documents.

# Section 3.3 — Training vs inference

- **Training**: offline learning of model parameters from huge datasets.
- **Inference**: runtime generation when you call the model endpoint.

This module focuses on inference.

In [None]:
resp = call_llm("Explain training vs inference in 2 bullet points.", temperature=0.0)
print(resp["response"])

# Section 3.4 — LLMs as services (APIs)

Treat the LLM like any other service: send JSON request, receive JSON response. This builds on your JSON and requests skills.

In [None]:
resp = call_llm("Say hello.")
print(resp.keys())

# Section 3.5 — Prompt structure: role, task, constraints

With single-prompt endpoints, simulate roles by placing behaviour rules first, task second, constraints last.

This reduces ambiguity and improves reliability.

In [None]:
system = "You are a cautious banking analyst. Do not speculate. If unsure, say 'Insufficient information'."
task = "Summarise for an executive: FX volatility increased due to rate differentials."
constraints = "Return exactly 2 bullet points. Max 20 words each."
prompt = f"SYSTEM:\n{system}\n\nTASK:\n{task}\n\nCONSTRAINTS:\n{constraints}"
print(call_llm(prompt, temperature=0.0)["response"])

# Section 3.6 — Temperature and determinism

Temperature controls randomness. Low temperature (0.0–0.2) is preferred in regulated workflows for consistency.

In [None]:
prompt = "Explain what a context window is in 2 sentences."
low = call_llm(prompt, temperature=0.0)["response"]
high = call_llm(prompt, temperature=0.8)["response"]
print("Temp 0.0:\n", low)
print("\nTemp 0.8:\n", high)

# Section 3.7 — Hallucinations (confident but wrong)

Hallucinations occur because the model optimises for plausible text rather than verified truth. Never treat confident language as evidence.

In [None]:
prompt = "What was the DJIA close on 32 December 2024? Answer with a number."
print(call_llm(prompt, temperature=0.0)["response"])

# Section 3.8 — Context windows: why long inputs fail

LLMs have a maximum context window. Long inputs can be truncated, producing generic or incomplete summaries.

In [None]:
long_text = ("This is a paragraph from a long banking policy document. " * 1500)
prompt = f"Summarise in 3 bullets:\n{long_text}"
print(call_llm(prompt, temperature=0.0)["response"][:600])

# Section 3.9 — Prompt hygiene: common mistakes and fixes

Avoid vague asks, missing constraints, and multi-task prompts. Prefer clear audience, format, and uncertainty policy.

In [None]:
bad = "Tell me about interest rates."
good = "Explain interest rates to a new bank analyst in 3 bullets, <= 18 words each. No speculation."
print("BAD:\n", call_llm(bad, temperature=0.0)["response"])
print("\nGOOD:\n", call_llm(good, temperature=0.0)["response"])

# Section 3.10 — Structured output: why JSON matters

JSON output enables deterministic parsing, validation, and automation. This builds directly on Module 1.

In [None]:
import json
prompt = (
"Return ONLY valid JSON with keys: summary (string), risks (array of exactly 3 strings). "
"No extra text. Use double quotes. "
"Text: Banks face credit risk, market risk, and operational risk."
)
raw = call_llm(prompt, temperature=0.0)["response"]
print(raw)
data = json.loads(raw)
print(data)

# Section 3.11 — Defensive parsing and validation

Models sometimes return invalid JSON. Handle this safely: parse, validate, retry or fail clearly.

In [None]:
import json
def safe_json_loads(s: str):
    try:
        return True, json.loads(s)
    except Exception as e:
        return False, f"{type(e).__name__}: {e}"
raw = call_llm('Return JSON only: {"a": 1}', temperature=0.0)["response"]
ok, parsed = safe_json_loads(raw)
print("OK?", ok)
print(parsed)

# Section 3.12 — Text validators: length, bullets, vocabulary

Not all tasks need JSON. You can validate text using deterministic rules like bullet count and max length.

In [None]:
text = call_llm("Return exactly 3 bullet points about liquidity risk.", temperature=0.0)["response"]
bullets = [ln for ln in text.splitlines() if ln.strip().startswith(("-", "*"))]
print("Bullet count:", len(bullets))
print(text)

# Section 3.13 — LLMs inside a Pandas pipeline

LLMs can augment data pipelines by generating summaries or tags. Start small and validate outputs.

In [None]:
import pandas as pd
df = pd.DataFrame({
    "id": [1,2,3],
    "text": [
        "Credit risk is the possibility of loss from borrower default.",
        "Market risk comes from adverse movements in interest rates and FX.",
        "Operational risk arises from process, people, or system failures."
    ]
})
def summarise_row(t: str) -> str:
    prompt = f"Summarise in 10 words or fewer: {t}"
    return call_llm(prompt, temperature=0.0)["response"].strip()
df["summary"] = df["text"].apply(summarise_row)
df

# Section 3.14 — Cost/latency mindset: caching

LLM calls are slow compared to normal functions. Use caching for repeated prompts.

In [None]:
_cache = {}
def cached_llm(prompt: str, temperature: float = 0.0) -> str:
    key = (prompt, temperature, DEFAULT_MODEL, LLM_BASE_URL)
    if key in _cache:
        return _cache[key]
    out = call_llm(prompt, temperature=temperature)["response"].strip()
    _cache[key] = out
    return out
p = "Summarise: Banks face credit and market risk."
print(cached_llm(p, 0.0))
print(cached_llm(p, 0.0))

# Section 3.15 — Local vs remote endpoint trade-offs

Local: simple, private, predictable. Remote: centrally managed, potentially faster, requires network/access control. Your code should work for both by switching LLM_BASE_URL.

# Section 3.16 — Enterprise constraints: auditability and compliance

Log prompts (or hashes), parameters, model, and output metadata for auditability. Avoid sending sensitive data to unapproved endpoints.

In [None]:
import hashlib, time
def audit_meta(prompt: str, response_text: str, model: str, temperature: float) -> dict:
    return {
        "ts": time.time(),
        "model": model,
        "temperature": temperature,
        "prompt_sha256": hashlib.sha256(prompt.encode()).hexdigest(),
        "response_sha256": hashlib.sha256(response_text.encode()).hexdigest(),
        "response_len": len(response_text),
    }
p = "Summarise operational risk in 12 words."
resp = call_llm(p, temperature=0.0)
meta = audit_meta(p, resp.get("response",""), resp.get("model", DEFAULT_MODEL), 0.0)
meta

# Section 3.17 — Evaluation without using another LLM

Prefer deterministic checks: schema validation, key checks, length constraints, bullet counts. Avoid 'LLM judging LLM' as your only control.

# Section 3.18 — Safety patterns: uncertainty and fallbacks

Include an uncertainty policy: if unsure, say 'Insufficient information'. Build fallbacks when validation fails.

In [None]:
system = "If you are unsure, respond exactly: Insufficient information. Do not guess."
task = "What is the exact USD/GBP rate at 09:31 UTC yesterday?"
prompt = f"{system}\n\n{task}"
print(call_llm(prompt, temperature=0.0)["response"])

# Section 3.19 — Why LLMs alone are not enough

LLMs have hallucinations, context limits, and no grounding in your internal data by default. This motivates grounding and retrieval techniques.

# Section 3.20 — Preparing for Module 4 (Grounding / RAG)

Mental model: LLM = language engine; RAG = evidence + memory. RAG reduces hallucinations by supplying trusted context.

## Practice exercises (ungraded)
1. Force JSON output for a classification task and parse it.
2. Demonstrate one hallucination and explain why it happened.
3. Enrich a small DataFrame with LLM-generated summaries and add a cache.
4. Add a validator enforcing: exactly 3 bullets and <= 20 words each.

## Module summary
- LLMs generate **probabilistic text**, not guaranteed truth.
- Treat outputs as **untrusted** unless validated.
- Use **low temperature** for consistency and auditability.
- Prefer **structured outputs (JSON)** for automation.
- Design for **failures, retries, and fallbacks**.