# Exercise 2: Open Model + RAG vs. Large Model (GPT-4o Mini, No RAG)

**Goal:** Determine whether a larger closed model (GPT-4o Mini) without any retrieval can match or beat a small open model (Qwen 2.5 1.5B) augmented with RAG.

**Setup:**
- Model: `gpt-4o-mini`
- No tools, no retrieval — pure parametric knowledge only
- Each question is sent as a fresh, independent API call (not a conversation)
- Same query sets used in Exercise 1 (Model T manual + Congressional Record)

**Questions to answer after running:**
1. Does GPT-4o Mini do a better job than Qwen 2.5 1.5B at **avoiding hallucinations**?
2. Which questions does GPT-4o Mini answer **correctly**?
3. How does GPT-4o Mini's **pre-training cut-off date** relate to the age of each corpus?

## Setup: Install & Import

In [None]:
# Install the OpenAI Python SDK if not already present
try:
    ip = get_ipython()
    ip.run_line_magic('pip', 'install -q openai pandas')
except NameError:
    import subprocess, sys
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'openai', 'pandas'])

In [None]:
import os
import time
import pandas as pd
from openai import OpenAI

# =============================================================================
# SET YOUR OPENAI API KEY
# =============================================================================
# Option A – Colab secret (recommended: Secrets panel on the left sidebar)
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("✔ API key loaded from Colab secrets")
except Exception:
    # Option B – environment variable or hard-coded (not recommended for shared notebooks)
    OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY', 'YOUR_KEY_HERE')
    if OPENAI_API_KEY == 'YOUR_KEY_HERE':
        print("⚠ Please set your OpenAI API key!")
    else:
        print("✔ API key loaded from environment variable")

client = OpenAI(api_key=OPENAI_API_KEY)

# Model for this exercise
GPT_MODEL = "gpt-4o-mini"
print(f"Model: {GPT_MODEL}")

## Query Function: GPT-4o Mini, No RAG

Each call is **independent** — no shared conversation history, no tools, no retrieved context.

In [None]:
def gpt_direct_query(question: str, max_tokens: int = 512, temperature: float = 0.3) -> str:
    """
    Send a single question to GPT-4o Mini with NO tools and NO retrieval context.
    Each call is stateless — a fresh conversation every time.

    Parameters
    ----------
    question    : The user question (string).
    max_tokens  : Maximum tokens in the completion.
    temperature : Sampling temperature (0 = deterministic, 1 = creative).

    Returns
    -------
    The model's answer as a plain string.
    """
    response = client.chat.completions.create(
        model=GPT_MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. "
                    "Answer the user's question concisely using only your pre-trained knowledge. "
                    "Do not use any tools or search the web."
                )
            },
            {
                "role": "user",
                "content": question
            }
        ],
        max_tokens=max_tokens,
        temperature=temperature,
        # No 'tools' parameter → model has no tools available
    )
    return response.choices[0].message.content.strip()

## Query Sets (same as Exercise 1)

The **Model T** questions test knowledge from a ~1919 service manual.  
The **Congressional Record** questions test knowledge of very recent (Jan 2026) Congressional proceedings — well beyond GPT-4o Mini's training cut-off.

The `note` field is **only for your evaluation** — it is never sent to the model.

In [None]:
model_t_queries = [
    {
        "question": "How do I adjust the carburetor on a Model T?",
        "note": "Answer is in the Model T service manual (1919)"
    },
    {
        "question": "What is the correct spark plug gap for a Model T Ford?",
        "note": "Manual states ~7/16 inch (about the thickness of a smooth dime)"
    },
    {
        "question": "How do I fix a slipping transmission band?",
        "note": "Manual describes loosening lock-nut and adjusting screw, Cut No. 12"
    },
    {
        "question": "What oil should I use in a Model T engine?",
        "note": "Manual discusses oil level using pet cocks; does not specify modern viscosity grades"
    },
]

congressional_record_queries = [
    {
        "question": "What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?",
        "note": "From Congressional Record Jan 13, 2026 — after GPT-4o Mini training cut-off"
    },
    {
        "question": "What mistake did Elise Stefanovic make in Congress on January 23, 2026?",
        "note": "From Congressional Record Jan 23, 2026 — after GPT-4o Mini training cut-off"
    },
    {
        "question": "What is the purpose of the Main Street Parity Act?",
        "note": "Referenced in Congressional Record Jan 20, 2026 — may be partially after cut-off"
    },
    {
        "question": "Who in Congress has spoken for and against funding of pregnancy centers?",
        "note": "From Congressional Record Jan 21, 2026 — after GPT-4o Mini training cut-off"
    },
]

print(f"Model T queries      : {len(model_t_queries)}")
print(f"Congressional queries: {len(congressional_record_queries)}")

## Run GPT-4o Mini on All Queries

In [None]:
def run_experiment(queries: list, corpus_name: str, delay: float = 1.0) -> list:
    """
    Run GPT-4o Mini on a list of query dicts, collect results,
    and save a per-corpus CSV immediately after the last question.

    Parameters
    ----------
    queries     : List of dicts with 'question' and 'note' keys.
    corpus_name : Label for the corpus (e.g. 'MODEL_T').
    delay       : Seconds to wait between API calls (avoids rate-limit errors).

    Returns
    -------
    List of result dicts ready for a DataFrame.
    """
    results = []

    for i, item in enumerate(queries, start=1):
        question = item["question"]
        note     = item.get("note", "")

        print(f"\n[{corpus_name}] Q{i}: {question}")
        print("-" * 70)

        try:
            answer = gpt_direct_query(question)
        except Exception as e:
            answer = f"ERROR: {e}"

        print(answer)

        results.append({
            "corpus"               : corpus_name,
            "question"             : question,
            "gpt4o_mini_answer"    : answer,
            # --- Fill these in manually after reviewing the outputs ---
            "hallucinated"         : None,   # True / False
            "factually_correct"    : None,   # True / False / Partial
            "admits_uncertainty"   : None,   # True / False
            "note_for_evaluator"   : note,
        })

        # Small delay between calls to respect rate limits
        if i < len(queries):
            time.sleep(delay)

    # ── Save per-corpus CSV immediately after the last question ──
    corpus_csv = f"exercise2_{corpus_name.lower()}_results.csv"
    pd.DataFrame(results).to_csv(corpus_csv, index=False)
    print(f"\n✔ Saved {len(results)} rows → {corpus_csv}")

    return results

In [None]:
# ---- Run on Model T corpus ----
print("=" * 70)
print("CORPUS: MODEL T SERVICE MANUAL (1919)")
print("=" * 70)

model_t_results = run_experiment(model_t_queries, corpus_name="MODEL_T")

In [None]:
# ---- Run on Congressional Record corpus ----
print("=" * 70)
print("CORPUS: CONGRESSIONAL RECORD (Jan 2026)")
print("=" * 70)

cr_results = run_experiment(congressional_record_queries, corpus_name="CONGRESSIONAL_RECORD")

## Combine Results & Save to CSV

In [None]:
# ── Per-corpus CSVs were already saved by run_experiment() ──
# exercise2_model_t_results.csv
# exercise2_congressional_record_results.csv

# ── Merge into one combined CSV ──
all_results = model_t_results + cr_results
df = pd.DataFrame(all_results)

csv_path = "exercise2_gpt4o_mini_results.csv"
df.to_csv(csv_path, index=False)
print(f"✔ Combined CSV saved: {csv_path}")
print(f"  Rows: {len(df)}  |  Columns: {list(df.columns)}\n")

# Display answers
pd.set_option('display.max_colwidth', 120)
display(df[['corpus', 'question', 'gpt4o_mini_answer']])

## Manual Evaluation

After reviewing the outputs above, fill in the evaluation columns in the cell below.

| Column | Values | Meaning |
|---|---|---|
| `hallucinated` | True / False | Did the model invent specific facts not found in the source corpus? |
| `factually_correct` | True / False / Partial | Is the answer accurate compared to the ground truth (manual or CR)? |
| `admits_uncertainty` | True / False | Did the model say it wasn't sure or that the event may be outside its knowledge? |

In [None]:
# =============================================================================
# AUTO-EVALUATION: Upload your results CSV and let GPT-4o Mini judge each answer
# =============================================================================
# Upload your previously saved results CSV (e.g. exercise2_model_t_results.csv
# or the combined exercise2_gpt4o_mini_results.csv). The cell will:
#   1. Load the CSV
#   2. Send each (question, answer) pair to GPT-4o Mini for evaluation
#   3. Fill in hallucinated / factually_correct / admits_uncertainty automatically
#   4. Save the evaluated CSV
# =============================================================================

import io, os

# ── Step 1: Load the results CSV ──────────────────────────────────────────────
# In Colab: triggers a file-upload dialog so you can pick your CSV.
# Locally : set RESULTS_CSV_PATH to the path of your file.
RESULTS_CSV_PATH = None  # e.g. "exercise2_model_t_results.csv"  ← set for local use

try:
    from google.colab import files as colab_files
    print("📂 Upload your results CSV (e.g. exercise2_model_t_results.csv):")
    uploaded = colab_files.upload()          # blocking dialog
    csv_filename = list(uploaded.keys())[0]
    df_eval = pd.read_csv(io.BytesIO(uploaded[csv_filename]))
    print(f"✔ Loaded '{csv_filename}'  ({len(df_eval)} rows)")
except (ImportError, Exception) as e:
    if RESULTS_CSV_PATH and os.path.exists(RESULTS_CSV_PATH):
        df_eval = pd.read_csv(RESULTS_CSV_PATH)
        csv_filename = RESULTS_CSV_PATH
        print(f"✔ Loaded '{csv_filename}'  ({len(df_eval)} rows)")
    else:
        raise FileNotFoundError(
            "No file uploaded and RESULTS_CSV_PATH not set. "
            "Set RESULTS_CSV_PATH to your CSV path and re-run."
        ) from e

print("Columns:", list(df_eval.columns))
display(df_eval[['corpus', 'question', 'gpt4o_mini_answer']].head())


In [None]:
# ── Step 2: Ground-truth hints for evaluation ─────────────────────────────────
# These are the RAG answers from Exercise 1, used here as reference facts
# to judge GPT-4o Mini's correctness. NOT sent during the main experiment.

GROUND_TRUTH = {
    "How do I adjust the carburetor on a Model T?": (
        "To adjust the carburetor on a Model T, refer to the instructions for your "
        "specific model year. Generally, adjustments are made using the dashboard "
        "adjustment mechanism, which controls the amount of fuel entering the carburetor "
        "based on speed and temperature. For a cold start, turn the adjustment mechanism "
        "slightly right to increase the air-fuel ratio, preventing knocking. In colder "
        "temperatures, turn slightly left to reduce gasoline in the mixture. "
        "Always consult the owner's manual for precise steps and safety guidelines."
    ),

    "What is the correct spark plug gap for a Model T Ford?": (
        "According to the context, the correct spark plug gap for a Model T Ford is "
        "approximately 7/16 inch — about the thickness of a smooth dime. "
        "The manual states: 'The distance between the sparking points should be 7/4', "
        "referring to this measurement. All wire connections to spark plugs, coil box, "
        "and commutator should at all times be kept in perfect contact."
    ),

    "How do I fix a slipping transmission band?": (
        "Loosen the lock nut at the tight side of the transmission cover, then adjust "
        "the bands using the adjusting screws (refer to Cut No. 12). Ensure the bands "
        "don't drag the drums when disengaging to prevent overheating the motor. "
        "After making adjustments, reassemble the transmission cover. The manual does "
        "not explicitly mention tightening the slow speed band directly — that step is "
        "implied rather than stated outright. Actual repair procedures may vary, so "
        "consulting an expert or official service manual is advisable."
    ),

    "What oil should I use in a Model T engine?": (
        "According to the context, the Model T manual states 'All other parts of the car "
        "are properly oiled when it leaves the factory,' but does not specify a particular "
        "oil type, brand, or viscosity grade. The manual only mentions maintaining the "
        "correct oil level. Any standard automotive oil appropriate for the year and make "
        "of the vehicle can be assumed, though specifics beyond what is stated are unknown."
    ),

    "What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?": (
        "In Congress on January 13, 2026, Mr. Flood recognized Mayor David Black from "
        "Nebraska as a paragon of public service and a remarkable steward of the city's "
        "affairs. He highlighted Black's contributions over nearly 17.5 years of service, "
        "emphasizing how he guided the city into a thriving economic position. Mr. Flood "
        "also noted Black's dedication to community engagement and outreach, and mentioned "
        "that Black would not seek re-election, concluding an impressive career."
    ),

    "What mistake did Elise Stefanovic make in Congress on January 23, 2026?": (
        "According to the context, Elise Stefanovic tried to overturn the democratic "
        "process by overruling repeated lawful orders from Capitol Police, forcibly entering "
        "the Speaker's Lobby, attempting to break a window, and forcing her way through a "
        "barricaded door while armed. This behavior was deemed inappropriate and illegal. "
        "Key points: she ignored lawful orders from Capitol Police, pushed to the front of "
        "the crowd disrupting orderly flow, broke a window, and attempted to force entry "
        "while armed."
    ),

    "What is the purpose of the Main Street Parity Act?": (
        "The purpose of the Main Street Parity Act is to modify the criteria for loans "
        "for plant acquisition, construction, conversion, or expansion by setting an equity "
        "requirement at 10 percent, aligning these loans with the 504 standardization "
        "programs. The act aims to provide small businesses with more equitable access to "
        "capital while reflecting modern economic and technological realities."
    ),

    "Who in Congress has spoken for and against funding of pregnancy centers?": (
        "Ms. Dexter of Oregon spoke strongly against the funding of pregnancy centers, "
        "while Mr. Schneider spoke in favor of them. Ms. Dexter stated that Democrats would "
        "make claims about pregnancy resource centers, and Mr. Schneider argued that his "
        "Republican colleagues brought the bill to the floor to funnel money to centers he "
        "called anti-abortion. The Biden administration also made efforts to block states "
        "from partnering with pregnancy centers."
    ),
}


def auto_evaluate(question: str, answer: str) -> dict:
    """
    Ask GPT-4o Mini to evaluate a (question, answer) pair against ground truth.
    Returns a dict with hallucinated, factually_correct, admits_uncertainty.
    """
    ground_truth = GROUND_TRUTH.get(question, "No ground truth available.")

    prompt = f"""You are an impartial evaluator. A model was asked a question and gave an answer.
Compare the answer to the ground truth and return a JSON object with exactly these three fields:

  hallucinated        : true if the answer states specific facts that are invented or not supported
                        by the ground truth; false otherwise.
  factually_correct   : "True" if fully correct, "Partial" if partly right, "False" if wrong.
  admits_uncertainty  : true if the answer explicitly says it is unsure, doesn't know, or that
                        the event may be outside its knowledge; false otherwise.

QUESTION: {question}

GROUND TRUTH: {ground_truth}

MODEL ANSWER: {answer}

Respond with ONLY valid JSON. Example:
{{"hallucinated": false, "factually_correct": "Partial", "admits_uncertainty": true}}"""

    response = client.chat.completions.create(
        model=GPT_MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0,
    )
    raw = response.choices[0].message.content.strip()

    import json as _json
    try:
        return _json.loads(raw)
    except Exception:
        return {"hallucinated": None, "factually_correct": raw, "admits_uncertainty": None}


# ── Step 3: Run auto-evaluation on every row ───────────────────────────────────
print(f"Evaluating {len(df_eval)} answers with GPT-4o Mini...\n")

for i, row in df_eval.iterrows():
    result = auto_evaluate(row['question'], row['gpt4o_mini_answer'])
    df_eval.at[i, 'hallucinated']       = result.get('hallucinated')
    df_eval.at[i, 'factually_correct']  = result.get('factually_correct')
    df_eval.at[i, 'admits_uncertainty'] = result.get('admits_uncertainty')

    corpus  = row['corpus']
    q_short = row['question'][:60]
    print(f"[{corpus}] {q_short}...")
    print(f"  hallucinated={result.get('hallucinated')}  "
          f"correct={result.get('factually_correct')}  "
          f"admits_uncertainty={result.get('admits_uncertainty')}")
    time.sleep(0.5)

# ── Step 4: Save evaluated CSV ────────────────────────────────────────────────
evaluated_csv = "exercise2_evaluated_results.csv"
df_eval.to_csv(evaluated_csv, index=False)
print(f"\n✔ Evaluated CSV saved: {evaluated_csv}")

# Also update the main df so the summary stats cell works
df = df_eval.copy()
csv_path = evaluated_csv

display(df_eval[['corpus', 'question', 'hallucinated', 'factually_correct', 'admits_uncertainty']])



## Summary Statistics

In [None]:
# Only run after filling in the evaluation columns above
if df['hallucinated'].notna().any():
    for corpus in df['corpus'].unique():
        sub = df[df['corpus'] == corpus]
        print(f"\n{'='*50}")
        print(f"Corpus: {corpus}  (n={len(sub)})")
        print(f"{'='*50}")
        print(f"  Hallucinated        : {sub['hallucinated'].sum()} / {len(sub)}")
        print(f"  Factually correct   : {(sub['factually_correct'] == True).sum()} / {len(sub)}")
        print(f"  Admits uncertainty  : {sub['admits_uncertainty'].sum()} / {len(sub)}")
else:
    print("Fill in the evaluation columns first (cell above).")

## Discussion Prompts

Answer these in your write-up or as Markdown cells below.

---

### 1. Hallucination comparison: GPT-4o Mini vs Qwen 2.5 1.5B (no RAG)

- For the **Model T** queries, which model hallucinated more confidently (gave specific but wrong facts)? Which was more likely to hedge?
- For the **Congressional Record** queries, did GPT-4o Mini admit it couldn't know recent 2026 proceedings, or did it confabulate plausible-sounding congressional speech?
- Overall: does a larger model without RAG outperform a smaller model without RAG on avoiding hallucinations?

---

### 2. Which questions does GPT-4o Mini answer correctly?

- List the question numbers it got **fully correct**, **partially correct**, and **wrong**.
- Were the correctly-answered questions ones where the answer is **common general knowledge** (e.g., well-documented automotive trivia) vs. **corpus-specific details** (e.g., exact cut No. references in the manual)?

---

### 3. Training cut-off vs. corpus age

GPT-4o Mini's training data has a cut-off of roughly **October 2023** (OpenAI's stated cut-off for this model family).

| Corpus | Date of material | Within cut-off? |
|---|---|---|
| Model T Service Manual | 1919 | ✔ Yes — ~100 years old, well-documented online |
| Congressional Record (used) | Jan 2026 | ✗ No — ~2 years after cut-off |

- For the **Model T** corpus: the manual is old and widely digitised. Does GPT-4o Mini show evidence of having absorbed this information into its weights? Where does it still err?
- For the **Congressional Record** corpus: all four queries reference events in January 2026 — well after the cut-off. What is the expected failure mode? Did the model hallucinate, refuse, or correctly say it doesn't know?
- **Conclusion:** In what scenario is GPT-4o Mini (no RAG) competitive with Qwen 2.5 1.5B + RAG? When does it clearly lose?


In [None]:
# Optional: download all CSVs in Colab
try:
    from google.colab import files
    import os
    for fname in [
        "exercise2_model_t_results.csv",
        "exercise2_congressional_record_results.csv",
        csv_path,   # combined
    ]:
        if os.path.exists(fname):
            files.download(fname)
            print(f"⬇ Downloading: {fname}")
        else:
            print(f"⚠ Not found (skipping): {fname}")
except ImportError:
    print("Results saved locally:")
    for fname in [
        "exercise2_model_t_results.csv",
        "exercise2_congressional_record_results.csv",
        csv_path,
    ]:
        print(f"  {fname}")
