# RAG Evaluation - BMW Press Releases

This notebook implements a Retrieval-Augmented Generation (RAG) system using:
- **Model**: Qwen2.5-0.5B (same as Full FT and LoRA)
- **Embeddings**: sentence-transformers/all-MiniLM-L6-v2
- **Vector Store**: FAISS
- **Sources**: All BMW press release texts from raw/train/ and raw/eval/

We'll compare RAG performance against Full FT and LoRA on the same evaluation questions.


## Step 1: Imports and Setup


In [69]:
import json
import re
from pathlib import Path
from typing import List, Dict

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
import faiss
from tqdm import tqdm


## Step 2: Load All Articles and Extract Text

We'll load all articles from both train and eval folders, extracting only the text field.


In [70]:
def load_articles_from_folder(folder_path: str) -> List[Dict]:
    """
    Load all JSON articles from a folder.
    Returns list of dicts with article metadata.
    """
    articles = []
    folder = Path(folder_path)
    
    for json_file in sorted(folder.glob("*.json")):
        with open(json_file, 'r', encoding='utf-8') as f:
            article = json.load(f)
            articles.append(article)
    
    return articles

# Load all articles
train_articles = load_articles_from_folder("data/raw/train")
eval_articles = load_articles_from_folder("data/raw/eval")

# Combine all articles for RAG knowledge base
all_articles = train_articles + eval_articles


## Step 3: Prepare Documents for RAG

Each article becomes a document in our vector store. We'll store:
- The text content (for retrieval)
- Metadata (id, title, date) for reference


In [71]:
# Prepare documents: extract text and metadata
documents = []
metadata = []

for article in all_articles:
    # Extract text (the actual content we want to search)
    text = article.get("text", "").strip()
    
    # Skip articles with no text
    if not text or len(text) < 50:
        continue
    
    documents.append(text)
    metadata.append({
        "id": article.get("id", "unknown"),
        "title": article.get("title", ""),
        "date": article.get("date", ""),
        "url": article.get("url", "")
    })


## Step 4: Create Embeddings and FAISS Index
We'll use `sentence-transformers/all-MiniLM-L6-v2` for embeddings:
- Fast and efficient (384-dimensional embeddings)
- Good for semantic search
- Works well on CPU



In [72]:
# Load embedding model
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Create embeddings for all documents
embeddings = embedding_model.encode(
    documents,
    show_progress_bar=True,
    convert_to_numpy=True
)


Batches: 100%|██████████| 8/8 [00:01<00:00,  5.76it/s]


In [73]:
# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # L2 distance (cosine similarity alternative)

# Add embeddings to index
index.add(embeddings)


## Step 5: Implement RAG Retrieval Function


In [None]:
def retrieve_relevant_docs(query: str, top_k: int = 3) -> List[Dict]:
    """
    Retrieve top-k most relevant documents for a query.
    
    Returns:
        List of dicts with 'text', 'metadata', and 'score'
    """
    # Encode query
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)
    
    # Search in FAISS index
    distances, indices = index.search(query_embedding, top_k)
    
    # Collect results
    results = []
    for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
        results.append({
            "rank": i + 1,
            "text": documents[idx],
            "metadata": metadata[idx],
            "distance": float(dist)
        })
    
    return results

# Test retrieval
test_query = "Which team won the 24 Hours of Nürburgring?"
test_results = retrieve_relevant_docs(test_query, top_k=2)


## Step 6: Load Qwen2.5-0.5B for Generation
We use the **base model** (not fine-tuned) to isolate RAG performance.



In [75]:
BASE_MODEL = "Qwen/Qwen2.5-0.5B"

tokenizer = AutoTokenizer.from_pretrained(
    BASE_MODEL,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    trust_remote_code=True,
    torch_dtype=torch.float32
)

model.eval()
model.to("cpu")


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Qwen2

## Step 7: RAG Pipeline
Complete RAG pipeline:
1. Retrieve top-k relevant documents
2. Construct prompt with retrieved context
3. Generate answer using Qwen2.5-0.5B



In [76]:
def rag_answer(question: str, top_k: int = 3, max_new_tokens: int = 150) -> Dict:
    """RAG pipeline: retrieve relevant documents and generate answer."""
    retrieved_docs = retrieve_relevant_docs(question, top_k=top_k)
    
    context_parts = []
    for i, doc in enumerate(retrieved_docs, 1):
        title = doc['metadata']['title']
        text = doc['text']
        text_preview = text[:500] + "..." if len(text) > 500 else text
        context_parts.append(f"[{i}] {title}\n{text_preview}")
    
    context = "\n\n".join(context_parts)
    
    prompt = (
        f"Based on the following BMW press articles, answer the question.\n\n"
        f"{context}\n\n"
        f"Question: {question}\n"
        f"Answer:"
    )
    
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1800)
    input_length = inputs['input_ids'].shape[1]
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.2,
            no_repeat_ngram_size=3,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id
        )
    
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    if "Answer:" in full_text:
        parts = full_text.split("Answer:")
        answer = parts[-1].strip()
    else:
        generated_ids = outputs[0][input_length:]
        answer = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
    
    answer = answer.replace("<|assistant|>", "").replace("<|user|>", "").strip()
    
    if not answer or len(answer) < 3:
        answer = "[Unable to generate answer from retrieved context]"
    
    return {
        "answer": answer,
        "retrieved_docs": retrieved_docs,
        "prompt": prompt
    }

# Test RAG pipeline
test_question = "Which team won the 24 Hours of Nürburgring?"
test_result = rag_answer(test_question, top_k=2)


## Step 8: Load Evaluation Questions

We'll use the same eval questions as Full FT and LoRA for fair comparison.


In [77]:
eval_questions = [
    {
        "question": "What type of document contains BMW 5 Series Sedan specifications valid from 03/2025?",
        "expected_answer": "It provides the specifications of the BMW 5 Series Sedan.",
        "topic": "BMW 5 Series Specs"
    },
    {
        "question": "The BMW 5 Series Sedan specifications are valid from which specific month in 2025?",
        "expected_answer": "They are valid from March 2025.",
        "topic": "BMW 5 Series Specs"
    },
    {
        "question": "Is the attached BMW 5 Series Sedan specifications valid from 03/2025 about an event, vehicle launch, or technical documentation?",
        "expected_answer": "It is about technical documentation.",
        "topic": "BMW 5 Series Specs"
    },
    {
        "question": "Which racing team won the 24 Hours of Nürburgring with car number 98 BMW M4 GT3 EVO driven by Kelvin van der Linde, Augusto Farfus, Jesse Krohn, and Raffaele Marciello?",
        "expected_answer": "ROWE Racing won the race.",
        "topic": "Motorsport - Nürburgring"
    },
    {
        "question": "Which BMW race car model number 98 secured BMW's 21st overall victory at the 24 Hours of Nürburgring?",
        "expected_answer": "The BMW M4 GT3 EVO.",
        "topic": "Motorsport - Nürburgring"
    },
    {
        "question": "What made the ROWE Racing victory at the Nürburgring with drivers Kelvin van der Linde and Augusto Farfus special, marking which number overall victory for BMW?",
        "expected_answer": "It was BMW's 21st overall victory at the 24 Hours of Nürburgring after a comeback.",
        "topic": "Motorsport - Nürburgring"
    },
    {
        "question": "Which anniversary of the BMW 3 Series is being celebrated by revising the '40 Years of BMW 3 Series' press kit from 2015?",
        "expected_answer": "The 50th anniversary of the BMW 3 Series.",
        "topic": "BMW 3 Series Heritage"
    },
    {
        "question": "What was the title of the 2015 BMW publication that was revised for the 3 Series 50th anniversary?",
        "expected_answer": "The press kit \"40 Years of BMW 3 Series\" from 2015.",
        "topic": "BMW 3 Series Heritage"
    },
    {
        "question": "Does the comprehensively revised and updated BMW 3 Series 50th anniversary press kit mainly look to the past or announce a new vehicle?",
        "expected_answer": "It mainly looks to the past.",
        "topic": "BMW 3 Series Heritage"
    },
    {
        "question": "Which British designer with the motto 'Every day is a new beginning' collaborated with MINI on a special edition with Nottingham Green accents?",
        "expected_answer": "Paul Smith.",
        "topic": "MINI Design"
    },
    {
        "question": "At which event in Tokyo on October 29 did the MINI Paul Smith Edition have its world premiere?",
        "expected_answer": "At the Japan Mobility Show in Tokyo.",
        "topic": "MINI Design"
    },
    {
        "question": "What design philosophy described as 'Classic with a twist' defines the MINI Paul Smith Edition with its Signature Stripe?",
        "expected_answer": "A classic British design with playful and unexpected twists.",
        "topic": "MINI Design"
    },
    {
        "question": "In which city is the BMW Museum located that received over 840,000 visitors in the past year?",
        "expected_answer": "Munich.",
        "topic": "BMW Museum"
    },
    {
        "question": "What is the main purpose of the BMW Museum that was expanded from 2004 to 2008 and features seven exhibition houses?",
        "expected_answer": "To showcase more than 100 years of BMW car and motorcycle history.",
        "topic": "BMW Museum"
    },
    {
        "question": "Which age group is specifically targeted by the BMW Junior Museum workshops held in German and English?",
        "expected_answer": "Children and teenagers.",
        "topic": "BMW Museum"
    }
]


## Step 9: Extract Questions from Eval Samples


In [78]:
# Run RAG evaluation on all questions
rag_results = []

for i, q_data in enumerate(eval_questions):
    question = q_data["question"]
    expected = q_data["expected_answer"]
    topic = q_data["topic"]
    
    # Get RAG answer (retrieves docs + generates)
    rag_result = rag_answer(question, top_k=3, max_new_tokens=100)
    
    # Store detailed results
    rag_results.append({
        "id": i,
        "question": question,
        "expected_answer": expected,
        "rag_answer": rag_result["answer"],
        "retrieved_doc_ids": [doc["metadata"]["id"] for doc in rag_result["retrieved_docs"]],
        "retrieved_titles": [doc["metadata"]["title"][:60] + "..." for doc in rag_result["retrieved_docs"]],
        "topic": topic
    })
    
    # Print progress
    print(f"\n{'='*60}")
    print(f"Question {i+1}/{len(eval_questions)} [{topic}]")
    print(f"{'='*60}")
    print(f"Q: {question}")
    print(f"\nRetrieved articles:")
    for j, doc in enumerate(rag_result["retrieved_docs"], 1):
        print(f"  {j}. {doc['metadata']['id']}: {doc['metadata']['title'][:50]}...")
    print(f"\nRAG Answer:")
    print(f"  {rag_result['answer']}")
    print(f"\nExpected Answer:")
    print(f"  {expected}")



Question 1/15 [BMW 5 Series Specs]
Q: What type of document contains BMW 5 Series Sedan specifications valid from 03/2025?

Retrieved articles:
  1. T0443474EN: Specifications of the BMW 5 Series Sedan, valid fr...
  2. T0451220EN: 50 years of the BMW 3 series....
  3. T0451071EN: BMW Motorrad model revision measures for the model...

RAG Answer:
  According to the article "Specifications of theBMW 5SeriesSedan," which provides details about the 2-year-old (valid until March 28, 23) specification of the current BMW 2-Series sedan available from January 2nd through February 3rd, 4th generation Miatzas with a powertrain consisting primarily of an inline six-cylinder engine derived from its original F-series platform, this implies that such documentation could refer specifically to the latest version or

Expected Answer:
  It provides the specifications of the BMW 5 Series Sedan.

Question 2/15 [BMW 5 Series Specs]
Q: The BMW 5 Series Sedan specifications are valid from which specific mo

## Step 10: Evaluate RAG on All Eval Questions


In [79]:
# Save detailed results to JSONL
output_path = Path("rag_eval_results.jsonl")

with open(output_path, "w", encoding="utf-8") as f:
    for result in rag_results:
        f.write(json.dumps(result, ensure_ascii=False) + "\n")


## Step 11: Save RAG Results for Comparison


In [80]:
# Show a few interesting examples with full details
# Show 3 diverse examples
for i in [0, 2, 8]:
    if i >= len(rag_results):
        continue
    
    result = rag_results[i]
    print(f"\n{'='*80}")
    print(f"Example {i+1}: {result['topic']}")
    print(f"{'='*80}")
    print(f"\nQuestion: {result['question']}")
    
    print(f"\nRetrieved Articles (Top-3):")
    for j, (doc_id, title) in enumerate(zip(result['retrieved_doc_ids'], result['retrieved_titles']), 1):
        print(f"  {j}. [{doc_id}] {title}")
    
    print(f"\nRAG Answer:")
    print(f"  {result['rag_answer']}")
    
    print(f"\nExpected Answer:")
    print(f"  {result['expected_answer']}")



Example 1: BMW 5 Series Specs

Question: What type of document contains BMW 5 Series Sedan specifications valid from 03/2025?

Retrieved Articles (Top-3):
  1. [T0443474EN] Specifications of the BMW 5 Series Sedan, valid from 03/2025...
  2. [T0451220EN] 50 years of the BMW 3 series....
  3. [T0451071EN] BMW Motorrad model revision measures for the model year  202...

RAG Answer:
  According to the article "Specifications of theBMW 5SeriesSedan," which provides details about the 2-year-old (valid until March 28, 23) specification of the current BMW 2-Series sedan available from January 2nd through February 3rd, 4th generation Miatzas with a powertrain consisting primarily of an inline six-cylinder engine derived from its original F-series platform, this implies that such documentation could refer specifically to the latest version or

Expected Answer:
  It provides the specifications of the BMW 5 Series Sedan.

Example 3: BMW 5 Series Specs

Question: Is the attached BMW 5 Series Seda