# Session 2 – Minimal RAG Pipeline

Byg en let Retrieval-Augmented Generation-pipeline ved hjælp af Foundry Local + sentence-transformers embeddings.


### Forklaring: Installation af afhængigheder
Installerer minimale pakker til denne pipeline:
- `foundry-local-sdk` til lokal modelhåndtering (hvis ikke kun BASE_URL-stien bruges).
- `openai` for kompatible SDK-strukturer (nogle værktøjer).
- `sentence-transformers` til embeddings.
- `numpy` til vektormatematik.
Sikkert at køre igen; spring over, hvis miljøet allerede er opfyldt.


# Scenario
Denne notebook opbygger en minimal Retrieval-Augmented Generation (RAG) pipeline, der kører helt lokalt:
- Forbinder til en Foundry Local-model (auto-detekteres via SDK eller BASE_URL).
- Opretter en lille dokumentkorpus i hukommelsen og indlejrer det med Sentence Transformers.
- Implementerer naiv vektorsimilaritets-søgning (ingen ekstern indeks) for gennemsigtighed.
- Udfører grounded genereringsanmodninger via flere HTTP fallback-ruter (`/v1/chat/completions`, `/v1/completions`, `/v1/responses`).
- Tilbyder en `answer()`-hjælpefunktion, der prøver alternative modelformer, når de første forsøg mislykkes.

Brug dette som en diagnostisk skabelon, før du skalerer til større korpusser, vedvarende vektorlagre eller evalueringsmetrikker (se RAG evalueringsnotebook).


In [5]:
# Install dependencies
!pip install -q foundry-local-sdk openai sentence-transformers numpy

### Forklaring: Kerneimporter  
Indlæser kernebiblioteker, der er nødvendige for indlejring + lokal inferens:  
- SentenceTransformer til tætte vektorindlejringer.  
- FoundryLocalManager (valgfrit) til at administrere lokal service.  
- OpenAI-klient til velkendte objektformer (selvom vi senere bruger HTTP direkte).  


In [6]:
import os, numpy as np
from sentence_transformers import SentenceTransformer
from foundry_local import FoundryLocalManager
from openai import OpenAI

### Forklaring: Legetøjsdokumentkorpus
Definerer en lille liste med domæneudtalelser i hukommelsen. Holder iterationen hurtig og kontrolleret, så fokus forbliver på pipeline-mekanik (hentning + forankring) frem for databehandling.


In [7]:
DOCS = [
    'Foundry Local provides an OpenAI-compatible local inference endpoint.',
    'Retrieval Augmented Generation improves answer grounding by injecting relevant context.',
    'Edge AI reduces latency and preserves privacy via local execution.',
    'Small Language Models can offer competitive quality with lower resource usage.',
    'Vector similarity search retrieves semantically relevant documents.'
]

### Forklaring: Forbindelse, Modelvalg & Initialisering af Embedding
Robust forbindelseslogik:
1. Bruger valgfrit `BASE_URL` (ren HTTP-sti) eller falder tilbage til FoundryLocalManager.
2. Undersøger `/v1/models` og vælger det bedst matchende konkrete model-id (præcis alias > kanonisk familie > første tilgængelige).
3. Implementerer retry-loop med konfigurerbare `FOUNDRY_CONNECT_RETRIES` & forsinkelse.
4. Initialiserer SentenceTransformer embeddings (normaliserede vektorer) for den simple korpus.
5. Registrerer OpenAI SDK-versionen for reproducerbarhed.
Hvis tjenesten mangler, udskriver den vejledning til at starte den i stedet for at gå ned.


In [12]:
import os, time, json, requests, re
# Native Foundry Local SDK preferred; fall back to explicit BASE_URL if provided
os.environ.setdefault('FOUNDRY_LOCAL_ALIAS', 'phi-4-mini')
alias = os.getenv('FOUNDRY_LOCAL_ALIAS', os.getenv('TARGET_MODEL', 'phi-4-mini'))
base_url_env = os.getenv('BASE_URL', '').strip()
manager = None
client = None
endpoint = None

def _canonicalize(model_id: str) -> str:
    """Remove CUDA suffix and version tags from model name."""
    b = model_id.split(':')[0]
    return re.sub(r'-cuda.*', '', b)

try:
    if base_url_env:
        # Allow user override; normalize by removing trailing / and optional /v1
        root = base_url_env.rstrip('/')
        if root.endswith('/v1'):
            root = root[:-3]
        endpoint = root
        print(f'[INFO] Using explicit BASE_URL override: {endpoint}')
    else:
        from foundry_local import FoundryLocalManager
        manager = FoundryLocalManager(alias)
        # Manager endpoint already includes /v1 - remove it for our base
        raw_endpoint = manager.endpoint.rstrip('/')
        if raw_endpoint.endswith('/v1'):
            endpoint = raw_endpoint[:-3]
        else:
            endpoint = raw_endpoint
        print(f'[OK] Foundry Local manager endpoint: {manager.endpoint} | base={endpoint} | alias={alias}')
    
    # Probe models list (endpoint does NOT include /v1 here)
    models_resp = requests.get(endpoint + '/v1/models', timeout=5)
    models_resp.raise_for_status()
    payload = models_resp.json() if models_resp.headers.get('content-type','').startswith('application/json') else {}
    data = payload.get('data', []) if isinstance(payload, dict) else []
    ids = [m.get('id') for m in data if isinstance(m, dict)]
    
    # Select best matching model
    chosen = None
    if alias in ids:
        chosen = alias
    else:
        for mid in ids:
            if _canonicalize(mid) == _canonicalize(alias):
                chosen = mid
                break
    if not chosen and ids:
        chosen = ids[0]
    model_name = chosen or alias
    
    # Initialize OpenAI client
    from openai import OpenAI as _OpenAI
    client = _OpenAI(
        base_url=endpoint + '/v1',  # OpenAI client needs full base URL with /v1
        api_key=(getattr(manager, 'api_key', None) or os.getenv('API_KEY') or 'not-needed')
    )
    print(f'[OK] Model resolved: {model_name} (total_models={len(ids)})')
except Exception as e:
    print('[ERROR] Failed to initialize Foundry Local client:', e)
    client = None
    model_name = alias

# Expose BASE for downstream compatibility (without /v1)
BASE = endpoint

# Embeddings setup
embed_model_name = os.getenv('EMBED_MODEL', 'sentence-transformers/all-MiniLM-L6-v2')
try:
    from sentence_transformers import SentenceTransformer
    embedder = SentenceTransformer(embed_model_name)
    doc_emb = embedder.encode(DOCS, convert_to_numpy=True, normalize_embeddings=True)
    print(f'[OK] Embedded {len(DOCS)} docs using {embed_model_name} shape={doc_emb.shape}')
except Exception as e:
    print('[ERROR] Embedding init failed:', e)
    embedder = None
    doc_emb = None

try:
    import openai as _openai
    openai_version = getattr(_openai, '__version__', 'unknown')
    print('OpenAI SDK version:', openai_version)
except Exception:
    openai_version = 'unknown'

if client is None:
    print('\nNEXT: Start/verify service then re-run this cell:')
    print('  foundry service start')
    print('  foundry model run phi-4-mini')
    print('  (optional) set BASE_URL=http://127.0.0.1:57127')

[OK] Foundry Local manager endpoint: http://127.0.0.1:59778/v1 | base=http://127.0.0.1:59778 | alias=phi-4-mini
[OK] Model resolved: deepseek-r1-distill-qwen-7b-cuda-gpu:0 (total_models=11)
[OK] Embedded 5 docs using sentence-transformers/all-MiniLM-L6-v2 shape=(5, 384)
OpenAI SDK version: 1.109.1


### Forklaring: Retrieve-funktion (Vector Similarity)
`retrieve(query, k=3)` koder forespørgslen, beregner cosinus-similaritet (prikprodukt på normaliserede vektorer) og returnerer de øverste k dokumentindekser. Dette forbliver enkelt og i hukommelsen for gennemsigtighed.


In [9]:
def retrieve(query, k=3):
    q = embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True)[0]
    sims = doc_emb @ q
    return sims.argsort()[::-1][:k]

### Forklaring: SDK-baseret generering & svarhjælper
Omstruktureret til at bruge Foundry Local SDK + OpenAI-kompatible klientmetoder i stedet for manuelle rå HTTP-posts:
- Primær sti: `client.chat.completions.create` (strukturerede beskeder).
- Faldback: `client.completions.create` (ældre prompt) og derefter `client.responses.create` (strømlinet svar-API).
- Normaliserer alternative model-id'er (RAW vs. strippet ALT) for at øge kompatibiliteten.
- `answer()` konstruerer en funderet prompt fra top-k hentede dokumenter og registrerer ordnede forsøgsspor.
Dette holder logikken læsbar, samtidig med at det tilbyder en smidig nedgradering på tværs af udviklende OpenAI-kompatible endpoints.


In [14]:
# SDK-based generation (Foundry Local manager + OpenAI client methods)
import re, time, json

def _strip_model_name(name: str) -> str:
    """Strip CUDA suffix and version tags from model name."""
    base = name.split(':')[0]
    base = re.sub(r'-cuda.*', '', base)
    return base

# Use the actual resolved model name from connection cell
RAW_MODEL = model_name
ALT_MODEL = _strip_model_name(RAW_MODEL)

def _try_via_client(messages, prompt, model_id: str, max_tokens=220, temperature=0.2):
    """Try generating response using OpenAI client with multiple fallback routes."""
    attempts = []
    
    # 1. Try chat.completions endpoint (preferred for chat models)
    try:
        resp = client.chat.completions.create(
            model=model_id, 
            messages=messages, 
            max_tokens=max_tokens, 
            temperature=temperature
        )
        content = resp.choices[0].message.content
        attempts.append(('chat.completions', 200, (content or '')[:160]))
        if content and content.strip():
            return content, attempts
    except Exception as e:
        attempts.append(('chat.completions', None, str(e)[:160]))
    
    # 2. Try legacy completions endpoint
    try:
        comp = client.completions.create(
            model=model_id, 
            prompt=prompt, 
            max_tokens=max_tokens, 
            temperature=temperature
        )
        txt = comp.choices[0].text if comp.choices else ''
        attempts.append(('completions', 200, (txt or '')[:160]))
        if txt and txt.strip():
            return txt, attempts
    except Exception as e:
        attempts.append(('completions', None, str(e)[:160]))
    
    return None, attempts

def retrieve(query, k=3):
    """Retrieve top-k most similar documents using cosine similarity."""
    if embedder is None or doc_emb is None:
        raise RuntimeError("Embeddings not initialized.")
    q_emb = embedder.encode([query], normalize_embeddings=True)[0]
    scores = doc_emb @ q_emb
    idxs = np.argsort(scores)[::-1][:k]
    return idxs

def answer(query, k=3, max_tokens=220, temperature=0.2, try_alternate=True):
    """
    Answer a query using RAG pipeline:
    1. Retrieve relevant documents using vector similarity
    2. Generate grounded response using Foundry Local model via OpenAI SDK
    
    Args:
        query: User question
        k: Number of documents to retrieve
        max_tokens: Maximum tokens for generation
        temperature: Sampling temperature
        try_alternate: Whether to try alternate model name on failure
    
    Returns:
        Dictionary with query, answer, docs, context, route, and tried attempts
    """
    if client is None:
        raise RuntimeError('Model client not initialized. Re-run connection cell after starting Foundry Local.')
    if embedder is None or doc_emb is None:
        raise RuntimeError('Embeddings not initialized.')
    
    # Retrieve relevant documents
    idxs = retrieve(query, k=k)
    context = '\n'.join(f'Doc {i}: {DOCS[i]}' for i in idxs)
    
    # Construct grounded generation prompt
    system_content = 'Use ONLY provided context. If insufficient, say "I\'m not sure."'
    user_content = f'Context:\n{context}\n\nQuestion: {query}'
    messages = [
        {'role': 'system', 'content': system_content},
        {'role': 'user', 'content': user_content}
    ]
    prompt = f'System: {system_content}\n{user_content}\nAnswer:'
    
    # Try generation with primary model
    tried = []
    ans, attempts = _try_via_client(messages, prompt, RAW_MODEL, max_tokens=max_tokens, temperature=temperature)
    tried.append({'model': RAW_MODEL, 'attempts': attempts})
    
    if ans and ans.strip():
        return {
            'query': query, 
            'answer': ans.strip(), 
            'docs': idxs.tolist(), 
            'context': context, 
            'route': 'chat-first', 
            'tried': tried
        }
    
    # Try alternate model name if available
    if try_alternate and ALT_MODEL != RAW_MODEL:
        ans2, attempts2 = _try_via_client(messages, prompt, ALT_MODEL, max_tokens=max_tokens, temperature=temperature)
        tried.append({'model': ALT_MODEL, 'attempts': attempts2})
        if ans2 and ans2.strip():
            return {
                'query': query, 
                'answer': ans2.strip(), 
                'docs': idxs.tolist(), 
                'context': context, 
                'route': 'chat-alt', 
                'tried': tried
            }
    
    # All routes failed
    return {
        'query': query, 
        'answer': 'I\'m not sure. (All SDK routes failed)', 
        'docs': idxs.tolist(), 
        'context': context, 
        'route': 'failed', 
        'tried': tried
    }

print('[INFO] SDK generation mode active.')
print(f'       RAW_MODEL = {RAW_MODEL}')
print(f'       ALT_MODEL = {ALT_MODEL}')

[INFO] SDK generation mode active.
       RAW_MODEL = deepseek-r1-distill-qwen-7b-cuda-gpu:0
       ALT_MODEL = deepseek-r1-distill-qwen-7b


In [15]:
# Self-test cell: validates connectivity, embeddings, and answer() basic functionality (SDK mode)
import math, pprint

def rag_self_test(sample_query: str = 'Why use RAG with local inference?', expect_docs: int = 3):
    report = {'base': BASE, 'raw_model': RAW_MODEL, 'alt_model': ALT_MODEL}
    if not BASE:
        report['error'] = 'BASE not resolved'
        return report
    if embedder is None or doc_emb is None:
        report['error'] = 'Embeddings not initialized'
        return report
    if getattr(doc_emb, 'shape', (0,))[0] != len(DOCS):
        report['warning_embeddings'] = f"doc_emb count {getattr(doc_emb,'shape',('?'))} mismatch DOCS {len(DOCS)}"
    try:
        idxs = retrieve(sample_query, k=expect_docs)
        report['retrieved_indices'] = idxs.tolist() if hasattr(idxs, 'tolist') else list(idxs)
    except Exception as e:
        report['error_retrieve'] = str(e)
        return report
    try:
        ans = answer(sample_query, k=expect_docs, max_tokens=80, temperature=0.2)
        report['route'] = ans.get('route')
        report['answer_preview'] = ans.get('answer','')[:160]
        if ans.get('route') == 'failed':
            report['warning_generation'] = 'All SDK routes failed for sample query'
    except Exception as e:
        report['error_generation'] = str(e)
    return report

pprint.pprint(rag_self_test())

{'alt_model': 'deepseek-r1-distill-qwen-7b',
 'answer_preview': 'Okay, so I need to figure out why someone would use '
                   'Retrieval Augmented Generation (RAG) with local inference. '
                   'Let me start by understanding each part of the qu',
 'base': 'http://127.0.0.1:59778',
 'raw_model': 'deepseek-r1-distill-qwen-7b-cuda-gpu:0',
 'retrieved_indices': [0, 3, 1],
 'route': 'chat-first'}


### Forklaring: Batch Query Smoke Test
Udfører flere repræsentative brugerforespørgsler gennem `answer()` for at validere:
- Hentningsindekser svarer til sandsynlige understøttende dokumenter.
- Fallback-routing fungerer (ruteværdi er ikke 'failed').
- Svar respekterer grundlæggende instruktioner (ingen hallucinationer).
Gemmer det sidste resultatobjekt til ad hoc-inspektion.


In [16]:
# Quick test queries

queries = [

    "Why use RAG with local inference?",

    "What does vector similarity search do?",

    "Explain privacy benefits."

]



last_result = None

for q in queries:

    try:

        r = answer(q)

        last_result = r

        print(f"Q: {q}\nA: {r['answer']}\nDocs: {r['docs']}\n---")

    except Exception as e:

        print(f"Failed answering '{q}': {e}")



last_result

Q: Why use RAG with local inference?
A: Okay, so I need to figure out why someone would use Retrieval Augmented Generation (RAG) with local inference. Let me start by understanding each part of the question.

First, RAG. From the context given, Doc 1 says that RAG improves answer grounding by injecting relevant context. So RAG is a method that uses retrieval techniques to find the most relevant parts of a document or corpus to augment the generation process. This probably helps in making the generated answers more accurate because they're backed by real data.

Then, local inference. Doc 0 mentions that Foundry Local provides an OpenAI-compatible local inference endpoint. So local inference means running the model on the user's device rather than sending the request to a remote server. This is good for privacy and reducing latency, but it might have limitations in terms of model size or capabilities compared to cloud-based options.

Now, combining RAG with local inference. The context s

{'query': 'Explain privacy benefits.',
 'answer': 'Okay, so I need to explain the privacy benefits mentioned in the provided context. Let me look at the context again. The context includes three documents:\n\nDoc 2 says Edge AI reduces latency and preserves privacy via local execution.\nDoc 3 mentions Small Language Models can offer competitive quality with lower resource usage.\nDoc 1 states Retrieval Augmented Generation improves answer grounding by injecting relevant context.\n\nThe question is about explaining the privacy benefits. So, I should focus on the parts of the context that talk about privacy. \n\nLooking at Doc 2, it mentions Edge AI reduces latency and preserves privacy via local execution. That seems directly related to privacy. I think "local execution" means that the AI processes data on the device itself rather than sending it to a server. This could mean that data doesn\'t have to be transmitted, which might help protect user privacy because it avoids centralizing d

### Forklaring: Enkelt svar bekvemmelighedskald
Endeligt hurtigt enkeltspørgsmåls-kald til nem kopiering/indsætning eller videre henvisning. Viser idempotent brug af `answer()` efter tidligere opvarmningsforespørgsler.


In [17]:
result = answer('Why use RAG with local inference?')
result

{'query': 'Why use RAG with local inference?',
 'answer': "Okay, so I need to figure out why someone would use Retrieval Augmented Generation (RAG) with local inference. Let me start by understanding each part of the question.\n\nFirst, RAG. From the context given, Doc 1 says that RAG improves answer grounding by injecting relevant context. So RAG is a method that uses retrieval techniques to find the most relevant parts of a document or corpus to augment the generation process. This probably helps in making the generated answers more accurate because they're backed by real data.\n\nThen, local inference. Doc 0 mentions that Foundry Local provides an OpenAI-compatible local inference endpoint. So local inference means running the model on the user's device rather than sending the request to a remote server. This is good for privacy and reducing latency, but it might have limitations in terms of model size or capabilities compared to cloud-based options.\n\nNow, combining RAG with local


---

**Ansvarsfraskrivelse**:  
Dette dokument er blevet oversat ved hjælp af AI-oversættelsestjenesten [Co-op Translator](https://github.com/Azure/co-op-translator). Selvom vi bestræber os på nøjagtighed, skal det bemærkes, at automatiserede oversættelser kan indeholde fejl eller unøjagtigheder. Det originale dokument på dets oprindelige sprog bør betragtes som den autoritative kilde. For kritisk information anbefales professionel menneskelig oversættelse. Vi påtager os ikke ansvar for misforståelser eller fejltolkninger, der måtte opstå som følge af brugen af denne oversættelse.
