A reference-free RAG system that retrieves relevant CVE records from a FAISS vector index and generates actionable security guidelines using a locally hosted Ollama LLM. Evaluated with DeepEval using LLM-as-judge metrics (context relevance, context recall, answer relevance, faithfulness).
- Architecture Overview
- Component Decisions
- Project Structure
- Installation
- Usage
- Evaluation
- Design Decisions Deep-Dive
- Limitations and Future Work
┌─────────────────────────────────────────────────────────────────────┐
│ OFFLINE (ingest) │
│ │
│ HuggingFace Dataset → SentenceTransformer → FAISS Index │
│ (CVE records) (all-MiniLM-L6-v2) (IndexFlatIP) │
└─────────────────────────────────────────────────────────────────────┘
│
index/ (persisted)
│
┌─────────────────────────────────────────────────────────────────────┐
│ ONLINE (query / chat) │
│ │
│ Query → Embed Query → FAISS Search (top-k) → Retrieved Docs │
│ │ │
│ Ollama LLM │
│ (Mistral / Llama3) │
│ │ │
│ Generated Guidelines │
└─────────────────────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────────────────────┐
│ EVALUATION (evaluate) │
│ │
│ Context Relevance │ Context Recall │ Answer Relevance │ Faithfulness │
│ (DeepEval + LLM judge: Gemini or local Ollama) │
└──────────────────────────────────────────────────────────────────────┘
The dataset contains structured CVE records in a system/user/assistant format. Each record's Assistant field contains:
- CVE metadata (ID, dates, state)
- Vulnerability description
- Affected products and versions
- Reference URLs
This rich, semi-structured text is ideal for a retrieval corpus because practitioners can search by CVE ID, vulnerability type (XSS, SQLi, RCE), affected product, or general security concept.
Records used: 300,000
| Property | Detail |
|---|---|
| Model size | ~80 MB (6-layer MiniLM) |
| Embedding dim | 384 |
| Speed | ~14,000 sentences/sec on CPU |
| Quality | Strong on short-paragraph semantic similarity (SBERT benchmarks) |
Why this model?
- Speed vs quality trade-off: MiniLM-L6-v2 achieves ~97% of the quality of larger SBERT models at ~5× the throughput – ideal for ingesting thousands of CVE records on commodity hardware.
- Technical domain: The model is trained on a broad NLI + semantic similarity corpus. CVE descriptions are structured English (not code), so a general-purpose sentence encoder performs well.
- Cosine-friendly: The model produces vectors well-suited for cosine similarity search (the FAISS
IndexFlatIPafter L2-normalisation).
Alternatives considered:
msmarco-distilbert-base-v4– better passage retrieval but slower.bge-small-en-v1.5– comparable quality, slightly better IR benchmarks; either would work well.
FAISS (Facebook AI Similarity Search) provides millisecond nearest-neighbour lookup over millions of vectors.
IndexFlatIP (Flat Inner Product):
- Exact search – no approximation error.
- With L2-normalised embeddings, inner-product = cosine similarity.
- No training required; index builds in seconds for ≤ 100 K records.
Why exact over approximate (IVF, HNSW)? For a corpus of 5 K–50 K CVE records the difference in search latency is negligible (<1 ms either way). Exact search guarantees retrieval quality – important for a security-critical application where missing a highly relevant CVE could lead to incorrect guidance.
For corpora > 1 M records, switching to IndexIVFFlat or IndexHNSWFlat is recommended.
Ollama runs open-source LLMs locally via a REST API, requiring no cloud API key and keeping sensitive vulnerability data on-premise.
Default model: llama3.2:1b (meta-llama/Llama-3.2-1B)
| Property | Detail |
|---|---|
| Parameters | 1 B |
| Context window | 128 K tokens |
| Licence | Llama 3.2 Community License |
| Hardware req. | ~4 GB RAM (on CPU) |
Why llama3.2:1b?
- Strong instruction-following at small scale – produces structured, numbered guidelines reliably.
- Open Source and easily available.
- Fast inference on CPU (Ollama's backend).
Alternative models (swap via OLLAMA_MODEL env var):
mistral– Meta Llama 3 8B, strong general reasoning.phi3– Microsoft Phi-3-mini, very fast on CPU.gemma2– Google Gemma 2 9B, competitive quality.
The prompt design explicitly instructs the model to:
- Ground answers in retrieved context only (reduces hallucination).
- Produce exactly 3 numbered guidelines.
- Cite the CVE ID in each guideline.
- Use a low temperature (0.2) for factual, deterministic output.
Evaluation is powered by DeepEval, which implements all four metrics using an LLM judge for deep semantic assessment rather than heuristic cosine thresholds. This produces scores that correlate more closely with human judgements.
The judge model is configurable via environment variables in .env:
# Use Gemini as the judge (default)
USE_GEMINI=True
GEMINI_API_KEY=your_key_here
GEMINI_MODEL=gemini-2.0-flash # or gemini-1.5-pro, etc.
# Use a local Ollama model as the judge
USE_GEMINI=False
OLLAMA_RAGAS_MODEL=llama3.1:8b # any model pulled via `ollama pull`When USE_GEMINI=True, the evaluator calls the Gemini API. Else it routes requests to the local Ollama server, keeping all data fully on-premise.
Some metrics are reference-free – no human-annotated ground-truth answers are needed. But others require a ground truth to calculate better scores.
1. Context Precision
Measures how focused the retrieved passages are on the query. The LLM judge assesses whether each retrieved chunk contains information pertinent to answering the question. A score of 1.0 means every retrieved passage is semantically on-topic. Low scores indicate noisy retrieval.
2. Context Recall
Measures how completely the retrieval system captures the information needed to answer the query. The LLM judge identifies claims in the reference context and checks whether the retrieved passages cover them. A score of 1.0 means all necessary supporting information is present. Low scores indicate missing evidence that could lead to incomplete guidelines.
3. Answer Relevance
Measures whether the generated answer addresses the practitioner's question. The LLM judge evaluates whether the response is on-topic and directly useful. Low scores suggest the LLM drifted off-topic or was confused by irrelevant context.
4. Faithfulness
Measures whether the generated guidelines are grounded in the retrieved CVE records. The LLM judge checks whether factual claims in the answer can be attributed to the retrieved context. A score close to 1.0 means the LLM closely follows the retrieved evidence. Low scores indicate potential hallucination.
cve_rag/
├── ingest.py # Download dataset, build FAISS index
├── retriever.py # CVERetriever class (FAISS + SentenceTransformer)
├── generator.py # CVEGenerator class (Ollama REST API)
├── rag.py # End-to-end RAGPipeline
├── evaluate.py # DeepEval evaluation over 10 queries
├── cli.py # CLI entry-point (all commands)
├── requirements.txt # Python dependencies
├── .env # Set all env vars (model, judge, API keys)
├── cloud_ollama_pull.py # File to pull models for Ollama on cloud
├── README.md # This file
└── index/ # Created by ingest
├── cve.faiss # FAISS binary index
└── cve_meta.json # Metadata per vector (CVE ID, snippet, full text)
conda create -n CveRag
conda activate CveRag
pip install -r requirements.txtInstall Ollama from https://ollama.com, then pull a model:
# Install Ollama (Linux/macOS one-liner)
curl -fsSL https://ollama.com/install.sh | sh
# Start the server
ollama serve &
# Pull the default model
ollama pull llama3.1:8b
# Or use Qwen3
ollama pull qwen3:8bThe dataset is public, but you may need to be logged in for large downloads:
pip install huggingface-hub
huggingface-cli loginCopy .env and set your variables:
# Generator model
OLLAMA_BASE_URL='http://localhost:11434'
OLLAMA_MODEL=llama3.2:1b
# LLM judge (choose one provider)
USE_GEMINI=True # or: ollama
GEMINI_API_KEY=your_key_here
GEMINI_MODEL=gemini-2.0-flash
OLLAMA_RAGAS_MODEL=llama3.1:8b # used when JUDGE_PROVIDER=ollamacli.py is the single entry-point for all interactions with the CVE-RAG system.
Usage: cli.py [OPTIONS] COMMAND [ARGS]...
CVE-RAG – Retrieval-Augmented Generation for CVE Security Guidelines
────────────────────────────────────────────────────────────────────────
Run `cve-rag COMMAND --help` for details on each command.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
chat Start an interactive real-time RAG chat session.
evaluate Run DeepEval evaluation (context relevance, context recall,
answer relevance, faithfulness) over a set of test queries.
generate Run generation only using a manually supplied context (no
retrieval step).
info Show current index statistics, model settings, and Ollama
server status.
ingest Build (or rebuild) the FAISS vector index from the CVE
dataset.
query Run the full RAG pipeline (retrieve → generate) for a single
query.
retrieve Run retrieval only and print the top-K matching CVE records.
python cli.py ingestThis downloads the dataset, encodes all records, and writes index/cve.faiss and index/cve_meta.json. Adjust MAX_RECORDS in ingest.py to control how many CVEs are indexed (default: 5 000).
Expected output:
INFO: Loaded 297441 records.
INFO: Encoding 297441 documents (batch_size=256)…
INFO: Building FAISS IndexFlatIP (dim=384)…
INFO: FAISS index contains 297441 vectors.
INFO: FAISS index saved → index/cve.faiss
INFO: Metadata saved → index/cve_meta.json
INFO: Ingestion complete.
# Single query
python cli.py query "How should we handle remote code execution CVEs in Apache?"
# Interactive chat session
python cli.py chatExample output:
══════════════════════════════════════════════════════════════════════
QUERY: How should we handle XSS vulnerabilities in a mid-sized enterprise?
──────────────────────────────────────────────────────────────────────
RETRIEVED CONTEXT:
[1] CVE-2010-3763 (score=0.7812)
## CVE-2010-3763 ... Cross-site scripting (XSS) vulnerability in core/summary_api.php…
[2] CVE-2012-1889 (score=0.7341)
[3] CVE-2019-0230 (score=0.7198)
──────────────────────────────────────────────────────────────────────
GENERATED GUIDELINES:
**Guideline 1 – Patch Immediately**: Apply vendor patches for CVE-2010-3763 and upgrade
MantisBT to version 1.2.3 or later to close the XSS injection path in summary_api.php.
**Guideline 2 – Input Validation & WAF**: Enforce server-side input sanitisation on all
user-supplied fields and deploy a Web Application Firewall rule to block reflected and
stored XSS payloads.
**Guideline 3 – Monitoring & CSP**: Implement Content Security Policy headers and monitor
application logs for unexpected script injection patterns as an ongoing detective control.
══════════════════════════════════════════════════════════════════════
# Retrieve only (no generation)
python cli.py retrieve "SQL injection vulnerabilities in MySQL" --top-k 5
# Generate only from manually supplied context
python cli.py generate "What are the risks?" --context "CVE-2021-44228 is a critical RCE…"
# Show index stats and model configuration
python cli.py info
# Change the generator model at runtime
OLLAMA_MODEL=llama3 python cli.py query "Privilege escalation in Linux kernels"python cli.py evaluateRuns 10 practitioner-style queries through the full RAG pipeline and scores each with DeepEval's four LLM-judge metrics. Results are written to evaluation_results.json.
The judge model is determined by JUDGE_PROVIDER in .env (see Configuration). Using Gemini is recommended for highest-quality judgements; the local Ollama option keeps evaluation fully air-gapped.
Sample summary table:
════════════════════════════════════════════════════════════════════
DEEPEVAL EVALUATION SUMMARY
────────────────────────────────────────────────────────────────────
# Ctx Rel Ctx Rec Ans Rel Faith Query
────────────────────────────────────────────────────────────────────
1 0.412 0.680 0.731 0.750 How should we handle XSS…
2 0.389 0.651 0.712 0.700 What are the mitigation steps…
3 0.356 0.620 0.698 0.667 How do we detect privilege…
…
────────────────────────────────────────────────────────────────────
AVG 0.391 0.650 0.714 0.712
════════════════════════════════════════════════════════════════════
| Metric | < 0.3 | 0.3 – 0.5 | > 0.5 |
|---|---|---|---|
| Context Relevance | Poor retrieval; chunks are off-topic | Acceptable; some noise | Good; retrieved chunks are focused |
| Context Recall | Critical gaps in retrieved evidence | Partially complete | Retrieved context covers the topic well |
| Answer Relevance | Answer doesn't address query | Partially relevant | Answer directly addresses query |
| Faithfulness | High hallucination risk | Mostly grounded | Strongly grounded in evidence |
- Self-contained: No external service to run or pay for – ideal for academic projects and air-gapped security environments.
- Performance: FAISS
IndexFlatIPdelivers <1 ms search over 50 K vectors on a CPU. - Portability: The index is a single binary file that can be copied, versioned, and restored trivially.
- No publicly available, openly licensed security-domain sentence encoder performs significantly better on short paragraph retrieval.
- The CVE
Assistantfield is structured English prose, not code or specialised jargon – a general SBERT model handles it well. - If a larger budget is available,
BAAI/bge-large-en-v1.5or a fine-tuned security encoder could improve recall.
The original cosine-threshold approach is fast but brittle: a sentence can score above the threshold without being logically relevant, and below it while still containing useful information. An LLM judge evaluates semantic entailment, logical grounding, and topical relevance in a way that better mirrors how a security practitioner would assess the output. DeepEval's implementation follows the RAGAS framework (Es et al., 2023) while replacing the heuristic scorer with a prompted LLM, yielding higher correlation with human judgements at the cost of slightly longer evaluation runtime.
Configuring the judge to use a local Ollama model preserves the system's air-gapped, on-premise properties for security-sensitive environments.
The RAGAS paper (Es et al., 2023) demonstrates that reference-free metrics correlate well with human judgements for RAG systems. This project follows that approach because:
- No human-annotated gold answers exist for CVE guidelines.
- LLM-judge metrics capture semantic quality and logical grounding rather than surface-level word overlap (unlike BLEU/ROUGE).
Each CVE Assistant field is treated as a single chunk (up to ~1 500 characters in generation context). This is appropriate because:
- Each record is a cohesive, self-contained vulnerability description.
- Splitting mid-record would sever the link between CVE ID, description, and affected products.
- The embedding model (max 512 tokens) compresses longer records gracefully – the most discriminative content (CVE ID, vulnerability type, affected product) typically appears in the first 200 tokens.
| Limitation | Impact | Mitigation |
|---|---|---|
| LLM judge adds latency to evaluation | Evaluation runs slower than cosine-threshold approach | Use Gemini Flash or a small local model for speed; cache results |
| LLM judge requires API key or local model | Adds dependency vs. pure cosine evaluation | Local Ollama option keeps everything on-premise |
| MiniLM-L6-v2 truncates at 512 tokens | Long CVE records partially encoded | Use a 4096-token model (e.g., jina-embeddings-v2-base) |
| Ollama CPU inference is slow (~30 s/query) | Limits interactive use | Enable GPU with OLLAMA_NUM_GPU or use quantised models |
| Dataset may contain outdated CVEs | Guidelines could reference deprecated patches | Filter by Published Date ≥ a configurable year |
| No re-ranking step | Top-k retrieval may include tangentially relevant records | Add cross-encoder re-ranking (cross-encoder/ms-marco-MiniLM-L-6-v2) |
| English-only | Non-English security teams not served | Use multilingual SBERT (paraphrase-multilingual-MiniLM-L12-v2) |
- Bias toward large-enterprise CVEs: The dataset reflects publicly reported vulnerabilities which skew toward widely-used commercial software. Under-resourced or indigenous contexts may use different software stacks with fewer CVE records.
- Outdated advice propagation: CVE remediation advice evolves. Always validate generated guidelines against current vendor advisories.
- Transparency: The system cites CVE IDs and displays retrieved sources, supporting human auditability.
- Human-in-the-loop: Generated guidelines should be reviewed by a qualified security professional before being applied to production systems.
- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
- Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
- Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data.
- Jiang, A. Q. et al. (2023). Mistral 7B. arXiv:2310.06825.
- NIST NVD: https://nvd.nist.gov/
- Ollama: https://ollama.com
- DeepEval: https://github.com/confident-ai/deepeval