CVE-RAG: Retrieval-Augmented Generation for CVE Security Guidelines

A reference-free RAG system that retrieves relevant CVE records from a FAISS vector index and generates actionable security guidelines using a locally hosted Ollama LLM. Evaluated with DeepEval using LLM-as-judge metrics (context relevance, context recall, answer relevance, faithfulness).

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         OFFLINE (ingest)                            │
│                                                                     │
│  HuggingFace Dataset  →  SentenceTransformer  →  FAISS Index        │
│  (CVE records)           (all-MiniLM-L6-v2)      (IndexFlatIP)      │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                           index/ (persisted)
                                   │
┌─────────────────────────────────────────────────────────────────────┐
│                         ONLINE (query / chat)                       │
│                                                                     │
│  Query  →  Embed Query  →  FAISS Search (top-k)  →  Retrieved Docs  │
│                                                          │          │
│                                                     Ollama LLM      │
│                                                  (Mistral / Llama3) │
│                                                          │          │
│                                              Generated Guidelines   │
└─────────────────────────────────────────────────────────────────────┘
                                   │
┌──────────────────────────────────────────────────────────────────────┐
│                      EVALUATION (evaluate)                           │
│                                                                      │
│ Context Relevance │ Context Recall │ Answer Relevance │ Faithfulness │
│         (DeepEval + LLM judge: Gemini or local Ollama)               │
└──────────────────────────────────────────────────────────────────────┘

Component Decisions

Dataset: `AlicanKiraz0/All-CVE-Records-Training-Dataset`

The dataset contains structured CVE records in a system/user/assistant format. Each record's Assistant field contains:

CVE metadata (ID, dates, state)
Vulnerability description
Affected products and versions
Reference URLs

This rich, semi-structured text is ideal for a retrieval corpus because practitioners can search by CVE ID, vulnerability type (XSS, SQLi, RCE), affected product, or general security concept.

Records used: 300,000

Embedding Model: `sentence-transformers/all-MiniLM-L6-v2`

Property	Detail
Model size	~80 MB (6-layer MiniLM)
Embedding dim	384
Speed	~14,000 sentences/sec on CPU
Quality	Strong on short-paragraph semantic similarity (SBERT benchmarks)

Why this model?

Speed vs quality trade-off: MiniLM-L6-v2 achieves ~97% of the quality of larger SBERT models at ~5× the throughput – ideal for ingesting thousands of CVE records on commodity hardware.
Technical domain: The model is trained on a broad NLI + semantic similarity corpus. CVE descriptions are structured English (not code), so a general-purpose sentence encoder performs well.
Cosine-friendly: The model produces vectors well-suited for cosine similarity search (the FAISS IndexFlatIP after L2-normalisation).

Alternatives considered:

msmarco-distilbert-base-v4 – better passage retrieval but slower.
bge-small-en-v1.5 – comparable quality, slightly better IR benchmarks; either would work well.

Vector Store: FAISS `IndexFlatIP`

FAISS (Facebook AI Similarity Search) provides millisecond nearest-neighbour lookup over millions of vectors.

IndexFlatIP (Flat Inner Product):

Exact search – no approximation error.
With L2-normalised embeddings, inner-product = cosine similarity.
No training required; index builds in seconds for ≤ 100 K records.

Why exact over approximate (IVF, HNSW)? For a corpus of 5 K–50 K CVE records the difference in search latency is negligible (<1 ms either way). Exact search guarantees retrieval quality – important for a security-critical application where missing a highly relevant CVE could lead to incorrect guidance.

For corpora > 1 M records, switching to IndexIVFFlat or IndexHNSWFlat is recommended.

Generator: Ollama (local LLM)

Ollama runs open-source LLMs locally via a REST API, requiring no cloud API key and keeping sensitive vulnerability data on-premise.

Default model: llama3.2:1b (meta-llama/Llama-3.2-1B)

Property	Detail
Parameters	1 B
Context window	128 K tokens
Licence	Llama 3.2 Community License
Hardware req.	~4 GB RAM (on CPU)

Why llama3.2:1b?

Strong instruction-following at small scale – produces structured, numbered guidelines reliably.
Open Source and easily available.
Fast inference on CPU (Ollama's backend).

Alternative models (swap via OLLAMA_MODEL env var):

mistral – Meta Llama 3 8B, strong general reasoning.
phi3 – Microsoft Phi-3-mini, very fast on CPU.
gemma2 – Google Gemma 2 9B, competitive quality.

The prompt design explicitly instructs the model to:

Ground answers in retrieved context only (reduces hallucination).
Produce exactly 3 numbered guidelines.
Cite the CVE ID in each guideline.
Use a low temperature (0.2) for factual, deterministic output.

Evaluation: DeepEval with LLM-as-Judge

Evaluation is powered by DeepEval, which implements all four metrics using an LLM judge for deep semantic assessment rather than heuristic cosine thresholds. This produces scores that correlate more closely with human judgements.

LLM Judge Configuration

The judge model is configurable via environment variables in .env:

# Use Gemini as the judge (default)
USE_GEMINI=True
GEMINI_API_KEY=your_key_here
GEMINI_MODEL=gemini-2.0-flash   # or gemini-1.5-pro, etc.

# Use a local Ollama model as the judge
USE_GEMINI=False
OLLAMA_RAGAS_MODEL=llama3.1:8b  # any model pulled via `ollama pull`

When USE_GEMINI=True, the evaluator calls the Gemini API. Else it routes requests to the local Ollama server, keeping all data fully on-premise.

Metrics

Some metrics are reference-free – no human-annotated ground-truth answers are needed. But others require a ground truth to calculate better scores.

1. Context Precision

Measures how focused the retrieved passages are on the query. The LLM judge assesses whether each retrieved chunk contains information pertinent to answering the question. A score of 1.0 means every retrieved passage is semantically on-topic. Low scores indicate noisy retrieval.

2. Context Recall

Measures how completely the retrieval system captures the information needed to answer the query. The LLM judge identifies claims in the reference context and checks whether the retrieved passages cover them. A score of 1.0 means all necessary supporting information is present. Low scores indicate missing evidence that could lead to incomplete guidelines.

3. Answer Relevance

Measures whether the generated answer addresses the practitioner's question. The LLM judge evaluates whether the response is on-topic and directly useful. Low scores suggest the LLM drifted off-topic or was confused by irrelevant context.

4. Faithfulness

Measures whether the generated guidelines are grounded in the retrieved CVE records. The LLM judge checks whether factual claims in the answer can be attributed to the retrieved context. A score close to 1.0 means the LLM closely follows the retrieved evidence. Low scores indicate potential hallucination.

Project Structure

cve_rag/
├── ingest.py            # Download dataset, build FAISS index
├── retriever.py         # CVERetriever class (FAISS + SentenceTransformer)
├── generator.py         # CVEGenerator class (Ollama REST API)
├── rag.py               # End-to-end RAGPipeline
├── evaluate.py          # DeepEval evaluation over 10 queries
├── cli.py               # CLI entry-point (all commands)
├── requirements.txt     # Python dependencies
├── .env                 # Set all env vars (model, judge, API keys)
├── cloud_ollama_pull.py # File to pull models for Ollama on cloud
├── README.md            # This file
└── index/               # Created by ingest
    ├── cve.faiss        # FAISS binary index
    └── cve_meta.json    # Metadata per vector (CVE ID, snippet, full text)

Installation

1. Python environment

conda create -n CveRag
conda activate CveRag
pip install -r requirements.txt

2. Ollama

Install Ollama from https://ollama.com, then pull a model:

# Install Ollama (Linux/macOS one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# Start the server
ollama serve &

# Pull the default model
ollama pull llama3.1:8b

# Or use Qwen3
ollama pull qwen3:8b

3. HuggingFace authentication (optional)

The dataset is public, but you may need to be logged in for large downloads:

pip install huggingface-hub
huggingface-cli login

4. Configure environment

Copy .env and set your variables:

# Generator model
OLLAMA_BASE_URL='http://localhost:11434'
OLLAMA_MODEL=llama3.2:1b

# LLM judge (choose one provider)
USE_GEMINI=True          # or: ollama
GEMINI_API_KEY=your_key_here
GEMINI_MODEL=gemini-2.0-flash
OLLAMA_RAGAS_MODEL=llama3.1:8b # used when JUDGE_PROVIDER=ollama

Usage

cli.py is the single entry-point for all interactions with the CVE-RAG system.

Usage: cli.py [OPTIONS] COMMAND [ARGS]...

  CVE-RAG  –  Retrieval-Augmented Generation for CVE Security Guidelines
  ────────────────────────────────────────────────────────────────────────
  Run `cve-rag COMMAND --help` for details on each command.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  chat      Start an interactive real-time RAG chat session.
  evaluate  Run DeepEval evaluation (context relevance, context recall,
            answer relevance, faithfulness) over a set of test queries.
  generate  Run generation only using a manually supplied context (no
            retrieval step).
  info      Show current index statistics, model settings, and Ollama
            server status.
  ingest    Build (or rebuild) the FAISS vector index from the CVE
            dataset.
  query     Run the full RAG pipeline (retrieve → generate) for a single
            query.
  retrieve  Run retrieval only and print the top-K matching CVE records.

Step 1 – Build the index

python cli.py ingest

This downloads the dataset, encodes all records, and writes index/cve.faiss and index/cve_meta.json. Adjust MAX_RECORDS in ingest.py to control how many CVEs are indexed (default: 5 000).

Expected output:

INFO: Loaded 297441 records.
INFO: Encoding 297441 documents (batch_size=256)…
INFO: Building FAISS IndexFlatIP (dim=384)…
INFO: FAISS index contains 297441 vectors.
INFO: FAISS index saved → index/cve.faiss
INFO: Metadata saved    → index/cve_meta.json
INFO: Ingestion complete.

Step 2 – Query the RAG system

# Single query
python cli.py query "How should we handle remote code execution CVEs in Apache?"

# Interactive chat session
python cli.py chat

Example output:

══════════════════════════════════════════════════════════════════════
QUERY: How should we handle XSS vulnerabilities in a mid-sized enterprise?
──────────────────────────────────────────────────────────────────────
RETRIEVED CONTEXT:
  [1] CVE-2010-3763  (score=0.7812)
      ## CVE-2010-3763 ... Cross-site scripting (XSS) vulnerability in core/summary_api.php…
  [2] CVE-2012-1889  (score=0.7341)
  [3] CVE-2019-0230  (score=0.7198)
──────────────────────────────────────────────────────────────────────
GENERATED GUIDELINES:
**Guideline 1 – Patch Immediately**: Apply vendor patches for CVE-2010-3763 and upgrade
MantisBT to version 1.2.3 or later to close the XSS injection path in summary_api.php.

**Guideline 2 – Input Validation & WAF**: Enforce server-side input sanitisation on all
user-supplied fields and deploy a Web Application Firewall rule to block reflected and
stored XSS payloads.

**Guideline 3 – Monitoring & CSP**: Implement Content Security Policy headers and monitor
application logs for unexpected script injection patterns as an ongoing detective control.
══════════════════════════════════════════════════════════════════════

Other commands

# Retrieve only (no generation)
python cli.py retrieve "SQL injection vulnerabilities in MySQL" --top-k 5

# Generate only from manually supplied context
python cli.py generate "What are the risks?" --context "CVE-2021-44228 is a critical RCE…"

# Show index stats and model configuration
python cli.py info

# Change the generator model at runtime
OLLAMA_MODEL=llama3 python cli.py query "Privilege escalation in Linux kernels"

Evaluation

python cli.py evaluate

Runs 10 practitioner-style queries through the full RAG pipeline and scores each with DeepEval's four LLM-judge metrics. Results are written to evaluation_results.json.

The judge model is determined by JUDGE_PROVIDER in .env (see Configuration). Using Gemini is recommended for highest-quality judgements; the local Ollama option keeps evaluation fully air-gapped.

Sample summary table:

════════════════════════════════════════════════════════════════════
DEEPEVAL EVALUATION SUMMARY
────────────────────────────────────────────────────────────────────
#   Ctx Rel  Ctx Rec  Ans Rel   Faith  Query
────────────────────────────────────────────────────────────────────
1     0.412    0.680    0.731   0.750  How should we handle XSS…
2     0.389    0.651    0.712   0.700  What are the mitigation steps…
3     0.356    0.620    0.698   0.667  How do we detect privilege…
…
────────────────────────────────────────────────────────────────────
AVG   0.391    0.650    0.714   0.712
════════════════════════════════════════════════════════════════════

Interpreting scores

Metric	< 0.3	0.3 – 0.5	> 0.5
Context Relevance	Poor retrieval; chunks are off-topic	Acceptable; some noise	Good; retrieved chunks are focused
Context Recall	Critical gaps in retrieved evidence	Partially complete	Retrieved context covers the topic well
Answer Relevance	Answer doesn't address query	Partially relevant	Answer directly addresses query
Faithfulness	High hallucination risk	Mostly grounded	Strongly grounded in evidence

Design Decisions Deep-Dive

Why FAISS over a vector database (Pinecone, Qdrant, Weaviate)?

Self-contained: No external service to run or pay for – ideal for academic projects and air-gapped security environments.
Performance: FAISS IndexFlatIP delivers <1 ms search over 50 K vectors on a CPU.
Portability: The index is a single binary file that can be copied, versioned, and restored trivially.

Why `all-MiniLM-L6-v2` over domain-specific security models?

No publicly available, openly licensed security-domain sentence encoder performs significantly better on short paragraph retrieval.
The CVE Assistant field is structured English prose, not code or specialised jargon – a general SBERT model handles it well.
If a larger budget is available, BAAI/bge-large-en-v1.5 or a fine-tuned security encoder could improve recall.

Why DeepEval with an LLM judge over cosine-threshold metrics?

The original cosine-threshold approach is fast but brittle: a sentence can score above the threshold without being logically relevant, and below it while still containing useful information. An LLM judge evaluates semantic entailment, logical grounding, and topical relevance in a way that better mirrors how a security practitioner would assess the output. DeepEval's implementation follows the RAGAS framework (Es et al., 2023) while replacing the heuristic scorer with a prompted LLM, yielding higher correlation with human judgements at the cost of slightly longer evaluation runtime.

Configuring the judge to use a local Ollama model preserves the system's air-gapped, on-premise properties for security-sensitive environments.

Why reference-free evaluation?

The RAGAS paper (Es et al., 2023) demonstrates that reference-free metrics correlate well with human judgements for RAG systems. This project follows that approach because:

No human-annotated gold answers exist for CVE guidelines.
LLM-judge metrics capture semantic quality and logical grounding rather than surface-level word overlap (unlike BLEU/ROUGE).

Chunking strategy

Each CVE Assistant field is treated as a single chunk (up to ~1 500 characters in generation context). This is appropriate because:

Each record is a cohesive, self-contained vulnerability description.
Splitting mid-record would sever the link between CVE ID, description, and affected products.
The embedding model (max 512 tokens) compresses longer records gracefully – the most discriminative content (CVE ID, vulnerability type, affected product) typically appears in the first 200 tokens.

Limitations and Future Work

Limitation	Impact	Mitigation
LLM judge adds latency to evaluation	Evaluation runs slower than cosine-threshold approach	Use Gemini Flash or a small local model for speed; cache results
LLM judge requires API key or local model	Adds dependency vs. pure cosine evaluation	Local Ollama option keeps everything on-premise
MiniLM-L6-v2 truncates at 512 tokens	Long CVE records partially encoded	Use a 4096-token model (e.g., `jina-embeddings-v2-base`)
Ollama CPU inference is slow (~30 s/query)	Limits interactive use	Enable GPU with `OLLAMA_NUM_GPU` or use quantised models
Dataset may contain outdated CVEs	Guidelines could reference deprecated patches	Filter by `Published Date` ≥ a configurable year
No re-ranking step	Top-k retrieval may include tangentially relevant records	Add cross-encoder re-ranking (`cross-encoder/ms-marco-MiniLM-L-6-v2`)
English-only	Non-English security teams not served	Use multilingual SBERT (`paraphrase-multilingual-MiniLM-L12-v2`)

Ethical considerations

Bias toward large-enterprise CVEs: The dataset reflects publicly reported vulnerabilities which skew toward widely-used commercial software. Under-resourced or indigenous contexts may use different software stacks with fewer CVE records.
Outdated advice propagation: CVE remediation advice evolves. Always validate generated guidelines against current vendor advisories.
Transparency: The system cites CVE IDs and displays retrieved sources, supporting human auditability.
Human-in-the-loop: Generated guidelines should be reviewed by a qualified security professional before being applied to production systems.

References

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data.
Jiang, A. Q. et al. (2023). Mistral 7B. arXiv:2310.06825.
NIST NVD: https://nvd.nist.gov/
Ollama: https://ollama.com
DeepEval: https://github.com/confident-ai/deepeval

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
cli.py		cli.py
cloud_ollama_pull.py		cloud_ollama_pull.py
dataset copy.json		dataset copy.json
dataset.json		dataset.json
dataset_v2.json		dataset_v2.json
evaluate.py		evaluate.py
evaluate_deepval.py		evaluate_deepval.py
evaluation_results.json		evaluation_results.json
generator.py		generator.py
ingest.py		ingest.py
rag.py		rag.py
requirements.txt		requirements.txt
results.json		results.json
results_v2.json		results_v2.json
retriever.py		retriever.py

Folders and files

Latest commit

History

Repository files navigation

CVE-RAG: Retrieval-Augmented Generation for CVE Security Guidelines

Table of Contents

Architecture Overview

Component Decisions

Dataset: AlicanKiraz0/All-CVE-Records-Training-Dataset

Embedding Model: sentence-transformers/all-MiniLM-L6-v2

Vector Store: FAISS IndexFlatIP

Generator: Ollama (local LLM)

Evaluation: DeepEval with LLM-as-Judge

LLM Judge Configuration

Metrics

Project Structure

Installation

1. Python environment

2. Ollama

3. HuggingFace authentication (optional)

4. Configure environment

Usage

Step 1 – Build the index

Step 2 – Query the RAG system

Other commands

Evaluation

Interpreting scores

Design Decisions Deep-Dive

Why FAISS over a vector database (Pinecone, Qdrant, Weaviate)?

Why all-MiniLM-L6-v2 over domain-specific security models?

Why DeepEval with an LLM judge over cosine-threshold metrics?

Why reference-free evaluation?

Chunking strategy

Limitations and Future Work

Ethical considerations

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Dataset: `AlicanKiraz0/All-CVE-Records-Training-Dataset`

Embedding Model: `sentence-transformers/all-MiniLM-L6-v2`

Vector Store: FAISS `IndexFlatIP`

Why `all-MiniLM-L6-v2` over domain-specific security models?

Packages