# 2: Pre-trained Self-RAG with GGUF Conversion

This notebook uses the official pre-trained Self-RAG model (7B) converted to GGUF format for efficient inference on Mac M4 (16GB).

## Why This Approach
- Pre-trained Self-RAG already learned reflection tokens on 150k examples
- GGUF quantization enables running 7B model on 16GB Mac

## Sections
1. Setup & Dependencies
2. Download Pre-trained Self-RAG Model
3. Convert to GGUF Format
4. Validate Special Tokens
5. Inference Integration
6. Integrate with Existing Retriever
7. Evaluation on LegalBench

---
## Section 1: Setup & Dependencies

In [8]:
# Cell 1.1: Install dependencies
# For Mac with Metal acceleration:
# CMAKE_ARGS="-DLLAMA_METAL=on" uv pip install llama-cpp-python --upgrade

# Uncomment to install:
# !CMAKE_ARGS="-DLLAMA_METAL=on" uv pip install llama-cpp-python --upgrade

In [9]:
# Cell 1.2: Imports
import os
import sys
import json
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent))

# Existing project imports
from src.retrieval.retriever import LegalRetriever
from src.retrieval.embedding import EmbeddingModel
from src.self_rag.reflection_tokens import ReflectionTokenizer, ReflectionAnnotation
from src.self_rag.gguf_inference import SelfRAGGGUFInference, SelfRAGOutput

print("Imports successful!")

Imports successful!


In [10]:
# Cell 1.3: Verify llama-cpp-python setup
try:
    import llama_cpp
    print(f"llama-cpp-python version: {llama_cpp.__version__}")
    print("Metal acceleration: Available on macOS")
except ImportError:
    print("ERROR: llama-cpp-python not installed!")
    print("Run: CMAKE_ARGS='-DLLAMA_METAL=on' uv pip install llama-cpp-python")

llama-cpp-python version: 0.3.16
Metal acceleration: Available on macOS


---
## Section 2: Download Pre-trained Self-RAG Model

### Prerequisites - Llama 2 License

**IMPORTANT:** Before downloading, you must:
1. Go to https://huggingface.co/meta-llama/Llama-2-7b
2. Accept Meta's Llama 2 license agreement
3. Request access (usually instant approval)
4. Login to HuggingFace CLI: `huggingface-cli login`

In [11]:
# Cell 2.1: Load HuggingFace token from .env
from dotenv import load_dotenv

# Load .env file
load_dotenv(Path("../.env"))

# Get HF token (try both common env var names)
hf_token = os.environ.get("HF_API_TOKEN") or os.environ.get("HF_TOKEN")

if hf_token:
    print("HuggingFace token loaded from .env")
else:
    print("No HF token in .env. Please run: huggingface-cli login")

HuggingFace token loaded from .env


In [12]:
# Cell 2.2: Download model
from huggingface_hub import snapshot_download

MODEL_ID = "selfrag/selfrag_llama2_7b"
LOCAL_DIR = Path("../models/selfrag_llama2_7b")

# Download (requires ~14GB disk space)
if not LOCAL_DIR.exists():
    print(f"Downloading {MODEL_ID}...")
    print("This may take 10-30 minutes depending on connection speed.")
    snapshot_download(
        repo_id=MODEL_ID,
        local_dir=str(LOCAL_DIR),
        local_dir_use_symlinks=False,
        token=hf_token,
    )
    print(f"Downloaded to {LOCAL_DIR}")
else:
    print(f"Model already exists at {LOCAL_DIR}")

Model already exists at ../models/selfrag_llama2_7b


In [13]:
# Cell 2.3: Verify download
if LOCAL_DIR.exists():
    model_files = list(LOCAL_DIR.iterdir())
    print(f"Model files ({len(model_files)} files):")
    for f in sorted(model_files):
        size_mb = f.stat().st_size / 1e6 if f.is_file() else 0
        print(f"  {f.name} ({size_mb:.1f} MB)")
else:
    print("Model directory does not exist. Please download first.")

Model files (12 files):
  .cache (0.0 MB)
  .gitattributes (0.0 MB)
  README.md (0.0 MB)
  added_tokens.json (0.0 MB)
  config.json (0.0 MB)
  generation_config.json (0.0 MB)
  pytorch_model-00001-of-00002.bin (9976.8 MB)
  pytorch_model-00002-of-00002.bin (3500.4 MB)
  pytorch_model.bin.index.json (0.0 MB)
  special_tokens_map.json (0.0 MB)
  tokenizer.model (0.5 MB)
  tokenizer_config.json (0.0 MB)


---
## Section 3: Convert to GGUF Format

### 3.1 Clone llama.cpp and Build

In [14]:
# # Cell 3.1: Clone llama.cpp
# LLAMA_CPP_DIR = Path("../tools/llama.cpp")

# if not LLAMA_CPP_DIR.exists():
#     print("Cloning llama.cpp...")
#     !git clone https://github.com/ggerganov/llama.cpp {LLAMA_CPP_DIR}
#     print(f"Cloned to {LLAMA_CPP_DIR}")
# else:
#     print(f"llama.cpp already exists at {LLAMA_CPP_DIR}")

In [15]:
# # Cell 3.2: Build llama.cpp with Metal support (Mac)
# import subprocess

# BUILD_DIR = LLAMA_CPP_DIR / "build"

# print("Building llama.cpp with CMake and Metal support...")

# # Create build directory
# BUILD_DIR.mkdir(exist_ok=True)

# # Configure with CMake (enable Metal)
# result = subprocess.run(
#     ["cmake", "..", "-DLLAMA_METAL=ON"],
#     cwd=str(BUILD_DIR),
#     capture_output=True,
#     text=True
# )

# if result.returncode != 0:
#     print(f"CMake configure failed: {result.stderr}")
# else:
#     # Build
#     result = subprocess.run(
#         ["cmake", "--build", ".", "--config", "Release", "-j"],
#         cwd=str(BUILD_DIR),
#         capture_output=True,
#         text=True
#     )

#     if result.returncode == 0:
#         print("Build successful!")
#     else:
#         print(f"Build failed: {result.stderr}")

In [16]:
# Cell 3.3: Install Python conversion dependencies
# !uv pip install -q -r {LLAMA_CPP_DIR}/requirements.txt

### 3.2 Convert HuggingFace to GGUF

In [17]:
# Cell 3.4: Convert to GGUF FP16
FP16_GGUF = Path("../models/selfrag_llama2_7b.fp16.gguf")

if not FP16_GGUF.exists():
    print("Converting to FP16 GGUF (this takes ~5-10 minutes)...")
    !python {LLAMA_CPP_DIR}/convert_hf_to_gguf.py {LOCAL_DIR} \
        --outfile {FP16_GGUF} \
        --outtype f16
    print(f"Converted to FP16 GGUF: {FP16_GGUF}")
else:
    print(f"FP16 GGUF already exists: {FP16_GGUF}")

# Check file size
if FP16_GGUF.exists():
    fp16_size = FP16_GGUF.stat().st_size / 1e9
    print(f"FP16 GGUF size: {fp16_size:.2f} GB")  # Expected: ~13-14 GB

FP16 GGUF already exists: ../models/selfrag_llama2_7b.fp16.gguf
FP16 GGUF size: 13.48 GB


### 3.3 Quantize to Q4_K_M

In [18]:
# Cell 3.5: Quantize to Q4_K_M (best quality/size balance)
Q4_GGUF = Path("../models/selfrag_llama2_7b.Q4_K_M.gguf")
QUANTIZE_BIN = LLAMA_CPP_DIR / "build" / "bin" / "llama-quantize"

if not Q4_GGUF.exists():
    print("Quantizing to Q4_K_M (this takes ~2-5 minutes)...")
    !{QUANTIZE_BIN} {FP16_GGUF} {Q4_GGUF} Q4_K_M
    print(f"Quantized to Q4_K_M: {Q4_GGUF}")
else:
    print(f"Q4_K_M GGUF already exists: {Q4_GGUF}")

# Check file size
if Q4_GGUF.exists():
    q4_size = Q4_GGUF.stat().st_size / 1e9
    print(f"Q4_K_M GGUF size: {q4_size:.2f} GB")  # Expected: ~3.8-4.2 GB

Q4_K_M GGUF already exists: ../models/selfrag_llama2_7b.Q4_K_M.gguf
Q4_K_M GGUF size: 4.08 GB


---
## Section 4: Validate Special Tokens (CRITICAL)

In [19]:
# Cell 4.1: Load GGUF model
from llama_cpp import Llama

# Use Q4 if available, otherwise fall back to FP16
MODEL_PATH = Q4_GGUF if Q4_GGUF.exists() else FP16_GGUF

print(f"Loading model: {MODEL_PATH}")
llm = Llama(
    model_path=str(MODEL_PATH),
    n_ctx=2048,          # Context window (2048 safer for 16GB Mac)
    n_batch=512,         # Batch size for prompt processing
    n_gpu_layers=-1,     # Use all GPU layers (Metal)
    verbose=False,
)
print(f"Model loaded successfully!")

Loading model: ../models/selfrag_llama2_7b.Q4_K_M.gguf


llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64  

Model loaded successfully!


In [20]:
# Cell 4.2: Define expected tokens (matching actual model tokens from added_tokens.json)
EXPECTED_TOKENS = {
    'retrieve': ['[Retrieval]', '[No Retrieval]', '[Continue to Use Evidence]'],
    'isrel': ['[Relevant]', '[Irrelevant]'],
    'issup': ['[Fully supported]', '[Partially supported]', '[No support / Contradictory]'],
    'isuse': ['[Utility:5]', '[Utility:4]', '[Utility:3]', '[Utility:2]', '[Utility:1]'],
}

print("Expected reflection tokens (from model's added_tokens.json):")
for token_type, tokens in EXPECTED_TOKENS.items():
    print(f"  {token_type}: {tokens}")

Expected reflection tokens (from model's added_tokens.json):
  retrieve: ['[Retrieval]', '[No Retrieval]', '[Continue to Use Evidence]']
  isrel: ['[Relevant]', '[Irrelevant]']
  issup: ['[Fully supported]', '[Partially supported]', '[No support / Contradictory]']
  isuse: ['[Utility:5]', '[Utility:4]', '[Utility:3]', '[Utility:2]', '[Utility:1]']


In [21]:
# Cell 4.3: Test token generation (without passage)
test_prompt = """### Instruction:
What is the capital of France?

### Response:
"""

print("Testing generation WITHOUT passage...")
output = llm(test_prompt, max_tokens=100, stop=["###"], echo=False)
generated = output['choices'][0]['text']
print(f"Generated: {generated}")

Testing generation WITHOUT passage...
Generated: Paris is the capital of France.[Utility:5]


In [22]:
# Cell 4.4: Check if model produces reflection tokens
def check_token_presence(text, token_list):
    """Check which tokens from list appear in text."""
    found = [t for t in token_list if t in text]
    return found

print("\nToken validation:")
for token_type, tokens in EXPECTED_TOKENS.items():
    found = check_token_presence(generated, tokens)
    status = "FOUND" if found else "NOT FOUND"
    print(f"  {token_type}: {status} - {found if found else '[]'}")


Token validation:
  retrieve: NOT FOUND - []
  isrel: NOT FOUND - []
  issup: NOT FOUND - []
  isuse: FOUND - ['[Utility:5]']


In [23]:
# Cell 4.5: Test with passage (Self-RAG format)
test_with_passage = """### Instruction:
What is the capital of France?

### Response:
[Retrieval]<paragraph>Paris is the capital and largest city of France. It is located on the Seine River in the north of France.</paragraph>"""

print("Testing generation WITH passage...")
output2 = llm(test_with_passage, max_tokens=100, stop=["###"], echo=False)
generated2 = output2['choices'][0]['text']
print(f"Generated: {generated2}")

print("\nToken validation (with passage):")
for token_type, tokens in EXPECTED_TOKENS.items():
    found = check_token_presence(generated2, tokens)
    status = "FOUND" if found else "NOT FOUND"
    print(f"  {token_type}: {status} - {found if found else '[]'}")

Testing generation WITH passage...
Generated: [Relevant]Paris is the capital of France.[Fully supported][Utility:5]

Token validation (with passage):
  retrieve: NOT FOUND - []
  isrel: FOUND - ['[Relevant]']
  issup: FOUND - ['[Fully supported]']
  isuse: FOUND - ['[Utility:5]']


---
## Section 5: Inference Integration

In [24]:
# Cell 5.1: Use the SelfRAGGGUFInference class
# Close the previous model to free memory
del llm

# Use our inference class
inference = SelfRAGGGUFInference(
    model_path=str(MODEL_PATH),
    n_ctx=2048,  # 2048 safer for 16GB Mac
    n_gpu_layers=-1,  # Metal acceleration
)

Loading Self-RAG model: ../models/selfrag_llama2_7b.Q4_K_M.gguf


llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64  

âœ“ Model loaded (persistent, single instance for all operations)


In [25]:
# Cell 5.2: Test without passage
result1 = inference.generate("What are the elements of negligence?")

print("=" * 60)
print("WITHOUT PASSAGE:")
print("=" * 60)
print(f"Answer: {result1.answer[:300]}..." if len(result1.answer) > 300 else f"Answer: {result1.answer}")
print(f"Retrieve: {result1.retrieve}")
print(f"IsRel: {result1.isrel}")
print(f"IsSup: {result1.issup}")
print(f"IsUse: {result1.isuse}")

WITHOUT PASSAGE:
Answer: The elements of negligence are:

1.Duty of care:<paragraph><paragraph><paragraph><paragraph>
2.Breach of duty: A breach of duty occurs when a person fails to uphold their duty of care.3. Causation: The third element of negligence is causation.4. Damages: The fourth element of negligence is damages.T...
Retrieve: [No Retrieval]
IsRel: None
IsSup: None
IsUse: [Utility:5]


In [26]:
# Cell 5.3: Test with passage
test_passage = """To establish negligence, the plaintiff must prove four elements: 
(1) duty of care - the defendant owed a legal duty to the plaintiff, 
(2) breach - the defendant breached that duty, 
(3) causation - the breach caused the plaintiff's injury, and 
(4) damages - the plaintiff suffered actual harm."""

result2 = inference.generate(
    "What are the elements of negligence?",
    passage=test_passage
)

print("=" * 60)
print("WITH PASSAGE:")
print("=" * 60)
print(f"Answer: {result2.answer[:300]}..." if len(result2.answer) > 300 else f"Answer: {result2.answer}")
print(f"Retrieve: {result2.retrieve}")
print(f"IsRel: {result2.isrel}")
print(f"IsSup: {result2.issup}")
print(f"IsUse: {result2.isuse}")

WITH PASSAGE:
Answer: The elements of negligence are: duty, breach, causation, and damages.
Retrieve: [Retrieval]
IsRel: [Relevant]
IsSup: [Fully supported]
IsUse: [Utility:5]


---
## Section 6: Integrate with Existing Retriever

In [27]:
# Cell 6.1: Load existing FAISS retriever
INDEX_DIR = Path("../data/legalbench_embeddings")

# Check if index exists
if INDEX_DIR.exists():
    # Initialize embedding model
    embedding_model = EmbeddingModel(
        model_name="sentence-transformers/all-mpnet-base-v2",
        device="cpu"  # Use CPU for embedding on Mac
    )
    
    # Create retriever and load index
    retriever = LegalRetriever(
        embedding_model=embedding_model,
        top_k=3
    )
    retriever.load_index(str(INDEX_DIR))
    
    print(f"Loaded retriever with {retriever.get_num_documents()} documents")
else:
    print(f"Index not found at {INDEX_DIR}")
    print("Please run the retrieval indexing notebook first.")
    retriever = None

Loading embedding model: sentence-transformers/all-mpnet-base-v2
Model loaded on cpu
Embedding dimension: 768
Using CPU index
Created IndexFlatIP index with dimension 768
Index loaded from ../data/legalbench_embeddings/faiss_index.faiss
Total documents in index: 326783
Documents loaded from ../data/legalbench_embeddings/documents.pkl
Loaded retriever with 326783 documents


In [28]:
# Cell 6.2: Test full pipeline with retrieval
if retriever is not None:
    question = "Does the NDA require the receiving party to return confidential information?"
    
    print(f"Question: {question}")
    print("=" * 60)
    
    result = inference.generate_with_retrieval(question, retriever)
    
    print(f"\nRetrieved {len(result['passages'])} passages")
    if result.get('passage_score'):
        print(f"Top passage score: {result['passage_score']:.4f}")
    
    print(f"\n--- Generated Output ---")
    output = result['output']
    print(f"Answer: {output.answer[:400]}..." if len(output.answer) > 400 else f"Answer: {output.answer}")
    print(f"\nReflection Tokens:")
    print(f"  Retrieve: {output.retrieve}")
    print(f"  IsRel: {output.isrel}")
    print(f"  IsSup: {output.issup}")
    print(f"  IsUse: {output.isuse}")
else:
    print("Retriever not available. Skipping retrieval test.")

Question: Does the NDA require the receiving party to return confidential information?

Retrieved 3 passages
Top passage score: 0.6992

--- Generated Output ---
Answer: No, the NDA does not require the receiving party to return confidential information.

Reflection Tokens:
  Retrieve: [Retrieval]
  IsRel: [Relevant]
  IsSup: [No support / Contradictory]
  IsUse: [Utility:5]


In [29]:
# Cell 6.3: Test multiple questions
if retriever is not None:
    test_questions = [
        "What is the definition of confidential information in the NDA?",
        "Can the receiving party disclose information to its employees?",
        "What happens if there is a breach of the NDA?",
    ]
    
    print("Testing multiple questions...\n")
    
    for i, q in enumerate(test_questions, 1):
        print(f"\n{'='*60}")
        print(f"Question {i}: {q}")
        print(f"{'='*60}")
        
        result = inference.generate_with_retrieval(q, retriever)
        output = result['output']
        
        print(f"Answer: {output.answer[:200]}...")
        print(f"Tokens: IsRel={output.isrel}, IsSup={output.issup}, IsUse={output.isuse}")

Testing multiple questions...


Question 1: What is the definition of confidential information in the NDA?
Answer: The definition of confidential information in the NDA is typically outlined in Section 2 of the agreement.This section defines what information is considered confidential and is protected under the te...
Tokens: IsRel=[Relevant], IsSup=[No support / Contradictory], IsUse=[Utility:5]

Question 2: Can the receiving party disclose information to its employees?
Answer: Yes, the receiving party can disclose information to its employees, but only to the extent that it is necessary for the receiving party's internal use and only after informing the employees of the res...
Tokens: IsRel=[Relevant], IsSup=[Partially supported], IsUse=[Utility:5]

Question 3: What happens if there is a breach of the NDA?
Answer: If there is a breach of an NDA, the person who breached the agreement may be subject to legal action, including fines and other penalties....
Tokens: IsRel=[Relevant], IsSup

---
## Section 7: Evaluation on LegalBench

In [30]:
# Cell 7.1: Load LegalBench queries
QUERIES_FILE = Path("../data/legalbench-rag/queries.json")

if QUERIES_FILE.exists():
    with open(QUERIES_FILE, 'r') as f:
        queries_data = json.load(f)
    
    queries = queries_data.get('tests', queries_data.get('queries', []))
    print(f"Loaded {len(queries)} queries")
    
    # Show sample
    if queries:
        print(f"\nSample query: {queries[0]}")
else:
    print(f"Queries file not found: {QUERIES_FILE}")
    queries = []

Loaded 6889 queries

Sample query: {'query': 'Consider the Non-Disclosure Agreement between CopAcc and ToP Mentors; Does the document indicate that the Agreement does not grant the Receiving Party any rights to the Confidential Information?', 'snippets': [{'file_path': 'contractnli/CopAcc_NDA-and-ToP-Mentors_2.0_2017.txt', 'span': [11461, 11963], 'answer': 'Any and all proprietary rights, including but not limited to rights to and in inventions, patent rights, utility models, copyrights, trademarks and trade secrets, in and to any Confidential Information shall be and remain with the Participants respectively, and Mentor shall not have any right, license, title or interest in or to any Confidential Information, except the limited right to review, assess and help develop such Confidential Information in connection with the Copernicus Accelerator 2017.'}], 'dataset_source': 'ContractNLI'}


In [31]:
# Cell 7.2: Load ground truth labels (if available)
LABELS_FILE = Path("../data/training/legalbench_training_labels.json")

if LABELS_FILE.exists():
    with open(LABELS_FILE, 'r') as f:
        training_labels = json.load(f)
    
    # Create lookup by question
    gt_labels = {ex.get('question', ex.get('query', '')): ex for ex in training_labels}
    print(f"Loaded {len(gt_labels)} ground truth labels")
else:
    print(f"Labels file not found: {LABELS_FILE}")
    gt_labels = {}

Loaded 776 ground truth labels


In [32]:
# Cell 7.3: Evaluate on queries
from collections import defaultdict
from tqdm.notebook import tqdm

# Limit for testing (set to len(queries) for full evaluation)
EVAL_LIMIT = 10

if queries and retriever is not None:
    results = []
    token_accuracy = defaultdict(lambda: {'correct': 0, 'total': 0})
    
    eval_queries = queries[:EVAL_LIMIT]
    print(f"Evaluating on {len(eval_queries)} queries...\n")
    
    for query_data in tqdm(eval_queries, desc="Evaluating"):
        # Handle different query formats
        if isinstance(query_data, dict):
            question = query_data.get('query', query_data.get('question', ''))
        else:
            question = str(query_data)
        
        if not question:
            continue
        
        # Get ground truth
        gt = gt_labels.get(question, {})
        gt_tokens = gt.get('reflection_tokens', {})
        
        # Generate with retrieval
        try:
            output = inference.generate_with_retrieval(question, retriever)
            pred = output['output']
            
            # Store result
            results.append({
                'question': question,
                'predicted': {
                    'answer': pred.answer,
                    'retrieve': pred.retrieve,
                    'isrel': pred.isrel,
                    'issup': pred.issup,
                    'isuse': pred.isuse,
                },
                'ground_truth': gt_tokens,
                'passage_score': output.get('passage_score'),
            })
            
            # Calculate token accuracy
            for token_type in ['retrieve', 'isrel', 'issup', 'isuse']:
                gt_val = gt_tokens.get(token_type)
                pred_val = getattr(pred, token_type)
                
                if gt_val and pred_val:
                    token_accuracy[token_type]['total'] += 1
                    if gt_val == pred_val:
                        token_accuracy[token_type]['correct'] += 1
        except Exception as e:
            print(f"Error on question: {question[:50]}... - {e}")
    
    print(f"\nEvaluation complete. Processed {len(results)} queries.")
else:
    print("Cannot run evaluation: queries or retriever not available.")
    results = []
    token_accuracy = {}

Evaluating on 10 queries...



Evaluating:   0%|          | 0/10 [00:00<?, ?it/s]


Evaluation complete. Processed 10 queries.


In [33]:
# Cell 7.4: Print evaluation results
print("\n" + "=" * 80)
print("EVALUATION RESULTS (Pre-trained Self-RAG 7B GGUF)")
print("=" * 80)

print(f"\nModel: selfrag/selfrag_llama2_7b (Q4_K_M GGUF)")
print(f"Queries evaluated: {len(results)}")

if token_accuracy:
    print("\nReflection Token Accuracy:")
    for token_type, counts in token_accuracy.items():
        if counts['total'] > 0:
            acc = counts['correct'] / counts['total'] * 100
            print(f"  {token_type.upper()}: {acc:.1f}% ({counts['correct']}/{counts['total']})")
        else:
            print(f"  {token_type.upper()}: N/A (no ground truth)")
else:
    print("\nNo token accuracy data available (no ground truth labels).")


EVALUATION RESULTS (Pre-trained Self-RAG 7B GGUF)

Model: selfrag/selfrag_llama2_7b (Q4_K_M GGUF)
Queries evaluated: 10

Reflection Token Accuracy:
  RETRIEVE: 0.0% (0/10)
  ISREL: 40.0% (4/10)
  ISSUP: 0.0% (0/10)
  ISUSE: 50.0% (5/10)


In [34]:
# Cell 7.5: Analyze token distribution
if results:
    print("\n" + "-" * 40)
    print("Token Distribution Analysis:")
    print("-" * 40)
    
    from collections import Counter
    
    for token_type in ['retrieve', 'isrel', 'issup', 'isuse']:
        values = [r['predicted'][token_type] for r in results if r['predicted'][token_type]]
        counter = Counter(values)
        print(f"\n{token_type.upper()}:")
        for val, count in counter.most_common():
            pct = count / len(values) * 100 if values else 0
            print(f"  {val}: {count} ({pct:.1f}%)")


----------------------------------------
Token Distribution Analysis:
----------------------------------------

RETRIEVE:
  [Retrieval]: 10 (100.0%)

ISREL:
  [Relevant]: 10 (100.0%)

ISSUP:
  [Fully supported]: 9 (90.0%)
  [Partially supported]: 1 (10.0%)

ISUSE:
  [Utility:5]: 10 (100.0%)


In [35]:
# Cell 7.6: Save evaluation results
RESULTS_DIR = Path("../results")
RESULTS_DIR.mkdir(exist_ok=True)

RESULTS_FILE = RESULTS_DIR / "selfrag_7b_gguf_evaluation.json"

evaluation_summary = {
    'model': 'selfrag/selfrag_llama2_7b (Q4_K_M GGUF)',
    'num_queries': len(results),
    'token_accuracy': dict(token_accuracy),
    'results': results,
}

with open(RESULTS_FILE, 'w') as f:
    json.dump(evaluation_summary, f, indent=2)

print(f"Saved results to {RESULTS_FILE}")

Saved results to ../results/selfrag_7b_gguf_evaluation.json


---
## Summary

This notebook demonstrated:
1. **Model Download**: Downloaded pre-trained Self-RAG 7B from HuggingFace
2. **GGUF Conversion**: Converted to Q4_K_M quantization (~4GB)
3. **Token Validation**: Verified reflection tokens work after conversion
4. **Inference Integration**: Used `SelfRAGGGUFInference` class with llama.cpp
5. **Retriever Integration**: Combined with existing `LegalRetriever` for RAG
6. **Evaluation**: Tested on LegalBench queries