# Jerusalem RAG Explorer - Complete Demo

## A Retrieval-Augmented Generation System for Crusader History Research

**By Yotam Nachtomy-Katz**  
**ID: 211718366**  
**Submitted: 01.02.26**  
**Course: Information Retrieval**  
**Haifa University**

---

This notebook provides a complete demonstration of the Jerusalem RAG Explorer system, from data ingestion to question answering.

## Table of Contents

1. [Project Overview](#1-project-overview)
2. [System Architecture](#2-system-architecture)
3. [Setup](#3-setup)
4. [Data Pipeline Demo](#4-data-pipeline-demo)
5. [Retrieval Demo](#5-retrieval-demo)
6. [Question Answering Demo](#6-question-answering-demo)
7. [Response Modes](#7-response-modes)
8. [Conclusion](#8-conclusion)

## 1. Project Overview

### Problem Statement

Researching the Crusades presents unique challenges:
- **Language Barriers**: Primary sources exist in Latin, Arabic, Greek, Armenian, and Old French
- **Scattered Archives**: Documents are distributed across multiple digital repositories
- **Volume**: Thousands of pages must be manually searched to find relevant passages
- **Perspective Bias**: Western sources dominate; Eastern perspectives are underrepresented

### Solution

Jerusalem RAG Explorer addresses these challenges through:
- **Multilingual Corpus**: Aggregates Latin, Arabic, Greek, Armenian, and French sources
- **AI Translation**: Pre-translates non-English texts during ingestion
- **Semantic Search**: FAISS index enables natural language queries
- **Grounded Answers**: LLM generates responses with mandatory source citations
- **Comparative Analysis**: Compare Western, Eastern, and Byzantine perspectives

## 2. System Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        DATA INGESTION                           ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Archive.org  ‚îÄ‚îÄ‚îê                                               ‚îÇ
‚îÇ  Gallica (BnF) ‚îÄ‚îº‚îÄ‚îÄ‚ñ∂ Fetch ‚îÄ‚îÄ‚ñ∂ Chunk ‚îÄ‚îÄ‚ñ∂ Translate ‚îÄ‚îÄ‚ñ∂ Embed   ‚îÇ
‚îÇ  Wikipedia ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                                ‚îÇ
‚îÇ                                            ‚îÇ                    ‚îÇ
‚îÇ                                            ‚ñº                    ‚îÇ
‚îÇ                                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê             ‚îÇ
‚îÇ                                    ‚îÇ FAISS Index  ‚îÇ             ‚îÇ
‚îÇ                                    ‚îÇ + Metadata   ‚îÇ             ‚îÇ
‚îÇ                                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        QUERY PIPELINE                           ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ  User Question ‚îÄ‚îÄ‚ñ∂ Embed ‚îÄ‚îÄ‚ñ∂ FAISS Search ‚îÄ‚îÄ‚ñ∂ Top-K Chunks     ‚îÇ
‚îÇ                                                      ‚îÇ          ‚îÇ
‚îÇ                                                      ‚ñº          ‚îÇ
‚îÇ                              Context + Prompt ‚îÄ‚îÄ‚ñ∂ Gemini LLM    ‚îÇ
‚îÇ                                                      ‚îÇ          ‚îÇ
‚îÇ                                                      ‚ñº          ‚îÇ
‚îÇ                                          Answer with Citations  ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Technology Stack

| Component | Technology | Purpose |
|-----------|------------|---------|  
| Frontend | Streamlit | Interactive web UI |
| Embeddings | sentence-transformers | 384-dim text vectors |
| Vector Search | FAISS | Fast similarity search |
| LLM | Google Gemini | Answer generation |
| Translation | Gemini API | Medieval text translation |

## 3. Setup

In [1]:
# Import required libraries
import os
import json
import sys
from pathlib import Path

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv

# Load environment variables
load_dotenv("../.env")

# Add parent directory to path for imports
sys.path.insert(0, str(Path("..").absolute()))

print("Libraries loaded successfully!")
print(f"GEMINI_API_KEY: {'‚úì Found' if os.getenv('GEMINI_API_KEY') else '‚úó Missing'}")

ModuleNotFoundError: No module named 'faiss'

In [None]:
# Configuration
INDEX_DIR = Path("../data/index_v2")
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

# Language utilities
LANGUAGE_NAMES = {"en": "English", "la": "Latin", "ar": "Arabic", "el": "Greek", "fr": "French", "hy": "Armenian"}
LANGUAGE_FLAGS = {"en": "üá¨üáß", "la": "üáªüá¶", "ar": "üá∏üá¶", "el": "üá¨üá∑", "fr": "üá´üá∑", "hy": "üá¶üá≤"}

def get_lang_name(code): return LANGUAGE_NAMES.get(code, code.upper())
def get_lang_flag(code): return LANGUAGE_FLAGS.get(code, "üåê")

## 4. Data Pipeline Demo

### 4.1 Document Sources

The system fetches documents from:
- **Archive.org**: Recueil des historiens des croisades (Latin, Arabic, Greek, Armenian)
- **Gallica (BnF)**: French National Library manuscripts
- **Wikipedia**: Modern encyclopedic content

In [None]:
# Show sample source documents
data_dir = Path("../data/raw")

if data_dir.exists():
    txt_files = list(data_dir.rglob("*.txt"))
    print(f"Total documents in corpus: {len(txt_files)}")
    print("\nSample documents:")
    for f in txt_files[:5]:
        print(f"  - {f.name}")
else:
    print("Data directory not found. Run 01_data_fetching.ipynb first.")

### 4.2 Chunking Strategy

Documents are split into overlapping segments:
- **Chunk size**: 2000 characters
- **Overlap**: 300 characters (preserves context across boundaries)

In [None]:
def chunk_text(text, chunk_size=2000, overlap=300):
    """Split text into overlapping chunks."""
    chunks = []
    i = 0
    while i < len(text):
        chunks.append(text[i:i + chunk_size].strip())
        i += chunk_size - overlap
    return [c for c in chunks if c]

# Demonstrate
demo_text = "A" * 5000
demo_chunks = chunk_text(demo_text)
print(f"Demo: {len(demo_text)} chars ‚Üí {len(demo_chunks)} chunks")
print(f"Chunk sizes: {[len(c) for c in demo_chunks]}")

### 4.3 Load Processed Index

In [None]:
# Load embedding model
print(f"Loading embedding model: {MODEL_NAME}")
model = SentenceTransformer(MODEL_NAME)

# Load FAISS index
print(f"Loading FAISS index...")
index = faiss.read_index(str(INDEX_DIR / "faiss.index"))
print(f"  Index contains {index.ntotal} vectors")

# Load chunks
print(f"Loading chunks metadata...")
with open(INDEX_DIR / "chunks.json", "r", encoding="utf-8") as f:
    chunks = json.load(f)
print(f"  Loaded {len(chunks)} chunks")

# Count by language
lang_counts = {}
for c in chunks:
    lang = c.get("language", "en")
    lang_counts[lang] = lang_counts.get(lang, 0) + 1

print("\nChunks by language:")
for lang, count in sorted(lang_counts.items(), key=lambda x: -x[1]):
    print(f"  {get_lang_flag(lang)} {get_lang_name(lang)}: {count}")

## 5. Retrieval Demo

The retrieval system:
1. Embeds the query using the same model
2. Searches the FAISS index for similar vectors
3. Returns top-k chunks with relevance scores

In [None]:
def retrieve(question, top_k=6, languages=None):
    """Retrieve top-k relevant chunks."""
    # Embed query
    q_emb = model.encode([question], normalize_embeddings=True)
    q_emb = np.array(q_emb, dtype="float32")
    
    # Search
    search_k = top_k * 3 if languages else top_k
    scores, ids = index.search(q_emb, search_k)
    
    results = []
    for score, idx in zip(scores[0], ids[0]):
        if idx < 0 or idx >= len(chunks):
            continue
        chunk = chunks[idx]
        
        # Language filter
        if languages:
            if chunk.get("language", "en") not in languages:
                continue
        
        results.append((float(score), chunk))
        if len(results) >= top_k:
            break
    
    return results

In [None]:
# Demo retrieval
question = "What happened at the Battle of Hattin?"
results = retrieve(question, top_k=5)

print(f"Query: '{question}'\n")
print(f"Retrieved {len(results)} chunks:\n")

for i, (score, chunk) in enumerate(results):
    lang = chunk.get("language", "en")
    flag = get_lang_flag(lang)
    is_trans = "(translated)" if chunk.get("is_translation") else ""
    
    print(f"{i+1}. [{chunk['chunk_id']}] Score: {score:.3f} {flag} {is_trans}")
    print(f"   Preview: {chunk['text'][:150]}...\n")

## 6. Question Answering Demo

The complete RAG pipeline:
1. **Retrieve** relevant chunks
2. **Format** context with metadata
3. **Generate** answer with Gemini LLM

In [None]:
from google import genai
from google.genai import types

SYSTEM_PROMPT = """You are a scholarly historian specializing in the Crusades (1095-1291 CE).

RULES:
1. Answer ONLY using the provided CONTEXT
2. EVERY claim must cite [ChunkID]
3. Note original language of translated sources
4. If insufficient information, say so
"""

def format_context(results):
    """Format chunks for LLM."""
    parts = []
    for score, chunk in results:
        header = f"[{chunk['chunk_id']}] (score: {score:.3f})"
        if chunk.get("original_language"):
            header += f" [Translated from {get_lang_name(chunk['original_language'])}]"
        parts.append(f"{header}\n{chunk['text']}")
    return "\n\n---\n\n".join(parts)

def ask_question(question, mode="default", top_k=6):
    """Complete RAG pipeline."""
    # Retrieve
    results = retrieve(question, top_k=top_k)
    if not results:
        return "No relevant sources found.", []
    
    # Format context
    context = format_context(results)
    
    # Build prompt
    prompt = f"{SYSTEM_PROMPT}\n\nQUESTION: {question}\n\nCONTEXT:\n{context}\n\nANSWER:"
    
    # Generate
    client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
    config = types.GenerateContentConfig(temperature=0.3, max_output_tokens=4096)
    resp = client.models.generate_content(
        model="gemini-3-flash-preview",
        contents=prompt,
        config=config
    )
    
    return resp.text or "", results

In [None]:
# Demo question answering
question = "What happened at the Battle of Hattin?"

print(f"Question: {question}")
print("=" * 60)

answer, sources = ask_question(question, top_k=6)

print("\nANSWER:")
print(answer)

print("\n" + "=" * 60)
print("SOURCES:")
for score, chunk in sources:
    flag = get_lang_flag(chunk.get("language", "en"))
    print(f"  {flag} {chunk['chunk_id']} (score: {score:.3f})")

## 7. Response Modes

The system supports multiple response formats:

| Mode | Description |
|------|-------------|
| **default** | Scholarly prose with citations |
| **chronology** | Timeline format |
| **dossier** | Structured report |
| **comparative** | Cross-cultural analysis |
| **claim_check** | Fact verification |

In [None]:
# Try different questions
questions = [
    "Who was Baldwin IV of Jerusalem?",
    "What were the laws of the Kingdom of Jerusalem?",
    "How did Arabic sources describe the Crusaders?",
]

for q in questions:
    print(f"\n{'='*60}")
    print(f"Q: {q}")
    print("="*60)
    
    answer, sources = ask_question(q, top_k=4)
    print(f"\nA: {answer[:500]}..." if len(answer) > 500 else f"\nA: {answer}")
    print(f"\n[{len(sources)} sources used]")

## 8. Conclusion

### Summary

Jerusalem RAG Explorer demonstrates how RAG systems can:
- **Bridge language barriers** through AI translation
- **Enable semantic search** across historical documents
- **Maintain scholarly rigor** through mandatory citations
- **Reveal multiple perspectives** on historical events

### Key Achievements

1. **Multilingual Corpus**: Latin, Arabic, Greek, Armenian, French sources
2. **Pre-translation Pipeline**: Non-English texts translated during ingestion
3. **Semantic Retrieval**: FAISS index with 384-dim embeddings
4. **Grounded Generation**: Every answer cites specific sources
5. **Comparative Analysis**: Compare Eastern vs Western perspectives

### Future Work

- Fine-tune embeddings on medieval historical text
- Add more sources (Vatican Library, British Library)
- Implement hybrid search (BM25 + semantic)
- Add citation verification

---

**By Yotam Nachtomy-Katz** | ID: 211718366 | Haifa University | Information Retrieval Course | 01.02.26