# PDF Question Answering with LLM and InterSystems IRIS

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:
- **InterSystems IRIS** as the multimodel database (relational and vector in this example)
- **Sentence Transformers** for text embeddings
- **Mistral AI** as the Large Language Model
- **LangChain** for orchestrating the RAG pipeline

## Workshop Overview
We'll process PDF documents, store them as vectors in IRIS, and enable natural language querying with accurate, contextual answers.

---

## 1. Setup and Dependencies

First, let's import all required libraries for our RAG pipeline:

In [8]:
# Import required libraries
import irisnative          # InterSystems IRIS native database connection
import os                  # Operating system interface
import getpass             # Secure password input
from pypdf import PdfReader                        # PDF reading capability
import sentence_transformers                       # Text embedding models
import numpy as np                                 # Numerical computations

# LangChain components for RAG pipeline
from langchain import hub                          # Pre-built prompts from LangChain Hub
from langchain_text_splitters import RecursiveCharacterTextSplitter  # Text chunking
from langchain_community.document_loaders import PyPDFDirectoryLoader # PDF loading
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.runnables import RunnablePassthrough              # Pipeline utilities
from langchain_core.output_parsers import StrOutputParser             # Output parsing

# Set up Mistral AI API key securely
# This prompts the user to enter their API key without displaying it on screen
os.environ["MISTRAL_API_KEY"] = getpass.getpass("Enter your Mistral API key: ")

# Import Mistral AI LLM after setting the API key
from langchain_mistralai import ChatMistralAI

Enter your Mistral API key:  ········


## 2. Database and LLM Initialization

Now we'll establish connections to our InterSystems IRIS database and initialize our Large Language Model.

In [9]:
# Database connection parameters
# These should match your InterSystems IRIS instance configuration
connection_string = "iris:1972/LLMRAG"  # host:port/namespace
username = "superuser"
password = "SYS"

# Establish connection to InterSystems IRIS database
# This creates both a connection and a cursor for executing SQL commands
connectionIRIS = irisnative.createConnection(connection_string, username, password)
cursorIRIS = connectionIRIS.cursor()
print("✅ Connected to InterSystems IRIS database")

# Initialize the Mistral AI Large Language Model
# mistral-large-latest is their most capable model for complex reasoning tasks
llm = ChatMistralAI(model="mistral-large-latest")
print("✅ Mistral AI LLM initialized")

✅ Connected to InterSystems IRIS database
✅ Mistral AI LLM initialized


## 3. Embedding Model Setup

We need a sentence transformer model to convert text into numerical vectors (embeddings). This allows us to perform semantic similarity searches.

In [10]:
# Check if the embedding model is already downloaded and saved locally
# This saves time and bandwidth by avoiding re-downloads
if not os.path.isdir('/app/data/model/'):
    print("📥 Downloading and saving embedding model...")
    # paraphrase-multilingual-MiniLM-L12-v2 is excellent for multilingual semantic similarity
    # It's lightweight but effective for most RAG applications
    model = sentence_transformers.SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
    model.save('/app/data/model/')
    print("✅ Model saved to local directory")
else:
    print("✅ Embedding model already available locally")

✅ Embedding model already available locally


## 4. Document Processing and Vector Storage

This is the core of our RAG system: we'll load PDF documents, split them into chunks, create embeddings, and store everything in InterSystems IRIS.

In [11]:
# Configure text splitting strategy
# Smaller chunks (700 chars) ensure focused, relevant context retrieval
# Overlap (50 chars) prevents important information from being split across chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=700,      # Maximum characters per chunk
    chunk_overlap=50,    # Characters to overlap between adjacent chunks
)

# Load all PDF documents from the data directory
path = "/app/data"
loader = PyPDFDirectoryLoader(path)
print("📖 Loading PDF documents...")
docs_before_split = loader.load()
print(f"✅ Loaded {len(docs_before_split)} document pages")

# Split documents into smaller, manageable chunks
print("✂️ Splitting documents into chunks...")
docs_after_split = text_splitter.split_documents(docs_before_split)
print(f"✅ Created {len(docs_after_split)} text chunks")

# Load the embedding model from local storage
print("🔄 Loading embedding model...")
model = sentence_transformers.SentenceTransformer("/app/data/model/")
print("✅ Embedding model loaded")

# Process each document chunk: create embeddings and store in IRIS database
print("💾 Processing chunks and storing in database...")
for i, doc in enumerate(docs_after_split):
    # Generate embeddings for the text content
    # normalize_embeddings=True ensures consistent vector magnitudes for dot product similarity
    embeddings = model.encode(doc.page_content, normalize_embeddings=True)
    
    # Convert to numpy array and format for database storage
    array = np.array(embeddings)
    formatted_array = np.vectorize('{:.12f}'.format)(array)  # 12 decimal precision
    
    # Prepare parameters for database insertion
    parameters = [
        doc.metadata['source'],                    # Source PDF file path
        str(doc.page_content),                     # Actual text content
        str(','.join(formatted_array))             # Comma-separated vector values
    ]
    
    # Insert into IRIS database with vector storage
    # TO_VECTOR() converts comma-separated string to IRIS vector format
    cursorIRIS.execute(
        "INSERT INTO LLMRAG.DOCUMENTCHUNK (Document, Phrase, VectorizedPhrase) VALUES (?, ?, TO_VECTOR(?,DECIMAL))", 
        parameters
    )
    
    # Show progress every 10 chunks
    if (i + 1) % 10 == 0:
        print(f"  Processed {i + 1}/{len(docs_after_split)} chunks")

# Commit all changes to the database
connectionIRIS.commit()
print(f"✅ Successfully stored {len(docs_after_split)} chunks in InterSystems IRIS database")

📖 Loading PDF documents...
✅ Loaded 51 document pages
✂️ Splitting documents into chunks...
✅ Created 239 text chunks
🔄 Loading embedding model...
✅ Embedding model loaded
💾 Processing chunks and storing in database...
  Processed 10/239 chunks
  Processed 20/239 chunks
  Processed 30/239 chunks
  Processed 40/239 chunks
  Processed 50/239 chunks
  Processed 60/239 chunks
  Processed 70/239 chunks
  Processed 80/239 chunks
  Processed 90/239 chunks
  Processed 100/239 chunks
  Processed 110/239 chunks
  Processed 120/239 chunks
  Processed 130/239 chunks
  Processed 140/239 chunks
  Processed 150/239 chunks
  Processed 160/239 chunks
  Processed 170/239 chunks
  Processed 180/239 chunks
  Processed 190/239 chunks
  Processed 200/239 chunks
  Processed 210/239 chunks
  Processed 220/239 chunks
  Processed 230/239 chunks
✅ Successfully stored 239 chunks in InterSystems IRIS database


## 5. Query Processing and Similarity Search

Now we'll demonstrate how to query our vector database. We'll convert a question into an embedding and find the most relevant document chunks.

In [17]:
# Example question in Spanish (the model supports multiple languages)
user_question = "¿Qué medicamento puede tomar mi hijo de 2 años para bajar la fiebre?"

print(f"🔍 Processing question: '{user_question}'")

# Convert the question into an embedding using the same model
# This ensures semantic similarity with our stored document embeddings
question_embedding = model.encode(user_question, normalize_embeddings=True)

# Format the question embedding for database query
array = np.array(question_embedding)
formatted_array = np.vectorize('{:.12f}'.format)(array)
parameter_query = [str(','.join(formatted_array))]

# Perform similarity search in InterSystems IRIS
# VECTOR_DOT_PRODUCT calculates similarity between question and document vectors
# Similarity > 0.6 threshold filters for highly relevant documents
print("🔍 Searching for relevant documents...")
cursorIRIS.execute("""
    SELECT DISTINCT(Document), MAX(similarity) as max_similarity
    FROM (
        SELECT VECTOR_DOT_PRODUCT(VectorizedPhrase, TO_VECTOR(?, DECIMAL)) AS similarity, 
               Document 
        FROM LLMRAG.DOCUMENTCHUNK
    ) 
    WHERE similarity > 0.6 
    GROUP BY Document
    ORDER BY max_similarity DESC
""", parameter_query)

similarity_rows = cursorIRIS.fetchall()
print(f"✅ Found {len(similarity_rows)} relevant documents")

# Display the relevant documents and their similarity scores
for doc_path, similarity in similarity_rows:
    print(f"📄 {doc_path} (similarity: {similarity})")

🔍 Processing question: '¿Qué medicamento puede tomar mi hijo de 2 años para bajar la fiebre?'
🔍 Searching for relevant documents...
✅ Found 1 relevant documents
📄 /APP/DATA/PROSPECTO_69726.HTML.PDF (similarity: .6020914157820870586)


## 6. RAG Chain: Context Building and Answer Generation

Now we'll build the context from relevant documents and use the LLM to generate an accurate, contextual answer.

In [18]:
# Build context from relevant documents
# We'll concatenate the full text of documents that matched our similarity search
context = ''
print("📚 Building context from relevant documents...")

for similarity_row in similarity_rows:
    document_path = similarity_row[0]
    print(f"  Adding content from: {document_path}")
    
    # Find the original document that matches this path
    for doc in docs_before_split:
        if similarity_row[0] == doc.metadata['source'].upper():
            context += doc.page_content + "\n\n"  # Add spacing between documents

print(f"✅ Context built with {len(context)} characters")

# Load a pre-built RAG prompt from LangChain Hub
# This prompt template is optimized for question-answering with context
print("🔧 Loading RAG prompt template...")
prompt = hub.pull("rlm/rag-prompt")

# Create the RAG chain using LangChain's pipeline syntax
# This chain: 1) Passes context and question to prompt, 2) Sends to LLM, 3) Parses output
print("⚙️ Building RAG chain...")
rag_chain = (
    {
        "context": lambda x: context,           # Provide the context we built
        "question": RunnablePassthrough()       # Pass the question through unchanged
    }
    | prompt                                    # Apply the RAG prompt template
    | llm                                      # Send to Mistral AI LLM
    | StrOutputParser()                        # Parse the response as a string
)

# Generate the final answer
print("🤖 Generating answer...")
print("=" * 50)
answer = rag_chain.invoke(user_question)
print(f"Question: {user_question}")
print(f"Answer: {answer}")
print("=" * 50)

📚 Building context from relevant documents...
  Adding content from: /APP/DATA/PROSPECTO_69726.HTML.PDF
✅ Context built with 33004 characters
🔧 Loading RAG prompt template...
⚙️ Building RAG chain...
🤖 Generating answer...
Question: ¿Qué medicamento puede tomar mi hijo de 2 años para bajar la fiebre?
Answer: Para un niño de **2 años** (aproximadamente **10-12 kg**), puedes administrarle **Dalsy (ibuprofeno)** en una dosis de **1.8 a 2.4 mL por toma** (cada 6-8 horas), sin superar **7.2-9 mL al día** (288-360 mg/día). **Consulta siempre a un pediatra** antes de medicarlo, especialmente si la fiebre persiste más de 24-48 horas.


## 7. Cleanup

Finally, let's properly close our database connection to free up resources.

In [19]:
# Close the database connection to free up resources
connectionIRIS.close()
print("✅ Database connection closed successfully")

print("\n🎉 Workshop completed successfully!")
print("\nWhat we accomplished:")
print("✅ Connected to InterSystems IRIS database")
print("✅ Loaded and processed PDF documents") 
print("✅ Generated embeddings for semantic search")
print("✅ Stored document chunks as vectors in IRIS")
print("✅ Performed similarity search for relevant content")
print("✅ Generated contextual answers using RAG pipeline")
print("\n💡 Try modifying the question above to test different queries!")

✅ Database connection closed successfully

🎉 Workshop completed successfully!

What we accomplished:
✅ Connected to InterSystems IRIS database
✅ Loaded and processed PDF documents
✅ Generated embeddings for semantic search
✅ Stored document chunks as vectors in IRIS
✅ Performed similarity search for relevant content
✅ Generated contextual answers using RAG pipeline

💡 Try modifying the question above to test different queries!


---

## 🚀 Next Steps and Experiments

Now that you have a working RAG system, try these enhancements:

### 1. **Different Questions**
Try asking questions in different languages or about different topics from your PDFs:
```python
# Example questions to try:
questions = [
    "What are the side effects mentioned in the documents?",
    "¿Cuál es la dosis recomendada para adultos?",
    "How should this medication be stored?",
]
```

### 2. **Adjust Similarity Threshold**
Experiment with different similarity thresholds in the database query:
- Higher threshold (0.8): More precise but fewer results
- Lower threshold (0.4): More results but potentially less relevant

### 3. **Different Embedding Models**
Try other sentence transformer models:
- `all-MiniLM-L6-v2`: Faster, English-focused
- `all-mpnet-base-v2`: Higher quality embeddings
- `multilingual-e5-large`: Better multilingual support

### 4. **Chunk Size Optimization**
Experiment with different `chunk_size` values:
- Smaller chunks (300-500): More precise retrieval
- Larger chunks (1000-1500): Better context preservation

### 5. **Advanced Query Features**
Add query expansion, keyword filtering, or hybrid search combining semantic and keyword matching.