[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ovaccarelli/LLM-RAG/blob/main/notebooks/llm_rag_Open_Source_AI_Workshop_final.ipynb)

# 🔧 0. Setup

Before we start building our Retrieval-Augmented Generation (RAG) system, we need to install the necessary libraries.
This will set up the core components for document loading, vector storage, LLM orchestration, and environment handling.

### 📦 Installation

> ✅ **Note:** This might take some minutes. Once completed, you're ready to start working with our LLM-Retrieval-Augmented Generation (RAG)!

In [None]:
# Install all required Python packages for this workshop

!pip install langchain langchain-community faiss-cpu pymupdf pypdf sentence_transformers rich wget python-dotenv cryptography langchain_ollama langchain-docling pymupdf4llm



### 🐳 Installing and Running Ollama in Colab

To run Ollama (a local LLM server) inside Google Colab, we’ll open an interactive terminal session, download and install Ollama, then start the Ollama service and pull a model.

If you prefer running Ollama on your PC, just follow the same procedure on your local shell (admin rights required), and consult the official docs at ollama.ai.



#### Enable a Terminal in Colab  
We’ll use the `colab-xterm` extension to spawn a bash shell directly in your notebook.

The `colab-xterm` package creates a new browser-based terminal in Colab.\

In [None]:
# Install the colab-xterm extension

!pip install colab-xterm #https://pypi.org/project/colab-xterm/
%load_ext colabxterm

#### Install Ollama and download a pre-trained LLM model
Open the terminal with `%xterm`

- Download and run Ollama’s installer (only once per session): `curl https://ollama.ai/install.sh | sh`.
- Start the Ollama API server in the background: `ollama serve &`

- Download a pre-trained LLM (e.g., Mistral) for local inference: `ollama pull mistral`







In [None]:
%xterm
 # curl https://ollama.ai/install.sh | sh
 # ollama serve & ollama pull mistral

#### 🧠 Check Installed Models

You can use the command below to see which LLMs are currently available in your local Ollama environment.  
This includes all models you've already downloaded (e.g., `mistral`), along with their sizes and versions.

You can also explore and experiment with many other open-source models available on Ollama by browsing their collection here:  
👉 [https://ollama.com/search](https://ollama.com/search)

In [None]:
!ollama list

#### ✅ Verify Your LLM Setup with LangChain

Now that your Ollama server is running and the model is pulled, let’s check if everything is connected properly.

We’ll use `LangChain`'s `OllamaLLM` class to send a test prompt to the **Mistral** model running locally via Ollama.


In [None]:
# Verify your Ollama-backed LLM is working with LangChain

from langchain_ollama.llms import OllamaLLM

# Initialize the OllamaLLM wrapper with the 'mistral' model you pulled
llm = OllamaLLM(model="mistral")

# Generate a simple test completion
response = llm.generate(["Hello, how are you today?"])

# Print out the first generated text
print(response.generations[0][0].text)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# 📄 1. Start Your LLM-RAG Pipeline: Load PDFs and Ask Your First Question

Now that we have our local LLM (e.g., `mistral`) running via Ollama, let's build the first part of our Retrieval-Augmented Generation (RAG) pipeline.

In this section, you'll:

- Download several PDFs
- Use LangChain to ingest and process these documents
- Set up your local LLM with a simple custom prompt
- Run your first query using LangChain's `PromptTemplate` + `OllamaLLM` integration

We'll start with a basic prompt-based LLM chain. Later, we'll add document embeddings, a retriever, and a full RAG chain.

### 🛠️ Let's Get Started:


In [None]:
import os, time
from pathlib import Path

import langchain
import wget
from dotenv import load_dotenv
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_ollama.llms import OllamaLLM
from langchain_docling import DoclingLoader
from langchain.document_loaders import PyPDFLoader
from langchain_docling.loader import ExportType
import pymupdf4llm
from langchain_core.documents.base import Document
from rich.console import Console
from rich.markdown import Markdown

console = Console()

### 🌐 Download PDFs
We download a small collection of PDFs. These will be our example knowledge base for the RAG pipeline.


In [None]:
# Create the "data/PDFs" folder if it doesn't exist
PDF_FOLDER = Path("data/PDFs")
os.makedirs(PDF_FOLDER, exist_ok=True)

urls = [
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/PDFs/Open_Source_AI_workshop.pdf",
]

# Download the PDFs
for url in urls:
    name = url.split("/")[-1]
    if not (PDF_FOLDER / name).is_file():
        filename = wget.download(url, f"data/PDFs/{name}")
console.print("Pdf file downloaded successfully.", style="bold green")

### 🤖 Initialize the Local LLM via Ollama
We load the `mistral` model using the `OllamaLLM` wrapper from LangChain.  
We also define some decoding parameters, like temperature and stopping conditions.

In [None]:
# Load your local LLM

llm = OllamaLLM(
    model="mistral",
    temperature=0.1,  # Will be explained later
    stop=["<end_of_turn>"],
)

### 🧾 Define a Custom Prompt Template
We craft a basic prompt to structure how questions are sent to the LLM.
This format is compatible with LangChain's `PromptTemplate` system.


In [None]:
# Create a simple prompt to ask a question

template = """
You are an helpful assistant that answer the question in detail.

Human input: {question}
Assistant:"""

prompt = PromptTemplate(input_variables=["question"], template=template)
prompt

### 🔗 Create a Simple LangChain LLM Chain
We create a basic chain that connects the prompt to the LLM.
This is the foundation for more complex RAG workflows (which we'll build later).


In [None]:
llm_chain = prompt | llm

### Ask a Question!
We test the chain by sending a question about the AI-Days event.


In [None]:
result = llm_chain.invoke(input="who wrote this article?")

console.print(Markdown(result))

> ✅ **Note:** The LLM returns a response based on its pretrained knowledge (not yet using the PDFs).

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 2. Extract Text from a Single PDF

In this step, we’ll load one PDF file and convert its pages into plain text (or Markdown) using three different methods:

- **PyPDFLoader** (LangChain): A straightforward loader that splits the PDF into page-level `Document` objects.  
- **PyMuPDF4LLM**: A fast, native extractor that generates Markdown-formatted text with optional page-wise chunking.  
- **Docling**: A robust parser that preserves layout and exports content as Markdown, either per page (DOC_CHUNKS) or whole-document (MARKDOWN).

You will see how to:

1. Read the PDF from disk.  
2. Extract every page’s text into a structured format.  
3. Time each method to compare performance.  
4. Preview a specific page for verification.

### 📁 Setup Paths & Choose only 1 PDF for testing

In [None]:
# Create the "data/sample_pdf" folder if it doesn't exist
SAMPLE_PDF_DIR = Path("data/sample_pdf")
os.makedirs(SAMPLE_PDF_DIR, exist_ok=True)

# URL of the PDFs to test
urls = [
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/sample_pdf/2312.10997.pdf",
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/sample_pdf/2312.10997_page13.pdf",
]

# Download the PDFs
for url in urls:
    name = url.split("/")[-1]
    if not (SAMPLE_PDF_DIR / name).is_file():
        filename = wget.download(url, f"data/sample_pdf/{name}")
console.print("Pdf file downloaded successfully.", style="bold green")

#### PyPDFLoader

In [None]:
pdf_path = SAMPLE_PDF_DIR/"2312.10997.pdf"  # Just pick one page for testing

# Load the PDF with PyPDFLoader
start = time.time()
loader = PyPDFLoader(str(pdf_path))
docs_pypdf = loader.load()                 # returns a list of Document objects, one per page
end = time.time()

print(f"Using file: {pdf_path.name}")
print(f"🕒 PyPDFLoader loaded {len(docs_pypdf)} pages in {end - start:.2f} seconds")

In [None]:
# --- Preview the PDF contents ---
# Pages are indexed starting from 0

page_to_print = 12  # Change this to the page index you want
max_num_characters = 1000 # Change the max num of characters you want to print

# Now preview the chosen page:

if 0 <= page_to_print < len(docs_pypdf):
    content = docs_pypdf[page_to_print].page_content
    print(f"--- 📄 Page {page_to_print + 1} / {len(docs_pypdf)} ---\n")
    print(content[:max_num_characters])
else:
    print(f"Page {page_to_print} is out of range (max:{len(docs_pypdf)})")

### PyMuPDF4LLM

In [None]:
# Load the PDF with PyMuPDF4LLM
start = time.time()
docs_pymupdf = pymupdf4llm.to_markdown(str(pdf_path), page_chunks=True)       # return a list of page dicts
end = time.time()

print(f"Using file: {pdf_path.name}")
print(f"🕒 PyMuPDF4LLM extracted {len(docs_pymupdf)} pages in {end - start:.2f} seconds\n")

In [None]:
# --- Preview the PDF contents ---
# Pages are indexed starting from 0

page_to_print = 12  # Change this to the page index you want
max_num_characters = 1000 # Change the max num of characters you want to print

# Now preview the chosen page:

if 0 <= page_to_print < len(docs_pymupdf):
    md = docs_pymupdf[page_to_print]["text"]
    print(f"--- 📄 Page {page_to_print + 1} / {len(docs_pymupdf)} ---\n")
    print(md[:max_num_characters])
else:
    print(f"Page {page_to_print} is out of range (max:{len(docs_pymupdf)})")

### Docling

In [None]:
pdf_path_docling = SAMPLE_PDF_DIR/"2312.10997_page13.pdf"  # Just pick one page for testing

# Load the PDF with Docling
start = time.time()
loader_docling = DoclingLoader(str(pdf_path_docling), export_type=ExportType.MARKDOWN)
docs_docling = loader_docling.load()
end = time.time()

print(f"Using file: {pdf_path_docling.name}")
print(f"🕒 Docling loaded {len(docs_docling)} document(s) in {end - start:.2f} seconds")

In [None]:
# --- Preview the PDF contents ---

# Print the full extracted text
for idx, doc in enumerate(docs_docling):
    print(f"\n--- 📄 PDF Document: {pdf_path_docling.name} ---\n")
    print(doc.page_content)

## 3. Construct the vectorstore

In this step, we take the PDF documents and transform them into a searchable vector database.


In [None]:
# 1. Create a folder to store the vector index
VECTORSTORES_DIR = Path("data/vectorstores")
os.makedirs(VECTORSTORES_DIR, exist_ok=True)

# 2. Point to the directory containing our PDFs
PDF_FOLDER = Path("data/PDFs")

# 3. Use PyPDFDirectoryLoader to load every PDF page as a Document
loader = PyPDFDirectoryLoader(PDF_FOLDER)
documents = loader.load()

# 4. Verify how many pages are loaded
print(f"Loaded {len(documents)} PDF pages")

### ✂️ Split Documents into Chunks

We break documents into smaller overlapping chunks using `RecursiveCharacterTextSplitter`.

- `chunk_size`: The number of characters per chunk.

- `chunk_overlap`: Ensures that we maintain context between chunks.

This is crucial for preserving semantic meaning across sentences and paragraphs.

In [None]:
# Set chunk size (how many characters per chunk) and overlap
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

# Split the loaded PDFs into smaller, overlapping chunks
all_splits = text_splitter.split_documents(documents)

print(f"✅ Split into {len(all_splits)} chunks")

### 🔍 Convert Text Chunks to Embeddings

We now convert each text chunk into a high-dimensional vector using the BGE model (`BAAI/bge-large-en-v1.5`). These vectors capture the semantic meaning of the text.

- We use `HuggingFaceBgeEmbeddings from LangChain`.

- Normalizing embeddings helps improve similarity search accuracy.

- We set the device to "cpu" for compatibility with Colab. (If you're running this on a local machine with GPU, you can switch "cpu" to "cuda" for better performance.)

In [None]:
# Define the embedding model — BGE is a strong open-source embedding model for English
EMBEDDING_MODEL_NAME = "BAAI/bge-large-en-v1.5"

embedding_model = HuggingFaceBgeEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs={"device": "cpu"},  # "cuda" if you run locally with a GPU
    encode_kwargs={"normalize_embeddings": True},
)

### 🏗️ Create and Save the Vectorstore

Using the text chunks and embeddings, we build our vectorstore:

- FAISS (Facebook AI Similarity Search) is a fast library for vector similarity search.

- This index will let us retrieve the most relevant chunks given a user question.

We also save the vectorstore locally so that it can be reused later without recomputing everything.

In [None]:
# Create a FAISS index from the text chunks and their embeddings
vectorstore = FAISS.from_documents(documents=all_splits, embedding=embedding_model)

# Save the vectorstore locally for reuse
vectorstore.save_local(VECTORSTORES_DIR)

print("✅ Vectorstore created and saved successfully.")

💾 Reload the Vectorstore (Optional)

In [None]:
# You can reload the saved vectorstore anytime without recomputing everything
vectorstore = FAISS.load_local(
    VECTORSTORES_DIR,
    embedding_model,
    allow_dangerous_deserialization=True  # Required in Colab environments
)

print("✅ Vectorstore reloaded successfully.")

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 🧩 Final Step: Connect Retriever + LLM to Answer Questions

Now that we have a vectorstore built from our documents and a local LLM (like Mistral) running, we're ready to complete the RAG pipeline. This means combining document retrieval with LLM-based answering.


### 📝 Define a Custom Prompt Template
We create a custom prompt that instructs the model to:

- Use only the retrieved document chunks as context.

- Avoid hallucinating or inventing answers.

- Respond concisely (max 3 sentences).

- Stick to English for consistency.

In [None]:
# Define the prompt template to guide the LLM behavior

rag_prompt = """
Use the following pieces of context to answer the question at the end.
Don't try to make up an answer and only use the information you know.
Use three sentences maximum and keep the answer as concise as possible.
You must answer in English.

Context:
{context}

Question:
{input}

Answer:
"""

# Wrap the prompt string in a LangChain PromptTemplate

prompt_template = PromptTemplate(
    input_variables=["context", "input"],
    template=rag_prompt,
)

### 🔍 Build the Retrieval-Augmented Generation (RAG) Chain

Now that we have our vectorstore populated with document chunks and embeddings, we can wire everything together into a Retrieval-Augmented Generation (RAG) pipeline.

This RAG system uses a "stuffing" strategy, where all retrieved documents are concatenated into a single prompt before being passed to the language model.

#### 🔗 Key Components:

- **`vectorstore.as_retriever(k=8)`**  
  Converts the FAISS vectorstore into a retriever object that finds the most semantically relevant document chunks for a user query.  
  We use `k = 8` to retrieve the top 8 most relevant chunks.

- **`create_stuff_documents_chain(...)`**  
  Creates a LangChain chain that stuffs multiple documents into a prompt template and sends it to the LLM.  
  This strategy is effective when the total input size remains within the LLM’s context window.

- **`create_retrieval_chain(...)`**  
  Wraps the retriever and the document chain into a full end-to-end pipeline:
  1. A user query is passed to the retriever.
  2. The retriever returns a list of relevant text chunks.
  3. These chunks are inserted into a prompt template along with the original question.
  4. The LLM generates an answer strictly based on the provided context.

This architecture gives you a fully functional, local, open-source LLM-based assistant that can answer domain-specific questions using real documents.




In [None]:
# Define the LLM

llm = OllamaLLM(
    model="mistral",
    temperature=0.1,
    stop=["<end_of_turn>"],
)

# Build the LLM question-answering chain

question_answer_chain = create_stuff_documents_chain(
    llm=llm,
    prompt=prompt_template,
)

# Configure the retriever

NB_RETRIEVED_CHUNKS = 8
retriever = vectorstore.as_retriever(
    search_kwargs={"k": NB_RETRIEVED_CHUNKS}
)

# Combine the retriever + LLM chain into one Retrieval-Augmented Generation (RAG) pipeline

rag_chain = create_retrieval_chain(
    retriever=retriever,
    combine_docs_chain=question_answer_chain
)

The temperature parameter in a language model (LLM) controls the randomness of the model's output.

- A lower temperature value (closer to 0) makes the model more deterministic, favoring higher probability words and resulting in more predictable and repetitive text.

- A higher temperature value (closer to 1) increases randomness, allowing for more creative and diverse responses by giving less probable words a better chance of being chosen.

Adjusting the temperature helps balance between coherence and creativity in the generated text.

### Chatting with our RAG chain

In [None]:
query = "Where is this workshop?"
result = rag_chain.invoke({"input": query})

# ✅ Print the generated answer
console.print(Markdown(result["answer"]))

In [None]:
# Retrieve the top relevant chunks
retrieved_docs = retriever.get_relevant_documents(query)

# Print the retrieved chunks
print(f"\n🔍 Top {NB_RETRIEVED_CHUNKS} Retrieved Chunks:\n")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"--- Chunk #{i} ---")
    print(doc.page_content[:500].strip(), "\n")  # print first 500 characters of each chunk


Display the full result object (includes context docs)

In [None]:
console.print(result)