[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ovaccarelli/LLM-RAG/blob/main/notebooks/llm_rag_Open_Source_AI_Workshop_4.ipynb)

# 🔧 Setup

In [None]:
# Install all required Python packages for this workshop

!pip install wget langchain_ollama langchain-community faiss-cpu pypdf

In [None]:
import os, wget
from pathlib import Path
from rich.console import Console
from rich.markdown import Markdown

from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain_ollama.llms import OllamaLLM
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

console = Console()

#### Install Ollama and download a pre-trained LLM model
Open the terminal with `%xterm`

- Download and run Ollama’s installer (only once per session): `curl https://ollama.ai/install.sh | sh`.
- Start the Ollama API server in the background: `ollama serve &`

- Download a pre-trained LLM (e.g., Mistral) for local inference: `ollama pull mistral`







In [None]:
# Install the colab-xterm extension

!pip install colab-xterm #https://pypi.org/project/colab-xterm/
%load_ext colabxterm

In [None]:
%xterm
 # curl https://ollama.ai/install.sh | sh
 # ollama serve &
 # ollama pull mistral

In [None]:
!ollama list

### 🌐 Download PDFs

In [None]:
# Create the "data/PDFs" folder if it doesn't exist
PDF_FOLDER = Path("data/PDFs")
os.makedirs(PDF_FOLDER, exist_ok=True)

urls = [
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/PDFs/Open_Source_AI_workshop.pdf",
]

# Download the PDFs
for url in urls:
    name = url.split("/")[-1]
    if not (PDF_FOLDER / name).is_file():
        filename = wget.download(url, f"data/PDFs/{name}")
console.print("Pdf file downloaded successfully.", style="bold green")

### Construct the vectorstore


In [None]:
# 1. Create a folder to store the vector index
VECTORSTORES_DIR = Path("data/vectorstores")
os.makedirs(VECTORSTORES_DIR, exist_ok=True)

# 2. Point to the directory containing our PDFs
PDF_FOLDER = Path("data/PDFs")

# 3. Use PyPDFDirectoryLoader to load every PDF page as a Document
loader = PyPDFDirectoryLoader(PDF_FOLDER)
documents = loader.load()

# 4. Verify how many pages are loaded
print(f"Loaded {len(documents)} PDF pages")

# Set chunk size (how many characters per chunk) and overlap
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

# Split the loaded PDFs into smaller, overlapping chunks
all_splits = text_splitter.split_documents(documents)

# Define the embedding model — BGE is a strong open-source embedding model for English
EMBEDDING_MODEL_NAME = "BAAI/bge-large-en-v1.5"

embedding_model = HuggingFaceBgeEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs={"device": "cpu"},  # "cuda" if you run locally with a GPU
    encode_kwargs={"normalize_embeddings": True},
)

# Create a FAISS index from the text chunks and their embeddings
vectorstore = FAISS.from_documents(documents=all_splits, embedding=embedding_model)

# Save the vectorstore locally for reuse
vectorstore.save_local(VECTORSTORES_DIR)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 🧩 Final Step: Connect Retriever + LLM to Answer Questions

Now that we have a vectorstore built from our documents and a local LLM (like Mistral) running, we're ready to complete the RAG pipeline. This means combining document retrieval with LLM-based answering.


### 📝 Define a Custom Prompt Template
We create a custom prompt that instructs the model to:

- Use only the retrieved document chunks as context.

- Avoid hallucinating or inventing answers.

- Respond concisely (max 3 sentences).

- Stick to English for consistency.

In [None]:
# Define the prompt template to guide the LLM behavior

rag_prompt = """
...

Context:
{context}

Question:
{input}

Answer:
"""

# Wrap the prompt string in a LangChain PromptTemplate

prompt_template = PromptTemplate(
    input_variables=["context", "input"],
    template=rag_prompt,
)

### 🔍 Build the Retrieval-Augmented Generation (RAG) Chain

Now that we have our vectorstore populated with document chunks and embeddings, we can wire everything together into a Retrieval-Augmented Generation (RAG) pipeline.

This RAG system uses a "stuffing" strategy, where all retrieved documents are concatenated into a single prompt before being passed to the language model.

#### 🔗 Key Components:

- **`vectorstore.as_retriever`**  
  Converts the vectorstore into a retriever object that finds the most semantically relevant document chunks for a user query.  

- **`create_stuff_documents_chain(...)`**  
  Creates a LangChain chain that stuffs multiple documents into a prompt template and sends it to the LLM.  
  This strategy is effective when the total input size remains within the LLM’s context window.

- **`create_retrieval_chain(...)`**  
  Wraps the retriever and the document chain into a full end-to-end pipeline:
  1. A user query is passed to the retriever.
  2. The retriever returns a list of relevant text chunks.
  3. These chunks are inserted into a prompt template along with the original question.
  4. The LLM generates an answer strictly based on the provided context.

This architecture gives you a fully functional, local, open-source LLM-based assistant that can answer domain-specific questions using real documents.




In [None]:
# Define the LLM

llm = OllamaLLM(
    model="...",
    temperature=...,
    stop=["<end_of_turn>"],
)

# Build the LLM question-answering chain

question_answer_chain = create_stuff_documents_chain(
    llm=llm,
    prompt=prompt_template,
)

# Configure the retriever

NB_RETRIEVED_CHUNKS = ...
retriever = vectorstore.as_retriever(
    search_kwargs={"k": NB_RETRIEVED_CHUNKS}
)

# Combine the retriever + LLM chain into one Retrieval-Augmented Generation (RAG) pipeline

rag_chain = create_retrieval_chain(
    retriever=retriever,
    combine_docs_chain=question_answer_chain
)

The temperature parameter in a language model (LLM) controls the randomness of the model's output.

- A lower temperature value (closer to 0) makes the model more deterministic, favoring higher probability words and resulting in more predictable and repetitive text.

- A higher temperature value (closer to 1) increases randomness, allowing for more creative and diverse responses by giving less probable words a better chance of being chosen.

Adjusting the temperature helps balance between coherence and creativity in the generated text.

### Chatting with our RAG chain

In [None]:
query = "..."
result = rag_chain.invoke({"input": query})

# ✅ Print the generated answer
console.print(Markdown(result["answer"]))

In [None]:
# Retrieve the top relevant chunks
retrieved_docs = retriever.get_relevant_documents(query)

# Print the retrieved chunks
print(f"\n🔍 Top {NB_RETRIEVED_CHUNKS} Retrieved Chunks:\n")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"--- Chunk #{i} ---")
    print(doc.page_content[:500].strip(), "\n")  # print first 500 characters of each chunk


Display the full result object (includes context docs)

In [None]:
console.print(result)