# Retrieval augmented generation (RAG)

## Loading Documents
A first step in RAG is to load document. You need a loader that supports the document type you are interested in. We use in this example Langchain, because it includes a collection of 60+ libraries for multiple types of documents and formats.

A first example with the `PyPDFLoader` library. Pdf support is direct and a single command is enough.

In [1]:
# For this loading Documents part, you may need these packages installed

%pip install langchain
%pip install -U langchain-community

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import warnings # optional, disabling warnings about versions and others
warnings.filterwarnings('ignore') # optional, disabling warnings about versions and others

%pip install pypdf 

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/War-of-the-Worlds.pdf")
book = loader.load()

Note: you may need to restart the kernel to use updated packages.


In [3]:
# How long is the document we loaded?
len(book)

128

In [4]:
#Looking at a small extract, one page, and a few hundred characters in that page
page = book[5]
print(page.page_content[0:500])

darkness were Ottershaw and Chertsey and a ll their hundreds of people, sleeping in 
peace.  
   He was full of speculation that night a bout the condition of Mars, and scoffed at the 
vulgar idea of its having in- habitants w ho were signalling us. His idea was that 
meteorites might be falling in a heavy shower upon the planet, or that a huge volcanic 
explosion was in progress. He pointed out to me how unlikely it was that organic 
evolution had taken the same direction in the two adjacent pl


In [5]:
#Which page is it, from which document?
page.metadata

{'producer': 'PDFill: Free PDF Writer and Tools',
 'creator': 'PyPDF',
 'creationdate': '2011-08-24T10:49:19-04:00',
 'moddate': '2011-08-24T10:49:19-04:00',
 'source': 'docs/War-of-the-Worlds.pdf',
 'total_pages': 128,
 'page': 5,
 'page_label': '6'}

A second example with a Youtube video. There is a little more work here. The yt_dlp library will need options to know what audio format to download (we won't care much about the video part). Here we use m4a, at 192 kbps. Then the ffmpeg and ffprobe programs will isolate and stream the audio part. We will then use the OpenAI whisper library to covnert the audio into text (speech-to-text).

In [6]:
%pip install --upgrade --no-deps --force-reinstall yt_dlp
%pip install pydub
%pip install ffmpeg
%pip install ffprobe
%pip install torch
%pip install tiktoken
%pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

import os
import whisper
from yt_dlp import YoutubeDL

# Step 1: Set up the download options
url = "https://www.youtube.com/watch?v=2vkJ7v0x-Fs"
save_dir = "docs/youtube/"
output_template = os.path.join(save_dir, '%(title)s.%(ext)s')

ydl_opts = {
    'format': 'bestaudio/best',
    'outtmpl': output_template,  # Save the file to the specified directory with a title-based name
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'm4a',  # You can change this to mp3 if you prefer
        'preferredquality': '192',
    }],
    'ffmpeg_location': '/usr/bin/ffmpeg',  # Specify the location of ffmpeg
}


# Step 2: Download the audio from the YouTube video
with YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

# Step 3: Find the downloaded file
downloaded_file = [f for f in os.listdir(save_dir) if f.endswith('.m4a')][0]  # Assuming m4a, adjust if using mp3
downloaded_file_path = os.path.join(save_dir, downloaded_file)

# Step 4: Load the Whisper model
model = whisper.load_model("base")  # You can choose 'tiny', 'base', 'small', 'medium', or 'large'

# Step 5: Transcribe the audio file
result = model.transcribe(downloaded_file_path)


Collecting yt_dlp
  Using cached yt_dlp-2025.9.5-py3-none-any.whl.metadata (177 kB)
Using cached yt_dlp-2025.9.5-py3-none-any.whl (3.3 MB)
Installing collected packages: yt_dlp
  Attempting uninstall: yt_dlp
    Found existing installation: yt-dlp 2025.9.5
    Uninstalling yt-dlp-2025.9.5:
      Successfully uninstalled yt-dlp-2025.9.5
Successfully installed yt_dlp-2025.9.5
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-akisouys
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /t

In [7]:
# Adding metadata to the transcript, and saving the transcript to a file so we can use it outside of this program.
class Document:
    def __init__(self, source, text, metadata=None):
        self.source = source
        self.page_content = text
        self.metadata = metadata or {}

# Wrap the transcription result in the Document class with metadata
document = Document(
    source=downloaded_file_path,
    text=result['text'], 
    metadata={"source": "youtube", "file_path": downloaded_file_path}
)
#Save the transcript to a text file
transcript_file_path = os.path.join(save_dir, 'transcript.txt')
with open(transcript_file_path, 'w') as f:
    f.write(result['text'])

print(f"Transcript saved to {transcript_file_path}")


Transcript saved to docs/youtube/transcript.txt


In [8]:
# how many characters in this transcript file?
len(document.page_content)

32850

In [9]:
# Print the first 500 characters of the transcript
print(document.page_content[:500])


 In lesson four, we will go deeper into architectures for big data, and we will take a closer look at some of the most popular big data management systems. First, we're going to look at how the big data management system framework looks, and explore the commonalities that pretty much all the big data systems have, as well as some of the key differences between no SQL, MPP, and Hadoop. Next, we're going to take a deep dive into the Hadoop data management system. You will see how we both store dat


## Splitting our documents in chunks
A second step is to split our documents (a 128-page book and 32K-character trasncript file) into smaller chunks. We use Langchain libraries here again.

In [10]:
# We will use the most important library, recursive character splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [11]:
# Chunks have a character length, and an overlap values. For example (in real life, you are probably closer to 500 to 1000 and 50 to 100 respectively):
rsplit = RecursiveCharacterTextSplitter(
    chunk_size=20,
    chunk_overlap=5,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

In [12]:
# Let's take an example string
text1 = 'abcdefghijklmnopqrstuvwxyz1234567890'

In [13]:
rsplit.split_text(text1)

['abcdefghijklmnopqrst', 'pqrstuvwxyz123456789', '567890']

In [14]:
Hamlet = """Truly to speak, and with no addition, \
We go to gain a little patch of ground \
That hath in it no profit but the name. \
To pay five ducats, five, I would not farm it; \
Nor will it yield to Norway or the Pole \
A ranker rate, should it be sold in fee."""

In [15]:
rsplit.split_text(Hamlet)

['Truly to speak, and',
 'and with no',
 'no addition, We go',
 'go to gain a little',
 'patch of ground',
 'That hath in it no',
 'no profit but the',
 'the name. To pay',
 'pay five ducats,',
 'five, I would not',
 'not farm it; Nor',
 'Nor will it yield',
 'to Norway or the',
 'the Pole A ranker',
 'rate, should it be',
 'be sold in fee.']

In [16]:
# Let's go for a more realistic chunk size
rsplit = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

In [17]:
# Looking at the files, first the pdf
rdoc1 = rsplit.split_documents(book)

In [18]:
len(rdoc1)

952

In [19]:
# the splitted version has more documents (pages) than the original pdf source, 
len(book)

128

In [20]:
#Printing a few splits
for i, doc in enumerate(rdoc1[30:33]):  # Adjust the number 3 to print more or fewer splits
    print(f"--- Split {i + 1} ---")
    print(doc.page_content)
    print()  # Print an empty line for better readability


--- Split 1 ---
but that was simply that my eye was tired. Forty millions of miles it was from us--more 
than forty millions of miles of void. Few people realise the im- mensity of vacancy in 
which the dust of the material universe swims.  
   Near it in the field, I re member, were three faint points of  light, three telescopic stars 
infinitely remote, and all around it was th e unfathomable darkness of empty space. You

--- Split 2 ---
infinitely remote, and all around it was th e unfathomable darkness of empty space. You 
know how that blackness looks on a frosty st arlight night. In a tele- scope it seems far 
profounder. And invisible to me because it wa s so remote and small, flying swiftly and 
steadily towards me across that incredible di stance, drawing nearer every min- ute by so 
many thousands of miles, came the Thing they  were sending us, the Thing that was to

--- Split 3 ---
many thousands of miles, came the Thing they  were sending us, the Thing that was to 
bring so

In [21]:
# Splitting the trasncript of the audio file
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

# Step 1: Load the transcript text
transcript_file_path = "docs/youtube/transcript.txt"
with open(transcript_file_path, 'r') as f:
    transcript_text = f.read()

# Step 2: Create a Document object
document = Document(page_content=transcript_text)

# Step 3: Split the transcript into chunks
rdoc2 = rsplit.split_documents([document])

# Step 4 manually assigning the metadata to each split
save_dir = "docs/youtube/"
downloaded_file = [f for f in os.listdir(save_dir) if f.endswith('.m4a')][0]  # Assuming m4a, adjust if using mp3
downloaded_file_path = os.path.join(save_dir, downloaded_file)
for doc in rdoc2:
    doc.metadata = {"source": "youtube", "file_path": downloaded_file_path}


# Step 5: Print the first few splits
for i, doc in enumerate(rdoc2[30:33]):  # Adjust the number 3 to print more or fewer splits
    print(f"--- Split {i + 1} ---")
    print(doc.page_content)
    print()  # Print an empty line for better readability



--- Split 1 ---
how we actually execute analytics jobs on that data that's sitting in HDFS. So on the master node we have a new function, a new demon called the job tracker, and on the slave nodes we have a new one called the task tracker. Now let's say we have an application job that needs to communicate and analyze some data set that's sitting on the slave nodes down below. So the application job executes a Java command on the API, communicating with the name node, and then it tries to communicate down to

--- Split 2 ---
Java command on the API, communicating with the name node, and then it tries to communicate down to the task trackers below. Now one of the big differences between big data architectures and traditional data processing is that we don't try to bring all the data to one place and analyze it. What we do is we send the processing job down to the data and distribute it. You can think of it like having a lot of minions doing the work for you. One analogy might be if you h

In [22]:
# Checking the metadata

# Viewing metadata of the first few splits from rdoc1 (the pdf text)
print("Metadata for rdoc1:")
for i, doc in enumerate(rdoc1[:3]):  # Adjust the number to view more or fewer splits
    print(f"--- Metadata for Split {i + 1} ---")
    print(doc.metadata)  # Print the metadata
    print()  # Print an empty line for better readability

# Viewing metadata of the first few splits from rdoc2 (the video transcript)
print("Metadata for rdoc2:")
for i, doc in enumerate(rdoc2[:3]):  # Adjust the number to view more or fewer splits
    print(f"--- Metadata for Split {i + 1} ---")
    print(doc.metadata)  # Print the metadata
    print()  # Print an empty line for better readability


Metadata for rdoc1:
--- Metadata for Split 1 ---
{'producer': 'PDFill: Free PDF Writer and Tools', 'creator': 'PyPDF', 'creationdate': '2011-08-24T10:49:19-04:00', 'moddate': '2011-08-24T10:49:19-04:00', 'source': 'docs/War-of-the-Worlds.pdf', 'total_pages': 128, 'page': 1, 'page_label': '2'}

--- Metadata for Split 2 ---
{'producer': 'PDFill: Free PDF Writer and Tools', 'creator': 'PyPDF', 'creationdate': '2011-08-24T10:49:19-04:00', 'moddate': '2011-08-24T10:49:19-04:00', 'source': 'docs/War-of-the-Worlds.pdf', 'total_pages': 128, 'page': 1, 'page_label': '2'}

--- Metadata for Split 3 ---
{'producer': 'PDFill: Free PDF Writer and Tools', 'creator': 'PyPDF', 'creationdate': '2011-08-24T10:49:19-04:00', 'moddate': '2011-08-24T10:49:19-04:00', 'source': 'docs/War-of-the-Worlds.pdf', 'total_pages': 128, 'page': 1, 'page_label': '2'}

Metadata for rdoc2:
--- Metadata for Split 1 ---
{'source': 'youtube', 'file_path': 'docs/youtube/Big Data Architectures.m4a'}

--- Metadata for Split 2 --

Recursive character splitting is a very common technique. But if you use an LLM that severly limits the number of input token (or charges you b y the token), you may want to split based on tokens instead of character sequences. This is how to do it.

In [23]:
from langchain.text_splitter import TokenTextSplitter

In [24]:
# Let's define a very small chunk and no overlap, so you can see what a chunk looks like with this method
token_split = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [26]:
print(token_split.split_text(Hamlet))
print(len(token_split.split_text(Hamlet)))

['T', 'ruly', ' to', ' speak', ',', ' and', ' with', ' no', ' addition', ',', ' We', ' go', ' to', ' gain', ' a', ' little', ' patch', ' of', ' ground', ' That', ' hath', ' in', ' it', ' no', ' profit', ' but', ' the', ' name', '.', ' To', ' pay', ' five', ' d', 'uc', 'ats', ',', ' five', ',', ' I', ' would', ' not', ' farm', ' it', ';', ' Nor', ' will', ' it', ' yield', ' to', ' Norway', ' or', ' the', ' Pole', ' A', ' rank', 'er', ' rate', ',', ' should', ' it', ' be', ' sold', ' in', ' fee', '.']
65


## Storing in Vector Store
The third step is to store your splits in a vector database. There are dozens of solutions. Very popular solutions for local storage include Mongodb, Chroma, Weaviate and Milvus. All large Cloud vendors (Azure, AWS etc.) offer a Cloud vectordb solution. Here we use Chroma, a locally stored, flexible popular choice. 

Before storing our data into the vectordb, we need to convert the text strings into vectors (embedding). We use a tokenizer compatible with the BERT model to first tokenize the text, then embed (convert to vectors).

In [None]:
# Create Ollama embeddings and vector store
%pip install chromadb

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
all_splits = rdoc1 + rdoc2
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=all_splits, embedding=embeddings)

What do these vectors look like? Let's play with a few examples.

In [None]:
text1 = "i like hotdogs"
text2 = "i like sandwiches"
text3 = "this is a large building"

In [None]:
embedding1 = embeddings.embed_query(text1)

In [None]:
embedding1 = embeddings.embed_query(text1)
embedding2 = embeddings.embed_query(text2)
embedding3 = embeddings.embed_query(text3)

In [None]:
# looking at the first values of the first embedding
print("embedding1 includes", len(embedding1), "values")
print("First few values:", embedding1[:10])

How closes are these vectors from one another? There are many ways to compare them, here we use the cosine similarity method.

In [None]:
import numpy as np
from numpy import dot
from numpy.linalg import norm
# Step 1 : creating the normalized vectors (so the product is between 0 and 1)

norm_a = np.linalg.norm(embedding1)
norm_b = np.linalg.norm(embedding2)
norm_c = np.linalg.norm(embedding3)
normalized_a = embedding1 / norm_a
normalized_b = embedding2 / norm_b
normalized_c = embedding3 / norm_c

#Step 2: comparing text1 and text 2 embeddings, then text1 and text 3 embeddings:

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

similarity_1_2 = cosine_similarity(embedding1, embedding2)
similarity_1_3 = cosine_similarity(embedding1, embedding3)

print("Similarity (with cos similarity) between sentence 1 and 2:", similarity_1_2)
print("Similarity (with cos similarity) between sentence 1 and 3:", similarity_1_3)

Now that we have embeddings, let's store them into a Chroma database.

In [None]:
%pip install --upgrade langchain chromadb
from langchain.vectorstores import Chroma

# Set the environment variable to disable tokenizers parallelism and avoid warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Let's define a directory where we'll store the database beyond this notebook execution (and let's make sure it is emtpy, as I run this notebook often :))
persist_directory = 'docs/chroma/'
!rm -rf ./docs/chroma  # remove old database files if any

In [None]:
vectordb = Chroma.from_documents(
    documents=all_splits,
    embedding=embeddings,
    persist_directory=persist_directory
)

Now let's see if we can perform some similarity search with this database. keep in mind that we are just comparing vectors here, there is no LLM yet to smartly correlate deeper.

In [None]:
question = "Did the spaceship come from the planet Mars?"

In [None]:
docs = vectordb.similarity_search(question,k=5)

In [None]:
len(docs)

In [None]:
docs[0].page_content

In [None]:
# Let's save the vectordb so we can use it outside of this notebook - note, this is FYI as it is automatically done with Chroma, but not with all other vectordbs!
vectordb.persist()

## Retrieving with the LLM in action
The full process consists of asking a question, retrieving the relevant information, then passing the information and the question to the LLM.

In [None]:
#We still need these bricks, so do not run this part of the notebook in isolation
persist_directory = 'docs/chroma/'
embedding = embeddings
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [None]:
print(vectordb._collection.count())

In [None]:
question = "Did the spaceship come from the planet Mars?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

In [None]:
%pip install ollama
%ollama serve & ollama pull llama3 & ollama pull nomic-embed-text

In [None]:
#Using Llama3 as the LLM, and Ollama as the wrapper to interact with Llama3. Then using a test question to calidate the install.
from langchain_community.llms import Ollama
llm = Ollama(model = "llama3")
llm.invoke("Are there aliens on Mars?")

In [None]:
%pip install ollama langchain beautifulsoup4 chromadb gradio -q

In [None]:
# This is "almost" the final code. You will see the final code in the last lesson of the course
import gradio as gr
import ollama
from bs4 import BeautifulSoup as bs
from langchain_community.embeddings import OllamaEmbeddings

# Create Ollama embeddings and vector store
#embeddings = OllamaEmbeddings(model="nomic-embed-text")
#vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)

# Define the function to call the Ollama Llama3 model
def ollama_llm(question, context):
    formatted_prompt = f"Question: {question}\n\nContext: {context}"
    response = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': formatted_prompt}])
    return response['message']['content']

# Define the RAG setup
retriever = vectordb.as_retriever()

def rag_chain(question):
    retrieved_docs = retriever.invoke(question)
    formatted_context = "\n\n".join(doc.page_content for doc in retrieved_docs)
    return ollama_llm(question, formatted_context)

# Define the Gradio interface
def get_important_facts(question):
    return rag_chain(question)

# Create a Gradio app interface
iface = gr.Interface(
  fn=get_important_facts,
  inputs=gr.Textbox(lines=2, placeholder="Enter your question here..."),
  outputs="text",
  title="RAG with Llama3",
  description="Ask questions about the provided context",
)

# Launch the Gradio app
iface.launch()
# example q: did the aliens eventually go on to land on Venus?