## Before building the RAG system, we need to check which device we're using — GPU or CPU. This helps us make sure that the model runs efficiently and uses available hardware. If a GPU is available, we'll use it for faster processing; otherwise, we fall back to the CPU.

In [1]:
import torch

if torch.cuda.is_available():
    device = "cuda"
    pipeline_device = 0
    print(f"CUDA is available. Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = "cpu"
    pipeline_device = -1
    print("CUDA not available. Using CPU.")

CUDA is available. Using GPU: AMD Radeon RX 6750 XT


# **Load data**

**First, we need to load the dataset. I chose a small dataset from Wikipedia - specifically the "rahular/simple-wikipedia" dataset from HuggingFace Hub.**

This dataset has English articles written in a simpler way, which makes it easier to work with. It’s a good choice for testing RAG because it’s small and doesn’t take much time or computing power to process.

In [2]:
from datasets import load_dataset

dataset = load_dataset("rahular/simple-wikipedia")

### **Key Information About Our Dataset**

In [3]:
print(f"Dataset type: {type(dataset)}")
print(f"Available splits: {list(dataset.keys())}")

Dataset type: <class 'datasets.dataset_dict.DatasetDict'>
Available splits: ['train']


In [4]:
print(f"Dataset length: {len(dataset["train"])}")

Dataset length: 769764


In [5]:
print(f"Dataset features: {dataset['train'].features}")

Dataset features: {'text': Value('string')}


In [6]:
print(f"Sample record (index 558): {dataset['train'][558]}")

Sample record (index 558): {'text': 'Plants are also multicellular eukaryotic organisms, but live by using light, water and basic elements to make their tissues.'}


In [7]:
empty = [i for i in range(len(dataset["train"])) if not dataset["train"][i]["text"].strip()]
print(f"Empty records: {len(empty)}")

Empty records: 0


In [8]:
import pandas as pd
df = pd.DataFrame(dataset["train"][:5])
print("Sample data as DataFrame:")
df

Sample data as DataFrame:


Unnamed: 0,text
0,April
1,"April is the fourth month of the year, and com..."
2,April always begins on the same day of week as...
3,April's flowers are the Sweet Pea and Daisy. I...
4,"April comes between March and May, making it t..."


# **Document Creation**

In this step, I turn each article from the dataset into a Document object. Each one stores the text and a bit of extra info like where it came from and how long it is. This format makes it easier to later add the documents to a vector store and use them in the RAG pipeline.

In [9]:
from langchain.schema import Document as LangChainDocument
    

langchain_documents = []
for i, record in enumerate(dataset["train"]):
    text_content = record["text"]
        
    if not text_content.strip():
        continue
        
    metadata = {
        "source": "simple_wikipedia",
        "original_index": i,
        "text_length": len(text_content),
    }
        
    doc = LangChainDocument(
        page_content=text_content,
        metadata=metadata
    )
        
    langchain_documents.append(doc)
    
print(f"Created {len(langchain_documents)} LangChain documents")

Created 769764 LangChain documents


# **Text Splitting**

Text splitting is used to break up longer documents into smaller chunks. This helps the retriever find more relevant parts of the text later on. Without splitting, long texts might be skipped or give worse results because the important info gets buried inside.

In [10]:
text_lengths = []
for i, record in enumerate(dataset["train"]):
    text = record["text"]
    if text.strip():  
        text_lengths.append(len(text))

print(f"Total non-empty documents: {len(text_lengths)}")


import numpy as np

print("\nDOCUMENT LENGTH STATISTICS (in characters):")
print("-" * 50)
print(f"Minimum length: {min(text_lengths):,} characters")
print(f"Maximum length: {max(text_lengths):,} characters")
print(f"Average length: {np.mean(text_lengths):,.0f} characters")
print(f"Median length: {np.median(text_lengths):,.0f} characters")
print(f"Standard deviation: {np.std(text_lengths):,.0f} characters")

percentiles = [10, 25, 50, 75, 90, 95, 99]
print(f"\nPercentiles:")
for p in percentiles:
    value = np.percentile(text_lengths, p)
    print(f"  {p}th percentile: {value:,.0f} characters")

Total non-empty documents: 769764

DOCUMENT LENGTH STATISTICS (in characters):
--------------------------------------------------
Minimum length: 1 characters
Maximum length: 10,570 characters
Average length: 183 characters
Median length: 127 characters
Standard deviation: 198 characters

Percentiles:
  10th percentile: 12 characters
  25th percentile: 24 characters
  50th percentile: 127 characters
  75th percentile: 271 characters
  90th percentile: 432 characters
  95th percentile: 558 characters
  99th percentile: 872 characters


### Why Text Splitting is Not Necessary for Our Dataset

After analyzing our Simple Wikipedia dataset, we found that text splitting (chunking) is not necessary for this particular dataset. Our dataset contains 769,764 non-empty documents with an average length of only 183 characters (46 tokens) and a median of 127 characters (32 tokens). Even the 95th percentile is just 558 characters (140 tokens), which is well below the typical embedding model limits of 512-8192 tokens.


Since most embedding models can easily handle documents of this size, splitting our already short Wikipedia articles would actually be counterproductive. Text splitting is typically needed when documents are very long (>1000 characters) or exceed model token limits, but our dataset doesn't meet these criteria. Keeping the documents intact preserves the complete context of each Wikipedia article, results in better embeddings, and simplifies our RAG pipeline. We can proceed directly to embedding generation without any chunking steps.

# **Embedding Generation**

In this step, I convert each text chunk into a vector of numbers, called an embedding. These vectors capture the meaning of the text and make it possible to compare them later when searching for relevant content. It’s a key part of how the system knows which documents are similar to a user’s question.

In [11]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2",model_kwargs={'device': device})

In [12]:
texts = [doc.page_content for doc in langchain_documents]
embeddings = embedding_function.embed_documents(texts)

In [13]:
print(f"Number of embeddings: {len(embeddings)}")
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"Shape equivalent: ({len(embeddings)}, {len(embeddings[0])})")

Number of embeddings: 769764
Embedding dimension: 384
Shape equivalent: (769764, 384)


### **Semantic Similarity**

To better understand how embeddings capture meaning, I ran a simple test with three example sentences. Two of them were about climate change, and one was about chocolate cake. Using cosine similarity, the two climate-related texts showed a high similarity score (~0.69), while comparisons with the dessert sentence returned values close to zero. This confirms that text embeddings can effectively group semantically related content and separate unrelated topics — a key feature that makes them useful for document retrieval.

In [14]:
sentences = [
    "Climate change leads to rising global temperatures and extreme weather.",
    "Greenhouse gases are the main cause of global warming.",
    "Chocolate cake is a popular dessert made with cocoa powder and sugar."
]

In [15]:
test_embeddings = embedding_function.embed_documents(sentences)

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(test_embeddings)

In [17]:
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        print(f"Similarity between:\n - \"{sentences[i]}\"\n - \"{sentences[j]}\"\n = {similarity[i][j]:.4f}\n")

Similarity between:
 - "Climate change leads to rising global temperatures and extreme weather."
 - "Greenhouse gases are the main cause of global warming."
 = 0.6898

Similarity between:
 - "Climate change leads to rising global temperatures and extreme weather."
 - "Chocolate cake is a popular dessert made with cocoa powder and sugar."
 = 0.0066

Similarity between:
 - "Greenhouse gases are the main cause of global warming."
 - "Chocolate cake is a popular dessert made with cocoa powder and sugar."
 = -0.0113



# **Vector Database Setup**

Here, I set up a vector database using Chroma. It stores all the document embeddings so they can be searched later. I load the documents in batches and save everything to disk, which makes the data persistent and ready for retrieval.

In [18]:
from langchain_chroma import Chroma
import shutil
import os

persist_dir = "./chroma_wikipedia_db"

if os.path.exists(persist_dir):
    shutil.rmtree(persist_dir)

vector_store = Chroma(
    collection_name="simple_wikipedia_collection",
    embedding_function=embedding_function,
    persist_directory=persist_dir, 
)

In [19]:
BATCH_SIZE = 5000  

for i in range(0, len(langchain_documents), BATCH_SIZE):
    batch = langchain_documents[i:i + BATCH_SIZE]
    vector_store.add_documents(batch)
    #print(f"Added batch {i//BATCH_SIZE + 1}: documents {i} - {min(i + BATCH_SIZE, len(langchain_documents))}")

In [20]:
print(f"Total number of documents stored in the vector database: {vector_store._collection.count()}")

Total number of documents stored in the vector database: 769764


# **Retrieval System**

In this part, I test the retrieval by asking a sample question. The system searches the vector database and returns the most similar documents based on the query. Each result comes with a similarity score that shows how closely it matches the question.

In [21]:
query = "Between each month comes April?"

In [22]:
results = vector_store.similarity_search_with_score(query, k=5)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content}")

* [SIM=0.570056] April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.
* [SIM=0.588876] April is the fourth month of the year, and comes between March and May. It is one of four months to have 30 days.
* [SIM=0.675394] April begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.
* [SIM=0.679126] April is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. In each hemisphere, it is the seasonal equivalent of October in the other.
* [SIM=0.698874] April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.


# **Generation (LLM)**

In this stage, I load a language model (flan-t5-large) and connect it to the retrieval system using LangChain. The model is responsible for generating final answers based on the documents retrieved earlier. I also define a custom prompt that tells the model how to respond using the given context. Then I build a RAG chain, which combines the retriever and the generator into one workflow. When I ask a question, the system finds relevant documents, passes them to the model, and returns a generated answer along with the sources it used.

In [23]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_huggingface.llms import HuggingFacePipeline

model_id = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    device=pipeline_device,  
    do_sample=True,
    max_new_tokens=200,
    temperature=0.7,
    repetition_penalty=1.0,
)

llm = HuggingFacePipeline(pipeline=pipe)

Device set to use cuda:0


In [24]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

retriever = vector_store.as_retriever(search_kwargs={"k": 7})

my_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template = """
You are a helpful assistant. Using only the information below, write a detailed and informative answer to the question.

{context}

Question: {question}
Answer:
"""
)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": my_prompt},
    return_source_documents=True,
    verbose=True
)

In [25]:
query = "How does the water cycle work in nature?"
result = rag_chain(query)

print("------------------------------------------------------------------------------------------------------------------------")
print("🔍 Result:\n", result["result"])
print("------------------------------------------------------------------------------------------------------------------------")
print("\n📄 Source:")
print("------------------------------------------------------------------------------------------------------------------------")
for doc in result["source_documents"]:
    print("-", doc.page_content, "...")


  result = rag_chain(query)




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
------------------------------------------------------------------------------------------------------------------------
🔍 Result:
 The central theme of hydrology is how the water circulates. This is called the water cycle. The most vivid illustration of it is the water evaporation from the ocean with the formation of clouds. These clouds drift over the land and produce rain.
------------------------------------------------------------------------------------------------------------------------

📄 Source:
------------------------------------------------------------------------------------------------------------------------
- Water cycle ...
- The water cycle (or hydrological cycle) is the cycle that water goes through on Earth. ...
- This is the process that water starts and ends in the water cycle. ...
- Human activities that change the water cycle include: ...
- The central theme of hydrology is how the water c