# Retrieval-Augmented Generation

## Design and Implementation of a Mini Retrieval-Augmented Search System
My primary goal was to develop a lightweight system that could read through a folder of text files, break them down into manageable pieces, and then utilize OpenAI’s embedding models to make those pieces searchable. I called this system DocumentSearch because it acts like a personal assistant for quickly finding relevant information inside documents. The core idea was to combine simple local file processing with cloud-based language models, keeping everything modular and easy to understand.

In the constructor, I set up the OpenAI API key so that the system can connect to the model. I also take in a document path and a chunk size, which determines the size of each segment of text. Rather than crowding the initialization with too much logic, I delegated the heavy lifting to a private method called _load_and_embed_docs. This design choice keeps the code cleaner and makes it easier to follow. Inside that method, I first load all the .txt files from the specified directory, read their content, and split the text into smaller chunks. I used a simple word-based splitting approach, ensuring each chunk stays below the defined size limit. This helps prevent overly long inputs that could cause inefficiencies or errors when generating embeddings.

Once I have all the text chunks, I generate embeddings for each one using OpenAI’s text-embedding-ada-002 model. These embeddings represent the semantic meaning of the text in a numerical form, enabling efficient comparison later. I store both the text chunks and their corresponding embeddings so that when a user asks a question, I can quickly measure how similar their query is to each stored piece of text.

The ask method handles the similarity search. It converts the user’s query into an embedding and then calculates the cosine similarity between that query and each stored document chunk. This step is the mathematical heart of the system; it identifies which chunks are most closely related in meaning to the user’s question. I then sort the results and return the top few matches, which gives a ranked list of relevant text snippets.

The answer_question method builds on this retrieval process by incorporating a generation step. After retrieving the most relevant chunks, I combine them into a single context and construct a prompt for the chat model. I instruct the model to answer the question only based on the given context and to admit if the answer cannot be found. This approach keeps the system grounded in the retrieved evidence, reducing the chances of generating hallucinated responses. I chose gpt-5-mini for this step because it offers strong reasoning capabilities at a reasonable computational cost.

To appropriately handle any possible API problems, I encapsulated the language model call in a try-except block. Instead of crashing if something goes wrong, the system produces a helpful error message. All things considered, this project is my effort to condensibly and readablely combine natural language production and retrieval-based reasoning. It can be expanded for research support, study aids, or knowledge management systems, and it forms the basis for document-based question answering.

In [1]:
import os
from typing import List, Tuple

import numpy as np
import openai


class DocumentSearch:
    def __init__(self, api_key: str | None, document_path: str, chunk_size: int = 1000):
        """
        Initializes the DocumentSearch system.
        """
        openai.api_key = api_key
        self.chunk_size = chunk_size
        # The setup process is a separate method to keep init clean
        self.chunks, self.embeddings = self._load_and_embed_docs(document_path)

    def _load_and_embed_docs(self, path: str) -> Tuple[List[str], List[np.ndarray]]:
        """A private helper method to handle the setup process."""
        print("Loading and processing documents...")
        # Logic for loading, chunking, and embedding
        docs = []
        for filename in os.listdir(path):
            if filename.endswith(".txt"):
                with open(os.path.join(path, filename), "r", encoding="utf-8") as file:
                    docs.append(file.read())

        all_chunks = []
        for text in docs:
            words = text.split()
            current_chunk_words = []
            current_size = 0
            for word in words:
                if current_size + len(word) > self.chunk_size:
                    all_chunks.append(" ".join(current_chunk_words))
                    current_chunk_words = [word]
                    current_size = len(word)
                else:
                    current_chunk_words.append(word)
                    current_size += len(word) + 1
            if current_chunk_words:
                all_chunks.append(" ".join(current_chunk_words))

        embeddings = []
        for chunk in all_chunks:
            response = openai.embeddings.create(
                model="text-embedding-ada-002", input=chunk
            )
            embeddings.append(np.array(response.data[0].embedding))

        print("Setup complete!")
        return all_chunks, embeddings

    def ask(self, query: str, top_k: int = 3) -> List[Tuple[str, float]]:
        """
        Finds the most relevant document chunks for a given query.
        """
        response = openai.embeddings.create(model="text-embedding-ada-002", input=query)
        query_embedding = np.array(response.data[0].embedding)

        similarities = []
        for i, chunk_embedding in enumerate(self.embeddings):
            similarity = np.dot(query_embedding, chunk_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(chunk_embedding)
            )
            similarities.append((self.chunks[i], similarity))

        return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]

    def answer_question(self, question: str, top_k: int = 3) -> str | None:
        """
        Retrieves relevant chunks and uses an LLM to generate a final answer.
        """
        # Retrieve the most relevant chunks of text
        relevant_chunks_with_scores = self.ask(question, top_k=top_k)

        # If no relevant chunks are found, we can't answer.
        if not relevant_chunks_with_scores:
            return "I couldn't find any relevant information to answer that question."

        # Combine the chunks into a single context string
        context = "\n---\n".join(
            [chunk for chunk, score in relevant_chunks_with_scores]
        )

        # Create a prompt for the language model
        prompt = f"""
        Based on the context provided below, please answer the user's question.
        If the context does not contain the answer, state that you cannot answer.

        --- CONTEXT ---
        {context}
        --- END CONTEXT ---

        QUESTION: {question}
        """

        # Call the language model to generate the answer
        try:
            response = openai.chat.completions.create(
                model="gpt-5-mini",
                messages=[
                    {
                        "role": "system",
                        "content": (
                            "You are a helpful assistant that "
                            "answers questions based on provided context."
                        ),
                    },
                    {"role": "user", "content": prompt},
                ],
            )
            return response.choices[0].message.content
        except openai.OpenAIError as openai_error:
            return f"An OpenAI API error occurred while generating the answer: {openai_error}"

## Initializing the Document Search System
In this cell, I initialize the DocumentSearch system by creating an instance called search_system. I pass in my OpenAI API key, which I load securely from the environment using os.getenv("OPENAI_API_KEY"), and specify "my_documents" as the folder containing the text files to process. The text file in this folder contains the contents of Imran Khan’s Wikipedia page, which serves as the knowledge base for this small demonstration. When this line runs, the system automatically loads that document, splits it into chunks, and generates embeddings for each piece. Once setup is complete, search_system is ready to retrieve relevant information and answer questions about Imran Khan using the RAG workflow built into the code.

In [2]:
# Initialize the system
search_system = DocumentSearch(
    api_key=os.getenv("OPENAI_API_KEY"), document_path="my_documents"
)

Loading and processing documents...
Setup complete!


## Asking a Question and Generating an Answer
In this cell, I test the system by asking a sample question “When was Imran Khan born?” and store the result in the variable answer1. When I call search_system.answer_question(question1), the program first retrieves the most relevant text chunks from the embedded documents using the text-embedding-ada-002 model. Then it passes those retrieved pieces to gpt-5-mini, which analyzes the context and generates a coherent answer.

Finally, I print the response to verify that the system is working as intended. This step demonstrates the complete retrieval-to-generation flow in action: the question is interpreted, related information is found, and a grounded natural-language answer is produced.

In [3]:
# Ask question using the new method
question1 = "When was Imran Khan born?"
answer1 = search_system.answer_question(question1)

# Print the final, generated answer
print(answer1)

Imran Khan was born on 5 October 1952.


## Asking a Follow-Up Question

In this final cell, I continue testing the system by asking a second question “When did he win the world cup?” and save the result as answer2. Just like before, the answer_question() method retrieves the most relevant chunks from the document embeddings using text-embedding-ada-002, then relies on gpt-5-mini to interpret that context and generate a clear, factual response.

Printing answer2 lets me confirm that the system can handle follow-up questions smoothly, maintaining consistency and accuracy across related queries. This step also illustrates how the model can build on the same embedded knowledge base to provide context-aware answers in a conversational workflow.

In [4]:
# Ask question using the new method
question2 = "When did he win the world cup?"
answer2 = search_system.answer_question(question2)

# Print the final, generated answer
print(answer2)

He won the Cricket World Cup in 1992.


## Analysis
The system gave correct answers to both questions based on Imran Khan's Wikipedia page.

For "When was Imran Khan born?", it responded with 5 October 1952, the right date and clearly stated.

For "When did he win the world cup?", it answered 1992, showing it understood that "he" referred to Imran Khan and pulled the right detail.

These results show the RAG pipeline is working properly. The embeddings guided the model to the right text, and gpt-5-mini produced accurate, grounded answers in plain language.

## References
Wikipedia. (n.d.). Imran Khan. In Wikipedia, The Free Encyclopedia. Retrieved October 10, 2025, from https://en.wikipedia.org/wiki/Imran_Khan