# Retrieval-Augmented Generation (RAG) Demo

This notebook demonstrates a simple retrieval-augmented generation workflow using a custom search engine.

I practiced using Github Copilot to write the code cell by cell. 

I then made the code modular using functions. 

In [29]:
#Download the custom search engine and documents json file from the Datatalks Club GitHub repository.
#!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/refs/heads/main/minsearch.py
#!wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/refs/heads/main/01-intro/documents.json

In [1]:
#Import necessary libraries
import json
import minsearch
from openai import OpenAI

In [2]:
# Load the documents from the JSON file
with open('documents.json', 'r') as f:
    docs_raw = json.load(f)

In [3]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [4]:
# Create the search index
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [5]:
# Fit the index with the documents
index.fit(documents)

<minsearch.Index at 0x173d79a60>

In [None]:
# Search the index for relevant documents based on the query.
def search(query):
    # Set boost values to give more importance to certain fields during search
    boost = { "question": 3, "section": 0.5}
    # Search the index with the query, filtering by course and applying boost
    results = index.search(
        query=query,
        num_results=5,
        filter_dict={"course": "data-engineering-zoomcamp"},
        boost_dict=boost
    )
    return results
    

In [None]:
# Build a prompt for the LLM using the query and search results.
def prompt_builder(query, search_results):
    # Create prompt template for the question and context
    prompt_template = """
    You are a course assistant for the Data Engineering Zoomcamp. 
    You will be given a question. Your task is to answer the question based on the CONTEXT from the FAQ Documents. Use only facts from the CONTEXT to answer the question. If you don't know the answer, say "I don't know".
    Question: {question}
    Context: {context}
    Answer: 
    """.strip()

    # Create context of documents from our search results
    context = ""

    for doc in search_results: 
        context += f"Question: {doc['question']}\n"
        context += f"Answer: {doc['text']}\n"
        context += f"Section: {doc['section']}\n\n"

        # Create the prompt
        prompt = prompt_template.format(
            question=query,
            context=context
        ).strip()
    return prompt
    

In [None]:
# Send the prompt to the OpenAI LLM and returns the response.
def llm(prompt):
    # Create an OpenAI client instance
    client = OpenAI()
    # Send the prompt with context  to the OpenAI model and get the response
    response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": prompt},
    ])
    return response.choices[0].message.content

In [None]:
# Implement the full RAG pipeline: search, prompt building, and LLM call.
def rag(query): 
    # Get search results for the query
    search_results = search(query)
    # Build the prompt with the query and search results
    prompt = prompt_builder(query, search_results)
    # Get the answer from the LLM
    answer = llm(prompt)
    return answer

In [19]:
# Set the query 
query = 'the course has already started, can I still enroll?'


In [28]:
# Example usage
answer = rag(query)
# Print the answer
print(rag(query))

Yes, even if the course has already started, you can still enroll and submit the homework. However, be aware that there will be deadlines for submitting the final projects, so it's advisable not to leave everything for the last minute.
