Following the steps from [LLM ZoomCamp → RAG Intro](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/rag-intro.ipynb).

# 1. Download a simple search library

In [21]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

import minsearch

--2024-06-30 18:20:30--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py.2’


2024-06-30 18:20:30 (20.1 MB/s) - ‘minsearch.py.2’ saved [3832/3832]



# 2. Load and prepare Q&A data

In [22]:
import json

with open('data/documents.json', 'rt') as input_file:
    docs_raw = json.load(input_file)

In [23]:
docs_raw[0]['course']

'data-engineering-zoomcamp'

In [24]:
docs_raw[0]['documents'][:3]

[{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?'},
 {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
  'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute."

### Add a `course` field to each Q&A item.

In [25]:
documents = []

for answers in docs_raw:
    for doc in answers['documents']:
        doc['course'] = answers['course']
        documents.append(doc)

In [26]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

# 3. Index Q&A items for future search

In [27]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],  # full-text search
    keyword_fields=["course"]  # filtering
)

In [28]:
index.fit(documents)

<minsearch.Index at 0x1355b2b40>

# 4. Seach for relevant documents

In [29]:
query = "The course has already started. Can I still enroll?"

In [30]:
# Give more importance to specific words
boost_dictionary = {'question': 3.0, 'section': 0.5}

search_results = index.search(
    query=query,
    boost_dict=boost_dictionary,
    num_results=3
)

search_results

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the cour

Here, the results show questions and answers for different courses.

### Filter: Show the documents only for a specific course

In [31]:
index.search(
    query=query,
    filter_dict={'course': 'data-engineering-zoomcamp'},
    boost_dict=boost_dictionary,
    num_results=3
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

# 5. Generate an answer with OpenAI API

In [32]:
from openai import OpenAI

In [None]:
client = OpenAI()

## Without RAG

In [None]:
query

'The course has already started. Can I still enroll?'

In [None]:
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": query}]
)

response.choices[0].message.content

"The possibility of enrolling in a course after it has already started depends on several factors, including the policies of the educational institution or platform offering the course. Here are a few steps you can take:\n\n1. **Check the Course Policy:** Look up the course details on the institution's website or learning platform to see if they mention late enrollment policies.\n\n2. **Contact the Instructor:** Reach out to the course instructor directly. They may have the discretion to allow late enrollments or offer advice on how to catch up.\n\n3. **Contact the Registrar or Administration:** If you're dealing with a college or university, contact the registrar's office or the relevant administrative department to inquire about their policy on late enrollment.\n\n4. **Assess the Course Structure:** Determine if the course content is sequential and if starting late would put you at a significant disadvantage. Some courses are designed in a way that makes catching up difficult, while 

## With RAG

In [None]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NONE.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

### a. Build the context from local documents

In [36]:
search_results[0]

{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
 'section': 'General course-related questions',
 'question': 'The course has already started. Can I still join it?',
 'course': 'machine-learning-zoomcamp'}

In [38]:
context_builder = []

for doc in search_results:
    context_builder.append(f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n")

context = "".join(context_builder)

In [40]:
prompt = prompt_template.format(question=query, context=context).strip()

In [41]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: The course has already started. Can I still enroll?

CONTEXT: 
section: General course-related questions
question: The course has already started. Can I still join it?
answer: Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.
In order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final pr

### b. Get the results from GPT-4o

In [42]:
rag_response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": prompt}]
)

In [46]:
rag_response.choices[0].message.content

"Yes, you can still enroll in the course even though it has already started. However, you won't be able to submit some of the homeworks. To get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ projects by the deadline. This means that even if you join the course at the end of November and complete two projects, you will still be eligible for a certificate. Be mindful of the deadlines for turning in the final projects, as leaving everything for the last minute is not advisable."