# Simple RAG with GPT Project

In [2]:
!wget https://raw.githubusercontent.com/harriliu/LLM/refs/heads/main/minisearch.py

--2025-02-20 15:07:24--  https://raw.githubusercontent.com/harriliu/LLM/refs/heads/main/minisearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
200 OKequest sent, awaiting response... 
Length: 3833 (3.7K) [text/plain]
Saving to: ‘minisearch.py’


2025-02-20 15:07:24 (30.2 MB/s) - ‘minisearch.py’ saved [3833/3833]



In [5]:
import minisearch
import json

### Get the toy text data from the documents.json contains Data Zoomcamp course FAQ

In [7]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [8]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [13]:
index = minisearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

Fit the document to the minisearch search engine

In [16]:
index.fit(documents)

<minisearch.Index at 0x7dc6f4bc8ec0>

In [51]:
# create a function to search in the documents based on query from the user
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'machine-learning-zoomcamp'},
        boost_dict=boost,
        num_results=10
    )

    return results

In [52]:
query = 'the course has already started, can I still enroll?'

In [53]:
results = search(query)
results

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "Yes! We'll cover some linear algebra in the course, but in general, there will be very few formulas, mostly code.\nHere are some interesting videos covering linear algebra that you can already watch: ML Zoomcamp 1.8 - Linear Algebra Refresher from Alexey Grigorev or the excellent playlist from 3Blue1Brown Vectors | Chapter 1, Essence of linear algebra. Never hesitate to ask the community for help if you have any question.\n(Mélanie Foues

In [31]:
from openai import OpenAI

In [34]:
client = OpenAI()

In [35]:
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": query}]
)

response.choices[0].message.content

'It depends on the course and the institution offering it. Some courses have rolling admissions and allow late enrollment, especially if they are online or self-paced. Others might have strict deadlines and do not accept late enrollments once the course has started.\n\nTo find out if you can still enroll, you should:\n\n1. **Check the course website:** Look for details on late enrollment policies.\n2. **Contact the instructor or administrator:** Reach out for more specific guidance or exceptions.\n3. **Contact the admissions office:** For official guidance and possible options for late registration.\n\nIt’s always best to inquire directly to get the most accurate and up-to-date information.'

### As you can see GPT4 can only provide very general answer without context about this specfic course

Let's perform prompt engineering and create a prompt template for the GPT 4

In [45]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()


In [46]:
prompt_template

"You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\nUse only the facts from the CONTEXT when answering the QUESTION.\n\nQUESTION: {question}\n\nCONTEXT: \n{context}"

In [61]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [62]:
prompt = prompt_template.format(question=query, context=context).strip()

In [63]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: the course has already started, can I still enroll?

CONTEXT: 
section: General course-related questions
question: The course has already started. Can I still join it?
answer: Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.
In order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.

section: General course-related questions
question: I don't know math. Can I take the course?
answer: Yes! We'll cover some linear algebra in the course, but in general, there will be very few formulas, mostly code.
Here are some interesting videos covering linear algebra t

### Now let's plug the prompt template into GPT-4 and see the new response

In [64]:
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": prompt}]
)

response.choices[0].message.content

"Yes, you can still enroll in the course even if it has already started. However, you may not be able to submit some of the homeworks. To receive a certificate, you need to submit 2 out of 3 course projects and review 3 peers' projects by the deadline. If you join the course at the end of November and complete two projects, you will still be eligible for a certificate."

### Let's clean the code a bit and combine the search and prompt into a single Rag function

In [68]:
def build_prompt(query, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """.strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [70]:
def call_llm(prompt):
    response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [72]:
def rag(query):
    search_results=search(query)
    prompt = build_prompt(query, search_results)
    response = call_llm(prompt)
    return response

In [74]:
rag("403 Forbidden' error message when you try to push to a GitHub repository")

'To resolve a \'403 Forbidden\' error when trying to push to a GitHub repository, follow these steps:\n\n1. Check the current URL configuration with the command:\n   ```bash\n   git config -l | grep url\n   ```\n   The output will look something like this:\n   ```\n   remote.origin.url=https://github.com/github-username/github-repository-name.git\n   ```\n\n2. Change the URL to the correct format using the command:\n   ```bash\n   git remote set-url origin "https://github-username@github.com/github-username/github-repository-name.git"\n   ```\n\n3. Verify that the change is reflected using the earlier command to check the URL configuration.'