# Simple RAG with GPT Project

In [2]:
!wget https://raw.githubusercontent.com/harriliu/LLM/refs/heads/main/minisearch.py

--2025-02-20 18:03:58--  https://raw.githubusercontent.com/harriliu/LLM/refs/heads/main/minisearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3833 (3.7K) [text/plain]
Saving to: ‘minisearch.py.1’


2025-02-20 18:03:58 (15.2 MB/s) - ‘minisearch.py.1’ saved [3833/3833]



In [3]:
import minisearch
import json

### Get the toy text data from the documents.json contains Data Zoomcamp course FAQ

In [4]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [5]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [6]:
index = minisearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

Fit the document to the minisearch search engine

In [7]:
index.fit(documents)

<minisearch.Index at 0x785d340f0bc0>

In [8]:
# create a function to search in the documents based on query from the user
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'machine-learning-zoomcamp'},
        boost_dict=boost,
        num_results=10
    )

    return results

In [9]:
query = 'the course has already started, can I still enroll?'

In [10]:
results = search(query)
results

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "Yes! We'll cover some linear algebra in the course, but in general, there will be very few formulas, mostly code.\nHere are some interesting videos covering linear algebra that you can already watch: ML Zoomcamp 1.8 - Linear Algebra Refresher from Alexey Grigorev or the excellent playlist from 3Blue1Brown Vectors | Chapter 1, Essence of linear algebra. Never hesitate to ask the community for help if you have any question.\n(Mélanie Foues

In [11]:
from openai import OpenAI

In [12]:
client = OpenAI()

In [13]:
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": query}]
)

response.choices[0].message.content

"Whether you can still enroll in a course that has already started depends on the institution or platform offering the course. Here are some steps you can take to find out:\n\n1. **Check the Course Website:** Look for any information about late enrollment, drop/add periods, or deadlines. Some institutions allow late enrollment within a certain timeframe.\n\n2. **Contact the Instructor:** Reach out directly to the course instructor or professor. They may be able to provide an exception or offer guidance on catching up.\n\n3. **Speak with Student Services:** If you are taking the course at a college or university, contact the registrar's office or student services. They can provide details on the institution's policies.\n\n4. **Online Platforms:** If the course is offered on an online learning platform (such as Coursera, edX, or Udemy), check their specific guidelines. Some platforms allow enrollment at any time, especially if the course is self-paced.\n\n5. **Consider the Course Require

### As you can see GPT4 can only provide very general answer without context about this specfic course

Let's perform prompt engineering and create a prompt template for the GPT 4

In [14]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()


In [15]:
prompt_template

"You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\nUse only the facts from the CONTEXT when answering the QUESTION.\n\nQUESTION: {question}\n\nCONTEXT: \n{context}"

In [16]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [17]:
prompt = prompt_template.format(question=query, context=context).strip()

In [18]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: the course has already started, can I still enroll?

CONTEXT: 
section: General course-related questions
question: The course has already started. Can I still join it?
answer: Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.
In order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.

section: General course-related questions
question: I don't know math. Can I take the course?
answer: Yes! We'll cover some linear algebra in the course, but in general, there will be very few formulas, mostly code.
Here are some interesting videos covering linear algebra t

### Now let's plug the prompt template into GPT-4 and see the new response

In [19]:
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": prompt}]
)

response.choices[0].message.content

"Yes, you can still enroll in the course even if it has already started. While you may not be able to submit some of the homeworks, you can still participate in the course. To be eligible for a certificate, you must submit 2 out of 3 course projects and review 3 peers' projects by the deadline. Therefore, if you join by the end of November and complete two projects, you will be eligible for a certificate."

### Let's clean the code a bit and combine the search and prompt into a single Rag function

In [20]:
def build_prompt(query, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """.strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [21]:
def call_llm(prompt):
    response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [22]:
def rag(query):
    search_results=search(query)
    prompt = build_prompt(query, search_results)
    response = call_llm(prompt)
    return response

In [23]:
print(rag("403 Forbidden' error message when you try to push to a GitHub repository"))

To resolve the '403 Forbidden' error message when trying to push to a GitHub repository, you should update the remote URL configuration. Use the following commands:

1. Check the current URL configuration with:
   ```bash
   git config -l | grep url
   ```

2. Ensure the output format is:
   ```plaintext
   remote.origin.url=https://github.com/github-username/github-repository-name.git
   ```

3. Change it to the following format to include the username:
   ```bash
   git remote set-url origin "https://github-username@github.com/github-username/github-repository-name.git"
   ```

Make sure to verify the change using the command in step 1. This adjustment often resolves the '403 Forbidden' error.


# Elastic Search

In [24]:
from elasticsearch import Elasticsearch

In [25]:
es_client = Elasticsearch('http://localhost:9200')

In [31]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

In [42]:
# index_name = "course-questions"

# es_client.indices.create(index=index_name, body=index_settings)

In [33]:
from tqdm.auto import tqdm

In [43]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [35]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [36]:
elastic_search("i JUST DISCOVER THE course, can I still join it?")

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the 

In [39]:
def rag_elastic_search(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = call_llm(prompt)
    return answer



In [41]:
print(rag_elastic_search("i JUST DISCOVER THE course, can I still join it?"))

Yes, you can still join the course even if it has already started. You are eligible to submit the homework assignments. However, please be aware that there are deadlines for submitting the final projects, so it is important not to leave everything until the last minute.
