In [4]:
#!pip install minsearch

Collecting minsearch
  Downloading minsearch-0.0.2-py3-none-any.whl.metadata (3.5 kB)
Downloading minsearch-0.0.2-py3-none-any.whl (4.1 kB)
Installing collected packages: minsearch
Successfully installed minsearch-0.0.2


In [6]:
import minsearch

In [7]:
import json

In [8]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [10]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [11]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [12]:
index = minsearch.Index(
    text_fields = ["question", "text", "section"],
    keyword_fields = ["course"]
)

In [13]:
# Idea of index
# SELECT  * WHERE course = 'data-engineering-zoomcamp';

In [14]:
q = 'the course has already started, can I still enroll?'

In [15]:
index.fit(documents)

<minsearch.minsearch.Index at 0x71ea6169faa0>

In [16]:
boost = {'question': 3.0, 'section' : 0.5}

index.search(
    query = q,
    boost_dict = boost,
    num_results = 6
)

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the cour

In [17]:
# filtered_results
results = index.search(
    query = q,
    filter_dict = {'course': 'data-engineering-zoomcamp'},
    boost_dict = boost,
    num_results = 6
)

In [18]:
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

In [19]:
from openai import OpenAI

In [23]:
client = OpenAI()

In [24]:
response = client.chat.completions.create(
                model = 'gpt-4o',
                messages = [{"role": "user", "content":q}]
                 )

In [25]:
response

ChatCompletion(id='chatcmpl-Bj5okshPcUAsuVuUT459FuGgWhaND', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Whether you can still enroll in a course that has already started depends on the policies of the institution offering the course. Here are a few steps you can take to find out:\n\n1. **Check the Course Website**: Often, course websites or portals provide detailed information about enrollment deadlines and late registration policies.\n\n2. **Contact the Instructor**: Reach out directly to the course instructor. They might be able to give permission for you to join late if the course structure allows it.\n\n3. **Talk to the Admissions Office**: The admissions or registrar's office of the institution can provide information regarding late enrollment policies and any potential fees or penalties.\n\n4. **Consider Auditing**: If enrolling for credit is not possible, some institutions allow students to audit a course, attending classe

In [26]:
response.choices[0].message.content

"Whether you can still enroll in a course that has already started depends on the policies of the institution offering the course. Here are a few steps you can take to find out:\n\n1. **Check the Course Website**: Often, course websites or portals provide detailed information about enrollment deadlines and late registration policies.\n\n2. **Contact the Instructor**: Reach out directly to the course instructor. They might be able to give permission for you to join late if the course structure allows it.\n\n3. **Talk to the Admissions Office**: The admissions or registrar's office of the institution can provide information regarding late enrollment policies and any potential fees or penalties.\n\n4. **Consider Auditing**: If enrolling for credit is not possible, some institutions allow students to audit a course, attending classes without receiving credit or a grade.\n\n5. **Online Programs**: For online courses, especially those provided by platforms like Coursera, edX, or Udemy, late 

In [27]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database. 
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NONE

QUESTION: {question}

CONTEXT:
{context}
""".strip()

In [28]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [29]:
print(context)

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-related questions
question: Course - When will the course start?
answer: The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start wit

In [30]:
prompt = prompt_template.format(question=q, context = context).strip()

In [31]:
response = client.chat.completions.create(
                model = 'gpt-4o',
                messages = [{"role": "user", "content":prompt}]
                 )

response.choices[0].message.content

'Yes, even if the course has already started, you can still enroll and submit the homeworks. However, be mindful of the deadlines for submitting the final projects.'

### 1.5 - The RAG Flow Cleaning and Modularizing Code

In [32]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query = query,
        filter_dict = {'course': 'data-engineering-zoomcamp'},
        boost_dict = boost,
        num_results = 5
    )

    return results

- The `search` function is designed to query an index (likely a search engine or vector database) for relevant results. That is to find the most relevant documents or passages related to the query.
- The `query` argument is the search string or phrase input by the user.
- The `boost` dictionary give the importance of certain fields in the search index.
- We are currently interested in only documents where the course field matches 'data-engineering-zoomcamp'. It gives a list of dictionaries.

In [33]:
def build_prompt(query, search_results):
    prompt_template = """
                        You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
                        Use only the facts from the CONTEXT when answering the QUESTION.
                        
                        QUESTION: {question}
                        
                        CONTEXT: 
                        {context}
                        """.strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

- The `build_prompt` function constructs a prompt for a language model (like GPT) to answer a user’s question using information from a set of search results.
- In other words, this function constructs a carefully formatted prompt, embedding both the user’s question and the relevant context (extracted from the search results).
- In this case, the model gives answers using only the provided context. That is, the prompt is designed to instruct the LLM to answer using only the provided context.
- The `search_results` contains a list of retrieved documents (e.g., `search(query)`) relevant to the query.
- To build the context, we first initialize an empty string to accumulate context snippets, then iterates through each document in `search_results`, where for each document, it appends a formatted string to `context` including `section`, `question` and `answer`.
- Next, we build the prompt using `prompt_template.format()`.

In [34]:

def llm(prompt):
    response = client.chat.completions.create(
                model = 'gpt-4o',
                messages = [{"role": "user", "content":prompt}]
                 )

    return response.choices[0].message.content

- The `llm` function is designed to interact with a Large Language Model (LLM)—specifically OpenAI’s GPT-4o—by sending a prompt and receiving a generated response.
- The LLM reads the prompt, processes the context, and generates a response. The response is expected to be grounded in the provided context, not invented from scratch.
- `client` is an instance of the OpenAI API client (typically from openai Python library).
- `.chat.completions.create` is the method used to interact with OpenAI’s chat-based models (like GPT-4o).
- `response` is the object returned by the API call. It contains a list of possible completions (choices).

### RAG workflow

In [35]:
query = 'how do I run kafka?'

In conclusion we do the following to answer a user’s question:
- Retrieve relevant information from a database or knowledge base
- Build a context-rich prompt,
- Pass that prompt to a language model (LLM) for answer generation.

In [42]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [43]:
rag(query)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
                        Use only the facts from the CONTEXT when answering the QUESTION.

                        QUESTION: how do I run kafka?

                        CONTEXT: 
                        section: Module 6: streaming with kafka
question: Java Kafka: How to run producer/consumer/kstreams/etc in terminal
answer: In the project directory, run:
java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java

section: Module 6: streaming with kafka
question: Module “kafka” not found when trying to run producer.py
answer: Solution from Alexey: create a virtual environment and run requirements.txt and the python files in that environment.
To create a virtual env and install packages (run only once)
python -m venv env
source env/bin/activate
pip install -r ../requirements.txt
To activate it (you'll need to run it every time you need the virtual env)

"To run Kafka, if you're using Java, navigate to the project directory and execute the following command in the terminal:\n\n```\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n```\n\nMake sure to replace `<jar_name>` with the actual name of your jar file. If you're using Python, consider setting up a virtual environment and ensuring your Docker images are running, but specific terminal commands weren't provided in the context for this scenario."

In [37]:
rag('the course has already started, can I still enroll?')

'Yes, you can still enroll in the course after it has started. You are eligible to submit the homework, but be mindful of the deadlines for turning in the final projects, so try not to leave everything to the last minute.'

## 1.6 Search with Elasticsearch

We want to transition from a toy search engine to Elasticsearch for better search results.