## configuring environment

In [1]:
# import libraries

from groq import Groq
import os 

In [3]:
# create client calling Groq class

client = Groq(api_key=os.getenv("GROQ_API_KEY"))

In [4]:
# create a query

response = client.chat.completions.create(
    messages=[
    { 
    "role":"user",
    "content":"Is it too late to join the course?",
    }
],
 model="llama3-8b-8192"
)
  

In [5]:
# print the response

print(response.choices[0].message.content)

I'm happy to help you with your question! However, I need a bit more information to provide a helpful response.

Could you please tell me more about the course you're interested in joining? What is the course about, and when was it originally scheduled to start? Additionally, what resources have you used to learn about the course, and what makes you think it might be too late to join?

With this information, I'll do my best to provide guidance on whether it's still possible to join the course, and what steps you can take to do so.


## retrieval and search

In [6]:
!curl -O https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  3832  100  3832    0     0   6799      0 --:--:-- --:--:-- --:--:--  6855


In [7]:
import minsearch

In [8]:
# getting data 

import json

with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [9]:
# preparing the documents

documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [10]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [11]:
# indexing documents by using min_search library

# SELECT * WHERE course = 'data-engineering-zoomcamp';

index = minsearch.Index(
    text_fields=["question", "text", "section"],  # bilgiyi arayacağı metin alanları
    keyword_fields=["course"]   # filtreleme yapacak
)

In [12]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [13]:
q = 'the course has already started, can I still enroll?'

In [14]:
# retrieving documents for a query

index.fit(documents)  # analysis of documents ---> train

<minsearch.Index at 0x1e0bfefc3a0>

In [15]:
# text alanlarından hangisinin ne kadar önemli olduğunu belirtmek için
boost = {"question":3.0,"section":0.5} # q:3 , s:0.5, t:1 oldu

result = index.search(
    filter_dict={'course': 'data-engineering-zoomcamp'},   # bunu eklersek sadece data-engineering için bakar
    query=q,
    boost_dict=boost,
    num_results=5
)

In [16]:
result

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

## generating answers with LLMs

In [18]:
from groq import Groq

client = Groq(api_key=os.getenv("GROQ_API_KEY"))

response = client.chat.completions.create(
    messages=[
    {
    "role":"user",
    "content":q
    }
],
 model="llama3-8b-8192",
)


In [19]:
print(response.choices[0].message.content) # context'i bilmediği için genel cevap verebilir

Whether you can still enroll in a course that has already started depends on several factors, which I'll outline below:

1. **Course policy**: Check the course website, syllabus, or contact the instructor to see if the course has an official policy on late enrollment. Some courses may allow late enrollment, while others may not.
2. **Deadline for enrollment**: If you're trying to enroll after the original deadline, it's unlikely that you'll be allowed to join the course. Most courses have a cutoff date for enrollment, and once that deadline passes, the instructor or program may not be able to accommodate new students.
3. **Prerequisites and availability**: If the course is already full or has specific prerequisites, it may be more challenging to enroll. The instructor or program may have limited seats available or require you to meet specific requirements before joining the course.
4. **Instructor approval**: In some cases, the instructor may consider granting permission for late enrol

In [20]:
# building a prompt template

prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

In [21]:
context = ""

for doc in result:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

print(context)

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-related questions
question: Course - When will the course start?
answer: The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start wit

In [22]:
prompt = prompt_template.format(question=q, context=context).strip()

In [23]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: the course has already started, can I still enroll?

CONTEXT:
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-related q

In [24]:
# getting the answer

response = client.chat.completions.create(
    messages=[
    {
    "role":"user",
    "content":prompt
    }
],
 model="llama3-8b-8192",
)

print(response.choices[0].message.content)

Based on the context, the QUESTION is: the course has already started, can I still enroll?

The answer is: Yes, even if you don't register, you're still eligible to submit the homeworks.


## The RAG Flow Cleaning and Modularizing Code

In [25]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [26]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [27]:
def llm(prompt):
    response = client.chat.completions.create(
    messages=[
    {
    "role":"user",
    "content":prompt
    }
],
 model="llama3-8b-8192",
)

    return response.choices[0].message.content

In [28]:
query = 'how do I run kafka?'

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [29]:
rag(query)

'QUESTION: How do I run Kafka?\n\nCONTEXT:\n\nFrom the provided FAQ database, we can find the answer to this question in two different sections: Module 6: Streaming with Kafka.\n\nFor running a Java Kafka producer/consumer, it says:\n\n"In the project directory, run:\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java"\n\nFor running a Python Kafka, it says:\n\n-To fix the error "ModuleNotFoundError: No module named \'kafka.vendor.six.moves\'", use kafka-python-ng instead:\nUse pip install kafka-python-ng instead"\n\nAlso, for running a Python Kafka file, it says:\n\n"create a virtual environment and run requirements.txt and the python files in that environment." (solution from Alexey)\n\n-To create a virtual env and install packages:\npython -m venv env\nsource env/bin/activate\npip install -r ../requirements.txt\n\n-To deactivate the virtual environment:\ndeactivate"\n\n-To ensure the necessary dependencies to run the code, ensure that the

In [30]:
rag('the course has already started, can I still enroll?')

'Based on the CONTEXT, since the course has already started, I can refer to the answer from the FAQ database.\n\nQUESTION: Can I still enroll?\n\nAnswer: According to the FAQ, "Yes, even if you don\'t register, you\'re still eligible to submit the homeworks."'

## Search with Elasticsearch

##### docker run -it --rm --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.4.3

In [32]:
documents[0] # şimdi bunları elasticsearch ile indexleyecez

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [33]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch("http://localhost:9200")

In [34]:
es_client.info()

ObjectApiResponse({'name': '76a83ed3d80a', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'tDUv-5yWSGS6MPHc0Vg7Hw', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [35]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"}
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [36]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [37]:
# indexing data

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 948/948 [07:30<00:00,  2.10it/s]


In [38]:
query = 'I just disovered the course. Can I still join it?'

In [39]:
# querying data

def elastic_search(query):
    search_query = {
        "size": 5,         # only 5 answers
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],  # importance levels
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [40]:
elastic_search(query)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (insta

In [41]:
def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [42]:
rag(query)

'Based on the context, the answer to the question "I just discovered the course. Can I still join it?" is:\n\nYes, you can still join the course after the start date. Even if you don\'t register, you\'re still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don\'t leave everything for the last minute.'