### RAG

In [1]:
import requests 
import json

from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()

client = OpenAI()

In [2]:


docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [3]:
from minsearch import AppendableIndex

index = AppendableIndex(
    text_fields = ["question", "text", "section"],
    keyword_fields = ["course"]
)

index.fit(documents)

<minsearch.append.AppendableIndex at 0x712b10744a40>

In [3]:
index.search("how to use kafka with spark")

[{'text': 'While following tutorial 13.2 , when running ./spark-submit.sh streaming.py, encountered the following error:\n…\n24/03/11 09:48:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077...\n24/03/11 09:48:36 INFO TransportClientFactory: Successfully created connection to localhost/127.0.0.1:7077 after 10 ms (0 ms spent in bootstraps)\n24/03/11 09:48:54 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors\n24/03/11 09:48:56 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077…\n24/03/11 09:49:16 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077...\n24/03/11 09:49:36 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.\n24/03/11 09:49:36 ERROR StandaloneSchedulerBacke

In [29]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5,
        output_ids=True
    )

    return results

In [5]:
question = "how to use kafka with spark"

In [30]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

<QUESTION>
{question}
</QUESTION>

<CONTEXT>
{context}
</CONTEXT>
""".strip()

def build_prompt(query, search_results):
    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [31]:
search_results = search(question)

In [32]:
prompt = build_prompt(question, search_results)

In [33]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

<QUESTION>
how do I do well in module 1?
</QUESTION>

<CONTEXT>
section: Module 5: pyspark
question: Module Not Found Error in Jupyter Notebook .
answer: Even after installing pyspark correctly on linux machine (VM ) as per course instructions, faced a module not found error in jupyter notebook .
The solution which worked for me(use following in jupyter notebook) :
!pip install findspark
import findspark
findspark.init()
Thereafter , import pyspark and create spark contex<<t as usual
None of the solutions above worked for me till I ran !pip3 install pyspark instead !pip install pyspark.
Filter based on conditions based on multiple columns
from pyspark.sql.functions import col
new_final.filter((new_final.a_zone=="Murray Hill") & (new_final.b_zone=="Midwood")).show()
Krishna Anand

section: Module 5: pyspark
question: Py4JJa

In [34]:
from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()

client = OpenAI()

def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [35]:
answer = llm(prompt)

In [36]:
print(answer)

To do well in Module 1, focus on the following points:

1. **Ensure All Modules Are Installed**: Make sure you have all the required modules installed. Specifically, if you encounter a `ModuleNotFoundError` related to `psycopg2`, you will need to install it using either Conda or pip.

2. **Use the Correct Connection String**: When working with SQLAlchemy, be aware of the connection string format. Instead of using `create_engine('postgresql://root:root@localhost:5432/ny_taxi')`, use `conn_string = "postgresql+psycopg://root:root@localhost:5432/ny_taxi"` and then create the engine.

By following these steps, you can effectively navigate the challenges presented in Module 1.


In [13]:
rag(question)

'To use Kafka with Spark, you can follow these general steps:\n\n1. **Set Up Kafka and Spark**: Ensure that you have Kafka and Spark running, typically in Docker containers. You can confirm their status using `docker ps` to see if the necessary containers are operational.\n\n2. **Check Kafka Broker**: If you encounter the error `kafka.errors.NoBrokersAvailable`, it indicates that your Kafka broker might not be working. You can start the broker by navigating to the folder containing your `docker-compose.yaml` file and running `docker compose up -d`.\n\n3. **Submit Spark Job**: Use the command `./spark-submit.sh streaming.py` to run your Spark job. Ensure that the versions of PySpark on your local machine and any Docker images being used are compatible to avoid connection issues (e.g., if there are errors about the Spark master being unresponsive, it may be caused by version mismatches).\n\n4. **Log Checking**: If you encounter issues with Spark master connection, open a new terminal and

#### 'Agentic' RAG

In [14]:
prompt_template = """
You are a course teaching assistant.

You are given a Question from a course student and that you need to answer with your own knowledge and provided CONTEXT.
At the beginning the context is EMPTY.

<QUESTION>
{question}
</QUESTION>

<CONTEXT>
{context}
</CONTEXT>

If CONTEXT is EMPTY, you can use your FAQ database.
In this case, use the following output template:

{{
"action": "SEARCH",
"reasoning": "<add your reasoning here>"
}}

If you can answer the QUESTION using CONTEXT, use this template:

{{
"action": "ANSWER",
"answer": "<your answer>",
"source" : "CONTEXT"
}}

If the context doesn't contain the answer, use your own knowledge to answer the question

{{
"action" : "ANSWER",
"answer" : "<your answer>",
"source" : "OWN_KNOWLEDGE"
}}
""".strip()

In [15]:
question = "Can i still join the course?"
context = "EMPTY"

In [16]:
prompt = prompt_template.format(question=question, context=context)
print(prompt)

You are a course teaching assistant.

You are given a Question from a course student and that you need to answer with your own knowledge and provided CONTEXT.
At the beginning the context is EMPTY.

<QUESTION>
Can i still join the course?
</QUESTION>

<CONTEXT>
EMPTY
</CONTEXT>

If CONTEXT is EMPTY, you can use your FAQ database.
In this case, use the following output template:

{
"action": "SEARCH",
"reasoning": "<add your reasoning here>"
}

If you can answer the QUESTION using CONTEXT, use this template:

{
"action": "ANSWER",
"answer": "<your answer>",
"source" : "CONTEXT"
}

If the context doesn't contain the answer, use your own knowledge to answer the question

{
"action" : "ANSWER",
"answer" : "<your answer>",
"source" : "OWN_KNOWLEDGE"
}


In [17]:
answer_json = llm(prompt)

In [18]:
import json

In [19]:
answer = json.loads(answer_json)

In [20]:
answer["action"]

'SEARCH'

In [38]:
def build_context(search_results):
    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    return context.strip()

In [39]:
search_results = search(question)
context = build_context(search_results)
prompt = prompt_template.format(question=question, context=context)

In [23]:
answer_json = llm(prompt)

In [24]:
print(answer_json)

{
"action": "ANSWER",
"answer": "Yes, you can still join the course after the start date, even if you haven't officially registered. You are eligible to submit homework assignments, but keep in mind that there will be deadlines for the final projects that you should not overlook.",
"source" : "CONTEXT"
}


In [40]:
def agentic_rag_v1(question):
    context = "EMPTY"
    prompt = prompt_template.format(question=question, context=context)
    answer_json = llm(prompt)
    answer = json.loads(answer_json)
    print(answer)

    if answer["action"]=="SEARCH":
        print("need to perform search...")
        search_results = search(question)
        context = build_context(search_results)

        prompt = prompt_template.format(question=question, context=context)
        answer_json = llm(prompt)
        answer = json.loads(answer_json)
        print(answer)

    return answer

In [26]:
%%time
agentic_rag_v1("how do i join the course?")

{'action': 'ANSWER', 'answer': "To join the course, you typically need to enroll through the course website or platform where it is hosted. Look for a button or link that says 'Enroll Now' or 'Join the Course'. You may also need to create an account if you haven't already. If there are prerequisites or specific requirements, make sure to complete those as well.", 'source': 'OWN_KNOWLEDGE'}
CPU times: user 3.46 ms, sys: 953 μs, total: 4.42 ms
Wall time: 2.27 s


{'action': 'ANSWER',
 'answer': "To join the course, you typically need to enroll through the course website or platform where it is hosted. Look for a button or link that says 'Enroll Now' or 'Join the Course'. You may also need to create an account if you haven't already. If there are prerequisites or specific requirements, make sure to complete those as well.",
 'source': 'OWN_KNOWLEDGE'}

#### Agentic search

In [41]:
def dedub(seq):
    """
    deduplicates by skipping the repeating element('_id')
    """
    seen = set()
    result = []
    for el in seq:
        _id = el['_id']
        if _id in seen:
            continue
        seen.add(_id)
        result.append(el)
    return result

In [42]:
prompt_template = """
You are a course teaching assistant.

You are given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.

The CONTEXT is build with the documents from our FAQ database.
SEARCH_QUERIES contains the queries that were used to retrieve the documents
from FAQ to and add them to the context.
PREVIOUS_ACTIONS contains the actions you already performed.

At the beginning the CONTEXT is empty.

You can perform the following actions:

- Search in the FAQ database to get more data for the CONTEXT
- Answer the question using the CONTEXT
- Answer the question using your own knowledge

For the SEARCH action, build search requests based on the CONTEXT and the QUESTION.
Carefully analyze the CONTEXT and generate the requests to deeply explore the topic.

Don't use search queries used at the previous iterations.

Don't repeat previously performed actions.

Don't perform more than {max_iterations} iterations for a given student question.
The current iteration number: {iteration_number}. If we exceed the allowed number
of iterations, give the best possible answer with the provided information.

Output templates:

If you want to perform search, use this template:

{{
"action" : "SEARCH",
"reasoning" : "<add your reasoning here>",
"keywords" : ["search query 1", "search query 2", ...]
}}

If you can answer the QUESTION using CONTEXT, use this template:

{{
"action" : "ANSWER_CONTEXT",
"answer" : "<your answer>",
"source" :"CONTEXT"
}}

If the context doesn't contain the answer, use your own knowledge to answer the question

{{
"action" : "ANSWER",
"answer" : "<your answer>",
"source" : "OWN_KNOWLEDGE"
}}

<QUESTION>
{question}
</QUESTION>

<SEARCH_QUERIES>
{search_queries}
</SEARCH_QUERIES>

<CONTEXT>
{context}
</CONTEXT>

<PREVIOUS_ACTIONS>
{previous_actions}
</PREVIOUS_ACTIONS>
""".strip()

In [43]:
question = "how do I do well on module 1"

max_iterations = 3
iteration_number = 0
search_queries = []
search_results = []
previous_actions = []

In [44]:
context = build_context(search_results)

prompt = prompt_template.format(
    question = question,
    context = context,
    search_queries = "\n".join(search_queries),
    previous_actions = "\n".join([json.dumps(a) for a in previous_actions]),
    max_iterations=max_iterations,
    iteration_number=iteration_number,
)

In [31]:
answer_json = llm(prompt)

In [54]:
answer = json.loads(answer_json)

In [55]:
answer

{'action': 'SEARCH',
 'reasoning': 'I need to gather more specific insights and advice on succeeding in Module 1, especially since the previous search did not yield any relevant results related to strategies for success in the module.',
 'keywords': ['succeeding in Module 1',
  'Module 1 study tips',
  'best practices for Module 1']}

In [56]:
previous_actions.append(answer)

In [57]:
previous_actions

[{'action': 'SEARCH',
  'reasoning': 'I want to find specific tips and resources related to success strategies for module 1 to better inform the student.',
  'keywords': ['how to succeed in module 1',
   'tips for module 1',
   'module 1 best practices']},
 {'action': 'SEARCH',
  'reasoning': 'I need to gather more specific insights and advice on succeeding in Module 1, especially since the previous search did not yield any relevant results related to strategies for success in the module.',
  'keywords': ['succeeding in Module 1',
   'Module 1 study tips',
   'best practices for Module 1']}]

In [58]:
keywords = answer['keywords']

In [59]:
print(keywords)

['succeeding in Module 1', 'Module 1 study tips', 'best practices for Module 1']


In [60]:
for kw in keywords:
    search_queries.append(kw)
    sr = search(kw)
    search_results.extend(sr)

In [61]:
search_results = dedub(search_results)

In [62]:
len(search_results)

7

##### Running iterations manually

In [67]:
iteration_number = 3

context = build_context(search_results)

prompt = prompt_template.format(
    question = question,
    context = context,
    search_queries = "\n".join(search_queries),
    previous_actions = "\n".join([json.dumps(a) for a in previous_actions]),
    max_iterations=max_iterations,
    iteration_number=iteration_number,
)

In [68]:
print(prompt)

You are a course teaching assistant.

You are given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.

The CONTEXT is build with the documents from our FAQ database.
SEARCH_QUERIES contains the queries that were used to retrieve the documents
from FAQ to and add them to the context.
PREVIOUS_ACTIONS contains the actions you already performed.

At the beginning the CONTEXT is empty.

You can perform the following actions:

- Search in the FAQ database to get more data for the CONTEXT
- Answer the question using the CONTEXT
- Answer the question using your own knowledge

For the SEARCH action, build search requests based on the CONTEXT and the QUESTION.
Carefully analyze the CONTEXT and generate the requests to deeply explore the topic.

Don't use search queries used at the previous iterations.

Don't repeat previously performed actions.

Don't perform more than 3 iterations for a given student question.
The current iteration numbe

In [69]:
answer_json = llm(prompt)

In [70]:
print(answer_json)

{
"action": "ANSWER",
"answer": "To do well in Module 1, which focuses on Docker and Terraform, consider the following strategies:\n\n1. **Understand the Basics:** Make sure you have a solid understanding of Docker concepts such as containers, images, and orchestration. Familiarize yourself with Terraform's infrastructure as code approach and its basic commands.\n\n2. **Hands-On Practice:** Engage in practical exercises by setting up your own Docker containers and using Terraform to manage infrastructure. The more you practice, the better you'll grasp these tools.\n\n3. **Utilize Resources:** Leverage the course materials, documentation, and online tutorials to reinforce your learning. Community forums can also provide assistance if you encounter challenges.\n\n4. **Manage Dependencies:** Be aware of common issues such as 'ModuleNotFoundError' for libraries like Psycopg2 or other dependencies. Make sure to install and update them as needed.\n\n5. **Collaborate:** If possible, work with

In [71]:
answer = json.loads(answer_json)

In [72]:
print(answer['answer'])

To do well in Module 1, which focuses on Docker and Terraform, consider the following strategies:

1. **Understand the Basics:** Make sure you have a solid understanding of Docker concepts such as containers, images, and orchestration. Familiarize yourself with Terraform's infrastructure as code approach and its basic commands.

2. **Hands-On Practice:** Engage in practical exercises by setting up your own Docker containers and using Terraform to manage infrastructure. The more you practice, the better you'll grasp these tools.

3. **Utilize Resources:** Leverage the course materials, documentation, and online tutorials to reinforce your learning. Community forums can also provide assistance if you encounter challenges.

4. **Manage Dependencies:** Be aware of common issues such as 'ModuleNotFoundError' for libraries like Psycopg2 or other dependencies. Make sure to install and update them as needed.

5. **Collaborate:** If possible, work with peers to share insights and tackle challen

#### Automating in a loop

In [45]:
question = "what do I need to do to be succesful at module 1?"

search_queries = []
search_results = []
previous_actions = []

iteration = 0

while True:
    print(f'ITERATION #{iteration}...')

    context = build_context(search_results)
    prompt = prompt_template.format(
        question = question,
        context = context,
        search_queries = "\n".join(search_queries),
        previous_actions = "\n".join([json.dumps(a) for a in previous_actions]),
        max_iterations = 3,
        iteration_number = iteration
    )

    print(prompt)

    answer_json = llm(prompt)
    answer = json.loads(answer_json)
    print(json.dumps(answer, indent=2))

    previous_actions.append(answer)

    action = answer['action']
    if action != 'SEARCH':
        break

    keywords = answer['keywords']
    search_queries = list(set(search_queries) | set(keywords))

    for k in keywords:
        res = search(k)
        search_results.extend(res)

    search_results = dedub(search_results)

    iteration = iteration + 1
    if iteration >= 4:
        break
    print()


ITERATION #0...
You are a course teaching assistant.

You are given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.

The CONTEXT is build with the documents from our FAQ database.
SEARCH_QUERIES contains the queries that were used to retrieve the documents
from FAQ to and add them to the context.
PREVIOUS_ACTIONS contains the actions you already performed.

At the beginning the CONTEXT is empty.

You can perform the following actions:

- Search in the FAQ database to get more data for the CONTEXT
- Answer the question using the CONTEXT
- Answer the question using your own knowledge

For the SEARCH action, build search requests based on the CONTEXT and the QUESTION.
Carefully analyze the CONTEXT and generate the requests to deeply explore the topic.

Don't use search queries used at the previous iterations.

Don't repeat previously performed actions.

Don't perform more than 3 iterations for a given student question.
The current

In [74]:
answer

{'action': 'ANSWER',
 'answer': "To be successful in Module 1, which focuses on Docker and Terraform, here are some general tips:\n\n1. **Understand the Basics of Docker:** Familiarize yourself with fundamental concepts such as images, containers, and orchestration. Make sure you are comfortable with Docker commands and workflows.\n\n2. **Learn Terraform Fundamentals:** Ensure you grasp Terraform's configuration language and how it manages infrastructure as code. Practice writing Terraform configurations and understand how to apply them.\n\n3. **Hands-On Practice:** Engage in as much hands-on practice as possible. Create your own Docker containers and configurations with Terraform for different scenarios to solidify your understanding.\n\n4. **Debugging Skills:** Develop debugging skills for both Docker and Terraform. Learn how to troubleshoot common issues that arise when building containers or managing infrastructure.\n\n5. **Follow Best Practices:** Refer to the Docker and Terraform

In [75]:
iteration

2

In [46]:
def agentic_search(question):
    search_queries = []
    search_results = []
    previous_actions = []
    
    iteration = 0
    
    while True:
        print(f'ITERATION #{iteration}...')
    
        context = build_context(search_results)
        prompt = prompt_template.format(
            question = question,
            context = context,
            search_queries = "\n".join(search_queries),
            previous_actions = "\n".join([json.dumps(a) for a in previous_actions]),
            max_iterations = 3,
            iteration_number = iteration
        )
    
        print(prompt)
    
        answer_json = llm(prompt)
        answer = json.loads(answer_json)
        print(json.dumps(answer, indent=2))
    
        previous_actions.append(answer)
    
        action = answer['action']
        if action != 'SEARCH':
            break
    
        keywords = answer['keywords']
        search_queries = list(set(search_queries) | set(keywords))
    
        for k in keywords:
            res = search(k)
            search_results.extend(res)
    
        search_results = dedub(search_results)
    
        iteration = iteration + 1
        if iteration >= 4:
            break
        print()
        
    return answer

In [77]:
agentic_search("how do I prepare for the course?")

ITERATION #0...
You are a course teaching assistant.

You are given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.

The CONTEXT is build with the documents from our FAQ database.
SEARCH_QUERIES contains the queries that were used to retrieve the documents
from FAQ to and add them to the context.
PREVIOUS_ACTIONS contains the actions you already performed.

At the beginning the CONTEXT is empty.

You can perform the following actions:

- Search in the FAQ database to get more data for the CONTEXT
- Answer the question using the CONTEXT
- Answer the question using your own knowledge

For the SEARCH action, build search requests based on the CONTEXT and the QUESTION.
Carefully analyze the CONTEXT and generate the requests to deeply explore the topic.

Don't use search queries used at the previous iterations.

Don't repeat previously performed actions.

Don't perform more than 3 iterations for a given student question.
The current

{'action': 'ANSWER',
 'answer': "To prepare for the course, ensure you register before the start date (January 15, 2024), and join the course's public Google Calendar and Telegram channel for announcements. Familiarize yourself with the course materials available via DataTalks.Club’s Slack and make sure to have your tools set up as outlined in the course guidelines. It's recommended to have a good path setup for any necessary tools like GitHub. Overall, being organized and proactive in setting up your environment will greatly enhance your experience in the course.",
 'source': 'OWN_KNOWLEDGE'}

In [78]:
print(_['answer'])

To prepare for the course, ensure you register before the start date (January 15, 2024), and join the course's public Google Calendar and Telegram channel for announcements. Familiarize yourself with the course materials available via DataTalks.Club’s Slack and make sure to have your tools set up as outlined in the course guidelines. It's recommended to have a good path setup for any necessary tools like GitHub. Overall, being organized and proactive in setting up your environment will greatly enhance your experience in the course.


#### Function calling ("tool use")

In [4]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5,
        output_ids=True
    )

    return results

In [5]:
search_tool = {
    "type" : "function",
    "name" : "search",
    "description" : "Search the FAQ database",
    "parameters" : {
        "type" : "object",
        "properties" : {
            "query" : {
                "type" : "string",
                "description" : "Search query text to look up in the course FAQ."
            }
        },
        "required" : ["query"],
        "additionalProperties": False
    }
}

In [6]:
question = "how do I do well in module 1?"

developer_prompt = """
You're a course teaching assistant.
You're given a question from a course student and your task is to answer it.
""".strip()

tools = [search_tool]

chat_messages = [
    {"role" : "developer", "content" : developer_prompt},
    {"role" : "user", "content" : question}
]

response = client.responses.create(
    model = 'gpt-4o-mini',
    input = chat_messages,
    tools = tools
)
response.output

[ResponseFunctionToolCall(arguments='{"query":"how to do well in module 1"}', call_id='call_LUkEw4fkKuiIX6QqHidYL6fm', name='search', type='function_call', id='fc_68a500dd625c8193a779ab9ef1212e7f06004efefa7d53d2', status='completed')]

In [7]:
calls = response.output

In [8]:
call = calls[0]

In [9]:
f_name = call.name

In [10]:
arguments = json.loads(call.arguments)

#### globabls - Note
globals() returns a dictionary that represents the current global symbol table.
This dictionary contains all global names (variables, functions, classes, imports, etc.) that are defined in the current module.
- globals(): Fast & flexible, but risky and too open.
- getattr(tools, "hello") is similar to globals()["hello"], but scoped to the tools module instead of everything in your global namespace. If all the functions live inside a specific module (say tools.py).
- production systems usually use a dispatcher dictionary: FastAPI doesn’t use globals() or getattr directly. Instead, it uses a registry of routes (functions) attached to an APIRouter.

In [11]:
globals()[f_name]

<function __main__.search(query)>

In [12]:
#gives a pointer to the search functions same as above
globals()['search']

<function __main__.search(query)>

In [13]:
f = globals()[f_name]

In [14]:
#it passes all the arguments to the function search or f here
search_results = f(**arguments)

In [15]:
chat_messages.append(call)

chat_messages.append({
    "type": "function_call_output",
    "call_id": call.call_id,
    "output": json.dumps(search_results),
})

In [16]:
chat_messages

[{'role': 'developer',
  'content': "You're a course teaching assistant.\nYou're given a question from a course student and your task is to answer it."},
 {'role': 'user', 'content': 'how do I do well in module 1?'},
 ResponseFunctionToolCall(arguments='{"query":"how to do well in module 1"}', call_id='call_LUkEw4fkKuiIX6QqHidYL6fm', name='search', type='function_call', id='fc_68a500dd625c8193a779ab9ef1212e7f06004efefa7d53d2', status='completed'),
 {'type': 'function_call_output',
  'call_id': 'call_LUkEw4fkKuiIX6QqHidYL6fm',
  'output': '[{"text": "Even after installing pyspark correctly on linux machine (VM ) as per course instructions, faced a module not found error in jupyter notebook .\\nThe solution which worked for me(use following in jupyter notebook) :\\n!pip install findspark\\nimport findspark\\nfindspark.init()\\nThereafter , import pyspark and create spark contex<<t as usual\\nNone of the solutions above worked for me till I ran !pip3 install pyspark instead !pip install p

In [17]:
#invoke the call one more time

response = client.responses.create(
    model = 'gpt-4o-mini',
    input = chat_messages,
    tools = tools
)
response.output

[ResponseOutputMessage(id='msg_68a500e413008193add3807914e838d906004efefa7d53d2', content=[ResponseOutputText(annotations=[], text="To excel in Module 1 of your course, here are some tips:\n\n1. **Understand Key Concepts**: Familiarize yourself with Docker and Terraform, as they are the primary tools in this module. Review the foundational concepts and their applications in cloud infrastructure.\n\n2. **Practice Hands-On**: Execute the commands and examples provided in the module. Set up a local environment to reinforce your learning through practical application.\n\n3. **Troubleshoot Common Errors**: Be aware of common errors, such as issues with SQLAlchemy or Docker setup. For example:\n   - If you encounter `ModuleNotFoundError: No module named 'psycopg2'`, ensure that you have installed the `psycopg2` module via pip or conda.\n   - For SQLAlchemy-related issues, verify your connection string format.\n\n4. **Engagement with Course Material**: Actively participate in discussions and 

In [18]:
print(response.output[0].content[0].text)

To excel in Module 1 of your course, here are some tips:

1. **Understand Key Concepts**: Familiarize yourself with Docker and Terraform, as they are the primary tools in this module. Review the foundational concepts and their applications in cloud infrastructure.

2. **Practice Hands-On**: Execute the commands and examples provided in the module. Set up a local environment to reinforce your learning through practical application.

3. **Troubleshoot Common Errors**: Be aware of common errors, such as issues with SQLAlchemy or Docker setup. For example:
   - If you encounter `ModuleNotFoundError: No module named 'psycopg2'`, ensure that you have installed the `psycopg2` module via pip or conda.
   - For SQLAlchemy-related issues, verify your connection string format.

4. **Engagement with Course Material**: Actively participate in discussions and seek help when needed. Utilize forums or Q&A sections to clarify doubts.

5. **Utilize Resources**: Make use of additional resources like docu

#### Multiple calls

In [19]:
question = "how do I do well in module 1?"

developer_prompt = """
You're a course teaching assistant.
You're given a question from a course student and your task is to answer it.
If you look up something in FAQ, convert the student question into multiple queries.
""".strip()

tools = [search_tool]

chat_messages = [
    {"role" : "developer", "content" : developer_prompt},
    {"role" : "user", "content" : question}
]

response = client.responses.create(
    model = 'gpt-4o-mini',
    input = chat_messages,
    tools = tools
)
response.output

[ResponseFunctionToolCall(arguments='{"query":"how to do well in module 1"}', call_id='call_ql6UQ8GpACFYp0hC7MneeuHK', name='search', type='function_call', id='fc_68a500ec45a481908e53132ca8d294a30e31cba901a82261', status='completed'),
 ResponseFunctionToolCall(arguments='{"query":"tips for succeeding in module 1"}', call_id='call_PY7wa4qgBQ04lyLmbbmHgg3z', name='search', type='function_call', id='fc_68a500eca2088190874218dd3a4f979a0e31cba901a82261', status='completed'),
 ResponseFunctionToolCall(arguments='{"query":"module 1 study strategies"}', call_id='call_1kdJucH0VbecP1B8LjTewWbd', name='search', type='function_call', id='fc_68a500ed09e08190983c5090dd667c270e31cba901a82261', status='completed')]

In [20]:
calls = response.output

In [21]:
for call in calls:
    f_name = call.name
    arguments = json.loads(call.arguments)
    f = globals()[f_name]
    results = f(**arguments)
    chat_messages.append(call)
    
    chat_messages.append({
        "type" : "function_call_output",
        "call_id" : call.call_id,
        "output" : json.dumps(results),
        })

In [22]:
response = client.responses.create(
    model = 'gpt-4o-mini',
    input = chat_messages,
    tools = tools
)
response.output

[ResponseOutputMessage(id='msg_68a500ef39dc8190b1bce0bdec5498600e31cba901a82261', content=[ResponseOutputText(annotations=[], text="To do well in Module 1 of the course, here's a set of strategies and tips you might find helpful:\n\n1. **Understand the Basics**:\n   - Make sure you have a strong foundational understanding of Docker and Terraform, as they are critical tools for this module. Review the introductory materials thoroughly.\n\n2. **Hands-On Practice**:\n   - Set up your local development environment as instructed. Create a sample project using Docker and Terraform to solidify your understanding.\n\n3. **Utilize Resources**:\n   - Your course likely provides resources like video lectures, reading materials, and forums. Don't hesitate to use these to clarify concepts.\n\n4. **Seek Help When Needed**:\n   - If you encounter errors like `ModuleNotFoundError` for PostgreSQL dependencies, seek solutions promptly. For instance, if you get errors related to `psycopg2`, ensure you us

In [23]:
print(response.output[0].content[0].text)

To do well in Module 1 of the course, here's a set of strategies and tips you might find helpful:

1. **Understand the Basics**:
   - Make sure you have a strong foundational understanding of Docker and Terraform, as they are critical tools for this module. Review the introductory materials thoroughly.

2. **Hands-On Practice**:
   - Set up your local development environment as instructed. Create a sample project using Docker and Terraform to solidify your understanding.

3. **Utilize Resources**:
   - Your course likely provides resources like video lectures, reading materials, and forums. Don't hesitate to use these to clarify concepts.

4. **Seek Help When Needed**:
   - If you encounter errors like `ModuleNotFoundError` for PostgreSQL dependencies, seek solutions promptly. For instance, if you get errors related to `psycopg2`, ensure you use:
     ```bash
     pip install psycopg2-binary
     ```
   - If issues persist, consult the FAQs or post in course forums for guidance.

5. **

#### refactor and make it like a chat interface

In [4]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5,
        output_ids=True
    )

    return results

In [5]:
search_tool = {
    "type" : "function",
    "name" : "search",
    "description" : "Search the FAQ database",
    "parameters" : {
        "type" : "object",
        "properties" : {
            "query" : {
                "type" : "string",
                "description" : "Search query text to look up in the course FAQ."
            }
        },
        "required" : ["query"],
        "additionalProperties": False
    }
}

In [6]:
def do_call(tool_call_response):
    function_name = tool_call_response.name
    arguments = json.loads(tool_call_response.arguments)

    f = globals()[function_name]
    result = f(**arguments)

    return {
        "type" : "function_call_output",
        "call_id" :tool_call_response.call_id,
        "output" : json.dumps(result, indent=2),
    }

In [26]:
question = "how do I do well in module 1?"

developer_prompt = """
You're a course teaching assistant.
You're given a question from a course student and your task is to answer it.
If you look up something in FAQ, convert the student question into multiple queries.
""".strip()

tools = [search_tool]

chat_messages = [
    {"role" : "developer", "content" : developer_prompt},
    {"role" : "user", "content" : question}
]

response = client.responses.create(
    model = 'gpt-4o-mini',
    input = chat_messages,
    tools = tools
)
response.output

[ResponseFunctionToolCall(arguments='{"query":"how to do well in module 1"}', call_id='call_o44gqRJ16NwuOqgEHZft8cgB', name='search', type='function_call', id='fc_68a429a9a5c08194ab642276ca29ecc50a874687391c6768', status='completed'),
 ResponseFunctionToolCall(arguments='{"query":"tips for success in module 1"}', call_id='call_bWvunwnrR8itpoKkG1FTuZzb', name='search', type='function_call', id='fc_68a429a9f6388194a17be362d6171cf30a874687391c6768', status='completed')]

In [29]:
for call in calls:
    result = do_call(call)
    chat_messages.append(call)
    chat_messages.append(result)

In [None]:
response = client.responses.create(
    model = 'gpt-4o-mini',
    input = chat_messages,
    tools = tools
)
response.output

In [None]:
#We dont know in advance what kind of output our function call will give us
#it might do another function call or give the final answer or result
#so check first

for entry in response.output:
    chat_messages.append(entry)
    print(entry.type)

    if entry.type == 'function_call':
        result = do_call(entry)
        chat_messages.append(result)
    elif entry.typ == 'message':
        print(entry.text)

In [27]:
developer_prompt = """
You're a course teaching assistant.
You're given a question from a course student and your task is to answer it.

Use FAQ if your own knowledge is not sufficient to answer the question.
when using FAQ, perform deep topic exploration: make one request to FAQ,
and then based on the results, make more requests.

At the end of each response, ask the user a follow up question based on your answer.
""".strip()

chat_messages = [
    {"role": "developer", "content": developer_prompt},
]

### Chatbot like interface

In [30]:
while True: #main Q&A loop
    question = input()
    if question == 'stop':
        break

    message = {"role" : "user", "content": question}
    chat_messages.append(message)

    while True:#request-response loop - query API till get a message
        response = client.responses.create(
            model = 'gpt-4o-mini',
            input = chat_messages,
            tools = tools
        )

        has_tool_calls = False

        for entry in response.output:
            chat_messages.append(entry)

            if entry.type == 'function_call':
                print('function_call:', entry)
                print()
                result = do_call(entry)
                chat_messages.append(result)
                has_tool_calls = True
                
            elif entry.type == 'message':
                print(entry.content[0].text)
                print()
                
        if not has_tool_calls:
            break
    

 how do I do well in module 1?


function_call: ResponseFunctionToolCall(arguments='{"query":"do well in module 1"}', call_id='call_LNQJxidfCkHOd4oPrxCe8ndZ', name='search', type='function_call', id='fc_68a504b8998081908475119ac00933910c6579135a790449', status='completed')

function_call: ResponseFunctionToolCall(arguments='{"query":"tips for success in module 1"}', call_id='call_8t4UDtkbZwOYw1HIrmlVgVWD', name='search', type='function_call', id='fc_68a504b9bc8881908fd4a4588ae110a20c6579135a790449', status='completed')

To do well in Module 1, which covers Docker and Terraform as part of the Data Engineering Zoomcamp, here are some key tips:

1. **Understand Key Concepts**: Familiarize yourself with the essentials of Docker and Terraform. Focus on what containers are, how orchestration works, and the syntax and usage of Terraform scripts.

2. **Hands-On Practice**: Engage deeply with the practical exercises provided in the module. Setting up Docker containers and writing Terraform scripts will solidify your understand

 Docker


function_call: ResponseFunctionToolCall(arguments='{"query":"Docker tips for success in module 1"}', call_id='call_NCZAmV3AvqmhPUKsB54g4EEx', name='search', type='function_call', id='fc_68a504f030948190bc9b06647bb6c5a60c6579135a790449', status='completed')

To excel in Docker as part of Module 1, consider the following tips:

1. **Installation and Setup**:
   - Ensure Docker is correctly installed on your machine. Verify that Docker is running smoothly by executing `docker --version` and `docker-compose --version`. 

2. **Understand Docker Basics**:
   - Familiarize yourself with fundamental Docker concepts like images, containers, volumes, and networks. This foundational knowledge will aid in understanding more complex operations later in your learning.

3. **Use Docker-Compose**:
   - Make use of `docker-compose` to manage multi-container applications. Write clear `docker-compose.yml` files defining your services, volumes, and networks.

4. **Debugging**:
   - If you encounter errors

 file permissions 


function_call: ResponseFunctionToolCall(arguments='{"query":"Docker file permissions issues"}', call_id='call_hzTwUf7JiYZZf3GCrz7EntsD', name='search', type='function_call', id='fc_68a50550e25c819090c1567e942acfc50c6579135a790449', status='completed')

When working with Docker, especially in environments like WSL (Windows Subsystem for Linux), file permissions can be a common issue. Here are some tips to handle file permissions effectively in Docker:

### Common Issues and Solutions

1. **Permission Errors with Mounted Volumes**:
   - If you encounter errors like `could not change permissions of directory /var/lib/postgresql/data`, it may be due to the way Windows handles permissions compared to Linux.
   - **Solution**: Utilize Docker volumes instead of binding mounts. Create a named volume in Docker:
     ```bash
     docker volume create my_volume_name
     ```
     Then, reference this volume in your `docker-compose.yml` or `docker run` command.

2. **Check User Permissions**:
   -

 stop


In [7]:
from pathlib import Path
import sys

PATH = Path().resolve().parent

if str(PATH) not in sys.path:
    sys.path.append(str(PATH))

In [8]:
from script import chat_assistant

tools = chat_assistant.Tools()
tools.add_tool(search, search_tool)

tools.get_tools()

[{'type': 'function',
  'name': 'search',
  'description': 'Search the FAQ database',
  'parameters': {'type': 'object',
   'properties': {'query': {'type': 'string',
     'description': 'Search query text to look up in the course FAQ.'}},
   'required': ['query'],
   'additionalProperties': False}}]

In [9]:
developer_prompt = """
You're a course teaching assistant. 
You're given a question from a course student and your task is to answer it.

Use FAQ if your own knowledge is not sufficient to answer the question.

At the end of each response, ask the user a follow up question based on your answer.
""".strip()

chat_interface = chat_assistant.chatInterface()
chat = chat_assistant.ChatAssistant(
    tools = tools, 
    developer_prompt = developer_prompt,
    chat_interface = chat_interface,
    client = client,
)

In [12]:
chat.run()

You: how to excel in data engineering?


You: stop


chat ended


#### Multiple tools

In [11]:
def add_entry(question, answer):
    doc = {
        'question': question,
        'text' : answer,
        'section' : 'user_added',
        'course' : 'data-engineering-zoomcamp'
    }
    index.append(doc)

In [12]:
add_entry_description = {
    "type" : "function",
    "name" : "add_entry",
    "description" : "Add an entry to the FAQ database",
    "parameters" : {
        "type" : "object",
        "properties" : {
            "question" : {
                "type" : "string",
                "description" : "The question to be added to the FAQ database",
            },
            "answer" : {
                "type" : "string",
                "description" : "The answer to the question",
            }
        },
        "required" : ["question", "answer"],
        "additionalProperties" : False
    }
}

In [13]:
tools.add_tool(add_entry, add_entry_description)

In [15]:
tools.get_tools()

[{'type': 'function',
  'name': 'search',
  'description': 'Search the FAQ database',
  'parameters': {'type': 'object',
   'properties': {'query': {'type': 'string',
     'description': 'Search query text to look up in the course FAQ.'}},
   'required': ['query'],
   'additionalProperties': False}},
 {'type': 'function',
  'name': 'add_entry',
  'description': 'Add an entry to the FAQ database',
  'parameters': {'type': 'object',
   'properties': {'question': {'type': 'string',
     'description': 'The question to be added to the FAQ database'},
    'answer': {'type': 'string', 'description': 'The answer to the question'}},
   'required': ['question', 'answer'],
   'additionalProperties': False}}]

In [16]:
chat.run()

You: how to master Docker permisions?


You: file permisions with the host and mounting


You: add this to the FAQ database


You: stop


chat ended
