In [1]:
import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [2]:
from minsearch import AppendableIndex

index = AppendableIndex(
    text_fields = ["question", "text", "section"],
    keyword_fields = ["course"]
)

index.fit(documents)

<minsearch.append.AppendableIndex at 0x7cc78859b0e0>

In [None]:
index.search("how to use kafka with spark")

In [4]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5,
        output_ids=True
    )

    return results

In [5]:
question = "how to use kafka with spark"

In [6]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

<QUESTION>
{question}
</QUESTION>

<CONTEXT>
{context}
</CONTEXT>
""".strip()

def build_prompt(query, search_results):
    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [7]:
search_results = search(question)

In [8]:
prompt = build_prompt(question, search_results)

In [9]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

<QUESTION>
how to use kafka with spark
</QUESTION>

<CONTEXT>
section: Module 6: streaming with kafka
question: Python Kafka: ./spark-submit.sh streaming.py - ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
answer: While following tutorial 13.2 , when running ./spark-submit.sh streaming.py, encountered the following error:
…
24/03/11 09:48:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077...
24/03/11 09:48:36 INFO TransportClientFactory: Successfully created connection to localhost/127.0.0.1:7077 after 10 ms (0 ms spent in bootstraps)
24/03/11 09:48:54 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGeneratio

In [10]:
from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()

client = OpenAI()

def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [12]:
answer = llm(prompt)

In [13]:
print(answer)

To use Kafka with Spark, first ensure that your Kafka broker is running properly; you can check its status with `docker ps`. If your Kafka container is not working, start it by navigating to the folder containing your docker compose YAML file and running `docker compose up -d`.

Next, for streaming with Spark, you might need to check that your Spark environment is correctly set up. If you encounter any connection issues with the Spark master, you can start a new terminal session, get the container ID of the Spark master with `docker ps`, and then view the logs using:

```
docker exec -it <spark_master_container_id> bash
cat logs/spark-master.out
```

By checking the logs, you can diagnose issues related to the Spark master connection.

Additionally, ensure that your PySpark version matches your intended setup; a mismatch can lead to connection errors. To check the PySpark version on your local machine, use:

```
pyspark --version
spark-submit --version
```

If necessary, downgrade your

In [14]:
rag(question)

"To use Kafka with Spark, you can refer to the materials in Module 6: Streaming with Kafka. If you're running a Python application with Spark and Kafka, you may need to troubleshoot errors such as connection issues with the Spark master or Kafka broker. \n\nFor instance, if you get an error indicating that no brokers are available, ensure that your Kafka broker Docker container is running by using the command `docker ps`. If it is not running, you can start it by navigating to the docker compose YAML file folder and running `docker compose up -d` to start all instances.\n\nIf you encounter errors with the Spark master connection, you can start a new terminal, find the Spark master container ID with `docker ps`, and check the logs for errors using:\n\n```bash\ndocker exec -it <spark_master_container_id> bash\ncat logs/spark-master.out\n```\n\nThis will provide you with insights into any connection issues. Additionally, ensure that your PySpark version matches the expected version to avo