In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-06-24 09:59:19--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-06-24 09:59:19 (12.3 MB/s) - ‘minsearch.py’ saved [3832/3832]



In [3]:
!wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json

--2024-06-24 10:02:36--  https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘documents.json’


2024-06-24 10:02:36 (47.5 MB/s) - ‘documents.json’ saved [658332/658332]



In [25]:
import json, os
from groq import Groq
from minsearch import Index

In [51]:
with open('documents.json', 'rt') as f_in:
    courses = json.load(f_in)

len(courses)

3

In [4]:
documents = []

for course in courses:
    for doc in course['documents']:
        doc['course'] = course['course']
        documents.append(doc)

len(documents)

948

In [5]:
engine = Index(
    text_fields=['text', 'section', 'question'],
    keyword_fields=['course']
)

In [6]:
engine.fit(documents)

<minsearch.Index at 0x7c6e2d1eada0>

In [44]:
def search(query):
    return engine.search(query, 
                         filter_dict={"course": "data-engineering-zoomcamp"}, 
                         boost_dict = {'question': 3.0, 'section': 0.5})

In [45]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [73]:
def ask_groq(prompt, model="mixtral-8x7b-32768"):
    client = Groq(
        api_key=os.environ.get("GROQ_API_KEY"),
    )
    
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=model,
    )

    print(chat_completion.choices[0].message.content)

In [67]:
def ask_llm(q):
    docs = search(q)
    prompt = build_prompt(q, docs)
    return ask_groq(prompt)

In [68]:
ask_llm("How do I run kafka?")

To run Kafka, you'll need to follow specific steps depending on the language you're using. 

If you're using Java, in the project directory, run:

`java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java`

If you're using Python, create a virtual environment and install the required packages. To create a virtual environment, run:

`python -m venv env`

Then, activate it with:

`source env/bin/activate`

Install the required packages with:

`pip install -r ../requirements.txt`

Make sure to activate the virtual environment every time you need it.


In [52]:
# elastic search

In [56]:
from tqdm import tqdm
from elasticsearch import Elasticsearch

In [57]:
es_client = Elasticsearch('http://localhost:9200') 

In [None]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

In [59]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 948/948 [00:31<00:00, 30.38it/s]


In [60]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [69]:
def elastic_ask_llm(q):
    docs = elastic_search(q)
    prompt = build_prompt(q, docs)
    return ask_groq(prompt)

In [74]:
elastic_ask_llm("How do I run kafka?")

To run Kafka, you need to follow the instructions provided in the "Module 6: streaming with kafka" section of the FAQ database. If you're encountering an issue with the "module 'kafka' not found" error, you should create a virtual environment and run the requirements.txt and python files in that environment.

Here are the steps to create a virtual environment and install the necessary packages:

1. Open a terminal and navigate to the project directory.
2. Create a virtual environment by running: `python -m venv env`
3. Activate the virtual environment with the command: `source env/bin/activate` (on Windows, use: `env\Scripts\activate`)
4. Install the required packages with: `pip install -r ../requirements.txt`

After setting up the virtual environment, you can run the Kafka producer/consumer/kstreams by using the Java commands provided in the FAQ database. For example, to run the producer, navigate to the project directory and execute:

```
java -cp build/libs/<jar_name>-1.0-SNAPSHOT.j