## Homework: Introduction

In this homework, we'll learn more about search and use Elastic Search for practice. 

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

## Q1. Running Elastic 

Run Elastic Search 8.4.3, and get the cluster information. If you run it on localhost, this is how you do it:

```bash
$ docker run -it     --rm     --name elasticsearch     -p 9200:9200     -p 9300:9300     -e "discovery.type=single-node"     -e "xpack.security.enabled=false"     docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

In [7]:
!curl localhost:9200 

{
  "name" : "20bec6a63766",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "NxrUbXmpRLGayupMtIhmpQ",
  "version" : {
    "number" : "8.4.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
    "build_date" : "2022-10-04T07:17:24.662462378Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}


## Q2. Indexing the data

Index the data in the same way as was shown in the course videos. Make the `course` field a keyword and the rest should be text. 

### Getting the data

Now let's get the FAQ data. You can run this snippet:

In [8]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [9]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
es.info()

ObjectApiResponse({'name': '20bec6a63766', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'NxrUbXmpRLGayupMtIhmpQ', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [10]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)

In [11]:
print(response)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'}


## Q3. Searching

Now let's search in our index. 

We will execute a query "How do I execute a command in a running docker container?". 

Use only `question` and `text` fields and give `question` a boost of 4, and use `"type": "best_fields"`.

In [12]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [13]:
user_question = "How do I execute a command in a running docker container?"

In [14]:
search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,
                    "fields": ["question^4", "text"],
                    "type": "best_fields"
                }
            },
        }
    }
}

In [15]:
response = es.search(index=index_name, body=search_query)

result_docs = []

for hit in response['hits']['hits']:
    result_docs.append(hit['_source'])

In [16]:
print(response['hits']['max_score'])

83.243706


## Q4. Filtering

Now let's only limit the questions to `machine-learning-zoomcamp`.

Return 3 results. What's the 3rd question returned by the search engine?

In [17]:
search_query = {
    "size": 3,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,
                    "fields": ["question^4", "text"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "machine-learning-zoomcamp"
                }
            }
        }
    }
}

response = es.search(index=index_name, body=search_query)

result_docs = []

for hit in response['hits']['hits']:
    result_docs.append(hit['_source'])

In [18]:
print("len(result_docs):", len(result_docs))

len(result_docs): 3


In [19]:
import json
print(json.dumps(result_docs[2], indent=4))

{
    "text": "You can copy files from your local machine into a Docker container using the docker cp command. Here's how to do it:\nIn the Dockerfile, you can provide the folder containing the files that you want to copy over. The basic syntax is as follows:\nCOPY [\"src/predict.py\", \"models/xgb_model.bin\", \"./\"]\t\t\t\t\t\t\t\t\t\t\tGopakumar Gopinathan",
    "section": "5. Deploying Machine Learning Models",
    "question": "How do I copy files from a different folder into docker container\u2019s working directory?",
    "course": "machine-learning-zoomcamp"
}


In [20]:
print(result_docs[2]["question"])

How do I copy files from a different folder into docker container’s working directory?


## Q5. Building a prompt

Now we're ready to build a prompt to send to an LLM. 

In [21]:
context_template = """
Q: {question}
A: {text}
""".strip()

context = ""
for doc in result_docs:
    temp_context_item = context_template.format(question=doc['question'], text=doc['text'])
    context = context + f'{temp_context_item}\n\n'


prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

query = "How do I execute a command in a running docker container?"

prompt = prompt_template.format(question=query, context=context).strip()


In [22]:
print(len(prompt))

1462


## Q6. Tokens

When we use the OpenAI Platform, we're charged by the number of 
tokens we send in our prompt and receive in the response.

In [23]:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4o")

In [24]:
encoded_prompt = encoding.encode(prompt)
print(len(encoded_prompt))

322


In [25]:
decoded_word = encoding.decode_single_token_bytes(63842)
print(decoded_word)


b"You're"


## Bonus: generating the answer (ungraded)

In [26]:
import os
from dotenv import load_dotenv

# load vars from .env file
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)


response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": prompt}]
)

output_llm_top_message = response.choices[0].message.content


In [27]:
print(output_llm_top_message)

To execute a command in a running docker container, you need to identify the container's ID and then use the `docker exec` command. Here are the steps:

1. Find the container ID:
   ```sh
   docker ps
   ```
   This command lists all running containers along with their IDs.

2. Execute a command in the specific container:
   ```sh
   docker exec -it <container-id> <command>
   ```
   For example, to start a bash session within the container, use:
   ```sh
   docker exec -it <container-id> bash
   ```

Replace `<container-id>` with the actual ID of your running container and `<command>` with the command you want to execute.


## Bonus: calculating the costs (ungraded)

In [30]:
INPUT_COST_PER_TOKEN = 0.005
OUTPUT_COST_PER_TOKEN = 0.015

N_REQUESTS = 1000
INPUT_TOKEN_PER_REQUEST = 150
OUTPUT_TOKEN_PER_REQUEST = 250

total_amount = N_REQUESTS * (INPUT_COST_PER_TOKEN * INPUT_TOKEN_PER_REQUEST + OUTPUT_TOKEN_PER_REQUEST * OUTPUT_COST_PER_TOKEN)
print(total_amount)

4500.0


In [31]:
encoded_prompt = encoding.encode(prompt)
length_input_token = len(encoded_prompt)

encoded_output_llm_top_message = encoding.encode(output_llm_top_message)
length_output_token = len(encoded_output_llm_top_message)

total_amount_q6_q7 = length_input_token * INPUT_COST_PER_TOKEN + length_output_token * OUTPUT_COST_PER_TOKEN
print(total_amount_q6_q7)


3.83
