## Setup for Elasticsearch
This demo is intended for use with the (`start_local`)[] docker image that provides a perfect instance of Elastic's full offerings to get started.

Run this command from a terminal to get started:

```bash
curl -fsSL https://elastic.co/start-local | sh
```

Note that you must have Docker Desktop or similar installed and running first.

After some procedural code and installations, you'll see a few key piecees of information:
- Elastic Password - this is to log into Kibana 
- Elastic API KEY - this is unique to your docker instance.

You'll need these for the code below to work properly and to use the Kibaba visualizations and navigation

In [None]:
import os
import base64
import json
import PyPDF2
from elasticsearch import Elasticsearch, helpers

client = Elasticsearch(
    hosts=["http://localhost:9200"],
    # This is the Elastic API Key you see from running the start_local docker image
    api_key="SnppODJKVUJEUXJXcENRZDFndTA6LTJQclVnbllUNE9LLTJiUHJqSnBRZw==",
)

client.options(request_timeout=60*3)
resp = client.ping()
print(resp)

def output(data):
    json_data = json.dumps(data.body, indent=4)
    print(json_data)

## Simple (non-optimized) pdf to text converter

In [None]:
def pdf_to_text(pdf_path):
    # Open the PDF file in read-binary mode
    with open(pdf_path, 'rb') as pdf_file:
        # Create a PdfReader object instead of PdfFileReader
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        # Initialize an empty string to store the text
        text = ''

        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()

        return text

## Create an index to store our pdfs. 
Note that `content` will store the pdf text and `bbq_vector` will store an e5 dense vector. We have activated BBQ by setting the `index_options` `type` to `bbq_hnsw`

In [None]:
bbq_index = "bbq-pdf-embeddings-text"
if not client.indices.exists(index=bbq_index):
    resp = client.indices.create(
        index=bbq_index,
        mappings={
            "properties": {
                "content": {
                    "type": "text",
                },
                "bbq_vector": {
                    "type": "dense_vector",
                    "dims": 384,
                    "index_options": {
                        "type": "bbq_hnsw"
                    }
                }
            }
        },
    )
    output(resp)

else:
    print(f'The index {bbq_index} already exists.')

## Create an Inference Endpoint
This will create an inference endpoint named `bbq-e5-model` that will allow us to vectorize our pdf file contents. You'll want to download this in Kibana under [Trained Models](http://localhost:5601/app/ml/trained_models) if running for the first time.

In [None]:

inference_id = "bbq-e5-model"

try:
    client.inference.delete(
        task_type="text_embedding",
        inference_id=inference_id,
        force=True
    )
finally:
    print(f'{inference_id} doesn\'t exist, creating it now.')
    resp = client.inference.put(
        task_type="text_embedding",
        inference_id=inference_id,
        inference_config={
            "service": "elasticsearch",
            "service_settings": {
                "num_allocations": 1,
                "num_threads": 1,
                "model_id": ".multilingual-e5-small"
            },
            "chunking_settings": {
                "strategy": "sentence",
                "max_chunk_size": 25,
                "sentence_overlap": 1
                }
            },
        )
    output(resp)

## Create an ingest pipeline
This ensures that all text within the `content` field will be copied to the `bbq_vector` field and vectorized appropriately.

In [None]:

resp = client.ingest.put_pipeline(
    id="my_bbq_inference_pipeline",
    processors=[
        {
            "inference": {
                "model_id": "bbq-e5-model",
                "input_output": [
                    {
                        "input_field": "content",
                        "output_field": "bbq_vector"
                    }
                ]
            }
        }
    ],
)

output(resp)

## Simulate the inference pipeline in action
Running this will perform a dry run of the ingest pipeline. No data will actually be saved, but we will be able to observe our efforts thus far.

In [None]:
text = pdf_to_text("WAC-SMALL/WAC 245.pdf")

resp = client.ingest.simulate(
    id="my_bbq_inference_pipeline",
    docs=[
        {
            "_source": {
                "content": text
            }
        }
    ],
)

output(resp)

## Bulk helper function
This function creates an array of documents to insert into Elasticsearch for faster processing.

In [None]:
# Function to generate actions for the bulk API
def generate_actions(pdf_dir, index):
    try:
        for filename in os.listdir(pdf_dir):
            if filename.lower().endswith(".pdf"):
                file_path = os.path.join(pdf_dir, filename)
                print(f'------------------------------\nProcessing {file_path}...')
                text = pdf_to_text(file_path)
                print(len(text))
                yield {
                    "_index": index,
                    "_source": {
                        "content": text
                    }
                }
    except:
        print('There was an error')
       

## Index the docs!
This will index a folder of pdf files in the `WAC-SMALL` pdf folder

In [None]:
actions = generate_actions("WAC-SMALL/", "bbq-pdf-embeddings-text")

helpers.bulk(client, actions , pipeline="my_bbq_inference_pipeline", request_timeout=60*3)

print("Finished indexing PDF documents.")

## Delete All Documents in the Index

In [None]:
resp = client.delete_by_query(
    index="bbq-pdf-embeddings-text",
    body={
        "query": {
            "match_all": {}
        }
    },
    refresh=True  # optional: forces immediate visibility
)

print(f"✅ Deleted {resp['deleted']} documents from bbq-pdf-embeddings")