# Elastic API prototype
## Steps 
1. Generate docs
2. Push docs to elastic
3. Inference agent
4. Search



# 1. Generate docs
Loads the api specs and guides to create a ndjson elastic can ingest.

Sources
- all guides from https://github.com/elastic/elasticsearch/tree/main/docs/reference
- api spec from https://github.com/elastic/elasticsearch/tree/main/rest-api-spec

In [None]:
# TODO: Add generation using openai to improve search

def generate_tags(text):
    raise Exception("")

def generate_target_outcomes(text):
    raise Exception("")

def generate_summary(text):
    raise Exception("")

def generate_questions_this_answers(text):
    raise Exception("")

### Transformers

#### guide to doc

In [None]:
from datetime import datetime
import json
import re


elastic_host = "https://www.elastic.co/guide/en/elasticsearch/reference/current/"


def extract_text_between_markers(text):
    match = re.search(r'\[\[(.*?)\]\]', text)
    return match.group(1) if match else "_"

def extract_role(text):
    return re.findall(r'\[role="([^"]+)"\]', text)

def includes_code(text):
    return "--------------------------------------------------" in text

def transform_documentation_page_to_doc(source):
    doc_title = extract_text_between_markers(source),
    
    doc = {
        "meta": {
            "timestamp": datetime.utcnow().isoformat(),
            "size": len(source),
            
            "url": elastic_host + doc_title[0] + ".html",
            "type": "documentation",
            "role": extract_role(source),
            "has_code": includes_code(source),
            "title": doc_title[0],
            "version": "8.15",

            # "tag": generate_tags(source),
            # "outcomes": generate_target_outcomes(source),
            # "summary": generate_summary(source),
            # "questions": generate_questions_this_answers(source)
        },
        "doc": str(source),
    }
    
    return json.dumps(doc, indent=4)

    

#### API spec to doc

In [None]:
import json
from datetime import datetime
import os


api_specification_details = """

The specification contains:

* The _name_ of the API (`indices.create`), which usually corresponds to the client calls
* Link to the documentation at the <http://elastic.co> website.

  **IMPORANT:** This should be a _live_ link. Several downstream ES clients use
  this link to generate their documentation. Using a broken link or linking to
  yet-to-be-created doc pages can break the [Elastic docs
  build](https://github.com/elastic/docs#building-documentation).
* `stability` indicating the state of the API, has to be declared explicitly or YAML tests will fail
    * `experimental` highly likely to break in the near future (minor/patch), no bwc guarantees.
    Possibly removed in the future.
    * `beta` less likely to break or be removed but still reserve the right to do so
    * `stable` No backwards breaking changes in a minor
* Request URL: HTTP method, path and parts
* Request parameters
* Request body specification

**NOTE**
If an API is stable but it response should be treated as an arbitrary map of key values please notate this as followed

```json
{
  "api.name": {
    "stability" : "stable",
    "response": {
      "treat_json_as_key_value" : true
    }
  }
}
```

## Type definition
In the documentation, you will find the `type` field, which documents which type every parameter will accept.

#### Querystring parameters
| Type  | Description  |
|---|---|
| `list`  | An array of strings *(represented as a comma separated list in the querystring)* |
| `date` | A string representing a date formatted in ISO8601 or a number representing milliseconds since the epoch *(used only in ML)*   |
| `time` | A numeric or string value representing duration |
| `string` | A string value  |
| `enum` | A set of named constants *(a single value should be sent in the querystring)*  |
| `int` | A signed 32-bit integer with a minimum value of -2<sup>31</sup> and a maximum value of 2<sup>31</sup>-1.  |
| `double` | A [double-precision 64-bit IEEE 754](https://en.wikipedia.org/wiki/Floating-point_arithmetic) floating point number, restricted to finite values.  |
| `long` | A signed 64-bit integer with a minimum value of -2<sup>63</sup> and a maximum value of 2<sup>63</sup>-1. *(Note: the max safe integer for JSON is 2<sup>53</sup>-1)* |
| `number` | Alias for `double`. *(deprecated, a more specific type should be used)*  |
| `boolean` | Boolean fields accept JSON true and false values  |

{
  "documentation" : {
    "description": "Parameters that are accepted by all API endpoints.",
    "url": "https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html"
  },
  "params": {
    "pretty": {
      "type": "boolean",
      "description": "Pretty format the returned JSON response.",
      "default": false
    },
    "human": {
      "type": "boolean",
      "description": "Return human readable values for statistics.",
      "default": true
    },
    "error_trace": {
      "type": "boolean",
      "description": "Include the stack trace of returned errors.",
      "default": false
    },
    "source": {
      "type": "string",
      "description": "The URL-encoded request definition. Useful for libraries that do not accept a request body for non-POST requests."
    },
    "filter_path": {
      "type": "list",
      "description": "A comma-separated list of filters used to reduce the response."
    }
  }
}
"""



def transform_api_spec_to_doc(api_spec, elastic_host="https://www.elastic.co"):
    """Extracts relevant info from an API specification JSON file."""

    spec_as_json = json.loads(api_spec)
    
    api_name = list(spec_as_json.keys())[0]
    source = spec_as_json[api_name]
    
    # Extracting main details
    doc_title = source.get("documentation", {}).get("description", "")
    doc_url = source.get("documentation", {}).get("url", "")
    stability = source.get("stability", "")
    response_key_value = source.get("response", {}).get("treat_json_as_key_value", False)
    visibility = source.get("visibility", "public")
    url_paths = source.get("url", {}).get("paths", [])
    params = source.get("params", {})

    
    # Organizing parameters with their type descriptions
    param_types = {param: params[param].get("type", "unknown") for param in params}
    
    # Structuring the output document
    doc = {
        "meta": {
            "timestamp": datetime.utcnow().isoformat(),
            "api_name": api_name,
            "stability": stability,
            "visibility": visibility,
            "main_component": api_name if len(api_name.split(".")) == 1 else api_name.split(".")[0],
            "url": doc_url,
            "elastic_url": f"{elastic_host}/{api_name.replace('.', '/')}.html",
            "treat_json_as_key_value": response_key_value,
            "title": doc_title,
            "paths": str(url_paths),
            "parameter_types": str(param_types),
            "type": "api_spec",
        },
        "doc": str(source),
    }
    
    return json.dumps(doc, indent=4)




### File management utilities 

In [None]:
import os
import json

def generate_ndjson_from_docs_folder(input_folder, output_file='combined.ndjson'):
    """
    Loop through all JSON files in a specified folder and write their contents to a single NDJSON file.

    :param input_folder: Path to the folder containing JSON files.
    :param output_file: Path to the output NDJSON file.
    """
    with open(output_file, 'w') as ndjson_file:
        for filename in os.listdir(input_folder):
            if filename.endswith('.json'):
                file_path = os.path.join(input_folder, filename)
                
                with open(file_path, 'r') as json_file:
                    data = json.load(json_file)
                    ndjson_file.write(json.dumps(data) + '\n')



In [None]:
import os


def ensure_path_exists(path):
    if os.path.splitext(path)[1]:
        dir_path = os.path.dirname(path)
    else:
        dir_path = path

    if not os.path.exists(dir_path):
        os.makedirs(dir_path)
    else:
        return

def read_file_with_fallback(file_path):
    try:
        # Attempt reading with UTF-8 encoding
        with open(file_path, 'r', encoding='utf-8') as f:
            return f.read()
    except UnicodeDecodeError:
        # Fallback to reading as binary and decoding manually
        with open(file_path, 'rb') as f:
            return f.read().decode('utf-8', errors='replace')  # Replace un-decodable chars

def create_doc_from_file(filepath, transform, create_doc_filemame):
    content = read_file_with_fallback(filepath)
    processed_content = transform(content)
    doc_filename, component = create_doc_filemame(content)

    output_path = f"_docs/{component}/{doc_filename}.json"
    ensure_path_exists(output_path)
    with open(output_path, 'w') as output_file:
        output_file.write(str(processed_content))

def process_files(folder_path, transform, create_doc_filemame):
    for root, _, files in os.walk(folder_path):
        for each in files:
            file_path = os.path.join(root, each)
            create_doc_from_file(file_path, transform, create_doc_filemame)




### Generate the documents from raw content and save to ndjson 

In [None]:
def define_api_spec_doc_title(content):
    spec_as_json = json.loads(content)
    return list(spec_as_json.keys())[0], "api-spec"

def define_documentation_doc_title(content):
    return extract_text_between_markers(content), "guides"



documentation_folder_path = "data/documentation"
process_files(
    documentation_folder_path, 
    transform_documentation_page_to_doc,
    define_documentation_doc_title,
)

rest_api_doc_folder_path = "data/documentation/rest-api"
process_files(
    rest_api_doc_folder_path,
    transform_api_spec_to_doc,
    define_api_spec_doc_title,
)

generate_ndjson_from_docs_folder('_docs/api-spec', 'docs/api-spec.ndjson')
generate_ndjson_from_docs_folder('_docs/guides', 'docs/guides.ndjson')


# Push files to elastic


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
from elasticsearch import Elasticsearch
import os
from dotenv import load_dotenv

load_dotenv()

ES_LOCAL_HOST = os.getenv("ES_LOCAL_HOST")
ES_LOCAL_API_KEY = os.getenv("ES_LOCAL_API_KEY")


client = Elasticsearch(
    ES_LOCAL_HOST,
    api_key=ES_LOCAL_API_KEY,
)
client.info()

In [None]:
# client.indices.delete(index=index_name, ignore_unavailable=True)


In [None]:
index_name = "api-spec"


mappings = {
    "properties": {
        "api_name_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": "true",
            "similarity": "cosine",
        },
        "paths_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": "true",
            "similarity": "cosine",
        },
        "title_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": "true",
            "similarity": "cosine",
        },
        "doc_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": "true",
            "similarity": "cosine",
        },
    }
}

client.indices.create(index=index_name, mappings=mappings)

In [None]:
import json

def process_ndjson(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            yield json.loads(line)


operations = []
for spec in process_ndjson("docs/api-spec.ndjson"):
    print(spec["meta"]["api_name"])
    
    
    operations.append({"index": {"_index": index_name}})

    spec["api_name_vector"] = model.encode(spec["meta"]["api_name"]).tolist()
    break 

    
    spec["paths_vector"] = model.encode(spec["meta"]["paths"]).tolist()
    spec["title_vector"] = model.encode(spec["meta"]["title"]).tolist()
    spec["doc_vector"] = model.encode(spec["doc"]).tolist()

    operations.append(spec)

# client.bulk(index=index_name, operations=operations, refresh=True)

# Training


https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb
https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/03-ELSER.ipynb
https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/cohere/inference-cohere.ipynb
https://www.elastic.co/training/elasticsearch-engineer
https://www.elastic.co/training/elastic-certified-engineer-exam
https://github.com/mr1716/Elastic-Certified-Engineer-Exam-8.1
https://github.com/LisaHJung/Beginners-guide-to-creating-a-full-stack-JavaScript-app-with-Elasticsearch/tree/main
https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api.html




Data Management

Define an index that satisfies a given set of requirements
Define and use an index template for a given pattern that satisfies a given set of requirements
Define and use a dynamic template that satisfies a given set of requirements
Define an Index Lifecycle Management policy for a time-series index
Define an index template that creates a new data stream
Searching Data

Write and execute a search query for terms and/or phrases in one or more fields of an index
Write and execute a search query that is a Boolean combination of multiple queries and filters
Write an asynchronous search
Write and execute metric and bucket aggregations
Write and execute aggregations that contain sub-aggregations
Write and execute a query that searches across multiple clusters
Write and execute a search that utilizes a runtime field
Developing Search Applications

Highlight the search terms in the response of a query
Sort the results of a query by a given set of requirements
Implement pagination of the results of a search query
Define and use index aliases
Define and use a search template
Data Processing

Define a mapping that satisfies a given set of requirements
Define and use a custom analyzer that satisfies a given set of requirements
Define and use multi-fields with different data types and/or analyzers
Use the Reindex API and Update By Query API to reindex and/or update documents
Define and use an ingest pipeline that satisfies a given set of requirements, including the use of Painless to modify documents
 Define runtime fields to retrieve custom values using Painless scripting
Cluster Management

Diagnose shard issues and repair a cluster's health
Backup and restore a cluster and/or specific indices
Configure a snapshot to be searchable
Configure a cluster for cross-cluster search
Implement cross-cluster replication