# OpenSearch

A brief tutorial of the basic usage of OpenSearch through some example. 

Index:
- OpenSearch API
- Mapping
- Operations
- Clients



## OpenSearch API

OpenSearch exposes a list of APIs in order check the state of the cluster and perform operations. 

In this section we try some of this APIs to perform the CRUD operations.

### Check cluster health

The ***/health*** endpoint provides a summary of the OpenSearch cluster and its status.
It possible to retrieve this information through:
```sh
curl -X GET "http://localhost:9200/_cluster/health" -ku admin:<custom-admin-password>
```

In [52]:
import json
import requests
from requests.auth import HTTPBasicAuth

r = requests.get("https://localhost:9200/_cluster/health", verify=False, auth=HTTPBasicAuth("admin", "admin"))
print(json.dumps(r.json(), indent=4))

{
    "cluster_name": "opensearch-cluster",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 2,
    "number_of_data_nodes": 2,
    "discovered_master": true,
    "discovered_cluster_manager": true,
    "active_primary_shards": 12,
    "active_shards": 24,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}


The ***/health*** endpoint provide information about:
- cluster configuration: cluster name, number of nodes, etc
- status: green/yellow/red

More information [here](https://opensearch.org/docs/1.1/opensearch/rest-api/cluster-health/)

### Indexing documents

It is possible adding new documents, or indexing new documents, through: *PUT /\<index-name\>/_doc/\<document-id\>*

In [54]:
import json
import requests
from requests.auth import HTTPBasicAuth

r = requests.put("https://localhost:9200/students/_doc/2", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                 json={
                     "name": "John Doe",
                     "gpa": 3.89,
                     "grad_year": 2024
                 },
                 headers={"Content-type": "application/json"}
                )

print(json.dumps(r.json(), indent=4))

{
    "error": {
        "root_cause": [
            {
                "type": "mapper_parsing_exception",
                "reason": "failed to parse field [grad_year] of type [long] in document with id '2'. Preview of field's value: 'ciao'"
            }
        ],
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [grad_year] of type [long] in document with id '2'. Preview of field's value: 'ciao'",
        "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "For input string: \"ciao\""
        }
    },
    "status": 400
}


This snippet has added a new document in an index called ***students***, without a initial index declaration!

**NO** index declaration

**NO** mapping definition

**WHY IT WORKS?**

Once you send the preceding request, OpenSearch creates an index called students and stores the ingested document in the index.

More inforiromation [here](https://opensearch.org/docs/latest/im-plugin/#update-data)

### Searching for document

In [57]:
import json
import requests
from requests.auth import HTTPBasicAuth

r = requests.get("https://localhost:9200/students/_search", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                 data=json.dumps({
                     "query": {
                         "match_all": {}
                     }
                 }),
                 headers={"Content-type": "application/json"}
                )

print(json.dumps(r.json(), indent=4))

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "students",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "name": "John Doe2",
                    "gpa": 3.89,
                    "grad_year": 2022
                }
            }
        ]
    }
}


### Updating documents

In [56]:
import json
import requests
from requests.auth import HTTPBasicAuth

r = requests.post("https://localhost:9200/students/_update/1", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                  json={
                      "doc": {
                          "name": "John Doe2",
                          "gpa": 3.89,
                          "grad_year": 2022
                      }
                  },
                  headers={"Content-type": "application/json"}
                 )

print(json.dumps(r.json(), indent=4))

{
    "_index": "students",
    "_id": "1",
    "_version": 2,
    "result": "updated",
    "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "_seq_no": 1,
    "_primary_term": 1
}


This operation update the document, if it exists. If no changes are detected from the latest version il will not create a new version of the document. It is also possible use the ***upsert*** operation

More information [here](https://opensearch.org/docs/latest/api-reference/document-apis/update-document/)

### Deleting a document

In [58]:
import json
import requests
from requests.auth import HTTPBasicAuth

r = requests.delete("https://localhost:9200/students/_doc/1", verify=False, auth=HTTPBasicAuth("admin", "admin"))

print(json.dumps(r.json(), indent=4))

{
    "_index": "students",
    "_id": "1",
    "_version": 3,
    "result": "deleted",
    "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "_seq_no": 2,
    "_primary_term": 1
}


## Mapping

Mapping tell OpenSearch how to store and index the documents and their fields. It is important specify the data type for each field to make storage and querying more efficient. Using explicit mapping is recommended.

In this section we explain the dynamic mapping, used for the previous example, and the explicit mapping, recommended.

### Dynamic Mapping ###

OpenSearch infers automatically the field types from the JSON types submitted in the document. This operation is called ***dynamic mapping***. It is possible to retrieve the mapping information of a index through:

In [59]:
import json
import requests
from requests.auth import HTTPBasicAuth

r = requests.get("https://localhost:9200/students/_mapping", verify=False, auth=HTTPBasicAuth("admin", "admin"))
print(json.dumps(r.json(), indent=4))

{
    "students": {
        "mappings": {
            "properties": {
                "gpa": {
                    "type": "float"
                },
                "grad_year": {
                    "type": "long"
                },
                "name": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}


More information [here](https://opensearch.org/docs/latest/field-types/#dynamic-mapping)

### Explicit mapping

Explicit mapping is easy to specify when the index is created.

In [60]:
import json
import requests
from requests.auth import HTTPBasicAuth

# delete the index automatically created and dinamically mapped in the previously example
r = requests.delete("https://localhost:9200/students", verify=False, auth=HTTPBasicAuth("admin", "admin"))
print(json.dumps(r.json(), indent=4))

# index creation with explicit mapping
r = requests.put("https://localhost:9200/students", verify=False, auth=HTTPBasicAuth("admin", "admin"), 
                json={
                    "settings": {
                        "index.number_of_shards": 1
                    },
                    "mappings": {
                        "properties": {
                            "name": {
                                "type": "text"
                            },
                            "surname": {
                                "type": "text"
                            },
                            "payload": {
                                "type": "object"
                            }
                        }
                    }
                })
print(json.dumps(r.json(), indent=4))

{
    "acknowledged": true
}
{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "students"
}


More information [here](https://opensearch.org/docs/latest/field-types/#explicit-mapping)

## Operations

In this section are executed some interesting operations. 

In the following snippet of code is defined an initial dataset in which we perform some of these operations.

In [61]:
import json
import random
import requests
from requests.auth import HTTPBasicAuth

team = [{
    "name": f"worker{i}",
    "age": random.randint(18, 50),
    "project": random.choice(["intranet", "intrateam", "devops", "development", "machine learning"])
} for i in range(20)]

# load all skillbillers
for member in team:
    r = requests.post("https://localhost:9200/skillbill/_doc/", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                     json=member,
                     headers={"Content-type": "application/json"}
                    )

In [62]:
import json
import requests
from requests.auth import HTTPBasicAuth

r = requests.get("https://localhost:9200/skillbill/_search", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                 data=json.dumps({
                     "size": 100,
                     "query": {
                         "match_all": {}
                     }
                 }),
                 headers={"Content-type": "application/json"}
                )

print(json.dumps(r.json(), indent=4))

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 20,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "skillbill",
                "_id": "y7cSAJIB9tihzKslYgzp",
                "_score": 1.0,
                "_source": {
                    "name": "worker0",
                    "age": 41,
                    "project": "intranet"
                }
            },
            {
                "_index": "skillbill",
                "_id": "zLcSAJIB9tihzKslYwxw",
                "_score": 1.0,
                "_source": {
                    "name": "worker1",
                    "age": 42,
                    "project": "intranet"
                }
            },
            {
                "_index": "skillbill",
                "_id": "zbcSAJIB9tihzKslYwy

### Query DSL

The domain-specific language ***DSL*** is a flexible language with a JSON interface. This language is used to query OpenSearch.

A query consists of many query clauses that can be combined to produce complex queries.

In [66]:
import json
import requests
from requests.auth import HTTPBasicAuth

query = {
    "size": 100,
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                      "project": "devops"
                    }
                },
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "name": "worker4"
                                }
                            },
                            {
                              "range": {
                                "age": {
                                  "gte": 18,
                                  "lte": 30
                                }
                              }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

r = requests.get("https://localhost:9200/skillbill/_search", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                 data=json.dumps(query),
                 headers={"Content-type": "application/json"}
                )

print(json.dumps(r.json(), indent=4))

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 2.5999472,
        "hits": [
            {
                "_index": "skillbill",
                "_id": "2LcSAJIB9tihzKslZAxJ",
                "_score": 2.5999472,
                "_source": {
                    "name": "worker13",
                    "age": 18,
                    "project": "devops"
                }
            },
            {
                "_index": "skillbill",
                "_id": "3rcSAJIB9tihzKslZAy8",
                "_score": 2.5999472,
                "_source": {
                    "name": "worker19",
                    "age": 27,
                    "project": "devops"
                }
            }
        ]
    }
}


### Aggregations

In [67]:
import json
import requests
from requests.auth import HTTPBasicAuth

query = {
  "size": 0,
  "aggs": {
    "avg_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}

r = requests.get("https://localhost:9200/skillbill/_search", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                 data=json.dumps(query),
                 headers={"Content-type": "application/json"}
                )

print(json.dumps(r.json(), indent=4))

{
    "took": 38,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 20,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "avg_age": {
            "value": 34.5
        }
    }
}


### Text analysis

In OpenSearch, the abstraction that encompasses text analysis is referred to as an analyzer. Each analyzer contains the following sequentially applied components:

- **Character filters:** First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string.

- **Tokenizer:** Next, a tokenizer receives the stream of characters that has been processed by the character filter and splits the text into individual tokens (usually, words). For example, win the snippet above we use the **default** analyzer. The output of a tokenizer is a stream of tokens.

- **Token filters:** Last, a token filter receives the stream of tokens from the tokenizer and adds, removes, or modifies tokens. For example, a token filter may lowercase the tokens so that Actions becomes action, remove stopwords like than, or add synonyms like talk for the word speak.


In [32]:
import json
import requests
from requests.auth import HTTPBasicAuth

r = requests.get("https://localhost:9200/_analyze", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                 data=json.dumps({
                     "analyzer": "standard",
                     "text": "Welcome to my demo in Skillbill!!!! :)"
                 }),
                 headers={"Content-type": "application/json"}
                )

print(json.dumps(r.json(), indent=4))

{
    "tokens": [
        {
            "token": "welcome",
            "start_offset": 0,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "to",
            "start_offset": 8,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "my",
            "start_offset": 11,
            "end_offset": 13,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "demo",
            "start_offset": 14,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "in",
            "start_offset": 19,
            "end_offset": 21,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "skillbill",
            "start_offset": 22,
            "end_offset": 31,
            "type": "<

In [68]:
import requests 

query = {
  "query": {
    "multi_match": {
      "query": "worker1",
      "fields": ["*"]
    }
  }
}
r = requests.get("https://localhost:9200/skillbill/_search", verify=False, auth=HTTPBasicAuth("admin", "admin"),
                 data=json.dumps(query),
                 headers={"Content-type": "application/json"}
                )

print(json.dumps(r.json(), indent=4))

{
    "took": 8,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 2.6390574,
        "hits": [
            {
                "_index": "skillbill",
                "_id": "zLcSAJIB9tihzKslYwxw",
                "_score": 2.6390574,
                "_source": {
                    "name": "worker1",
                    "age": 42,
                    "project": "intranet"
                }
            }
        ]
    }
}


More information [here](https://opensearch.org/docs/latest/analyzers/)

### Other

- [Search](https://opensearch.org/docs/latest/search-plugins/)
  - k-NN search
  - Vector search
  - Neural search
  - etc
- [Machine Learning](https://opensearch.org/docs/latest/ml-commons-plugin/)
  - ML integrations
 

## Clients

OpenSearch provides client for the different programmin languages: 
- Python: [opensearch-py](https://opensearch.org/docs/latest/clients/python-low-level/)
- Java: [opensearch-java](https://opensearch.org/docs/latest/clients/javascript/index)
- JavaScript: [opensearch-js](https://opensearch.org/docs/latest/clients/javascript/index)
- Go: [opensearch-go](https://opensearch.org/docs/latest/clients/go/)
- Ruby: [opensearch-ruby](https://opensearch.org/docs/latest/clients/ruby/)
- PHP: [opensearch-php](https://opensearch.org/docs/latest/clients/php/)
- Rust: [opensearch-rs](https://opensearch.org/docs/latest/clients/rust/)

### Python client

In this snippet of code there is an example of opensearch-py usage to estabilish a connection with OpenSearch and create/destroy of an index.

In [69]:
from opensearchpy import OpenSearch

host = 'localhost'
port = 9200
auth = ('admin', 'admin') # For testing only. Don't store credentials in code.
ca_certs_path = '/full/path/to/root-ca.pem' # Provide a CA bundle if you use intermediate CAs with your root CA.

# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
    hosts = [{'host': host, 'port': port}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = auth,
    use_ssl = True,
    verify_certs = False, # not used in this example
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    ca_certs = ca_certs_path # not used in this example
)

index_name = 'python-test-index'
print(f"Index creation: {index_name}")
response = client.indices.create(index_name, body={
    'settings': {
        'index': {
            'number_of_shards': 4
        }
    }
})
print(json.dumps(response, indent=4))

print(f"Deleting index: {index_name}")
response = client.indices.delete(
    index = 'python-test-index'
)
print(json.dumps(response, indent=4))

Index creation: python-test-index
{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "python-test-index"
}
Deleting index: python-test-index
{
    "acknowledged": true
}


More information about ***opensearch-py***:
- [OpenSearch Python repo](https://github.com/opensearch-project/opensearch-py)
- [API reference](https://opensearch-project.github.io/opensearch-py/api-ref.html)