In [1]:
# %pip install rich -Uqq
# %pip install fastcore -Uqq
# %pip install nbdev -Uqq
# %pip install opensearch-py -Uqq

In [2]:
from rich import inspect as rinspect
from rich import print as rprint
from fastcore.test import *
from fastcore.net import *
from fastcore.basics import *

In [3]:
from opensearchpy import OpenSearch
import json
host, port = 'localhost', 9200
auth=('admin', 'admin')

In [4]:
cli=OpenSearch(hosts=[{'host': host, 'port': port}], http_compress=True, http_auth=auth, use_ssl=True, verify_certs=False, ssl_assert_hostname=False, ssl_show_warn=False)

In [5]:
rprint(cli.cat.health.__doc__)

In [6]:
# Checks before indexing
#cli.cat.health(format='json')
rprint(cli.cat.health(v=True, h=['status', 'cluster']))
rprint(cli.cat.indices(v=True, h=['health', 'status', 'index', 'docs.count']))

## Indexing

- Create an index searchml_revisited with non-default settings 

In [7]:
rinspect(AttrDict)

In [8]:
index_name='searchml_revisited'
q_default_fld={'query': {'default_field': 'body'}}
d={'settings': {'index': q_default_fld}}
index_body=AttrDict(d)
index_body

- settings: 
  - index: 
    - query: 
      - default_field: body

In [9]:
from opensearchpy import NotFoundError

def _exists_index(index):
    try:
        return cli.cat.indices(index, h='index').strip() == index
    except NotFoundError:
        return False

if not _exists_index(index_name):
    resp = cli.indices.create(index=index_name, body=index_body); resp

In [10]:
test_eq(cli.cat.indices('searchml_revisited', h='index').strip(), 'searchml_revisited')

In [11]:
# Add our sample documents to the index.
docs = [
    {
        "id": "doc_a",
        "title": "Fox and Hounds",
        "body": "The quick red fox jumped over the lazy brown dogs.",
        "price": "5.99",
        "in_stock": True,
        "category": "childrens"},
    {
        "id": "doc_b",
        "title": "Fox wins championship",
        "body": "Wearing all red, the Fox jumped out to a lead in the race over the Dog.",
        "price": "15.13",
        "in_stock": True,
        "category": "sports"},
    {
        "id": "doc_c",
        "title": "Lead Paint Removal",
        "body": "All lead must be removed from the brown and red paint.",
        "price": "150.21",
        "in_stock": False,
        "category": "instructional"},
    {
        "id": "doc_d",
        "title": "The Three Little Pigs Revisted",
        "price": "3.51",
        "in_stock": True,
        "body": "The big, bad wolf huffed and puffed and blew the house down. The end.",
        "category": "childrens"}
]

In [12]:
index_name='searchml_revisited'
resps = [cli.index(index_name, body=doc, id=doc['id'], refresh=True) 
            for doc in docs]

In [13]:
# Commeting the below as this it is not repeatable
# First time it would show created if empty index
# [test_eq(resp['result'], 'created') for resp in resps]
# If already created
# [test_eq(resp['result'], 'updated') for resp in resps]

In [14]:
test_eq(cli.cat.count(index_name, h='count').strip(), str(4))

**Be intentional with your data mapping!** You can always override your configuration or perform other computations at runtime if you need to.

In OpenSearch it is possible to explicitly define the data types and text analysis via what are called Field Mappings or simply Mappings.

In [15]:
rinspect(cli.indices.get_mapping, help=True)

In [16]:
AttrDict(cli.indices.get_mapping(index_name))

- searchml_revisited: 
  - mappings: 
    - properties: 
      - body: 
        - type: text
        - fields: 
          - keyword: 
            - type: keyword
            - ignore_above: 256
      - category: 
        - type: text
        - fields: 
          - keyword: 
            - type: keyword
            - ignore_above: 256
      - id: 
        - type: text
        - fields: 
          - keyword: 
            - type: keyword
            - ignore_above: 256
      - in_stock: 
        - type: boolean
      - price: 
        - type: text
        - fields: 
          - keyword: 
            - type: keyword
            - ignore_above: 256
      - title: 
        - type: text
        - fields: 
          - keyword: 
            - type: keyword
            - ignore_above: 256

- **Multi fields**
  - keyword field type added by opensearch to every field due to 'text'
  - aggregation will be inefficient for text based fields

- in_stock is of type boolean which can be aggregated since we have only two values.

- ignore_above set to 256. Will not index values that have more than 256 characters. This means those values will not show up in any aggregations or searches. 


- The “price” field was chosen to be “text” even though we only ever passed in numeric values. This means we may get unexpected results from sorting or range filtering, since the values will be treated like strings rather than numbers.


**Well designed index structure**

- be explicit about mappings
- use multiple fields to represent the same piece of document content. Eg: fld with multiple analyzers ( with/without stemming), [autocompletion](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-suggesters.html#completion-suggester), [search-as-you-type](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-as-you-type.html), [joins](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/parent-join.html)

Reindex if field mapping is changed after indexing.

### Analyzer

- 'text' field uses [Standard Analyzer]()
- Analyzer consistes of 3 components. 1. 0 or more [character filters] 2. [tokenizer]() 3. 0 or more [token filters]()
- Char filters : strip things like HTML tags 
- Tokenizer : split text into tokens
- Tok filters : add/update/delete tokens before handing off to lucene.
![alt text](https://corise.com/_next/image?url=https%3A%2F%2Fcorise.com%2Fstatic%2Fcourse%2Fsearch-with-machine-learning%2Fassets%2Fckyclv9qd000n14727vnn8zax%2Fimage-6.jpg&w=384&q=75 "Standard Analyzer")
- [Standard Tokenizer]() : Lucene tokenizer splitting text into tokens based on UNICODE. Not suitable for lang that does not use whitespace to delineate words (Eg: ja, zh)
- Stopwords, Stemming

In [17]:
index_name, AttrDict(index_body)

('searchml_revisited',
 - settings: 
   - index: 
     - query: 
       - default_field: body)

In [19]:
# setting up a mapping using EnglishAnalyzer
mapping=AttrDict()
mapping.properties=AttrDict({'title': {}, 'body': {}, 'in_stock': {}, 'category': {}, 'price': {}})
mapping.properties.title={'type': 'text', 'analyzer': 'english'}
mapping.properties.body={'type': 'text', 'analyzer': 'english'}
mapping.properties.in_stock={'type': 'boolean'}
mapping.properties.category={'type': 'keyword', 'ignore_above': "256"}
mapping.properties.price={'type': "float"}
mapping

- properties: 
  - title: 
    - type: text
    - analyzer: english
  - body: 
    - type: text
    - analyzer: english
  - in_stock: 
    - type: boolean
  - category: 
    - type: keyword
    - ignore_above: 256
  - price: 
    - type: float

In [20]:
index_body

- settings: 
  - index: 
    - query: 
      - default_field: body

In [21]:
index_body.update({"mappings": mapping})

In [22]:
index_body

- settings: 
  - index: 
    - query: 
      - default_field: body
- mappings: 
  - properties: 
    - title: 
      - type: text
      - analyzer: english
    - body: 
      - type: text
      - analyzer: english
    - in_stock: 
      - type: boolean
    - category: 
      - type: keyword
      - ignore_above: 256
    - price: 
      - type: float

In [23]:
index_name='searchml_revisited_custom_mappings'

if not _exists_index(index_name):
    resp = cli.indices.create(index=index_name, body=index_body); resp

In [24]:
index_name='searchml_revisited_custom_mappings'
resps = [cli.index(index_name, body=doc, id=doc['id'], refresh=True) 
            for doc in docs]

In [25]:
rinspect(cli.search)
#rinspect(cli.search, help=True)

In [26]:
# Note we should have this collection earlier
# GET searchml_revisited/_search?q=body:dogs
resp = cli.search(index='searchml_revisited', params={'q': 'body:dogs'});rprint(resp)

In [27]:
test_eq(resp['hits']['total']['value'], 1)

In [28]:

# GET searchml_revisited_custom_mappings/_search?q=body:dogs
resp=cli.search(index='searchml_revisited_custom_mappings', params={'q': 'body:dogs'});rprint(resp)

In [29]:
test_eq(resp['hits']['total']['value'], 2)

## Querying

- [multi_match](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl-multi-match-query.html) query is used to search across multiple fields (eg: title, body)
- 'title' fld is twice important for matching in the query than the 'body'
- Types of multi_match (type param): 
  1. best_fields(default - match any but score from best) 
  2. most_fields (match any & score combined from each) 
  3. cross_fields (fields with same analyzer as one big field, each word from **any** field )

In [30]:
q='dogs'
index_name='searchml_revisited_custom_mappings'
query={
    'size': 5,
    'query': {
        'multi_match': {'query': q, 'fields': ['title^2', 'body']} 
    }
}
resp = cli.search(body=query, index=index_name); rprint(resp)

In [34]:
q='fox dog'
index_name='searchml_revisited_custom_mappings'
query={
    'size': 5,
    'query': {
        'match_phrase': {
            'body': {'query': q}
        }
    }
}
resp = cli.search(body=query, index=index_name); rprint(resp)

In [35]:
# phrase queries requiring the tokens fox and dog occur next to each other, hence 0 results
test_eq(resp['hits']['total']['value'], 0)

In [36]:
# find all documents where the terms “fox” and “dog” occur 
# within 10 positions of each other.
q='fox dog'
index_name='searchml_revisited_custom_mappings'
query={
    'size': 5,
    'query': {
        'match_phrase': {
            'body': {'query': q, 'slop': 10} # phrase query with slop
        }
    }
}
resp = cli.search(body=query, index=index_name); rprint(resp)

In [37]:
# phrase queries requiring the tokens fox and dog occur next to each other, hence 0 results
test_eq(resp['hits']['total']['value'] > 0, True)

- Filter queries are non-scoring queries that reduce the result set by simply determining what documents match the filter query, and function queries use the values within a field as a scoring feature.
- Function queries allow us to do things like boost documents based on some external value like price, inventory or popularity.

In [42]:
#  Filter queries are non-scoring queries that reduce the result set by simply determining what documents match the filter query, 
#  and function queries use the values within a field as a scoring feature. 
query = {
    'size': 5,
    'query': {
        "function_score": {
            "query": {
                # non-scoring queries
                "bool": {
                    "must": [
                        {"match_all": {}}
                    ],
                    "filter": [
                        {"term": {"category": "childrens"}}
                    ]
                }
            },
            "field_value_factor": {
                "field": "price",
                "missing": 1
            }
        }
    }
}
resp =cli.search(body=query, index=index_name); resp

{'took': 3,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 2, 'relation': 'eq'},
  'max_score': 5.99,
  'hits': [{'_index': 'searchml_revisited_custom_mappings',
    '_type': '_doc',
    '_id': 'doc_a',
    '_score': 5.99,
    '_source': {'id': 'doc_a',
     'title': 'Fox and Hounds',
     'body': 'The quick red fox jumped over the lazy brown dogs.',
     'price': '5.99',
     'in_stock': True,
     'category': 'childrens'}},
   {'_index': 'searchml_revisited_custom_mappings',
    '_type': '_doc',
    '_id': 'doc_d',
    '_score': 3.51,
    '_source': {'id': 'doc_d',
     'title': 'The Three Little Pigs Revisted',
     'price': '3.51',
     'in_stock': True,
     'body': 'The big, bad wolf huffed and puffed and blew the house down. The end.',
     'category': 'childrens'}}]}}

In [47]:
doc=resp['hits']['hits'][0]
test_eq(str(doc['_score']) == doc['_source']['price'], True)

👀 Aside: If you only need a field for ranking, you might consider using the [Rank Feature](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl-rank-feature-query.html) field and query for improved performance.

Boosts the relevance score of documents based on the numeric value of a rank_feature or rank_features field.

The rank_feature query is typically used in the should clause of a bool query so its relevance scores are added to other scores from the bool query.

[Elastic query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl.html): queries are [composable](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/compound-queries.html) in many places! That is, you can often build a more sophisticated query by adding and grouping different types of queries via things like the [“bool”](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl-bool-query.html) query. You can also mix and match many of the other query types like geo, shapes, spans, and terms!

## Aggregations

- https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-aggregations.html

- Metric aggregations that calculate metrics, such as a sum or average, from field values.
- Bucket aggregations that group documents into buckets, also called bins, based on field values, ranges, or other criteria.
- Pipeline aggregations that take input from other aggregations instead of documents or fields.

[bucketing/counting](https://opensearch.org/docs/latest/opensearch/bucket-agg/) of fields

In [50]:
query={
    'size': 0, # do not return any hits
    'query': {
        'match_all': {} # aggregations will be calculated over all documents
    },
    'aggs': {
        'category': {
            'terms': { # create an aggregation for each unique term
                'field': 'category',
                'size': 10,
                'missing': 'N/A', # determine # of docs that don't have value set for this field
                'min_doc_count': 0
            }
        }
    }
}
resp=cli.search(body=query, index=index_name);rprint(resp)

In [52]:
# aggregate the price field
query = {
    'size': 0,
    'query': {
        'match_all': {}
    },
    'aggs': {
        'price': {
            'terms': {
                'field': 'price',
                'size': 10,
                'min_doc_count': 0
            }
        }
    }
}
resp=cli.search(body=query, index=index_name);rprint(resp)

Terms aggregation on a field like prices is not helpful. But instead let's use [range aggregation](https://opensearch.org/docs/latest/opensearch/bucket-agg/#range-date_range-ip_range).

In [53]:
# aggregate the price field
query = {
    'size': 0,
    'query': {
        'match_all': {}
    },
    'aggs': {
        'price': {
            'range': {
                'field': 'price',
                'ranges': [
                    {'to': 5},
                    {'from': 5, 'to': 20},
                    {'from': 20}
                ]
            }
        }
    }
}
resp=cli.search(body=query, index=index_name);rprint(resp)

HW
- Add in some of your own “multi-fields” to index the content in different ways using the Field Mapping settings
- Index some different data types that we didn’t try out, such as latitude and longitude. How would you model searching what stores have what books in our data type?
- Try out some more sophisticated queries that combine several different query types, filters and aggregations.