## Documentation

To read more about analyzers, checkout the docs [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html).



## Connect to ElasticSearch

In [None]:
from pprint import pprint
from elasticsearch import Elasticsearch

es = Elasticsearch('http://localhost:9200')
client_info = es.info()
print('Connected to Elasticsearch!')
pprint(client_info.body)

## 1. Character filters

Read more about them [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html).

### 1.1. HTML Strip Character Filter

In [None]:
from pprint import pprint

response = es.indices.analyze(
    char_filter=[
        "html_strip"
    ],
    text="I&apos;m so happy</b>!</p>",
)
pprint(response.body)

### 1.2. Mapping character filter

In [None]:
response = es.indices.analyze(
    tokenizer="keyword",
    char_filter=[
        {
            "type": "mapping",
            "mappings": [
                "٠ => 0",
                "١ => 1",
                "٢ => 2",
                "٣ => 3",
                "٤ => 4",
                "٥ => 5",
                "٦ => 6",
                "٧ => 7",
                "٨ => 8",
                "٩ => 9"
            ]
        }
    ],
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
pprint(response.body)

## 2. Tokenizer

Read more about tokenizers [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html).

### 2.1. Standard

In [None]:
response = es.indices.analyze(
    tokenizer="standard",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}', Type: {token['type']}")

### 2.2. Lowercase

In [None]:
response = es.indices.analyze(
    tokenizer="lowercase",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}', Type: {token['type']}")

### 2.3. Whitespace

In [None]:
response = es.indices.analyze(
    tokenizer="whitespace",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}', Type: {token['type']}")

## 3. Token filter

Read more about token filters [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html).

### 3.1. Apostrophe

In [None]:
response = es.indices.analyze(
    tokenizer="standard",
    filter=[
        "apostrophe"
    ],
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

### 3.2. Decimal digit

In [None]:
response = es.indices.analyze(
    tokenizer="standard",
    filter=[
        "decimal_digit"
    ],
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

### 3.3. Reverse

In [None]:
response = es.indices.analyze(
    tokenizer="standard",
    filter=[
        "reverse"
    ],
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

## 4. Built-in analyzers

Read more about token filters [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html).

### 4.1. Standard

In [None]:
response = es.indices.analyze(
    analyzer="standard",
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

### 4.2. Stop

In [None]:
response = es.indices.analyze(
    analyzer="stop",
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

### 4.3. Keyword

In [None]:
response = es.indices.analyze(
    analyzer="keyword",
    text="I saw comet Tsuchinshan Atlas in ٢٠٢٤",
)
tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

## 5. Index time VS Search time analysis

### 5.1. Index time

Index-time analysis transforms text before it's stored in the index. In this example, let's create an index with an analyzer that lowercases text, removes HTML tags, and replaces ampersands (&) with the word "and."

In [None]:
index_name = "index_time_example"
settings = {
    "settings": {
        "analysis": {
            "char_filter": {
                "ampersand_replacement": {
                    "type": "mapping",
                    "mappings": ["& => and"]
                }
            },
            "analyzer": {
                "custom_index_analyzer": {
                    "type": "custom",
                    "char_filter": ["html_strip", "ampersand_replacement"],
                    "tokenizer": "standard",
                    "filter": ["lowercase"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "custom_index_analyzer"
            }
        }
    }
}

es.indices.delete(index=index_name, ignore_unavailable=True)
es.indices.create(index=index_name, body=settings)

document = {
    "content": "Visit my website https://myuniversehub.com/ & like some images!"}
response = es.index(index=index_name, id=1, body=document)
pprint(response.body)

When searching for the document, you'll notice that the content appears unchanged. This is expected because Elasticsearch stores the transformed tokens in an inverted index for searching purposes, while keeping the original document intact in the `_source` field.

In [None]:
response = es.search(index=index_name, body={"query": {"match_all": {}}})
hits = response.body["hits"]["hits"]

for hit in hits:
    print(hit["_source"])

We can verify that the custom analyzer is working by applying it to the document like this.

In [None]:
response = es.indices.analyze(
    index=index_name,
    body={
        "field": "content",
        "text": "Visit my website https://myuniversehub.com/ & like some images!"
    }
)

tokens = response.body["tokens"]
for token in tokens:
    print(f"Token: '{token['token']}'")

### 5.2. Search time

Search-time analysis transforms text only when a search query is performed, not when data is indexed. In this example, we’ll perform a search with a search-time analyzer that transforms text differently (e.g., it lowercases and removes stop words).

In [None]:
response = es.search(index=index_name, body={
    "query": {
        "match": { # match is used for full-text search
            "content": {
                "query": "myuniversehub.com",
                "analyzer": "standard"  # Using a different analyzer than the one used at index time
            }
        }
    }
})

hits = response["hits"]["hits"]
for hit in hits:
    print(hit["_source"])

You can also use a `term` query to match exact terms. Since `myuniversehub.com` exists exactly as-is in the document, this query will return the document in the results.

In [None]:
response = es.search(index=index_name, body={
    "query": {
        "term": {  # term is used for exact matches
            "content": {
                "value": "myuniversehub.com",
            }
        }
    }
})

hits = response["hits"]["hits"]
for hit in hits:
    print(hit["_source"])

In this case, `MYUNIVERSEHUB.com` does not appear in the document, so no results are returned.

In [None]:
response = es.search(index=index_name, body={
    "query": {
        "term": {  # term is used for exact matches
            "content": {
                "value": "MYUNIVERSEHUB.com",
            }
        }
    }
})

hits = response["hits"]["hits"]
for hit in hits:
    print(hit["_source"])