<a href="https://colab.research.google.com/github/martin-mirantes/MCP/blob/main/notebooks/search/00-quick-start.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic search quick start

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb)

This interactive notebook will introduce you to some basic operations with Elasticsearch, using the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html).
You'll perform semantic search using [Sentence Transformers](https://www.sbert.net) for text embedding. Learn how to integrate traditional text-based search with semantic search, for a hybrid search system.

## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

Once logged in to your Elastic Cloud account, go to the [Create deployment](https://cloud.elastic.co/deployments/create) page and select **Create deployment**. Leave all settings with their default values.

## Install packages and import modules

To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to install the `elasticsearch` Python client.

In [1]:
!pip install -qU "elasticsearch<9" sentence-transformers==2.7.0

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m906.3/906.3 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.0/65.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Setup the Embedding Model

For this example, we're using `all-MiniLM-L6-v2`, part of the `sentence_transformers` library. You can read more about this model on [Huggingface](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Initialize the Elasticsearch client

Now we can instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment.

In [3]:
from elasticsearch import Elasticsearch
from getpass import getpass

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# Create the client instance
client = Elasticsearch(
    # For local development
    # hosts=["http://localhost:9200"]
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

Elastic Cloud ID: ··········
Elastic Api Key: ··········


If you're running Elasticsearch locally or self-managed, you can pass in the Elasticsearch host instead. [Read more](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#_verifying_https_with_certificate_fingerprints_python_3_10_or_later) on how to connect to Elasticsearch locally.

### Enable Telemetry

Knowing that you are using this notebook helps us decide where to invest our efforts to improve our products. We would like to ask you that you run the following code to let us gather anonymous usage statistics. See [telemetry.py](https://github.com/elastic/elasticsearch-labs/blob/main/telemetry/telemetry.py) for details. Thank you!

In [4]:
!curl -O -s https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/telemetry/telemetry.py
from telemetry import enable_telemetry

client = enable_telemetry(client, "00-quick-start")

Telemetry enabled for "00-quick-start". Thank you!


### Test the Client
Before you continue, confirm that the client has connected with this test.

In [5]:
print(client.info())

{'name': 'instance-0000000000', 'cluster_name': '51cadc9319ff4c758693155bb5e8117d', 'cluster_uuid': 'Bz-_rqsjSaqLKBItQN7exQ', 'version': {'number': '9.0.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '0a58bc1dc7a4ae5412db66624aab968370bd44ce', 'build_date': '2025-05-28T10:06:37.834829258Z', 'build_snapshot': False, 'lucene_version': '10.1.0', 'minimum_wire_compatibility_version': '8.18.0', 'minimum_index_compatibility_version': '8.0.0'}, 'tagline': 'You Know, for Search'}


## Index some test data

Our client is set up and connected to our Elastic deployment.
Now we need some data to test out the basics of Elasticsearch queries.
We'll use a small index of books with the following fields:

- `title`
- `authors`
- `publish_date`
- `num_reviews`
- `publisher`

### Create an index

First ensure that you do not have a previously created index with the name `book_index`.

In [6]:
client.indices.delete(index="book_index", ignore_unavailable=True)

ObjectApiResponse({'acknowledged': True})

🔐 NOTE: at any time you can come back to this section and run the `delete` function above to remove your index and start from scratch.

Let's create an Elasticsearch index with the correct mappings for our test data.

In [7]:
# Define the mapping
mappings = {
    "properties": {
        "title_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": "true",
            "similarity": "cosine",
        }
    }
}

# Create the index
client.indices.create(index="book_index", mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'book_index'})

### Index test data

Run the following command to upload some test data, containing information about 10 popular programming books from this [dataset](https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json).
`model.encode` will encode the text into a vector on the fly, using the model we initialized earlier.

In [8]:
import json
from urllib.request import urlopen

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json"
response = urlopen(url)
books = json.loads(response.read())

operations = []
for book in books:
    operations.append({"index": {"_index": "book_index"}})
    # Transforming the title into an embedding using the model
    book["title_vector"] = model.encode(book["title"]).tolist()
    operations.append(book)
client.bulk(index="book_index", operations=operations, refresh=True)

ObjectApiResponse({'errors': False, 'took': 200, 'items': [{'index': {'_index': 'book_index', '_id': '-9VoQJcBYS182ubtUI05', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 0, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': '_NVoQJcBYS182ubtUI05', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 1, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': '_dVoQJcBYS182ubtUI05', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 2, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': '_tVoQJcBYS182ubtUI05', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 3, '_primary_term': 1, 'status': 201}}, {'index': {'_

## Aside: Pretty printing Elasticsearch responses

Your API calls will return hard-to-read nested JSON.
We'll create a little function called `pretty_response` to return nice, human-readable outputs from our examples.

In [9]:
def pretty_response(response):
    if len(response["hits"]["hits"]) == 0:
        print("Your search returned no results.")
    else:
        for hit in response["hits"]["hits"]:
            id = hit["_id"]
            publication_date = hit["_source"]["publish_date"]
            score = hit["_score"]
            title = hit["_source"]["title"]
            summary = hit["_source"]["summary"]
            publisher = hit["_source"]["publisher"]
            num_reviews = hit["_source"]["num_reviews"]
            authors = hit["_source"]["authors"]
            pretty_output = f"\nID: {id}\nPublication date: {publication_date}\nTitle: {title}\nSummary: {summary}\nPublisher: {publisher}\nReviews: {num_reviews}\nAuthors: {authors}\nScore: {score}"
            print(pretty_output)

## Making queries

Now that we have indexed the books, we want to perform a semantic search for books that are similar to a given query.
We embed the query and perform a search.

In [17]:
response = client.search(
    index="book_index",
    knn={
        "field": "title_vector",
        "query_vector": model.encode("clear"),
        "k": 3,
        "num_candidates": 3,
    },
)

pretty_response(response)


ID: AtVoQJcBYS182ubtUI45
Publication date: 2011-05-13
Title: The Clean Coder: A Code of Conduct for Professional Programmers
Summary: A guide to professional conduct in the field of software engineering
Publisher: prentice hall
Reviews: 20
Authors: ['robert c. martin']
Score: 0.65018773

ID: _tVoQJcBYS182ubtUI05
Publication date: 2008-08-11
Title: Clean Code: A Handbook of Agile Software Craftsmanship
Summary: A guide to writing code that is easy to read, understand and maintain
Publisher: prentice hall
Reviews: 55
Authors: ['robert c. martin']
Score: 0.6459305

ID: _9VoQJcBYS182ubtUI05
Publication date: 2015-03-27
Title: You Don't Know JS: Up & Going
Summary: Introduction to JavaScript and programming as a whole
Publisher: oreilly
Reviews: 36
Authors: ['kyle simpson']
Score: 0.6196325


## Filtering

Filter context is mostly used for filtering structured data. For example, use filter context to answer questions like:

- _Does this timestamp fall into the range 2015 to 2016?_
- _Is the status field set to "published"?_

Filter context is in effect whenever a query clause is passed to a filter parameter, such as the `filter` or `must_not` parameters in a `bool` query.

[Learn more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context) about filter context in the Elasticsearch docs.

### Example: Keyword Filtering

This is an example of adding a keyword filter to the query.

The example retrieves the top books that are similar to "javascript books" based on their title vectors, and also Addison-Wesley as publisher.

In [18]:
response = client.search(
    index="book_index",
    knn={
        "field": "title_vector",
        "query_vector": model.encode("javascript books"),
        "k": 10,
        "num_candidates": 100,
        "filter": {"term": {"publisher.keyword": "addison-wesley"}},
    },
)

pretty_response(response)


ID: -9VoQJcBYS182ubtUI05
Publication date: 2019-10-29
Title: The Pragmatic Programmer: Your Journey to Mastery
Summary: A guide to pragmatic programming for software engineers and developers
Publisher: addison-wesley
Reviews: 30
Authors: ['andrew hunt', 'david thomas']
Score: 0.62071896

ID: AdVoQJcBYS182ubtUI45
Publication date: 1994-10-31
Title: Design Patterns: Elements of Reusable Object-Oriented Software
Summary: Guide to design patterns that can be used in any object-oriented language
Publisher: addison-wesley
Reviews: 45
Authors: ['erich gamma', 'richard helm', 'ralph johnson', 'john vlissides']
Score: 0.5665283


In [24]:
from datetime import datetime, timedelta
# Calcula la fecha de hace dos años desde hoy
two_years_ago = (datetime.now() - timedelta(days=8*365)).strftime('%Y-%m-%d')

response = client.search(
    index="book_index",
    knn={
        "field": "title_vector",
        "query_vector": model.encode("javascript books"), # o la consulta que desees
        "k": 10,
        "num_candidates": 100,
        "filter": {
            "range": {
                "publish_date": {
                    "gte": two_years_ago,
                    "lte": "now" # Puedes usar "now" para la fecha actual
                }
            }
        },
    },
)

# Asume que tienes una función pretty_response definida
# Ejemplo:
# import json
# def pretty_response(response):
#     print(json.dumps(response.body, indent=2, ensure_ascii=False))

pretty_response(response)


ID: ANVoQJcBYS182ubtUI45
Publication date: 2018-12-04
Title: Eloquent JavaScript
Summary: A modern introduction to programming
Publisher: no starch press
Reviews: 38
Authors: ['marijn haverbeke']
Score: 0.6798649

ID: -9VoQJcBYS182ubtUI05
Publication date: 2019-10-29
Title: The Pragmatic Programmer: Your Journey to Mastery
Summary: A guide to pragmatic programming for software engineers and developers
Publisher: addison-wesley
Reviews: 30
Authors: ['andrew hunt', 'david thomas']
Score: 0.62093973

ID: _dVoQJcBYS182ubtUI05
Publication date: 2020-04-06
Title: Artificial Intelligence: A Modern Approach
Summary: Comprehensive introduction to the theory and practice of artificial intelligence
Publisher: pearson
Reviews: 39
Authors: ['stuart russell', 'peter norvig']
Score: 0.56021476

ID: _NVoQJcBYS182ubtUI05
Publication date: 2019-05-03
Title: Python Crash Course
Summary: A fast-paced, no-nonsense guide to programming in Python
Publisher: no starch press
Reviews: 42
Authors: ['eric matthe

1. Buscar Libros por Autor ✍️
Puedes buscar utilizando el campo authors (analizado, para búsquedas flexibles) o authors.keyword (no analizado, para búsquedas exactas).

Búsqueda flexible (usando authors de tipo text)
Esto encontrará libros donde "Gabriel García" aparezca en el campo de autores, incluso si hay otros autores o si el nombre tiene variaciones que el analizador pueda manejar.

Python

In [68]:
def pretty_print(response):
    """Función auxiliar para imprimir respuestas JSON de forma legible."""
    print(json.dumps(response, indent=2, ensure_ascii=False))

# Nombre del índice
INDEX_NAME = "book_index"

print("## 1.1 Búsqueda flexible de autor:")
response = client.search(
    index=INDEX_NAME,
    query={
        "match": {
            "authors": "ericc mathes"
        }
    },
    size=5 # Limitar a 5 resultados
)
pretty_response(response.body)

## 1.1 Búsqueda flexible de autor:
Your search returned no results.


In [70]:
print("\n## Búsqueda flexible de autor (incluyendo solo 'title', 'authors', 'publish_date'):")
response_specific_fields = client.search(
    index=INDEX_NAME,
    query={
        "match": {
            "authors": "eric"
        }
    },
    size=1,
    _source=["title", "authors", "publish_date"] # <--- AQUÍ SE ESPECIFICAN LOS CAMPOS A INCLUIR
)
pretty_print(response_specific_fields.body)


## Búsqueda flexible de autor (incluyendo solo 'title', 'authors', 'publish_date'):
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.3534746,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 2.3534746,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ],
          "publish_date": "2019-05-03"
        }
      }
    ]
  }
}


In [71]:
print("## Búsqueda flexible de autor (excluyendo 'title_vector'):")
response = client.search(
    index=INDEX_NAME,
    query={
        "match": {
            "authors": "eric"
        }
    },
    size=1, # Limitar a 1 resultado para el ejemplo
    _source_excludes=["title_vector"] # <--- AQUÍ SE EXCLUYE EL CAMPO
)
pretty_print(response.body)

## Búsqueda flexible de autor (excluyendo 'title_vector'):
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.3534746,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 2.3534746,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ],
          "summary": "A fast-paced, no-nonsense guide to programming in Python",
          "publish_date": "2019-05-03",
          "num_reviews": 42,
          "publisher": "no starch press"
        }
      }
    ]
  }
}


In [75]:
print("## Búsqueda por Prefijo (ej: 'eri'):")
query_term = "eric"
response = client.search(
    index=INDEX_NAME,
    query={
        "match_phrase_prefix": {
            "authors": {
                "query": query_term
            }
        }
    },
    size=5,
    _source_excludes=["title_vector"]
)
pretty_print(response.body)
# Debería encontrar 'eric matthes'

## Búsqueda por Prefijo (ej: 'eri'):
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 2.3534746,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 2.3534746,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ],
          "summary": "A fast-paced, no-nonsense guide to programming in Python",
          "publish_date": "2019-05-03",
          "num_reviews": 42,
          "publisher": "no starch press"
        }
      },
      {
        "_index": "book_index",
        "_id": "AdVoQJcBYS182ubtUI45",
        "_score": 1.2347455,
        "_source": {
          "title": "Design Patterns: Elements of Reusable Object-Oriented Software",
          "authors": [
            "erich gamma",
          

In [79]:
print("\n## Búsqueda con Fuzziness (ej: 'erric'):")
query_term = "rric"
response = client.search(
    index=INDEX_NAME,
    query={
        "match": {
            "authors": {
                "query": query_term,
                "fuzziness": "AUTO" # O un valor como 1 o 2 (distancia de Levenshtein)
                                # "AUTO" genera distancias de edición basadas en la longitud del término.
            }
        }
    },
    size=5,
    _source_excludes=["title_vector"]
)
pretty_print(response.body)
# Debería encontrar 'eric matthes' si la fuzziness es adecuada.


## Búsqueda con Fuzziness (ej: 'erric'):
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.765106,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 1.765106,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ],
          "summary": "A fast-paced, no-nonsense guide to programming in Python",
          "publish_date": "2019-05-03",
          "num_reviews": 42,
          "publisher": "no starch press"
        }
      }
    ]
  }
}


In [81]:
print("\n## Búsqueda con Comodines usando query_string:")

# Ejemplo para "eri" (como prefijo)
query_wildcard_prefix = "eri*"
print(f"--- Buscando con: {query_wildcard_prefix} ---")
response_prefix = client.search(
    index=INDEX_NAME,
    query={
        "query_string": {
            "query": query_wildcard_prefix,
            "default_field": "authors"
        }
    },
    size=5,
     _source_excludes=["title_vector"]
)
pretty_print(response_prefix.body)




## Búsqueda con Comodines usando query_string:
--- Buscando con: eri* ---
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 1.0,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ],
          "summary": "A fast-paced, no-nonsense guide to programming in Python",
          "publish_date": "2019-05-03",
          "num_reviews": 42,
          "publisher": "no starch press"
        }
      },
      {
        "_index": "book_index",
        "_id": "AdVoQJcBYS182ubtUI45",
        "_score": 1.0,
        "_source": {
          "title": "Design Patterns: Elements of Reusable Object-Oriented Software",
          "authors": [
            "eric

In [80]:
# Ejemplo para "ric" (como parte del término)
query_wildcard_contains = "*ric*" # Podría ser "*ric" si solo buscas al final
print(f"\n--- Buscando con: {query_wildcard_contains} ---")
response_contains = client.search(
    index=INDEX_NAME,
    query={
        "query_string": {
            "query": query_wildcard_contains,
            "default_field": "authors"
        }
    },
    size=5,
     _source_excludes=["title_vector"]
)
pretty_print(response_contains.body)
# Esta consulta ('*ric*') debería encontrar 'eric matthes'.


--- Buscando con: *ric* ---
{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 1.0,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ],
          "summary": "A fast-paced, no-nonsense guide to programming in Python",
          "publish_date": "2019-05-03",
          "num_reviews": 42,
          "publisher": "no starch press"
        }
      },
      {
        "_index": "book_index",
        "_id": "AdVoQJcBYS182ubtUI45",
        "_score": 1.0,
        "_source": {
          "title": "Design Patterns: Elements of Reusable Object-Oriented Software",
          "authors": [
            "erich gamma",
            "richard helm",
        

In [82]:
print("\n## Búsqueda con simple_query_string:")

# Para prefijo "eri"
query_sqs_prefix = "eri*"
print(f"--- Buscando con simple_query_string (prefijo): {query_sqs_prefix} ---")
response_sqs_prefix = client.search(
    index=INDEX_NAME,
    query={
        "simple_query_string": {
            "query": query_sqs_prefix,
            "fields": ["authors"], # Especifica los campos donde buscar
            "default_operator": "AND"
        }
    },
    size=5,
     _source_excludes=["title_vector"]
)
pretty_print(response_sqs_prefix.body)


# Para typo "erric" (usando fuzziness)
query_sqs_fuzzy = "erric~1" # El ~1 indica una distancia de edición de 1
print(f"\n--- Buscando con simple_query_string (fuzziness): {query_sqs_fuzzy} ---")
response_sqs_fuzzy = client.search(
    index=INDEX_NAME,
    query={
        "simple_query_string": {
            "query": query_sqs_fuzzy,
            "fields": ["authors"],
            "default_operator": "AND"
        }
    },
    size=5,
     _source_excludes=["title_vector"]
)
pretty_print(response_sqs_fuzzy.body)


# Para "ric" como parte de "eric"
query_sqs_contains = "*ric*"
print(f"\n--- Buscando con simple_query_string (contiene): {query_sqs_contains} ---")
response_sqs_contains = client.search(
    index=INDEX_NAME,
    query={
        "simple_query_string": {
            "query": query_sqs_contains,
            "fields": ["authors"],
            "default_operator": "AND"
        }
    },
    size=5,
     _source_excludes=["title_vector"]
)
pretty_print(response_sqs_contains.body)


## Búsqueda con simple_query_string:
--- Buscando con simple_query_string (prefijo): eri* ---
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 1.0,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ],
          "summary": "A fast-paced, no-nonsense guide to programming in Python",
          "publish_date": "2019-05-03",
          "num_reviews": 42,
          "publisher": "no starch press"
        }
      },
      {
        "_index": "book_index",
        "_id": "AdVoQJcBYS182ubtUI45",
        "_score": 1.0,
        "_source": {
          "title": "Design Patterns: Elements of Reusable Object-Oriented Software",
          "authors":

In [84]:
print("\n## Probando con query_string para '*ric*':")
query_term = "*ric*"
response_qs = client.search(
    index=INDEX_NAME,
    query={
        "query_string": {
            "query": query_term,
            "fields": ["authors"], # Puedes usar 'fields' o 'default_field'
            # "analyze_wildcard": False # Es el valor por defecto, pero puedes ser explícito
        }
    },
    size=5,
     _source_excludes=["title_vector"]
)
pretty_print(response_qs.body)


## Probando con query_string para '*ric*':
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 1.0,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ],
          "summary": "A fast-paced, no-nonsense guide to programming in Python",
          "publish_date": "2019-05-03",
          "num_reviews": 42,
          "publisher": "no starch press"
        }
      },
      {
        "_index": "book_index",
        "_id": "AdVoQJcBYS182ubtUI45",
        "_score": 1.0,
        "_source": {
          "title": "Design Patterns: Elements of Reusable Object-Oriented Software",
          "authors": [
            "erich gamma",
            "richard 

In [85]:
print("\n## Probando simple_query_string simplificada para '*ric*':")
query_term = "*ric*"
response_sqs_simple = client.search(
    index=INDEX_NAME,
    query={
        "simple_query_string": {
            "query": query_term,
            "fields": ["authors"]
        }
    },
    size=5,
     _source_excludes=["title_vector"]
)
pretty_print(response_sqs_simple.body)


## Probando simple_query_string simplificada para '*ric*':
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}


In [86]:
GET book_index/_termvectors/_NVoQJcBYS182ubtUI05
{
  "fields": ["authors"]
}

SyntaxError: invalid syntax (<ipython-input-86-a02ae384dc45>, line 1)

In [57]:
print("\n## 1.2 Búsqueda exacta de autor:")
response = client.search(
    index=INDEX_NAME,
    query={
        "term": {
            "authors.keyword": "david"
        }
    },
    size=5
)
pretty_response(response.body)


## 1.2 Búsqueda exacta de autor:
Your search returned no results.


In [63]:
print("\n## 2. Agregación de autores únicos:")
response = client.search(
    index=INDEX_NAME,
    size=0,  # No necesitamos los documentos, solo la agregación
    aggs={
        "unique_authors": {
            "terms": {
                "field": "authors.keyword",
                "size": 20  # Número de autores únicos a mostrar
            }
        }
    }
)
pretty_print(response.body['aggregations'])


## 2. Agregación de autores únicos:
{
  "unique_authors": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "robert c. martin",
        "doc_count": 2
      },
      {
        "key": "andrew hunt",
        "doc_count": 1
      },
      {
        "key": "david thomas",
        "doc_count": 1
      },
      {
        "key": "douglas crockford",
        "doc_count": 1
      },
      {
        "key": "eric matthes",
        "doc_count": 1
      },
      {
        "key": "erich gamma",
        "doc_count": 1
      },
      {
        "key": "john vlissides",
        "doc_count": 1
      },
      {
        "key": "kyle simpson",
        "doc_count": 1
      },
      {
        "key": "marijn haverbeke",
        "doc_count": 1
      },
      {
        "key": "michael sipser",
        "doc_count": 1
      },
      {
        "key": "peter norvig",
        "doc_count": 1
      },
      {
        "key": "ralph johnson",
        "doc_count

In [69]:
print("\n## 3. Destacar términos en autores:")
response = client.search(
    index=INDEX_NAME,
    query={
        "match": {
            "authors": "Martin" # Buscar libros que tengan "Martin" en los autores
        }
    },
    highlight={
        "fields": {
            "authors": {} # Configuración de resaltado por defecto para el campo 'authors'
        }
    },
    size=3,
    _source_excludes=["title_vector"]
)
pretty_print(response.body)


## 3. Destacar términos en autores:
{
  "took": 22,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.5204805,
    "hits": [
      {
        "_index": "book_index",
        "_id": "_tVoQJcBYS182ubtUI05",
        "_score": 1.5204805,
        "_source": {
          "title": "Clean Code: A Handbook of Agile Software Craftsmanship",
          "authors": [
            "robert c. martin"
          ],
          "summary": "A guide to writing code that is easy to read, understand and maintain",
          "publish_date": "2008-08-11",
          "num_reviews": 55,
          "publisher": "prentice hall"
        },
        "highlight": {
          "authors": [
            "robert c. <em>martin</em>"
          ]
        }
      },
      {
        "_index": "book_index",
        "_id": "AtVoQJcBYS182ubtUI45",
        "_score": 1.5204805,
      

In [87]:
print("\n## 4. Libros con más de un autor:")
response = client.search(
    index=INDEX_NAME,
    query={
        "bool": {
            "filter": {
                "script": {
                    "script": {
                        "source": "doc['authors.keyword'].size() > params.count",
                        "params": {
                            "count": 1
                        }
                    }
                }
            }
        }
    },
    size=5,
    _source=["title", "authors"] # Solo traer título y autores
)
pretty_print(response.body)


## 4. Libros con más de un autor:
{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 0.0,
    "hits": [
      {
        "_index": "book_index",
        "_id": "-9VoQJcBYS182ubtUI05",
        "_score": 0.0,
        "_source": {
          "title": "The Pragmatic Programmer: Your Journey to Mastery",
          "authors": [
            "andrew hunt",
            "david thomas"
          ]
        }
      },
      {
        "_index": "book_index",
        "_id": "_dVoQJcBYS182ubtUI05",
        "_score": 0.0,
        "_source": {
          "title": "Artificial Intelligence: A Modern Approach",
          "authors": [
            "stuart russell",
            "peter norvig"
          ]
        }
      },
      {
        "_index": "book_index",
        "_id": "AdVoQJcBYS182ubtUI45",
        "_score": 0.0,
        "_source": {


In [88]:
print("\n## 5. Libros con más de dos autores:")
response = client.search(
    index=INDEX_NAME,
    query={
        "bool": {
            "filter": {
                "script": {
                    "script": {
                        "source": "doc['authors.keyword'].size() > params.count",
                        "params": {
                            "count": 2
                        }
                    }
                }
            }
        }
    },
    size=5,
    _source=["title", "authors"]
)
pretty_print(response.body)


## 5. Libros con más de dos autores:
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.0,
    "hits": [
      {
        "_index": "book_index",
        "_id": "AdVoQJcBYS182ubtUI45",
        "_score": 0.0,
        "_source": {
          "title": "Design Patterns: Elements of Reusable Object-Oriented Software",
          "authors": [
            "erich gamma",
            "richard helm",
            "ralph johnson",
            "john vlissides"
          ]
        }
      }
    ]
  }
}


In [89]:
print("\n## 6. Facet por cantidad de autores:")
response = client.search(
    index=INDEX_NAME,
    size=0, # No necesitamos los documentos, solo la agregación
    aggs={
        "authors_count_facet": {
            "terms": {
                "script": {
                    "source": "doc['authors.keyword'].size()", # Devuelve el número de autores
                    "lang": "painless"
                },
                "size": 10 # Muestra los 10 conteos de autores más comunes
            }
        }
    }
)
pretty_print(response.body['aggregations'])


## 6. Facet por cantidad de autores:
{
  "authors_count_facet": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "1",
        "doc_count": 7
      },
      {
        "key": "2",
        "doc_count": 2
      },
      {
        "key": "4",
        "doc_count": 1
      }
    ]
  }
}


In [91]:
from elasticsearch import Elasticsearch

# Asume que 'client' ya está inicializado
# client = Elasticsearch(...)
# INDEX_NAME = "book_index" # O un nuevo nombre como "book_index_ngram"

# 1. Definir settings y mappings
index_settings = {
    "analysis": {
        "analyzer": {
            "custom_trigram_analyzer": {
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "trigram_token_filter"
                ]
            }
        },
        "filter": {
            "trigram_token_filter": {
                "type": "ngram",
                "min_gram": 3,
                "max_gram": 3
            }
        }
    }
}

index_mappings = {
    "properties": {
        "authors": {
            "type": "text",
            "analyzer": "custom_trigram_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "num_reviews": {"type": "long"},
        "publish_date": {"type": "date"},
        "publisher": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}
        },
        "summary": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}
        },
        "title": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword", "ignore_above": 256}}
        },
        "title_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": True,
            "similarity": "cosine",
            "index_options": {"type": "int8_hnsw", "m": 16, "ef_construction": 100}
        }
    }
}

# (Opcional) Nombre del nuevo índice, o puedes usar el mismo si lo vas a borrar primero
NEW_INDEX_NAME = "book_index_trigram" # O usa tu INDEX_NAME original

# 2. (Opcional) Borrar el índice antiguo si vas a reutilizar el nombre y estás en desarrollo
if client.indices.exists(index=NEW_INDEX_NAME):
    print(f"Borrando índice existente: {NEW_INDEX_NAME}")
    client.indices.delete(index=NEW_INDEX_NAME)

# 3. Crear el nuevo índice con la configuración y mapeo
print(f"Creando índice: {NEW_INDEX_NAME}")
client.indices.create(
    index=NEW_INDEX_NAME,
    settings=index_settings,
    mappings=index_mappings
)
print("Índice creado con éxito.")

# 4. Re-poblar tus datos
#    Esto dependerá de cómo cargas tus datos originalmente.
#    Por ejemplo, si tienes una lista de documentos:
#    docs = [
#        {"authors": ["eric matthes"], "title": "Python Crash Course", ...},
#        ...
#    ]
#    for i, doc in enumerate(docs):
#        client.index(index=NEW_INDEX_NAME, id=i, document=doc)
#    print(f"{len(docs)} documentos indexados.")

# Si tienes un índice existente (ej. "book_index") y quieres mover los datos:
print("Iniciando reindexación...")
client.reindex(
     body={
         "source": {"index": "book_index"}, # Tu índice original
         "dest": {"index": NEW_INDEX_NAME}
     },
     request_timeout=300 # Aumenta el timeout si es necesario
 )
print("Reindexación completada.")

Creando índice: book_index_trigram
Índice creado con éxito.
Iniciando reindexación...
Reindexación completada.


  client.reindex(


In [95]:
# Asume que 'client' está configurado y NEW_INDEX_NAME es el índice con N-grams
# (y que has re-poblado los datos en este nuevo índice)

# Ejemplo: Buscar "ric"
search_term = "erric"

print(f"\nBuscando '{search_term}' en el campo 'authors' (N-gram):")
response = client.search(
    index=NEW_INDEX_NAME, # Asegúrate de consultar el índice correcto
    query={
        "match": {
            "authors": search_term
        }
    },
    size=5,
    _source=["title", "authors"]
)
# from your_previous_code import pretty_print # si la tienes definida
# pretty_print(response.body)
print(json.dumps(response.body, indent=2, ensure_ascii=False))

# Esto debería encontrar "eric matthes" porque "eric" genera el trigrama "ric",
# y el término de búsqueda "ric" (al ser de longitud 3) se convierte en el token "ric".


Buscando 'erric' en el campo 'authors' (N-gram):
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 2.251802,
    "hits": [
      {
        "_index": "book_index_trigram",
        "_id": "AdVoQJcBYS182ubtUI45",
        "_score": 2.251802,
        "_source": {
          "title": "Design Patterns: Elements of Reusable Object-Oriented Software",
          "authors": [
            "erich gamma",
            "richard helm",
            "ralph johnson",
            "john vlissides"
          ]
        }
      },
      {
        "_index": "book_index_trigram",
        "_id": "_NVoQJcBYS182ubtUI05",
        "_score": 2.2498753,
        "_source": {
          "title": "Python Crash Course",
          "authors": [
            "eric matthes"
          ]
        }
      }
    ]
  }
}
