# Semantic search quick start

This interactive notebook will introduce you to some basic operations with Elasticsearch, using the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html).
You'll perform semantic search using [Sentence Transformers](https://www.sbert.net) for text embedding. Learn how to integrate traditional text-based search with semantic search, for a hybrid search system.

## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

Once logged in to your Elastic Cloud account, go to the [Create deployment](https://cloud.elastic.co/deployments/create) page and select **Create deployment**. Leave all settings with their default values.

## Install packages and import modules

To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to install the `elasticsearch` Python client.

In [1]:
!pip install -qU elasticsearch sentence-transformers==2.7.0

# Setup the Embedding Model

For this example, we're using `all-MiniLM-L6-v2`, part of the `sentence_transformers` library. You can read more about this model on [Huggingface](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [9]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

## Initialize the Elasticsearch client

Now we can instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment.

In [20]:
from elasticsearch import Elasticsearch
from ssl import create_default_context

# Your credentials
USERNAME = "ibm_cloud_1f1ef1fc_0540_4ff0_9bc0_c2d03866a87f"
PASSWORD = "4ff231b8280167f844fbe4468ff65c6f48ddf96bf1c8747d1a0cd519f55aa78b"
CERT_PATH = "6b3f1059-174c-11ea-8ac1-9e4ef9cb8b62"
# URL = "https://ibm_cloud_1f1ef1fc_0540_4ff0_9bc0_c2d03866a87f:4ff231b8280167f844fbe4468ff65c6f48ddf96bf1c8747d1a0cd519f55aa78b@7cb70516-3a4a-447d-9017-90ff4622dd14.bngflf7f0ktkmkdl3jhg.databases.appdomain.cloud:32745"

# Create SSL context with the certificate
context = create_default_context(cafile=CERT_PATH)

# Create the client instance
client = Elasticsearch(
    hosts=["https://$USERNAME:$PASSWORD@7cb70516-3a4a-447d-9017-90ff4622dd14.bngflf7f0ktkmkdl3jhg.databases.appdomain.cloud:32745"],
    basic_auth=(USERNAME, PASSWORD),
    ssl_context=context
)

In [11]:
try:
   info = client.info()
   print("Connected successfully!")
   print("Elasticsearch version:", info['version']['number'])
except Exception as e:
   print("Connection failed:", str(e))

Connected successfully!
Elasticsearch version: 8.15.0


In [12]:
# List all indices
indices = client.indices.get_alias(index="*")
for index in indices:
   print(index)

# Get more detailed information including docs count and size
indices_stats = client.indices.stats()
for index, stats in indices_stats['indices'].items():
   print(f"\nIndex: {index}")
   print(f"Docs count: {stats['total']['docs']['count']}")

.ent-search-actastic-workplace_search_accounts_v16
.ent-search-actastic-workplace_search_search_groups_v4-name-unique-constraint
.ent-search-actastic-crawler2_robots_txts
.ent-search-actastic-workplace_search_pre_content_sources_v3
.ent-search-actastic-crawler_crawl_requests_v7
.ent-search-esqueues-me_queue_v1_process_crawl2
.ent-search-actastic-reindex_jobs_v3
.ent-search-actastic-workplace_search_role_mappings_v8
.kibana_8.15.0_001
.ent-search-actastic-search_relevance_suggestion_update_process_v1
.apm-custom-link
.ent-search-actastic-connectors_jobs_v5
.ml-annotations-000001
.ent-search-actastic-workplace_search_content_sources_v23
.internal.alerts-observability.uptime.alerts-default-000001
.ent-search-actastic-users_v7-auth_source-elasticsearch_username-unique-constraint
.ent-search-actastic-crawler_process_crawls
.apm-source-map
.ent-search-actastic-users_v7-email-unique-constraint
.ent-search-actastic-crawler2_configurations_v2-index_name-unique-constraint
.slo-observability.summ

### Test the Client
Before you continue, confirm that the client has connected with this test.

In [13]:
print(client.info())

{'name': 'm-0.7cb70516-3a4a-447d-9017-90ff4622dd14.884686a9d2fb4f288039d08f94d09479.bngflf7f0ktkmkdl3jhg.databases.appdomain.cloud', 'cluster_name': '7cb70516-3a4a-447d-9017-90ff4622dd14', 'cluster_uuid': 'ork0gSebRBuVRZ7Ibkdmsg', 'version': {'number': '8.15.0', 'build_flavor': 'default', 'build_type': 'tar', 'build_hash': '1a77947f34deddb41af25e6f0ddb8e830159c179', 'build_date': '2024-08-05T10:05:34.233336849Z', 'build_snapshot': False, 'lucene_version': '9.11.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


## Index some test data

Our client is set up and connected to our Elastic deployment.
Now we need some data to test out the basics of Elasticsearch queries.
We'll use a small index of books with the following fields:

- `title`
- `authors`
- `publish_date`
- `num_reviews`
- `publisher`

### Create an index

First ensure that you do not have a previously created index with the name `book_index`.

In [14]:
client.indices.delete(index="book_index", ignore_unavailable=True)

ObjectApiResponse({'acknowledged': True})

🔐 NOTE: at any time you can come back to this section and run the `delete` function above to remove your index and start from scratch.

Let's create an Elasticsearch index with the correct mappings for our test data. 

In [15]:
# Define the mapping
mappings = {
    "properties": {
        "title_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": "true",
            "similarity": "cosine",
        }
    }
}

# Create the index
client.indices.create(index="book_index", mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'book_index'})

### Index test data

Run the following command to upload some test data, containing information about 10 popular programming books from this [dataset](https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json).
`model.encode` will encode the text into a vector on the fly, using the model we initialized earlier.

In [16]:
import json
from urllib.request import urlopen

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json"
response = urlopen(url)
books = json.loads(response.read())

operations = []
for book in books:
    operations.append({"index": {"_index": "book_index"}})
    # Transforming the title into an embedding using the model
    book["title_vector"] = model.encode(book["title"]).tolist()
    operations.append(book)
client.bulk(index="book_index", operations=operations, refresh=True)

ObjectApiResponse({'errors': False, 'took': 294670663, 'items': [{'index': {'_index': 'book_index', '_id': 'GzuINZQBdIoiyWMegu_R', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 0, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': 'HDuINZQBdIoiyWMegu_R', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 1, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': 'HTuINZQBdIoiyWMegu_R', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 2, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': 'HjuINZQBdIoiyWMegu_R', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 3, '_primary_term': 1, 'status': 201}}, {'index

## Aside: Pretty printing Elasticsearch responses

Your API calls will return hard-to-read nested JSON.
We'll create a little function called `pretty_response` to return nice, human-readable outputs from our examples.

In [17]:
def pretty_response(response):
    if len(response["hits"]["hits"]) == 0:
        print("Your search returned no results.")
    else:
        for hit in response["hits"]["hits"]:
            id = hit["_id"]
            publication_date = hit["_source"]["publish_date"]
            score = hit["_score"]
            title = hit["_source"]["title"]
            summary = hit["_source"]["summary"]
            publisher = hit["_source"]["publisher"]
            num_reviews = hit["_source"]["num_reviews"]
            authors = hit["_source"]["authors"]
            pretty_output = f"\nID: {id}\nPublication date: {publication_date}\nTitle: {title}\nSummary: {summary}\nPublisher: {publisher}\nReviews: {num_reviews}\nAuthors: {authors}\nScore: {score}"
            print(pretty_output)

## Making queries

Now that we have indexed the books, we want to perform a semantic search for books that are similar to a given query.
We embed the query and perform a search.

In [18]:
response = client.search(
    index="book_index",
    knn={
        "field": "title_vector",
        "query_vector": model.encode("javascript books"),
        "k": 10,
        "num_candidates": 100,
    },
)

pretty_response(response)


ID: IzuINZQBdIoiyWMegu_R
Publication date: 2008-05-15
Title: JavaScript: The Good Parts
Summary: A deep dive into the parts of JavaScript that are essential to writing maintainable code
Publisher: oreilly
Reviews: 51
Authors: ['douglas crockford']
Score: 0.80517054

ID: HzuINZQBdIoiyWMegu_R
Publication date: 2015-03-27
Title: You Don't Know JS: Up & Going
Summary: Introduction to JavaScript and programming as a whole
Publisher: oreilly
Reviews: 36
Authors: ['kyle simpson']
Score: 0.6986463

ID: IDuINZQBdIoiyWMegu_R
Publication date: 2018-12-04
Title: Eloquent JavaScript
Summary: A modern introduction to programming
Publisher: no starch press
Reviews: 38
Authors: ['marijn haverbeke']
Score: 0.6795542

ID: GzuINZQBdIoiyWMegu_R
Publication date: 2019-10-29
Title: The Pragmatic Programmer: Your Journey to Mastery
Summary: A guide to pragmatic programming for software engineers and developers
Publisher: addison-wesley
Reviews: 30
Authors: ['andrew hunt', 'david thomas']
Score: 0.6211879

I

## Filtering

Filter context is mostly used for filtering structured data. For example, use filter context to answer questions like:

- _Does this timestamp fall into the range 2015 to 2016?_
- _Is the status field set to "published"?_

Filter context is in effect whenever a query clause is passed to a filter parameter, such as the `filter` or `must_not` parameters in a `bool` query.

[Learn more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context) about filter context in the Elasticsearch docs.

### Example: Keyword Filtering

This is an example of adding a keyword filter to the query.

The example retrieves the top books that are similar to "javascript books" based on their title vectors, and also Addison-Wesley as publisher.

In [21]:
response = client.search(
    index="book_index",
    knn={
        "field": "title_vector",
        "query_vector": model.encode("javascript books"),
        "k": 10,
        "num_candidates": 100,
        "filter": {"term": {"publisher.keyword": "addison-wesley"}},
    },
)

pretty_response(response)


ID: GzuINZQBdIoiyWMegu_R
Publication date: 2019-10-29
Title: The Pragmatic Programmer: Your Journey to Mastery
Summary: A guide to pragmatic programming for software engineers and developers
Publisher: addison-wesley
Reviews: 30
Authors: ['andrew hunt', 'david thomas']
Score: 0.6211879

ID: ITuINZQBdIoiyWMegu_R
Publication date: 1994-10-31
Title: Design Patterns: Elements of Reusable Object-Oriented Software
Summary: Guide to design patterns that can be used in any object-oriented language
Publisher: addison-wesley
Reviews: 45
Authors: ['erich gamma', 'richard helm', 'ralph johnson', 'john vlissides']
Score: 0.56723905
