<a href="https://colab.research.google.com/github/jeffvestal/elastic_jupyter_notebooks/blob/main/load_embedding_model_from_hf_to_elastic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading an Sentense Transformer model from Hugging Face into Elastic

This code will show you how to set up an ingest pipeline to generate vectors for documents on ingest.

Overview of steps
1. Set up our python environment
2. Setup index mapping
3. Configure ingest pipeline
4. Index a couple test documents

### Requirements
This notebook assumes you already have loaded an embedding model into elasticsearch. If you haven't, please start with [this notebook example](https://github.com/jeffvestal/elastic_jupyter_notebooks/blob/main/load_embedding_model_from_hf_to_elastic.ipynb)


### Elastic version support
Requires Elastic version 8.0+ with a platinum or enterprise license (or trial license)

You can set up a [free trial elasticsearch Deployment in Elastic Cloud](https://cloud.elastic.co/registration).

---
---
# Setup
---
---
This section will set up the python environment with the required libraries

## Install and import required python libraries

Elastic uses the [eland python library](https://github.com/elastic/eland) to download modesl from Hugging Face hub and load them into elasticsearch

In [None]:
pip install eland

In [None]:
pip install elasticsearch

In [None]:
pip install transformers

In [None]:
pip install sentence_transformers

In [None]:
pip install torch==1.11

In [None]:
from pathlib import Path
from eland.ml.pytorch import PyTorchModel
from eland.ml.pytorch.transformers import TransformerModel
from elasticsearch import Elasticsearch, helpers
from elasticsearch.client import MlClient
from pprint import pprint

## Configure elasticsearch authentication. 
The recommended authentication approach is using the [Elastic Cloud ID](https://www.elastic.co/guide/en/cloud/current/ec-cloud-id.html) and a [cluster level API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html)

You can use any method you wish to set the required credentials. We are using getpass in this example to prompt for credentials to avoide storing them in github.

In [None]:
import getpass

In [None]:
es_cloud_id = getpass.getpass('Enter Elastic Cloud ID:  ')
es_api_id = getpass.getpass('Enter cluster API key ID:  ') 
es_api_key = getpass.getpass('Enter cluster API key:  ')

## Connect to Elastic Cloud

In [None]:
es = Elasticsearch(cloud_id=es_cloud_id, 
                   api_key=(es_api_id, es_api_key)
                   )
es.info() # should return cluster info

---
---
# Model Information and Status
---
---

## View information about the model
This is not required but will allow us to get the model_id as it is stored in elasticsearch as well as verify the model is running / deployed and ready to use in our ingest pipeline

In [None]:
m = MlClient.get_trained_models(es)
m.body

## Set the model_id for ease of reference later
To make is easy for reference later, we will set  `es_model_id` to the `model_id` listed in the output above

In [None]:
es_model_id = "sentence-transformers__msmarco-minilm-l-12-v3"

### *If* the model is not started we will need to deploy the model

You will only need to run this if the model hasn't been deployed. 

This will load the model on the ML nodes and start the process(es) making it available for the NLP task

uncomment the code below

In [None]:
#s = MlClient.start_trained_model_deployment(es, model_id=es_model_id)
#s.body

#### Verify the model started without issue
If you aren't sure if the model is started you can check here

In [None]:
stats = MlClient.get_trained_models_stats(es, model_id=es_model_id)
stats.body['trained_model_stats'][0]['deployment_stats']['nodes'][0]['routing_state']

---
---
# Elasticsearch index setup
---
---
Here we will configure an index template with settings and mappings to store our vectors and text data

The **important** part here will be setting our vector field to be a `dense_vector` type. This will tell elasticsearch to build the HNSW graph for the vectors so we can then use kNN search later. 

## Define the index template
We will have the following fields

- `vectors` of type `dense_vector`
-- it is important to set `dims` to the number of dimensions the model you will use outputs
- `title` of type `text`
- `summary` of type `text`

We will have 
- 1 primary shard
- 0 replica -> *note* in production you will want at least 1 replica

This will match new indices with the name matching the pattern of `jupyter-vector-demo*`

In [None]:
index_patterns = "jupyter-vector-demo*"
settings= {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
mappings= {
        "properties": {
            "vectors": {
                "type": "dense_vector",
                "dims": 384,
                "index" : True,
                "similarity" : "cosine"
            },
            "title": {
                "type": "text"
            },
            "summary": {
                "type": "text"
            }
        }
    }

## Apply the template
Here we apply the templat and give it a name of `jupyter-vector-demo`. This is just the name of the template if we need to modify it later on.

In [None]:
es.indices.put_template(name="jupyter-vector-demo-template", 
                        index_patterns=index_patterns,
                        settings=settings,
                        mappings=mappings
                        )

---
---
# The Ingest Pipeline
---
---

An ingest pipeline has one or more processors and processes documents before they are written into an elasticsearch index. 

Each processor is designed to perform a various task such as parsing fields or enriching data. 

The main processor for this pipeline is the `inference` processor. The inference processor sends a specified field to a supervised model and writes the output from the model to a new field along with the original fields in the document. 

To make it simpler to access the vector, we will copy the vectors to a field named `vectors` and them remove the `ml` field tree which is the default output.

## Configure the pipeline

In [None]:
pipeline_definition = {
    "description": "A pipeline for generating and storing vectors on ingest",
    "processors": [
      {
       "inference": {
          "model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
          "field_map": {
           "summary": "text_field"
          }
       }
     },
     {
      "set": {
        "field": "vectors",
        "copy_from": "ml.inference.predicted_value"
        }
     },
    {
      "remove": {
        "field": "ml"
      }
    }
  ]
}



## Create the pipeline if it doesn't exist

In [None]:
if es.ingest.put_pipeline(id="jupyter-vector-demo-pipeline", body=pipeline_definition):
    print("Pipeline created successfully")
else:
    print("Failed to create pipeline")


## Verify the pipeline
Not required but nice to verify everything looks correct

In [None]:
pipeline = es.ingest.get_pipeline(id="jupyter-vector-demo-pipeline")
pipeline.body

---
---
# Ingest Docs and Generate Vectors
---
---

## Create sample documents
These aren't real blogs just sampls ChatGPT created for me :) 

In [None]:
samples = [["The Power of Word Embeddings in NLP", "Word embeddings have revolutionized the field of NLP."  ],  
    ["An Introduction to Transformer Models", "Transformer models have taken NLP by storm."  ],  
    ["Fine-Tuning BERT for Text Classification", "Fine-tuning BERT can lead to state-of-the-art results in text classification."  ],  
    ["Why GPT-3 is a Game Changer for NLP", "GPT-3 has set a new standard for language models in NLP."  ],  
    ["Using ELMO for Sentiment Analysis", "ELMO can effectively capture contextual information for sentiment analysis."  ],  
    ["The Rise of Pre-Trained Models in NLP", "Pre-trained models have become increasingly popular in NLP."  ]
]

## Create the list of docs to ingest

In [None]:
docs = [
    {   "_index": "jupyter-vector-demo",
        "_source": {
           "title": sample[0], 
           "summary": sample[1]
        }
    }
    for sample in samples
]

## Index the docs 
This will send a bulk index request to elastic, sending all the docs through the ingest pipeline, generating vectors, and storing them in elasticsearch

In [None]:
helpers.bulk(es, docs, pipeline="jupyter-vector-demo-pipeline" )

## Verify one of the docs 
Let's take a look at one doc and see how it was indexed

In [None]:
result = es.search(index='jupyter-vector-demo', body={}, size=1)
result.body['hits']['hits'][0]['_source']

---
---
# Approximate k-Nearest Neighbor Search (knn)
---
---



## Generate Vector for Query

Before we can run an approximate k-nearest neighbor (kNN) query, we need to convert our query string to a vector.

Set a sample query doc

Depending on your specific model, you may need to change the field name from "text_field"

In [None]:
docs =  [
    {
      "text_field": "State of the art nlp models"
    }
  ]

We call the `_infer` endpoint supplying the model_id and the doc[s] we want to vectorize. 

In [None]:
vec = MlClient.infer_trained_model(es, model_id=es_model_id, docs=docs, )

The vector for the first doc can be accessed in the response dict as shown below

In [None]:
doc_0_vector = vec['inference_results'][0]['predicted_value']
doc_0_vector

## Run the Search

We will call the `_search` api and specify the `knn` section. 

This is a simple example of a search query. Elastic supports combining kNN search with "traditional" BM25 search. You can also filter documents to reduce the number of docs that needs to be searched. See the [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search-api.html) for more information.

This will be a very simple example to get started

### Create the search paramaters
Here we are just specifying the `knn` section, but you can also set all the other search params to pass.

In [None]:
knn = {
    "field": "vectors",
    "query_vector": doc_0_vector,
    "k": 2,
    "num_candidates": 10
  }

Send in the search request

In [None]:
results = es.search(index='jupyter-vector-demo', knn=knn, size=1)


### View the results

In [None]:
results.body

### Search Query without returning vectors

Often when running kNN search, you don't actually need to return the vectors themselves, you just want to return the fields to display to the end user

We are adding a new paramater `fields` which allows us to specify specific fields to return rather than all of them

By setting `source` to False (_source:false) we save having to get the entire source payoad back in the response

We are moving the `size` value here simply to gather the paramaters together

In [None]:
knn = {
    "field": "vectors",
    "query_vector": doc_0_vector,
    "k": 2,
    "num_candidates": 10
  }
fields = ["summary", "title"]
size = 1
source = False

In [None]:
results = es.search(index='jupyter-vector-demo', 
                    knn=knn, 
                    source=source, 
                    fields=fields, 
                    size=size
                  )


### View the results

In [None]:
results.body