### Embedding our documents

Similarily to the previous notebook, we can run another external ML model, in this case dealing with text embeddings, to enrich our indexed data even further. 

In [None]:
import eland as ed
import pandas as pd
from elasticsearch import Elasticsearch
import configparser

from elasticsearch import Elasticsearch

config = configparser.ConfigParser()
config.read('foobar.ini')
cloud_id = config["cloud-connection"]["cloud_id"]
user = config["cloud-connection"]["user"] # by default user = "elastic"
password = config["cloud-connection"]["password"]

client = Elasticsearch(
    cloud_id=cloud_id,  # cloud id can be found under deployment management
    basic_auth=(user, password) # your username and password for connecting to elastic, found under Deplouments - Security
)

client.info()

In [31]:
index = "hp_scripts_enriched"
response = client.search(index = index, size=5000)
docs = []
for line in response["hits"]["hits"]:
    docs.append({"text_field" : line["_source"]["Sentence"]})

### In your environment terminal run through the following commands

Pull the official docker image, add the necessary variables to your environment, and import the necessary model for the embeddings.

In [None]:
docker pull docker.elastic.co/eland/eland:8.9.0

In [None]:
export CLOUD_ID=change_me
export USER=change_me
export PASSWORD=change_me

In [None]:
docker run -it --rm docker.elastic.co/eland/eland:8.9.0 \
    eland_import_hub_model \
      --cloud-id $CLOUD_ID \
      -u $USER -p $PASSWORD \
      --hub-model-id sentence-transformers/msmarco-MiniLM-L-12-v3 \
      --task-type text_embedding \
      --start 

Once the model has been deployed, you can make inference calls to it, either through the dev console or with our python client.

In [None]:
POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l-12-v3/_infer
{
  "docs": {
    "text_field": "I shouldve known that you would be here Professor McGonagall"
  }
}

In [29]:
from elasticsearch.client import MlClient
models = MlClient.get_trained_models(client)
for model in models["trained_model_configs"]:
    print(model["model_id"])

.elser_model_1
distilbert-base-uncased-finetuned-sst-2-english
lang_ident_model_1
sentence-transformers__msmarco-minilm-l-12-v3


In [None]:
model_id = "sentence-transformers__msmarco-minilm-l-12-v3"

doc_test = {"text_field": "I shouldve known that you would be here Professor McGonagall"}
result = MlClient.infer_trained_model(client, model_id =model_id, docs = doc_test)

result["inference_results"]

Now that we've seen the inference is working, we can create a new pipeline that will apply the model to our entire index, so we can then use the embeddings in our search.

You can use this code in the dev console

In [None]:
#creating a pipeline
client.ingest.put_pipeline(
    id="embeddings", 
    processors=[
    {
      "inference": {
        "model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
        "target_field" : "text_embedding",
        "field_map": {
          "Sentence": "text_field"
        }
      }
    }
  ]
)

#creating the new target index, this step can be skipped if you do not want to pre-define custom mappings
index = "hp_scripts_final"
mappings = {
    "properties": {
      "text_embedding.predicted_value": {
        "type": "dense_vector",
        "dims": 384,
        "index": True,
        "similarity": "cosine"
      },
      "Character": {
          "type": "text"
      },
      "Line_number": {
        "type": "long"
      },
      "Sentence": {
        "type": "text"
       },
      "sentiment": {
          "properties": {
            "model_id": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "predicted_value": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "prediction_probability": {
              "type": "float"
            }
          }
        } 
      }
    }


client.indices.create(index=index, mappings=mappings)

client.reindex(body={
      "source": {
          "index": "hp_scripts_enriched"},
      "dest": {
    "index": "hp_scripts_final",
    "pipeline": "embeddings"
    }}, wait_for_completion = False)

In [None]:
index = 'hp_scripts_final'
response = client.search(index = index)

line = response["hits"]["hits"][0]
print(line['_source']["Sentence"])
print(line['_source']["text_embedding"]["predicted_value"][0:5])

### Semantic Search

We can search through our text now by embedding our queries with the same model and using the newly build vectors as part of our search ceriteria.

### Let's start with some basic associations

In [33]:
question1 = {"text_field" : "magic"}
question2 = {"text_field" : "wand"}
question3 = {"text_field" : "harry potter"}
question4 = {"text_field" : "dumbledore"}
question_list = [question1, question2, question3, question4]

In [34]:
def semantic_search(question_list):
    answer_list = []
    index = "hp_scripts_final"

    for question in question_list:
        result = MlClient.infer_trained_model(client, model_id =model_id, docs = question)
        query_vector = result["inference_results"][0]["predicted_value"]

        query = {
        "field": "text_embedding.predicted_value",
        "query_vector": query_vector,
        "k": 5,
        "num_candidates": 100
        }
        

        result = client.search(index = index, knn=query, source=["Sentence", "Character"])
        answer = []
        
        answer.append("prompt: " + question["text_field"])

        for element in result["hits"]["hits"]:
            answer.append("{}: {}, score {}".format(element["_source"]["Character"], element["_source"]["Sentence"], element["_score"]))

        answer_list.append(answer)
    return answer_list

In [35]:
semantic_search(question_list)

[['prompt: magic',
  'Vernon: Theres no such thing as magic, score 0.8299141',
  'Hermione: Oh are you doing magic, score 0.8199536',
  'GILDEROY LOCKHART: This is just like magic, score 0.8000889',
  'Hagrid: Strickly speaking Im not allowed to do magic, score 0.77473605',
  'HARRY: Im not allowed to use magic outside of school, score 0.7666023'],
 ['prompt: wand',
  'GILDEROY LOCKHART: Wands at the ready, score 0.8621737',
  'Harry: I still need a wand, score 0.8611474',
  'Harry: And who owned that wand, score 0.8589636',
  'RON: My wand Look at my wand, score 0.843747',
  'LUPIN: Wand at the ready Ron, score 0.832528'],
 ['prompt: harry potter',
  'Whispers: Harry Potter, score 1.0',
  'McGonagall: Harry Potter, score 1.0',
  'McGonagall: Harry Potter, score 1.0',
  'VERNON: Harry Potter, score 1.0',
  'PHOTOGRAPHER: Harry Potter, score 1.0'],
 ['prompt: dumbledore',
  'Hermione: Dumbledore, score 1.0',
  'HARRY: Dumbledore, score 1.0',
  'LUCIUS MALFOY: Dumbledore, score 1.0',
  '

So far it just behaves as as a "normal" keyword search. Especially the names are quite bad - just finding the full name in a line. 

Let's try some more complex searches

In [36]:
question1 = {"text_field" : "what do they fear"}
question2 = {"text_field" : "bad decisions"}
question3 = {"text_field" : "breaking rules"}
question4 = {"text_field" : "when are you brave"}
question_list = [question1, question2, question3, question4]

In [37]:
semantic_search(question_list)

[['prompt: what do they fear',
  'LUPIN: That suggests what you fear the most is fear itself, score 0.84556496',
  'Hagrid: Anything you couldnt explain when you were angry or scared, score 0.7971493',
  'LUPIN: Concentrate Face your fear, score 0.7910902',
  'MR WEASLEY: You are in danger, score 0.7876258',
  'MCGONAGALL: Our worst fear has been realized, score 0.786567'],
 ['prompt: bad decisions',
  'Ron: Its bad, score 0.772177',
  'Harry: But Hagrid there must be a mistake, score 0.7552644',
  'Harry: No youve made a mistake, score 0.7539615',
  'RON: Hermione Bad idea Bad idea, score 0.7443874',
  'Dumbledore: The good and the bad, score 0.7368764'],
 ['prompt: breaking rules',
  'HERMIONE: Not to mention wed be breaking about 50 school rules, score 0.75131655',
  'DUMBLEDORE: You both realize of course that in the past few hours you have broken perhaps a dozen school rules, score 0.7429861',
  'Vernon: You are breaking and entering, score 0.7344173',
  'McGonagall: Any rulebreak

Finally some nicer results! 
* Fear associated with other emotions, especially being scared or scarry.
* Bad decisions implying a mistake
* Some gramatical flexibility like broken / breakging / rulebreaking and the associated exceptions
* Bravery linked to courage and daring. 

Now let's try to use our book knowledge to navigate this. Like getting the model to guess a particular result we know is in the books. 

Can you guess these?

In [None]:
question1 = {"text_field" : "who is the smartest"}
question2 = {"text_field" : "jailbreak"}
question3 = {"text_field" : "room that is hidden"}
question4 = {"text_field" : "evil person"}
question_list = [question1, question2, question3, question4]

In [None]:
semantic_search(question_list)