## Welcome to Eland
### You know, for data science.

We've explored the main Python client for Elastic in the [previous notebook](/5.%20Python%20Wrapper.ipynb). This allows us to more easily make calls to the Elastic engine via the API directly from our notebooks or projects rather than the Elastic console or another HTTP request tool.

However, when we start explorin more advanced concepts and features, we can also leverage the Eland Elasticserch Python Client. This has been design to make data exploration and analysis easier, as well as allow us to bring in compatible ML models and use them for inference in Elasticsearch.


RSe the [Eland docs here]https://eland.readthedocs.io/en/latest/index.html)
Check out this awesome blog series [about using NLP in elastic here](https://www.elastic.co/blog/how-to-deploy-nlp-named-entity-recognition-ner-example)

You can install 'eland' via pip:
```
pip install eland 
```

First we will connect to Elaticsearch again with the Python Client as we did before. Then we will be able to connect to data from Elastisearch, and see it in the native Eland Dataframe. See comparison between this and a pandas dataframe [here](https://eland.readthedocs.io/en/v8.9.0/examples/demo_notebook.html)

### Connecting to both Elasticsearch Python Clients - Getting Data

In [None]:
import eland as ed
import pandas as pd
from elasticsearch import Elasticsearch
from getpass import getpass  # For securely getting user input

# Prompt the user to enter their Elastic Cloud ID and API Key securely
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY # API keys can be generated under management / security
)

index = "hp_scripts"

ed_hp_script = ed.DataFrame(client, es_index_pattern=index)

We create a dictionary that fits the required input for the NLP models we will want to explore

In [None]:
response = client.search(index = index, size=5000)
docs = []
for line in response["hits"]["hits"]:
    docs.append({"text_field" : line["_source"]["Sentence"]})

len(docs)

We can now see the data we've indexed in the previous notebook in our Eland Dataframe. 

Next up, let's import some models to play with on this data.

Eland allows transforming trained models from scikit-learn, XGBoost, and LightGBM libraries to be serialized and used as an inference model in Elasticsearch. 



### Building the Docker for Eland


For NLP use cases we can import [models from the Hugging Face model hub](https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english)

Explanation of using docker for this. 

The minimum dedicated ML node size for deploying and using the natural language processing models is 16 GB in Elasticsearch Service 

In [None]:
git clone git@github.com:elastic/eland.git
cd eland
docker build -t elastic/eland

Or, since more recently you can pull the official docker

In [None]:
docker pull docker.elastic.co/eland/eland:8.9.0

In [None]:
docker run -it --rm elastic/eland \
    eland_import_hub_model \
      --cloud-id $CLOUD_ID \
      -u $USER -p $PASSWORD \
      --hub-model-id distilbert-base-uncased-finetuned-sst-2-english \
      --task-type text_classification \
      --start 

We can now make calls that use the deployed model using this structure:

https://www.elastic.co/guide/en/elasticsearch/reference/current/get-trained-models.html?#ml-get-trained-models-request 

As a quick example, this is what the query looks like when used diretly through the API requests. 
From now on, we will continue to use the python client instead.

In [None]:
POST _ml/trained_models/distilbert-base-uncased-finetuned-sst-2-english/_infer
{
  "docs": [
    {
      "text_field": "The movie was awesome!"
    }
  ]
}

### Working with ML Models in Python

In [None]:
from elasticsearch.client import MlClient

model_id = "distilbert-base-uncased-finetuned-sst-2-english"

models = MlClient.get_trained_models(client, model_id=model_id)
models.body

In [None]:
stats = MlClient.get_trained_models_stats(client, model_id=model_id)
stats.body['trained_model_stats'][0]['deployment_stats']['nodes'][0]['routing_state']

In [None]:
response = MlClient.infer_trained_model(client, model_id=model_id, docs=docs[0:10], timeout=None)
#queue capacity defaults to 1024 so bulk requests should be split

response["inference_results"]

In [None]:
doc_test = {"text_field": "I shouldve known that you would be here Professor McGonagall"}
result = MlClient.infer_trained_model(client, model_id =model_id, docs = doc_test)

result["inference_results"]

Everything looks to be running smoothly! 

We can now apply the model to our entire index by using an Elastic pipeline. 

Here you can specify which transformations or enrichments to apply on all your documents.

### Creating an enrcihment pipeline

In [None]:
#creating a pipeline
client.ingest.put_pipeline(
    id="sentiment", 
    processors=[
    {
      "inference": {
        "model_id": "distilbert-base-uncased-finetuned-sst-2-english",
        "target_field" : "sentiment",
        "field_map": {
          "Sentence": "text_field"
        }
      }
    }
  ]
)

#creating the new target index, this step can be skipped if you do not want to pre-define custom mappings
index = "hp_scripts_enriched"
mappings = {
    "properties": {
      "Character": {
          "type": "text"
      },
      "Line_number": {
        "type": "long"
      },
      "Sentence": {
        "type": "text"
      },
      "sentiment.predicted_value": {
          "type": "text",
          "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
            }
          }
      },
      "sentiment.prediction_probability": {
          "type": "float"
      }    
    }
  }

client.indices.create(index=index, mappings=mappings)


In [None]:
client.reindex(body={
      "source": {
          "index": "hp_scripts"},
      "dest": {"index": "hp_scripts_enriched", "pipeline" : "sentiment"}
    }, wait_for_completion=False)

In [None]:
response = client.search(index = "hp_scripts_enriched")

print("We get back {total} results, here are the top ones:".format(total=response["hits"]['total']['value']))
for hit in response["hits"]["hits"]:
    print(hit['_source']["Sentence"], hit['_source']["sentiment"]["predicted_value"])

We can again add more arguments to our searches to make it more interesting. 

For instance, the most negative lines in the first three movies would be:

In [None]:
query={
    "match" : {
      "sentiment.predicted_value": "NEGATIVE"
    }
  }

response = client.search(index = "hp_scripts_enriched",query=query, sort="sentiment.prediction_probability:desc")

print("The most negative sentences in the series:")
for hit in response["hits"]["hits"]:
    print(hit['_source']["Sentence"],  hit['_source']["sentiment"]["prediction_probability"] )

In [None]:
query={
    "match" : {
      "sentiment.predicted_value": "POSITIVE"
    }
  }

response = client.search(index = "hp_scripts_enriched",query=query, sort="sentiment.prediction_probability:desc")

print("The most positive sentences in the series:")
for hit in response["hits"]["hits"]:
    print(hit['_source']["Sentence"],  hit['_source']["sentiment"]["prediction_probability"] )