## Starting from Elasticsearch

The Elasticsearch client allows us to dig deeper into the index we've generated from the Huggingface dataset, and to leverage more elastic native features.

In [1]:
from getpass import getpass  
from elasticsearch import Elasticsearch

# Prompt the user to enter their Elastic Cloud ID and API Key securely
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY, # your username and password for connecting to elastic, found under Deplouments - Security
)

The index we built earlier is now on our Elastic cluster, so we can interact with it with the normal API calls.

One benefit of indexing your data is that you no longer need to locally load the dataset, rather sending the search queries to run where your ES client is hosted insted of processing your computations locally.

In [2]:
index_name="bluesky"
query={
        "match": {
            "text": "travelling destination"
        }
    }

response = client.search(index=index_name, query=query)

for hit in response["hits"]["hits"][0:5]:
    print(hit['_source']['text'])

#Armenia is an amazing destination, with incredible history and beautiful landscapes. As the only current nation on the world's oldest map, oldest winery site and the first #Christian nation, it's worth travelling to this inexpensive destination to discover places like this (Dadal's Bridge, 14th c)!
Bleurgh!!!! Why is it that travelling anywhere involves so much, well, travelling?
Destination reached 🫡
Destination:  Afternoon Nap.
Journey before destination.


### Leveraging models

As for the model side, this can also be done on the Elastic side rather than through the HuggingFace hub. 

Similarily to the index - we can leverage the storing & compute of your Elastic instance for your chosen LLMs then simply call upon them using either the elastic client or compatible huggingface functions.

For example, we can start the Elasticsearch models registered on HuggingFace and deploy them in our Elasitc cluster rather than running them locally as in the previous phase. 
We will once again set up[the sentiment text classifier](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment).

In [5]:
!docker pull docker.elastic.co/eland/eland

!docker run -it --rm elastic/eland \
    eland_import_hub_model \
      --cloud-id $ELASTIC_CLOUD_ID \
      --es-api-key $ELASTIC_API_KEY \
      --hub-model-id cardiffnlp/twitter-roberta-base-sentiment \
      --task-type text_classification \
      --clear-previous \
      --start

Using default tag: latest
latest: Pulling from eland/eland
Digest: sha256:64adfcc89f42a63641dab2d9b459fd8a6196e73b51ad69d51d9eef8f3fe612da
Status: Image is up to date for docker.elastic.co/eland/eland:latest
docker.elastic.co/eland/eland:latest
[1m
What's Next?
[0m  View a summary of image vulnerabilities and recommendations → [36mdocker scout quickview docker.elastic.co/eland/eland[0m
2024-12-04 12:30:02,457 INFO : Establishing connection to Elasticsearch
2024-12-04 12:30:02,516 INFO : Connected to cluster named 'fdcc4e10e5a34385884a3eda9350099a' (version: 8.15.2)
2024-12-04 12:30:02,517 INFO : Loading HuggingFace transformer tokenizer and model 'cardiffnlp/twitter-roberta-base-sentiment'
Downloading config.json: 100%|█████████████████| 747/747 [00:00<00:00, 57.5kB/s]
Downloading vocab.json: 100%|████████████████| 899k/899k [00:00<00:00, 3.34MB/s]
Downloading merges.txt: 100%|████████████████| 456k/456k [00:00<00:00, 2.40MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████| 150/

This model will now show up under `Trainded Models` in your Elastic Cloud interface.

We can now call this model and run inference tasks on it. Here's an example to get us started:

In [3]:
model_id = "cardiffnlp__twitter-roberta-base-sentiment"
models = client.ml.get_trained_models(model_id=model_id)
models.body

{'count': 1,
 'trained_model_configs': [{'model_id': 'cardiffnlp__twitter-roberta-base-sentiment',
   'model_type': 'pytorch',
   'created_by': 'api_user',
   'version': '12.0.0',
   'create_time': 1733315468654,
   'model_size_bytes': 0,
   'estimated_operations': 0,
   'license_level': 'platinum',
   'description': "Model cardiffnlp/twitter-roberta-base-sentiment for task type 'text_classification'",
   'tags': [],
   'input': {'field_names': ['text_field']},
   'inference_config': {'text_classification': {'vocabulary': {'index': '.ml-inference-native-000002'},
     'tokenization': {'roberta': {'do_lower_case': False,
       'with_special_tokens': True,
       'max_sequence_length': 512,
       'truncate': 'first',
       'span': -1,
       'add_prefix_space': False}},
     'classification_labels': ['LABEL_0', 'LABEL_1', 'LABEL_2'],
     'num_top_classes': 0}},
   'location': {'index': {'name': '.ml-inference-native-000002'}}}]}

In [10]:
def process_label(result):
    label = result[0]["predicted_value"]
    if label == "LABEL_1":
        label = "neutral"
    elif label == "LABEL_2":
        label = "positive"
    elif label == "LABEL_0":
        label = "negative"
    return label

In [11]:
doc_test = {'text_field': 'I love you'}

result = client.ml.infer_trained_model(model_id =model_id, docs = doc_test)
print(process_label(result["inference_results"]))

positive


We can take this principle and use an Elastic pipeline to run the inference on each of our saved blueskye posts in the original index; and adding the generated embeddings as a new field.

In [18]:
client.ingest.put_pipeline(
    id="pipeline_sentiment",
    processors=[
        {
            "inference": {
                "model_id": model_id,
                "field_map": {"text": "text_field"}, 
                "target_field": "sentiment",  
            }
        }
    ],
)

ObjectApiResponse({'acknowledged': True})

In [12]:
mappings = {
    "properties" : {
        "text" : {
            "type" : "keyword",
            "type" : "text"
        },
         #sentiment results field
        "sentiment": {
          "properties": {
            "model_id": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                  }
              }
            },
            "predicted_value": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "prediction_probability": {
              "type": "float"
            }
          }
        }
    }
}

In [19]:
# Create the index (deleting any existing index)
client.indices.delete(index="bluesky_sentiment", ignore_unavailable=True)
client.indices.create(index="bluesky_sentiment", mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'bluesky_sentiment'})

In [20]:
#Creating the new index with enriched data
client.reindex(body={
      "source": {
          "index": "bluesky"},
      "dest": {"index": "bluesky_sentiment", "pipeline" : "pipeline_sentiment"}
    }, wait_for_completion=False)

ObjectApiResponse({'task': '3ym8dc4vSe68mqqnBZfPgg:132218057'})

This task will run in the background as it goes through all the bluesky posts to analyze their sentinent. You can check on the progress via the elastic dev console using:

```
GET _tasks/3ym8dc4vSe68mqqnBZfPgg:132218057

```

### Running some searches w/ sentiment

Once this has finished, our entire index will now have the sentiment field.

We can start building elasticsearch queries to sort through our data based on filters, matches, and scores.

In [None]:
{
    "match" : {
        "sentiment.predicted_value" : "LABEL_2"
    }
}

In [29]:
text = "travelling destination"

query={
    "bool":
    {
      "must": [
      {
        "match" : {
            "text" : text
        }
      },
      {
            "match" : {
                "sentiment.predicted_value" : "LABEL_0"
            }
        }
      ]
    }
}

response = client.search(index = "bluesky_sentiment",query=query, size=5)


print("We get back {total} results".format(total=response["hits"]['total']['value']))


print("Some of the *negative* results on " + text +" so far:")
print()
for hit in response["hits"]["hits"]:
    print(hit['_source']["text"] )
    print()


We get back 76 results
Some of the *negative* results on travelling destination so far:

Bleurgh!!!! Why is it that travelling anywhere involves so much, well, travelling?

i am TRAVELLING not DRIVING

''Always the journey, never the destination.''
#EliteDangerous

So you believe everyone travelling through an airport should be strip searched?

the pain of living in the middle of nowhere event-wise and hating travelling



In [31]:
query={
    "bool":
    {
      "must": [
      {
        "match" : {
            "text" : text
        }
      },
      {
            "match" : {
                "sentiment.predicted_value" : "LABEL_2"
            }
        }
      ]
    }
}
response = client.search(index = "bluesky_sentiment",query=query, size=5,  sort="sentiment.prediction_probability:desc")
print("The *most positive* results on " + text +" so far:")
print()
for hit in response["hits"]["hits"]:
    print(hit['_source']["text"])
    print()

The *most positive* results on travelling destination so far:

Had the most amazing adventure travelling in Thailand. So much to learn about and appreciate. And, wow, November is a great time for us Brits to visit 🤗🙏

10. I love going on holiday and visiting new places. Started cruising last year and I love it so much. You get to visit so many different places whilst on the one holiday and best of all if you find a place you love on a day off the ship you know where to book if you do fancy a single destination!

Happy Thanksgiving(tomorrow l will be travelling) to my friends and relatives south of the border. May you know you are blessed. ❤️❤️

Woke up in what may be the 11th hotel in 6 weeks of travelling for work....and the in-room coffee maker makes a 4 CUP CARAFE!!

Sorry...so excited to just reach over and pour a second cup while joy-scrolling on #bluesky

Today is ALREADY a good day.

Glad you made it to your destination safely! Have fun, eat good food and pass out on the couch!


### TBD - using semantic_text and the inference endpoint instead