## Using Elasticsearch to explore Huggingface Datasets

### Connecting to the ES client

In [1]:
from getpass import getpass  
from elasticsearch import Elasticsearch

# Prompt the user to enter their Elastic Cloud ID and API Key securely
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY, # your username and password for connecting to elastic, found under Deplouments - Security
)

Huggingface allows as to quickly get started with datasets. This collection of 2 million posts from blueskye will allow us to explore the text social media data and find some cool insights. 

https://huggingface.co/datasets/alpindale/two-million-bluesky-posts

In [50]:
from datasets import load_dataset

ds = load_dataset("alpindale/two-million-bluesky-posts", split="train")

Here's an example of a post:

In [59]:
ds[0:1]

{'text': ["This is really interesting polling data about national public attitudes re: California.  It's from the LA Times, in January.  I wonder if this will change substantially in the next two years?  5233025.fs1.hubspotusercontent-na1.net/hubfs/523302..."],
 'created_at': ['2024-11-27T07:53:47.202Z'],
 'author': ['did:plc:5ug6fzthlj6yyvftj3alekpj'],
 'uri': ['at://did:plc:5ug6fzthlj6yyvftj3alekpj/app.bsky.feed.post/3lbw33zxvik24'],
 'has_images': [False],
 'reply_to': [None]}

The most interesting thing we can do with such a dataset is to search through the posts. Huggingface integrates seamlessly with elasticsearch to allow us to add search capabilities to the data. 

[These docs](https://huggingface.co/docs/datasets/en/faiss_es#elasticsearch) show how to add a search index to your Dataset.

In [46]:
mappings = {
    "properties" : {
        "text" : {
            "type" : "keyword",
            "type" : "text"
        },
        "created_at": {
            "type": "date" 
        },
        "author" : {
            "type" : "keyword",
            "type" : "text"
        }
    }
}

In [52]:
index_name="bluesky"
ds.add_elasticsearch_index(column="text", es_client=client, es_index_config={"mappings":mappings} ,es_index_name=index_name)

100%|██████████| 2107530/2107530 [08:13<00:00, 4270.93docs/s]


Dataset({
    features: ['text', 'created_at', 'author', 'uri', 'has_images', 'reply_to'],
    num_rows: 2107530
})

This created the "bluesky" index in Elasticsearch and added our HuggingFace dataset to it. It also creates an index on the "text" feature of our Huggingface dataset that can be further leveraged.

This means that we can run our usual commands to interact with this data through the regular elastic client (or any other methods like direct API calls or the Dev Console):

In [99]:
query={
        "match": {
            "text": "travelling"
        }
    }

response = client.search(index=index_name, query=query)

print("We get back {total} results, here are the top ones:".format(total=response["hits"]['total']['value']))
for hit in response["hits"]["hits"][0:5]:
    print(hit['_source']['text'])


We get back 206 results, here are the top ones:
Bleurgh!!!! Why is it that travelling anywhere involves so much, well, travelling?
i am TRAVELLING not DRIVING
Travelling squad gonna be hilarious
Very nice! Are you stull travelling?
Amsterdam, Netherlands 🇳🇱

#Amsterdam #Holland #travelling


Alternatively, we can continue to use the huggingface functions and leverage the ES index that has been added to our dataset:

In [93]:
scores, retrieved_examples = ds.get_nearest_examples(index_name="text", query="travelling", k=5)
retrieved_examples["text"]

['Bleurgh!!!! Why is it that travelling anywhere involves so much, well, travelling?',
 'i am TRAVELLING not DRIVING',
 'Travelling squad gonna be hilarious',
 'Very nice! Are you stull travelling?',
 'Amsterdam, Netherlands 🇳🇱\n\n#Amsterdam #Holland #travelling']