## Using Elasticsearch to explore Huggingface Datasets

## Starting from a HuggingFace dataset

Huggingface allows as to quickly get started with datasets. This collection of 2 million posts from blueskye will allow us to explore the text social media data and find some cool insights. 

https://huggingface.co/datasets/alpindale/two-million-bluesky-posts

In [3]:
from datasets import load_dataset

ds = load_dataset("alpindale/two-million-bluesky-posts", split="train")

  from .autonotebook import tqdm as notebook_tqdm


Here's an example of a post:

In [4]:
ds[0:1]

{'text': ["This is really interesting polling data about national public attitudes re: California.  It's from the LA Times, in January.  I wonder if this will change substantially in the next two years?  5233025.fs1.hubspotusercontent-na1.net/hubfs/523302..."],
 'created_at': ['2024-11-27T07:53:47.202Z'],
 'author': ['did:plc:5ug6fzthlj6yyvftj3alekpj'],
 'uri': ['at://did:plc:5ug6fzthlj6yyvftj3alekpj/app.bsky.feed.post/3lbw33zxvik24'],
 'has_images': [False],
 'reply_to': [None]}

The most interesting thing we can do with such a dataset is to search through the posts. Huggingface integrates seamlessly with elasticsearch to allow us to add search capabilities to the data. 

[These docs](https://huggingface.co/docs/datasets/en/faiss_es#elasticsearch) show how to add a search index to your Dataset.

We can first connect to our cloud hosted ES client:

In [1]:
from getpass import getpass  
from elasticsearch import Elasticsearch

# Prompt the user to enter their Elastic Cloud ID and API Key securely
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY, # your username and password for connecting to elastic, found under Deplouments - Security
)

And we now build an index out of our dataset to leverage for search

In [52]:
index_name="bluesky"
ds.add_elasticsearch_index(column="text", es_client=client ,es_index_name=index_name)

100%|██████████| 2107530/2107530 [08:13<00:00, 4270.93docs/s]


Dataset({
    features: ['text', 'created_at', 'author', 'uri', 'has_images', 'reply_to'],
    num_rows: 2107530
})

This created the "bluesky" index in Elasticsearch and added our HuggingFace dataset to it. It also creates an index on the "text" feature of our Huggingface dataset that can be further leveraged.

Once the index has been initialized once, you can load it again for future uses from elastic.


In [6]:
index_name="bluesky"
ds.load_elasticsearch_index("text", es_client=client ,es_index_name=index_name)

We can now quickly run some searches!
It feels like one of the most talked about topics on bluesky (at least from what my feed looks like lately) is people discussing twitter. Let's test that.

In [59]:
scores, retrieved_examples = ds.get_nearest_examples(index_name="text", query="travelling destination", k=5)
for response in retrieved_examples["text"]:
    print(response + "\n")

#Armenia is an amazing destination, with incredible history and beautiful landscapes. As the only current nation on the world's oldest map, oldest winery site and the first #Christian nation, it's worth travelling to this inexpensive destination to discover places like this (Dadal's Bridge, 14th c)!

Bleurgh!!!! Why is it that travelling anywhere involves so much, well, travelling?

Destination reached 🫡

Destination:  Afternoon Nap.

Journey before destination.



We can also check if people on bluesky are talking about HuggingFace, or maybe our elasticon conference that happened last week?

In [22]:
scores, retrieved_examples = ds.get_nearest_examples(index_name="text", query="elasticon", k=2)
for response in retrieved_examples["text"]:
    print(response + "\n")

following the success of the community track at #ElasticON (here amsterdam from yesterday): we still have paris, london, singapore, and sydney coming up and are looking for community talks. send us your best tech topics around anything #elastic
https://sessionize.com/elasticon

Attended ElasticOn yesterday in Amsterdam. And learned about Better Binary Quantization (BBQ), which recently got implemented in Elastic and Lucene. Based of RaBitQ paper. Very interesting for everybody busy with vector search and RAG systems. www.elastic.co/search-labs/...



Just from these examples we can already think of what next steps would be interesting to explore. 

Perhaps checking the sentiment along with each query to see how people feel about a topic? 

Or using a multilingual model to help us work through the international posts?

### Adding LLMs

We can go beyond simple search by also leveraging models from the HuggingFace model hub to generate more insights for our queries.

Let's start out with sentiment! For this task we can use [this classification model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) trained on tweets which we can expect to work quite well since our data type will be quite similar.

These are the labels it should generate:

Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive


In [23]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Quick test on some obvious data:

Seems pretty legitimate. Let's see how people feel about some topics!

In [28]:
pipe(["I love you", "I hate you"])

[{'label': 'LABEL_2', 'score': 0.955704927444458},
 {'label': 'LABEL_0', 'score': 0.9654269218444824}]

In [25]:
def process_label(result):
    label = result[0]["label"]
    if label == "LABEL_1":
        label = "neutral"
    elif label == "LABEL_2":
        label = "positive"
    elif label == "LABEL_0":
        label = "negative"
    return label

In [61]:
scores, retrieved_examples = ds.get_nearest_examples(index_name="text", query="travelling destination", k=3)
for result in retrieved_examples["text"]:
    print(result)
    print("Sentiment: " + process_label(pipe(result)))
    print()

#Armenia is an amazing destination, with incredible history and beautiful landscapes. As the only current nation on the world's oldest map, oldest winery site and the first #Christian nation, it's worth travelling to this inexpensive destination to discover places like this (Dadal's Bridge, 14th c)!
Sentiment: positive

Bleurgh!!!! Why is it that travelling anywhere involves so much, well, travelling?
Sentiment: negative

Destination reached 🫡
Sentiment: neutral



That's a great start, we can already see the model does okay with the sentiment for our searches. But to get some more meaningful insights, we'd need to scale this example out. 

At the moment our data and model are both stored / running locally and we'd need to re-run this notebook whenever we want to run a differet query. 

In the next part of the series we'll take a look at storing our data and model to be able to scale out the inference process and run more complex queries.


Then we can start building some interesting insights like:
* How popular is a particular topic? 
* Can we count posts or interest over time?
* How about the general sentiment over the topic?
* Could we also measure this in real time?


Our current dataset will be a bit difficult to use to answer these questions in the current state.

Here's where the Elastic index and inference service will come in handy to help us build a more complex search use case. 