## Using Elasticsearch to explore Huggingface Datasets

## Starting from a HuggingFace dataset

Huggingface allows as to quickly get started with datasets. This collection of 2 million posts from blueskye will allow us to explore the text social media data and find some cool insights. 

https://huggingface.co/datasets/alpindale/two-million-bluesky-posts

In [5]:
from datasets import load_dataset

ds = load_dataset("alpindale/two-million-bluesky-posts", split="train")

  from .autonotebook import tqdm as notebook_tqdm


Here's an example of a post:

In [6]:
ds[0:1]

{'text': ["This is really interesting polling data about national public attitudes re: California.  It's from the LA Times, in January.  I wonder if this will change substantially in the next two years?  5233025.fs1.hubspotusercontent-na1.net/hubfs/523302..."],
 'created_at': ['2024-11-27T07:53:47.202Z'],
 'author': ['did:plc:5ug6fzthlj6yyvftj3alekpj'],
 'uri': ['at://did:plc:5ug6fzthlj6yyvftj3alekpj/app.bsky.feed.post/3lbw33zxvik24'],
 'has_images': [False],
 'reply_to': [None]}

The most interesting thing we can do with such a dataset is to search through the posts. Huggingface integrates seamlessly with elasticsearch to allow us to add search capabilities to the data. 

[These docs](https://huggingface.co/docs/datasets/en/faiss_es#elasticsearch) show how to add a search index to your Dataset.

We can first connect to our cloud hosted ES client:

In [None]:
from getpass import getpass  
from elasticsearch import Elasticsearch

# Prompt the user to enter their Elastic Cloud ID and API Key securely
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY, # your username and password for connecting to elastic, found under Deplouments - Security
)

And we now build an index out of our dataset to leverage for search

In [52]:
index_name="bluesky"
ds.add_elasticsearch_index(column="text", es_client=client ,es_index_name=index_name)

100%|██████████| 2107530/2107530 [08:13<00:00, 4270.93docs/s]


Dataset({
    features: ['text', 'created_at', 'author', 'uri', 'has_images', 'reply_to'],
    num_rows: 2107530
})

This created the "bluesky" index in Elasticsearch and added our HuggingFace dataset to it. It also creates an index on the "text" feature of our Huggingface dataset that can be further leveraged.

Once the index has been initialized once, you can load it again for future uses from elastic.


In [12]:
ds.load_elasticsearch_index("text", es_client=client ,es_index_name=index_name)

We can now quickly run some searches!
It feels like one of the most talked about topics on bluesky (at least from what my feed looks like lately) is people discussing twitter. Let's test that.

In [65]:
scores, retrieved_examples = ds.get_nearest_examples(index_name="text", query="twitter", k=5)
retrieved_examples["text"]

['Who’s here from Twitter? #x #twitter',
 "you can take the twitter user out of twitter but you can't take twitter out of the twitter user",
 'Deixem os tópicos do Twitter no Twitter',
 '>opens twitter\n>holocaust_denial.mp3\n>closes twitter\n>deletes app',
 'Twitter failed to scare legacy verified accounts into paying for Twitter Blue https://mashable.com/article/twitter-legacy-verified-account-twitter-blue-subscribers']

We can also check if people on bluesky are talking about HuggingFace, or maybe our elasticon conference that happened last week?

In [84]:
scores, retrieved_examples = ds.get_nearest_examples(index_name="text", query="huggingface", k=3)
for response in retrieved_examples["text"]:
    print(response + "\n")

What's huggingface?

Huggingface, nomen est omen

huggingface is drunk. damn



In [83]:
scores, retrieved_examples = ds.get_nearest_examples(index_name="text", query="elasticon", k=3)
for response in retrieved_examples["text"]:
    print(response + "\n")

following the success of the community track at #ElasticON (here amsterdam from yesterday): we still have paris, london, singapore, and sydney coming up and are looking for community talks. send us your best tech topics around anything #elastic
https://sessionize.com/elasticon

Attended ElasticOn yesterday in Amsterdam. And learned about Better Binary Quantization (BBQ), which recently got implemented in Elastic and Lucene. Based of RaBitQ paper. Very interesting for everybody busy with vector search and RAG systems. www.elastic.co/search-labs/...



Just from these examples we can already think of what next steps would be interesting to explore. 

Perhaps checking the sentiment along with each query to see how people feel about a topic? 

Or using a multilingual model to help us work through the international posts?

### Adding LLMs

We can go beyond simple search by also leveraging models from the HuggingFace model hub to generate more insights for our queries.

Let's start out with sentiment! For this task we can use [this classification model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) trained on tweets which we can expect to work quite well since our data type will be quite similar.

These are the labels it should generate:

Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive


In [92]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [95]:
pipe(["I love you", "I hate you"])

[{'label': 'LABEL_2', 'score': 0.955704927444458},
 {'label': 'LABEL_0', 'score': 0.9654269218444824}]

Seems pretty legitimate. Let's try it on some of our previous sarch results!

In [96]:
scores, retrieved_examples = ds.get_nearest_examples(index_name="text", query="twitter", k=5)
for result in retrieved_examples["text"]:
    print(result)
    print(pipe(result))
    print()

Who’s here from Twitter? #x #twitter
[{'label': 'LABEL_1', 'score': 0.8716976642608643}]

you can take the twitter user out of twitter but you can't take twitter out of the twitter user
[{'label': 'LABEL_1', 'score': 0.4940926432609558}]

Deixem os tópicos do Twitter no Twitter
[{'label': 'LABEL_1', 'score': 0.7995824813842773}]

>opens twitter
>holocaust_denial.mp3
>closes twitter
>deletes app
[{'label': 'LABEL_1', 'score': 0.5307735800743103}]

Twitter failed to scare legacy verified accounts into paying for Twitter Blue https://mashable.com/article/twitter-legacy-verified-account-twitter-blue-subscribers
[{'label': 'LABEL_0', 'score': 0.6615949273109436}]



Alright, we seem to be getting mostly "neutral" results with low confidence in the classification. Perhaps we should dig deeper into our data to fish out some more sentimentally "intense" posts. Here's where the Elastic index will come in handy to help us build a more complex search use case. 


First off, we should apply this new sentiment insight to all our posts and add the results as a new feature in our index.

In the second notebook we will leverage the elasticsearch python clients to deploy the model in our Elastic cluster, and set up an inference pipeline that adds the sentiment label to all our fields.