## Connect to the Elasticsearch client with your credentials

Adding the HF + ELSER example to the Spotify data example by adding a lyrics dataset in there too.
https://huggingface.co/datasets/brunokreiner/genius-lyrics
* Connecting to data from HF - show the way to add a elastic engine to it - first quick and dirty example

* Use the Spotify ID to add it to the Spotify index 
* Adding the ELSER model and a sentiment analysis model
* Make a hybrid search example (search for artist + mood from sentiment for example) 
    * Maybe sort Taylor songs by mood?
    * Follow up with GenAI example to sort songs into the stages of grief thing? 

In [83]:
from elasticsearch import Elasticsearch, helpers
from getpass import getpass

#Connect to the elastic cloud server
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY, # your username and password for connecting to elastic, found under Deplouments - Security
)

# ADD HUGGING FACE DATASET CONNECTION 

https://huggingface.co/datasets/brunokreiner/genius-lyrics

## Put the data in an index

We are creating a nested field for the lyrics so we can search for the inner hits to get the exact lines we want.

In [133]:
index_name = 'ts_songs'

mappings = {
  "properties": {
    "lyrics": {
        "type": "nested",
        "properties": {
          "line": {
            "type": "text"
          }
        }
    },
  }
}

# Create the Elasticsearch index with the specified name (delete if already existing)
if client.indices.exists(index=index_name):
    client.indices.delete(index=index_name)
client.indices.create(index=index_name, mappings=mappings)

def generate_docs(data, index_name):
    for document in data:
        yield dict(_index=index_name, _id=f"{document['id']}", _source=document)


# Use the Elasticsearch helpers.bulk() method to index the DataFrame data into Elasticsearch
load = helpers.bulk(client, generate_docs(songs, index_name), index_name)

## Look for a specific line in a song

We can now use a nested query to look up words in our songs and get the specific passage where this would be mentioned.

In [134]:
def simple_search(query):
  query = {
      "nested": {
        "path": "lyrics",
        "query": {
          "match": {
            "lyrics.line": query
          }
        },
        "inner_hits" : {
          "docvalue_fields" : [
            "lyrics.line.keyword"
          ]
        }
      }
  }

  #Run a simple query, for example looking for problems with the engine
  response = client.search(index=index_name, query=query)

  print(f'We get back {response["hits"]["total"]["value"]} songs that fit, here are the top results:')
  for hit in response["hits"]["hits"][0:5]:
      print(f'From {hit["_source"]["artist"]} : {hit["_source"]["name"]}: ')
      for inner_hit in hit["inner_hits"]["lyrics"]["hits"]["hits"][0:1]:
          print(inner_hit["_source"]["line"])
      print()

simple_search("I am in love with you")


We get back 19 songs that fit, here are the top results:
From Taylor Swift : You Are in Love: 
You are in love, true love

From Taylor Swift : Dancing With Our Hands Tied: 
Oh, keeping you with me, I-

From Taylor Swift : Come Back... Be Here: 
4 AM, the second day

From Taylor Swift : New Romantics: 
Come on, come along with me

From Taylor Swift : So It Goes...: 
Do bad things with you



However, this is only returning exact matches, missing out on similar songs about "lovers", "loving", or any similar phrases which I might still want to find. 

So we can take this a step further and add a semantic search model into the mix, to help us really look for meaning in the lyrics.



## Adding ELSER inference for semantic search

We will use a [foreach](https://www.elastic.co/guide/en/elasticsearch/reference/current/foreach-processor.html) processor to loop through all lines of the lyrics.

See [the ELSER Notebook](https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/03-ELSER.ipynb) for a simple get-started quide for semantic search; and [this document chunking example](https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/document-chunking/with-index-pipelines.ipynb) for another instance of embedding inner hits. 

In [91]:
client.ingest.put_pipeline(
    id="adding_ELSER_to_lyrics", 
    processors=
    [
        {
            "foreach": {
                "field": "lyrics",
                "processor": {
                    "inference": {
                        "model_id": ".elser_model_2",
                        "input_output": [
                            {"input_field": "_ingest._value.line", "output_field": "_ingest._value.tokens"}
                        ],
                        "on_failure" : [
                        {
                            "set" : {
                                "field": "_ingest._value.errors",
                                "value": "failed in foreach processor"
                            }
                        }]
                    }
                }
            }
        }
    ]
)

mappings = {
    "dynamic" : True,
    "properties" : 
    {
        "lyrics": {
            "type": "nested",
            "properties": {
                "line" : {
                    "type": "text",
                    "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
                },
                "tokens": { 
                    "type": "sparse_vector" 
                }
            }
        }
    }
}

#Creating the new index with enriched data
index_name_new = "ts_songs_semantic"
if client.indices.exists(index=index_name_new):
    client.indices.delete(index=index_name_new)
client.indices.create(index=index_name_new, mappings=mappings)

client.reindex(body={
      "source": {
          "index": index_name},
      "dest": {"index": index_name_new, "pipeline" : "adding_ELSER_to_lyrics"}
    }, wait_for_completion=False)

ObjectApiResponse({'task': 'eclQBhHoS0CN09g-_bZM5w:63679788'})

We can now run the same query again, but using `text_expansion` on the generated tokens rather than `match` directly on the text field.

In [135]:
def semantic_search(query):
    query = {
        "nested": {
            "path": "lyrics",
            "query": {
                "text_expansion": {
                    "lyrics.tokens": {
                        "model_id": ".elser_model_2",
                        "model_text": query,
                    }
                }
            },
            "inner_hits" : {
                "docvalue_fields" : [
                    "lyrics.line.keyword"
                ]
            }
        }
    }

    #Run a simple query, for example looking for problems with the engine
    response = client.search(index=index_name_new, query=query)

    print(f'We get back {response["hits"]["total"]["value"]} songs that fit, here are the top results:')
    for hit in response["hits"]["hits"][0:5]:
        print(f'From {hit["_source"]["artist"]} : {hit["_source"]["name"]}: ')
        for inner_hit in hit["inner_hits"]["lyrics"]["hits"]["hits"][0:1]:
            print(inner_hit["_source"]["line"])
        print()

semantic_search("I am in love with you")


We get back 19 songs that fit, here are the top results:
From Taylor Swift : You Are in Love: 
You are in love

From Taylor Swift : Gorgeous: 
(I hate you so much)

From Taylor Swift : End Game (Ft. Ed Sheeran & Future): 
You love it, I love it, too 'cause you my type (you my type)

From Taylor Swift : Come Back... Be Here: 
This is falling in love in the cruelest way

From Taylor Swift : Don't Blame Me: 
You're lovin' me



With the new semantic search layer, we actually get back pretty much every song in our sample dataset. 

This could either mean the model has too low of a treshold for the match, or that simply all songs are in some form about love. 

Let's try some more specific searches.

Semantic search does seem to capture the meaning better, however it still seems like the spirit of the songs isn't fully represented. This may be due to the chunking strategy. In these examples, each line of the lyrics is its own document, however the sentences or paragraphs end up broken up and some of the context is lost. 

This brings up a very important point about semantic search - performance isn't only determined by the model chosen, but also the way data is processed and introduced to the model.
Let's try a few strategies to offer the model more context.


# ADD SECTION TO MATCH THE SPOTIFY IDs TO THE EXISTING INDEX 

## RUN hybrid search examples