<a href="https://colab.research.google.com/github/mandychenze/Applied_Machine_Learning_Homework/blob/master/Copy_of_optimized_rag_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimanyu-anand/GenAIPlayground/blob/main/tmls-2025-rag-workshop/optimized_rag_workshop.ipynb)

In this interactive notebook we will implment some the concepts we learnt about in the talk section of the Optimized RAG (TMLS 2025 workshop) using Elasticsearch, via the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html).

Sepcifically we will go over:
1. ES basics
2. Embedding Quanitzayion
3. Filtered Hybrid search
4. Semantic Context Highlighting
5. Tracing using OTel

## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

Once logged in to your Elastic Cloud account, go to the [Create deployment](https://cloud.elastic.co/deployments/create) page.
1. Select **Create deployment**.
2. Then choose the **ElasticSearch** option & click next.
3. Write **tmls-2025-workshop** as name for your elasticsearch deployment.
4. Select **AWS** as Cloud Provider.
5. Select **🇺🇸   N. Virginia (us-east-1)** as Region
6. Select **Vector Search Optimized (ARM)** as Hardware Profile
7. Leave all other settings including **Version** with their default values.
8. Then click **Create hosted deployment**. This will take about 5 minutes.
9. You'll be provided with a username and password you can ignore that; as we will not need that for this workshop.
10. Once the deployment is ready hit **Continue**


## Install packages and import modules

To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to install the `elasticsearch` Python client.

In [1]:
!pip install -qU transformers accelerate opentelemetry-sdk opentelemetry-exporter-otlp elasticsearch

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/118.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.5/118.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/65.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.8/65.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/196.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.2/196.2 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.7/55.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m914.3/914.3 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Initialize the Elasticsearch client

Now we can instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment.

In [3]:
from elasticsearch import Elasticsearch
client = Elasticsearch(
    "https://48cf73e9993f4e25bdc561aa4bf275e0.us-east-1.aws.found.io:443",
    api_key="RkU1S2hKY0JWMWI3eHJuUElsaEQ6MlJaOWhielpDWVdOOUVqbGdTcndzQQ=="
)
index_name = "search-sc2l"
mappings = {
    "properties": {
        "text": {
            "type": "text"
        }
    }
}
mapping_response = client.indices.put_mapping(index=index_name, body=mappings)
print(mapping_response)

{'acknowledged': True}


In [4]:
from elasticsearch import Elasticsearch, helpers, exceptions
from urllib.request import urlopen
from getpass import getpass
import json
import time

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# Create the client instance
client = Elasticsearch(
    ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

Elastic Cloud ID: ··········
Elastic Api Key: ··········


In [5]:
# check if connection is succesfully established
if client.ping():
    print("Connected to Elasticsearch! 🥳")
    print(client.info())
else:
    print("Connection failed.")

Connected to Elasticsearch! 🥳
{'name': 'instance-0000000000', 'cluster_name': '48cf73e9993f4e25bdc561aa4bf275e0', 'cluster_uuid': 'Z_WeY9pRTkqx6uF8NUCyFA', 'version': {'number': '9.0.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '0a58bc1dc7a4ae5412db66624aab968370bd44ce', 'build_date': '2025-05-28T10:06:37.834829258Z', 'build_snapshot': False, 'lucene_version': '10.1.0', 'minimum_wire_compatibility_version': '8.18.0', 'minimum_index_compatibility_version': '8.0.0'}, 'tagline': 'You Know, for Search'}


## Platform Walkthrough

In [7]:
model_id = ".multilingual-e5-small-elasticsearch"
client.ingest.put_pipeline(
    id="title_embeddings_pipeline",
    description="Ingest pipeline for converting title text to embeddings using elastic multilingual e5-small model",
    processors=[
        {
            "inference": {
                "model_id": model_id,
                "input_output": [
                    {"input_field": "title", "output_field": "title_embedding"}
                ],
            }
        }
    ],
)

ObjectApiResponse({'acknowledged': True})

Let's note a few important parameters from that API call:

- `inference`: A processor that performs inference using a machine learning model.
- `model_id`: Specifies the ID of the machine learning model to be used. In this example, the model ID is set to `.multilingual-e5-small-elasticsearch`.
- `input_output`: Specifies input and output fields
- `input_field`: Field name from which the `dense_vector` representation are created.
- `output_field`:  Field name which contains inference results.

## Define the mapping
Now we need to create an index **tmls-optimized-rag-workshop**

Notice how we configured the mappings specifically:

1. Setting the **index_options.type** field of the **title_embedding** to **bbq_hnsw** indicating it should be a binary vector. Important to note that the full precision embedding is not lost, and we can use the non-quantized, original vectors to re-rank the candidates to get the top results. This combines:

  - The performance and memory gains of approximate retrieval using quantized vectors for retrieving the top candidates.
  - The accuracy of using the original vectors for rescoring the top candidates.

2. We defined plot_semantic as a semantic_text field. We have configured the plot field to copy its value to the plot_semantic field. While copy_to is not required to use semantic_text, it enables use cases like hybrid search where semantic and lexical techniques are used together.


In [8]:
index_name = "tmls-optimized-rag-workshop"
client.indices.delete(index=index_name, ignore_unavailable=True)
client.indices.create(
    index=index_name,
    settings={"index": {"default_pipeline": "title_embeddings_pipeline"}},
    mappings={
        "properties": {
            "title": {
                "type": "text"
            },
            "title_embedding": {
                "type": "dense_vector",
                "dims": 384,
                "index": "true",
                "similarity": "cosine",
                "index_options": {
                    "type": "bbq_hnsw"
                }
            },
            "genre": {
                "type": "text"
            },
            "plot": {
                "type": "text",
                "copy_to": "plot_semantic"
            },
            "plot_semantic": {
                "type": "semantic_text"
            },
            "released": {
                "type": "integer"
            },
        }
    },
)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'tmls-optimized-rag-workshop'})

### Index test data

Run the following command to upload a test data, containing information about 12 popular movies books from this [dataset](https://raw.githubusercontent.com/abhimanyu-anand/GenAIPlayground/main/tmls-2025-rag-workshop/assets/movies_with_wikipedia_plots.json)

In [9]:
from urllib.request import urlopen
import json

url = "https://raw.githubusercontent.com/abhimanyu-anand/GenAIPlayground/main/tmls-2025-rag-workshop/assets/movies_with_wikipedia_plots.json"

try:
    response = urlopen(url)
    data_json = json.loads(response.read())
    print("JSON file loaded successfully!")

except Exception as e:
    print(f"Error loading the JSON file: {e}")

# Prepare the documents to be indexed
documents = []
for doc in data_json:
    documents.append(
        {
            "_index": index_name,
            "_source": doc,
        }
    )

JSON file loaded successfully!


In [16]:
#documents

In [11]:
%%time
# Use helpers.bulk to index
helpers.bulk(client, documents, request_timeout=600)

print(f"Done indexing documents into {index_name} index!")
time.sleep(3)



Done indexing documents into tmls-optimized-rag-workshop index!
CPU times: user 569 ms, sys: 83.4 ms, total: 652 ms
Wall time: 2min 52s


## Semantic Search
Now that our index is populated, we can query it using semantic search.

Aside: Pretty printing Elasticsearch search results
Your search API calls will return hard-to-read nested JSON. We'll create a little function called pretty_search_response to return nice, human-readable outputs from our examples.

In [12]:
def pretty_search_response(response, highlight = False):
    if len(response["hits"]["hits"]) == 0:
        print("Your search returned no results.")
    else:
        for hit in response["hits"]["hits"]:
            id = hit["_id"]
            score = hit["_score"]
            title = hit["_source"]["title"]
            runtime = hit["_source"]["runtime"]
            plot = hit["_source"]["plot"]
            keyScene = hit["_source"]["keyScene"]
            genre = hit["_source"]["genre"]
            released = hit["_source"]["released"]

            pretty_output = f"\nID: {id}\nScore: {score}\nTitle: {title}\nRuntime: {runtime}\nPlot: {plot}\nKey Scene: {keyScene}\nGenre: {genre}\nReleased: {released}"
            if highlight:
              highlight_plot_semantic = hit["highlight"]['plot_semantic']
              for section_number, section in enumerate(highlight_plot_semantic):
                pretty_output += f"\nHighlighted Section {section_number}: {section}"
            print(pretty_output)
        return pretty_output

### Semantic Search with the `semantic` Query

We can use the [semantic query](https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-semantic-query.html) to quickly & easily query the `semantic_text` field in our index.
Under the hood, an embedding is automatically generated for our query text using the `semantic_text` field's inference endpoint.

In [13]:
query_text = "organized crime movies"

In [14]:
response = client.search(
    index=index_name,
    query={
        "semantic": {
            "field": "plot_semantic",
            "query": query_text
        }
    },
    size=3,
)

output = pretty_search_response(response)


ID: XE5QhJcBV1b7xrnPi1jt
Score: 12.085434
Title: Goodfellas
Runtime: 146
Plot: In 1955, teenager Henry Hill becomes enamored by the criminal life and Mafia presence in East New York, a working-class Italian-American neighborhood in Brooklyn, New York City. He begins working for local caporegime Paulie Cicero and his associates Jimmy Conway, an Irish-American truck hijacker and gangster, and Tommy DeVito, a fellow juvenile delinquent. Henry begins as a fence for Jimmy, gradually working his way up to more serious crimes.Throughout the 1960s, the three men excel at carjacking, stealing cargo trucks from JFK Airport, and eventually commit the Air France Robbery. They spend most of their nights at the Copacabana nightclub, carousing with women. Henry starts dating Karen Friedman, a Jewish woman who is initially confused by Henry's criminal activities. She is soon seduced by Henry's glamorous lifestyle, and marries him, despite her parents' disapproval.In 1970, Billy Batts, a made man in t

## Semantic Highlighting

In [15]:
response = client.search(
    index=index_name,
    query={
        "semantic": {
            "field": "plot_semantic",
            "query": query_text
        }
    },
    highlight={
      "fields": {
        "plot_semantic": {"number_of_fragments" : 2, "order" : "score"}
      }
    },
    size=3
)

output = pretty_search_response(response, highlight=True)


ID: XE5QhJcBV1b7xrnPi1jt
Score: 12.085434
Title: Goodfellas
Runtime: 146
Plot: In 1955, teenager Henry Hill becomes enamored by the criminal life and Mafia presence in East New York, a working-class Italian-American neighborhood in Brooklyn, New York City. He begins working for local caporegime Paulie Cicero and his associates Jimmy Conway, an Irish-American truck hijacker and gangster, and Tommy DeVito, a fellow juvenile delinquent. Henry begins as a fence for Jimmy, gradually working his way up to more serious crimes.Throughout the 1960s, the three men excel at carjacking, stealing cargo trucks from JFK Airport, and eventually commit the Air France Robbery. They spend most of their nights at the Copacabana nightclub, carousing with women. Henry starts dating Karen Friedman, a Jewish woman who is initially confused by Henry's criminal activities. She is soon seduced by Henry's glamorous lifestyle, and marries him, despite her parents' disapproval.In 1970, Billy Batts, a made man in t

In [None]:
def get_highlighted_context(query):
  response = client.search(
      index=index_name,
      query={
          "semantic": {
              "field": "plot_semantic",
              "query": query
          }
      },
      highlight={
        "fields": {
          "plot_semantic": {"number_of_fragments" : 1, "order" : "score"}
        }
      },
      size=1
  )

  output = pretty_search_response(response, highlight=True)
  print(output)
  marker = "Highlighted Section 0:"
  before_marker, separator, after_marker = output.partition(marker)
  print(after_marker)
  highlight_fragment = after_marker.strip()
  return highlight_fragment

In [None]:
%%capture
queries = ["A story about implanting an idea into someone's subconscious through their dreams", "An FBI trainee seeks help from an imprisoned, manipulative cannibal to catch another serial killer" ,"A computer hacker who learns that his reality is a simulation and is offered a choice between a red pill to see the truth or a blue pill to forget everything", "A film about a man with a fractured psyche who creates an underground society to rebel against his dissatisfying lifestyle"]
fragments = [get_highlighted_context(query) for query in queries]

In [None]:
fragments

["Dom Cobb and Arthur are extractors who perform corporate espionage using experimental dream-sharing technology to infiltrate their targets' subconscious and extract information. Their latest target, Saito, is impressed with Cobb's ability to layer multiple dreams within each other. He offers to hire Cobb for the ostensibly impossible job of implanting an idea into a person's subconscious; performing inception on Robert Fischer, the son of Saito's competitor Maurice Fischer, with the idea to dissolve his father's company. In return, Saito promises to clear Cobb's criminal status, allowing him to return home to his children. Cobb accepts the offer and assembles his team: a forger named Eames, a chemist named Yusuf, and a college student named Ariadne. Ariadne is tasked with designing the dream's architecture, something Cobb himself cannot do for fear of being sabotaged by his mind's projection of his late wife, Mal. Maurice Fischer dies, and the team sedates Robert Fischer into a three

### Dense vector Search

This example will:

Search using approximate kNN for the top **12** candidates.
Rescore the top **9** candidates (oversample * k) per shard using the original, non quantized vectors.
Return the top **3** (k) rescored candidates.
Merge the rescored canddidates from all shards, and return the top **3** (k) results.

In [None]:
response = client.search(
    index=index_name,
    knn={
        "field": "title_embedding",
        "query_vector_builder": {
          "text_embedding": {
            "model_id": model_id,
            "model_text": query_text
          }
        },
        "rescore_vector": {
            "oversample": 3
        },
        "k": 3,
        "num_candidates": 12
    }
)

print(pretty_search_response(response))


ID: r89zg5cBAgK7a8pdppKt
Score: 0.90041786
Title: Fight Club
Runtime: 139
Plot: The unnamed Narrator, who struggles with insomnia and dissatisfaction with his job and lifestyle, finds temporary solace in support groups. As his insomnia worsens, he discovers that expressions of emotional vulnerability help him sleep, leading him to join multiple groups for people facing emotionally distressing problems, despite his expressions being fraudulent. His efforts are thwarted when Marla Singer, another impostor, joins the same groups. The Narrator cannot present his fabricated struggles as genuine, or divert his attention from her presence as an impostor, causing his sleeplessness to return. He arranges for them to attend different sessions to regain his ability to sleep and, under certain circumstances, to exchange contact information, to which she reluctantly agrees.On a return flight from work, the Narrator meets a soap salesman, Tyler Durden. After an explosion destroys the Narrator's apa

## Full - text Search
#### Match query
Returns documents that `match` a provided text, number, date or boolean value. The provided text is analyzed before matching.

The `match` query is the standard query for performing a full-text search, including options for fuzzy matching.

[Read more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#match-query-ex-request).

In [None]:
response = client.search(
    index=index_name,
    query = {
        "match": {
            "plot": {
                "query": query_text
            }
        }
    },
    size=3
)

print(pretty_search_response(response))


ID: rs9zg5cBAgK7a8pdppKt
Score: 5.1212735
Title: The Dark Knight
Runtime: 152
Plot: The Dark Knight is a 2008 superhero film directed by Christopher Nolan, from a screenplay co-written with his brother Jonathan. Based on the DC Comics superhero Batman, it is the sequel to Batman Begins (2005), and the second installment in The Dark Knight trilogy. The plot follows the vigilante Batman, police lieutenant James Gordon, and district attorney Harvey Dent, who form an alliance to dismantle organized crime in Gotham City. Their efforts are derailed by the Joker, an anarchistic mastermind who seeks to test how far Batman will go to save the city from chaos. The ensemble cast includes Christian Bale, Michael Caine, Heath Ledger, Gary Oldman, Aaron Eckhart, Maggie Gyllenhaal, and Morgan Freeman.Warner Bros. Pictures prioritized a sequel following the successful reinvention of the Batman film series with Batman Begins. Christopher and Batman Begins co-writer David S. Goyer developed the story e

#### Multi-match query

The `multi_match` query builds on the match query to allow multi-field queries.

[Read more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html).

In [None]:
response = client.search(
   index=index_name,
    query={
        "multi_match": {
            "query": query_text,
            "fields": ["plot", "title"]
        }
    },
    size=3
)
print(pretty_search_response(response))


ID: rs9zg5cBAgK7a8pdppKt
Score: 5.1212735
Title: The Dark Knight
Runtime: 152
Plot: The Dark Knight is a 2008 superhero film directed by Christopher Nolan, from a screenplay co-written with his brother Jonathan. Based on the DC Comics superhero Batman, it is the sequel to Batman Begins (2005), and the second installment in The Dark Knight trilogy. The plot follows the vigilante Batman, police lieutenant James Gordon, and district attorney Harvey Dent, who form an alliance to dismantle organized crime in Gotham City. Their efforts are derailed by the Joker, an anarchistic mastermind who seeks to test how far Batman will go to save the city from chaos. The ensemble cast includes Christian Bale, Michael Caine, Heath Ledger, Gary Oldman, Aaron Eckhart, Maggie Gyllenhaal, and Morgan Freeman.Warner Bros. Pictures prioritized a sequel following the successful reinvention of the Batman film series with Batman Begins. Christopher and Batman Begins co-writer David S. Goyer developed the story e

Individual fields can be boosted with the caret (^) notation. Note in the following query how the score of the results that have "JavaScript" in their title is multiplied.

In [None]:
response = client.search(
   index=index_name,
    query={
        "multi_match": {
            "query": query_text,
            "fields": ["plot^3", "title"]
        }
    },
    size=3
)
print(pretty_search_response(response))


ID: rs9zg5cBAgK7a8pdppKt
Score: 15.363821
Title: The Dark Knight
Runtime: 152
Plot: The Dark Knight is a 2008 superhero film directed by Christopher Nolan, from a screenplay co-written with his brother Jonathan. Based on the DC Comics superhero Batman, it is the sequel to Batman Begins (2005), and the second installment in The Dark Knight trilogy. The plot follows the vigilante Batman, police lieutenant James Gordon, and district attorney Harvey Dent, who form an alliance to dismantle organized crime in Gotham City. Their efforts are derailed by the Joker, an anarchistic mastermind who seeks to test how far Batman will go to save the city from chaos. The ensemble cast includes Christian Bale, Michael Caine, Heath Ledger, Gary Oldman, Aaron Eckhart, Maggie Gyllenhaal, and Morgan Freeman.Warner Bros. Pictures prioritized a sequel following the successful reinvention of the Batman film series with Batman Begins. Christopher and Batman Begins co-writer David S. Goyer developed the story e

### Using Filters with Queries
Filters are often added to search queries with the intention of limiting the search to a subset of the documents. A filter can cleanly eliminate documents from a search, without altering the relevance scores of the results.

The next example returns movies that were released after 1990.

In [None]:
response = client.search(
   index=index_name,
   query={
        "bool": {
            "must": [{"match": {"plot": {"query": query_text}}}],
            "must_not": [{"range": {"released": {"lte": 1991}}}],
        }
    },
    size=3
)
print(pretty_search_response(response))


ID: rs9zg5cBAgK7a8pdppKt
Score: 5.1212735
Title: The Dark Knight
Runtime: 152
Plot: The Dark Knight is a 2008 superhero film directed by Christopher Nolan, from a screenplay co-written with his brother Jonathan. Based on the DC Comics superhero Batman, it is the sequel to Batman Begins (2005), and the second installment in The Dark Knight trilogy. The plot follows the vigilante Batman, police lieutenant James Gordon, and district attorney Harvey Dent, who form an alliance to dismantle organized crime in Gotham City. Their efforts are derailed by the Joker, an anarchistic mastermind who seeks to test how far Batman will go to save the city from chaos. The ensemble cast includes Christian Bale, Michael Caine, Heath Ledger, Gary Oldman, Aaron Eckhart, Maggie Gyllenhaal, and Morgan Freeman.Warner Bros. Pictures prioritized a sequel following the successful reinvention of the Batman film series with Batman Begins. Christopher and Batman Begins co-writer David S. Goyer developed the story e

# Querying Documents with Hybrid Search

Now we need to perform a query using two different search strategies:
- Semantic search using the **plot_semantic** field
- Keyword search using the **plot** field

We then use [Reciprocal Rank Fusion (RRF)](https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html) to balance the scores to provide a final list of documents, ranked in order of relevance. RRF is a ranking algorithm for combining results from different information retrieval strategies.

Note: With the retriever API, _score contains the document’s relevance score, and the rank is simply the position in the results (first result is rank 1, etc.).

In [None]:
response = client.search(
    index=index_name,
    size=5,
    retriever={
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                      "query": {
                        "match": {
                            "plot": {"query": query_text},
                        }
                    }
                  }
                },
                {
                    "standard": {
                      "query": {
                        "match": {
                            "plot_semantic": {"query": query_text},
                        }
                    }
                  }
                }
            ]
        }
    },
    request_timeout=90
)
print(pretty_search_response(response))


ID: ts9zg5cBAgK7a8pdppKt
Score: 0.032002047
Title: The Godfather
Runtime: 175
Plot: In 1945, the don of New York City's Corleone family, Vito Corleone, listens to requests during his daughter Connie's wedding to Carlo Rizzi. Vito's youngest son Michael, a Marine and World War II hero who has so far stayed out of the family business, introduces his girlfriend Kay Adams to his family at the reception. Johnny Fontane, a popular singer and Vito's godson, seeks Vito's help in securing a movie role. Vito sends his consigliere, Tom Hagen, to persuade studio president Jack Woltz to offer Johnny the part. Woltz refuses Hagen's request at first, but soon complies after finding the severed head of his prized stud horse in his bed.As Christmas approaches, drug baron Virgil The Turk Sollozzo asks Vito to invest in his narcotics business to provide police protection. Vito declines, citing that involvement in narcotics would alienate his political connections. Suspicious of Sollozzo's partnership wi

  response = client.search(


In [None]:
response = client.search(
    index=index_name,
    size=5,
    retriever={
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                      "query": {
                        "bool": {
                            "must": [{"match": {"plot": {"query": query_text}}}],
                            "must_not": [{"range": {"released": {"lte": 1991}}}],
                        }
                    }
                  }
                },
                {
                    "standard": {
                      "query": {
                        "match": {
                            "plot_semantic": {"query": query_text},
                        }
                    }
                  }
                }
            ]
        }
    },
    request_timeout=90
)
print(pretty_search_response(response))


ID: rc9zg5cBAgK7a8pdppKt
Score: 0.031513646
Title: Pulp Fiction
Runtime: 154
Plot: Pulp Fiction is a 1994 American independent crime film written and directed by Quentin Tarantino from a story he conceived with Roger Avary. It tells four intertwining tales of crime and violence in Los Angeles. The film stars John Travolta, Samuel L. Jackson, Bruce Willis, Tim Roth, Ving Rhames, and Uma Thurman. The title refers to the pulp magazines and hardboiled crime novels popular during the mid-20th century, known for their graphic violence and punchy dialogue.Tarantino wrote Pulp Fiction in 1992 and 1993, incorporating scenes that Avary originally wrote for True Romance (1993). Its plot occurs out of chronological order. The film is also self-referential from its opening moments, beginning with a title card that gives two dictionary definitions of pulp. Considerable screen time is devoted to monologues and casual conversations with eclectic dialogue revealing each character's perspectives on sev

  response = client.search(


# Pahse III: Generation and Observability

In [None]:
# @title Observability - Setup
# Import Libraries
import os
import torch
import time
import base64
from threading import Thread
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer

# OpenTelemetry Imports for tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

In [None]:
# @title Step 3: Core Functions (Setup, Model Loading, Generation)

def setup_opentelemetry_from_env():
    """
    Configures and initializes OpenTelemetry by reading configuration from
    standard environment variables.

    Returns:
        opentelemetry.trace.Tracer: An OpenTelemetry Tracer instance.
    """
    print("Setting up OpenTelemetry exporter from environment variables...")
    # The OTLPSpanExporter will automatically read credentials and endpoint from the environment
    otlp_exporter = OTLPSpanExporter()

    # The Resource is also created automatically from the OTEL_RESOURCE_ATTRIBUTES env var
    # We pass an empty Resource.create() to initialize it from the environment
    trace.set_tracer_provider(TracerProvider(resource=Resource.create()))
    tracer_provider = trace.get_tracer_provider()

    # Use a BatchSpanProcessor to send spans in batches
    span_processor = BatchSpanProcessor(otlp_exporter)
    tracer_provider.add_span_processor(span_processor)

    print("OpenTelemetry setup complete.")
    # Return a tracer instance for use in the application
    return trace.get_tracer(__name__)

def load_huggingface_model(model_id):
    """
    Loads a Hugging Face model and its tokenizer without quantization.

    Args:
        model_id (str): The identifier of the model on the Hugging Face Hub.

    Returns:
        tuple: A tuple containing the loaded model and tokenizer.
    """
    print(f"Loading tokenizer for model: {model_id}")
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    print(f"Loading model: {model_id} (This may take a few minutes)...")
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    print("Model loaded successfully.")
    return model, tokenizer

def generate_and_trace(tracer, model, tokenizer, prompt_text):
    """
    Generates text from a prompt, traces the operation, and captures detailed
    performance metrics (Latency, TTFT, TPOT).

    Args:
        tracer (opentelemetry.trace.Tracer): The OTel tracer instance.
        model: The loaded Hugging Face model.
        tokenizer: The loaded tokenizer.
        prompt_text (str): The input text to provide to the model.

    Returns:
        str: The generated text response from the model.
    """
    with tracer.start_as_current_span("llm-generation") as span:
        print(f"\n--- Starting generation for prompt: '{prompt_text}' ---")
        span.set_attribute("llm.prompt", prompt_text)
        span.set_attribute("llm.model_id", model.config.name_or_path)

        # 1. Prepare inputs and streamer
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt_text}
        ]
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        inputs = tokenizer([text], return_tensors="pt").to(model.device)
        streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)

        # 2. Run generation in a separate thread to enable streaming
        generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=256)
        thread = Thread(target=model.generate, kwargs=generation_kwargs)
        thread.start()

        # 3. Measure metrics while iterating over the streamed output tokens
        start_time = time.time()
        first_token_time = None
        output_tokens = 0
        response_text = ""

        print("Response: ", end="", flush=True)
        for new_text in streamer:
            if first_token_time is None:
                first_token_time = time.time()
                ttft = first_token_time - start_time
                span.add_event("First Token Received", attributes={"time_to_first_token_ms": ttft * 1000})
                span.set_attribute("llm.metrics.ttft_ms", round(ttft * 1000, 2))

            print(new_text, end="", flush=True)
            response_text += new_text
            # A more accurate token count would re-tokenize the output. This is a close approximation.
            output_tokens += 1

        thread.join()
        end_time = time.time()
        print("\n--- Generation Complete ---")

        # 4. Calculate final metrics and set them as span attributes
        total_latency = end_time - start_time
        span.set_attribute("llm.metrics.latency_ms", round(total_latency * 1000, 2))
        span.set_attribute("llm.response", response_text)
        span.set_attribute("llm.metrics.output_tokens", output_tokens)

        if output_tokens > 1 and first_token_time is not None:
            tpot = (end_time - first_token_time) / (output_tokens - 1)
            span.set_attribute("llm.metrics.tpot_ms", round(tpot * 1000, 2))
            print(f"\n[Metrics] TPOT: {tpot*1000:.2f} ms per token")

        print(f"[Metrics] TTFT: {ttft*1000:.2f} ms" if first_token_time else "[Metrics] TTFT: N/A")
        print(f"[Metrics] Total Latency: {total_latency*1000:.2f} ms")
        print(f"[Metrics] Output Tokens: {output_tokens}")

        return response_text

In [None]:
def main(prompt):
    """
    Main function to orchestrate the setup, model loading, and generation.
    """
    # --- Configuration via Environment Variables ---
    # IMPORTANT: Replace with your Elasticsearch endpoint and credentials
    os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = ""
    os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = ""
    os.environ["OTEL_RESOURCE_ATTRIBUTES"] = "service.name=tmls-2025-workshop-app,service.version=1.0,deployment.environment=test"

    # --- Static Configuration ---
    MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"

    # --- Execution ---
    tracer = setup_opentelemetry_from_env()
    model, tokenizer = load_huggingface_model(MODEL_ID)
    response = generate_and_trace(tracer, model, tokenizer, prompt)

    print("\n---")
    print("Trace data has been sent to your Elasticsearch cluster.")

In [None]:
llm_prompts = [
    "Write a short story that begins with the sentence: \"The antique map was blank, except for a single, ominous-looking symbol in the center.\"",
    "Generate five catchy slogans for a new brand of sparkling water infused with botanicals.",
    "Write a short, witty dialogue for a scene where a time traveler tries to explain a smartphone to a medieval blacksmith.",
    "Compose a poem about the feeling of the first crisp autumn day in a bustling city.",
    "Create a brief plot summary for the film Inception, which involves a thief who steals information by entering people's dreams. ",
    "Summarize the key differences between nuclear fission and nuclear fusion in a simple, easy-to-understand paragraph.",
    "Create a table that compares the pros and cons of working from home versus working in a traditional office environment.",
    "Explain the concept of Artificial Intelligence to a 10-year-old.",
    "Describe the historical significance of the Silk Road and its impact on trade and culture.",
    "List the main duties of a character in the film The Godfather who acts as a `consigliere`, or advisor, for a crime family. ",
    "Draft a professional but friendly email to a new team member, welcoming them to the company and outlining their first-day schedule.",
    "I have chicken breast, broccoli, rice, and soy sauce. Create a simple and healthy recipe I can make for dinner tonight.",
    "Outline a 3-day travel itinerary for a first-time visitor to Toronto, including a mix of popular attractions and local experiences.",
    "I need to ask my boss for a raise. Write a short script for me to use as a starting point for the conversation, focusing on my accomplishments from the past year.",
    "Create a workout plan for someone who wants to start exercising, focusing on 30-minute bodyweight routines they can do at home three times a week.",
    "Write a Python function that takes a string as input and returns `True` if it's a palindrome and `False` if it's not.",
    "Explain what this line of SQL code does: `SELECT ProductName, Price FROM Products WHERE CategoryID = 1 ORDER BY Price DESC;`",
    "A farmer has to cross a river with a fox, a chicken, and a sack of grain. The boat can only hold the farmer and one other item. If left alone, the fox will eat the chicken, and the chicken will eat the grain. How does the farmer get everything across the river safely?",
    "Debug this simple Javascript code snippet that is supposed to change the text of a button when clicked: `let button = document.getElementById(\"myButton\"); button.onClick = function() { document.getElementById(\"myButton\").text = \"Clicked!\"; }`",
    "You are a senior software developer. A junior developer is struggling to choose between using a SQL or NoSQL database for a new social media application. Explain the key considerations they should take into account to make the right decision."
]

[main(prompt) for prompt in llm_prompts[:5]]



Setting up OpenTelemetry exporter from environment variables...
OpenTelemetry setup complete.
Loading tokenizer for model: Qwen/Qwen2.5-1.5B-Instruct
Loading model: Qwen/Qwen2.5-1.5B-Instruct (This may take a few minutes)...
Model loaded successfully.

--- Starting generation for prompt: 'Write a short story that begins with the sentence: "The antique map was blank, except for a single, ominous-looking symbol in the center."' ---
Response: The antique map was blank, except for a single, ominous-looking symbol in the center. It was an old atlas, made of thick leather and covered in faded wood grain. The symbols on it were meant to guide travelers through unknown territories, but this one seemed different.

As I picked up the map, my fingers traced over its worn surface, trying to make sense of what lay hidden within those lines and dots. But no matter how hard I looked, there wasn't anything else there - just that one strange symbol.

I couldn't shake the feeling that something terrible



OpenTelemetry setup complete.
Loading tokenizer for model: Qwen/Qwen2.5-1.5B-Instruct
Loading model: Qwen/Qwen2.5-1.5B-Instruct (This may take a few minutes)...
Model loaded successfully.

--- Starting generation for prompt: 'Generate five catchy slogans for a new brand of sparkling water infused with botanicals.' ---
Response: 1. "Sparkling Life, Naturally Botanical"
2. "Drink the Earth's Refreshing Spritzes"
3. "Botanical Sparkle: Purely Packed with Nature"
4. "Fresh from the Source: A Burst of Botanical Bliss"
5. "Nature's Flavors in Every Sip"<|im_end|>
--- Generation Complete ---

[Metrics] TPOT: 41.56 ms per token
[Metrics] TTFT: 232.83 ms
[Metrics] Total Latency: 3017.31 ms
[Metrics] Output Tokens: 68

---
Trace data has been sent to your Elasticsearch cluster.
Setting up OpenTelemetry exporter from environment variables...




OpenTelemetry setup complete.
Loading tokenizer for model: Qwen/Qwen2.5-1.5B-Instruct
Loading model: Qwen/Qwen2.5-1.5B-Instruct (This may take a few minutes)...
Model loaded successfully.

--- Starting generation for prompt: 'Write a short, witty dialogue for a scene where a time traveler tries to explain a smartphone to a medieval blacksmith.' ---
Response: Time traveler: "So, you've got this thing that can take pictures and videos? And it has all these apps that let you do anything from playing games to watching movies?"

Medieval blacksmith: "Yes, I have such an amazing tool! It's called a hammer."

Time traveler: "A what?!"

Medieval blacksmith: "A hammer! You know, the one we use to shape metal into tools and weapons."

Time traveler: "Oh, right! So how does it work?"

Medieval blacksmith: "Well, when you hit the head of the hammer on the anvil, it creates heat which melts the metal until it becomes soft enough to be shaped by hand."

Time traveler: "Wow, that sounds pretty impres



OpenTelemetry setup complete.
Loading tokenizer for model: Qwen/Qwen2.5-1.5B-Instruct
Loading model: Qwen/Qwen2.5-1.5B-Instruct (This may take a few minutes)...
Model loaded successfully.

--- Starting generation for prompt: 'Compose a poem about the feeling of the first crisp autumn day in a bustling city.' ---
Response: In the heart of the city, where the crowds hum,
A crisp autumn day breaks through the gloom.
The trees stand tall and proud, their leaves now fallen,
But still they whisper secrets to the sun.

The air is cool as it brushes past my face,
As if nature herself has turned her hair.
And everywhere I look, there's a sense of peace,
As if all the worries have taken flight.

I step outside into the brisk, clear sky,
Where the world seems to pause for just a while.
The sound of footsteps echoes down the street,
As people hurry by with smiles on their faces.

The city lights flicker like distant stars,
Reflecting off the wet pavement stones.
But even though the world seems so 



Setting up OpenTelemetry exporter from environment variables...
OpenTelemetry setup complete.
Loading tokenizer for model: Qwen/Qwen2.5-1.5B-Instruct
Loading model: Qwen/Qwen2.5-1.5B-Instruct (This may take a few minutes)...
Model loaded successfully.

--- Starting generation for prompt: 'Create a brief plot summary for the film Inception, which involves a thief who steals information by entering people's dreams. ' ---
Response: Inception is an action-thriller that follows a skilled thief named Cobb (Leonardo DiCaprio) who specializes in stealing information from other people's minds through a technique called "inception." He uses this ability to steal secrets and gain access to valuable information.

However, things take a dark turn when Cobb becomes entangled in a criminal conspiracy led by his former partner and mentor, Arthur (Joseph Gordon-Levitt). The group plans to use Cobb's skills to infiltrate the mind of a powerful businessman, Leonardo (Tom Hardy), in order to steal one of 

[None, None, None, None, None]