# Building A RAG System with Gemma, Elasticsearch and Huggingface Models

<a target="_blank" href="https://colab.research.google.com/github/lloydmeta/huggingface_elasticsearch_rag/blob/main/rag_with_hugging_face_gemma_elasticsearch.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


Authored By: [lloydmeta](https://huggingface.co/lloydmeta)

This notebook walks you through building a Retrieve-Augmented-Generation (RAG) powered by Elasticsearch (ES) and Huggingface models, letting you toggle between ES-vectorising vs self-vectorising.

**Note**: this notebook has been tested with ES 8.12.2.

## Step 0: Installing Libraries


In [None]:
!pip install datasets elasticsearch sentence_transformers transformers eland accelerate

# Step 1: Set up

## Credentials

### Huggingface
This allows you to authenticate with Huggingface to download models and datasets.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Elasticsearch deployment

Let's make sure that you can access your Elasticsearch deployment. If you don't have one, create one at [Elastic Cloud](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-a-cloud-deployment)

In [16]:
from google.colab import userdata

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
CLOUD_ID = userdata.get("CLOUD_ID") # or "<YOUR CLOUD_ID>"

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = userdata.get("ELASTIC_DEPL_API_KEY")  # or "<YOUR API KEY>"

Set up the client and make sure the credentials work.

In [17]:
from elasticsearch import Elasticsearch, helpers

# Create the client instance
client = Elasticsearch(cloud_id=CLOUD_ID, api_key=ELASTIC_API_KEY)

# Successful response!
client.info()

ObjectApiResponse({'name': 'instance-0000000000', 'cluster_name': '055e53ef68e14c76a4aa086c01ad96d8', 'cluster_uuid': '_aNdwoebQBiikRK6DS7a4g', 'version': {'number': '8.12.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '48a287ab9497e852de30327444b0809e55d46466', 'build_date': '2024-02-19T10:04:32.774273190Z', 'build_snapshot': False, 'lucene_version': '9.9.2', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### Choose data and query vectorisation options

Here, you need to make a decision: do you want Elasticsearch to vectorise your data and queries, or do you want to do it yourself?

Setting `USE_ELASTICSEARCH_VECTORISATION` to `True` will make the rest of this notebook set up and use ES-hosted-vectorisation for your data and your querying, but **BE AWARE** that this requires your ES deployment to have at least 1 ML node (I would recommend setting autoscaling to true on your Cloud deployment in case the model you chose is to obig).

If `USE_ELASTICSEARCH_VECTORISATION` is `False`, this notebook will set up and use the provided model "locally" for data and query vectorisation.

What should you use for your use case? *It depends* 🤷‍♂️. Running vectorisation on ES means your clients don't have to implement it, so that's the default here; however, if you don't have any ML nodes, or your own embedding setup is better/faster, feel free to toggle it to `False`!

**Note**: if you change these values, you'll likely need to re-run the notebook from this step.

In [18]:
USE_ELASTICSEARCH_VECTORISATION = True

EMBEDDING_MODEL_ID = "thenlper/gte-large"
# https://huggingface.co/thenlper/gte-large's page shows the dimensions of the model
# If you use the `gte-base` or `gte-small` embedding models, the numDimension
# value in the vector search index must be set to 768 and 384, respectively.
EMBEDDING_DIMENSIONS = 1024

## Step 2: Data sourcing and preparation


The data utilised in this tutorial is sourced from Hugging Face datasets, specifically the
[MongoDB/embedded_movies dataset](https://huggingface.co/datasets/MongoDB/embedded_movies).

In [6]:
# Load Dataset
from datasets import load_dataset

# https://huggingface.co/datasets/MongoDB/embedded_movies
dataset = load_dataset("MongoDB/embedded_movies")

dataset

Downloading readme:   0%|          | 0.00/6.18k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['fullplot', 'genres', 'title', 'countries', 'languages', 'plot_embedding', 'cast', 'imdb', 'type', 'num_mflix_comments', 'awards', 'runtime', 'poster', 'plot', 'writers', 'rated', 'directors', 'metacritic'],
        num_rows: 1500
    })
})

The operations within the following code snippet below focus on enforcing data integrity and quality.
1. The first process ensures that each data point's `fullplot` attribute is not empty, as this is the primary data we utilise in the embedding process.
2. This step also ensures we remove the `plot_embedding` attribute from all data points as this will be replaced by new embeddings created with a different embedding model, the `gte-large`.

In [7]:
# Data Preparation

# Remove data point where plot coloumn is missing
dataset = dataset.filter(lambda x: x["fullplot"] is not None)

if "plot_embedding" in sum(dataset.column_names.values(), []):
    # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open source embedding model from Hugging Face
    dataset = dataset.remove_columns("plot_embedding")

dataset["train"]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Dataset({
    features: ['fullplot', 'genres', 'title', 'countries', 'languages', 'cast', 'imdb', 'type', 'num_mflix_comments', 'awards', 'runtime', 'poster', 'plot', 'writers', 'rated', 'directors', 'metacritic'],
    num_rows: 1452
})

## Step 3: Embeddings for your data




### Load Huggingface model into Elasticsearch if needed

This step loads and deploys the Huggingface model into Elasticsearch using [Eland](https://eland.readthedocs.io/en/v8.12.1/), if `USE_ELASTICSEARCH_VECTORISATION` is `True`. This allows Elasticsearch to vectorise your queries, and data in later steps.

In [19]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!(if [ "True" == $USE_ELASTICSEARCH_VECTORISATION ]; then \
  eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id $EMBEDDING_MODEL_ID --task-type text_embedding --es-api-key $ELASTIC_API_KEY --start --clear-previous; \
fi)

2024-03-22 12:58:55,699 INFO : Establishing connection to Elasticsearch
2024-03-22 12:58:55,808 INFO : Connected to cluster named '055e53ef68e14c76a4aa086c01ad96d8' (version: 8.12.2)
2024-03-22 12:58:55,809 INFO : Loading HuggingFace transformer tokenizer and model 'thenlper/gte-large'
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
STAGE:2024-03-22 12:59:03 9545:9545 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-03-22 12:59:10 9545:9545 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-22 12:59:10 9545:9545 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2024-03-22 12:59:22,593 INFO : Creating model with id 'thenlper__gte-large'
2024-03-22 12:59:22,859 INFO : Uploading model definition
100% 1275/1275 [07:19<00:00,  2.90 parts/s]
2024-03-22 13:06:41,933 INFO : Uploading model vocabulary
2024-03-22 13:06:42,373 INFO : Starting mo

This step adds functions for creating embeddings for text locally, and enriches the dataset with embeddings, so that the data can be ingested into Elasticsearch as vectors. Does not run if `USE_ELASTICSEARCH_VECTORISATION` is True.

In [30]:
from sentence_transformers import SentenceTransformer

if not USE_ELASTICSEARCH_VECTORISATION:
    embedding_model = SentenceTransformer(EMBEDDING_MODEL_ID)


def get_embedding(text: str) -> list[float]:
    if USE_ELASTICSEARCH_VECTORISATION:
        raise Exception(
            f"Disabled when USE_ELASTICSEARCH_VECTORISATION is [{USE_ELASTICSEARCH_VECTORISATION}]"
        )
    else:
        if not text.strip():
            print("Attempted to get embedding for empty text.")
            return []

        embedding = embedding_model.encode(text)
        return embedding.tolist()


def add_fullplot_embedding(x):
    if USE_ELASTICSEARCH_VECTORISATION:
        raise Exception(
            f"Disabled when USE_ELASTICSEARCH_VECTORISATION is [{USE_ELASTICSEARCH_VECTORISATION}]"
        )
    else:
        full_plots = x["fullplot"]
        return {"embedding": [get_embedding(full_plot) for full_plot in full_plots]}


if not USE_ELASTICSEARCH_VECTORISATION:
    dataset = dataset.map(add_fullplot_embedding, batched=True)
    dataset["train"]



## Step 4: Create a Search Index with vector search mappings.

At this point, we create an index in Elasticsearch with the right index mappings to handle vector searches.

Go here to read more about [Elasticsearch vector capabilities](https://www.elastic.co/what-is/vector-search).

In [21]:
# Needs to match the id returned from Eland
# in general for Huggingface models, you just replace the forward slash with
# double underscore
model_id = EMBEDDING_MODEL_ID.replace("/", "__")

INDEX_NAME = "movies"

index_mapping = {
    "properties": {
        "fullplot": {"type": "text"},
        "plot": {"type": "text"},
        "title": {"type": "text"},
    }
}
# define index mapping
if USE_ELASTICSEARCH_VECTORISATION:
    index_mapping["properties"]["embedding"] = {
        "properties": {
            "is_truncated": {"type": "boolean"},
            "model_id": {
                "type": "text",
                "fields": {"keyword": {"type": "keyword", "ignore_above": 256}},
            },
            "predicted_value": {
                "type": "dense_vector",
                "dims": EMBEDDING_DIMENSIONS,
                "index": True,
                "similarity": "cosine",
            },
        }
    }
else:
    index_mapping["properties"]["embedding"] = {
        "type": "dense_vector",
        "dims": EMBEDDING_DIMENSIONS,
        "index": "true",
        "similarity": "cosine",
    }

# flag to check if index has to be deleted before creating
should_delete_index = True

# check if we want to delete index before creating the index
if should_delete_index:
    if client.indices.exists(index=INDEX_NAME):
        print("Deleting existing %s" % INDEX_NAME)
        client.indices.delete(index=INDEX_NAME, ignore=[400, 404])

print("Creating index %s" % INDEX_NAME)


# ingest pipeline definition
if USE_ELASTICSEARCH_VECTORISATION:
    PIPELINE_ID = "vectorize_fullplots"

    client.ingest.put_pipeline(
        id=PIPELINE_ID,
        processors=[
            {
                "inference": {
                    "model_id": model_id,
                    "target_field": "embedding",
                    "field_map": {"fullplot": "text_field"},
                }
            }
        ],
    )

    INDEX_SETTINGS = {
        "index": {
            "default_pipeline": PIPELINE_ID,
        }
    }
else:
    INDEX_SETTINGS = {}

client.indices.create(
    index=INDEX_NAME, mappings=index_mapping, settings=INDEX_SETTINGS, ignore=[400, 404]
)

Creating index movies


  client.indices.create(


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'movies'})

Ingesting data into a Elasticsearch is best done in batches. Luckily `helpers` offers an esasy way to do this.

In [25]:
from elasticsearch.helpers import BulkIndexError


def batch_to_bulk_actions(batch):
    for record in batch:
        action = {
            "_index": "movies",
            "_source": {
                "title": record["title"],
                "fullplot": record["fullplot"],
                "plot": record["plot"],
            },
        }
        if not USE_ELASTICSEARCH_VECTORISATION:
            action["_source"]["embedding"] = record["embedding"]
        yield action


def bulk_index(ds):
    start = 0
    end = len(ds)
    batch_size = 100
    if USE_ELASTICSEARCH_VECTORISATION:
        # If using auto-embedding, bulk requests can take a lot longer,
        # so pass a longer request_timeout here (defaults to 10s), otherwise
        # we could get Connection timeouts
        batch_client = client.options(request_timeout=600)
    else:
        batch_client = client
    for batch_start in range(start, end, batch_size):
        batch_end = min(batch_start + batch_size, end)
        print(f"batch: start [{batch_start}], end [{batch_end}]")
        batch = ds.select(range(batch_start, batch_end))
        actions = batch_to_bulk_actions(batch)
        helpers.bulk(batch_client, actions)


try:
    bulk_index(dataset["train"])
except BulkIndexError as e:
    print(f"{e.errors}")

print("Data ingestion into Elasticsearch complete!")

batch: start [0], end [100]
batch: start [100], end [200]
batch: start [200], end [300]
batch: start [300], end [400]
batch: start [400], end [500]
batch: start [500], end [600]
batch: start [600], end [700]
batch: start [700], end [800]
batch: start [800], end [900]
batch: start [900], end [1000]
batch: start [1000], end [1100]
batch: start [1100], end [1200]
batch: start [1200], end [1300]
batch: start [1300], end [1400]
batch: start [1400], end [1452]
Data ingestion into Elasticsearch complete!


## Step 5: Perform Vector Search on User Queries

The following step implements a function that returns a vector search result by generating a query that contains an embedded form of your text query.

In [26]:
def vector_search(plot_query):
    if USE_ELASTICSEARCH_VECTORISATION:
        knn = {
            "field": "embedding.predicted_value",
            "k": 10,
            "query_vector_builder": {
                "text_embedding": {
                    "model_id": model_id,
                    "model_text": plot_query,
                }
            },
            "num_candidates": 150,
        }
    else:
        question_embedding = get_embedding(plot_query)
        knn = {
            "field": "embedding",
            "query_vector": question_embedding,
            "k": 10,
            "num_candidates": 150,
        }

    response = client.search(index="movies", knn=knn, size=5)
    results = []
    for hit in response["hits"]["hits"]:
        id = hit["_id"]
        score = hit["_score"]
        title = hit["_source"]["title"]
        plot = hit["_source"]["plot"]
        fullplot = hit["_source"]["fullplot"]
        result = {
            "id": id,
            "_score": score,
            "title": title,
            "plot": plot,
            "fullplot": fullplot,
        }
        results.append(result)
    return results

def pretty_search(query):

    get_knowledge = vector_search(query)

    search_result = ""
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\n"

    return search_result

## Step 6: Handling user queries and loading Gemma


In [27]:
# Conduct query with retrival of sources, combining results into something that
# we can feed to Gemma
def combined_query(query):
    source_information = pretty_search(query)
    return f"Query: {query}\nContinue to answer the query by using these Search Results:\n{source_information}."


query = "What is the best romantic movie to watch and why?"
combined_results = combined_query(query)

print(combined_results)

Query: What is the best romantic movie to watch and why?
Continue to answer the query by using these Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you kn

Load our LLM

In [28]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
# CPU Enabled uncomment below 👇🏽
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
# GPU Enabled use below 👇🏽
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Define a method that fetches formatted results from a vectorised search in ES, then feed it to the LLM to get our results.

In [29]:
def rag_query(query):
    combined_information = combined_query(query)

    # Moving tensors to GPU
    input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")
    response = model.generate(**input_ids, max_new_tokens=700)

    return tokenizer.decode(response[0], skip_special_tokens=True)


print(rag_query("What's a romantic movie that I can watch with my wife?"))

Query: What's a romantic movie that I can watch with my wife?
Continue to answer the query by using these Search Results:
Title: King Solomon's Mines, Plot: Guide Allan Quatermain helps a young lady (Beth) find her lost husband somewhere in Africa. It's a spectacular adventure story with romance, because while they fight with wild animals and cannibals, they fall in love. Will they find the lost husband and finish the nice connection?
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy 

## Credits

This notebook was adapted from
* [MongoDB's RAG cookbook](https://huggingface.co/learn/cookbook/rag_with_hugging_face_gemma_mongodb)
* OpenAI's [ES RAG cookbok](https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/elasticsearch/elasticsearch-retrieval-augmented-generation.ipynb)
* Elasticsearch-labs' [loading-model-fromhugging-face cookbook](https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/integrations/hugging-face/loading-model-from-hugging-face.ipynb)