# RAG: Fundamentals and Advanced Techniques

 > Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

In this post we are going to see what the `RAG` (`Retrieval Augmented Generation`) technique consists of and how it can be implemented in a language model.

To make it free, instead of using an OpenAI account (as you'll see in most tutorials) we're going to use the `API inference` of Hugging Face, which has a free tier of 1000 requests per day, which is more than enough to make this post.

## Setting up Hugging Face's `API Inference`

To be able to use the `API Inference` of HuggingFace, the first thing you need is to have an account on HuggingFace, once you have it you have to go to [Access tokens](https://huggingface.co/settings/keys) in your profile settings and generate a new token.

We need to give it a name, in my case I'm going to name it `rag-fundamentals` and enable the permission `Make calls to serverless Inference API`. A token will be created for us that we need to copy

To manage the token we are going to create a file in the same path where we are working called `.env` and we are going to put the token that we have copied in the file in the following way:

``` bash
RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN="hf_...."
```

Now to be able to obtain the token we need to have `dotenv` installed, which we install via

```bash
pip install python-dotenv
```

And we run the following

In [1]:
import os
import dotenv

dotenv.load_dotenv()

RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN = os.getenv("RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN")

Now that we have a token, we create a client, for which we need to have the `huggingface_hub` library installed, which we do using conda or pip

``` bash
conda install -c conda-forge huggingface_hub
```

or

``` bash
pip install --upgrade huggingface_hub
```

Now we have to choose which model we are going to use. You can see the available models on the [Supported models](https://huggingface.co/docs/api-inference/supported-models) page of the `API Inference` documentation of Hugging Face.

As at the time of writing the post, the best available is `Qwen2.5-72B-Instruct`, so we will use that model.

In [2]:
MODEL = "Qwen/Qwen2.5-72B-Instruct"

Now we can create the client

In [3]:
from huggingface_hub import InferenceClient

client = InferenceClient(api_key=RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN, model=MODEL)
client

<InferenceClient(model='Qwen/Qwen2.5-72B-Instruct', timeout=None)>

We make a test to see if it works

In [4]:
message = [
	{ "role": "user", "content": "Hola, qué tal?" }
]

stream = client.chat.completions.create(
	messages=message, 
	temperature=0.5,
	max_tokens=1024,
	top_p=0.7,
	stream=False
)

response = stream.choices[0].message.content
print(response)

¡Hola! Estoy bien, gracias por preguntar. ¿Cómo estás tú? ¿En qué puedo ayudarte hoy?


## What is `RAG`?

`RAG` stands for `Retrieval Augmented Generation`, it's a technique created to obtain information from documents. Although LLMs can be very powerful and have a lot of knowledge, they will never be able to answer you about private documents, such as your company's reports, internal documentation, etc. That's why `RAG` was created, to be able to use these LLMs on that private documentation.

![¿Qué es RAG?](https://pub-fb664c455eca46a2ba762a065ac900f7.r2.dev/RAG.webp)

The idea is that a user asks a question about that private documentation, the system is able to get the part of the documentation where the answer to that question is, it is passed to an LLM the question and the part of the documentation and the LLM generates the answer for the user

### How is information stored?

It is known, and if you didn't know, I'll tell you now, that LLMs have a limit of information that can be passed to them, this is called the context window. This is due to the internal architectures of LLMs that are not relevant now. But the important thing is that you can't pass a document and a question without more, because it's likely that the LLM won't be able to process all that information.

In cases where more information is usually passed than its context window allows, what usually happens is that the LLM does not pay attention to the end of the input. Imagine you ask the LLM something about your document, that information is at the end of the document and the LLM does not read it.

Therefore, what is done is to divide the documentation into blocks called `chunk`s. So that the documentation is stored in a bunch of `chunk`s, which are pieces of that documentation. So when the user asks a question, the `chunk` where the answer to that question is, is passed to the LLM.

In addition to dividing the documentation into `chunks`, these are converted to embeddings, which are numerical representations of the `chunks`. This is done because LLMs actually don't understand text, but numbers, and the `chunks` are converted to numbers so that the LLM can understand them. If you want to learn more about embeddings, you can read my post about [transformers](https://www.maximofn.com/transformers) in which I explain how transformers work, which is the architecture behind LLMs. You can also read my post about [ChromaDB](https://www.maximofn.com/chromadb) where I explain how embeddings are stored in a vector database. And it would also be interesting for you to read my post about the [HuggingFace Tokenizers](https://www.maximofn.com/hugging-face-tokenizers) library, which explains how text is tokenized, which is the step prior to generating embeddings.

![RAG - embeddings](https://pub-fb664c455eca46a2ba762a065ac900f7.r2.dev/RAG-embeddings.webp)

### How to get the correct `chunk`?

We've said that the documentation is divided into `chunks` and the `chunk` containing the answer to the user's question is passed to the LLM. But, how do we know which `chunk` contains the answer? To do this, what is done is to convert the user's question into an embedding, and the similarity between the question's embedding and the embeddings of the `chunks` is calculated. So, the `chunk` with the highest similarity is the one that is passed to the LLM.

![RAG - embeddings similarity](https://pub-fb664c455eca46a2ba762a065ac900f7.r2.dev/rag-chunk_retreival.webp)

### Let's review what `RAG` is

On the one hand, we have the `retrieval`, which is obtaining the correct `chunk` of documentation, on the other hand, we have the `augmented`, which is passing the user's question and the `chunk` to the LLM, and finally, we have the `generation`, which is obtaining the response generated by the LLM.

## Vector Database

We have seen that documentation is divided into `chunks` and stored in a vector database, so we need to use one. For this post, I'm going to use [ChromaDB](https://www.trychroma.com/), which is a widely used vector database and I also have a [post](https://www.maximofn.com/chromadb) where I explain how it works.

To get started, we first need to install the ChromaDB library, to do this we install it with Conda or with pip

``` bash
conda install conda-forge::chromadb
```

or

``` bash
pip install chromadb
```

### Embedding Function

As we've said, everything will be based on embeddings, so the first thing we do is create a function to get embeddings from a text. We're going to use the `sentence-transformers/all-MiniLM-L6-v2` model

In [5]:
import chromadb.utils.embedding_functions as embedding_functions

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
      
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key=RAG_FUNDAMENTALS_ADVANCE_TECHNIQUES_TOKEN,
    model_name=EMBEDDING_MODEL
)

We test the embedding function

In [6]:
embedding = huggingface_ef(["Hello, how are you?",])
embedding[0].shape

(384,)

We obtain a 384-dimensional embedding. Although the mission of this post is not to explain embeddings, in summary, our embedding function has categorized the phrase `Hello, how are you?` in a 384-dimensional space.

### Cliente de ChromaDB

Now that we have our embedding function we can create a ChromaDB client

First, we create a folder where the vector database will be stored

In [7]:
from pathlib import Path
      
chroma_path = Path("chromadb_persisten_storage")
chroma_path.mkdir(exist_ok=True)

Now we create the client

In [8]:
from chromadb import PersistentClient

chroma_client = PersistentClient(path = str(chroma_path))

### Collection

When we have the ChromaDB client, the next thing we need to do is create a collection. A collection is a set of vectors, in our case the `chunks` of the documentation.

We create it by specifying the embedding function we are going to use

In [9]:
collection_name = "document_qa_collection"
collection = chroma_client.get_or_create_collection(name=collection_name, embedding_function=huggingface_ef)

## Document Upload

Now that we have created the vector database, we need to split the documentation into `chunk`s and store them in the vector database.

### Document Loading Function

First, we create a function to load all `.txt` documents from a directory

In [10]:
def load_one_document_from_directory(directory, file):
    with open(os.path.join(directory, file), "r") as f:
        return {"id": file, "text": f.read()}

def load_documents_from_directory(directory):
    documents = []
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            documents.append(load_one_document_from_directory(directory, file))
    return documents


### Function to divide documentation into `chunk`s

Once we have the documents, we divide them into `chunk`s

In [11]:
def split_text(text, chunk_size=1000, chunk_overlap=20):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - chunk_overlap
    return chunks


### Function to generate embeddings from a `chunk`

Now that we have the `chunk`s, we generate the `embedding`s for each of them

Later we will see why, but to generate the embeddings we are going to do it locally and not through the Hugging Face API. To do this, we need to have [PyTorch](https://pytorch.org) and `sentence-transformers` installed, so we do

``` bash
pip install -U sentence-transformers
```

In [12]:
from sentence_transformers import SentenceTransformer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embedding_model = SentenceTransformer(EMBEDDING_MODEL).to(device)

def get_embeddings(text):
    try:
        embedding = embedding_model.encode(text, device=device)
        return embedding
    except Exception as e:
        print(f"Error: {e}")
        exit(1)

Let's try this embedding function locally

In [13]:
text = "Hello, how are you?"
embedding = get_embeddings(text)
embedding.shape

(384,)

We see that we get an embedding of the same dimension as when we did it with the Hugging Face API

The `sentence-transformers/all-MiniLM-L6-v2` model has only 22M parameters, so you will be able to run it on any GPU. Even if you don't have a GPU, you will be able to run it on a CPU.

The LLM we are going to use to generate responses, which is `Qwen2.5-72B-Instruct`, as its name suggests, is a 72B parameter model, so this model cannot be run on just any GPU and on a CPU it is unthinkable how slow it would be. That's why we will use this LLM via the API, but when generating the `embedding`s we can do it locally without any issues

### Documents we are going to test with

To perform all these tests, I have downloaded the dataset [aws-case-studies-and-blogs](https://www.kaggle.com/datasets/harshsinghal/aws-case-studies-and-blogs) and left it in the `rag-txt_dataset` folder, with the following commands I tell you how to download and unzip it

We create the folder where we are going to download the documents

In [30]:
!mkdir rag_txt_dataset

We download the `.zip` with the documents

In [1]:
!curl -L -o ./rag_txt_dataset/archive.zip https://www.kaggle.com/api/v1/datasets/download/harshsinghal/aws-case-studies-and-blogs

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1430k  100 1430k    0     0  1082k      0  0:00:01  0:00:01 --:--:-- 2440k


We unzip the `.zip`

In [2]:
!unzip rag_txt_dataset/archive.zip -d rag_txt_dataset

Archive:  rag_txt_dataset/archive.zip
  inflating: rag_txt_dataset/23andMe Case Study _ Life Sciences _ AWS.txt  
  inflating: rag_txt_dataset/36 new or updated datasets on the Registry of Open Data_ AI analysis-ready datasets and more _ AWS Public Sector Blog.txt  
  inflating: rag_txt_dataset/54gene _ Case Study _ AWS.txt  
  inflating: rag_txt_dataset/6sense Case Study.txt  
  inflating: rag_txt_dataset/ADP Developed an Innovative and Secure Digital Wallet in a Few Months Using AWS Services _ Case Study _ AWS.txt  
  inflating: rag_txt_dataset/AEON Case Study.txt  
  inflating: rag_txt_dataset/ALTBalaji _ Amazon Web Services.txt  
  inflating: rag_txt_dataset/AWS Case Study - Ineos Team UK.txt  
  inflating: rag_txt_dataset/AWS Case Study - StreamAMG.txt  
  inflating: rag_txt_dataset/AWS Case Study_ Creditsafe.txt  
  inflating: rag_txt_dataset/AWS Case Study_ Immowelt.txt  
  inflating: rag_txt_dataset/AWS Customer Case Study _ Kepler Provides Effective Monitoring of Elderly Care 

Let's delete the `.zip`

In [3]:
!rm rag_txt_dataset/archive.zip

We see what's left

In [4]:
!ls rag_txt_dataset

'23andMe Case Study _ Life Sciences _ AWS.txt'
'36 new or updated datasets on the Registry of Open Data_ AI analysis-ready datasets and more _ AWS Public Sector Blog.txt'
'54gene _ Case Study _ AWS.txt'
'6sense Case Study.txt'
'Accelerate Time to Business Value Using Amazon SageMaker at Scale with NatWest Group _ Case Study _ AWS.txt'
'Accelerate Your Analytics Journey on AWS with DXC Analytics and AI Platform _ AWS Partner Network (APN) Blog.txt'
'Accelerating customer onboarding using Amazon Connect _ NCS Case Study _ AWS.txt'
'Accelerating Migration at Scale Using AWS Application Migration Service with 3M Company _ Case Study _ AWS.txt'
'Accelerating Time to Market Using AWS and AWS Partner AccelByte _ Omeda Studios Case Study _ AWS.txt'
'Achieving Burstable Scalability and Consistent Uptime Using AWS Lambda with TiVo _ Case Study _ AWS.txt'
'Acrobits Uses Amazon Chime SDK to Easily Create Video Conferencing Application Boosting Collaboration for Global Users _ Acrobits Case Study _

### Creating the `chunk`s!

We list the documents with the function we had created

In [14]:
dataset_path = "rag_txt_dataset"
documents = load_documents_from_directory(dataset_path)

We check that we have done it well

In [15]:
for document in documents[0:10]:
    print(document["id"])

Run Jobs at Scale While Optimizing for Cost Using Amazon EC2 Spot Instances with ActionIQ _ ActionIQ Case Study _ AWS.txt
Recommend and dynamically filter items based on user context in Amazon Personalize _ AWS Machine Learning Blog.txt
Windsor.txt
Bank of Montreal Case Study _ AWS.txt
The Mill Adventure Case Study.txt
Optimize software development with Amazon CodeWhisperer _ AWS DevOps Blog.txt
Announcing enhanced table extractions with Amazon Textract _ AWS Machine Learning Blog.txt
THREAD _ Life Sciences _ AWS.txt
Deep Pool Optimizes Software Quality Control Using Amazon QuickSight _ Deep Pool Case Study _ AWS.txt
Upstox Saves 1 Million Annually Using Amazon S3 Storage Lens _ Upstox Case Study _ AWS.txt


Now we create the `chunk`s.

In [16]:
chunked_documents = []
for document in documents:
    chunks = split_text(document["text"])
    for i, chunk in enumerate(chunks):
        chunked_documents.append({"id": f"{document['id']}_{i}", "text": chunk})

In [17]:
len(chunked_documents)

3611

As we can see, there are 3611 `chunk`s. Since the daily limit of the Hugging Face API is 1000 calls on the free account, if we want to create embeddings of all the `chunk`s, we would run out of available calls and also wouldn't be able to create embeddings of all the `chunk`s

We recall again, this embedding model is very small, only 22M parameters, so it can be run on almost any computer, faster or slower, but it can be run.

As we are only going to create the embeddings of the `chunk`s once, even if we don't have a very powerful computer and it takes a long time, it will only be executed once. Then, when we want to ask questions about the documentation, that's when we will generate the embeddings of the prompt with the Hugging Face API and use the LLM with the API. So, we will only have to go through the process of generating the embeddings of the `chunk`s once.

We generate the embeddings of the `chunk`s

Last library we are going to have to install. Since the process of generating the embeddings of the `chunk`s is going to be slow, we are going to install `tqdm` so that it shows us a progress bar. We install it with conda or with pip, as you prefer

``` bash
conda install conda-forge::tqdm
```

or

``` bash
pip install tqdm
```

We generate the embeddings of the `chunk`s

In [19]:
import tqdm

progress_bar = tqdm.tqdm(chunked_documents)

for chunk in progress_bar:
    embedding = get_embeddings(chunk["text"])
    if embedding is not None:
        chunk["embedding"] = embedding
    else:
        print(f"Error with document {chunk['id']}")

100%|██████████| 3611/3611 [00:16<00:00, 220.75it/s]


We see an example

In [60]:
from random import randint

idx = randint(0, len(chunked_documents))
print(f"Chunk id: {chunked_documents[idx]['id']},\n\ntext: {chunked_documents[idx]['text']},\n\nembedding shape: {chunked_documents[idx]['embedding'].shape}")

Chunk id: BNS Group Case Study _ Amazon Web Services.txt_0,

text: Reducing Virtual Machines from 40 to 12
The founders of BNS had been contemplating a migration from the company’s on-premises data center to the public cloud and observed a growing demand for cloud-based operations among current and potential BNS customers.
Français
Configures security according to cloud best practices
Clive Pereira, R&D director at BNS Group, explains, “The database that records Praisal’s SMS traffic resides in Praisal’s AWS environment. Praisal can now run complete analytics across its data and gain insights into what’s happening with its SMS traffic, which is a real game-changer for the organization.”  
Español
 AWS ISV Accelerate Program
 Receiving Strategic, Foundational Support from ISV Specialists
 Learn More
The value that AWS places on the ISV stream sealed the deal in our choice of cloud provider.” 
日本語
  Contact Sales 
BNS is an Australian software provider focused on secure enterprise SMS an

### Loading the `chunk`s into the vector database

Once we have all the chunks generated, we load them into the vector database. We use `tqdm` again to show us a progress bar, because this is also going to be slow

In [22]:
import tqdm

progress_bar = tqdm.tqdm(chunked_documents)

for chunk in progress_bar:
    collection.upsert(
        ids=[chunk["id"]],
        documents=chunk["text"],
        embeddings=chunk["embedding"],
    )

100%|██████████| 3611/3611 [00:59<00:00, 60.77it/s]


## Questions

Now that we have the vector database, we can ask questions to the documentation. To do this, we need a function that returns the correct `chunk`

### Getting the correct `chunk`

Now we need a function that returns the correct `chunk`, let's create it

In [50]:
def get_top_k_documents(query, k=5):
    results = collection.query(query_texts=query, n_results=k)
    return results

Finally, we create a `query`.

To generate the query, I randomly selected the document `Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt`, passed it to an LLM, and asked it to generate a question about the document. The question it generated is

```
¿Cómo utilizó Neeva Karpenter y Amazon EC2 Spot Instances para mejorar la gestión de su infraestructura y la optimización de costos?
```

So we get the most relevant `chunk`s for that question

In [51]:
query = "How did Neeva use Karpenter and Amazon EC2 Spot Instances to improve its infrastructure management and cost optimization?"
top_chunks = get_top_k_documents(query=query, k=5)

Let's see what `chunk`s it has returned

In [52]:
for i in range(len(top_chunks["ids"][0])):
    print(f"Rank {i+1}: {top_chunks['ids'][0][i]}, distance: {top_chunks['distances'][0][i]}")

Rank 1: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_0, distance: 0.29233667254447937
Rank 2: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_5, distance: 0.4007825255393982
Rank 3: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_1, distance: 0.4317566752433777
Rank 4: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_6, distance: 0.43832334876060486
Rank 5: Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt_4, distance: 0.44625571370124817


As I said, the document I had chosen at random was `Using Amazon EC2 Spot Instances and Karpenter to Simplify and Optimize Kubernetes Infrastructure _ Neeva Case Study _ AWS.txt` and as can be seen the `chunk`s it has returned are from that document. That is, out of more than 3000 `chunk`s that were in the database, it has been able to return the most relevant `chunk`s to that question, it seems that this works!

### Generate the response

Now that we have the most relevant `chunk`s, we pass them to the LLM, along with the question, so that it generates a response

In [58]:
def generate_response(query, relevant_chunks, temperature=0.5, max_tokens=1024, top_p=0.7, stream=False):
    context = "\n\n".join([chunk for chunk in relevant_chunks])
    prompt = f"You are an assistant for question-answering. You have to answer the following question:\n\n{query}\n\nAnswer the question with the following information:\n\n{context}"
    message = [
        { "role": "user", "content": prompt }
    ]
    stream = client.chat.completions.create(
        messages=message, 
        temperature=temperature,
        max_tokens=max_tokens,
        top_p=top_p,
        stream=stream,
    )
    response = stream.choices[0].message.content
    return response

We test the function

In [59]:
response = generate_response(query, top_chunks["documents"][0])
print(response)

Neeva, a cloud-native, ad-free search engine founded in 2019, has leveraged Karpenter and Amazon EC2 Spot Instances to significantly improve its infrastructure management and cost optimization. Here’s how:

### Early Collaboration with Karpenter
In late 2021, Neeva began working closely with the Karpenter team, experimenting with and contributing fixes to an early version of Karpenter. This collaboration allowed Neeva to integrate Karpenter with its Kubernetes dashboard, enabling the company to gather valuable metrics on usage and performance.

### Combining Spot Instances and On-Demand Instances
Neeva runs its jobs on a large scale, which can lead to significant costs. To manage these costs effectively, the company adopted a combination of Amazon EC2 Spot Instances and On-Demand Instances. Spot Instances allow Neeva to bid on unused EC2 capacity, often at a fraction of the On-Demand price, while On-Demand Instances provide the necessary reliability for critical pipelines.

### Flexibi

When I asked the LLM to generate a question about the document, I also asked it to generate the correct answer. This is the answer the LLM gave me

``` text
Neeva utilizó Karpenter y Amazon EC2 Spot Instances para mejorar su gestión de infraestructura y optimización de costos de varias maneras:

Administración de Instancias Simplificada:

Karpenter: Al adoptar Karpenter, Neeva simplificó el proceso de aprovisionamiento y administración de recursos de cómputo para sus clústeres Amazon EKS. Karpenter aprovisiona y desaprovisiona instancias automáticamente en función de la carga de trabajo, eliminando la necesidad de configuraciones manuales y reduciendo la complejidad de comprender diferentes instancias de cómputo.
Instancias Spot: Neeva aprovechó las instancias Spot de Amazon EC2, que son capacidad de EC2 no utilizada disponible con un descuento significativo (hasta un 90% de ahorro de costos). Esto permitió a la empresa controlar los costos mientras cumplía con sus requisitos de rendimiento.
Escalabilidad mejorada:

Karpenter: La capacidad de Karpenter para escalar dinámicamente los recursos permitió a Neeva iniciar nuevas instancias rápidamente, lo que permitió a la empresa iterar a una velocidad mayor y realizar más experimentos en menos tiempo.
Instancias Spot: El uso de instancias Spot proporcionó flexibilidad y diversificación de instancias, lo que facilitó a Neeva escalar sus recursos de cómputo de manera eficiente.
Productividad mejorada:

Karpenter: Al democratizar los cambios de infraestructura, Karpenter permitió que cualquier ingeniero modificara configuraciones de Kubernetes, reduciendo la dependencia de la experiencia especializada. Esto ahorró al equipo de Neeva hasta 100 horas a la semana de tiempo de espera en administración de sistemas.
Instancias Spot: La capacidad de aprovisionar y desaprovisionar rápidamente instancias Spot redujo los retrasos en la canalización de desarrollo, asegurando que los trabajos no se bloquearan debido a la falta de recursos disponibles.
Eficiencia de Costos:

Karpenter: Las mejores prácticas de Karpenter para Spot Instances, incluyendo flexibilidad y diversificación de instancias, ayudaron a Neeva a utilizar estas instancias de manera más efectiva, manteniéndose dentro del presupuesto.
Instancias Spot: El ahorro de costos al utilizar instancias Spot permitió a Neeva ejecutar trabajos a gran escala, como la indexación, por casi el mismo costo pero en una fracción del tiempo. Por ejemplo, Neeva redujo sus trabajos de indexación de 18 horas a solo 3 horas.
Mejora en la Utilización de Recursos:

Karpenter: Karpenter proporcionó una mejor visibilidad en el uso de los recursos de cómputo, lo que permitió a Neeva rastrear y optimizar su consumo de recursos de manera más cercana.
Instancias Spot: La combinación de Karpenter y Instancias Spot permitió a Neeva ejecutar modelos de lenguaje grandes de manera más eficiente, mejorando la experiencia de búsqueda para sus usuarios.
En resumen, la adopción de Karpenter y Amazon EC2 Spot Instances por parte de Neeva mejoró significativamente la gestión de su infraestructura, la optimización de costos y la eficiencia general de desarrollo, lo que permitió a la empresa ofrecer mejores experiencias de búsqueda sin anuncios a sus usuarios.
```

And this has been the response generated by our `RAG`

``` text
Neeva, un motor de búsqueda nativo en la nube y sin anuncios fundado en 2019, ha aprovechado Karpenter y Amazon EC2 Spot Instances para mejorar significativamente la gestión de su infraestructura y la optimización de costos. Aquí está cómo:

### Colaboración Temprana con Karpenter
A fines de 2021, Neeva comenzó a trabajar en estrecha colaboración con el equipo de Karpenter, experimentando y contribuyendo con correcciones a una versión temprana de Karpenter. Esta colaboración permitió a Neeva integrar Karpenter con su panel de Kubernetes, lo que permitió a la empresa recopilar valiosas métricas sobre el uso y el rendimiento.

### Combinando Instancias Spot y Instancias a Petición
Neeva ejecuta sus trabajos a gran escala, lo que puede generar costos significativos. Para gestionar estos costos de manera efectiva, la empresa adoptó una combinación de instancias Spot de Amazon EC2 y instancias bajo demanda. Las instancias Spot permiten a Neeva ofrecer precios por la capacidad EC2 no utilizada, a menudo a una fracción del precio bajo demanda, mientras que las instancias bajo demanda proporcionan la confiabilidad necesaria para tuberías críticas.

### Flexibilidad y Diversificación de Instancias
Según Mohit Agarwal, líder de ingeniería de infraestructura en Neeva, la adopción de Karpenter de las mejores prácticas para Spot Instances, incluyendo la flexibilidad y la diversificación de instancias, ha sido crucial. Este enfoque garantiza que Neeva pueda ajustar dinámicamente sus recursos de cómputo para satisfacer las cargas de trabajo variables mientras minimiza los costos.

### Mejora de la escalabilidad y la agilidad
Al utilizar Karpenter para aprovisionar recursos de infraestructura para sus clústeres de Amazon EKS, Neeva ha logrado varios beneficios clave:
- **Escalabilidad**: Neeva puede escalar sus recursos de cómputo hacia arriba o hacia abajo según sea necesario, asegurando que siempre tenga la capacidad necesaria para manejar sus cargas de trabajo.
- **Agilidad**: La empresa puede iterar rápidamente y democratizar los cambios de infraestructura, reduciendo el tiempo dedicado a la administración de sistemas hasta 100 horas a la semana.

### Ciclos de Desarrollo Mejorados
La integración de Karpenter y Spot Instances también ha acelerado los ciclos de desarrollo de Neeva. La empresa ahora puede lanzar nuevas características y mejoras de manera más rápida, lo que es esencial para mantener una ventaja competitiva en el mercado de motores de búsqueda.

### Ahorros de Costos y Control de Presupuesto
Al utilizar instancias Spot, Neeva ha podido mantenerse dentro de su presupuesto mientras cumple con sus requisitos de rendimiento. Esta optimización de costos es fundamental para una empresa que prioriza las experiencias centradas en el usuario y no tiene incentivos competitivos de publicidad.

### Planes Futuros
Neeva se compromete a continuar su innovación y expansión. La empresa planea lanzarse en nuevas regiones y mejorar aún más su motor de búsqueda, todo mientras mantiene la eficiencia de costos. Como nota Mohit Agarwal, "La mayor parte de nuestro cómputo es o será administrado utilizando Karpenter en el futuro".

### Conclusión
Al aprovechar Karpenter y Amazon EC2 Spot Instances, Neeva no solo ha optimizado sus costos de infraestructura, sino que también ha mejorado su escalabilidad, agilidad y velocidad de desarrollo. Este enfoque estratégico ha posicionado a Neeva para ofrecer experiencias de búsqueda de alta calidad y sin anuncios a sus usuarios, mientras mantiene un fuerte enfoque en el control de costos y la innovación.
```

So we can conclude that `RAG` has worked correctly!!!