# Deploy Jina Models on AWS SageMaker

This notebook was created with the **Data Science 3.0 image** on **ml.t3.medium** instance on SageMaker Studio

[Jina Embeddings](https://jina.ai/embeddings/) and [Jina Reranker](https://jina.ai/reranker/) are now available to use with [SageMaker](https://aws.amazon.com/pm/sagemaker/) from the [AWS Marketplace](https://aws.amazon.com/marketplace/seller-profile?id=seller-stch2ludm6vgy). 

This notebook walks you through creating a [Retrieval-augmented generation (RAG)](https://jina.ai/news/full-stack-rag-with-jina-embeddings-v2-and-llamaindex/) application in AWS SageMaker for a collection of YouTube video transcripts. The models we will use are Jina Embeddings v2 - English, Jina Reranker v1, and the [Mistral-7B-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) large language model, all of which are available to SageMaker users.

You will need to have an AWS account. If you are not already an AWS user, you can [sign up for an account](https://portal.aws.amazon.com/billing/signup) on the AWS website.

## Set Up

Install the jina-sagemaker package and additional dependencies

In [None]:
 !pip install sagemaker jina-sagemaker setuptools  --upgrade 

## Configure a Role

You will need an [AWS role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) with sufficient permissions to use the resources required for this tutorial. 


In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
role_name = role.split(["/"][-1])
region = sagemaker_session.boto_region_name


print(f"The Amazon Resource Name (ARN) of the role used for this demo is: {role}")
print(f"The name of the role used for this demo is: {role_name[-1]}")
print(f"The default region is: {region}")

To verify that the role above has required permissions for this tutorial:

1. Go to the IAM console: https://console.aws.amazon.com/iam/home.
2. Select **Roles**.
3. Enter the role name in the search box to search for that role. 
4. Select the role.
5. Use the **Permissions** tab to verify this role has required permissions below attached:
    
        1. aws-marketplace:ViewSubscriptions
        2. aws-marketplace:Unsubscribe
        3. aws-marketplace:Subscribe

# Subscribe to Jina AI Models on AWS Marketplace

Subscribe to the [Jina Embeddings v2 base English](https://aws.amazon.com/marketplace/pp/prodview-5iljbegvoi66w) and [Jina Reranker v1 ](https://aws.amazon.com/marketplace/pp/prodview-avmxk2wxbygd6).

When you’ve subscribed to them, we get the models’ ARNs for your AWS region and store them in the variable names `embedding_package_arn` and `reranker_package_arn` respectively. The code in this tutorial will reference them using those variable names.

In [None]:

def get_arn_for_model(region, model_name):
    model_package_map = {
        "us-east-1": f"arn:aws:sagemaker:us-east-1:253352124568:model-package/{model_name}",
        "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{model_name}",
        "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{model_name}",
        "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{model_name}",
        "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{model_name}",
        "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{model_name}",
        "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{model_name}",
        "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{model_name}",
        "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{model_name}",
        "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{model_name}",
        "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{model_name}",
        "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{model_name}",
        "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{model_name}",
        "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{model_name}",
        "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{model_name}",
        "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{model_name}",
    }

    return model_package_map[region]

embedding_package_arn = get_arn_for_model(region, "jina-embeddings-v2-base-en")
reranker_package_arn = get_arn_for_model(region, "jina-reranker-v1-base-en")

# Load the Dataset

In this tutorial, we are going to use a collection of videos provided by the YouTube channel [TU Delft Online Learning](https://www.youtube.com/@tudelftonlinelearning1226). This channel produces a variety of educational materials in STEM subjects. Its programming is [CC-BY licensed](https://creativecommons.org/licenses/by/3.0/legalcode).

We downloaded 193 videos from the channel and processed them with OpenAI’s open-source [Whisper speech recognition model](https://openai.com/research/whisper). We used the smallest model ([`openai/whisper-tiny`](https://huggingface.co/openai/whisper-tiny) [on Hugging Face](https://huggingface.co/openai/whisper-tiny)) to process the videos into transcripts. 

The transcripts have been organized into a CSV file, which you can [download from here](https://tbd.todo).

## Install Requirements

This data is CSV format and will be handled using `pandas` dataframes.

In [None]:
!pip install requests pandas

## Download the Data into a Dataframe

In [None]:
import pandas

# Load the CSV file
data_url = "https://github.com/jina-ai/workshops/raw/main/notebooks/embeddings/sagemaker/tu_delft.csv"
tu_delft_dataframe = pandas.read_csv(data_url)

Run the line below to inspect the first few lines of the dataframe.

In [None]:
tu_delft_dataframe.head()

# Start the Jina Embeddings v2 Endpoint

The code below will launch an instance of `ml.g4dn.xlarge` on AWS to run the embedding model.

It may take several minutes for this to finish.

In [None]:
import boto3
from jina_sagemaker import Client

# Choose a name for your embedding endpoint. It can be anything convenient.
embeddings_endpoint_name = "jina_embedding"

embedding_client = Client(region_name=region)
embedding_client.create_endpoint(
    arn=embedding_package_arn,
    role=role,
    endpoint_name=embeddings_endpoint_name,
    instance_type="ml.g4dn.xlarge",
    n_instances=1,
)

embedding_client.connect_to_endpoint(endpoint_name=embeddings_endpoint_name)

# Build and Index the Dataset

Now that we have loaded the data and are running a Jina Embeddings v2 model, we can prepare and index the data. We will store the data in a [FAISS vector store](https://faiss.ai/index.html), an open-source vector database designed specifically for AI applications.

First, install the remaining prerequisites for our RAG application.

In [None]:
!pip install tdqm numpy faiss-cpu

## Chunking

We will need to take the individual transcripts and split them up into smaller parts, i.e., “chunks” so that we can fit multiple texts into a prompt for the LLM. The code below breaks the individual transcripts up on sentence boundaries, ensuring that all chunks have no more than 128 words by default.

In [None]:
def chunk_text(text, max_words=128):
    """
    Divide text into chunks where each chunk contains the maximum number of full 
    sentences under `max_words`.
    """
    sentences = text.split('.')
    chunk = []
    word_count = 0

    for sentence in sentences:
        sentence = sentence.strip(".")
        if not sentence:
          continue

        words_in_sentence = len(sentence.split())
        if word_count + words_in_sentence <= max_words:
            chunk.append(sentence)
            word_count += words_in_sentence
        else:
            # Yield the current chunk and start a new one
            if chunk:
              yield '. '.join(chunk).strip() + '.'
            chunk = [sentence]
            word_count = words_in_sentence

    # Yield the last chunk if it's not empty
    if chunk:
        yield ' '.join(chunk).strip() + '.'

## Get Embeddings for Each Chunk

We need an embedding for each chunk to store it in the FAISS database. To get them, we pass the text chunks to the Jina AI embedding model endpoint, using the method `embedding_client.embed()`. Then, we add the text chunks and embedding vectors to the pandas dataframe `tu_delft_dataframe` as the new columns `chunks` and `embeddings`:

In [None]:
import numpy as np
from tqdm import tqdm

tqdm.pandas()

def generate_embeddings(text_df):
    chunks = list(chunk_text(text_df['Text']))
    embeddings = []

    for i, chunk in enumerate(chunks):
      response = embedding_client.embed(texts=[chunk])
      chunk_embedding = response[0]['embedding']
      embeddings.append(np.array(chunk_embedding))

    text_df['chunks'] = chunks
    text_df['embeddings'] = embeddings
    return text_df

print("Embedding text chunks ...")

tu_delft_dataframe = tu_delft_dataframe.progress_apply(generate_embeddings, axis=1)

## Set Up Semantic Search Using Faiss

The code below creates a FAISS database and inserts the chunks and embedding vectors by iterating over `tu_delft_pandas`:

In [None]:
import faiss

dim = 768  # dimension of Jina v2 embeddings
index_with_ids = faiss.IndexIDMap(faiss.IndexFlatIP(dim))
k = 0

doc_ref = dict()

for idx, row in tu_delft_dataframe.iterrows():
    embeddings = row['embeddings']
    for i, embedding in enumerate(embeddings):
        normalized_embedding = np.ascontiguousarray(np.array(embedding, dtype='float32').reshape(1, -1))
        faiss.normalize_L2(normalized_embedding)
        index_with_ids.add_with_ids(normalized_embedding, k)
        doc_ref[k] = (row['chunks'][i], idx)
        k += 1

# Start the Jina Reranker v1 Endpoint

As with the Jina Embedding v2 model above, this code will launch an instance of ml.g4dn.xlarge on AWS to run the reranker model. Similarly, it may take several minutes to run.

In [None]:
import boto3
from jina_sagemaker import Client

# Choose a name for your reranker endpoint. It can be anything convenient.
reranker_endpoint_name = "jina_reranker"

reranker_client = Client(region_name=region)
reranker_client.create_endpoint(
    arn=reranker_package_arn,
    role=role,
    endpoint_name=reranker_endpoint_name,
    instance_type="ml.g4dn.xlarge",
    n_instances=1,
)

reranker_client.connect_to_endpoint(endpoint_name=reranker_endpoint_name)

# Define Query Functions

Next, we will define a function that identifies the most similar transcript chunks to any text query.

In [None]:
def find_most_similar_transcript_segment(query, n=20):
    query_embedding = embedding_client.embed(texts=[query])[0]['embedding']  # Assuming the query is short enough to not need chunking
    query_embedding = np.ascontiguousarray(np.array(query_embedding, dtype='float32').reshape(1, -1))
    faiss.normalize_L2(query_embedding)

    D, I = index_with_ids.search(query_embedding, n)  # Get the top n matches

    results = []
    for i in range(n):
        distance = D[0][i]
        index_id = I[0][i]
        transcript_segment, doc_idx = doc_ref[index_id]
        results.append((transcript_segment, doc_idx, distance))

    # Sort the results by score, highest to lowest
    results.sort(key=lambda x: x[2], reverse=True)

    return [(tu_delft_dataframe.iloc[r[1]]["Title"].strip(), r[0], r[2]) for r in results]

Also, define a function that accesses the reranker endpoint `reranker_client` and is set up to accept the output of `find_most_similar_transcript_segment`.

In [None]:
def rerank_results(query_found, query, n=3):
    ret = reranker_client.rerank(
        documents=[f[1] for f in query_found], 
        query=query, 
        top_n=n,
    )
    return [query_found[r['index']] for r in ret[0]['results']]

# Mistral-Instruct with JumpStart

For this tutorial, we will use the `mistral-7b-instruct` model, which is [available via SageMaker JumpStart](https://aws.amazon.com/blogs/machine-learning/mistral-7b-foundation-models-from-mistral-ai-are-now-available-in-amazon-sagemaker-jumpstart/), as the LLM portion of the RAG system.

## Loading Mistral-Instruct with JumpStart
To load the model with JumpStart, run the following:

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

jumpstart_model = JumpStartModel(model_id="huggingface-llm-mistral-7b-instruct", role=role)
model_predictor = jumpstart_model.deploy()

## Making a Prompt Template for Mistral-Instruct

Below is the code to create a prompt template for Mistral-Instruct for this application using [Python’s built-in string template class](https://docs.python.org/3/library/string.html#template-strings). It assumes that for each query there are three matching transcript chunks that will be presented to the model.

You can experiment with this template yourself to modify this application or see if you can get better results.

In [None]:
from string import Template

prompt_template = Template("""
  <s>[INST] Answer the question below only using the given context.
  The question from the user is based on transcripts of videos from a YouTube
    channel.
  The context is presented as a ranked list of information in the form of
    (video-title, transcript-segment), that is relevant for answering the
    user's question.
  The answer should only use the presented context. If the question cannot be
    answered based on the context, say so.

  Context:
  1. Video-title: $title_1, transcript-segment: $segment_1
  2. Video-title: $title_2, transcript-segment: $segment_2
  3. Video-title: $title_3, transcript-segment: $segment_3

  Question: $question

  Answer: [/INST]
""")

# Querying the Model

We now have all the parts of a complete RAG application and can start querying it. Querying the model is a three-step process.

1. Search for relevant chunks given a query.
2. Assemble the prompt.
3. Send the prompt to the Mistral-Instruct model and return its answer.

To search for relevant chunks, we use the `find_most_similar_transcript_segment` function we defined above.

Fist, we query for related segments from the video transcripts and rerank the results:

In [None]:
question = "When was the first offshore wind farm commissioned?"
search_results = find_most_similar_transcript_segment(question)
reranked_results = rerank_results(search_results, question)

You can inspect the result:

In [None]:
for title, text, _ in reranked_results:
    print(title + "\n" + text + "\n")

Next, we instanciate the template and fill in the values:

In [None]:
prompt_for_llm = prompt_template.substitute(
    question = question,
    title_1 = search_results[0][0],
    segment_1 = search_results[0][1],
    title_2 = search_results[1][0],
    segment_2 = search_results[1][1],
    title_3 = search_results[2][0],
    segment_3 = search_results[2][1],
)

Inspect the completed prompt text:

In [None]:
print(prompt_for_llm)

Now, we can pass the complete prompt to the language model endpoint `model_predictor`:

In [None]:
answer = model_predictor.predict({"inputs": prompt_for_llm})

Print the resulting answer:

In [None]:
answer = answer[0]['generated_text']
print(answer)

Let’s simplify querying by writing a function to do all the steps:

In [None]:
def ask_rag(question):
    search_results = find_most_similar_transcript_segment(question)
    reranked_results = rerank_results(search_results, question)
    prompt_for_llm = prompt_template.substitute(
        question = question,
        title_1 = search_results[0][0],
        segment_1 = search_results[0][1],
        title_2 = search_results[1][0],
        segment_2 = search_results[1][1],
        title_3 = search_results[2][0],
        segment_3 = search_results[2][1],
    )
    answer = model_predictor.predict({"inputs": prompt_for_llm})
    return answer[0]['generated_text']


Now let's ask some questions:

In [None]:
ask_rag("Who is Reneville Solingen?")

In [None]:
ask_rag("What is a Kaplan Meyer estimator?")

In [None]:
ask_rag("What countries export the most coffee?")

In [None]:
ask_rag("How much wood could a woodchuck chuck if a woodchuck could chuck wood?")

In [None]:
ask_rag("What is the European Green Deal?")

# Shutting Down

Because you are billed by the hour for the models you use and for the AWS infrastructure to run them, it is very important, when you finish, to stop all three AI models used in this tutorial:

- The embedding model endpoint `embedding_client`
- The reranker model endpoint `reranker_client`
- The large language model endpoint `model_predictor`

To shut all three model endpoints down, run the following code:

In [None]:
#End all clients

embedding_client.delete_endpoint()
embedding_client.close()
reranker_client.delete_endpoint()
reranker_client.close()
model_predictor.delete_model()
model_predictor.delete_endpoint()
