## Contextual Video RAG over Webinars with Pinecone, Anthropic and AWS

In doing so, we'll convert a multimodal problem into a purely text one on search, and leave the complex multimodal ingestion to the Claude Bedrock API. This saves us time and a bit of complexity on the multimodal embedding front!

Welcome to the workshop! In this notebook, we'll setup a simple video RAG workflow using Pinecone, Claude and AWS. We'll take an input set of videos and ingest them (using Claude in pre and post processing) in order to allow for an contextual RAG experience over a traditionally vexing dataset. 


Before running this notebook in Sagemaker, you'll need the following:


- A Sagemaker Instance with this Repo open
- Access to Claude Haiku and Sonnet via Bedrock
- A folder called "data" with a subfolder called "videos", with at least 1 video in .mp4 format there
- A Pinecone API Key, so we can create our index

## First, some dependency cleanup

In [None]:
# install torch down, install ffmpeg-python

In [2]:
# important Environmental Variables
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.20.1%2Bcu118-cp310-cp310-linux_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m123.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.5.1%2Bcu118-cp310-cp310-linux_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m144.9 MB/s[0m eta [36m0:00:00[0m
Collecting torch
  Downloading https://download.pytorch.org/whl/cu118/torch-2.5.1%2Bcu118-cp310-cp310-linux_x86_64.whl (838.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m838.3/838.3 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_nvrtc_cu11-11.8.89-py3-non

In [None]:
%conda install ffmpeg-python

## Video Data as Input

The trickiest part about working with Video data is the multimodal nature of the content presented on screen.
 
For content such as webinars (like this one!), you may have multiple speakers, diagrams on screen, mis-matched transcripts and audio, etc.

Without some sort of end-to-end encoder, it can be quite difficult to encompass all of these attributes. 

We'll take a simplified approach where we process our video set into frame-transcript pairs, which will allow us to reduce the dimensionality of the data to images and pairs.

Lets begin by transcribing and processing our video data.

**Don't forget to upload your videos manually into the data folder!**

![Test Title](./diagrams/Video_Preprocessing.png)


### Video preprocessing: Transcription and Frames



First, we'll do some housework to grab our video and setup some helper functions. 

In [None]:
import os 
from preprocessing.config import data_dir, videos_dir
from preprocessing.preprocess_videos import *


video_files = os.listdir(videos_dir)
print(video_files)
# add root dir to video files
video_files = [os.path.join(videos_dir, f) for f in video_files]

all_videos_data = {}
transcriptions_dir = os.path.join(data_dir, "transcriptions")
frames_and_words_dir = os.path.join(data_dir, "frames_and_words")
frames_dir = os.path.join(data_dir, "frames")

# folder setup
try:
    os.mkdir(transcriptions_dir)
    os.mkdir(frames_and_words_dir)
    os.mkdir(frames_dir)
except FileExistsError:
        print("Folders already exist. Please delete them to start fresh or ensure they are empty.")


Now, we can iterate over the video files, for the following workflow:

1. Transcribe the video and obtain the word-level timestamps (we do this in 45s intervals)
2. Walk over the video in INTERVAL length windows, and take the current frame on screen
3. Grab all words covering that frame, and save out along with the transcript and frames themselves

**If you'd like to modify any of this code, take a look at the preprocess_videos.py script under preprocessing!**


In [None]:
INTERVAL=45

for video_path in video_files:
    transcription = transcribe_video(video_path)

    video_filename = os.path.splitext(os.path.basename(video_path))[0]
        
    # Write transcription out as json, for use later in the pipeline
    transcription_filename = os.path.join(transcriptions_dir, video_filename + "_transcription.json")
    with open(transcription_filename, "w") as f:
        json.dump(transcription["text"], f)
        
    frames = extract_frames(frames_dir, video_path, INTERVAL)

    # We group the words into the frames they belong to, here
    frames_and_words = assign_words_to_frames(transcription, frames)
        
    frames_and_words_filename = os.path.join(frames_and_words_dir, video_filename + "_frames_and_words.json")
    with open(frames_and_words_filename, "w") as f:
        json.dump(frames_and_words, f)
        
    all_videos_data[video_filename] = {
            "transcription": transcription_filename,
            "frames_and_words": frames_and_words_filename
}

# This file helps manage all of the videos we make, if there is more than one
all_videos_data_path = data_dir / "all_videos_data.json"
with open(all_videos_data_path, "w") as f:
    json.dump(all_videos_data, f)

  from pandas.core.computation.check import NUMEXPR_INSTALLED


['mlsearch_webinar.mp4']
Folders already exist. Please delete them to start fresh or ensure they are empty.


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.


Folder already exists for videofile /home/ec2-user/SageMaker/pc-yt-rag/data/videos/mlsearch_webinar.mp4
Please delete it to start fresh or ensure it is empty.


ffmpeg version 7.1 Copyright (c) 2000-2024 the FFmpeg developers
  built with gcc 13.3.0 (conda-forge gcc 13.3.0-1)
  configuration: --prefix=/home/ec2-user/anaconda3/envs/python3 --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1730671409690/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1730671409690/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1730671409690/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1730671409690/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --enable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libopenh264 --enable-libdav1d --disable-gnutls --enable-libmp3lame --enable-libvpx --enable-libass --enable-pthreads --enable-vaapi --enable-libopenvino --enable-gpl --enable-libx264 --enable-libx265 --enable-libaom --enable-

## Using Claude on Ingest

### Contextual Retrieval



As discussed earlier, the videos we desire to do RAG over have some properties that differentiate them from normal documents.

Notably, these videos can be really long! So, how can we get high quality representations of each frame, if we just have the context of the transcripts?

Lucky for us, we can use Claude's visual understanding capabilities to annotate each frame, conditioned on the **transcript, frame image, and overall transcript summary**. 

This is a basic form of **contextual retrieval**, where we enrich the initial text data with the context surrounding it. Anthropic [announced this technique](https://www.anthropic.com/news/contextual-retrieval) as a way to improve retrieval on texts where the context for chunked data is particularly important, and we extend it here to apply to video data!

First, we setup some helper functions to ingest the data we need to pass to Claude Haiku.

In [None]:
import base64

def convert_image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        binary_data = image_file.read()
        base_64_encoded_data = base64.b64encode(binary_data)
        base64_string = base_64_encoded_data.decode('utf-8')
    return base64_string


Next, we create some functions to obtain Claude's response given images and tex response from our vector database, but also in cases where we just want a response on a single image.

In [None]:
MODEL = "anthropic.claude-3-haiku-20240307-v1:0"
MAX_TOKENS = 256

from anthropic import AnthropicBedrock



def ask_claude(img, text):
    # best for one off queries
    client = AnthropicBedrock(
    aws_region="us-east-1")
    if img:
        img_b64 = convert_image_to_base64(img)
        message = client.messages.create(
            model=MODEL,
            max_tokens=MAX_TOKENS,
            messages=[
            {
                "role": "user", 
                "content": [
                    {"type": "image", "source": 
                        {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": img_b64
                        }
                    },
                    {"type": "text", "text": text}
                ]
            }
        ]
        )
    else:
        message = client.messages.create(
            model=MODEL,
            max_tokens=MAX_TOKENS,
            messages=[{"role": "user", "content": text}]
        )
    return message.content[0].text


    




And, we make a helper function to make the transcript summaries specifically.

In [None]:

def make_claude_transcript_summary(transcript):
    client = AnthropicBedrock(
    aws_region="us-east-1")

    prompt = "Summarize the following transcript, being as concise as possible:"
    message = client.messages.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt + ": " + transcript}],
        max_tokens=MAX_TOKENS
    )
    return message.content[0].text

#### Creating the Contextual Descriptions using VQA, Transcripts, and Claude



Finally, the contextual rag step! We use the transcript summary, the current transcript in frame, in addition to the frame itself, in order for Claude to create a nice contextual description. This is what will be embedded, for search in Pinecone.

To better understand how Claude deals with visual data, especially slides, take a look at Anthropic's cookbook [here](https://github.com/anthropics/anthropic-cookbook/blob/main/multimodal/reading_charts_graphs_powerpoints.ipynb):

![Our Contextual Embedding Workflow](./diagrams/Contextual_Retrieval_With_Video_RAG.png)

In [None]:
def create_contextual_frame_description(frame_caption_index, frame_caption_pairs, transcript_summary, window=60, frame_width=15):
    # frame caption pair will have an image, and a transcript. Window is in seconds
    client = AnthropicBedrock(
    aws_region="us-east-1")
    
    current_frame = frame_caption_pairs[frame_caption_index]

    meta_prompt = f'''

    You are watching a video and trying to explain what has
    happened in the video using a global summary, some recent context, 
    and the transcript of the current frame.

    The video has been summarized as follows:
    {transcript_summary}

    The current frame's transcript is as follows:
    {current_frame["words"]}

    You also want to provide a description of the current frame based on the context provided.

    Please describe this video snippet using the information above in addition to the frame visual. Explain any diagrams or code or important text that appears on screen,
    especially if the snippet is of a slide or a code snippet. 
    If there are only people in the frame, focus on the transcript and the context provided to describe what has
    been talked about. 
    If a question was asked, and answered, 
    include the question and answer in the description as well.

    Description:
    '''

    rich_summary = ask_claude(img=current_frame["frame_path"], text=meta_prompt)
    return rich_summary

#### Step 3: Putting it all Together

In [None]:
from tqdm import tqdm
import json
with open("./data/all_videos_data.json", "r") as f:
    all_videos_data = json.load(f)


finalized_data = []

for video, data in all_videos_data.items():
    with open(data["transcription"], "r") as f:
        transcript = json.load(f)
    with open(data["frames_and_words"], "r") as f:
        frame_caption_pairs = json.load(f)

    transcript_summary = make_claude_transcript_summary(transcript=transcript)

    print(transcript_summary)

    for i, pair in tqdm(enumerate(frame_caption_pairs)):
        contextual_frame_description = create_contextual_frame_description(
            frame_caption_index = i, 
            frame_caption_pairs=frame_caption_pairs, 
            transcript_summary=transcript_summary)
        # write out the updated frame caption pairs
        # this data will compose the metadata for the vector database
        # Note that only the contextual frame description will be searched over
        new_pair = {
            "frame_path": pair["frame_path"],
            "words": pair["words"],
            "timestamp": pair["timestamp"],
            "transcript_summary": transcript_summary,
            "contextual_frame_description": contextual_frame_description
        }
        finalized_data.append(new_pair)

# write out the finalized data
with open("./data/finalized_data.json", "w") as f:
    json.dump(finalized_data, f)


Here is a concise summary of the key points from the transcript:

- The webinar covers the magic of multilingual search, specifically multilingual semantic search. 

- It provides a crash course on vectors, vector embeddings, and how large language models can represent concepts across languages.

- The focus is on using the multilingual E5 large model and Pinecone's vector database to enable efficient multilingual semantic search.

- A demo is shown applying this approach to a language learning problem, allowing cross-lingual and model-lingual search over a dataset of English and Spanish sentence translations.

- Key takeaways include embedding queries and passages differently, handling chunking and rate limiting, and evaluating performance with a domain-specific gold standard dataset.

- The session covers theoretical aspects of multilingual embeddings as well as practical steps for implementing a multilingual semantic search application using Pinecone.


72it [07:36,  6.34s/it]


## Using Pinecone

Now that we've uploaded the 


### What is Pinecone?

### Using AWS Bedrock: Titan Text Embedding Models

### Creating Index

### A Note about Metadata


### And we're done!

### Embedding the Data with Titan

We'll be doing roughly the following:

![Pinecone Embedding and Upsertion](./diagrams/Pinecone_Upsertion.png)

In [None]:
from pinecone import Pinecone, ServerlessSpec
import boto3

boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
bedrock_client = boto3.client(
    "bedrock-runtime",
    region_name,
)


# Embedding code
def titan_text_embedding(
    text: str,  # English only and max input tokens 128
    dimension: int = 1024,  # 1,024 (default), 384, 256
    model_id: str = "amazon.titan-embed-text-v2:0"
):
    payload_body = {
        "inputText": text,
    }

    response = bedrock_client.invoke_model(
        body=json.dumps(payload_body),
        modelId=model_id,
        accept="application/json",
        contentType="application/json"
    )

    response_body = json.loads(response.get("body").read())

    finish_reason = response_body.get("message")

    if finish_reason is not None:
        raise Exception(f"Embeddings generation error: {finish_reason}")

    return response_body


# read in as json
file_path = './data/finalized_data.json'
with open(file_path, "r") as f:
    data = json.load(f)


values_to_embed = [item["contextual_frame_description"] for item in data]
ids = [item["frame_path"] for item in data]

embeddings = []

# For large number of embeddings, take care to respect rate limits!
for v in tqdm(values_to_embed):
    embedding = titan_text_embedding(text=v)
    embeddings.append(embedding["embedding"])


final_vectors = []
# Easy way to assign ids. Be careful of overwriting these 
ids = [x for x in range(0, len(values_to_embed))]

for v, e, id in tqdm(zip(data, embeddings, ids)):
    final_vectors.append({
            "id": str(id),
            "values": e,
            "metadata": {
                "transcript": v["words"],
                "filepath": v["frame_path"],
                "timestamp_start": v["timestamp"][0],
                "timestamp_end": v["timestamp"][1],
                "contextual_frame_description": v["contextual_frame_description"]
            }
})
    

100%|██████████| 72/72 [00:06<00:00, 10.84it/s]
72it [00:00, 77612.41it/s]


### Creating an Index with Pinecone and upserting!


**Be sure to enter your Pinecone API Key here!**

Don't have one? No problem, sign up [here](https://docs.pinecone.io/guides/get-started/quickstart):

In [14]:
import pandas as pd

df = pd.DataFrame(final_vectors)
print(df.head())
index_name="enriched-claude-vqa-aws"
# replace this as necessary
pc = Pinecone(api_key="")


# create index
if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1024,
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        ) 
    ) 

index = pc.Index(index_name)
# Handy helper function for upsertion!
index.upsert_from_dataframe(df)

  id                                             values  \
0  0  [-0.01581481657922268, 0.05500806123018265, -0...   
1  1  [-0.007456343621015549, 0.04353542625904083, -...   
2  2  [-0.030072269961237907, 0.045493949204683304, ...   
3  3  [-0.030104508623480797, 0.06101180613040924, 0...   
4  4  [-0.001742413965985179, 0.004221058916300535, ...   

                                            metadata  
0  {'transcript': ' All right, welcome everybody....  
1  {'transcript': ' take universal translation as...  
2  {'transcript': ' a bit about vector embeddings...  
3  {'transcript': ' about your weekend trip that ...  
4  {'transcript': ' parking, which is completely ...  


sending upsert requests:   0%|          | 0/72 [00:00<?, ?it/s]

{'upserted_count': 72}

## Setting up the RAG workflow with Claude and Pinecone

Now, we're ready to do RAG! But, we have one more trick up our sleeves.

We don't actually embed the images and text in the same vector space, so when we query Pinecone, we're just doing a semantic search over the contextual frame descriptions.

But, because we wrote the frames out to disk, and we stored the frame-paths in the metadata, we can read in the frames and pass them to Claude again for full-fidelity question answering!

This saves us the trouble of trying to embed the text-image data in the same modality, while preserving the information from the images anyway.

We write two helper functions to accomplish this: one to format the image, text pairs, and another to prompt Claude directly


In [None]:
def format_messages_for_claude(user_query, vdb_response):
    """
    Formats the user's query and the vector database response into a structured message for Claude.
    
    Args:
        user_query (str): The user's query.
        vdb_response (list): The response from the vector database, containing images and text.
    
    Returns:
        list: A list of messages formatted for Claude.
    """
    messages = [{"role": "user", "content": []}]
    # add in the first query
    new_content = [{"type": "text", "text": "The user query is: " + user_query}]
    # we alternate between text, image, and text, where we introduce the iamge, then the text, then the next image, and so on.
    # we append three messages at a time, one for the image, one for the text, and one for the next image.

    for item in vdb_response:
        img_b64 = convert_image_to_base64(item["metadata"]["filepath"])
        new_content.extend([
            {
            "type": "text",
            "text": "Image: " + item["metadata"]["filepath"],
            },
            {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": img_b64
            }
            },
            {
            "type": "text",
            "text": "Contextual description: " + item["metadata"]["contextual_frame_description"]
            },
            {
                "type": "text",
                "text": "Transcript: " + item["metadata"]["transcript"]
            }
        ])
    #reassign
    messages[0]["content"] = new_content
    return messages



def ask_claude_vqa_response(user_query, vdb_response):
    """
    Sends the user's query and the vector database response to Claude and gets a response.
    
    Args:
        user_query (str): The user's query.
        vdb_response (list): The response from the vector database, containing images and text.
    
    Returns:
        str: The response from Claude.
    """
    client = AnthropicBedrock()
    messages = format_messages_for_claude(user_query, vdb_response)
    system_prompt = '''

You are a friendly assistant helping people interpret their videos at their company.

You will recieve frames of these videos, with descriptions of what has happened in the frames, as well as a user query

Your job is to ingest the images and text, and respond to the user's query or question based on the context provided.

Refer back to the images and text provided to guide the user to the appropriate slide, section, webinar, or talk
where the information they are looking for is located.
    '''
    response = client.messages.create(
            model=MODEL,
            max_tokens=MAX_TOKENS * 10,
            system=system_prompt,
            messages=messages
        )
    return response.content[0].text

![Claude RAG](./diagrams/RAG_Workflow_for_Video_Contextual_RAG.png)

In [None]:
query_text = "Find me code samples for learning how to use Pinecone with multilingualism"

query_embedding =  titan_text_embedding(text=query_text)

response = index.query(vector=query_embedding["embedding"], top_k=5, include_metadata=True)

claude_explanation = ask_claude_vqa_response(query_text, response["matches"])

In [17]:
print(claude_explanation)

Based on the context provided in the video frames and transcript, here are the key steps to find code samples for learning how to use Pinecone with multilingualism:

1. The video covers how to use the Pinecone Inference API and the multilingual E5-large language model to enable efficient multilingual semantic search. 

2. The key points include:
   - Distinguishing between "queries" (sentences/phrases for searching) and "passages" (longer-form content like articles) when working with the multilingual model.
   - Using a function to embed a list of sentences and return the embeddings, to handle chunking and rate limiting when indexing data.
   - Embedding translation pairs separately to get the embedding list object.
   - Setting up the Pinecone index with the appropriate dimension size (1024) to match the model's output vector size.
   - Embedding the data into the Pinecone index, treating queries and passages differently.

3. The code samples that demonstrate these multilingual search

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def visualize_matches(matches):
    for match in matches:
        metadata = match['metadata']
        img_path = metadata['filepath']
        transcript = metadata['transcript']
        contextual_description = metadata['contextual_frame_description']
        
        # Load and display the image
        img = mpimg.imread(img_path)
        plt.figure(figsize=(10, 6))
        plt.imshow(img)
        plt.axis('off')
        
        # Display the transcript and contextual description
        plt.title(f"Transcript: {transcript}\n\nContextual Description: {contextual_description}", fontsize=10)
        plt.show()

# Example usage
visualize_matches(response["matches"])

## Test Drive with specialized queries!

In our webinar, we pre-loaded our index with a few videos from the Pinecone Youtube Channel.

Let's try to see how well we can process them!

#### Queries about Code


#### Queries in transcript but not onscreen

#### Queries onscreen but not in transcript


#### Finding and interpreting diagrams