## Contextual Video RAG over Webinars with Pinecone, Anthropic and AWS

Welcome to the workshop! In this notebook, we'll setup a simple video RAG workflow using Pinecone, Claude and AWS. We'll take an input set of videos and ingest them (using Claude in pre and post processing) in order to allow for an contextual RAG experience over a traditionally vexing dataset. 


We'll need to create the following preprocessing pipeline to make it all work:

In doing so, we'll convert a multimodal problem into a purely text one on search, and leave the complex multimodal ingestion to the Claude Bedrock API. This saves us time and a bit of complexity on the multimodal embedding front!

## Installing dependencies and scripting...

In [None]:
# important Environmental Variables

## Video Data as Input

Lets begin by transcribing and processing our video data.

**Don't forget to upload your videos manually into the data folder!**

### Video preprocessing: Transcription and Frames





In [None]:
import os 
from preprocessing.config import data_dir, videos_dir
from preprocessing.preprocess_videos import *


video_files = os.listdir(videos_dir)
print(video_files)
# add root dir to video files
video_files = [os.path.join(videos_dir, f) for f in video_files]

all_videos_data = {}
transcriptions_dir = os.path.join(data_dir, "transcriptions")
frames_and_words_dir = os.path.join(data_dir, "frames_and_words")
frames_dir = os.path.join(data_dir, "frames")
# folder setup
try:
    os.mkdir(transcriptions_dir)
    os.mkdir(frames_and_words_dir)
    os.mkdir(frames_dir)
except FileExistsError:
        print("Folders already exist. Please delete them to start fresh or ensure they are empty.")

for video_path in video_files:
    transcription = transcribe_video(video_path)

    video_filename = os.path.splitext(os.path.basename(video_path))[0]
        
    # Write transcription out as json
    transcription_filename = os.path.join(transcriptions_dir, video_filename + "_transcription.json")
    with open(transcription_filename, "w") as f:
        json.dump(transcription["text"], f)
        
    frames = extract_frames(frames_dir, video_path, INTERVAL)
    frames_and_words = assign_words_to_frames(transcription, frames)
        
    frames_and_words_filename = os.path.join(frames_and_words_dir, video_filename + "_frames_and_words.json")
    with open(frames_and_words_filename, "w") as f:
        json.dump(frames_and_words, f)
        
    all_videos_data[video_filename] = {
            "transcription": transcription_filename,
            "frames_and_words": frames_and_words_filename
}

    # Optionally, write all_videos_data to a summary file
all_videos_data_path = data_dir / "all_videos_data.json"
with open(all_videos_data_path, "w") as f:
    json.dump(all_videos_data, f)

## Using Claude on Ingest

### Contextual Retrieval Primer



### Using Claude to Annotate data



#### Step 1: Transcript Summary



In [None]:
def convert_image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        binary_data = image_file.read()
        base_64_encoded_data = base64.b64encode(binary_data)
        base64_string = base_64_encoded_data.decode('utf-8')
    return base64_string


def format_messages_for_claude(user_query, vdb_response):
    """
    Formats the user's query and the vector database response into a structured message for Claude.
    
    Args:
        user_query (str): The user's query.
        vdb_response (list): The response from the vector database, containing images and text.
    
    Returns:
        list: A list of messages formatted for Claude.
    """
    messages = [{"role": "user", "content": []}]
    # add in the first query
    new_content = [{"type": "text", "text": "The user query is: " + user_query}]
    # we alternate between text, image, and text, where we introduce the iamge, then the text, then the next image, and so on.
    # we append three messages at a time, one for the image, one for the text, and one for the next image.

    for item in vdb_response:
        img_b64 = convert_image_to_base64(item["metadata"]["filepath"])
        new_content.extend([
            {
            "type": "text",
            "text": "Image: " + item["metadata"]["filepath"],
            },
            {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": img_b64
            }
            },
            {
            "type": "text",
            "text": "Contextual description: " + item["metadata"]["contextual_frame_description"]
            },
            {
                "type": "text",
                "text": "Transcript: " + item["metadata"]["transcript"]
            }
        ])
    #reassign
    messages[0]["content"] = new_content
    return messages

def ask_claude_vqa_response(user_query, vdb_response):
    """
    Sends the user's query and the vector database response to Claude and gets a response.
    
    Args:
        user_query (str): The user's query.
        vdb_response (list): The response from the vector database, containing images and text.
    
    Returns:
        str: The response from Claude.
    """
    client = AnthropicBedrock()
    messages = format_messages_for_claude(user_query, vdb_response)
    system_prompt = '''

You are a friendly assistant helping people interpret their videos at their company.

You will recieve frames of these videos, with descriptions of what has happened in the frames, as well as a user query

Your job is to ingest the images and text, and respond to the user's query or question based on the context provided.

Refer back to the images and text provided to guide the user to the appropriate slide, section, webinar, or talk
where the information they are looking for is located.
    '''
    response = client.messages.create(
            model=MODEL,
            max_tokens=MAX_TOKENS * 10,
            system=system_prompt,
            messages=messages
        )
    return response.content[0].text
    


def ask_claude(img, text):
    # best for one off queries
    client = AnthropicBedrock(
    aws_region="us-east-1")
    if img:
        img_b64 = convert_image_to_base64(img)
        message = client.messages.create(
            model=MODEL,
            max_tokens=MAX_TOKENS,
            messages=[
            {
                "role": "user", 
                "content": [
                    {"type": "image", "source": 
                        {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": img_b64
                        }
                    },
                    {"type": "text", "text": text}
                ]
            }
        ]
        )
    else:
        message = client.messages.create(
            model=MODEL,
            max_tokens=MAX_TOKENS,
            messages=[{"role": "user", "content": text}]
        )
    return message.content[0].text



def make_claude_transcript_summary(transcript):
    client = AnthropicBedrock(
    aws_region="us-east-1")

    prompt = "Summarize the following transcript, being as concise as possible:"
    message = client.messages.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt + ": " + transcript}],
        max_tokens=MAX_TOKENS
    )
    return message.content[0].text




#### Step 2: Meta Prompt 



In [None]:
def create_contextual_frame_description(frame_caption_index, frame_caption_pairs, transcript_summary, window=60, frame_width=15):
    # frame caption pair will have an image, and a transcript. Window is in seconds
    client = AnthropicBedrock(
    aws_region="us-east-1")

    # gather context, look 4 frame widths before and after. Make sure not to go out of bounds if near beginning or end of video.
    
    surrounding_frames = frame_caption_pairs[max(0, frame_caption_index - 4 * frame_width):frame_caption_index + 1]

    current_frame = frame_caption_pairs[frame_caption_index]

    # summarize past frames
    # removed for now
   #past_frames_summary = make_claude_transcript_summary(" ".join([f["words"] for f in surrounding_frames]))
    meta_prompt = f'''

    You are watching a video and trying to explain what has happened in the video using a global summary, some recent context, and the transcript of the current frame.

    The video has been summarized as follows:
    {transcript_summary}

    The current frame's transcript is as follows:
    {current_frame["words"]}

    You also want to provide a description of the current frame based on the context provided.

    Please describe this video snippet using the information above in addition to the frame visual. Explain any diagrams or code or important text that appears on screen,
    especially if the snippet is of a slide or a code snippet. If there are only people in the frame, focus on the transcript and the context provided to describe what has
    been talked about. If a question was asked, and answered, include the question and answer in the description as well.

    Description:
    '''

    rich_summary = ask_claude(img=current_frame["frame_path"], text=meta_prompt)
    return rich_summary

#### Step 3: Putting it all Together

In [None]:
from tqdm import tqdm

import json
with open("./data/all_videos_data.json", "r") as f:
    all_videos_data = json.load(f)


finalized_data = []

for video, data in all_videos_data.items():
    with open(data["transcription"], "r") as f:
        transcript = json.load(f)
    with open(data["frames_and_words"], "r") as f:
        frame_caption_pairs = json.load(f)

    transcript_summary = make_claude_transcript_summary(transcript=transcript)

    print(transcript_summary)

    for i, pair in tqdm(enumerate(frame_caption_pairs)):
        contextual_frame_description = create_contextual_frame_description(
            frame_caption_index = i, 
            frame_caption_pairs=frame_caption_pairs, 
            transcript_summary=transcript_summary)
    # write out the updated frame caption pairs
        new_pair = {
            "frame_path": pair["frame_path"],
            "words": pair["words"],
            "timestamp": pair["timestamp"],
            "transcript_summary": transcript_summary,
            "contextual_frame_description": contextual_frame_description
        }
        finalized_data.append(new_pair)

# write out the finalized data
with open("./data/finalized_data.json", "w") as f:
    json.dump(finalized_data, f)


## Using Pinecone

### What is Pinecone?

### Using AWS Bedrock: Titan Text Embedding Models

### Creating Index

### A Note about Metadata

### Embedding + Upsertion

### And we're done!

In [None]:
from pinecone import Pinecone, ServerlessSpec
import boto3

boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
bedrock_client = boto3.client(
    "bedrock-runtime",
    region_name,
)


# Embedding code
def titan_text_embedding(
    text: str,  # English only and max input tokens 128
    dimension: int = 1024,  # 1,024 (default), 384, 256
    model_id: str = "amazon.titan-embed-text-v2:0"
):
    payload_body = {
        "inputText": text,
    }

    response = bedrock_client.invoke_model(
        body=json.dumps(payload_body),
        modelId=model_id,
        accept="application/json",
        contentType="application/json"
    )

    response_body = json.loads(response.get("body").read())

    finish_reason = response_body.get("message")

    if finish_reason is not None:
        raise Exception(f"Embeddings generation error: {finish_reason}")

    return response_body


# read in as json
file_path = './data/finalized_data.json'
with open(file_path, "r") as f:
    data = json.load(f)


values_to_embed = [item["contextual_frame_description"] for item in data]
ids = [item["frame_path"] for item in data]

embeddings = []

# For large number of embeddings, take care to respect rate limits
for v in tqdm(values_to_embed):
    embedding = titan_text_embedding(text=v)
    embeddings.append(embedding["embedding"])


final_vectors = []
# Easy way to assign ids. Be careful of overwriting these 
ids = [x for x in range(0, len(values_to_embed))]

for v, e, id in tqdm(zip(data, embeddings, ids)):
    final_vectors.append({
            "id": str(id),
            "values": e,
            "metadata": {
                "transcript": v["words"],
                "filepath": v["frame_path"],
                "timestamp_start": v["timestamp"][0],
                "timestamp_end": v["timestamp"][1],
                "contextual_frame_description": v["contextual_frame_description"]
            }
})
    




In [None]:
import pandas as pd

df = pd.DataFrame(final_vectors)
print(df.head())
index_name="enriched-claude-vqa"
# replace this as necessary
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))


# create index
if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1024,
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        ) 
    ) 

index = pc.Index(index_name)
# Handy helper function for upsertion!
index.upsert_from_dataframe(df)

## Setting up the RAG workflow

### Embed Query with Titan

### Get results back with metadata

### Read in images, pass to Claude, and generate response


In [None]:
query_text = "Find me code samples for learning how to use Pinecone with multilingualism"

query_embedding =  titan_text_embedding(text=query_text)

response = index.query(vector=query_embedding["embedding"], top_k=5, include_metadata=True)


claude_explanation = ask_claude_vqa_response(query_text, response["matches"])

#### Queries about Code


#### Queries in transcript but not onscreen

#### Queries onscreen but not in transcript


#### Finding and interpreting diagrams