# Iva's Art RAG Pipeline

Still a work in progress, but it's coming along!

This first cell imports our dependencies and loads the wikiart-subjects dataset.
It also creates a small_dataset of the first 5 dataset rows for testing purposes.

Everything here is in order that I worked on it. You'll see that I got a little too absorbed with trying to set up a FAISS vector database, that I didn't have time to dig deeper into actually using Claude more deeply.

But I DO at least get connected to everything -- I can generate embeddings and get responses from Claude via the AWS client I was provided with.

The next step is to finish turning my slapdash tinkering with FAISS into a working vector DB. After that, the ACTUAL fun begins: Taking user input (text queries and/or images), embedding the query, doing a relevance search with FAISS, getting the most relevant wikidata-art entries, and giving those to Claude to use as context while generating its response to the query. There is some loose pseudocode at the end of the notebook describing the process.

In [1]:
# This requires two dependencies: langchain and langchain_aws
# You also need datasets to use the wikiart-subject dataset, and datasets requires Pillow to decode images
# I've also chosen to use FAISS as a vector store & cache, which requires faiss-cpu and langchain-community
# pip install langchain langchain_aws
# pip install datasets Pillow
# pip install faiss-cpu langchain-community

# Available models:
# amazon.titan-embed-text-v1
# amazon.titan-embed-image-v1
# anthropic.claude-3-5-sonnet-20240620-v1:0
# cohere.embed-multilingual-v3
# meta.llama3-70b-instruct-v1:0

import boto3
from langchain_aws.chat_models.bedrock import ChatBedrock
from langchain_aws.embeddings.bedrock import BedrockEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.docstore.in_memory import InMemoryDocstore

from datasets import load_dataset
import faiss
import os
import pandas as pd
import hashlib
import json

# Load the dataset:
# https://huggingface.co/datasets/jlbaker361/wikiart-subjects
full_dataset = load_dataset("jlbaker361/wikiart-subjects")

# For development, let's use a smaller subset of the full dataset, since it's quite large (815MB)
# Let's take a 5% random sample from the "train" split.
#small_dataset = full_dataset["train"].train_test_split(test_size=0.05)["test"]

# Optionally, if we set also a seed, we'll get the same subset each time; the consistency can be handy for testing & debugging.
small_dataset = full_dataset["train"].train_test_split(test_size=0.05, seed=42)["test"]

# Just to see that the data is there, convert it to DataFrame and display the first few rows
small_dataset_df = small_dataset.to_pandas()
print(small_dataset_df.head())  # Display the first 5 rows

                                               image  \
0  {'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...   
1  {'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...   
2  {'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...   
3  {'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...   
4  {'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...   

                                             text               style  
0   the cover of the book the magician's daughter  art-nouveau-modern  
1               the adoration of the holy trinity             baroque  
2  a painting of a man on horseback with two dogs  art-nouveau-modern  
3          a painting of a woman in a white dress             baroque  
4           a painting of a woman laying on a bed       expressionism  


## Connect to AWS and start a session

In [None]:
# Start a session with AWS via the Boto3 Python SDK.
session = boto3.Session(
  aws_access_key_id='[AWS_ACCESS_KEY_ID]',
  aws_secret_access_key='[AWS_SECRET_ACCESS_KEY]',
  region_name='us-east-1'
)

# Connect to Bedrock services so we can access models for embeddings & chat.
client = session.client('bedrock-runtime')

## WIP Creating embeddings and storing them in a vector database

Now we need to create embeddings for our dataset and store them in a vector database.
I have chosen FAISS as a vector db because it runs locally instead of in the cloud, so that I don't have to sign up for an account, a subscription, etc.

For added efficiency, I'm also attempting to store the FAISS db in a local index file, to act as a cache so that I don't recreate embeds over and over if I restart Jupyter, preventing wasteful and costly API calls to AWS.

Just for my own reference, I've saved links where I can read more about the titan models:
- https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html
- https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html

In [3]:
text_embed_model = 'amazon.titan-embed-text-v1'
image_embed_model = 'amazon.titan-embed-image-v1'
INDEX_PATH = "faiss_index_file"  # File path for saving/loading FAISS index
dimension = 1024  # Default vector dimensions for Amazon's titan embed models; 384 and 256 also supported
docstore = InMemoryDocstore({})
index_to_docstore_id = {}  # Empty mapping to start

# Helper function: generates a text embedding
def generate_text_embedding(text):
    response = client.invoke_model(
        modelId=text_embed_model,
        body=json.dumps({"inputText": text})
    )
    #print("Response keys:", response.keys()) #dict_keys(['ResponseMetadata', 'contentType', 'body'])
    response_body = json.loads(response['body'].read())    
    embedding = response_body['embedding']
    return embedding

# Helper function: generates an image embedding
def generate_image_embedding(image_data):
    response = client.invoke_model(
        modelId=image_embed_model,
        body={"image": image_data}
    )
    response_body = json.loads(response['body'].read())
    embedding = response_body['embedding']
    return embedding

# Helper function: generates a unique ID based on text it's given
# You could also try hashlib.sha256 or hashlib.sha3_256 for improved collision (duplicate) resistance
def get_unique_id(text):
    return hashlib.md5(text.encode()).hexdigest()

# Step 1: Load or Initialize FAISS Index
if os.path.exists(INDEX_PATH):
    print("Loading existing FAISS index...")
    index = faiss.read_index(INDEX_PATH)
else:
    print("No FAISS index found. Initializing a new one...")
    index = faiss.IndexFlatL2(dimension)  # L2 distance index

# Step 2: Add rows to FAISS if they are not already there
def add_to_faiss_if_missing(row_data):
    text, style, image = row_data["text"], row_data["style"], row_data["image"]
    
    # Check if already in FAISS
    # docstore.search() may not be performant for large-scale applications; Redis may be better...
    unique_id = get_unique_id(f"{text} - Style: {style}")
    if not docstore.search(unique_id):
        # Embed each component
        text_embedding = generate_text_embedding(text)
        style_embedding = generate_text_embedding(style)
        #image_embedding = generate_image_embedding(image)
    
        # Add to FAISS
        faiss_index = index.ntotal  # Get the next available index in FAISS
        #vector_db.add_texts([unique_id], embeddings=[text_embedding + style_embedding + image_embedding])
        vector_db.add_texts([unique_id], embeddings=[text_embedding + style_embedding])
        docstore[doc_id] = unique_id

# Step 3: Process dataset and add rows to FAISS as necessary
vector_db = FAISS(
    index=index,
    embedding_function=add_to_faiss_if_missing,
    docstore=docstore,
    index_to_docstore_id=index_to_docstore_id
)

for row in small_dataset:  # Assuming `dataset` is iterable with each row as a dictionary of `text`, `style`, `image`
    add_to_faiss_if_missing(row)

# Step 4: Save the FAISS index
faiss.write_index(vector_db.index, INDEX_PATH)
# TODO: can the docstore also be saved to disk like this?
print("FAISS index saved to disk.")


`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


Loading existing FAISS index...
FAISS index saved to disk.


In [4]:
# Test to see how many embeddings are there:
print(f"Number of embeddings in FAISS index: {index.ntotal}")


Number of embeddings in FAISS index: 0


## Sending a single message to Claude for demo purposes

Since I'm still working on the above embedding workflow with a FAISS vector DB, here is at least a working call that sends a query to Claude and receives a response.

I chose "translate the text to French" after seeing it in an example in documentation. An interesting 'Hello World' of sorts.

Naturally, Claude will kindly explain that it cannot translate vectors into French, and it needs the text.
I was too focused on trying to send SOMETHING to Claude that I forgot I don't need to send it the _vectors themselves_ at all... ;-)

In [5]:
# I've heard Claude is nice in conversation, so let's try it!

# Test: Make sure embedding functions work:
first_row = small_dataset[0]
print("First row data:", first_row)

sample_text = first_row["text"]
sample_style = first_row["style"]
#sample_image = first_row["image"]  # Ensure this is in the correct format for `generate_image_embedding`

# Run embedding functions
sample_text_embedding = generate_text_embedding(sample_text)
sample_style_embedding = generate_text_embedding(sample_style)
#print("\n\nText embedding:", sample_text_embedding)
#print("\n\nStyle embedding:", sample_style_embedding)
#print("Image embedding:", generate_image_embedding(sample_image))

# Convert embeddings to text for demonstration purposes
embedding_text = f"Text embedding: {sample_text_embedding[:5]}... Style embedding: {sample_style_embedding[:5]}..."  # Truncate for readability

# Formulate a prompt using the embeddings
# Define your conversation in the required Messages API format
messages = [
    {
        "role": "user",
        "content": f"The following are embeddings based on a text and style: {embedding_text}. Translate the original text and style description to French."
    }
]

# The payload includes the "messages" key and other required parameters
payload = {
    "messages": messages,
    "max_tokens": 200,  # Specifies the maximum tokens for Claude's response
    "anthropic_version": "bedrock-2023-05-31"  # Required version field
}


# Make the request to invoke Claude's model
response = client.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    body=json.dumps(payload),  # Adjust field as required by Claude's setup
    contentType="application/json",
    accept="application/json"
)

# Parse the response
response_body = json.loads(response["body"].read())
#print("Response body keys:", response_body.keys())
#Response body keys: dict_keys(['id', 'type', 'role', 'model', 'content', 'stop_reason', 'stop_sequence', 'usage'])

# Extract and print the content under the "content" key
claude_response = response_body.get("content", [{}])[0].get("text", "No content available.")
print("Claude's response:", claude_response)

First row data: {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=187x500 at 0x158D002C0>, 'text': "the cover of the book the magician's daughter", 'style': 'art-nouveau-modern'}
Claude's response: I apologize, but I'm not able to directly translate the original text and style description to French based solely on the embeddings provided. Embeddings are numerical representations of text that capture semantic meaning, but they don't contain the actual words or content of the original text.

To translate the original text and style description to French, I would need access to the actual text and style description in their original language (presumably English). Embeddings alone do not contain enough information to reconstruct the original text or perform a translation.

If you have the original text and style description available, I'd be happy to assist with translating those to French. Otherwise, I can only provide general information about embeddings and their use in n

## How the RAG process should work eventually

PSEUDOCODE for how my RAG process should work in a nutshell, when it all comes together.

After I connected to Claude I realized: At no point should I actually need to send the embeddings to Claude.
Instead, I should be using my own vector DB to decide which of my embeddings are most relevant.
I should embed each query, do a similarity search on the DB, gather the most relevant data, and then send the text to Claude for it to use as context in its response.

I had set the image embeddings aside to just focus on the text first, trying to take baby steps, since all of these tools are new to me.
But the image embeddings will be interesting down the line as well, because it can compare the images in the DB to one another, or to a human-provided image (kind of like Google's reverse image search).

```python
# # Step 1: Initialize embedding and chat models
embedder = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
chat_model = ChatBedrock(model_id="claude-v1")

# Step 2: Generate and store embeddings for your dataset (only done once)
document_texts = ["text of document 1", "text of document 2", ...]  # Replace with actual documents
document_embeddings = [embedder.embed_query(doc) for doc in document_texts]

# Step 3: When a query is received, embed it and find relevant documents
query = "What is the summary of X topic?"
query_embedding = embedder.embed_query(query)

# Find top matches in FAISS or another similarity tool (pseudo-code)
relevant_doc_ids = faiss_index.search(query_embedding, top_k=5)  # `top_k` is number of relevant docs to retrieve

# Step 4: Retrieve the relevant texts and construct the prompt
relevant_texts = [document_texts[i] for i in relevant_doc_ids]
context = "\n\n".join([f"Passage {i+1}: {text}" for i, text in enumerate(relevant_texts)])
prompt = f"{context}\n\nQuestion: {query}"

# Step 5: Send prompt to Claude for a generated response
response = chat_model(prompt)
print("Claude's RAG-based response:", response.content)
```