# Multi Modal Rag

**Note**: The [GPT-4V model by OpenAI](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo#:~:text=to%20Apr%202023-,gpt%2D4%2Dvision%2Dpreview,-GPT%2D4%20Turbo) is still in preview.

This example will demonstrate how to preform [RAG](https://arxiv.org/abs/2005.11401) on images, using the new GPT-4V model by OpenAI.

At a high level we're:

- Passing all images to GPT-4V and summarizing their contents.
- Embedding the summaries and adding links to the images in metadata.
- Using semantic search on a query to retrieve the most relevant image.
- Passing the full image and user query to GPT-4V for a final answer.

## Setup

In [None]:
Deno.env.set("OPENAI_API_KEY", "");

import { ChatOpenAI } from "npm:langchain@0.0.185/chat_models/openai";
import { Document } from "npm:langchain@0.0.185/document";
import { OpenAIEmbeddings } from "npm:langchain@0.0.185/embeddings/openai";
import { ChatPromptTemplate } from "npm:langchain@0.0.185/prompts";
import { HumanMessage } from "npm:langchain@0.0.185/schema";
import { StringOutputParser } from "npm:langchain@0.0.185/schema/output_parser";
import { RunnableSequence } from "npm:langchain@0.0.185/schema/runnable";
import { HNSWLib } from "npm:langchain@0.0.185/vectorstores/hnswlib";

Instantiate `ChatOpenAI` using the vision model and load in the images.

In [None]:
const model = new ChatOpenAI({
  modelName: "gpt-4-vision-preview",
  maxTokens: 1024,
}).pipe(new StringOutputParser());
// Load in images
const usNationalDebt = await Deno.readFile(
  "../examples/multi_modal_content/us_national_debt_chart.jpg"
);
const canadianNationalDebt = await Deno.readFile(
  "../examples/multi_modal_content/canadian_debt_by_gdp.jpg"
);
const mexicanNationalDebt = await Deno.readFile(
  "../examples/multi_modal_content/mexico_national_debt_monthly.jpg"
);

Create a dict containing all the images. This will be helpful later on when we want to retrieve a given image.

In [None]:
const imageDict = {
  us: usNationalDebt,
  canada: canadianNationalDebt,
  mexico: mexicanNationalDebt,
};

## Summarization

Map over each image in the dict and create a prompt message, encoding the image in base64.

In [None]:
const promptMessages = Object.keys(imageDict).map(
  (key) =>
    new HumanMessage({
      content: [
        {
          type: "text",
          text: "Describe the contents of this image in detail.",
        },
        {
          type: "image_url",
          image_url: {
            url: `data:image/jpeg;base64,${imageDict[
              key as keyof typeof imageDict
            ].toString("base64")}`,
          },
        },
      ],
    })
);

Invoke the model to generate a summary of each image.

In [None]:
const summaries = await Promise.all([
  model.invoke([promptMessages[0]]),
  model.invoke([promptMessages[1]]),
  model.invoke([promptMessages[2]]),
]);
console.log(summaries)

## Embedding the summaries

Create a document for each summary, also including the image dict key as metadata so we can retrieve the actual image later on.

In [None]:
const documents = summaries.map(
  (summary, i) =>
    new Document({
      pageContent: summary,
      metadata: {
        imageKey: Object.keys(imageDict)[i],
      },
    })
);

Initialize the vector store with `OpenAIEmbeddings` and the documents we created above. Then, instantiate the store as a retriever.

In [None]:
const vectorStore = await HNSWLib.fromDocuments(
  documents,
  new OpenAIEmbeddings()
);
const retriever = vectorStore.asRetriever();

## Prompts

Create a `HumanMessage` prompt which will contain the image with the relevant content based on the users question. Since we do not know the image yet we use an input variable `{imageString}` which we'll replace with the base 64 encoded image at runtime.

Then, create a `ChatPromptTemplate` with the `imageMessage` and an input variable for the users question.

In [None]:
const imageMessage = new HumanMessage({
  content: [
    {
      type: "image_url",
      image_url: {
        url: "data:image/jpeg;base64,{imageString}",
      },
    },
  ],
});
const prompt = ChatPromptTemplate.fromMessages([
  ["ai", "Answer the users question using the provided image."],
  ["human", "{question}"],
  imageMessage,
]);

## Construct the chain.

Here we're taking in a single input which is the users question, then preforming a similarity search to find the most relevant document, and using the first returned doc since in our case we know only 1 document will match the question.

Then, using the image key in the metadata we're able to retrieve the relevant image and encode it to then be passed into the prompt.

In [None]:
const chain = RunnableSequence.from([
  async (input: string) => {
    const relevantDoc = (await retriever.getRelevantDocuments(input))[0];
    const imageKey = relevantDoc.metadata.imageKey as keyof typeof imageDict;
    const imageString = imageDict[imageKey].toString("base64");
    return {
      imageString,
      question: input,
    };
  },
  prompt,
  model,
]);

Finally, invoke the model and sit back while the magic happens.

In [None]:
const response = await chain.invoke(
  "How much was Mexico's national debt increasing by on a monthly basis?"
);
console.log("response\n", response);