## Semi-Structured RAG For Private Data

In this homework assignment, you will be delving into the realm of **Retrieval Augmented Generation (RAG)**.

Your objective is to construct a system that leverages retrieval from a **private database** consisting of PDFs. These PDFs encapsulate a rich variety of content, including textual information, images, and tables.

The challenge lies in preserving all these components while efficiently extracting relevant data based on a user's input question.

- As a first step, you will need to develop mechanisms for extracting text from the PDFs. Also, extract textual embeddings for following comparison with the user's input.
- Subsequently, you should implement a process to identify and retrieve the most pertinent information matching a user's query.
- Because some input texts are too long, we have to summarize them, and then use the summary of the most similar text to LLM as input.
- Then, you will integrate this retrieved information with a Large Language Model (LLM) to generate comprehensive and contextually relevant responses to user queries.
- Finally, you will apply this mechanism in a Multimodal approach, where you convert PDF images to clip embeddings and use the input's textual CLIP embeddings to compare with the ground truth's image embeddings and find the most similar image to the input text.
- As we are using Unimodal LLMs, we can not give those images to the LLM. Hence, we use image captions to be used in LLM's input.

This holistic approach ensures that no valuable information is lost, and the system provides nuanced answers by combining both the knowledge embedded in the PDFs and the capabilities of the LLM.

Instruction:

<font color='77CC99'>Follow the Green texts and fill out the notebook.</font>


<img src='https://drive.google.com/uc?id=1kODk16WWrn9DqvaWoEAekHRXup1djGjl' width="75%">

## Packages

In [1]:
%%capture
# run an `apt-get update`, just to be sure
!apt-get update

In [2]:
# restart kernel after first instllation
%%capture
!apt-get install -y poppler-utils
!apt-get install tesseract-ocr
!pip install pytesseract
# for image extraction from pdf
! pip install PyMuPDF
! pip install Pillow
# text embedding
! pip install -U sentence-transformers
! pip install transformers accelerate bitsandbytes>=0.39.0 -q

In [3]:
%%capture
# Perform a full installation of unstructured and some dependencies
!apt-get install -y libmagic-dev
!pip install unstructured[all-docs]

# 0 - Loading Data

### 0.1 - Downoading the PDF

In [4]:
from pathlib import Path
import urllib.request

# Define the name of the PDF file and then download them
file_name = "Dall_E_paper"

url = "https://arxiv.org/pdf/2204.06125.pdf"
file_path = f"{file_name}.pdf"
urllib.request.urlretrieve(url, file_path)

('Dall_E_paper.pdf', <http.client.HTTPMessage at 0x7dfd58cd2ec0>)

## 0.2 - Extract Images and Texts

Implement mechanisms to extract images and texts from the downloaded PDFs.

In [5]:
!which pdftotext

/usr/bin/pdftotext


In [6]:
import pytesseract
print(pytesseract.get_tesseract_version())

4.1.1


In [7]:
# Import required dependencies
import fitz
import os
from PIL import Image

#### Step 0.2.1: Extract and Store Images

In [8]:
# Open PDF file
pdf_file = fitz.open(file_path)

# Calculate number of pages in PDF file
page_nums = len(pdf_file)

# Create an empty list to store images information
images_list = []

# Extract all images information from each page
for page_num in range(page_nums):
    page_content = pdf_file[page_num]
    images_list.extend(page_content.get_images(full=True))  # Ensure full image info is retrieved

In [9]:
import io

images_path = "./images/"
Path(images_path).mkdir(parents=True, exist_ok=True)

# Save all the extracted images and convert to PNG if necessary
for i, image_info in enumerate(images_list, start=1):
    xref = image_info[0]
    base_image = pdf_file.extract_image(xref)
    image_bytes = base_image['image']
    image_ext = base_image['ext']

    # Convert .ppm images to .png
    if image_ext.lower() == 'ppm':
        image_ext = 'png'
        image = Image.open(io.BytesIO(image_bytes))
        image_bytes = io.BytesIO()
        image.save(image_bytes, format='PNG')

    image_name = f"{file_name}_{i}.{image_ext}"
    with open(os.path.join(images_path, image_name), 'wb') as image_file:
        image_file.write(image_bytes.getbuffer() if isinstance(image_bytes, io.BytesIO) else image_bytes)


### Step 0.2.2: Extract and Store Texts From PDF Content

In [10]:
from lxml import html
from pydantic import BaseModel
from typing import Any, Optional
from unstructured.partition.pdf import partition_pdf

# Define the paths
path = './'
image_output_dir_path = './images/'  # Path where extracted images are stored

# Specify the paths to the Poppler and Tesseract installations
poppler_path = '/usr/bin/'  # Replace with the correct path
tesseract_path = '/usr/bin/tesseract'  # Replace with the correct path

# Get elements from the PDF
raw_pdf_elements = partition_pdf(
    filename="./Dall_E_paper.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=2000,
    new_after_n_chars=1900,
    combine_text_under_n_chars=1000,
    image_output_dir_path=image_output_dir_path,
    tesseract_path=tesseract_path,
)


config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
text_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.CompositeElement" in str(type(element)):
      text_elements.append(str(element))

print(len(text_elements))

39


In [12]:
long_text_ind = 10

In [13]:
text_elements[long_text_ind]

'3.2 Interpolations\n\nIt is also possible to blend two images x1 and x2 for variations (Figure 4), traversing all of the concepts in CLIP’s embedding space that occur between them. To do this, we rotate between their CLIP embeddings zi1 and zi2 using spherical interpolation, yielding intermediate CLIP representations ziθ = slerp(zi1 , zi2, θ) as θ is varied from 0 to 1. There are two options for producing the intermediate DDIM latents along the trajectory. The ﬁrst option involves interpolating between their DDIM inverted latents xT1 and xT2 (by setting xTθ = slerp(xT1, xT2, θ)), which yields a single trajectory whose endpoints reconstruct x1 and x2. The second option involves ﬁxing the DDIM latent to a randomly-sampled value for all interpolates in the trajectory. This results in an inﬁnite number of trajectories between x1 and x2, though the endpoints of these trajectories will generally no longer coincide with the original images. We use this approach in Figure 4.\n\n3.3 Text Diffs

Because some texts are too long, we have to summarize them at first

In [14]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

summarized_text_elements = summarizer(text_elements , max_length=100, do_sample=False)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Your max_length is set to 100, but your input_length is only 37. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
Your max_length is set to 100, but your input_length is only 50. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


In [15]:
summarized_text_elements[long_text_ind]

{'summary_text': 'It is also possible to blend two images x1 and x2 for variations. To do this, we rotate between their CLIP embeddings zi1 and zi2 using spherical interpolation. A key advantage of using CLIP compared to other models for image representations is that it embeds images and text to the same latent space.'}

# 1 - Unimodal RAG

## 1.1 - Loading True Text Data as Embeded Vectors

In this section, we should convert the text data into embedding vectors and store them. Hence, in the following step. having an input, by comparing we can find out the most similar fact with the input.

We use this model for [Text-Embedding](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

In [16]:
from sentence_transformers import SentenceTransformer, util

text_emb_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [17]:
text_embeddings = text_emb_model.encode(text_elements, convert_to_tensor=True)

Now, we have all our crucial embeddings. Thus, if we have a new input, we know that we should compare the input's embeddings with the text_embeddings element and find the closest one.

## 1.2 - Unimodal Semi-Structured RAG

### Step 1.2.1: Most Similar Ground Truth Text Ectraction

At the fist step, for any given input, we have to have evaluation functions to find the closest embedding vector to the input vectors. We use the Cosine similarity for this operation.

<font color='77CC99'>Write a function "text_embedding_similarity" to convert input texts to embedded vector and then returns the similarity between the input text and any of the ground truth texts.</font>


In [18]:
import torch

def get_similarity(embeddings_1, embeddings_2, device):
    # Convert lists of embeddings to tensors if necessary
    if isinstance(embeddings_1, list):
        embeddings_1 = torch.stack(embeddings_1).to(device)
    if isinstance(embeddings_2, list):
        embeddings_2 = torch.stack(embeddings_2).to(device)

    # Normalize the embeddings
    embeddings_1 = embeddings_1 / embeddings_1.norm(dim=-1, keepdim=True)
    embeddings_2 = embeddings_2 / embeddings_2.norm(dim=-1, keepdim=True)

    similarity_matrix = (embeddings_1 @ embeddings_2.T)

    return similarity_matrix.cpu().detach().numpy() if torch.cuda.is_available() else similarity_matrix.detach().numpy()


In [19]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

torch.set_default_device(device)

In [20]:
def text_embedding_similarity(input_text, text_embeddings, text_emb_model):
    # Generate embedding for the input text
    input_text_emb = text_emb_model.encode([input_text], convert_to_tensor=True)[0]

    # Calculate similarity and return results
    return get_similarity(text_embeddings, input_text_emb, device)


In [21]:
input_text = "is DALL-E2 uses a clip model inside?"
text_embedding_similarity(input_text, text_embeddings, text_emb_model)

  return func(*args, **kwargs)


array([0.18537425, 0.2028819 , 0.21293399, 0.20692343, 0.2686563 ,
       0.31146652, 0.01873749, 0.21311942, 0.20081617, 0.25198764,
       0.26972085, 0.2460359 , 0.25822914, 0.25395477, 0.23482761,
       0.23657551, 0.17059213, 0.15917057, 0.15459308, 0.14453974,
       0.11232637, 0.18208511, 0.12710598, 0.37942073, 0.05238128,
       0.3243861 , 0.30019337, 0.06964829, 0.10325826, 0.13214464,
       0.02927516, 0.15812725, 0.10174508, 0.26262778, 0.15325922,
       0.00922206, 0.31082875, 0.04998646, 0.12929472], dtype=float32)

<font color='77CC99'> Now, write a function that finds "Summaries" of the k most similar ground truth texts to the user's input. function "text_retrival"</font>

In [22]:
import heapq

def text_retrival(k, input_text, text_embeddings, text_elements, summarized_text_elements, text_emb_model):
    # Calculate similarity scores for each text element
    similarities = text_embedding_similarity(input_text, text_embeddings, text_emb_model)

    # Retrieve the indices of the top k most similar elements
    top_k_indices = heapq.nlargest(k, range(len(similarities)), similarities.take)

    # Select the corresponding text elements, preferring summarized text if available
    selected_text_elements = [summarized_text_elements[i] if summarized_text_elements else text_elements[i] for i in top_k_indices]

    return {"selected_text_elements": selected_text_elements}


### Step 1.2.2: Load the core LLM and Combine them all

We use a Question-answering model as the core of our system. In fact, having the input text and finding the closest ground truth fact to the input text, we can give them both to an LLM to answer the question.

Here we load the core LLM for our Unimodal  Semi-Structured RAG. [Model in HF](https://huggingface.co/samwit/koala-7b)

In [23]:
import gc

# clean-up the memory and gpu cache
gc.collect()
if torch.cuda.is_available():
  torch.cuda.empty_cache()

In [24]:
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import textwrap

model = LlamaForCausalLM.from_pretrained(
    "samwit/koala-7b",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = LlamaTokenizer.from_pretrained("samwit/koala-7b")

config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/14 [00:00<?, ?it/s]

pytorch_model-00001-of-00014.bin:   0%|          | 0.00/1.96G [00:00<?, ?B/s]

pytorch_model-00002-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00003-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00004-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00005-of-00014.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

pytorch_model-00006-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00007-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00008-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00009-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00010-of-00014.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

pytorch_model-00011-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00012-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00013-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00014-of-00014.bin:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/881 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [25]:
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15)

tokenizer.pad_token_id = tokenizer.eos_token_id

Now, what follows is our Prompt, based on that, do the task bellow.


<font color='77CC99'> Based on the prompt and what we have done before, write a function that answers the user's question by finding the most related ground truth text(fact) by giving the prompt to LLM. Function "Unimodal_Question_Answering" </font>

In [26]:
# Prompt
prompt_text = """ANSWER the QUESTION in conformity to on FACTS. \n
FACTS: \n {text_facts}. \n
QUESTION: {user_question} \n
ANSWER:  """

In [27]:
def Unimodal_Question_Answering(input_text, k=1):
    # Retrieve relevant text elements
    retrieved_elements = text_retrival(k, input_text, text_embeddings, text_elements, summarized_text_elements, text_emb_model)

    # Select the best response (assuming the first one is the best)
    response = retrieved_elements["selected_text_elements"][0]

    return response


In [28]:
input_text = "is DALL-E2 uses a clip model inside?"

response = Unimodal_Question_Answering(input_text,k=1)

In [29]:
response

{'summary_text': 'Since its release, CLIP has been used extensively to steer generative image models towards text prompts. Nichol et al. [35] showed classiﬁer-free guidance works more favorably than CLIP guidance for text conditional image generation. Zhou and Crowson [9] trained diffusion models conditioned on CLIP text embeddings, allowing for direct text-conditional imagegeneration.'}

# 2 - Multimodal RAG

In this section, we want to add another modality to our unimodal RAG. What happens if we can consider images as ground truth facts?

We have stored all ground truth images. Thus, in this step, we should extract image embeddings for comparison with textual input embeddings

## 2.1 - Loading CLIP Model for Extracting Embeddings

<font color='77CC99'> Load CLIP model for extracting textual and visial embeddings, then convert all input images to their corresponding vectors.

[Huggingface Link](https://huggingface.co/docs/transformers/model_doc/clip) </font>


In [30]:
from PIL import Image
import requests
from transformers import CLIPModel, CLIPProcessor

# Load CLIP model and processor
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Textual and visual models are part of the CLIPModel
textual_clip_model = clip_model.text_model
visual_clip_model = clip_model.vision_model

textual_clip_tokenizer = clip_processor.tokenizer

config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

In [40]:
import torch
from PIL import Image
import gc
import glob

def process_images(image_paths, clip_processor, visual_clip_model):

    # List to store all embeddings
    all_embeddings = []

    for image_path in image_paths:

        # Open image
        image = Image.open(image_path)

        # Create inputs
        inputs = clip_processor(images=image, return_tensors="pt", padding=True)

        # Get image embedding
        with torch.no_grad():
            image_embedding = visual_clip_model(**inputs).pooler_output

        # Append to list
        all_embeddings.append(image_embedding)

        # Clear cache
        del image, inputs
        gc.collect()
        torch.cuda.empty_cache()

    # Stack list of tensors
    all_embeddings = torch.stack(all_embeddings)

    return all_embeddings


# Usage
images_path = glob.glob('./images/*')
images_embeddings = process_images(images_path, clip_processor, visual_clip_model)

As we are using unimodsl LLM, we need to make image's information understandable for LLM. Hence, we extract textual information of imaged as "Caption" store them in "captions" list.

<font color='77CC99'>Write the corresponding code.</font>


[Image Captioning HF Model](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)

In [68]:
def generate_caption_embeddings(captions_list, textual_clip_model, textual_clip_tokenizer, device):
    caption_embeddings = []

    for caption in captions_list:
        inputs = textual_clip_tokenizer(caption['generated_text'], return_tensors="pt", padding=True, truncation=True).to(device)
        with torch.no_grad():
            caption_embedding = textual_clip_model(**inputs).last_hidden_state.mean(dim=1)
            caption_embeddings.append(caption_embedding.squeeze(0))

    return torch.stack(caption_embeddings)

In [42]:
from transformers import pipeline
image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

### To Do ###

from transformers import pipeline

# Initialize the pipeline for image-to-text conversion
image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning", device=device)

# Assuming the image_to_text pipeline and other models are correctly loaded
images_path = glob.glob('./images/*')
captions_list = [image_to_text(Image.open(img_path)) for img_path in images_path]

### End ###

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.


In [67]:
import numpy as np

captions_list = np.array(captions_list).ravel()

captions_list[:12]

array([{'generated_text': 'a person holding a green plant in their hand '},
       {'generated_text': 'people on skis on a snowy slope '},
       {'generated_text': 'a living room with a table and chairs '},
       {'generated_text': 'a toy elephant with a red handle and a blue and white background '},
       {'generated_text': 'a dog wearing a hat and a flag '},
       {'generated_text': 'a collage of photos of people in a street '},
       {'generated_text': 'a man with a mask on '},
       {'generated_text': 'a pair of black shoes with a white heel '},
       {'generated_text': 'a person holding a green plant in their hand '},
       {'generated_text': 'a kitchen with a refrigerator, stove, microwave and a table '},
       {'generated_text': 'a woman is riding a skateboard in a crowded street '},
       {'generated_text': 'a large building with a large clock on it '}],
      dtype=object)

In [69]:
caption_embeddings = generate_caption_embeddings(captions_list, textual_clip_model, textual_clip_tokenizer, device)

In [70]:
caption_embeddings[:12]

tensor([[ 0.6607, -0.7599, -0.1563,  ..., -0.2615,  0.0344, -0.1142],
        [ 0.1877, -0.0785,  0.7978,  ...,  0.4225,  0.9018,  0.0136],
        [ 0.7590,  0.2619, -0.1598,  ...,  0.5904,  0.6454,  0.0085],
        ...,
        [ 0.8570, -0.3866,  0.3679,  ...,  0.6092,  1.1859, -0.0879],
        [ 0.6250, -0.2662,  0.9190,  ...,  0.8142,  0.8735, -0.1756],
        [ 0.5792,  0.1005, -0.5525,  ...,  0.5149,  0.0015, -0.4678]],
       device='cuda:0')

Ok, now that we have every thing ready, we can code the Multimodal Semi-Structured RAG

## 2.2 - Multimodal Semi-Structured RAG

### Step 2.2.1: Most Similar Ground Truth Image Ectraction

At the fist step, for any given input, we have to have evaluation functions to find the closest visual embedding vector to the input's textual vectors. We use the Cosine similarity for this operation.

<font color='77CC99'>Write a function "visual_embedding_similarity" to convert input texts to clip embedding vector and then returns the similarity between the input text and any of the ground truth images.</font>

Note: You can use "get_similarity" function that you have definced before.

In [107]:
def visual_embedding_similarity(input_text, caption_embeddings, textual_clip_tokenizer, textual_clip_model):
    # Conversion of input text to CLIP embedding
    inputs = textual_clip_tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).to(device)
    text_embeds = textual_clip_model(**inputs).last_hidden_state.mean(dim=1)

    # Preparation of caption embeddings
    caption_embeddings = torch.tensor(caption_embeddings).unsqueeze(0).to(device)

    # Dimension inspection
    text_embeds_shape = text_embeds.shape
    caption_embeddings_shape = caption_embeddings.shape

    # Calculate similarity with ground truth image captions
    return get_similarity(caption_embeddings, text_embeds, device)


<font color='77CC99'>Now, write a function that finds k most similar Text/Image to user's input.</font>

In [72]:
import heapq

def multimodal_retrieval(k, input_text, text_embeddings, text_elements, summarized_text_elements,
                         text_emb_model, images_embeddings, captions_list, textual_clip_tokenizer, textual_clip_model):

    # Get text and visual similarities
    text_similarity = text_embedding_similarity(input_text, text_embeddings, text_emb_model)
    visual_similarity = visual_embedding_similarity(input_text, caption_embeddings, textual_clip_tokenizer, textual_clip_model)

    # Retrieve top k indices for text and images
    top_k_text_indices = heapq.nlargest(k, range(len(text_similarity)), text_similarity.take)
    top_k_image_indices = heapq.nlargest(k, range(len(visual_similarity)), visual_similarity.take)

    # Select corresponding elements
    selected_text_elements = [summarized_text_elements[i] if summarized_text_elements else text_elements[i] for i in top_k_text_indices]
    selected_image_elements = [captions_list[i] for i in top_k_image_indices]

    return {"selected_image_elements": selected_image_elements,
            "selected_text_elements": selected_text_elements}


### Step 2.2.2: Use the core LLM and Combine them all

In this section, based on what we have done before(Loading LLM), we want to use what we have done in this section to write the Multimodal RAG. Do it as follows.

<font color='77CC99'> Based on the new prompt which contains both textual ground truth facts and the caption of visual ground truth images, to write the "Multimodal_Question_Answering" function. This function should takes the user's textual question as input, then finds the most correlated textual and visual ground truth. Then gives them all to LLM via prompt.</font>

In [108]:
# Prompt
prompt_text = """ANSWER the QUESTION in conformity to on FACTS. \n
FACTS: \n {text_facts} \n {image_facts}. \n
QUESTION: {user_question} \n
ANSWER:  """

In [132]:
def LLM(prompt, max_length=180, num_return_sequences=1):
    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)

    # Generate response using the model
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        no_repeat_ngram_size=2,
        early_stopping=True
    )

    # Decode the output to text
    responses = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

    # Return the response(s)
    return responses[0] if num_return_sequences == 1 else responses


In [133]:
def Multimodal_Question_Answering(input_text, k=1):
    # Retrieve the most correlated textual and visual ground truth
    retrieval_results = multimodal_retrieval(k, input_text, text_embeddings, text_elements, summarized_text_elements,
                                             text_emb_model, images_embeddings, captions_list, textual_clip_tokenizer, textual_clip_model)

    # Combine textual and image facts
    text_facts = ' '.join([el['summary_text'] for el in retrieval_results["selected_text_elements"]])
    image_facts = ' '.join([el['generated_text'] for el in retrieval_results["selected_image_elements"]])

    # Format the prompt
    prompt = prompt_text.format(text_facts=text_facts, image_facts=image_facts, user_question=input_text)

    # Query the LLM with the new prompt (assuming LLM function is defined)
    response = LLM(prompt)

    return response


In [134]:
input_text = "is DALL-E2 uses a clip model inside?"

response = Multimodal_Question_Answering(input_text,k=1)

In [135]:
response

'ANSWER the QUESTION in conformity to on FACTS. \n\nFACTS: \n Since its release, CLIP has been used extensively to steer generative image models towards text prompts. Nichol et al. [35] showed classiﬁer-free guidance works more favorably than CLIP guidance for text conditional image generation. Zhou and Crowson [9] trained diffusion models conditioned on CLIP text embeddings, allowing for direct text-conditional imagegeneration. \n a person holding a green plant in their hand . \n\nQUESTION: is DALL-E2 uses a clip model inside? \n\nANSWER:  \nDALL E2-is used to train a model, but it is not a part of the model itself.\nThe model'

<font color='77CC99'>The Answer to the input question is "Yes" or "No". What are your Semi-structured models' answers? (Both Unimodal and Multimodal). Are they right or not?</font>

<font color='CC7799'>Your Answer:</font>

Unimodal: NO (It **doesn't give** a direct answer in it's output)

Multimodal: NO (It gives **A WRONG** answer)

**The Unimodal Does't Output Relevant Answer:**

"Since its release, CLIP has been used extensively to steer generative image models towards text prompts. Nichol et al. [35] showed classiﬁer-free guidance works more favorably than CLIP guidance for text conditional image generation. Zhou and Crowson [9] trained diffusion models conditioned on CLIP text embeddings, allowing for direct text-conditional imagegeneration."

**The multi-modal RAG Outputs Incorrect Answer:**

"ANSWER the QUESTION in conformity to on FACTS. \n\nFACTS: \n Since its release, CLIP has been used extensively to steer generative image models towards text prompts. Nichol et al. [35] showed classiﬁer-free guidance works more favorably than CLIP guidance for text conditional image generation. Zhou and Crowson [9] trained diffusion models conditioned on CLIP text embeddings, allowing for direct text-conditional imagegeneration. \n a person holding a green plant in their hand . \n\nQUESTION: is DALL-E2 uses a clip model inside? \n\nANSWER:  \nDALL E2-is used to train a model, but it is not a part of the model itself.\nThe model"

**The Actual Answer**: (Inshallah GPT-3.5 actually knows about OpenAI stuff)

"Yes, DALL-E 2 uses a CLIP model as part of its architecture. DALL-E 2, developed by OpenAI, is an advanced AI model designed for generating images from textual descriptions. One of its key components is the CLIP (Contrastive Language–Image Pretraining) model, also developed by OpenAI."

