## Enhancing Retrieval Augmented Generation for Documents with Integrated Images, Graphics and Tables

Implementing Retrieval Augmented Generation (RAG) presents unique challenges when working with documents rich in images and tables. Traditional RAG models excel with textual data but often falter when visual elements play a crucial role in conveying information. In this cookbook, we bridge that gap by leveraging vision models to extract and interpret visual content, ensuring that the generated responses are as informative and accurate as possible.

Our approach involves parsing documents into images and utilizing metadata tagging to identify pages containing images or tables. When a semantic search retrieves such a page, we pass the page image to a vision model instead of relying solely on text. This method enhances the model's ability to understand and answer user queries that pertain to visual data.

In this cookbook, we will explore and demonstrate the following key concepts:

##### 1. Setting Up a Vector Store with Pinecone:
- Learn how to initialize and configure Pinecone to store vector embeddings efficiently.

##### 2. Parsing PDFs and Extracting Visual Information:
- Discover techniques for converting PDF pages into images.
- Use GPT-4o vision modality to detect pages with images or tables and extract pertinent information. 

##### 3. Generating Embeddings: 
- Utilize embedding models to create vector representations of textual data. 
- Flag the pages that have visual content so that we can use it for meta-data filters on vector store if needed.

##### 4. Uploading embeddings to Pinecone: 
- Upload these embeddings to Pinecone for storage and retrieval. 

##### 5. Performing Semantic Search for Relevant Pages:
- Implement semantic search on page text to find pages that best match the user's query. 

##### 6. Handling Pages with Visual Content:
- Learn how to pass the image using GPT-4o vision modality for question answering with additional context. 
- Understand how this process improves the accuracy of responses involving visual data.
 
By the end of this cookbook, you will have a robust understanding of how to implement RAG systems capable of processing and interpreting documents with complex visual elements. This knowledge will empower you to build AI solutions that deliver richer, more accurate information, enhancing user satisfaction and engagement.

We will use the World Bank report - [A better bank for a better world: Annual Report 2024](https://documents1.worldbank.org/curated/en/099101824180532047/pdf/BOSIB13bdde89d07f1b3711dd8e86adb477.pdf) to illustrate the concepts as this document has a mix of images, tables and graphics data. 

### 1. Setting up a Vector Store with Pinecone 
In this section, we'll set up a vector store using Pinecone to store and manage our embeddings efficiently. Pinecone is a powerful vector database optimized for handling high-dimensional vector data, which is essential for tasks like semantic search and similarity matching.

**Prerequisites** 
Sign-up for Pinecone and obtain an API key by following the instructions here [Pinecone Database Quickstart](https://docs.pinecone.io/guides/get-started/quickstart)
Install the Pinecone SDK using `pip install "pinecone[grpc]"`. gRPC (gRPC Remote Procedure Call) is a high-performance, open-source universal RPC framework that uses HTTP/2 for transport, Protocol Buffers (protobuf) as the interface definition language, and enables client-server communication in a distributed system. It is designed to make inter-service communication more efficient and suitable for microservices architectures. 

**Storing the API Key Securely**
Store the API key in an .env file for security purposes in you project directory as follows:  
 `PINECONE_API_KEY=your-api-key-here`. Install `pip install python-dotenv` to read the API Key from the .env file. 

**Creating the Pinecone Index** 
We'll use the `create_index` function to initialize our embeddings database on Pinecone. There are two crucial parameters to consider:

1. Dimension: This must match the dimensionality of the embeddings produced by your chosen model. For example, OpenAI's text-embedding-ada-002 model produces embeddings with 1536 dimensions, while text-embedding-3-large produces embeddings with 3072 dimensions. We'll use the text-embedding-3-large model, so we'll set the dimension to 3072.

2. Metric: The distance metric determines how similarity is calculated between vectors. Pinecone supports several metrics, including cosine, dotproduct, and euclidean. For this cookbook, we'll use the cosine similarity metric. You can learn more about distance metrics in the [Pinecone Distance Metrics documentation](https://docs.pinecone.io/guides/indexes/understanding-indexes#distance-metrics).


In [3]:
import os
import time
# Import the Pinecone library
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec

from dotenv import load_dotenv

load_dotenv()

api_key = os.getenv("PINECONE_API_KEY")

# Initialize a Pinecone client with your API key
pc = Pinecone(api_key)

# Create a serverless index
index_name = "my-test-index"

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=3072,
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

# Wait for the index to be ready
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)


Navigate to Indexes list on [Pinecone](https://app.pinecone.io/) and you should be able to view `my-test-index` index. 

### 2. Parsing PDFs and Extracting Visual Information:

In this section, we will parse a PDF document and extract textual and visual information, such as describing images, graphics, and tables. The process involves three main steps:

1. Parse the PDF into individual pages: We split the PDF into separate pages for easier processing.
2. Convert PDF pages to images: This enabled vision GPT-4o vision capability to analyze the page as an image.  
3. Process images and tables: Provide instructions to GPT-4o to extract text, and also describe the images or tables in the document. 

**Prerequisites**

Before proceeding, make sure you have the following packages installed. Also ensure your OpenAI API key is set up as an environment variable. 

`pip install PyPDF2 pdf2image pytesseract pandas tqdm`

You may also need to install Poppler for PDF rendering. 
 
**Steps Breakdown:**

**1. Downloading and Chunking the PDF:**  
- The `chunk_document` function downloads the PDF from the provided URL and splits it into individual pages using PyPDF2.
- Each page is stored as a separate PDF byte stream in a list.

**2. Converting PDF Pages to Images:** 
- The `convert_page_to_image` function takes the PDF bytes of a single page and converts it into an image using pdf2image.
- The image is saved locally in an 'images' directory for further processing.

**3. Extracting Text Using OCR:**
- The `extract_text_from_image` function uses GPT-4o vision capability to extract text from the image of the page.
- This method can extract textual information even from scanned documents.

**4. Processing the Entire Document:** 
- The process_document function orchestrates the processing of each page.
- It uses a progress bar (tqdm) to show the processing status.
- The extracted information from each page is collected into a list and then converted into a Pandas DataFrame.


In [4]:
import base64
import requests
import os
import pandas as pd
from PyPDF2 import PdfReader, PdfWriter
from pdf2image import convert_from_bytes
from io import BytesIO
from openai import OpenAI
from tqdm import tqdm  # Import tqdm

document_to_parse = "https://documents1.worldbank.org/curated/en/099101824180532047/pdf/BOSIB13bdde89d07f1b3711dd8e86adb477.pdf"


def chunk_document(document_url):
    # Download the PDF document
    response = requests.get(document_url)
    pdf_data = response.content

    # Read the PDF data using PyPDF2
    pdf_reader = PdfReader(BytesIO(pdf_data))
    page_chunks = []

    for page_number, page in enumerate(pdf_reader.pages, start=1):
        pdf_writer = PdfWriter()
        pdf_writer.add_page(page)
        pdf_bytes_io = BytesIO()
        pdf_writer.write(pdf_bytes_io)
        pdf_bytes_io.seek(0)
        pdf_bytes = pdf_bytes_io.read()
        page_chunk = {
            'pageNumber': page_number,
            'pdfBytes': pdf_bytes
        }
        page_chunks.append(page_chunk)

    return page_chunks

# Function to encode the image
def encode_image(local_image_path):
    with open(local_image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')
    
def convert_page_to_image(pdf_bytes, page_number):
    # Convert the PDF page to an image
    images = convert_from_bytes(pdf_bytes)
    image = images[0]  # There should be only one page

    # Define the directory to save images (relative to your script)
    images_dir = 'images'  # Use relative path here

    # Ensure the directory exists
    os.makedirs(images_dir, exist_ok=True)

    # Save the image to the images directory
    image_file_name = f"page_{page_number}.png"
    image_file_path = os.path.join(images_dir, image_file_name)
    image.save(image_file_path, 'PNG')

    # Return the relative image path
    return image_file_path


def get_vision_response(prompt, image_path):
    client = OpenAI()

    # Getting the base64 string
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}"
                        },
                    },
                ],
            }
        ],
    )
    return response


def process_document(document_url):
    try:
        # Update document status to 'Processing'
        print("Document processing started")

        # Get per-page chunks
        page_chunks = chunk_document(document_url)
        total_pages = len(page_chunks)

        # Prepare a list to collect page data
        page_data_list = []

        # Add progress bar here
        for page_chunk in tqdm(page_chunks, total=total_pages, desc='Processing Pages'):
            page_number = page_chunk['pageNumber']
            pdf_bytes = page_chunk['pdfBytes']

            # Convert page to image
            image_path = convert_page_to_image(pdf_bytes, page_number)

            # Prepare question for vision API
            system_prompt = (
                "The user will provide you an image of a document file. Perform the following actions: "
                "1. Transcribe the text on the page. **TRANSCRIPTION OF THE TEXT:**"
                "2. If there is a chart, describe the image and include the text **DESCRIPTION OF THE IMAGE OR CHART**"
                "3. If there is a table, transcribe the table and include the text **TRANSCRIPTION OF THE TABLE**"
            )

            # Get vision API response
            vision_response = get_vision_response(system_prompt, image_path)

            # Extract text from vision response
            text = vision_response.choices[0].message.content

            # Collect page data
            page_data = {
                'PageNumber': page_number,
                'ImagePath': image_path,
                'PageText': text
            }
            page_data_list.append(page_data)

        # Create DataFrame from page data
        pdf_df = pd.DataFrame(page_data_list)
        print("Document processing completed.")
        print("DataFrame created with page data.")

        # Return the DataFrame
        return pdf_df

    except Exception as err:
        print(f"Error processing document: {err}")
        # Update document status to 'Error'


df = process_document(document_to_parse)

Document processing started


Processing Pages: 100%|██████████| 49/49 [15:11<00:00, 18.61s/it]

Document processing completed.
DataFrame created with page data.





Let's examine the dataframe to make sure the pages have been processed. You should also be able to see page images generated under directory 'images'. 

In [5]:
from IPython.display import display, HTML

# Convert the DataFrame to an HTML table and display it
display(HTML(df.head().to_html()))

Unnamed: 0,PageNumber,ImagePath,PageText
0,1,images/page_1.png,"**TRANSCRIPTION OF THE TEXT:**\n\nA BETTER BANK FOR A BETTER WORLD \nANNUAL REPORT 2024 \nWORLD BANK GROUP \nIBRD • IDA \nPublic Disclosure Authorized \n\n**DESCRIPTION OF THE IMAGE OR CHART:**\n\nThe image shows a night scene with a makeshift structure illuminated from within. The structure appears to be a tent with colorful patterns on the fabric. Inside, there are several people visible, notably children, and various personal belongings scattered on a raised platform. The exterior ground seems to be sandy, and the scene conveys a sense of activity and habitation within the tent."
1,2,images/page_2.png,"**TRANSCRIPTION OF THE TEXT:**\n\nCONTENTS\n\nMessage from the President 6 \nMessage from the Executive Directors 8 \nBecoming a Better Bank 10 \nFiscal 2024 Financial Summary 12 \nResults by Region 14 \nResults by Theme 44 \nHow We Work 68 \n\nKEY TABLES\n\nIBRD Key Financial Indicators, Fiscal 2020–24 84 \nIDA Key Financial Indicators, Fiscal 2020–24 88 \n\nThis annual report, which covers the period from July 1, 2023, to June 30, 2024, has been prepared by the Executive Directors of both the International Bank for Reconstruction and Development (IBRD) and the International Development Association (IDA)—collectively known as the World Bank—in accordance with the respective by-laws of the two institutions. Ajay Banga, President of the World Bank Group and Chairman of the Board of Executive Directors, has submitted this report, together with the accompanying administrative budgets and audited financial statements, to the Board of Governors. \n\nAnnual reports for the other World Bank Group institutions—the International Finance Corporation (IFC), the Multilateral Investment Guarantee Agency (MIGA), and the International Centre for Settlement of Investment Disputes (ICSID)—are published separately. Key highlights from each institution’s annual report are available in the World Bank Group Annual Report Summary.\n\nThroughout the report, the term World Bank and the abbreviated Bank refer only to IBRD and IDA; the term World Bank Group and the abbreviated Bank Group refer to the five institutions. All dollar amounts used in this report are current U.S. dollars unless otherwise specified. Funds allocated to multiregional projects are accounted for by recipient country where possible in tables and text when referring to regional breakdowns. For sector and theme breakdowns, funds are accounted for by operation. Fiscal year commitments and disbursements data are in accordance with the audited figures reported in the IBRD and IDA Financial Statements and Management’s Discussion and Analysis documents for Fiscal 2024. As a result of rounding, numbers in tables may not add to totals, and percentages in figures may not add to 100.\n\nTHE WORLD BANK ANNUAL REPORT 2024\n\n**DESCRIPTION OF THE IMAGE OR CHART:**\n\nThe image on the left half of the document is a close-up photograph of a hand holding a bunch of rice stalks, with golden grains. The background is a blurred green, indicating a field. At the bottom of this image, there is a caption that reads ""THE WORLD BANK ANNUAL REPORT 2024."""
2,3,images/page_3.png,"**TRANSCRIPTION OF THE TEXT:**\n\n**ABOUT US**\n\nThe World Bank Group is one of the world’s largest sources of funding and knowledge for developing countries. Our five institutions share a commitment to reducing poverty, increasing shared prosperity, and promoting sustainable development.\n\n**OUR VISION**\n\nOur vision is to create a world free of poverty on a livable planet.\n\n**OUR MISSION**\n\nOur mission is to end extreme poverty and boost shared prosperity on a livable planet. This is threatened by multiple, intertwined crises. Time is of the essence. We are building a better Bank to drive impactful development that is:\n\n• Inclusive of everyone, including women and young people;\n\n• Resilient to shocks, including against climate and biodiversity crises, pandemics and fragility;\n\n• Sustainable, through growth and job creation, human development, fiscal and debt management, food security and access to clean air, water, and affordable energy.\n\nTo achieve this, we will work with all clients as one World Bank Group, in close partnership with other multilateral institutions, the private sector, and civil society.\n\n**OUR CORE VALUES**\n\nOur work is guided by our core values: impact, integrity, respect, teamwork, and innovation. These inform everything we do, everywhere we work.\n\n**DESCRIPTION OF THE IMAGE OR CHART**\n\nThe image shows two girls in school uniforms smiling and hugging each other in what appears to be a joyful and supportive environment. The background suggests a school setting with other children around."
3,4,images/page_4.png,"**TRANSCRIPTION OF THE TEXT:**\n\nDRIVING ACTION, MEASURING RESULTS\n\nThe World Bank Group contributes to impactful, meaningful development results around the world. In the first half of fiscal 2024*, we:\n\n- Helped feed 156 million people\n\n- Improved schooling for 280 million students\n\n- Reached 287 million people living in poverty with effective social protection support†\n\n- Provided healthy water, sanitation, and/or hygiene to 59 million people\n\n- Enabled access to sustainable transportation for 77 million people\n\n- Provided 17 gigawatts of renewable energy capacity\n\n- Committed to devote 45 percent of annual financing to climate action by 2025, deployed equally between mitigation and adaptation\n\n*The development of the new Scorecard is ongoing at the time of printing; therefore, this report can only account for results up to December 31, 2023.\n\n† IBRD and IDA only indicator.\n\nIn fiscal 2024, the Bank Group announced the development of a new Scorecard that will track results across 22 indicators—a fraction of the previous 150—to provide a streamlined, clear picture of progress on all aspects of the Bank Group’s mission, from improving access to healthcare to making food systems sustainable to boosting private investment.\n\nFor the first time, the work of all Bank Group financing institutions will be tracked through the same set of indicators. The new Scorecard will track the Bank Group's overarching vision of ending poverty on a livable planet.\n\nTHE WORLD BANK ANNUAL REPORT 2024\n\n**DESCRIPTION OF THE IMAGE OR CHART:**\n\nThe image contains several circular photos related to each impact area: food, schooling, poverty, water, transportation, renewable energy, and climate action. These images visually represent the achievements laid out in the text, such as food distribution, education, and energy infrastructure."
4,5,images/page_5.png,"**TRANSCRIPTION OF THE TEXT:**\n\nMESSAGE FROM THE PRESIDENT\n\nDELIVERING ON OUR COMMITMENTS REQUIRES US TO DEVELOP NEW AND BETTER WAYS OF WORKING. IN FISCAL 2024, WE DID JUST THAT.\n\nAJAY BANGA\n\nIn fiscal 2024, the World Bank Group adopted a bold new vision of a world free of poverty on a livable planet. To achieve this, the Bank Group is enacting reforms to become a better partner to governments, the private sector, and, ultimately, the people we serve. Rarely in our 80-year history has our work been more urgent: We face declining progress in our fight against poverty, an existential climate crisis, mounting public debt, food insecurity, an unequal pandemic recovery, and the effects of geopolitical conflict.\n\nResponding to these intertwined challenges requires a faster, simpler, and more efficient World Bank Group. We are refocusing to confront these challenges not just through funding, but with knowledge. Our Knowledge Compact for Action, published in fiscal 2024, details how we will empower all Bank Group clients, public and private, by making our wealth of development knowledge more accessible. And we have reorganized the World Bank’s global practices into five Vice Presiding units—People, Prosperity, Planet, Infrastructure, and Digital—for more flexible and faster engagements with clients. Each of these units reached important milestones in fiscal 2024.\n\nWe are supporting countries in delivering quality, affordable health services to 1.5 billion people by 2030 so our children and grandchildren will lead healthier, better lives. This is part of our larger global effort to address a basic standard of care through every stage of a person’s life—infancy, childhood, adolescence, and adulthood. To help people withstand food-affected shocks and crises, we are strengthening social protection services to support half a billion people by the end of 2030—aiming for half of these beneficiaries to be women.\n\nWe are helping developing countries create jobs and employment, the surest enablers of prosperity. In the next 10 years, 1.2 billion young people across the Global South will become working-age adults. Yet, in the same period and the same countries, only 424 million jobs are expected to be created. The cost of hundreds of millions of young people with no hope for a decent job or future is unimaginable, and we are working urgently to create opportunity for all.\n\nIn response to climate change—arguably the greatest challenge of our generation—we’re channeling 45 percent of annual financing to climate action by 2025, deployed equally between mitigation and adaptation. Among other efforts, we intend to launch at least 15 country-led methane-reduction programs by fiscal 2026, and our Forest Carbon Partnership Facility has helped strengthen high-integrity carbon markets.\n\nAccess to electricity is a fundamental human right and foundational to any successful development effort. It will accelerate the great struggle to uplift developing countries, strengthen public infrastructure, and prepare people for the jobs of tomorrow. But the population of Africa—600 million people—lacks access to electricity. In response, we have committed to provide electricity to 300 million people in Sub-Saharan Africa by 2030 in partnership with the African Development Bank.\n\nRecognizing that digitalization is the transformational opportunity of our time, we are collaborating with governments in more than 100 developing countries to enable digital economies. Our digital lending portfolio totaled $5.6 billion in commitments as of June 2024; and our new Digital Vice Presidency unit will afford us to establish the foundations of a digital economy. Key initiatives include building and enhancing digital and data infrastructure, ensuring cybersecurity and data privacy for institutions, businesses, and citizens, and advancing digital government services.\n\nDelivering on our commitments requires us to develop new and better ways of working. In fiscal 2024, we did just that. We are squeezing our balance sheet and finding new opportunities to take more risk and boost our lending. Our new crisis preparedness and response tools, Global Challenge Programs, and Livable Planet Fund demonstrate how we are amending our approach to better thrive and meet outcomes. Our new Scorecard radically changes how we track results.\n\nBut we cannot achieve development on our own. We need partners from both the public and private sectors to join our efforts. That’s why we are working closely with other multilateral development banks to improve the lives of people in developing countries in tangible, measurable ways. Our deepening relationship with the private sector is evidenced by our Private Sector Investment Lab, which is working to address the barriers preventing private sector investment in emerging markets. The Lab’s core group of 15 Chief Executive Officers and Chairs meets regularly, and already has informed our work—most notably with the development of the World Bank Group Guarantee Platform.\n\nThe impact and innovations we delivered this year will allow us to move forward with a raised ambition and a greater sense of urgency to improve people’s lives. I would like to recognize the remarkable efforts of our staff and Executive Directors, as well as the unwavering support of all our clients and partners. Together, we head into fiscal 2025 with a great sense of optimism—and determination to create a better Bank for a better world.\n\nAJAY BANGA\nPresident of the World Bank Group\nand Chairman of the Board of Executive Directors\n\n**DESCRIPTION OF THE IMAGE OR CHART:**\n\nThe image shows a group of people examining produce, likely in an agricultural context. The individuals appear engaged and focused on the crops being discussed or displayed. The setting is likely outdoors with natural lighting."


Let's examine a sample page, such as page number 21, that has embedded images and text. We can notice that the vision modality extracted the visual information effectively. An example is the pie chart that appears on the page is accurately described: 

`"FIGURE 6: MIDDLE EAST AND NORTH AFRICA IBRD AND IDA LENDING BY SECTOR - FISCAL 2024 SHARE OF TOTAL OF $4.6 BILLION" is a circular chart, resembling a pie chart, illustrating the percentage distribution of funds among different sectors. The sectors include:`

In [10]:
# Filter and print rows where pageNumber is 21
filtered_rows = df[df['PageNumber'] == 21]
for text in filtered_rows.PageText:
    print(text)

**TRANSCRIPTION OF THE TEXT:**

We also committed $35 million in grants to support emergency relief in Gaza. Working with the World Food Programme, the World Health Organization, and the UN Children's Fund, the grants support the delivery of emergency food, water, and medical supplies. In the West Bank, we approved a $40 million grant for the continuation of education for children, $22 million to support municipal services, and $45 million to strengthen healthcare and hospital services.

Enabling green and resilient recovery
To help policymakers in the region advance their climate change and development goals, we published Country Climate and Development Reports for the West Bank and Gaza, Lebanon, and Tunisia. In Libya, the catastrophic flooding in September 2023 devastated eastern localities, particularly the city of Derna. The World Bank, together with the UN and the European Union, produced a Rapid Damage and Needs Assessment to inform recovery and reconstruction efforts.

We signe

### 3. Generating Embeddings: 

In this section, we focus on transforming the textual content extracted from each page of the document into vector embeddings. These embeddings capture the semantic meaning of the text, enabling efficient similarity searches and various NLP tasks. We also identify pages containing visual elements, such as images or tables, and flag them for special handling.

**Steps Breakdown:**

**1. Adding a flag for visual content**   
  
To address pages with visual information, we used the vision modality to extract content from charts, tables, and images. Using specific instructions, we tag the extracted text and flag pages containing visual content. Although the vision model effectively captures much of the visual data, some nuances may be lost in translation. This is especially true for complex visuals such as engineering drawings. 

When a semantic search retrieves a page flagged for visual content, we use this tag to pass the original page as input to the model for Retrieval-Augmented Content Generation. A simple lambda function checks if `PageText` contains markers like `DESCRIPTION OF THE IMAGE OR CHART` or `TRANSCRIPTION OF THE TABLE`. If a marker is detected, the flag is set to 'Y'; otherwise, it's 'N'.

You can use this flag to identify pages that images, charts or tables, so it can be passed to the model as vision modality input. 

**2. Generating Embeddings with OpenAI's Embedding Model**  

We use OpenAI's embedding model, `text-embedding-3-large`, to generate high-dimensional embeddings that represent the semantic content of each page. 

Note: It is crucial to ensure that the dimensions of the embedding model you use are consistent with the configuration of your Pinecone vector store. In our case, we set up the Pinecone database with 3072 dimensions to match the default dimensions of `text-embedding-3-large`.  


In [14]:
# Add a column to flag pages with visual content
df['Visual_Input_Processed'] = df['PageText'].apply(
    lambda x: 'Y' if '**DESCRIPTION OF THE IMAGE OR CHART:**' in x or '**TRANSCRIPTION OF THE TABLE:**' in x else 'N'
)


# Function to get embeddings
def get_embedding(text_input):
    client = OpenAI()

    response = client.embeddings.create(
        input=text_input,
        model="text-embedding-3-large"
    )
    return response.data[0].embedding


# Generate embeddings with a progress bar to display progress
embeddings = []
for text in tqdm(df['PageText'], desc='Generating Embeddings'):
    embedding = get_embedding(text)
    embeddings.append(embedding)

# Add the embeddings to the DataFrame
df['Embeddings'] = embeddings

Generating Embeddings: 100%|██████████| 49/49 [00:34<00:00,  1.42it/s]


Let's make sure both columns have been added successfully. We can check that page-21 that we examined has a Visual_Input_Needed flag set to "Y".  

In [13]:
filtered_rows = df[df['PageNumber'] == 21]
print(filtered_rows.Visual_Input_Needed)

20    Y
Name: Visual_Input_Needed, dtype: object


#### 4. Uploading embeddings to Pinecone: 

In this section, we will upload the embeddings we've generated for each page of our document to Pinecone. Along with the embeddings, we'll include relevant metadata tags that describe each page, such as the page number, text content, image paths, and whether the page includes graphics. This metadata will enhance our ability to perform more granular searches and filtering within the vector database.


Metadata Fields:

* pageId: Combines the document_id and pageNumber to create a unique identifier for each page.
* pageNumber: The numerical page number within the document.
* text: The extracted text content from the page.
* ImagePath: The file path to the image associated with the page.
* GraphicIncluded: A boolean or flag indicating whether the page includes graphical elements that may require visual processing.

The function `upsert_vector` "upserts" the values - a unique identifier, embeddings, and metadata to Pinecone.

Note: "Upsert" is a combination of the words "update" and "insert." In database operations, an upsert is an atomic operation that updates an existing record if it exists or inserts a new record if it doesn't. This is particularly useful when you want to ensure that your database has the most recent data without having to perform separate checks for insertion or updating.

In [15]:
import os
from tqdm import tqdm
import asyncio  # Import asyncio for asynchronous operations

# Assuming 'pc', 'index_name', and 'df' are defined elsewhere in your code
index = pc.Index(index_name)
document_id = 'WB_Report'

# Define the async function correctly
def upsert_vector(identifier, embedding, metadata):
    try:
        index.upsert([
            {
                'id': identifier,
                'values': embedding,
                'metadata': metadata
            }
        ])
    except Exception as e:
        print(f"Error upserting vector with ID {identifier}: {e}")
        raise


for idx, row in tqdm(df.iterrows(), total=df.shape[0], desc='Uploading to Pinecone'):
    pageNumber = row['PageNumber']
    
    # Create meta-data tags to be added to Pinecone 
    metadata = {
        'pageId': f"{document_id}-{pageNumber}",
        'pageNumber': pageNumber,
        'text': row['PageText'],
        'ImagePath': row['ImagePath'],
        'GraphicIncluded': row['Visual_Input_Needed']
    }

    upsert_vector(metadata['pageId'] , row['Embeddings'], metadata)  
    

Uploading to Pinecone: 100%|██████████| 49/49 [00:08<00:00,  5.66it/s]


Navigate to Indexes list on [Pinecone](https://app.pinecone.io/) and you should be able to view the vectors upserted into the database with metadata.

### 5. Performing Semantic Search for Relevant Pages:
In this section, we implement a semantic search to find the most relevant pages in our document that answer a user's question. This approach leverages the embeddings stored in the Pinecone vector database to retrieve pages based on the semantic similarity of their content to the user's query. By doing so, we can effectively search textual content, and provide it as context to the LLM for answering user's question.

**Steps Breakdown:**  
1. Generate an embedding for the user's question.
2. Query the Pinecone index to find the most relevant pages based on the embeddings.
3. Compile the metadata of the matched pages to provide context.
4. Use the GPT-4o model to generate an informative answer using the retrieved context.

In [22]:
import json

def get_response_to_question(user_question, pc_index):
    # Get embedding of the question to find the relevant page with the information 
    question_embedding = get_embedding(user_question) 

    # get response vector embeddings 
    response = pc_index.query(
            vector=question_embedding,
            top_k=2,
            include_values=True,
            include_metadata=True
        )
    
    # Collect the metadata from the matches
    context_metadata = [match['metadata'] for match in response['matches']]
    
    # Convert the list of metadata dictionaries to a JSON string
    context_json = json.dumps(context_metadata, indent=2)
    
    prompt = f"""You are a helpful assistant. Use the following context and images to answer the question. In the answer, include the reference to the document, and page number you found the information on between <source></source> tags. If you don't find the information, you can say "I couldn't find the information"

    question: {user_question}
    
    <SOURCES>
    {context_json}
    </SOURCES>
    """
    
    client = OpenAI()
    
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": prompt}
        ]
    )
    
    return completion.choices[0].message.content

Now, let's ask a question where the information was included in a graphic representation. 

In [23]:
question = "What percentage was allocated to social protections in Western and Central Africa?"
answer = get_response_to_question(question, index)

print(answer)

The percentage allocated to social protections in Western and Central Africa was 8% according to the donut chart on page 13 of the document. <source>WB_Report-13, page 13</source>


Make it a bit harder for RAG, let's ask a question that requires visual understanding of a table.  

In [24]:
question = "What was the increase in access to electricity between 2000 and 2012 in Western and Central Africa?"
answer = get_response_to_question(question, index)

print(answer)

The increase in access to electricity between 2000 and 2012 in Western and Central Africa was 10%. Access to electricity increased from 34.1% of the population in 2000 to 44.1% in 2012. 

<source>WB_Report-13, page 13</source>


This worked well. However, there may be instances where the information is contained with in images or graphics that may lose fidelity when translated to text. This can be the case for example for documents with complex engineering drawings. With the new vision modality, we can pass the image of the page as context to the model. In next section, we will examine how to enhance the accuracy of model response with image inputs. 

### 6. Handling Pages with Visual Content:
When metadata indicates the presence of an image or table, with the new vision modality, we can pass the image for the response. In this approach, we identify the right page using the vector embedding, however instead of passing the extracted text, we pass the image of the page as context for the model to respond to the question.  

In [25]:
import base64
import json

def get_response_to_question_with_images(user_question, pc_index):
    # Get embedding of the question to find the relevant page with the information 
    question_embedding = get_embedding(user_question) 

    # Get response vector embeddings 
    response = pc_index.query(
        vector=question_embedding,
        top_k=2,
        include_values=True,
        include_metadata=True
    )
    
    # Collect the metadata from the matches
    context_metadata = [match['metadata'] for match in response['matches']]

    # Function to encode the image
    def encode_image(local_image_path):
        with open(local_image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')
    
    # Build the message content
    message_content = []

    # Add the initial prompt
    initial_prompt = f"""You are a helpful assistant. Use the following images to answer the question. In the answer, include the reference to the page number or title of the section you found on the image. If you don't find the information, you can say "I couldn't find the information"

    question: {user_question}
    """
    
    message_content.append({"type": "text", "text": initial_prompt})

    # Process each metadata item to include text and images
    for metadata in context_metadata:
        image_path = metadata.get('ImagePath', None)
        
        try:
            base64_image = encode_image(image_path)
            image_type = 'jpeg'  # Adjust if your images are in a different format
            message_content.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/{image_type};base64,{base64_image}"
                },
            })
        except Exception as e:
            print(f"Error encoding image at {image_path}: {e}")

    # Prepare the messages for the API call
    messages = [
        {
            "role": "user",
            "content": message_content
        }
    ]
    
    client = OpenAI()
    
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    
    return completion.choices[0].message.content


In [27]:
question = "What percentage was allocated to social protections in Western and Central Africa?"
answer = get_response_to_question_with_images(question, index)

print(answer)

Social protections in Western and Central Africa were allocated 8% according to Figure 2 on page 22.


In [29]:
question = "What was the increase in access to electricity between 2000 and 2012 in Western and Central Africa?"
answer = get_response_to_question_with_images(question, index)

print(answer)

I found the information on page 23 under "TABLE 5: WESTERN AND CENTRAL AFRICA REGIONAL SNAPSHOT."

The access to electricity in Western and Central Africa increased from 34.1% in 2000 to 55.4% by the most recent data (up to 2023).


### Conclusion

In this cookbook, we embarked on a journey to enhance Retrieval-Augmented Generation (RAG) systems for documents rich in images, graphics and tables. Traditional RAG models, while proficient with textual data, often overlook the wealth of information conveyed through visual elements. By integrating vision models and leveraging metadata tagging, we've bridged this gap, enabling AI to interpret and utilize visual content effectively.

We began by setting up a vector store using Pinecone, establishing a foundation for efficient storage and retrieval of vector embeddings. Parsing PDFs and extracting visual information allowed us to convert document pages into images and identify those containing crucial visual data. By generating embeddings and flagging pages with visual content, we created a robust metadata filtering system within our vector store.

Uploading these embeddings to Pinecone facilitated seamless integration with our RAG system. Through semantic search, we retrieved relevant pages that matched user queries, ensuring that both textual and visual information were considered. Handling pages with visual content by passing them to vision models enhanced the accuracy and depth of the responses, particularly for queries dependent on images or tables.

Using the World Bank's **A Better Bank for a Better World: Annual Report 2024** as our guiding example, we demonstrated how these techniques come together to process and interpret complex documents. This approach not only enriches the information provided to users but also significantly enhances user satisfaction and engagement by delivering more comprehensive and accurate responses.

By following the concepts outlined in this cookbook, you are now equipped to build RAG systems capable of processing and interpreting documents with intricate visual elements. This advancement opens up new possibilities for AI applications across various domains where visual data plays a pivotal role.