## Challenge 4: Advanced RAG with Azure AI Document intelligence

Many documents in  real scenario, are not just text, they are a combination of text, images, tables, etc. In this step, you will create a more advanced RAG application able to deal with this kind of documents.
For this reason, you will use Azure AI Document Intelligence to extract the text, images, and tables from the documents and use them as input for the RAG model.

To achieve this, we will build on top of the langchain framework enhancing the `Document Loader` and `Text Splitters` to deal with images and tables.
In the code repositiory, you have already the enhanced version of the `Document Loader` and `Text Splitters` that you can use. They are included in two different python modules: `doc_intelligence.py` and `ingestion.py`.

You can now use these libraries to create your advanced RAG.

We provided already the libraries and the Environment variables required (you need just to populate them).

In [2]:
import sys, os, dotenv
dotenv.load_dotenv(override=True)
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../../lib')))

# Setup environment

# OpenAI
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION")
AZURE_OPENAI_MODEL = os.getenv("AZURE_OPENAI_MODEL")
AZURE_OPENAI_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")
AZURE_OPENAI_EMBEDDING = os.getenv("AZURE_OPENAI_EMBEDDING")
# Azure Search
AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_API_KEY = os.getenv("AZURE_SEARCH_API_KEY")
# Azure AI Document Intelligence
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
AZURE_DOCUMENT_INTELLIGENCE_API_KEY = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_API_KEY")
AZURE_DOCUMENT_INTELLIGENCE_API_VERSION= os.getenv("AZURE_DOCUMENT_INTELLIGENCE_API_VERSION")
# Azure Blob Storage
AZURE_STORAGE_CONNECTION_STRING = os.getenv("AZURE_STORAGE_CONNECTION_STRING")
AZURE_STORAGE_CONTAINER = os.getenv("AZURE_STORAGE_CONTAINER")
AZURE_STORAGE_FOLDER = os.getenv("AZURE_STORAGE_FOLDER")

# Import Libraries
import os
from langchain_openai import AzureChatOpenAI
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from azure.ai.documentintelligence.models import DocumentAnalysisFeature

# Custom Libraries
from its_a_rag.doc_intelligence import AzureAIDocumentIntelligenceLoader
from its_a_rag import ingestion

# Define the questions list (if you are using your own dataset you need to change this list)
QUESTIONS = [
  "What are the revenues of GOOGLE in the year 2009?",
  "What are the revenues and the operative margins of ALPHABET Inc. in 2022 and how it compares with the previous year?",
  "Can you create a table with the total revenue for ALPHABET, NVIDIA, MICROSOFT and APPLE in year 2023?",
  "Can you give me the Fiscal Year 2023 Highlights for APPLE, MICROSOFT and NVIDIA?",
  "Did APPLE repurchase common stock in 2023? create a table of APPLE repurchased stock with date, numbers of stocks and values in dollars.",
  "What is the value of the cumulative 5-years total return of ALPHABET Class A at December 2022?",
  "What was the price of APPLE, NVIDIA and MICROSOFT stock in 23/07/2024?",
  "Can you buy 10 shares of APPLE for me?"
  ]

# Define the System prompt (you need to update this is you are using your own dataset.)
system_prompt = """ You are a financial assistant tasked with answering questions related to the financial results of major technology companies listed on NASDAQ, \n
specifically Microsoft (MSFT), Alphabet Inc. (GOOGL), Nvidia (NVDA), Apple Inc. (AAPL), and Amazon (AMZN). \n
if you don't find the answer in the context, just say `I don't know.`"""


## Create the Vector store, the embeddings client and the OpenAI Chat client

Let's start creating the vector store and the embeddings client. Because we need a custom index to store the information in the way so that our retriever wil be able to get it, we have a custom function for that (create_multimodal_vectore_store).
For the OpenAI Chat client we will simply use the one offered by langchain framework as in the Step 3 of this notebook.

In [None]:
# Create the index for Azure Search store and Embedding (using the custom function create_multimodal_vectore_store)
# NOTE: Remember to create the new index in Azure Search called "itsarag-ch4-001"

# Create the Azure OpenAI Chat Client


## Index Phase

As always the first step is to index the documents:
the high level steps are:

- Set Folder Path: Assign the local folder path to the variable folder.
- List Files: Create a list of files in the specified folder.
- Get Full Paths: Convert the list of file names to their full paths.
- Iterate Over Files: Loop through each file in the list.
    - Extract File Name: Extract the file name from the full path (this is required for the document loader).
    - Load Document: Use AzureAIDocumentIntelligenceLoader to load the document with specified API credentials and settings (remember to use pre-built layout as model and the latest API version)
    - Split Document: Split the loaded document using a custom advanced text splitter.
    - Store Document: Add the processed documents to a multimodal vector store (using the add_documents method).

In [3]:
# Index Phase

# Set Folder Path
folder = "./data/fsi/pdf"

# List Files (filtering files starting with "2023")
files = [os.path.join(folder, f) for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f)) and f.startswith("2023")]

# Iterate Over Files
for file in files:
    # Extract File Name
    pdf_file_name = os.path.basename(file)
    print(f"Processing file: {pdf_file_name}")

    # Load Document using Azure AI Document Intelligence
    loader = AzureAIDocumentIntelligenceLoader(
        api_endpoint=AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT,
        api_key=AZURE_DOCUMENT_INTELLIGENCE_API_KEY,
        api_version=AZURE_DOCUMENT_INTELLIGENCE_API_VERSION,
        file_path=file,
        model="prebuilt-layout",
        features=[DocumentAnalysisFeature.TABLES, DocumentAnalysisFeature.OCR_HIGH_RESOLUTION]
    )
    document = loader.load()

    # Split Document using custom advanced text splitter
    splitter = ingestion.AdvancedTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.split_documents(document)

    # Store Document chunks in multimodal vector store
    ingestion.add_documents(chunks, vector_store)

FileNotFoundError: [Errno 2] No such file or directory: './data/fsi/pdf'

## Retrieve Phase

The next step is to create a retriever for the documents based on the user query.
You should use the following parameters:
- Search Type: Hybrid
- number of results: 20

In [None]:
# Retrieve  (as_retriever)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})


## Generate Phase

The final step is to generate the answer using the RAG model.
We will create a Langchain chain with the following steps:
 - Retrieve the docs and get the image description if the doc matedata is an image (with get_image_description function - RunnableLambda), then pass the context and question (using RunnablePassthrough) to the next phase
 - Use the advanced multimodal Prompt function to append system messages, the context including the text, the image (if present) and the question - check RannableLambda method also here.
 - Use the OpenAI model to generate the answer
 - Parse the output and return the answer

In [None]:
# Generate
# RAG pipeline
qa = (
    retriever
    | llm
    | StrOutputParser()
    | RunnableLambda(lambda x: f"Answer: {x}")
)
# Run the pipeline
for question in QUESTIONS:
    print(f"Question: {question}")
    answer = qa.invoke(question)
    print(f"Answer: {answer}")
    print("-" * 80)
# Save the index
ing.save_index(vector_store, "itsarag-ch4-001")
# Load the index
# vector_store = ingestion.load_index("itsarag-ch4-001")
# Run the pipeline
# for question in QUESTIONS:
#     print(f"Question: {question}")
#     answer = qa.invoke(question)
#     print(f"Answer: {answer}")        

## Test the Solution

You can test the solution by providing a question and checking the answer generated by the RAG model (invoke the Langchain chain).

Try to get answer for the following questions:


In [4]:
# Test the solution
for QUESTION in QUESTIONS:
    print(f"QUESTION: {QUESTION}")
    print(chain_multimodal_rag.invoke(QUESTION))
    print("--------------------------------------------------")

QUESTION: What are the revenues of GOOGLE in the year 2009?


NameError: name 'chain_multimodal_rag' is not defined