This code is to build a RAG module using traidation model GPT-4 and multimodal RAG GPT-4o to find the answers for below questions. 
1. Is the RAG architecture capable of addressing all kinds of contextual responses from context documents?
2. Can the RAG architecture handle different types of inputs, such as text, images, and tabular data, and return responses with equal precision based on their respective contexts?


In [0]:
from datetime import datetime
datetime.now()

datetime.datetime(2024, 8, 5, 18, 2, 47, 359493)

### Packages

In [0]:
%pip install transformers
%pip install protobuf==3.20.*
%pip install poppler-utils
%pip install langchain==0.2.6
%pip install langchain-core==0.2.6
%pip install langchain-openai==0.1.13
%pip install langchain-community==0.2.6
%pip install langchain unstructured[all-docs] pydantic lxml
%pip install chromadb==0.5.3
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting protobuf==3.20.*
  Downloading protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 12.7 MB/s eta 0:00:00
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.19.4
    Not uninstalling protobuf at /databricks/python3/lib/python3.10/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-611d4d2b-c519-4166-bf96-a8d17ff0fea9
    Can't uninstall 'protobuf'. No files were found to uninstall.
ERROR: pip's dependency resolver does not currently take into account all the package

### Path of dataset

In [0]:
path = "/Volumes/main/iqvia/my_volume/mnkannualreport_v1.pdf"


In [0]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Get elements
try:
    raw_pdf_elements = partition_pdf(
        filename=path,
        # Using pdf format to find embedded image blocks
        extract_images_in_pdf=True,
        # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
        # Titles are any sub-section of the document
        infer_table_structure=True,
        # Post processing to aggregate text once we have the title
        chunking_strategy="by_title",
        # Chunking params to aggregate text blocks
        # Attempt to create a new chunk 3800 chars
        # Attempt to keep chunks > 2000 chars
        # Hard max on chunks
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
        image_output_dir_path=path,
    )
except ValueError as e:
    if str(e) == "max() arg is an empty sequence":
        print("No detectable table structures or cells in the PDF.")
        # Handle the case as needed, e.g., skip processing this PDF
    else:
        raise  # Re-raise the exception if it's not the specific one we're catching

Downloading yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

### Create a dictionary to store counts of each type

In [0]:

category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 29,
 "<class 'unstructured.documents.elements.Table'>": 13}

In [0]:
class Element(BaseModel):
    type: str
    text: Any

# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# extract the Tables element
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# extract the Text elements
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

13
29


### Text and Table summaries

In [0]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

In [0]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

In [0]:
import os
from openai import AzureOpenAI
from langchain_core.messages import HumanMessage
from langchain_openai import AzureChatOpenAI

In [0]:
# Summary chain
model = AzureChatOpenAI(azure_deployment="Test", api_version="2024-02-01")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [0]:
summarize_chain

{
  element: RunnableLambda(lambda x: x)
}
| ChatPromptTemplate(input_variables=['element'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['element'], template='You are an assistant tasked with summarizing tables and text. Give a concise summary of the table or text. Table or text chunk: {element} '))])
| AzureChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7fca6fb5ed10>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7fca6fbbc670>, openai_api_key=SecretStr('**********'), openai_proxy='', azure_endpoint='https://mnkazureopenaitest.openai.azure.com/', deployment_name='Test', openai_api_version='2024-02-01', openai_api_type='azure')
| StrOutputParser()

In [0]:
# Apply to text
texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

In [0]:
text_summaries

["The text provides an overview of a corporation's market leadership, ranking second in the Covered Market (CVM) in FY23. Additionally, it has held the top position in prescriptions over the last six years.",
 "The company ranks third in terms of volume in FY23 and has four Consumer Healthcare brands that are #1 in their categories. It has seen a 2.2X average volume growth from FY18 to FY23 compared to the IPM. The company's operations have a significant scale, with a revenue of INR 8,749 crore, 97% of which is domestic, and an EBITDA of INR 1,913 crore. The company has a large field force of over 15,000 and more than 13,000 stockists. It has 20 brand families worth over INR 100 crore, a cash EPS of INR 40.4, and a cash flow from operations amounting to INR 1,813 crore. The company acknowledges potential risks and uncertainties in its forward-looking statements, which include factors such as competition, government policies, economic conditions, and technological advances.",
 'Mankind 

In [0]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [0]:
len(table_summaries)

13

In [0]:
tables[1]

'Key Therapeutic Segments FY14 FY15 FY16 FY17 FY18 FY19 Value Growth YoY in IPM (%) 7.3 19.3 24.2 9.8 11.3 12.6 Market Share by Value in IPM (%) 3.3 3.5 3.8 3.8 3.9 4.0 Market Ranking by Value in IPM (x) 7 5 4 4 4 4 Covered Market Share in total IPM (%) 64.6 64.1 64.8 63.5 60.2 61.6 Market Share in Covered Market (%) 5.1 5.5 5.8 5.9 6.6 6.5 Covered Market Rank (x) 3 2 2 2 2 2 Volume Share in IPM (%) 3.9 4.3 4.7 4.4 4.8 5.1 Market Ranking by Volume in IPM (x) 6 6 5 5 5 3 Chronic Share in total portfolio (%) 19.6 20.4 25.3 26.7 27.9 31.9 Chronic Growth YoY (%) 14.6 23.8 53.9 16.0 16.4 28.6 Metro & Class 1 Share (%) NA 51.6 50.3 50.7 49.9 49.2 FY20 12.5 4.1 4 62.4 6.5 2 5.2 3 32.2 13.5 48.1 FY21 11.1 4.3 4 62.2 6.9 2 5.7 3 34.1 17.6 51.8 FY22 17.7 4.3 4 65.4 6.6 2 5.5 3 32.9 13.6 52.9 FY23 10.6 4.4 4 68.1 6.5 2 5.7 3 33.9 14.1 53.2'

In [0]:
table_summaries

["This is a table of contents for a comprehensive report or document about Mankind Pharma. It outlines the company's key details, including a brief introduction, its business highlights over the past decade, key milestones, and pillars of growth. It also includes messages from the Chairman, Vice Chairman, CEO, and COO. Moreover, it provides information about the company's business model, marketing and branding strategies, financial highlights, operational excellence, and quality management. The company's supply chain, research and development initiatives, and technology excellence are also discussed. The latter part of the document covers topics on environmental, social, and corporate governance, inclusive development, good governance, and profiles of board directors. It concludes with the company's awards, recognitions, and other corporate information.",
 'The table presents the growth and market share of different therapeutic segments from FY14 to FY23. Value growth YoY in IPM (%) fl

### Add to vector space

In [0]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import AzureOpenAIEmbeddings

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=AzureOpenAIEmbeddings(model="text-embedding-3-large"))

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

In [0]:
retriever

MultiVectorRetriever(vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7fca70153f40>, docstore=<langchain_core.stores.InMemoryStore object at 0x7fca716afa60>)

### Sanity check

In [0]:
tables[2]

'Key Therapeutic Segments FY14 FY15 FY16 FY17 FY18 FY19 Anti-Infectives 24.3 22.0 19.6 19.0 17.7 15.8 Cardiac 7.1 7.6 8.5 9.3 10.1 10.8 Gastro Intestinal 12.0 12.1 13.1 12.7 12.3 11.5 Respiratory 6.4 6.8 6.6 6.5 7.6 8.0 Pain / Analgesics 5.9 6.2 6.3 6.3 6.0 6.4 Anti Diabetic 4.3 4.5 4.9 5.7 6.3 7.7 Vitamins/Minerals/Nutrients 7.7 9.2 10.8 10.7 10.6 9.8 Dermatology 5.3 6.2 7.1 8.2 9.1 9.0 Gynaecology 8.0 7.3 6.1 5.4 4.8 5.0 Neuro / CNS 4.4 4.3 3.1 2.8 2.7 2.9 FY20 15.9 11.5 11.4 8.7 6.1 7.5 9.5 8.4 5.1 2.9 FY21 13.2 12.6 11.3 7.2 5.4 8.7 10.3 8.6 6.5 3.2 FY22 14.7 12.1 10.9 9.7 5.4 8.3 9.5 7.4 6.7 2.9 FY23 15.0 12.8 10.8 9.5 5.0 8.2 8.5 6.1 7.7 2.6'

In [0]:
table_summaries[1]

'The table presents the growth and market share of different therapeutic segments from FY14 to FY23. Value growth YoY in IPM (%) fluctuated, reaching a peak of 24.2% in FY16 and lowest of 7.3% in FY14. Market Share by Value in IPM (%) increased gradually from 3.3% in FY14 to 4.4% in FY23. The company maintained its Market Ranking by Value in IPM (x) at 4 from FY16 to FY23. The Covered Market Share in total IPM (%) also saw a fluctuating trend, highest at 68.1% in FY23. The Market Share in Covered Market (%) and Covered Market Rank (x) remained fairly consistent. Volume Share in IPM (%) saw a steady increase from 3.9% in FY14 to 5.7% in FY23. The Chronic Share in total portfolio (%) increased from 19.6% in FY14 to 33.9% in FY23. Chronic Growth YoY (%) peaked at 53.9% in FY16. The Metro & Class 1 Share (%) decreased slightly from 51.6% in FY14 to 53.2% in FY23.'

## retrieval of that table from the natural language query:

In [0]:
retriever.invoke("What is the EBITDA for FY23?")[3]


'#3 Rank by volumes in FY23\n\n4 Consumer Healthcare brands ranked #1 in their categories\n\nYoungest in the Top 5 of the IPM\n\n2.2X Average volume growth from FY18 to FY23 vs IPM\n\nRank by Value\n\n#8  #4  FY12  FY23 \n\nSTATUTORY REPORTS\n\n65 - Management Discussion and Analysis 87 - Board’s Report 135 - Business Responsibility and Sustainability Report\n\nScale of operations\n\nFINANCIAL STATEMENTS\n\n177 - Standalone Financial Statements 321 - Consolidated Financial Statements\n\nINR 8,749 crore Revenue\n\n97% Domestic revenue\n\nINR 1,913 crore EBITDA\n\n15,000+ Field force (including field managers)\n\nFor more information, scan the QR code or visit our website\n\n13,000+ Stockists\n\nhttps://www.mankindpharma.com\n\nINR 1,310 crore PAT\n\nForward-Looking Statements\n\nThe statements may contain forward-looking statements like the words ‘believe’, ‘expect’, ‘anticipate’, ‘intend’, ‘plan’, ‘estimate’, ‘project’, ‘will’, ‘may’, ‘targeting’ and similar expressions regarding the f

## Traditional RAG architecture

In [0]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Option 1: LLM
model = AzureChatOpenAI(temperature=0,azure_deployment="gpt-4", api_version="2024-02-01")
# Option 2: Multi-modal LLM
# model = GPT4-V or LLaVA
# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

### Sample Queries

In [0]:
chain.invoke(
    "What is the Value Growth YoY in IPM (%) for FY22?"
)

'The Value Growth YoY in IPM (%) for FY22 is 17.7%.'

In [0]:
chain.invoke(
    "Mankind share (%) of key therapeutic segments (on covered market basis) for Cardic segment for FY18?"
)

'The Mankind share (%) of key therapeutic segments (on covered market basis) for the Cardiac segment for FY18 is 10.1%.'

In [0]:
chain.invoke(
    "WHo is the vice president and managing director of mankind?"
)

'The Vice Chairman and Managing Director of Mankind is Rajeev Juneja.'

In [0]:
chain.invoke(
    "What is the EBITDA for FY23?"     ## page 14
)

'The EBITDA for FY23 is INR 1,913 crore.'

In [0]:
chain.invoke(
    "who is the first company to manufacture Dydrogesterone?"
)

'The first company to manufacture Dydrogesterone is Mankind.'

In [0]:
chain.invoke(
    "Who is Arjun Juneja?"
)

'The text does not provide information on who Arjun Juneja is.'

In [0]:
chain.invoke(
    "Who is Chief Operating Officer of mankind?"
)

'The Chief Operating Officer of Mankind is Arjun Juneja.'

In [0]:
chain.invoke(
    " Total Equity?"
)

'The total equity is INR 7,623 crore.'

In [0]:
chain.invoke(
    "What is total Market Share in FY14?"
) 

'The total market share in FY14 was 3.3%.'

In [0]:
chain.invoke(
    "What is total Market share of UNWANTED-72?"
)

'The total market share of UNWANTED-72 is 62%.'

In [0]:
chain.invoke("When did the unwanted-72 got launched?")

'Unwanted-72 was launched in 2007.'

In [0]:
chain.invoke("what is HealthOK?")

'HealthOK is a multivitamin brand that was switched to the Consumer Healthcare segment in 2021. It is a multivitamin tablet that improves energy levels, overall health, and immunity. The brand also offers a range of multivitamin gummies for children.'

## Multimodal RAG

In [0]:
%pip install --upgrade langchain

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting langchain
  Downloading langchain-0.2.10-py3-none-any.whl (990 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 990.0/990.0 kB 11.4 MB/s eta 0:00:00
Installing collected packages: langchain
  Attempting uninstall: langchain
    Found existing installation: langchain 0.2.6
    Uninstalling langchain-0.2.6:
      Successfully uninstalled langchain-0.2.6
Successfully installed langchain-0.2.10
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


In [0]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Multi-modal LLM
model = AzureChatOpenAI(temperature=0,azure_deployment="gpt-4o", api_version="2024-02-01")

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [0]:
chain.invoke(
    "What is total Market share of UNWANTED-72?"
)

'The total market share of UNWANTED-72 is 62%.'

In [0]:
chain.invoke(
    "What is total Market share of PregaNews?"
)

'The total market share of Prega News is 82%.'

In [0]:
chain.invoke(
    "How many products launched in FY23?"
)

'The provided context does not specify the exact number of products launched in FY23.'

In [0]:
chain.invoke(
    "What is the strength of Doctor associated with Mankind?"
)

'The strength of doctors associated with Mankind is more than 4 lakh (400,000) doctors.'

In [0]:
chain.invoke("strength of Field Force?")

'The strength of the Field Force is over 15,000, including field managers.'

In [0]:
chain.invoke("mankind's pillars of growth?")

"Mankind's pillars of growth include:\n\n1. **Affordability**: Providing cost-effective healthcare solutions.\n2. **Quality**: Ensuring high-quality healthcare products and services.\n3. **Accessibility**: Making products accessible to customers across different parts of the country.\n4. **Innovation**: Investing in research and development to create innovative healthcare products.\n5. **Sustainability**: Incorporating sustainability principles in business operations to minimize environmental footprint and empower local communities.\n6. **Customer-Centricity**: Maintaining a strong focus on customer needs and satisfaction.\n7. **Strong Manufacturing Capabilities**: Operating 25 state-of-the-art manufacturing facilities with a wide range of dosage forms.\n8. **Extensive Distribution Network**: Having one of the largest distribution networks in India to ensure product availability.\n9. **Corporate Governance**: Implementing robust corporate governance practices.\n10. **Market Leadership*