<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/docs/examples/metadata_extraction/MetadataExtractionSEC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%load_ext autoreload
%autoreload 2

In [10]:
import sys
import os
import openai
from dotenv import load_dotenv, find_dotenv

In [None]:
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
    BaseExtractor,
)
from llama_index.core.node_parser.text.sentence import SentenceSplitter

# base_nodes+objects
qa_extractor = QuestionsAnsweredExtractor(questions=3)

# assume documents are defined -> extract nodes
from llama_index.core.ingestion import IngestionPipeline

extractors = [
    # TitleExtractor(nodes=5),
    # QuestionsAnsweredExtractor(questions=3),
    SummaryExtractor(summaries=["prev", "self"]),
    # KeywordExtractor(keywords=10),
    # CustomExtractor()
]

NODE_PARSER_CHUNK_SIZE = 512
NODE_PARSER_CHUNK_OVERLAP = 10


sentence_splitter = SentenceSplitter.from_defaults(
    chunk_size=NODE_PARSER_CHUNK_SIZE,
    chunk_overlap=NODE_PARSER_CHUNK_OVERLAP
)

transformations = [sentence_splitter] + extractors

pipeline = IngestionPipeline(
    transformations=transformations
)


In [11]:
# Get the current working directory
llamaindex_dir = os.getcwd()
# Get the parent directory
llamaindex_dir = os.path.dirname(llamaindex_dir)

sys.path.append(llamaindex_dir + "/utils")
# sys.path

_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.getenv('OPENAI_API_KEY')

In [12]:
from llamaindex_utils import *

In [13]:
import logging
import sys

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Extracting Metadata for Better Document Indexing and Understanding

In many cases, especially with long documents, a chunk of text may lack the context necessary to disambiguate the chunk from other similar chunks of text. One method of addressing this is manually labelling each chunk in our dataset or knowledge base. However, this can be labour intensive and time consuming for a large number or continually updated set of documents.

To combat this, we use LLMs to extract certain contextual information relevant to the document to better help the retrieval and language models disambiguate similar-looking passages.

We do this through our brand-new `Metadata Extractor` modules.

In [15]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.embeddings.openai import (
    OpenAIEmbedding,
    OpenAIEmbeddingMode,
    OpenAIEmbeddingModelType,
)
from llama_index.llms.openai import OpenAI
from llama_index.schema import MetadataMode

llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)
embedding_model = OpenAIEmbedding(
    mode=OpenAIEmbeddingMode.SIMILARITY_MODE,
    model_type=OpenAIEmbeddingModelType.TEXT_EMBED_ADA_002,
    api_key=openai.api_key,
)

service_context = ServiceContext.from_defaults(
  llm=llm, embed_model=embedding_model, chunk_size=512
)
set_global_service_context(service_context)

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [16]:
# %pip install llama-index-llms-openai
# %pip install llama-index-extractors-entity

In [17]:
# !pip install llama-index

In [18]:
import nest_asyncio

nest_asyncio.apply()

# import os
# import openai

# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"

We create a node parser that extracts the document title and hypothetical question embeddings relevant to the document chunk.

We also show how to instantiate the `SummaryExtractor` and `KeywordExtractor`, as well as how to create your own custom extractor 
based on the `BaseExtractor` base class

In [19]:
from llama_index.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
    BaseExtractor,
)
# from llama_index.extractors.entity import EntityExtractor
from llama_index.node_parser import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)


class CustomExtractor(BaseExtractor):
    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list


extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
    # EntityExtractor(prediction_threshold=0.5),
    # SummaryExtractor(summaries=["prev", "self"], llm=llm),
    # KeywordExtractor(keywords=10, llm=llm),
    # CustomExtractor()
]

transformations = [text_splitter] + extractors

In [20]:
from llama_index import SimpleDirectoryReader

We first load the 10k annual SEC report for Uber and Lyft for the years 2019 and 2020 respectively.

In [21]:
# !mkdir -p ../data
# !wget -O "../data/10k-132.pdf" "https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1"
# !wget -O "../data/10k-vFinal.pdf" "https://www.dropbox.com/scl/fi/qn7g3vrk5mqb18ko4e5in/lyft.pdf?rlkey=j6jxtjwo8zbstdo4wz3ns8zoj&dl=1"

In [22]:
# Note the uninformative document file name, which may be a common scenario in a production setting
uber_docs = SimpleDirectoryReader(input_files=["../data/10k-132.pdf"]).load_data()
uber_front_pages = uber_docs[0:3]
uber_content = uber_docs[63:69]
uber_docs = uber_front_pages + uber_content

In [23]:
# print("uber_front_pages: ", uber_front_pages)
print([x.doc_id for x in uber_docs])

['ce4a15a6-7529-4760-b966-1228c5bb9192', '1ee4abce-7012-46c1-8a13-b716e822028a', 'e8659e3b-226d-4b36-9522-07dce4cbca9c', 'b2d37bf5-bb86-4c16-9a43-693047a0495f', '976852a6-62a3-4487-b8eb-dca416229c8e', 'f2dbaa89-c9b8-4369-b12a-ca91a7d01660', 'e4e9da35-a27a-4896-8723-c9e4f999b5d7', '64867df6-7aa4-4a4e-adbf-b3edb1c3692b', 'ffa17641-73e8-4d0a-bec3-7be80327349c']


In [24]:
from llama_index.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=transformations)

uber_nodes = pipeline.run(documents=uber_docs)

Extracting titles:   0%|          | 0/5 [00:00<?, ?it/s]

Extracting questions:   0%|          | 0/20 [00:00<?, ?it/s]

In [25]:
for i in range(5):
    print(uber_nodes[i].text)
    print()
    print(uber_nodes[i].metadata["questions_this_excerpt_can_answer"])
    print("------------------------\n")

2019
Annual  
Report

1. What is the title of Uber Technologies, Inc.'s 2019 Annual Report?
2. When was the document "10k-132.pdf" last modified?
3. What is the file size of the document "10k-132.pdf"?
------------------------

69  
Countries
10K+  
Cities
$65B  
Gross Bookings
111M  
MAPCs
7B  
TripsA global tech 
platform at 
massive scale
Serving multiple multi-trillion 
dollar markets with products 
leveraging our core technology 
and infrastructure
We believe deeply in our bold mission. Every minute 
of every day, consumers and Drivers on our platform 
can tap a button and get a ride or tap a button and 
get work. We revolutionized personal mobility with 
ridesharing, and we are leveraging our platform to 
redefine the massive meal delivery and logistics 
industries. The foundation of our platform is our 
massive network, leading technology, operational 
excellence, and product expertise. Together, these 
elements power movement from point A to point B.

1. How many countries does

In [26]:
# Note the uninformative document file name, which may be a common scenario in a production setting
lyft_docs = SimpleDirectoryReader(
    input_files=["../data/10k-vFinal.pdf"]
).load_data()
lyft_front_pages = lyft_docs[0:3]
lyft_content = lyft_docs[68:73]
lyft_docs = lyft_front_pages + lyft_content

In [27]:
from llama_index.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=transformations)

lyft_nodes = pipeline.run(documents=lyft_docs)

Extracting titles:   0%|          | 0/5 [00:00<?, ?it/s]

Extracting questions:   0%|          | 0/20 [00:00<?, ?it/s]

In [28]:
lyft_nodes[2].metadata

{'page_label': '2',
 'file_name': '10k-vFinal.pdf',
 'file_path': '../data/10k-vFinal.pdf',
 'file_type': 'application/pdf',
 'file_size': 3416577,
 'creation_date': '2024-05-01',
 'last_modified_date': '2024-05-01',
 'last_accessed_date': '2024-05-01',
 'document_title': 'Lyft, Inc. Form 10-K Annual Report for Fiscal Year Ended December 31, 2020: Securities Exchange Act Filing Status, Internal Control Reporting, and Market Value Analysis, Incorporation of Proxy Statement for 2021 Annual Meeting of Stockholders',
 'questions_this_excerpt_can_answer': '1. Has Lyft, Inc. submitted all required reports to the Securities Exchange Act of 1934 in the past 12 months?\n2. What is the filing status of Lyft, Inc. as a registrant under the Securities Exchange Act of 1934?\n3. Has Lyft, Inc. filed a report on the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act?'}

Since we are asking fairly sophisticated questions, we utilize a subquestion query engine for all QnA pipelines below, and prompt it to pay more attention to the relevance of the retrieved sources. 

In [29]:
from llama_index.question_gen.llm_generators import LLMQuestionGenerator
from llama_index.question_gen.prompts import (
    DEFAULT_SUB_QUESTION_PROMPT_TMPL,
)


# To generate the subquestions
question_gen = LLMQuestionGenerator.from_defaults(
    service_context=service_context,
    prompt_template_str="""
        Follow the example, but instead of giving a question, always prefix the question 
        with: 'By first identifying and quoting the most relevant sources, '. 
        """
    + DEFAULT_SUB_QUESTION_PROMPT_TMPL,
)

In [31]:
from llama_index import VectorStoreIndex
from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.tools import QueryEngineTool, ToolMetadata

## Querying an Index With No Extra Metadata

In [44]:
from copy import deepcopy

nodes_no_metadata = deepcopy(uber_nodes) + deepcopy(lyft_nodes)
for node in nodes_no_metadata:
    node.metadata = {
        k: node.metadata[k]
        for k in node.metadata
        if k in ["page_label", "file_name"]
    }
print(
    "LLM sees:\n",
    (nodes_no_metadata)[9].get_content(metadata_mode=MetadataMode.LLM),
)

LLM sees:
 [Excerpt from document]
page_label: 66
file_name: 10k-132.pdf
Excerpt:
-----
62 2019 Compared to 2018 
Adjusted EBITDA loss increased $878 million, or 48%, primar ily attributable to continued investments within our non-
Rides offerings and an increase in corpor ate overhead as we grow the business. Th ese investments drove an increase in our 
Adjusted EBITDA loss margin as a percentage of  Adjusted Net Revenue of (3)% to (21)%. 
Components of Results of Operations 
The following discussion on trends in our components of results of operations excludes IPO related impacts as well 
as the Driver appreciation award of $299 million, both of which occurred during the second quarter of 2019. The Driver 
appreciation award was accounted for as a Driver incentive.  For additional information about our IPO, see Note 1 - 
Description of Business and Summary of Significant Accoun ting Policies to our consolidated financial statements 
included in Part II, Item 8, “Financial  Statements

In [45]:
index_no_metadata = VectorStoreIndex(
    nodes=nodes_no_metadata,
)
engine_no_metadata = index_no_metadata.as_query_engine(
    similarity_top_k=10, llm=OpenAI(model="gpt-3.5-turbo-0613")
)

In [None]:
# final_engine_no_metadata = SubQuestionQueryEngine.from_defaults(
#     query_engine_tools=[
#         QueryEngineTool(
#             query_engine=engine_no_metadata,
#             metadata=ToolMetadata(
#                 name="sec_filing_documents",
#                 description="financial information on companies",
#             ),
#         )
#     ],
#     # A module for generating sub questions given a complex question and tools.
#     question_gen=question_gen,
#     use_async=True,
# )

In [None]:
# response_no_metadata = final_engine_no_metadata.query(
#     """
#     What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
#     Give your answer as a JSON.
#     """
# )
# print(response_no_metadata.response)
# # Correct answer:
# # {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
# #  "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}

**RESULT**: As we can see, the QnA agent does not seem to know where to look for the right documents. As a result it gets the Lyft and Uber data completely mixed up.

## Querying an Index With Extracted Metadata

In [71]:
print(
    "LLM sees:\n",
    (uber_nodes + lyft_nodes)[9].get_content(metadata_mode=MetadataMode.LLM),
)

LLM sees:
 [Excerpt from document]
page_label: 66
file_name: 10k-132.pdf
file_path: ../data/10k-132.pdf
file_type: application/pdf
file_size: 2829436
creation_date: 2024-05-01
last_modified_date: 2024-05-01
last_accessed_date: 2024-05-01
document_title: Revolutionizing Mobility and Logistics: Uber Technologies, Inc. 2019 Annual Report
Excerpt:
-----
62 2019 Compared to 2018 
Adjusted EBITDA loss increased $878 million, or 48%, primar ily attributable to continued investments within our non-
Rides offerings and an increase in corpor ate overhead as we grow the business. Th ese investments drove an increase in our 
Adjusted EBITDA loss margin as a percentage of  Adjusted Net Revenue of (3)% to (21)%. 
Components of Results of Operations 
The following discussion on trends in our components of results of operations excludes IPO related impacts as well 
as the Driver appreciation award of $299 million, both of which occurred during the second quarter of 2019. The Driver 
appreciation award

In [32]:
index = VectorStoreIndex(
    nodes=uber_nodes + lyft_nodes,
)

In [52]:
print(uber_nodes[0].get_content(metadata_mode=MetadataMode.EMBED))

[Excerpt from document]
page_label: 1
file_path: ../data/10k-132.pdf
document_title: Revolutionizing Mobility and Logistics: Uber Technologies, Inc. 2019 Annual Report
questions_this_excerpt_can_answer: 1. What is the title of Uber Technologies, Inc.'s 2019 Annual Report?
2. When was the document "10k-132.pdf" last modified?
3. What is the file size of the document "10k-132.pdf"?
Excerpt:
-----
2019
Annual  
Report
-----


In [53]:
# IF NO METADATA, THEN IT ONLY EMBEDS THIS:
print(nodes_no_metadata[0].get_content(metadata_mode=MetadataMode.EMBED))

[Excerpt from document]
page_label: 1
Excerpt:
-----
2019
Annual  
Report
-----


# Improve the retriever

With metadata, the embedding retriever might be better. However, if too much metadata, we might add noise.

TODO: test [Retriever (Metadata References)](https://docs.llamaindex.ai/en/stable/examples/retrievers/recursive_retriever_nodes/)

In [37]:
from llama_index.response.notebook_utils import display_source_node

question = "What was the cost due to research and development for uber in 2019 in millions of USD?"

In [46]:
base_retriever = index.as_retriever(similarity_top_k=2)
base_retriever_no_metadata = index_no_metadata.as_retriever(similarity_top_k=2)

In [47]:
retrievals = base_retriever.retrieve(
    question
)

retrievals_no_metadata = base_retriever_no_metadata.retrieve(
    question
)

In [49]:
for n in retrievals_no_metadata:
    display_source_node(n, source_length=1500)
    # print(n.metadata["questions_this_excerpt_can_answer"])
    print("\n-----------------------------------------\n")
# Node ID: afc3e421-72a6-461d-8c56-158d83a5756a
# Node ID: 9f0cdd17-fa62-4046-95e3-d3c482bbd150

**Node ID:** 27e09a78-e745-476c-ad44-d3e77bdc9670<br>**Similarity:** 0.8559352843492449<br>**Text:** thousands, except for percentages)
Research and development $ 909,126 $ 1,505,640 $ 300,836  (40) %  400 %Year Ended December 31,2019 to 2020 
% Change2018 to 2019 
% Change
Research and development expenses decreased  $596.5 million , or 40%, in the year ended December 31, 2020  as compared to 
the prior year. The decrease was primarily due to a $609.6 million  reduction in stock-based compensation expense primarily 
attributable to (i) the use of the accelerated attribution method to recognize expenses for RSUs granted prior to the effectiveness of our 
IPO Registration Statement which resulted in higher stock-based compensation expense for the year ended December 31, 2019 , and 
(ii) the stock-based compensation benefit related to the restructuring in the second quarter of 2020. The decrease was partially offset 
by an increase of $47.0 million  in autonomous vehicles research and development costs primarily due to the absence of 
70<br>


-----------------------------------------



**Node ID:** 9f0cdd17-fa62-4046-95e3-d3c482bbd150<br>**Similarity:** 0.8539766101325302<br>**Text:** the periods presented (in 
millions): 
  Year Ended December 31,   
  2017   2018   2019   
Revenue  .................................................................................................. $ 7, 932 $ 11,270 $ 14,147 
Costs and expenses:        
Cost of revenue, exclusive of depreciation and amortization  
shown separatel y below.......................................................................  4,1 60  5,623  7,208 
Operations and suppor t ...........................................................................  1,354  1,516  2,302 
Sales and marketin g ................................................................................  2,524  3,151  4,626 
Research and developmen t .....................................................................  1,201  1,505  4,836 
Gene ral and administra tive .....................................................................  2,263  2,082  3,299 
Depreciation and amortizati on ................................................................  510  426  472 
Total costs and expenses  ...................................................................  12 ,012  14,303  22,743 
Loss from operations  .........................................................................  (4, 080)  (3,033)  (8,596) 
Interest expense ......................................................................................  (479)  (648)  (559) 
Other income (expense), ne t ...................................................................<br>


-----------------------------------------



In [50]:
for n in retrievals:
    display_source_node(n, source_length=1500)
    print(n.metadata["questions_this_excerpt_can_answer"])
    print("\n-----------------------------------------\n")
# Node ID: afc3e421-72a6-461d-8c56-158d83a5756a
# Node ID: 9f0cdd17-fa62-4046-95e3-d3c482bbd150

**Node ID:** afc3e421-72a6-461d-8c56-158d83a5756a<br>**Similarity:** 0.8631747362537648<br>**Text:** 64 • Gain on business divestitures, which consists  of gain on sale of divested operations. 
• Gain (loss) on debt and equity secu rities, net, which consis ts primarily of gains from fair value adjustments 
relating to our investments such as our investment in Didi. 
• Foreign currency exchange gains (losses), net, which consist primarily of remeasur ement of transactions and 
monetary assets and liabilities denominated in currencies other than the functional currency at the end of the period. 
• Change in fair value of embedded derivatives, which consists primarily of gains and losses on embedded 
derivatives related to our Convertible Notes until their extinguishment in connection with our IPO. 
• Gain on extinguishment of convertible notes and settlement of derivatives. • Other, which consists primarily of ch anges in the fair value of warrants an d income from forfeitures of warrants. 
Provision for (Benefit from) Income Taxes 
We are subject to income taxes in the United States and fo reign jurisdictions in which we do business. These foreign 
jurisdictions have different statutory tax rates than those in the United States. Additionally, certain of our foreign earnings  
may also be taxable in the United States. Accordingly, our  effective tax rate will vary depending on the relative 
proportion of foreign to domestic income, use of foreign tax credits, changes in the valuatio n of our deferred tax assets, 
and liabilities and changes in tax laws. 
Equity Method Inve...<br>

1. What were the revenue figures for Uber Technologies, Inc. for the years 2017, 2018, and 2019?
2. How did the costs and expenses, including cost of revenue and operations and support, change for Uber Technologies, Inc. from 2017 to 2019?
3. Can you provide details on the gain on business divestitures, gain (loss) on debt and equity securities, foreign currency exchange gains (losses), and other financial aspects mentioned in the excerpt from Uber Technologies, Inc.'s 2019 Annual Report?

-----------------------------------------



**Node ID:** 9f0cdd17-fa62-4046-95e3-d3c482bbd150<br>**Similarity:** 0.8625019056243356<br>**Text:** the periods presented (in 
millions): 
  Year Ended December 31,   
  2017   2018   2019   
Revenue  .................................................................................................. $ 7, 932 $ 11,270 $ 14,147 
Costs and expenses:        
Cost of revenue, exclusive of depreciation and amortization  
shown separatel y below.......................................................................  4,1 60  5,623  7,208 
Operations and suppor t ...........................................................................  1,354  1,516  2,302 
Sales and marketin g ................................................................................  2,524  3,151  4,626 
Research and developmen t .....................................................................  1,201  1,505  4,836 
Gene ral and administra tive .....................................................................  2,263  2,082  3,299 
Depreciation and amortizati on ................................................................  510  426  472 
Total costs and expenses  ...................................................................  12 ,012  14,303  22,743 
Loss from operations  .........................................................................  (4, 080)  (3,033)  (8,596) 
Interest expense ......................................................................................  (479)  (648)  (559) 
Other income (expense), ne t ...................................................................<br>

1. What were Uber Technologies, Inc.'s total revenue and total costs and expenses for the years 2017, 2018, and 2019?
2. What was Uber Technologies, Inc.'s net income (loss) attributable to non-controlling interests for the years 2017, 2018, and 2019?
3. How did Uber Technologies, Inc.'s income (loss) before income taxes and loss from equity method investment change from 2017 to 2019?

-----------------------------------------



# SubQuestionQueryEngine

In [None]:
engine = index.as_query_engine(similarity_top_k=10, llm=OpenAI(model="gpt-3.5-turbo-0613"))

In [73]:
final_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[
        QueryEngineTool(
            query_engine=engine,
            metadata=ToolMetadata(
                name="sec_filing_documents",
                description="financial information on companies.",
            ),
        )
    ],
    question_gen=question_gen,
    use_async=True,
)

In [74]:
response = final_engine.query(
    """
    What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
    Give your answer as a JSON.
    """
)
print(response.response)
# Correct answer:
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
#  "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}

Generated 4 sub questions.
[1;3;38;2;237;90;200m[sec_filing_documents] Q: By first identifying and quoting the most relevant sources, What was the cost due to research and development for Uber in 2019 in millions of USD?
[0m[1;3;38;2;90;149;237m[sec_filing_documents] Q: By first identifying and quoting the most relevant sources, What was the cost due to sales and marketing for Uber in 2019 in millions of USD?
[0m[1;3;38;2;11;159;203m[sec_filing_documents] Q: By first identifying and quoting the most relevant sources, What was the cost due to research and development for Lyft in 2019 in millions of USD?
[0m[1;3;38;2;155;135;227m[sec_filing_documents] Q: By first identifying and quoting the most relevant sources, What was the cost due to sales and marketing for Lyft in 2019 in millions of USD?
[0m[1;3;38;2;11;159;203m[sec_filing_documents] A: Research and development $ 909,126 $ 1,505,640 $ 300,836
[0m[1;3;38;2;155;135;227m[sec_filing_documents] A: Sales and marketing $ 814.1 

**RESULT**: As we can see, the LLM answers the questions correctly.

### Challenges Identified in the Problem Domain

In this example, we observed that the search quality as provided by vector embeddings was rather poor. This was likely due to highly dense financial documents that were likely not representative of the training set for the model.

In order to improve the search quality, other methods of neural search that employ more keyword-based approaches may help, such as ColBERTv2/PLAID. In particular, this would help in matching on particular keywords to identify high-relevance chunks.

Other valid steps may include utilizing models that are fine-tuned on financial datasets such as Bloomberg GPT.

Finally, we can help to further enrich the metadata by providing more contextual information regarding the surrounding context that the chunk is located in.

### Improvements to this Example
Generally, this example can be improved further with more rigorous evaluation of both the metadata extraction accuracy, and the accuracy and recall of the QnA pipeline. Further, incorporating a larger set of documents as well as the full length documents, which may provide more confounding passages that are difficult to disambiguate, could further stresss test the system we have built and suggest further improvements. 