# Financial Document Analysis with LlamaIndex

In this example notebook, we showcase how to perform financial analysis over [**10-K**](https://en.wikipedia.org/wiki/Form_10-K) documents with the [**LlamaIndex**](https://gpt-index.readthedocs.io/en/latest/) framework with just a few lines of code.

## Notebook Outline
* [Introduction](#Introduction)
* [Setup](#Setup)
* [Data Loading & Indexing](#Data-Loading-and-Indexing)
* [Simple QA](#Simple-QA)
* [Advanced QA - Compare and Contrast](#Advanced-QA---Compare-and-Contrast)


## Introduction

### LLamaIndex
[LlamaIndex](https://gpt-index.readthedocs.io/en/latest/) is a data framework for LLM applications.
You can get started with just a few lines of code and build a retrieval-augmented generation (RAG) system in minutes.
For more advanced users, LlamaIndex offers a rich toolkit for ingesting and indexing your data, modules for retrieval and re-ranking, and composable components for building custom query engines.

See [full documentation](https://gpt-index.readthedocs.io/en/latest/) for more details.

### Financial Analysis over 10-K documents
A key part of a financial analyst's job is to extract information and synthesize insight from long financial documents.
A great example is the 10-K form - an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company's financial performance.
These documents typically run hundred of pages in length, and contain domain-specific terminology that makes it challenging for a layperson to digest quickly.


We showcase how LlamaIndex can support a financial analyst in quickly extracting information and synthesize insights **across multiple documents** with very little coding.

## Setup

To begin, we need to install the llama-index library

In [1]:
!pip install llama-index pypdf

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting llama-index
  Obtaining dependency information for llama-index from https://files.pythonhosted.org/packages/0b/a1/9886f7c39570c26adff946696a1cc21fa244ae2e34a88d9dd64e07290419/llama_index-0.8.28-py3-none-any.whl.metadata
  Downloading llama_index-0.8.28-py3-none-any.whl.metadata (5.0 kB)
Collecting pypdf
  Obtaining dependency information for pypdf from https://files.pythonhosted.org/packages/b2/5d/c2671fe6b1e799a4e2d2b4e2d58e13a63691f04bb9006e0d91fb47b9c3c0/pypdf-3.16.0-py3-none-any.whl.metadata
  Downloading pypdf-3.16.0-py3-none-any.whl.metadata (7.2 kB)
Collecting urllib3<2 (from llama-index)
  Obtaining dependency information for urllib3<2 from https://files.pythonhosted.org/packages/c5/05/c214b32d21c0b465506f95c4f28ccbcba15022e000b043b72b3df7728471/urllib3-1.26.16-py2.py3-none-any.whl.metadata
  Downloading urllib3-1.26.16-py2.py3-none-any.whl.metadata (48 kB)
     ---------------------------------

DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.0.5 requires pydantic<1.8.0,>=1.7.1, but you have pydantic 1.10.12 which is incompatible.


Now, we import all modules used in this tutorial.

In [2]:
from langchain import OpenAI

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

Before we start, we can configure the LLM provider and model that will power our RAG system.
Here, we pick *text-davinci-003* from OpenAI, allow unlimited output tokens.

In [3]:
llm = OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=-1)

We construct a `ServiceContext` and set it as the global default, so all subsequent operations that depends on LLM calls will use the model we configured here.

In [4]:
service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\deconf\AppData\Local\llama_index...
[nltk_data]   Unzipping tokenizers\punkt.zip.


## Data Loading and Indexing

Now, we load and parse 2 PDFs (one for Uber 10-K in 2021 and another for Lyft 10-k in 2021).
Under the hood, the PDFs are converted to plain text `Document` objects, separate by page.

> Note: this operation might take a while to run, since each document is more than 100 pages.

In [7]:
lyft_docs = SimpleDirectoryReader(input_files=["f:/openai-cookbook/examples/data/10k/lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["f:/openai-cookbook/examples/data/10k/uber_2021.pdf"]).load_data()

In [31]:
print(f'Loaded lyft 10-K with {len(lyft_docs)} pages')
print(f'Loaded Uber 10-K with {len(uber_docs)} pages')

Loaded lyft 10-K with 238 pages
Loaded Uber 10-K with 307 pages


Now, we can build an (in-memory) `VectorStoreIndex` over the documents that we've loaded.

> Note: this operation might take a while to run, since it calls OpenAI API for computing vector embedding over document chunks.

In [8]:
lyft_index = VectorStoreIndex.from_documents(lyft_docs)
uber_index = VectorStoreIndex.from_documents(uber_docs)

## Simple QA

Now we are ready to run some queries against our indices!
To do so, we first configure a `QueryEngine`, which just captures a set of configurations for how we want to query the underlying index.

For a `VectorStoreIndex`, the most common configuration to adjust is `similarity_top_k` which controls how many document chunks (which we call `Node` objects) are retrieved to use as context for answering our question.

In [9]:
lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)

In [10]:
uber_engine = uber_index.as_query_engine(similarity_top_k=3)

Let's see some queries in action!

In [15]:
response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in cents with page reference')

In [16]:
print(response)

 3,208,323,000 cents (page 79)


In [17]:
response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')

In [18]:
print(response)

 Revenue in 2021 was $17,455 million (page 98).


## Advanced QA - Compare and Contrast

For more complex financial analysis, one often needs to reference multiple documents.

As a example, let's take a look at how to do compare-and-contrast queries over both Lyft and Uber financials.
For this, we build a `SubQuestionQueryEngine`, which breaks down a complex compare-and-contrast query, into simpler sub-questions to execute on respective sub query engine backed by individual indices.

In [19]:
query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(name='lyft_10k', description='Provides information about Lyft financials for year 2021')
    ),
    QueryEngineTool(
        query_engine=uber_engine,
        metadata=ToolMetadata(name='uber_10k', description='Provides information about Uber financials for year 2021')
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

Let's see these queries in action!

In [20]:
response = await s_engine.aquery('Compare and contrast the customer segments and geographies that grew the fastest')

Generated 4 sub questions.
[36;1m[1;3m[lyft_10k] Q: What were the customer segments that grew the fastest for Lyft in 2021?
[0m[33;1m[1;3m[uber_10k] Q: What were the customer segments that grew the fastest for Uber in 2021?
[0m[38;5;200m[1;3m[lyft_10k] Q: Which geographies experienced the fastest growth for Lyft in 2021?
[0m[32;1m[1;3m[uber_10k] Q: Which geographies experienced the fastest growth for Uber in 2021?
[0m[36;1m[1;3m[lyft_10k] A:  In 2021, Lyft saw the fastest growth in riders who used the platform to commute to and from work, explore their cities, spend more time at local businesses, and stay out longer knowing they could get a reliable ride home. Lyft also saw growth in drivers who had access to 24/7 support and earnings tools, education resources, and other support to meet their personal goals.
[0m[33;1m[1;3m[uber_10k] A:  In 2021, Uber's Mobility, Delivery, and Freight segments grew the fastest. Mobility refers to products that connect consumers with Mo

In [21]:
print(response)


Lyft saw the fastest growth in riders who used the platform to commute to and from work, explore their cities, spend more time at local businesses, and stay out longer knowing they could get a reliable ride home. Additionally, Lyft saw growth in drivers who had access to 24/7 support and earnings tools, education resources, and other support to meet their personal goals. The geographies that experienced the fastest growth for Lyft in 2021 were those that were able to fully reopen and distribute vaccines more quickly.

Uber's Mobility, Delivery, and Freight segments grew the fastest in 2021. Mobility refers to products that connect consumers with Mobility Drivers who provide rides in a variety of vehicles, such as cars, auto rickshaws, motorbikes, minibuses, or taxis. Delivery allows consumers to search for and discover local restaurants, order a meal, and either pick-up at the restaurant or have the meal delivered and, in certain markets, Delivery also includes offerings for grocery, 

In [22]:
response = await s_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

Generated 4 sub questions.
[36;1m[1;3m[uber_10k] Q: What is the revenue of Uber in 2020?
[0m[33;1m[1;3m[uber_10k] Q: What is the revenue of Uber in 2021?
[0m[38;5;200m[1;3m[lyft_10k] Q: What is the revenue of Lyft in 2020?
[0m[32;1m[1;3m[lyft_10k] Q: What is the revenue of Lyft in 2021?
[0m[33;1m[1;3m[uber_10k] A:  The revenue of Uber in 2021 was $17,455 million.
[0m[32;1m[1;3m[lyft_10k] A:  The revenue of Lyft in 2021 was $3,208,323 thousand.
[0m[36;1m[1;3m[uber_10k] A:  The revenue of Uber in 2020 was $11,139 million.
[0m[38;5;200m[1;3m[lyft_10k] A:  The revenue of Lyft in 2020 was $2,364,681 thousand.
[0m

In [23]:
print(response)


 Uber's revenue grew by 56.2% from 2020 to 2021, from $11,139 million to $17,455 million. Lyft's revenue grew by 35.6% from 2020 to 2021, from $2,364,681 thousand to $3,208,323 thousand.
