# 10-k Analysis using Llama-Index

source: https://medium.com/@jerryjliu98/how-unstructured-and-llamaindex-can-help-bring-the-power-of-llms-to-your-own-data-3657d063e30d

In [4]:
# set text wrapping
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [5]:
import openai
import os
from llama_index import download_loader, GPTSimpleVectorIndex
from pathlib import Path
from llama_index import LangchainEmbedding
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import AzureOpenAI
from llama_index import (
    GPTSimpleVectorIndex,
    SimpleDirectoryReader, 
    LLMPredictor,
    PromptHelper
)
from llama_index import ServiceContext


os.environ["OPENAI_API_KEY"]  = "APIKEY"
os.environ["OPENAI_API_TYPE"] = openai.api_type = "azure"
os.environ["OPENAI_API_VERSION"] = openai.api_version = "2022-12-01"
os.environ["OPENAI_API_BASE"] = openai.api_base = "APIENDPOINT"
deployment_name = "davinci3"
model_name = "davinci3"

### Ingest Unstructured Data Through the Unstructured.io Reader

Leverage the capabilities of Unstructured.io HTML parsing.
Downloaded through LlamaHub.

In [6]:
UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

In [None]:
loader = UnstructuredReader()
doc_set = {}
all_docs = []
years = [2022, 2021, 2020, 2019]
for year in years:
    year_docs = loader.load_data(file=Path(f'./UBER/UBER_{year}.html'), split_documents=False)
    # insert year metadata into each year
    for d in year_docs:
        d.extra_info = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jacwang\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\jacwang\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Setup Service Context

In [7]:

embedding_llm = LangchainEmbedding(OpenAIEmbeddings(
    document_model_name="text-embedding-ada-002",
    query_model_name="text-embedding-ada-002",
))

In [8]:

llm_predictor = LLMPredictor(llm=AzureOpenAI(deployment_name=deployment_name, model_name=model_name )
)


In [9]:

service_context = ServiceContext.from_defaults(llm_predictor = llm_predictor, embed_model =embedding_llm, chunk_size_limit=512)

In [10]:
service_context

ServiceContext(llm_predictor=<llama_index.llm_predictor.base.LLMPredictor object at 0x000001FFEC8A8B20>, prompt_helper=<llama_index.indices.prompt_helper.PromptHelper object at 0x000001FFEC8A8D30>, embed_model=<llama_index.embeddings.langchain.LangchainEmbedding object at 0x000001FFEC5A7B80>, node_parser=<llama_index.node_parser.simple.SimpleNodeParser object at 0x000001FFEC8A88E0>, llama_logger=<llama_index.logger.base.LlamaLogger object at 0x000001FFEC8A8C70>, chunk_size_limit=512)

### Setup a Vector Index for each SEC filing

We setup a separate vector index for each SEC filing from 2019-2022.

We also optionally initialize a "global" index by dumping all files into the vector store.

In [None]:
# initialize simple vector indices + global vector index
# NOTE: don't run this cell if the indices are already loaded! 
index_set = {}
for year in years:
    cur_index = GPTSimpleVectorIndex.from_documents(doc_set[year], service_context=service_context)
    index_set[year] = cur_index
    cur_index.save_to_disk(f'index_{year}.json')
    

In [None]:
# Load indices from disk
index_set = {}
for year in years:
    cur_index = GPTSimpleVectorIndex.load_from_disk(f'index_{year}.json', service_context=service_context)
    index_set[year] = cur_index

In [None]:
# NOTE: this global index is a single vector store containing all documents
# Only relevant for the section below: "Can a single vector index answer questions across years?"
global_index = GPTSimpleVectorIndex.from_documents(all_docs, service_context=service_context)
global_index.save_to_disk(f'index_global.json')

In [86]:
global_index = GPTSimpleVectorIndex.load_from_disk(f'index_global.json', service_context=service_context)

### Ask Initial Questions over a Given Year (2020)

Let's first ask some questions over the UBER 10-k for 2020! 

In [None]:
response = index_set[2020].query("What were some of the biggest risk factors in 2020?", similarity_top_k=3)

In [88]:
print(response)



The biggest risk factors in 2020 included the unpredictable duration of the spread of the COVID-19 outbreak, the impact of the pandemic on capital and financial markets, and the potential for permanent changes in end-user behaviors. Additionally, there was a risk of weak demand for the Mobility offering for a significant length of time and the potential for adverse impacts from business partners and third-party vendors. Furthermore, there was the risk of extreme volatility in financial markets impacting the company's stock price and ability to access capital markets, as well as the potential for cascading effects of the pandemic that are not currently foreseeable. There was also the risk of additional regulatory challenges or fines that could have a significant impact on the company's financial results, such as the legal or regulatory challenges to the latest guidance from regulatory authorities in connection with the COVID-19 pandemic, or the impact of changes to pricing models or h

In [None]:
response = index_set[2020].query("What were some of the signifcant acquisitions?", similarity_top_k=3)

In [90]:
print(response)



Some of the significant acquisitions mentioned in the context are the divestiture of our ATG business to Aurora, the Uber Elevate business to Joby, a joint venture with SK Telecom Co., LTD., the acquisition of Careem, the purchase of a controlling interest in Cornershop for a total consideration of $362 million, paid in Uber common stock (67 million) with the remainder of the consideration transferred (380 million) net of the CS-Mexico Put/Call (18 million). The purchase included net assets acquired of $362 million, goodwill of $384 million, intangible assets of $122 million, other long-term assets of $11 million, current liabilities of $34 million, deferred tax liability of $33 million, other long-term liabilities of $2 million, and redeemable non-controlling interests of $290 million. The identifiable intangible assets acquired and their estimated useful lives include vendor relationship (20 years), shopper relationship (15 years), customer relationship (14 years), developed techno

### Can a single vector index answer questions across years?

If we dump all documents to a single vector store, let's test its ability to answer questions across years! 

In [91]:
# Option 2
risk_query_str = (
    "Describe the current risk factors. If the year is provided in the information, "
    "provide that as well. If the context contains risk factors for multiple years, "
    "explicitly provide the following:\n"
    "- A description of the risk factors for each year\n"
    "- A summary of how these risk factors are changing across years"
)

In [92]:
# Option 1
#risk_query_str = "What are some of the biggest risk factors in each year?"

In [93]:
response = global_index.query(risk_query_str, similarity_top_k=3)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 2937 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 60 tokens


In [94]:
print(str(response))



The current risk factors for 2019 include modified or new laws and regulations applying to our business, the size of our addressable markets, market share, category positions, and market trends, including our ability to grow our business in the six countries we have identified as near-term priorities, the safety, affordability, and convenience of our platform and our offerings, our ability to identify, recruit, and retain skilled personnel, including key members of senior management, our expected growth in the number of platform users, and our ability to promote our brand and attract and retain platform users, our ability to maintain, protect, and enhance our intellectual property rights, our ability to introduce new products and offerings and enhance existing products and offerings, our ability to successfully enter into new geographies, expand our presence in countries in which we are limited by regulatory restrictions, and manage our international expansion, our ability to success

### Composing a Graph to synthesize answers across 10-K filings (2019-2022)

We want our queries to aggregate/synthesize information across *all* 10-K filings. To do this, we define a List index
on top of the 4 vector indices.

In [95]:
from llama_index import GPTListIndex, LLMPredictor
from langchain import OpenAI
from llama_index.composability import ComposableGraph

In [96]:
# set summary text for each doc
summaries = {}
for year in years:
    summaries[year] = f"UBER 10-k Filing for {year} fiscal year"

In [97]:
# set number of output tokens
llm_predictor.max_tokens = 1000
#service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

In [98]:
llm_predictor.__dict__

{'_llm': AzureOpenAI(cache=None, verbose=False, callback_manager=<langchain.callbacks.shared.SharedCallbackManager object at 0x000001FB046ED5D0>, client=<class 'openai.api_resources.completion.Completion'>, model_name='davinci3', temperature=0.7, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0, n=1, best_of=1, model_kwargs={}, openai_api_key=None, batch_size=20, request_timeout=None, logit_bias={}, max_retries=6, streaming=False, deployment_name='davinci3'),
 'retry_on_throttling': True,
 '_total_tokens_used': 7735,
 'flag': True,
 '_last_token_usage': 2937,
 'max_tokens': 1000}

In [99]:
graph = ComposableGraph.from_indices(
    GPTListIndex,
    [index_set[y] for y in years],
    [summaries[y] for y in years],
    service_context=service_context
)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens


In [43]:
graph.save_to_disk('10k_graph.json')

In [71]:
graph = ComposableGraph.load_from_disk('10k_graph.json', service_context=service_context)

### Setting Up the Query

We query about the risk factors. We want to synthesize information across each year.

In [100]:
risk_query_str = (
    "Describe the current risk factors. If the year is provided in the information, "
    "provide that as well. If the context contains risk factors for multiple years, "
    "explicitly provide the following:\n"
    "- A description of the risk factors for each year\n"
    "- A summary of how these risk factors are changing across years"
)

In [101]:
query_configs = [
    {
        "index_struct_type": "dict",
        "query_mode": "default",
        "query_kwargs": {
            "similarity_top_k": 1,
            # "include_summary": True
        }
    },
    {
        "index_struct_type": "list",
        "query_mode": "default",
        "query_kwargs": {
            "response_mode": "tree_summarize",
        }
    },
]

In [102]:
response_summary = graph.query(risk_query_str, query_configs=query_configs)

In [103]:
print(response_summary)



The current risk factors for 2020, as listed in Item 1A of the Annual Report on Form 10-K, include competition from other companies, changes in laws and regulations, changes in economic or business conditions, fluctuations in currency exchange rates, the potential for impairment of goodwill and long-lived assets, and the potential for cyber-attacks. 

For the year 2021, the risk factors include modifications or new laws and regulations applying to the business, and the ability to implement, maintain, and improve internal control over financial reporting. Additionally, the risk factors for 2021 include our ability to promote our brand and attract and retain platform users; our ability to maintain, protect, and enhance our intellectual property rights; our ability to introduce new products and offerings and enhance existing products and offerings; our ability to successfully enter into new geographies, expand our presence in countries in which we are limited by regulatory restrictions,

In [50]:
print(response_summary.get_formatted_sources())

> Source (Doc id: None): 
The year provided in the context information is 2022. The current risk factors for 2022 include ...

> Source (Doc id: None): 
The year provided in the context is 2021. The current risk factors include the ability to promot...

> Source (Doc id: None): 
The risk factors for 2020 are disclosed in Item 1A of this Annual Report on Form 10-K, which sta...

> Source (Doc id: None): 
The risk factors for 2019 include modified or new laws and regulations applying to our business,...

> Source (Doc id: 699cb7b9-fae2-4ffc-9230-d0c63b17e477): year: 2022

and certain events we participate in or host with members of the investment community...

> Source (Doc id: 231c12bd-e38a-42ac-87df-3e3c704df94b): year: 2021

ability to promote our brand and attract and retain platform users;

our ability to m...

> Source (Doc id: b6d600a3-c38c-4aad-8f17-634076d131ba): year: 2020

1A.

Risk Factors

11

Item 1B.

Unresolved Staff Comments

46

Item 2.

Properties

...

> Source (Doc i



In [51]:
# query a specific year
response_tmp = index_set[2022].query(risk_query_str)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 974 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 60 tokens


In [None]:
str(response_tmp)

'\nIn 2022, the risk factors for our business include Drivers being classified as employees, workers or quasi-employees instead of independent contractors, the mobility, delivery, and logistics industries being highly competitive, and the need to lower fares or service fees and offer Driver incentives and consumer discounts and promotions in order to remain competitive in certain markets. Since our inception, we have incurred significant losses.'

In [52]:
# query a global index
response = global_index.query(risk_query_str, similarity_top_k=4)

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3886 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 60 tokens


In [53]:
str(response)

'\n\nThe current risk factors for 2019 include modified or new laws and regulations applying to our business, our ability to address those trends and developments with our products and offerings, the size of our addressable markets, market share, category positions, and market trends, including our ability to grow our business in the six countries we have identified as near-term priorities, the safety, affordability, and convenience of our platform and our offerings, our ability to identify, recruit, and retain skilled personnel, including key members of senior management, our expected growth in the number of platform users, and our ability to promote our brand and attract and retain platform users, our ability to maintain, protect, and enhance our intellectual property rights, our ability to introduce new products and offerings and enhance existing products and offerings, our ability to successfully enter into new geographies, expand our presence in countries in which we are limited b