In [1]:
!pip install -U unstructured
!pip install llama-index


Collecting unstructured
  Downloading unstructured-0.15.9-py3-none-any.whl.metadata (29 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB)
Collecting dataclasses-json (from unstructured)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2024.4.27-py3-none-any.whl.metadata (13 kB)
Collecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz (from unstructured)
  Downloading rapidfuzz-3.9.6-cp310-

In [2]:
import os

os.environ["OPENAI_API_KEY"] = ""

import nest_asyncio
import nltk
nltk.download('averaged_perceptron_tagger')
nest_asyncio.apply()

from llama_index.readers.file import UnstructuredReader
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from pathlib import Path

years = [2022, 2021, 2020, 2019]
doc_set = {}
all_docs = []


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
loader = UnstructuredReader()
for year in years:
    year_docs = loader.load_data(file=Path(f"./data/UBER_{year}.html"), split_documents=False)
    for d in year_docs:
        d.metadata = {"year": year}
    print(f"{year} has {len(year_docs)} documents")
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

2022 has 1 documents
2021 has 1 documents
2020 has 1 documents
2019 has 1 documents


In [4]:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

In [5]:
Settings.chunk_size = 512
Settings.chunk_overlap = 64
Settings.llm = OpenAI(model="gpt-4")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        storage_context=storage_context
    )

    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")

In [6]:
from llama_index.core import load_index_from_storage

In [7]:
index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(storage_context=storage_context)

    index_set[year] = cur_index

individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=("useful for when you want to answer queries about the"
                         f" {year} SEC 10-K for Uber"
                         ),
        ),
    )
    for year in years
]

In [8]:
from llama_index.core.query_engine import SubQuestionQueryEngine

In [9]:
query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools
)

In [10]:
query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="sub_question_query_engine",
        description=(
            "useful for when you want to answer queries that require analyzing"
            " multiple SEC 10-K documents for Uber"
        )
    )
)

tools = individual_query_engine_tools + [query_engine_tool]

from llama_index.agent.openai import OpenAIAgent
agent = OpenAIAgent.from_tools(tools, verbose=True)

response = agent.chat("Hey! I'm Suhas")
print(response)


Added user message to memory: Hey! I'm Suhas
Hello Suhas! How can I assist you today?


In [11]:
response = agent.chat("What were some of the biggest risk factors in 2021?")
print(response)

Added user message to memory: What were some of the biggest risk factors in 2021?
=== Calling Function ===
Calling function: vector_index_2021 with args: {
  "input": "biggest risk factors in 2021"
}
Got output: The biggest risk factors in 2021 include the increase in remote working leading to privacy, cybersecurity, and fraud risks, as well as potential legal or regulatory challenges. The response to the COVID-19 pandemic, such as launching new services or expanding existing ones, could also heighten risks, including the classification of drivers. These challenges could result in fines or enforcement measures that could negatively impact financial results or operations. The pandemic itself has adversely affected financial results and may continue to do so, requiring significant actions such as workforce reductions and changes to pricing models. Other risks include being blocked from or limited in providing products in certain jurisdictions, legal and regulatory risks, risks related to

In [12]:
response = agent.chat("What about 2020?")
print(response)

Added user message to memory: What about 2020?
=== Calling Function ===
Calling function: vector_index_2020 with args: {
  "input": "biggest risk factors in 2020"
}
Got output: The biggest risk factors in 2020 were primarily related to the COVID-19 pandemic. This included the adverse impact on near-term and potentially long-term financial results, requiring significant actions such as workforce reductions and changes to pricing models. The unpredictable nature of the pandemic, including its duration, future waves, and impact on global markets, also posed a risk. The pandemic's effect on business partners and third-party vendors, as well as the potential for extreme volatility in financial markets, were additional risk factors. The shift to widespread remote work arrangements could negatively impact operations, business plans, and employee productivity. The company also faced potential privacy, cybersecurity, and fraud risks due to increased remote working. Lastly, the launch of new ser

In [13]:
response = agent.chat("Can you compare risks across all years?")
print(response)

Added user message to memory: Can you compare risks across all years?
=== Calling Function ===
Calling function: sub_question_query_engine with args: {
  "input": "compare risk factors across 2019, 2020, and 2021"
}
Generated 3 sub questions.
[1;3;38;2;237;90;200m[vector_index_2019] Q: What were the risk factors for Uber in 2019?
[0m[1;3;38;2;90;149;237m[vector_index_2020] Q: What were the risk factors for Uber in 2020?
[0m[1;3;38;2;11;159;203m[vector_index_2021] Q: What were the risk factors for Uber in 2021?
[0m[1;3;38;2;11;159;203m[vector_index_2021] A: The risk factors for Uber in 2021 included the potential for actual results to differ from forward-looking statements, the impact of the COVID-19 pandemic on market and economic conditions globally, and the resulting effects on drivers, merchants, consumers, and business partners. The pandemic also led to driver supply constraints. Other risks included liabilities for breaches experienced by companies Uber acquires, such as th