# Document Search
Set up a vector based document search for multiple file types. As files we use a txt file and pdf files, in this case the 10-k filings of Uber from 2019-2022. The goal is to be able to search through these documents using a search query and return the most relevant parts in the documents.

In [3]:
!mkdir doc_data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O doc_data/UBER.zip
!unzip doc_data/UBER.zip -d doc_data

mkdir: cannot create directory ‘doc_data’: File exists
--2023-07-09 13:28:31--  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6021:18::a27d:4112, 162.125.65.18
Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6021:18::a27d:4112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/948jr9cfs7fgj99/UBER.zip [following]
--2023-07-09 13:28:31--  https://www.dropbox.com/s/dl/948jr9cfs7fgj99/UBER.zip
Reusing existing connection to [www.dropbox.com]:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc797cd666d44857ff92b9c7d7bd.dl.dropboxusercontent.com/cd/0/get/B_ghPjekQq-O3SjX6e0b5mHWAJvLA-A6u0YkV4IaJ7jcX5DrpzVagPgeYqv6c-PRSy47fuaI3K07vmbjK53PPhBYWuJUhDb1ABacxY3QjaVDwayvsXLkoLu9sh6h-NmHUlu_G2vqYcKESFv6KpEjDtI5jRjB7CV2QeW47jokZXDHMw/file?dl=1# [following]
--2023-07-09 13:28:32--  https://uc797cd666d44857ff92b

First load the documents using the Unstructured library.

In [1]:
from llama_index import download_loader, VectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage
from pathlib import Path

In [2]:
# Load API key from environment variable
import os
import openai
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.environ["OPENAI_API_KEY"]


In [4]:
years = [2022, 2021, 2020, 2019]
UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

loader = UnstructuredReader()


[nltk_data] Downloading package punkt to /home/bluegnome/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/bluegnome/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [4]:
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(file=Path(f'./doc_data/UBER/UBER_{year}.html'), split_documents=False)
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

No that each document is loaded (into a Document pydantic schema defined by llama-index), we can create indices for search. We use vector indices here.

In [34]:
service_context = ServiceContext.from_defaults(chunk_size=512)  # Set the embedding model to chunk size 512
index_set = {}  # Store the indexes for each document
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        service_context=service_context,
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f'./storage/{year}') # Save the storage index for future use

Load indices from disk.

In [5]:
# Load indices from disk
index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(persist_dir=f'./storage/{year}')
    cur_index = load_index_from_storage(storage_context=storage_context)
    index_set[year] = cur_index

# Graph Composition For Question Answering
We might want to ask questions on the entire set of documents, in addition to the individual documents. We compose a "graph", being a list index consisting of the vector indices.

In [11]:
from langchain import OpenAI
from llama_index import LLMPredictor

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, max_tokens=512))

In [39]:
from llama_index import ListIndex, ServiceContext, load_graph_from_storage
from llama_index.indices.composability import ComposableGraph

# describe each index to help traversal of composed graph
index_summaries = [f"UBER 10-k Filing for {year} fiscal year" for year in years]

# define an LLMPredictor set number of output tokens

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
storage_context = StorageContext.from_defaults()

# define a list index over the vector indices
# allows us to synthesize information across each index
graph = ComposableGraph.from_indices(
    ListIndex,
    [index_set[y] for y in years], 
    index_summaries=index_summaries,
    service_context=service_context,
    storage_context = storage_context,
)
root_id = graph.root_id


# [optional] save to disk
storage_context.persist(persist_dir=f'./storage/root')

In [40]:
# [optional] load from disk, so you don't need to build graph from scratch
graph = load_graph_from_storage(
    root_id=root_id, 
    service_context=service_context,
    storage_context=storage_context,
)

# Langchain Tool Setup
Now that the graphs exists and are queryable, we can set up lanchain to determine which node to query, and how to query it.

In [7]:
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.agents import initialize_agent

from llama_index.langchain_helpers.agents import LlamaToolkit, create_llama_chat_agent, IndexToolConfig

In [43]:
# define a decompose transform
from llama_index.indices.query.query_transform.base import DecomposeQueryTransform
decompose_transform = DecomposeQueryTransform(
    llm_predictor, verbose=True
)

In [45]:
# define custom retrievers
from llama_index.query_engine.transform_query_engine import TransformQueryEngine

custom_query_engines = {}
for index in index_set.values():
    query_engine = index.as_query_engine()
    query_engine = TransformQueryEngine(
        query_engine,
        query_transform=decompose_transform,
        transform_metadata={'index_summary': index.index_struct.summary},
    )
    custom_query_engines[index.index_id] = query_engine
custom_query_engines[graph.root_id] = graph.root_index.as_query_engine(
    response_mode='tree_summarize',
    verbose=True,
)

# construct query engine
graph_query_engine = graph.as_query_engine(custom_query_engines=custom_query_engines)

# tool config
graph_config = IndexToolConfig(
    query_engine=graph_query_engine,
    name=f"Graph Index",
    description="useful for when you want to answer queries that require analyzing multiple SEC 10-K documents for Uber.",
    tool_kwargs={"return_direct": True}
)

In [8]:
# Define tools for each vectorindex
index_configs = []
for y in range(2019, 2023):
    query_engine = index_set[y].as_query_engine(
        similarity_top_k=3,
    )
    tool_config = IndexToolConfig(
        query_engine=query_engine, 
        name=f"Vector Index {y}",
        description=f"useful for when you want to answer queries about the {y} SEC 10-K for Uber",
        tool_kwargs={"return_direct": True}
    )
    index_configs.append(tool_config)

In [9]:
# Create the toolkit with both the graph and vector indices
toolkit = LlamaToolkit(
    # index_configs=index_configs + [graph_config],
    index_configs=index_configs
)

In [12]:
# Create a langchain agent that can use the tools
memory = ConversationBufferMemory(memory_key="chat_history")
llm=OpenAI(temperature=0)
agent_chain = create_llama_chat_agent(
    toolkit,
    llm,
    memory=memory,
    verbose=True
)

# Testing The Agent

In [13]:
agent_chain.run(input="Hi GPT, I'm Arthur, king of the britons.")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: Do I need to use a tool? No
AI: Hi Arthur, it's nice to meet you! How can I help you today?[0m

[1m> Finished chain.[0m


"Hi Arthur, it's nice to meet you! How can I help you today?"

In [14]:
agent_chain.run(input="What were some of the biggest risk factors in 2020 for Uber?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: Do I need to use a tool? Yes
Action: Vector Index 2020
Action Input: Risk Factors[0m
Observation: [33;1m[1;3m
1. The COVID-19 pandemic and the impact of actions to mitigate the pandemic
2. Drivers being classified as employees, workers or quasi-employees instead of independent contractors
3. High competition in the mobility, delivery, and logistics industries
4. Lowering of fares or service fees and offering significant Driver incentives and consumer discounts and promotions
5. Inability to achieve profitability
6. Inability to attract or maintain a critical mass of Drivers, consumers, merchants, shippers, and carriers
7. Operational, compliance, and cultural challenges
8. Maintaining and enhancing brand and reputation
9. Inability to optimize organizational structure or effectively manage growth
10. Platform users engaging in criminal, violent, inappropriate, or dangerous activity
11. Investing in new offerings 

'\n1. The COVID-19 pandemic and the impact of actions to mitigate the pandemic\n2. Drivers being classified as employees, workers or quasi-employees instead of independent contractors\n3. High competition in the mobility, delivery, and logistics industries\n4. Lowering of fares or service fees and offering significant Driver incentives and consumer discounts and promotions\n5. Inability to achieve profitability\n6. Inability to attract or maintain a critical mass of Drivers, consumers, merchants, shippers, and carriers\n7. Operational, compliance, and cultural challenges\n8. Maintaining and enhancing brand and reputation\n9. Inability to optimize organizational structure or effectively manage growth\n10. Platform users engaging in criminal, violent, inappropriate, or dangerous activity\n11. Investing in new offerings and technologies\n12. Economic, social, weather, and regulatory conditions or other circumstances, including COVID-19\n13. Retaining and attracting high-quality personnel\

In [51]:
cross_query_str = (
    "Compare/contrast the risk factors described in the Uber 10-K across years. Give answer in bullet points."
)
agent_chain.run(input=cross_query_str)



[1m> Entering new  chain...[0m
[32;1m[1;3m
Thought: Do I need to use a tool? Yes
Action: Graph Index
Action Input: Compare/contrast the risk factors described in the Uber 10-K across years[0m[33;1m[1;3m> Current query: Compare/contrast the risk factors described in the Uber 10-K across years
[0m[38;5;200m[1;3m> New query:  What are the risk factors described in the Uber 10-K for the 2022 fiscal year?
[0m[33;1m[1;3m> Current query: Compare/contrast the risk factors described in the Uber 10-K across years
[0m[38;5;200m[1;3m> New query:  What are the risk factors described in the Uber 10-K for the 2022 fiscal year?
[0m[33;1m[1;3m> Current query: Compare/contrast the risk factors described in the Uber 10-K across years
[0m[38;5;200m[1;3m> New query:  What are the risk factors described in the Uber 10-K for the 2021 fiscal year?
[0m[33;1m[1;3m> Current query: Compare/contrast the risk factors described in the Uber 10-K across years
[0m[38;5;200m[1;3m> New query

"\nThe risk factors described in the Uber 10-K for the 2020, 2021, and 2022 fiscal years all include potential for Drivers to be classified as employees, increased competition in the ride-sharing, meal delivery, grocery delivery, and freight industries, potential for regulatory changes, potential for litigation or other proceedings, potential for cybersecurity incidents or other disruptions to the company's technology platform, and potential for the company to be unable to attract and retain key personnel.\n\nThe risk factors described in the Uber 10-K for the 2020 fiscal year also include the severity of the COVID-19 pandemic, the duration of the outbreak, any future waves or resurgences of the virus, variants of the virus, the timing of widespread adoption of vaccines against the virus, additional actions that may be taken by governmental authorities, the further impact on the business of Drivers, Merchants, consumers, and business partners, and the inability to attract or maintain a