# Document Search
Set up a vector based document search for multiple file types. As files we use a txt file and pdf files, in this case the 10-k filings of Uber from 2019-2022. The goal is to be able to search through these documents using a search query and return the most relevant parts in the documents.

In [3]:
!mkdir doc_data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O doc_data/UBER.zip
!unzip doc_data/UBER.zip -d doc_data

mkdir: cannot create directory ‘doc_data’: File exists
--2023-07-09 13:28:31--  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6021:18::a27d:4112, 162.125.65.18
Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6021:18::a27d:4112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/948jr9cfs7fgj99/UBER.zip [following]
--2023-07-09 13:28:31--  https://www.dropbox.com/s/dl/948jr9cfs7fgj99/UBER.zip
Reusing existing connection to [www.dropbox.com]:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc797cd666d44857ff92b9c7d7bd.dl.dropboxusercontent.com/cd/0/get/B_ghPjekQq-O3SjX6e0b5mHWAJvLA-A6u0YkV4IaJ7jcX5DrpzVagPgeYqv6c-PRSy47fuaI3K07vmbjK53PPhBYWuJUhDb1ABacxY3QjaVDwayvsXLkoLu9sh6h-NmHUlu_G2vqYcKESFv6KpEjDtI5jRjB7CV2QeW47jokZXDHMw/file?dl=1# [following]
--2023-07-09 13:28:32--  https://uc797cd666d44857ff92b

First load the documents using the Unstructured library.

In [25]:
from llama_index import download_loader, VectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage
from pathlib import Path

In [32]:
# Load API key from environment variable
import os
import openai
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.environ["OPENAI_API_KEY"]


In [2]:
years = [2022, 2021, 2020, 2019]
UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

loader = UnstructuredReader()


[nltk_data] Downloading package punkt to /home/bluegnome/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/bluegnome/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [4]:
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(file=Path(f'./doc_data/UBER/UBER_{year}.html'), split_documents=False)
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

No that each document is loaded (into a Document pydantic schema defined by llama-index), we can create indices for search. We use vector indices here.

In [34]:
service_context = ServiceContext.from_defaults(chunk_size=512)  # Set the embedding model to chunk size 512
index_set = {}  # Store the indexes for each document
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        service_context=service_context,
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f'./storage/{year}') # Save the storage index for future use

Load indices from disk.

In [35]:
# Load indices from disk
index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(persist_dir=f'./storage/{year}')
    cur_index = load_index_from_storage(storage_context=storage_context)
    index_set[year] = cur_index

# Graph Composition For Question Answering
We might want to ask questions on the entire set of documents, in addition to the individual documents. We compose a "graph", being a list index consisting of the vector indices.

In [39]:
from llama_index import ListIndex, LLMPredictor, ServiceContext, load_graph_from_storage
from langchain import OpenAI
from llama_index.indices.composability import ComposableGraph

# describe each index to help traversal of composed graph
index_summaries = [f"UBER 10-k Filing for {year} fiscal year" for year in years]

# define an LLMPredictor set number of output tokens
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, max_tokens=512))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
storage_context = StorageContext.from_defaults()

# define a list index over the vector indices
# allows us to synthesize information across each index
graph = ComposableGraph.from_indices(
    ListIndex,
    [index_set[y] for y in years], 
    index_summaries=index_summaries,
    service_context=service_context,
    storage_context = storage_context,
)
root_id = graph.root_id


# [optional] save to disk
storage_context.persist(persist_dir=f'./storage/root')

In [40]:
# [optional] load from disk, so you don't need to build graph from scratch
graph = load_graph_from_storage(
    root_id=root_id, 
    service_context=service_context,
    storage_context=storage_context,
)