# Cohere Document Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [2]:
from getpass import getpass
import os
from pathlib import Path

from llama_index.core import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank

Set up some helper functions:

In [3]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [4]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

# Look for the source-materials folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

## Cohere LLM

In [5]:
llm = Cohere(api_key=os.environ["COHERE_API_KEY"])

Without additional information, Cohere is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from source-materials

Start by reading in all the PDF files from `source_documents`.

In [6]:
# Load the pdfs
pdf_folder_path = "./source_documents"
documents = documents = SimpleDirectoryReader(input_files=[f"{pdf_folder_path}/Vanguard_ETF_Statutory_Prospectus_Single_VOE.pdf"]).load_data()
print(f"Number of source materials: {len(documents)}\n")

Number of source materials: 116



## Define an embeddings model

This embeddings model will convert the textual data from our PDF files into vector embeddings. These vector embeddings will later enable us to quickly find the chunk of text that most closely corresponds to our original query.

In [7]:
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=200
)

[nltk_data] Downloading package punkt to
[nltk_data]     /tmp/9NqDLbzEJfRnMaTF/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Storage: Store the documents in a vector database

In [8]:
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

Parsing nodes:   0%|          | 0/116 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1744 [00:00<?, ?it/s]

## Retrieval: Now do a search to retrieve the chunk of document text that most closely matches our original query

## Setup retriever and reranker

In [9]:

search_query_retriever = index.as_retriever(service_context=service_context)
reranker = CohereRerank()

## Query Response pipeline

In [10]:
def get_response_to_query(query):
    # search_query_retrieved_nodes = search_query_retriever.retrieve(query)
    # print(f"Search query retriever found {len(search_query_retrieved_nodes)} results")
    # print(f"First result example:\n{search_query_retrieved_nodes[0]}\n")
    query_engine = index.as_query_engine(
        node_postprocessors = [reranker]
    )
    result = query_engine.query(query)
    return result

## Get the Fund Name

In [13]:
fund_name = get_response_to_query("What is the name of the fund? Give only the name without additional comments. The name of the fund is: ")
print(f"Fund Name: {fund_name}")

Fund Name: Vanguard US Large Cap Growth Index Fund


## Get responses to key queries

In [14]:
queries = [
    "What is the investment strategy of the fund?",
    "What are the investment objectives of the fund?",
    "Who are the key people in the management team?",
    "What is the investment philosphy of the fund regarding ESG (Environmental, Social, and Governance)?",
    "What industries, markets, or types of securities is the fund want exposure to?",
    "What investment tools (derivatives, leverage, etc) does does the fund use to achieve their investment goals?"
]

In [15]:
responses = []

for query in queries:
    result = get_response_to_query(query)
    responses.append(result)

response_answer_pairs = zip(queries, responses)


In [16]:
response_answer_text = ""
for (query, response) in response_answer_pairs:
    response_answer_text = f"{response_answer_text}{query}\n{response}\n\n"

print(response_answer_text)

What is the investment strategy of the fund?
The fund's investment strategy is to track the performance of two different indices, the Standard & Poor’s Completion Index and the CRSP US Large Cap Growth Index. The fund does this by investing all, or substantially all, of its assets in the stocks that make up the indices, holding each stock in approximately the same proportion as its weighting in the index. The indices are broadly diversified and include U.S. common stocks of small and mid-size companies, and growth stocks of large U.S. companies, respectively. The fund's investment approach is indexing, which involves attempting to replicate the performance of a target index by investing in a broadly diversified collection of securities that approximate the key characteristics of the full index.

What are the investment objectives of the fund?
The fund's investment objectives are to employ an indexing approach to track the performance of the CRSP US Large Cap Growth Index, a broadly div

## Chat Engine

In [17]:
from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=(
        f"You are an expert Mutual Fund analyst for a bank, and you privide answers to your boss about whether the bank should purchase the fund named {fund_name}."
        f"  You have answered these key questions about the fund:\n {response_answer_text}"
    ),
)

In [18]:
chat_response = chat_engine.chat("What is the level of risk for the fund?")
print(chat_response)

According to the prospectus, the Vanguard US Large Cap Growth Index Fund is subject to both stock market risk and index sampling risk at a low level. However, the reality is that all investments come with a level of risk. The fund is also exposed to investment style risk, as small and mid-cap stocks tend to be more sensitive to changing economic conditions than large-cap stocks. Finally, the fund also invests in derivatives and equity futures for asset exposure, which exposes the fund to a different level of risk. 

Ultimately, the fund's asset allocation and risk exposure will be influenced by the CRSP US Large Cap Growth Index, as the fund's objective is to perform indexing and track the index's performance. As a result, the fund will be subject to the risks associated with the Index's specific market sector allocation, and investors should carefully consider this before purchasing the fund.


In [19]:
chat_response = chat_engine.chat("is it higher or lower than most funds?")
print(chat_response)

It is difficult to compare the risk of the Vanguard US Large Cap Growth Index Fund to other funds because assessing risk is complex and dependent on many factors. 

This fund focuses on tracking two indices with a diverse range of small and mid-sized company stocks, which introduces elements of market diversity that typically lowers risk. However, the fund also utilizes derivatives and equity futures which frequently increases risk exposure. While the fund's objectives and investment style introduce specific risks related to its asset allocation and indexing approach, these can be managed to match an investor's goals and tolerance. 

Each fund has a unique risk profile that is influenced by factors such as asset class, market capitalization, investment style, and portfolio management. These factors can influence the volatility and potential rewards of the fund. Investors should carefully review each fund's prospectus to understand its risk profile and consult with a financial advisor t