# Cohere Document Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [2]:
from getpass import getpass
import os
from pathlib import Path

from llama_index.core import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank

Set up some helper functions:

In [3]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [4]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

# Look for the source-materials folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

## Cohere LLM

In [5]:
llm = Cohere(api_key=os.environ["COHERE_API_KEY"])

Without additional information, Cohere is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from source-materials

Start by reading in all the PDF files from `source_documents`.

In [6]:
# Load the pdfs
pdf_folder_path = "./source_documents"
documents = documents = SimpleDirectoryReader(input_files=[f"{pdf_folder_path}/Blackrock_MF_Summary_Prospectus_Single_BROAX-BROCX-BROIX-BGORX.pdf"]).load_data()
print(f"Number of source materials: {len(documents)}\n")

Number of source materials: 12



## Define an embeddings model

This embeddings model will convert the textual data from our PDF files into vector embeddings. These vector embeddings will later enable us to quickly find the chunk of text that most closely corresponds to our original query.

In [7]:
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=200
)

[nltk_data] Downloading package punkt to
[nltk_data]     /tmp/CaYTPA7e9z3ObQoG/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Storage: Store the documents in a vector database

In [8]:
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

Parsing nodes:   0%|          | 0/12 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/197 [00:00<?, ?it/s]

## Retrieval: Now do a search to retrieve the chunk of document text that most closely matches our original query

## Setup retriever and reranker

In [9]:

search_query_retriever = index.as_retriever(service_context=service_context)
reranker = CohereRerank()

## Query Response pipeline

In [10]:
def get_response_to_query(query):
    # search_query_retrieved_nodes = search_query_retriever.retrieve(query)
    # print(f"Search query retriever found {len(search_query_retrieved_nodes)} results")
    # print(f"First result example:\n{search_query_retrieved_nodes[0]}\n")
    query_engine = index.as_query_engine(
        node_postprocessors = [reranker]
    )
    result = query_engine.query(query)
    return result

## Get the Fund Name

In [11]:
fund_name = get_response_to_query("What is the name of the fund?  The name of the fund is: ")
print(f"Fund Name: {fund_name}")

Fund Name: BlackRock Advantage International Fund


## Get responses to key queries

In [12]:
queries = [
    "What is the investment strategy of the fund?",
    "What are the investment objectives of the fund?",
    "Who are the key people in the management team?",
    "What is the investment philosphy of the fund regarding ESG (Environmental, Social, and Governance)?",
    "What industries, markets, or types of securities is the fund want exposure to?",
    "What investment tools (derivatives, leverage, etc) does does the fund use to achieve their investment goals?"
]

In [13]:
responses = []

for query in queries:
    result = get_response_to_query(query)
    responses.append(result)

response_answer_pairs = zip(queries, responses)


In [14]:
response_answer_text = ""
for (query, response) in response_answer_pairs:
    response_answer_text = f"{response_answer_text}{query}\n{response}\n\n"

print(response_answer_text)

What is the investment strategy of the fund?
The fund aims to invest at least 80% of its assets in non-US equity securities and equity-like instruments of companies that are part of the MSCI EAFE Index and derivatives tied to the index. These assets include common stock, preferred stock, and convertible securities, among others.

What are the investment objectives of the fund?
The Fund’s investment objective is to provide long-term capital appreciation. The Fund is a series of BlackRock FundsSM, and it is managed by BlackRock Advisors, LLC.

Who are the key people in the management team?
Two portfolio managers and the investment manager are the key people in the management team, as per the provided context information.  The names of the portfolio managers are Raffaele Savi, Kevin Franklin, and Richard Mathieson. They have been working as senior managing directors and managing directors of BlackRock, Inc., since 2017. The investment manager, BlackRock Advisors, LLC, previously defined a

## Chat Engine

In [16]:
from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=(
        f"You are an expert Mutual Fund analyst for a bank, and you privide answers to your boss about whether the bank should purchase the fund named {fund_name}."
        f"  You have answered these key questions about the fund:\n {response_answer_text}"
    ),
)

In [17]:
chat_response = chat_engine.chat("What is the level of risk for the fund?")
print(chat_response)

The provided fund summary Prospectus outlines multiple types of risk associated with the BlackRock Advantage International Fund. Understanding these risks is critical in assessing the level of risk associated with an investment in the Fund. Here are the main types of risks identified:

1. Market Risk: Market risk is the risk associated with the possibility that markets, and thus the value of the assets the fund invests in, will decline due to changes in economic trends, market conditions, or events not related to specific issuers. These declines could be sharp and unpredictable.

2. Selection Risk: Selection risk refers to the risk that the securities selected by the fund's management will underperform relative to the market, relevant indices, or other funds with similar objectives and strategies. This risk is inherent in all actively managed investment portfolios. 

3. Liquidity Risk: Liquidity risk relates to the potential lack of a liquid secondary market for derivatives, making it 

In [18]:
chat_response = chat_engine.chat("is it higher or lower than most funds?")
print(chat_response)

The fund's 128% portfolio turnover rate indicates a higher turnover compared to many mutual funds. Turnover rates above 100% suggest that the fund's holdings change more frequently. More frequent changes in the fund's holdings may lead to higher transaction costs and potential capital gains tax liabilities for investors. 

This higher turnover rate signals more active trading and adjustments to the portfolio's composition. It's important to consider this factor in light of the fund's investment strategy and objectives, as higher turnover can lead to distribution of capital gains, affecting an investor's tax obligations. 

However, it is important to note that portfolio turnover rates vary across funds and depend on their investment strategies. Some strategies, such as index funds that aim to replicate specific market indices, typically have lower turnover rates because they make fewer trades. Compared to actively managed funds that frequently adjust their portfolios based on market con