# Cohere Document Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [7]:
from getpass import getpass
import os
from pathlib import Path

from llama_index.core import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank

Set up some helper functions:

In [8]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [9]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

# Look for the source-materials folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

## Cohere LLM

In [10]:
llm = Cohere(api_key=os.environ["COHERE_API_KEY"])

Without additional information, Cohere is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from source-materials

Start by reading in all the PDF files from `source_documents`.

In [11]:
# Load the pdfs
pdf_folder_path = "./source_documents"
documents = documents = SimpleDirectoryReader(input_files=[f"{pdf_folder_path}/Blackrock_MF_Summary_Prospectus_Single_BROAX-BROCX-BROIX-BGORX.pdf"]).load_data()
print(f"Number of source materials: {len(documents)}\n")

Number of source materials: 12



## Define an embeddings model

This embeddings model will convert the textual data from our PDF files into vector embeddings. These vector embeddings will later enable us to quickly find the chunk of text that most closely corresponds to our original query.

In [12]:
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=200
)

  service_context = ServiceContext.from_defaults(


## Storage: Store the documents in a vector database

In [13]:
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes:   0%|          | 0/12 [00:00<?, ?it/s]

Parsing nodes: 100%|██████████| 12/12 [00:00<00:00, 433.76it/s]
Generating embeddings: 100%|██████████| 197/197 [00:05<00:00, 35.72it/s]


## Retrieval: Now do a search to retrieve the chunk of document text that most closely matches our original query

## Setup retriever and reranker

In [20]:

search_query_retriever = index.as_retriever(service_context=service_context)
reranker = CohereRerank()

## Query Response pipeline

In [21]:
def get_response_to_query(query):
    # search_query_retrieved_nodes = search_query_retriever.retrieve(query)
    # print(f"Search query retriever found {len(search_query_retrieved_nodes)} results")
    # print(f"First result example:\n{search_query_retrieved_nodes[0]}\n")
    query_engine = index.as_query_engine(
        node_postprocessors = [reranker]
    )
    result = query_engine.query(query)
    return result

## Get the Fund Name

In [24]:
fund_name = get_response_to_query("What is the name of the fund?  The name of the fund is: ")
print(f"Fund Name: {fund_name}")

Fund Name: BlackRock Advantage International Fund


## Get responses to key queries

In [32]:
queries = [
    "What is the investment strategy of the fund?",
    "What are the investment objectives of the fund?",
    "Who are the key people in the management team?",
    "What is the investment philosphy of the fund regarding ESG (Environmental, Social, and Governance)?",
    "What industries, markets, or types of securities is the fund want exposure to?",
    "What investment tools (derivatives, leverage, etc) does does the fund use to achieve their investment goals?"
]

In [33]:
responses = []

for query in queries:
    result = get_response_to_query(query)
    responses.append(result)

response_answer_pairs = zip(queries, responses)


In [34]:
response_answer_text = ""
for (query, response) in response_answer_pairs:
    response_answer_text = f"{response_answer_text}{query}\n{response}\n\n"

print(response_answer_text)

What is the investment strategy of the fund?
The fund aims to invest at least 80% of its assets in non-US equity securities and equity-like instruments of companies that are part of the MSCI EAFE Index and derivatives tied to the index. These assets include a variety of common stocks, preferred stocks, and convertible securities. The MSCI EAFE index is a capitalization-weighted index inclusive of a broad range of industries chosen for market size, liquidity, and industry group representation.

What are the investment objectives of the fund?
The Fund's investment objective is to provide long-term capital appreciation. This objective is to serve the purpose of providing a reasonable return on your investment while also considering safety and liquidity.

Who are the key people in the management team?
Raffaele Savi, Kevin Franklin, and Richard Mathieson have served as portfolio managers for the fund for 2 years, and they have the titles Senior Managing Director and Managing Director at Bla

## Chat Engine

In [38]:
from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=(
        f"You are an expert Mutual Fund analyst for a bank, and you privide answers to your boss about whether the bank should purchase the fund named {fund_name}."
        f"  You have answered these key questions about the fund:\n {response_answer_text}"
    ),
)

In [39]:
chat_response = chat_engine.chat("What is the level of risk for the fund?")
print(chat_response)

The level of risk for the fund is determined by the fund's investment objectives and strategies, along with the expertise and skill of the fund's portfolio managers and analysts. These professionals make investment decisions on behalf of the fund's shareholders with the goal of achieving the fund's investment objectives. 

The fund invests primarily in equity securities and equity-like instruments of companies that are part of the MSCI EAFE Index and derivatives tied to the index. It also invests in a variety of common stocks, preferred stocks, and convertible securities. These investment practices are subject to market risk, which is the possibility that the value of the underlying securities of the fund may decline due to changes in general market conditions, economic trends, or events not specifically related to the issuers of the securities. These are factors that can impact a particular issuer, exchange, country, region, market, industry, group of industries, sector, or asset clas

In [40]:
chat_response = chat_engine.chat("is it higher or lower than most funds?")
print(chat_response)

Risk is a subjective assessment, and it is challenging to make precise comparisons between different mutual funds. Each fund has a unique investment strategy and different types of risk exposures. The risk profile of the BlackRock Advantage International Fund encompasses the risks associated with its investment objectives and strategies, as well as the expertise and skill of its portfolio managers and analysts.

The fund documents disclose that it invests in equity securities and equity-like instruments of companies that are components of the MSCI EAFE Index and derivatives tied to the index. The fund also invests in a variety of common stocks, preferred stocks, and convertible securities. These practices are subject to market risk, which is the possibility that the value of the underlying securities of the fund may decline due to changes in general market conditions, economic trends, or events not specifically related to the issuers of the securities. 

The fund utilizes derivatives, 