# Cohere Document Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [2]:
from getpass import getpass
import os
from pathlib import Path

from llama_index.core import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank

Set up some helper functions:

In [3]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [4]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

# Look for the source-materials folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

## Cohere LLM

In [5]:
llm = Cohere(api_key=os.environ["COHERE_API_KEY"])

Without additional information, Cohere is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from source-materials

Start by reading in all the PDF files from `source_documents`.

In [7]:
# Load the pdfs
pdf_folder_path = "./source_documents"
documents = documents = SimpleDirectoryReader(input_files=[f"{pdf_folder_path}/Blackrock_MF_Summary_Prospectus_Single_BROAX-BROCX-BROIX-BGORX.pdf"]).load_data()
print(f"Number of source materials: {len(documents)}\n")

Number of source materials: 12



## Define an embeddings model

This embeddings model will convert the textual data from our PDF files into vector embeddings. These vector embeddings will later enable us to quickly find the chunk of text that most closely corresponds to our original query.

In [8]:
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=200
)

[nltk_data] Downloading package punkt to
[nltk_data]     /tmp/fmuVvbHzYtyIqPnK/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Storage: Store the documents in a vector database

In [8]:
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

Parsing nodes:   0%|          | 0/12 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/197 [00:00<?, ?it/s]

## Retrieval: Now do a search to retrieve the chunk of document text that most closely matches our original query

In [20]:
queries = [
    "What is the investment strategy of the fund?",
    "What are the investment objectives of the fund?",
    "Who are the key people in the management team?",
    "What is the investment philosphy of the fund regarding ESG (Environmental, Social, and Governance)?",
    "What industries, markets, or types of securities is the fund want exposure to?",
    "What investment tools (derivatives, leverage, etc) does does the fund use to achieve their investment goals?"
]

## Setup retriever and reranker

In [21]:

search_query_retriever = index.as_retriever(service_context=service_context)
reranker = CohereRerank()

## Get responses to key queries

In [22]:
responses = []

for query in queries:
    search_query_retrieved_nodes = search_query_retriever.retrieve(query)
    # print(f"Search query retriever found {len(search_query_retrieved_nodes)} results")
    # print(f"First result example:\n{search_query_retrieved_nodes[0]}\n")
    query_engine = index.as_query_engine(
        node_postprocessors = [reranker]
    )
    result = query_engine.query(query)
    responses.append(result)

response_answer_pairs = zip(queries, responses)

In [23]:
for (query, response) in response_answer_pairs:
    print(f"{query}\n{response}\n")

What is the investment strategy of the fund?
The fund aims to invest at least 80% of its assets in non-US equity securities and equity-like instruments of companies that are part of the MSCI EAFE Index and derivatives tied to the index. These assets include a diverse range of industries that are chosen for market size, liquidity, and industry group representation. The fund primarily buys common stock, may also invest in preferred stock and convertible securities, and participates in new issues and IPOs.

What are the investment objectives of the fund?
The Fund’s investment objective is to provide long-term capital appreciation. The Fund aims to achieve this objective by investing primarily in equity securities, such as common stocks, of non-U.S. companies that are considered to be undervalued. These securities may include investments in emerging markets. Prior to October 16, 2017, the Fund was known as the BlackRock International Fund. 

Would you like to know more about this fund?

Wh

That first result doesn't look right, but it's close? Could it be that we got the result that we wanted from that retrieval, but the results came back out of order? Let's try using a reranker to check which of our results is a closest match.