# Cohere Document Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [32]:
from getpass import getpass
import os
from pathlib import Path

from llama_index.core import ServiceContext, SimpleDirectoryReader, VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

import chromadb
from chromadb import Collection

Set up some helper functions:

In [12]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [13]:
#try:
#    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
#    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
#except Exception:
#    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

# Look for the source-materials folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

## LLM

In [14]:
#llm = Cohere(api_key=os.environ["COHERE_API_KEY"])
llm = Ollama(model="llama3", request_timeout=30.0)

Without additional information, Cohere is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from source-materials

Start by reading in all the PDF files from `source_documents`.

In [15]:
# Load the pdfs
pdf_folder_path = "./source_documents"
documents = SimpleDirectoryReader(input_files=[f"{pdf_folder_path}/Vanguard_ETF_Statutory_Prospectus_Single_VOE.pdf"]).load_data()
print(f"Number of source materials: {len(documents)}\n")

Number of source materials: 116



## Define an embeddings model

This embeddings model will convert the textual data from our PDF files into vector embeddings. These vector embeddings will later enable us to quickly find the chunk of text that most closely corresponds to our original query.

In [16]:
#embed_model = CohereEmbedding(
#    model_name="embed-english-v3.0",
#    input_type="search_query"
#)

embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)

service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=200
)

  service_context = ServiceContext.from_defaults(


## Storage: Store the documents in a vector database

In [18]:
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)
index = VectorStoreIndex(documents, service_context=service_context, storage_context=storage_context, show_progress=True)


Generating embeddings: 100%|██████████| 116/116 [00:04<00:00, 27.46it/s]


In [51]:
import json

data_json = json.dumps(chroma_collection.get())
data = json.loads(data_json)
chroma_client.delete_collection("newcollection")
collection = chroma_client.create_collection("newcollection")
collection.add(ids=data["ids"], embeddings=data["embeddings"], documents=data["documents"], metadatas=data["metadatas"])

## Retrieval: Now do a search to retrieve the chunk of document text that most closely matches our original query

## Setup retriever and reranker

In [55]:

search_query_retriever = index.as_retriever(service_context=service_context)
#reranker = CohereRerank()

In [78]:
# dict = index.storage_context.to_dict()
index.storage_context.persist("./test_index")
storage_context = StorageContext.from_defaults()
updated_index = load_index_from_storage(StorageContext.from_dict(dict), service_context=service_context)


ValueError: Expected to load a single index, but got 3 instead. Please specify index_id.

## Query Response pipeline

In [66]:
def get_response_to_query(query):
    # search_query_retrieved_nodes = search_query_retriever.retrieve(query)
    # print(f"Search query retriever found {len(search_query_retrieved_nodes)} results")
    # print(f"First result example:\n{search_query_retrieved_nodes[0]}\n")
    query_engine = index.as_query_engine(
        # streaming=True
        #node_postprocessors = [reranker]
    )
    result = query_engine.query(query)
    return result

## Get the Fund Name

In [67]:
fund_name = get_response_to_query("What is the name of the fund? Give only the name without additional comments. The name of the fund is: ")
print(f"Fund Name: {fund_name}")

Fund Name: Vanguard Extended Market Index Fund


## Get responses to key queries

In [45]:
queries = [
    "What is the investment strategy of the fund?",
    "What are the investment objectives of the fund?",
    "Who are the key people in the management team?",
    "What is the investment philosphy of the fund regarding ESG (Environmental, Social, and Governance)?",
    "What industries, markets, or types of securities is the fund want exposure to?",
    "What investment tools (derivatives, leverage, etc) does does the fund use to achieve their investment goals?"
]

In [46]:
responses = []

for query in queries:
    result = get_response_to_query(query)
    responses.append(result.response)

response_answer_pairs = zip(queries, responses)


In [47]:
response_answer_text = ""
for (query, response) in response_answer_pairs:
    response_answer_text = f"{response_answer_text}{query}\n{response}\n\n"

print(response_answer_text)

What is the investment strategy of the fund?
Empty Response

What are the investment objectives of the fund?
Empty Response

Who are the key people in the management team?
Empty Response

What is the investment philosphy of the fund regarding ESG (Environmental, Social, and Governance)?
Empty Response

What industries, markets, or types of securities is the fund want exposure to?
Empty Response

What investment tools (derivatives, leverage, etc) does does the fund use to achieve their investment goals?
Empty Response




## Chat Engine

In [15]:
from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=(
        f"You are an expert Mutual Fund analyst for a bank, and you privide answers to your boss about whether the bank should purchase the fund named {fund_name}."
        f"  You have answered these key questions about the fund:\n {response_answer_text}"
    ),
)

In [16]:
chat_response = chat_engine.chat("What is the level of risk for the fund?")
print(chat_response)

Based on the information provided in the prospectus, the level of risk for the Vanguard Total Stock Market Index Fund appears to be moderate to high. The fund invests at least 80% of its assets in stocks that make up its target index, which means it is exposed to the risks associated with stock market investments. The prospectus also highlights several key risks, including:

1. Stock market risk: The chance that stock prices overall will decline. Stock markets tend to move in cycles, with periods of rising prices and periods of falling prices.
2. Index sampling risk: The chance that the securities selected for the fund, in the aggregate, will not provide investment performance matching that of the fund's target index. This risk is expected to be low.
3. Derivatives risk: The use of derivatives, such as equity futures and total return swaps, can amplify gains or losses depending on market conditions. While these tools can help reduce transaction costs and stay fully invested, they also 

In [17]:
chat_response = chat_engine.chat("is it higher or lower than most funds?")
print(chat_response)

Based on the information provided in the prospectus, the level of risk for the Vanguard Total Stock Market Index Fund is likely to be higher than most other mutual funds. This is because the fund invests at least 80% of its assets in stocks that make up its target index, which means it is exposed to the risks associated with stock market investments. Additionally, the prospectus highlights several key risks, including stock market risk, index sampling risk, derivatives risk, and leverage risk, which can all contribute to a higher level of risk compared to other funds that may have more limited exposure to these risks.

It's important to note that the risk level of a mutual fund can vary depending on its investment strategy and the specific securities it holds. Some funds may be more conservative and focused on income generation, while others may take on more risk in pursuit of higher returns. It's always important to carefully evaluate the investment objectives and strategies of any mu