# Cohere Document Search with Langchain

This example shows how to use the Python [langchain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [1]:
from getpass import getpass
import os
from pathlib import Path

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatCohere
from langchain.document_loaders import TextLoader
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.llms import Cohere
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.schema import HumanMessage, SystemMessage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

Set up some helper functions:

In [2]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [3]:
##  Set up all of the parameters here
##

#directory_path = "./rfp_documents"
#directory_path = "./sf_documents"
#directory_path = "./ats_documents"
directory_path = "./ats_fr_montreal_documents"

multilingual_mode= True
chunk_size=1000
rerank_top_n = 15

if multilingual_mode:
    embeddings_model="embed-multilingual-light-v3.0"
else:
    embeddings_model="embed-english-light-v3.0"

#query = "What are the different modes for junction management or conflict convergence?"
#query = "List requirements for route diversions and rerouting."
#query = "List requirements for online or runtime timetable editing."
#query = "Does the Montreal Blue Line spec contain any requirements for editing a timetable at runtime vs offline?"
query = "List any requirements for editing a timetable at runtime vs offline?"


In [4]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False

if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
    if ".pdf" in filename:
       print(f"found {filename}")
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

found Montreal-Blue-Line-ATS-Specs-French.pdf
found File_6_-_SFMTA-2022-40-FTA-SAApndxA-ContractSpecs-RevAdd5_CHAP17.pdf


## Start with a basic generation request without RAG augmentation

Let's start by asking the Cohere LLM a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's basic knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is very domain-specific that it won't know the answer to. A good example would an obscure detail buried deep within a company's annual report. For example:

"*How many Vector scholarships in AI were awarded in 2022?*"

## Now send the query to Cohere

In [5]:
from langchain_core.messages import HumanMessage
#llm = Cohere()
llm = ChatCohere(model="command-r")
#result = llm(query)([HumanMessage(content=message)])
#print(f"Result: \n\n{result}"
# result = llm(query)
# print(f"Result: \n\n{result}")

## Ingestion: Load and store the documents from source_documents

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [6]:
# Load the pdfs
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source materials: {len(docs)}")

# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

# Define the embeddings model
#model_name = "BAAI/bge-small-en-v1.5"
#encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...")
#embeddings = HuggingFaceBgeEmbeddings(
#    model_name=model_name,
#    model_kwargs={'device': 'cuda'},
#    encode_kwargs=encode_kwargs
#)

from langchain_community.embeddings.cohere import CohereEmbeddings

embeddings = CohereEmbeddings(model=embeddings_model)

print(f"Done")

Number of source materials: 94
Number of text chunks: 352
Setting up the embeddings model...
Done


# Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [7]:
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})



Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [8]:
# Retrieve the most relevant context from the vector store based on the query(No Reranking Applied)
docs = retriever.get_relevant_documents(query)
    

In [9]:
pretty_print_docs(docs)

Document 1:

shall be loaded from a Configurable star t time set by the Transportation Controller. 
13. The ATS subsystem shall have schedule editing tool s to modify the timetable database prior and 
during run time to account for potential changes or  during operations (i.e., changing timepoints, re-
baselining the schedule, service imp acts/cancellations, special events). 
a. The ATS subsystem shall have functional interf aces to synchronize schedule changes made in 
ATS to OrbCAD as described in Section 28 (Interface Requirements Specifications). 
b. The ATS subsystem shall modify the timetabl e database during run time to process same-day 
schedule changes received from OrbCAD in or der to remain synchronized between OrbCAD 
and ATS. 
c. Contractor shall confer with SFMTA during the Design Phase and confirm the specific 
schedule editing tools needed to meet Contr act requirements and SFMTA’s operational needs 
as described in the Concept of Operations and Maintenance.
-----------

These results seem to somewhat match our original query, but we still can't seem to find the information we're looking for. Let's try sending our LLM query again including these results, and see what it comes up with.

In [10]:
#print(f"Sending the RAG generation with query: {query}")
#qa = RetrievalQA.from_chain_type(llm=llm,
#        chain_type="stuff",
#        retriever=retriever)
#print(f"Result:\n\n{qa.run(query=query)}") 

# Reranking: Improve the ordering of the document chunks

In [11]:
#compressor = CohereRerank(top_n=15)
if multilingual_mode:
    compressor = CohereRerank(top_n=rerank_top_n, model="rerank-multilingual-2")
else:
    compressor = CohereRerank(top_n=rerank_top_n)
    
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.get_relevant_documents(query)
    

In [12]:
compressed_docs 

[Document(page_content='tous les trains dont la course est complétée et changer le mode de départ des terminus à «\u2009hors \nservice\u2009». \n5.7.6. MODIFIER LES HORAIRES HORS-LIGNE \n [R] L’ATS CBTC doit permettre de visualiser une table horaire sans la charger pour l’exploitation \n(hors-ligne) via l’IPM. \n [R] L’ATS CBTC doit fournir les outils nécessaires à l’IPM pour modifier un horaire hors-ligne et en \nvérifier la cohérence. Les mêmes commandes de modifications que pour un horaire chargé doivent \nêtre disponibles. \n [R] L’ATS CBTC doit afficher une alarme à l’IPM lorsqu’une modification de l’horaire induit une \nincohérence. \n [R] Lorsque les modifications faites à une table horaire hors-ligne sont valides, l’ATS CBTC doit \npermettre de sauvegarder les modifications pour utilisation ultérieure. \n5.7.7. ARCHIVER LES DONNÉES D’EXPLOITATION \n [R] L’ATS CBTC doit archiver toutes les données des horaires réalisés incluant, au minimum,', metadata={'source': 'ats_fr_montreal

Now let's see what the reranked results look like:

In [13]:
pretty_print_docs(compressed_docs)

Document 1:

tous les trains dont la course est complétée et changer le mode de départ des terminus à « hors 
service ». 
5.7.6. MODIFIER LES HORAIRES HORS-LIGNE 
 [R] L’ATS CBTC doit permettre de visualiser une table horaire sans la charger pour l’exploitation 
(hors-ligne) via l’IPM. 
 [R] L’ATS CBTC doit fournir les outils nécessaires à l’IPM pour modifier un horaire hors-ligne et en 
vérifier la cohérence. Les mêmes commandes de modifications que pour un horaire chargé doivent 
être disponibles. 
 [R] L’ATS CBTC doit afficher une alarme à l’IPM lorsqu’une modification de l’horaire induit une 
incohérence. 
 [R] Lorsque les modifications faites à une table horaire hors-ligne sont valides, l’ATS CBTC doit 
permettre de sauvegarder les modifications pour utilisation ultérieure. 
5.7.7. ARCHIVER LES DONNÉES D’EXPLOITATION 
 [R] L’ATS CBTC doit archiver toutes les données des horaires réalisés incluant, au minimum,
------------------------------------------------------------------------

Lastly, let's run our LLM query a final time with the reranked results:

In [14]:
query_grounded = "{}; Be sure to state the source (`source` and `page`) that you used; Translate the sources to English if needed; Compare the two sources in detail;  Documents: {}".format(query, compressed_docs)

result = llm([HumanMessage(content=query_grounded)]).content
print(f"Result:\n\n {result}")

#qa = RetrievalQA.from_chain_type(llm=llm,
#        chain_type="stuff",
#        retriever=compression_retriever)

#print(f"Result:\n\n {qa.run(query=query)}")

  warn_deprecated(


Result:

 I found some information on editing train timetables in two sources, which appear to be related documents. Both sources are in French, apparently originating from Montreal, Canada; I will provide an English translation of the requirements they outline. 

## Source 1: File_6_-_SFMTA-2022-40-FTA-SAApndxA-ContractSpecs-RevAdd5_CHAP17.pdf
This source's requirements for editing timetables are as follows:

- The ATS (Automatic Train Supervision) subsystem should have tools to edit the timetable database both prior to and during runtime to account for potential changes, including same-day schedule changes.
- The ATS subsystem should have functional interfaces to synchronize schedule changes made in ATS with a system called OrbCAD.
- The ATS subsystem should modify the timetable database during operation to process same-day schedule changes received from OrbCAD, remaining synchronized.
- The specific schedule editing tools needed to meet contract requirements and operational needs sh