# Cohere Document Search with Langchain

This example shows how to use the Python [langchain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

pip install langchain

pip install --upgrade pip setuptools wheel

pip install "blosc2~=2.0.0" "cython>=0.29.21"

pip install "black>=22.3.0"

In [1]:
from getpass import getpass
import os
from pathlib import Path

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatCohere
from langchain.document_loaders import TextLoader
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.llms import Cohere
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.schema import HumanMessage, SystemMessage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

Set up some helper functions:

In [2]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [3]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key
ERROR: The ./source_documents subfolder must contain at least one .pdf file


## Start with a basic generation request without RAG augmentation

Let's start by asking the Cohere LLM a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's basic knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is very domain-specific that it won't know the answer to. A good example would an obscure detail buried deep within a company's annual report. For example:

"*How many Vector scholarships in AI were awarded in 2022?*"

In [4]:
query = "What is IT Salary in Vietnam?"

## Now send the query to Cohere

pip install cohere

In [5]:
import cohere

# Replace 'your_cohere_api_key_here' with your actual Cohere API key
cohere_api_key = 'ht9CB7BKp1gXtE9RSGbwCplCOfTnwysPj2misv2y'

# Initialize the Cohere client
co = cohere.Client(api_key=cohere_api_key)

# Assuming you want to use the generate function, replace "your query here" with your actual prompt
query = "What is a salary in Vietnam?"  # Replace this with your actual query or prompt

# Use the 'co' client to call the generate function
response = co.generate(prompt=query, max_tokens=50)

# Access the first generation's text and print it
result = response.generations[0].text
print(f"Result: \n\n{result}")



Result: 

 A salary in Vietnam is the regular payment paid or provided to employees by an employer in return for work done. In Vietnam, salaries are typically paid on a monthly basis and can vary widely depending on the type of work, industry, and level of experience


llm = Cohere()
result = llm(query)
print(f"Result: \n\n{result}")

## Ingestion: Load and store the documents from source_documents

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

import os

current_working_directory = os.getcwd()
print(f"Current Working Directory: {current_working_directory}")


pip install pypdf


import torch

# Check if CUDA is available
print(torch.cuda.is_available())

# If CUDA is available, print the version and available GPUs
if torch.cuda.is_available():
    print(torch.version.cuda)
    print(torch.cuda.device_count())
    print(torch.cuda.get_device_name(0))



conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch


pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu111


In [6]:

# Define the path to your directory containing the PDF files
directory_path = r'C:\Users\yulia\Desktop\RAG BOOTCAMP\Document_search\RAG_source_documents'

# Now, you can use `directory_path` with the loader
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source materials: {len(docs)}")



Number of source materials: 181


#Check CUDA Availability in PyTorch:
import torch
print(torch.cuda.is_available())


#If False, do the following steps:
#1.Reinstall PyTorch:
!pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu112


#2. Verify the installation
import torch

print(torch.__version__)
print(torch.cuda.is_available())


# Load the pdfs
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source materials: {len(docs)}")

# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

# Define the embeddings model
model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...")
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)

print(f"Done")

In [10]:
#if your PC doesn't have graphic card, change configuration from 'cuda' to 'cpu'
# Load the pdfs
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source materials: {len(docs)}")

# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

# Define the embeddings model
model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...")
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cpu'},
    encode_kwargs=encode_kwargs
)

print(f"Done")

Number of source materials: 181
Number of text chunks: 583
Setting up the embeddings model...
Done


# Retrieval: Make the document chunks available via a retriever

# Install FAISS for CPU
pip install faiss-cpu

#Verify FAISS Installation
import faiss
print("FAISS version:", faiss.__version__)


The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [11]:
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

# Retrieve the most relevant context from the vector store based on the query(No Reranking Applied)
docs = retriever.get_relevant_documents(query)

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [12]:
pretty_print_docs(docs)

Document 1:

In Danang , the Game industry stands out by o昀昀ering the highest average salary range for IT 
professionals, reaching 44.5 million dong per month . In Hanoi , the Professional Services sector, 
including Accounting, Legal, and Business Consultin g, takes the lead with the highest monthly salary 
at 39.6 million dong . Meanwhile, Ho Chi Minh City  aligns with the country's trend, with Pharmaceuticals
----------------------------------------------------------------------------------------------------
Document 2:

in Danang with more than 5 years of experience has increased from 33% in the previous year's results t o over 50% . The 
increase in professionals with higher years of experience is likely to have a positive impact on the average salary. 
In Hanoi and Ho Chi Minh City, the average salaries this year slightly decreased compared to the previous year. In Hanoi, the 
reduction ratio was 5.5% , while in Ho Chi Minh City, it was 8%. A potential contributing factor to this

These results seem to somewhat match our original query, but we still can't seem to find the information we're looking for. Let's try sending our LLM query again including these results, and see what it comes up with.

In [16]:
print(f"Sending the RAG generation with query: {query}")
qa = RetrievalQA.from_chain_type(llm=llm,
        chain_type="stuff",
        retriever=retriever)
print(f"Result:\n\n{qa.run(query=query)}") 

Sending the RAG generation with query: What is a salary in Vietnam?


NameError: name 'llm' is not defined

# Reranking: Improve the ordering of the document chunks

In [None]:
compressor = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.get_relevant_documents(query)

Now let's see what the reranked results look like:

In [None]:
pretty_print_docs(compressed_docs)

Lastly, let's run our LLM query a final time with the reranked results:

qa = RetrievalQA.from_chain_type(llm=llm,
        chain_type="stuff",
        retriever=compression_retriever)

print(f"Result:\n\n {qa.run(query=query)}")