# Chatbot based on internal data (PDFs)

Steps:
1. Reading the PDFs and WebPages
2. Chunk the PDFs and Webpages
3. Create vector embeddings from the PDFs and Webpages
4. Add to Pinecode Vector DB
5. Create a chatbot that queries from Pincone to implement RAG architecture

### Import Libraries

Load all the necessary modules and libraries. 

If not present, add them to requirements.txt and run python -m requirements.txt on the terminal

In [627]:
import os
from langchain.document_loaders import PyPDFDirectoryLoader, PyPDFLoader, AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_cohere import CohereEmbeddings

Load the necessary environment variables which will contain the API Key

In [628]:
from dotenv import load_dotenv
load_dotenv()

False

### Reading the PDF and Webpages

Create a function that is used to read PDFs in a given folder using document loaders.

https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

In [629]:
def read_pdfs(folder):
    file_loader = PyPDFDirectoryLoader(folder)
    pdfs = file_loader.load()
    return pdfs
    

In [630]:
pdfs = read_pdfs('data/WFM/GC')
print("Number of pages:",len(pdfs))
pdfs

Number of pages: 21


[Document(metadata={'producer': 'madbuild', 'creator': 'PyPDF', 'creationdate': '2024-10-21T10:38:45-05:00', 'moddate': '2024-10-21T10:38:45-05:00', 'title': 'WFM Adapter for Genesys Cloud - Product Overview Guide', 'author': '', 'subject': '', 'keywords': '', 'source': 'data\\WFM\\GC\\WFMAdapter_GCProductOverview.pdf', 'total_pages': 21, 'page': 0, 'page_label': '1'}, page_content='WFMAdapter\nProductOverviewGuide\nContactcenter:GenesysCloud\nPublishdate:October21,2024'),
 Document(metadata={'producer': 'madbuild', 'creator': 'PyPDF', 'creationdate': '2024-10-21T10:38:45-05:00', 'moddate': '2024-10-21T10:38:45-05:00', 'title': 'WFM Adapter for Genesys Cloud - Product Overview Guide', 'author': '', 'subject': '', 'keywords': '', 'source': 'data\\WFM\\GC\\WFMAdapter_GCProductOverview.pdf', 'total_pages': 21, 'page': 1, 'page_label': '2'}, page_content='Contents\n1AboutWFMAdapter 3\n1.1SupportedWFMsystems 3\n1.2Supportedmediachannels 3\n1.3WFMAdapterAWSregions 4\n1.4Supportedsinglesign-o

### Chunk the pdfs

The LLM model can only handle a certain number of tokens at a time. So, we need to chunk the PDFs into smaller parts.
This can be done by splitting the PDFs into smaller parts based on the number of tokens.
Langchain provides a function to split the text into smaller parts based on the number of tokens.

https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

In [631]:
def chunk_documents(documents, chunk_size=500, chunk_overlap=0):
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = splitter.split_documents(documents)
    return chunks

In [632]:
chunked_pdfs = chunk_documents(documents=pdfs)
# chunked_pdfs = chunk_documents(documents=web_pages)  # Use the web pages instead of PDFs
chunk_documents

<function __main__.chunk_documents(documents, chunk_size=500, chunk_overlap=0)>

### Create vector embeddings

Vector embeddings are created as the LLM model cannot directly work with other data. The data is converted into vector embeddings using the LLM model. These embeddings are then stored in the Pinecone database.

In [None]:
os.environ["COHERE_API_KEY"] = "COHERE_API_KEY"

In [634]:
embeds = CohereEmbeddings(
    cohere_api_key=os.getenv("COHERE_API_KEY"),
    model="embed-english-v3.0"
)

In [635]:
from langchain.vectorstores import FAISS
docsearch = FAISS.from_documents(chunked_pdfs, embeds)


In [636]:
retriever = docsearch.as_retriever()

### Creating Prompt Template

Create a prompt template that can be used to query the Qdrant database. This template can be used to query the database and get the relevant information.

In [637]:
from langchain.prompts import PromptTemplate
prompt_template = """Text: {context}

Question: {question}

Answer the question based on the PDF Document provided. If the text doesn't contain the answer, reply that the answer is not available.
Do Not Hallucinate"""

prompt = PromptTemplate.from_template(prompt_template)

### LLM Model

The LLM model is used to generate the responses to the queries. The model is loaded and the prompt template is used to query the database and get the relevant information. We will use cohere model

In [638]:
from langchain.llms import Cohere

llm=Cohere(model="command", temperature=0.9)

In [639]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retrievable = RunnableParallel(
    {
        "context":retriever,
        "question":RunnablePassthrough()
    }
)
chain = retrievable | prompt | llm | StrOutputParser()

In [640]:
question = "What is WFM adapter"
output = chain.invoke(question)

In [641]:
output

' The WFMAdapter is a product that features a web user interface, API accessibility, and a Genesys Cloud app. Its purpose is to collect information from the Genesys Cloud and send RTA and historical feeds to the Reporting Client. '