Problem statement :

A QA chatbot (interacting with a pdf or word document ) is shown.

The first method to come up with a QA chatbot is to use a load_QA_chain, with a chain type : stuff (taking entire document) or with map_reduce (separates document to batches if the document is long, getting a summary).


Since the allowed number of context-length tokens is 4097, the chain type could be changed to map-reduce to address this problem.


https://github.com/hwchase17/langchain/issues/1349


In [1]:
#!pip uninstall langchain

!pip install --force-reinstall typing-extensions==4.5.0

!pip install langchain openai chromadb tiktoken pypdf panel sentence_transformers
!pip install faiss-gpu

Collecting typing-extensions==4.5.0
  Downloading typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.7.1
    Uninstalling typing_extensions-4.7.1:
      Successfully uninstalled typing_extensions-4.7.1
Successfully installed typing-extensions-4.5.0
Collecting langchain
  Downloading langchain-0.0.256-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.5-py3-none-any.whl (402 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m402.8/402.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollec

In [2]:
import openai
import os
import langchain #
import chromadb
import tiktoken
import pypdf
#from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS
import panel as pn

In [3]:
import os
os.environ["OPENAI_API_KEY"] = "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"
openai.api_key = "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"

#document loader

In [4]:
loader = CSVLoader("prepared_1.csv")
documents = loader.load()

In [5]:
documents

[Document(page_content='\ufeffQuestions: What is high blood pressure?\nResponse: High blood pressure is a condition where the force at which your heart pumps blood around your body is high. It is recorded with 2 numbers, the systolic pressure and the diastolic pressure, both measured in millimetres of mercury (mmHg).\nReferences:\n- https://www.nhs.uk/conditions/Blood-pressure-(high)/Pages/Introduction.aspx <|eos|> <|eod|>', metadata={'source': 'prepared_1.csv', 'row': 0}),
 Document(page_content='\ufeffQuestions: What are the risks of high blood pressure?\nResponse: Persistent high blood pressure can put extra strain on your blood vessels, heart, and other organs, such as the brain, kidneys, and eyes. It can increase your risk of serious and potentially life-threatening health conditions, such as heart disease, heart attacks, strokes, heart failure, peripheral arterial disease, aortic aneurysms, kidney disease, and vascular dementia.\nReferences:\n- https://www.nhs.uk/conditions/Blood

# QA function

Chain type, embeddings, similarity search.


In [14]:
from langchain.embeddings.openai import OpenAIEmbeddings
def qa(file, query, chain_type, k):
  loader = CSVLoader(file)
  documents = loader.load()
  text_splitter = RecursiveCharacterTextSplitter(chunk_size= 500, chunk_overlap = 0)
  docs = text_splitter.split_documents(documents)
  #embeddings = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L12-v2")
  embeddings= OpenAIEmbeddings()
  db = FAISS.from_documents(docs, embeddings)
  #db = Chroma.from_documents(texts, embeddings)
  retriever = db.as_retriever(search_type = "similarity", search_kwargs= {"k": k})
  qa = RetrievalQA.from_chain_type(
      llm = OpenAI(), chain_type = chain_type, retriever = retriever, return_source_documents = True
  )
  result = qa({'query': query})
  print(result['result'])
  return result


In [15]:
result = qa("prepared_1.csv", "Does the Alexander technique teach improved posture and movement, to help reduce and prevent problems caused by unhelpful habits? please do not make up an answer", "map_reduce",2)

 Yes, the Alexander technique teaches improved posture and movement, to help reduce and prevent problems caused by unhelpful habits.


In [16]:
result = qa("prepared_1.csv", "As mentioned in this document, do Non-pharmaceutical treatments such as photodynamic therapy, comedone extractor, and chemical peels for acne always work? please be honest and please do not make up an answer", "map_reduce", 2)

 These treatments may not always work and should be used only in conjunction with other treatments recommended by a dermatologist.


In [21]:
result = qa("prepared_1.csv", "According to this document, Are SSRIs suitable for everyone? Please do not make up an answer", "map_reduce", 2)

 No, SSRIs aren't suitable for everyone.


In [24]:
result = qa("prepared_1.csv", "What is Every Mind Matters website according to this document? Please do not make up an answer", "map_reduce", 2)

 Every Mind Matters is a website that provides tips and resources for improving mental wellbeing and offers practical self-care tips and guidance on where to find further support.
