<a href="https://colab.research.google.com/github/mariaafara/document-based_qa/blob/main/notebooks/Document_Based_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Document-Based_QA
Harnessing the Power of Hugging Face Embeddings and LangChain for Document-Based Question Answering
- InstructorEmbedding (https://huggingface.co/hkunlp/instructor-xl)

To create a question-answer application from documents within LangChain framework, the standard steps are as follows:
loading documents -> Splitting documents -> create embedding vectors from chuncks -> store embedded vectors in the vector database -> retrieve relevent documents from the storage -> pass relevent documents to llm -> llm generates the final answer.

In langchain there is three distinct methods that employ these steps with slight variation; VectorstoreIndexCreator, RetrievalQA, and load_qa_chain

<image>

In this notebook, I will try out each of these 3 methods while leveraging Hugging Face LLM (open source LLM)  and also open source embeddings, although we can use openAI's ones but it is not free.

For the Vectore store, I will try out both Langchan FAISS and Chromadb.

I am going to install all packages at the begining to avaoid restarting the runtime at later stages.

In [1]:
# Core libraries
!pip install langchain
!pip install sentence_transformers
!pip install InstructorEmbedding

# vector store
# !pip install huggingface_hub
!pip install faiss-cpu
!pip install chromadb

Collecting langchain
  Downloading langchain-0.0.330-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.1-py3-none-any.whl (27 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.52 (from langchain)
  Downloading langsmith-0.0.57-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.5/44.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langch



*   langchain:
*   sentence_transformers:
*   InstructorEmbedding:

*   chromadb:
*   faiss-cpu:



Load Web document

In [2]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://towardsdatascience.com/transformers-141e32e69591?gi=9950ebcdff62")
documents = loader.load()

Split document

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 100,
    length_function = len,
    add_start_index = True,)

texts = text_splitter.split_documents(documents)

In [4]:
texts[0]

Document(page_content="How Transformers Work. Transformers are a type of neural‚Ä¶ | by Giuliano Giacaglia | Towards Data ScienceOpen in appSign upSign InWriteSign upSign InHow Transformers WorkThe Neural Network used by Open AI and DeepMindGiuliano Giacaglia¬∑FollowPublished inTowards Data Science¬∑14 min read¬∑Mar 11, 2019--34ListenShareIf you liked this post and want to learn how machine learning algorithms work, how did they arise, and where are they going, I recommend the following:Making Things Think: How AI and Deep Learning Power the Products We Use - HollowayIt is the obvious which is so difficult to see most of the time. People say 'It's as plain as the nose on your face.'‚Ä¶www.holloway.comTransformers are a type of neural network architecture that have been gaining popularity. Transformers were recently used by OpenAI in their language models, and also used recently by DeepMind for AlphaStar ‚Äî their program to defeat a top professional Starcraft player.Transformers were d

In [5]:
len(texts)

25

### Get Embeddings for OUR Documents

### HF Instructor Embeddings

The HuggingFaceInstructEmbeddings class in the LangChain framework is a wrapper around the sentence_transformers embedding models. It is specifically designed to work with instruction-based models, which are models that generate embeddings based on a given instruction and a text.

In [8]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-base",
                                                      model_kwargs={"device": "cpu"}) #cuda

Downloading (…)62736/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Downloading (…)15e6562736/README.md:   0%|          | 0.00/66.2k [00:00<?, ?B/s]

Downloading (…)e6562736/config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)62736/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.43k [00:00<?, ?B/s]

Downloading (…)6562736/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


In [None]:
# !git lfs install
# # !git clone https://huggingface.co/hkunlp/instructor-xl
# !git clone https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Git LFS initialized.
Cloning into 'all-mpnet-base-v2'...
remote: Enumerating objects: 49, done.[K
remote: Total 49 (delta 0), reused 0 (delta 0), pack-reused 49[K
Unpacking objects: 100% (49/49), 314.92 KiB | 1.03 MiB/s, done.


In [None]:
# from langchain.embeddings import HuggingFaceEmbeddings


# embeddings = HuggingFaceEmbeddings(
#     model_name="/content/all-mpnet-base-v2",
#     model_kwargs={'device': 'cpu'},
#     encode_kwargs={'normalize_embeddings': False}
# )

Load Faiss vector store

In [9]:
from langchain.vectorstores import FAISS

faiss_vectorstore = FAISS.from_documents(texts, embeddings)

faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 3})

faiss_retriever.search_type

'similarity'

In [10]:
query = "What is Transformer?"

docs = faiss_retriever.get_relevant_documents(query)
docs

[Document(page_content="How Transformers Work. Transformers are a type of neural‚Ä¶ | by Giuliano Giacaglia | Towards Data ScienceOpen in appSign upSign InWriteSign upSign InHow Transformers WorkThe Neural Network used by Open AI and DeepMindGiuliano Giacaglia¬∑FollowPublished inTowards Data Science¬∑14 min read¬∑Mar 11, 2019--34ListenShareIf you liked this post and want to learn how machine learning algorithms work, how did they arise, and where are they going, I recommend the following:Making Things Think: How AI and Deep Learning Power the Products We Use - HollowayIt is the obvious which is so difficult to see most of the time. People say 'It's as plain as the nose on your face.'‚Ä¶www.holloway.comTransformers are a type of neural network architecture that have been gaining popularity. Transformers were recently used by OpenAI in their language models, and also used recently by DeepMind for AlphaStar ‚Äî their program to defeat a top professional Starcraft player.Transformers were 

Load Chroma vector store

In [None]:
# from langchain.vectorstores import Chroma

# chroma_vectorstore = Chroma.from_documents(texts, instructor_embeddings)
# chroma_retriever = chroma_vectorstore.as_retriever(search_kwargs={"k": 3})


In [None]:
# query = "What is Transformer?"

# # docs = chroma_vectorstore.similarity_search(query)
# docs = chroma_retriever.get_relevant_documents(query)
# docs

In [11]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "hf_xxx"

# !git clone https://huggingface.co/pszemraj/flan-t5-large-instruct-dolly_hhrlhf

# !git clone https://huggingface.co/google/mt5-small

In [12]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
        model_id="pszemraj/flan-t5-large-instruct-dolly_hhrlhf",
        task="text2text-generation",
        pipeline_kwargs={"temperature":0, "max_new_tokens": 64},
    )


Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/325 [00:00<?, ?B/s]



Milvus vector store

In [None]:
# from langchain import HuggingFaceHub

# # set the  Access Token for Hugging Face
# import os
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = "hf_xxx"

# # We can only use text2text-generation or text-generation models and when looking for such models make sure you choose active/enabled
# llm = HuggingFaceHub(repo_id="pszemraj/flan-t5-large-instruct-dolly_hhrlhf", model_kwargs={"temperature":0, "max_length":64})

Load HF LLM

The load_qa_chain method follows these steps:
loading documents -> documents processed by chain type (ex., stuff) ->prompting the llm model -> answer

In [13]:
from langchain.chains.question_answering import load_qa_chain

load_qa_chain = load_qa_chain(llm, chain_type="stuff")

FAISS + load_qa_chain

In [14]:
query = "What is a decoder?"
docs = faiss_retriever.get_relevant_documents(query)
load_qa_chain.run(input_documents=docs, question=query)

Token indices sequence length is longer than the specified maximum sequence length for this model (833 > 512). Running this sequence through the model will result in indexing errors


"A decoder is a machine learning model that uses attention to focus on the relevant parts of a sentence. It's a type of machine learning model."

*Chroma + load_qa_chain*

In [16]:
# query = "What is a decoder?"
# docs = chroma_retriever.get_relevant_documents(query)
# load_qa_chain.run(input_documents=docs, question=query)

RetrievalQA

FAISS + RetrievalQA

In [17]:
from langchain.chains import RetrievalQA

# create the chain to answer questions
faiss_retrievalQA_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=faiss_retriever,
                                  return_source_documents=True)

In [18]:
faiss_retrievalQA_chain(query)['result']

"A decoder is a machine learning model that uses attention to focus on the relevant parts of a sentence. It's a type of machine learning model."

In [19]:
query = "What is a decoder?"
llm_response = faiss_retrievalQA_chain(query)
llm_response

{'query': 'What is a decoder?',
 'result': "A decoder is a machine learning model that uses attention to focus on the relevant parts of a sentence. It's a type of machine learning model.",
 'source_documents': [Document(page_content='hidden state that is passed all the way to the decoding stage. Then, the hidden states are used at each step of the RNN to decode. The following gif shows how that happens.The green step is called the encoding stage and the purple step is the decoding stage. GIF from 3The idea behind it is that there might be relevant information in every word in a sentence. So in order for the decoding to be precise, it needs to take into account every word of the input, using attention.For attention to be brought to RNNs in sequence transduction, we divide the encoding and decoding into 2 main steps. One step is represented in green and the other in purple. The green step is called the encoding stage and the purple step is the decoding stage.GIF from 3The step in green i

Chroma + RetrievalQA

In [20]:
from langchain.chains import RetrievalQA

# create the chain to answer questions
chroma_retrievalQA_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=faiss_retriever,
                                  return_source_documents=True)

In [21]:
query = "What is a decoder?"
llm_response = chroma_retrievalQA_chain(query)
llm_response

{'query': 'What is a decoder?',
 'result': "A decoder is a machine learning model that uses attention to focus on the relevant parts of a sentence. It's a type of machine learning model.",
 'source_documents': [Document(page_content='hidden state that is passed all the way to the decoding stage. Then, the hidden states are used at each step of the RNN to decode. The following gif shows how that happens.The green step is called the encoding stage and the purple step is the decoding stage. GIF from 3The idea behind it is that there might be relevant information in every word in a sentence. So in order for the decoding to be precise, it needs to take into account every word of the input, using attention.For attention to be brought to RNNs in sequence transduction, we divide the encoding and decoding into 2 main steps. One step is represented in green and the other in purple. The green step is called the encoding stage and the purple step is the decoding stage.GIF from 3The step in green i