## Question Answering with LangChain ##

Question Answering in Natural Language Processing (NLP) and Information Retrieval (IR) is the task of answering questions through a combination of fetching the relevant documents and understanding them to make inferences, but abstaining when presented with a question that cannot be answered based on the provided context. __[LangChain](https://python.langchain.com/en/latest/index.html)__ is a library that makes it easy to combine several language models for applications. This tutorial illustrates a robust QA application that uses embedding-based retrieval and answering with the LangChain library. 


## Getting started ##

Install LangChain

In [26]:
# !pip install update langchain

Get HuggingFace API keys. Refer this __[page](https://huggingface.co/docs/hub/security-tokens)__ page for more information

In [1]:
import os

In [2]:
with open('keys.txt') as f:
    key = [row.rstrip('\n') for row in f]

os.environ["HUGGINGFACEHUB_API_TOKEN"] = key[1]

## Creating a Vector Database ##

Let's start by creating an index over the context documents. Even though theoretically an LLM can go over all the documents to look for the answer to a given query, we are limited by the context window of an LLM. Parsing over smaller chunks of every document is computationally expensive and time consuming. A retriever module will shortlist documents that can be parsed to the LLM to produce an answer.


__What is a vector database?__
Vector databases index vector embeddings to enable effective similarity search. LangChain supports several types of vector databses- it calls them vectorstores. More information about Langhain's support can be found __[here](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html#vectorstores)__.

In this tutorial, we use a FAISS vectorstore to index Hugginface embeddings of our documents.

__Loading documents__

We use the CMU Book Summary dataset, which contains plot summaries for 16,559 books extracted from Wikipedia, along with aligned metadata from Freebase, including book author, title, and genre.

In [3]:
with open('booksummaries.txt', 'r') as f:
        doc = f.read()

In [4]:
len(doc)

43403998

In [5]:
doc = doc[0:50000]

This string is too large. We have to split it into a context window that is suitable for an LLM. Let's chunk these by tokens. LLMs typically have a window of 512 tokens. First, we need to define the tokeinzer.

In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')

  from .autonotebook import tqdm as notebook_tqdm


Next, we use this tokeizer to create chunks of the document containing ~500 tokens

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

In [8]:
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, \
                                                                 chunk_size=500, chunk_overlap=0)

In [9]:
splits = text_splitter.split_text(doc)
sources = [i for i in range(len(splits))]
metadatas = [{"source": sources[i]} for i in range(len(sources))]

In [10]:
len(splits)

59

In [12]:
splits[0]

'620\t/m/0hhy\tAnimal Farm\tGeorge Orwell\t1945-08-17\t{"/m/016lj8": "Roman \\u00e0 clef", "/m/06nbt": "Satire", "/m/0dwly": "Children\'s literature", "/m/014dfn": "Speculative fiction", "/m/02xlf": "Fiction"}\t Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, \'Beasts of England\'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of'

The texts are now is a more suitable form to be parsed to an LLM. Next, let's create a FAISS Vectorstore from these chunks. We use HugginfFace's __gtr-t5-large__ for our purposes

In [13]:
embeddings = HuggingFaceEmbeddings(model_name='gtr-t5-large')


In [14]:
vectorstore = FAISS.from_texts(splits, embeddings, metadatas=metadatas)

## Querying the LLM using the context documents ##

Now, we have a vectorstore with indices and embedding-based representation of the chunks of the summaries. We can now do a smilarity search based on a query to fetch the most relevant chunks. Subsequently, we feed these chunks to the LLM to proudce an answer for the query.

In [27]:
query = "What is the philosophy of Animal Farm"
docs = vectorstore.similarity_search(query)

In [28]:
docs

[Document(page_content='the country. Napoleon announces an alliance with the humans, against the labouring classes of both "worlds". He abolishes practices and traditions related to the Revolution, and changes the name of the farm to "The Manor Farm". The animals, overhearing the conversation, notice that the faces of the pigs have begun changing. During a poker match, an argument breaks out between Napoleon and Mr Pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. The pigs Snowball, Napoleon, and Squealer adapt Old Major\'s ideas into an actual philosophy, which they formally name Animalism. Soon after, Napoleon and Squealer indulge in the vices of humans (drinking alcohol, sleeping in beds, trading). Squealer is employed to alter the Seven Commandments to account for this humanisation, an allusion to the Soviet government\'s revising of history', lookup_str='', metadata={'source': 4}, lookup_i

__Defining a template__

In [16]:
from langchain import PromptTemplate, HuggingFaceHub, LLMChain

A prompt template in LangChain is used to help an LLM 'think' or structure the input in such a way so as to get the most approapriate answer. The template is a text string that contains instructions for the LLM on what kind of output to produce. More information on Prompts cam be found __[here](https://python.langchain.com/en/latest/modules/prompts/prompt_templates/getting_started.html)__. In this tutorial, we use a simple 

__LLMChain__ is perhaps the most useful functionality in the library. It allows a language model application to seamlessly connect multiple poutputs from different LLMs into coherent text outputs. These is achieved by using Prompts as well as Chains effectively. More information of chains can be found __[here](https://python.langchain.com/en/latest/modules/chains/getting_started.html)__

return_only_outputs=False pramaeter allows the LLM to show its chain of reasoning.

In [30]:
template = """Answer the Question coherently. Question: {question}"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=HuggingFaceHub(repo_id="google/flan-t5-xl", \
                                                       model_kwargs={"temperature":1e-10}))

llm_chain({"input_documents": docs, "question": query}, return_only_outputs=False)

{'input_documents': [Document(page_content='the country. Napoleon announces an alliance with the humans, against the labouring classes of both "worlds". He abolishes practices and traditions related to the Revolution, and changes the name of the farm to "The Manor Farm". The animals, overhearing the conversation, notice that the faces of the pigs have begun changing. During a poker match, an argument breaks out between Napoleon and Mr Pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. The pigs Snowball, Napoleon, and Squealer adapt Old Major\'s ideas into an actual philosophy, which they formally name Animalism. Soon after, Napoleon and Squealer indulge in the vices of humans (drinking alcohol, sleeping in beds, trading). Squealer is employed to alter the Seven Commandments to account for this humanisation, an allusion to the Soviet government\'s revising of history', lookup_str='', metadata={'s