# Question Answering with Falcon-7B-Instruct

In this tutorial we will look at how we can use Qdrant to do question-answering with Falcon-7B-Instruct. Falcon-7B-Instruct is a ready-to-use chat/instruct model based on Falcon-7B which outperforms comparable open-source models. It's like a chatbot, you can provide your question and it will provide answer to that. To do so, we will store documents with relevant information in qdrant collection after converting them to vectors, then we will be able to search for information with our query vector and generate answer with the information we get. We will need three things-
- **Qdrant**: it will store our documents and search relevant document with answer to our query.
- **Retriever model**: it will be used for embedding context passages.
- **Generator model**: to generate answers.

Don't worry if you do not understand everything yet, we will go through each step in the notebook.
Let's start:

## Install Dependencies
Let's get started by installing the packages needed for the notebook to run:

In [1]:
!pip install -qU datasets==2.13.1 qdrant-client==1.3.1 sentence_transformers==2.2.2 tqdm==4.65.0

## Import libraries

In [39]:
from datasets import load_dataset
from tqdm.auto import tqdm
import pandas as pd
from qdrant_client import QdrantClient
from qdrant_client.http import models
from sentence_transformers import SentenceTransformer
from pathlib import Path
import torch
import os
import requests
from typing import Dict, Any, List

## Load Dataset

We will use the squad dataset, which is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles. It contains 87k records.

In [3]:
# load the dataset from huggingface
dataset = load_dataset("squad", split="train")
dataset

Found cached dataset squad (C:/Users/karti/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

## Prepare dataset
We only want the context column from the dataset so we will make a pandas dataframe and drop duplicate contexts.

In [4]:
# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(dataset["context"])
df = df.drop_duplicates()
print(len(df))
df.head()

18891


Unnamed: 0,0
0,"Architecturally, the school has a Catholic cha..."
5,"As at most other universities, Notre Dame's st..."
10,The university is the major seat of the Congre...
15,The College of Engineering was established in ...
20,All of Notre Dame's undergraduate students are...


After removing duplicates we are left with 18k unique contexts.

## Initialize Qdrant client

In [5]:
# Initialize Qdrant client

current_folder = Path.cwd()  # Get the current folder
qdrant_folder = current_folder / "qdrant"
qdrant_folder.mkdir()  # Create qdrant folder to store collection

client = QdrantClient(path=qdrant_folder.resolve())  # path to new qdrant folder

## Create new collection
Now the data is ready, we can set up our Qdrant collection to store it.

In [6]:
context_collection = "question-answering-falcon"

collections = client.get_collections()
print(collections)

# only create collection if it doesn't exist
if context_collection not in collections:
    client.recreate_collection(
        collection_name=context_collection,
        vectors_config=models.VectorParams(
            size=384,  # specifying dimensionality of vectors output by retriever model as both need to be same
            distance=models.Distance.COSINE,  # specifying which metric will be used to check similarity
        ),
    )
collections = client.get_collections()
print(collections)

collections=[]
collections=[CollectionDescription(name='question-answering-falcon')]


## Initialize Retriever
We will use the `all-MiniLM-L6-v2`, which is used to create vector representations of our records and also for our search queries. These vector embeddings capture the semantic meaning of the documents or records. Then, during the retrieval phase, similarity measure (i.e., cosine similarity) is applied in vector space to find the most similar records to a given query.

In [7]:
# set device to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
if device != "cuda":
    print(
        f"You are using {device}. This is much slower than using "
        "a CUDA-enabled GPU. If on Colab you can change this by "
        "clicking Runtime > Change runtime type > GPU."
    )
# Instantiate the SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2", device=device)
model


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
binary_path: C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

## Generate Embeddings -> Store in Qdrant

Next, we need to generate embeddings for the context passages.

When passing the documents to Qdrant, we need an:

    1. id (a unique integer value),
    2. context embedding, and
    3. payload for each document representing context passages in the dataset. The payload is a dictionary containing data relevant to our embeddings. We will store original context as payload.

In [8]:
%%time

batch_size = 1024  # specify batch size according to your RAM and compute, higher batch size = more RAM usage

for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i + batch_size, len(df))  # find end of batch
    batch = df.iloc[i:i_end]  # extract batch
    encoded_queries = model.encode(
        batch[0].tolist()
    ).tolist()  # encoding the whole batch of context passages into vectors
    meta = batch.to_dict(orient="records")  # get metadata
    ids = list(range(i, i_end))  # create unique IDs

    # upsert to qdrant
    client.upsert(
        collection_name=context_collection,
        points=models.Batch(ids=ids, vectors=encoded_queries, payloads=meta),
    )
collection_vector_count = client.get_collection(
    collection_name=context_collection
).vectors_count
print(f"Vector count in collection: {collection_vector_count}")
assert collection_vector_count == len(df)

  0%|          | 0/19 [00:00<?, ?it/s]

Vector count in collection: 18891
CPU times: total: 1min 37s
Wall time: 4min 3s


## Initialize Generator model
We will use **tiiuae/falcon-7b-instruct** from hugging face for generating answers to our questions. First we will search qdrant to find the context of our question. Then we will provide both the question and context to our generator model which will produce the answer for us. 

We will use inference api of Falcon-7B-Instruct from hugging face to query.
We need an API_TOKEN to do so which we can get from hugging face.

In [83]:
# get API_TOKEN from huggingface website
API_TOKEN = os.getenv("API_TOKEN")
if not API_TOKEN:
    raise ValueError(
        "API_TOKEN is not set. Please obtain a valid API_TOKEN from the Hugging Face website."
    )

Now we will write two helper functions, **query_qdrant** to search relevant context from collection based on our query and **generate_answer** which will generate answer after being provided with the query and relevant context from query_qdrant function.

In [84]:
def query_qdrant(query: str, top_k: int) -> List:
    """
    Searches Qdrant collection for context passages which are similar to the given query.

    Args:
        query (str): The query to search for.
        top_k (int): The number of results to return.

    Returns:
        List: A list of search results.
    """
    encoded_query = model.encode(query).tolist()  # Generate embeddings for the query
    result = client.search(
        collection_name=context_collection,
        query_vector=encoded_query,
        limit=top_k,  # how many results we want
    )  # search qdrant collection for context passage
    return result

In [113]:
def generate_answer(query: str, results: List) -> str:
    """
    Generates an answer to a given query by combining the query with context from query_qdrant function and querying
    it with the model Falcon-7b-Instruct from HuggingFace.

    Args:
        query (str): The query to generate an answer for.
        results (List): A list of search results containing context passages from query_qdrant.

    Returns:
        str: The generated answer.
    """
    context = ""
    for i in range(len(results)):
        context += results[i].payload[0]
    input_text = f"Please answer below question using provided context.question : {query} context: {context}"

    API_URL = "https://api-inference.huggingface.co/models/tiiuae/falcon-7b-instruct"
    headers = {"Authorization": f"Bearer {API_TOKEN}"}

    def query_falcon(payload: Dict[str, Any]) -> List:
        """
        Sends a query to the Hugging Face API and returns the response json.

        Args:
            payload (Dict[str, Any]): The payload to send with the query.

        Returns:
            List: The json response from the API.
        """
        response = requests.post(API_URL, headers=headers, json=payload)
        if response.status_code != 200:
            raise ValueError("Invalid API_TOKEN. Please provide a valid token.")
        return response.json()

    output = query_falcon(
        {
            "inputs": input_text,
            "parameters": {
                "top_p": 0.9,  # controls the diversity of the generated text
                "temperature": 0.8,  # determines the randomness of the generated text
                "max_new_tokens": 2000,  # maximum number of tokens the model will generate
                "repetition_penalty": 1.03,  # discourages the model from repeating the same token multiple times
            },
        }
    )
    return output[0]["generated_text"][len(input_text) + 1 :]

Let's test query_qdrant function and see how we get our relevant contexts.

In [114]:
query = "How many rivers travel through Kathmandu?"
results = query_qdrant(query, top_k=1)
results

[ScoredPoint(id=18845, version=0, score=0.6813206721766036, payload={0: 'Kathmandu is dissected by eight rivers, the main river of the valley, the Bagmati and its tributaries, of which the Bishnumati, Dhobi Khola, Manohara Khola, Hanumant Khola, and Tukucha Khola are predominant. The mountains from where these rivers originate are in the elevation range of 1,500–3,000 metres (4,900–9,800 ft), and have passes which provide access to and from Kathmandu and its valley. An ancient canal once flowed from Nagarjuna hill through Balaju to Kathmandu; this canal is now extinct.'}, vector=None)]

Now we have to pass the query and results to generate_answer function.

In [115]:
answer = generate_answer(query, results)
answer

'As an AI language model, I do not have access to real-time information. However, according to the provided context, Kathmandu is intersected by eight rivers, with the Bagmati and its tributaries, including Bishnumati, Dhobi Khola, Manohara Khola, Hanumant Khola, and Tukucha Khola being the most prominent. The mountainous regions with elevations between 1,500-3,000 metres (4,900-9,800 ft) are the source of these rivers. The ancient canal, once flowing through Balaju to Kathmandu, is now extinct.'

If we ask the same question but removing the context, it will not be able to answer it correctly.

In [116]:
query = "How many rivers travel through Kathmandu?"
results = query_qdrant(query, top_k=1)
answer = generate_answer(query, [])
answer

'The answer is 25. There are 25 rivers that flow through Kathmandu, including the Bagmati River, Bishankar River, and Tamakoshi River.'

Let's run some  more queries.

In [123]:
query = "How long ago did Antarctica and Australia split apart?"
results = query_qdrant(query, top_k=1)
answer = generate_answer(query, results)
answer

'How long ago did Antarctica and Australia split apart? Answer: Approximately 45 million years ago during the Eocene epoch.'

In [120]:
query = "Which person played Knute Rockne in the 1940 movie Knute Rockne?"
results = query_qdrant(query, top_k=1)
answer = generate_answer(query, results)
answer

"Pat O'Brien"

The model gives decent answers to our questions, you can try more queries if you want.

In [None]:
client.delete_collection(collection_name=context_collection)