## CHATBOT

In this notebook, we will build a retrieval-based chatbot. We use the SquadV2 dataset.

Retrieval-based chatbot
*  Dataset: SquadV2
*  Model: Distil Bert as the one used for the assignment

This notebook is a revised version of the original notebook Practical_08__Semantic_Search__THIRD-TUTORIAL.ipynb
However, this notebook is designed to be run on a different dataset, namely the "SquadV2" dataset.
Consequenly it is modified the preprocessing steps of the dataset used to create the embeddings.
The different structure of the dataset (question, context, answer) brought to a different way to get the answers.
Additional features to recognize and handle a question already present in the dataset and impossible questions are added.

## Retrieval-based chatbots

We can use these semantic search pipelines to build a *retrieval-based chabot*, which is a data-driven open-domain conversational agent that finds responses from a corpus.

In [1]:
# Install all the required packages

!pip install hnswlib
!pip install sentencepiece
!pip -q install transformers sentencepiece accelerate
!pip install -U sentence-transformers
!pip install datasets transformers accelerate -U

Collecting hnswlib
  Using cached hnswlib-0.8.0.tar.gz (36 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (pyproject.toml): started
  Building wheel for hnswlib (pyproject.toml): finished with status 'error'
Failed to build hnswlib


  error: subprocess-exited-with-error
  
  × Building wheel for hnswlib (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [5 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'hnswlib' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for hnswlib
ERROR: Could not build wheels for hnswlib, which is required to install pyproject.toml-based projects


^C
Collecting accelerate
  Downloading accelerate-0.26.0-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.26.0-py3-none-any.whl (270 kB)
   ---------------------------------------- 0.0/270.7 kB ? eta -:--:--
   ---------------------------------------- 270.7/270.7 kB 8.4 MB/s eta 0:00:00
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.25.0
    Uninstalling accelerate-0.25.0:
      Successfully uninstalled accelerate-0.25.0
Successfully installed accelerate-0.26.0


In [4]:
# Import the required packages

import datasets
import os
import torch
import pandas

We use as sentence embedding model the multi-qa-MiniLM-L6-cos-v1 model and as cross encoder the ms-marco-MiniLM-L-6-v2 model
The sentence embedding model is used to encode the sentences into vectors
and the cross encoder is used to score the pairs of sentences

MS MARCO Passage Ranking is a large dataset to train models for information retrieval. It consists of about 500k real search queries from Bing search engine with the relevant text passage that answers the query.

* multi-qa-MiniLM-L6-cos-v1 maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.

* cross-encoder/ms-marco-MiniLM-L-6-v2 model can be used for Information Retrieval: Given a query, encode the query with all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order


In [2]:
from sentence_transformers import SentenceTransformer, CrossEncoder, util

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

  from .autonotebook import tqdm as notebook_tqdm


*  (1) Document-query similarity with cosine similarity is more efficient: I need to pass only the query and not all the document.
We can precompute the embeddings for all the documents with it.
Then when a new queries arrives, I run only the query through it.


*  (2) Cross encoder is more powerful but every time I need to put the query together with the document.



-->   In practice we use (1) to store all vectors representing our objects and we find the most relevant set and we rerank them using (2).
Once we have found the most similar questions using the cosine similarity we re-rank them using cross-encoding

Let's load the dataset

In [10]:
# Set the directory to work in
WORKING_DIR = "./squad"
DATA_DIR = "./data"
# Download the dataset
print("Downloading the DEV dataset of SQuAD2.0")
! wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json


dev_dataframe = pandas.read_json("./dev-v2.0.json")
print(f"Size of the dataset: {dev_dataframe.size} (e.g. Categories of questions)")
dev_dataframe.head()


Downloading the DEV dataset of SQuAD2.0


"wget" non � riconosciuto come comando interno o esterno,
 un programma eseguibile o un file batch.


FileNotFoundError: File ./dev-v2.0.json does not exist

Now we need to build all the message-response pairs.
In the original notebook, we started by grouping together the utterances (turns) that belong to the same dialogue.
However, our dataset have only questions and answers, it doesn't have dialogs.
First we convert the data set to a Pandas DataFrame.

In [None]:
import pandas as pd

df = pd.DataFrame(dev_dataframe)
df

Unnamed: 0,version,data
0,v2.0,"{'title': 'Normans', 'paragraphs': [{'qas': [{..."
1,v2.0,"{'title': 'Computational_complexity_theory', '..."
2,v2.0,"{'title': 'Southern_California', 'paragraphs':..."
3,v2.0,"{'title': 'Sky_(United_Kingdom)', 'paragraphs'..."
4,v2.0,"{'title': 'Victoria_(Australia)', 'paragraphs'..."
5,v2.0,"{'title': 'Huguenot', 'paragraphs': [{'qas': [..."
6,v2.0,"{'title': 'Steam_engine', 'paragraphs': [{'qas..."
7,v2.0,"{'title': 'Oxygen', 'paragraphs': [{'qas': [{'..."
8,v2.0,"{'title': '1973_oil_crisis', 'paragraphs': [{'..."
9,v2.0,"{'title': 'European_Union_law', 'paragraphs': ..."


Extract questions and paragraphs and index them

In [None]:
squad_data = dev_dataframe

# Extract questions and paragraphs and index them
questions = []
paragraphs = []

# Iterate over documents and paragraphs
for _, row in squad_data.iterrows():
    for paragraph in row['data']['paragraphs']:
        for qa in paragraph['qas']:
            question_dict = {
                'id': qa['id'],
                'title': row['data']['title'],
                'context': paragraph['context'],
                'question': qa['question'],
                'is_impossible': qa['is_impossible'],
                'answers': qa['answers']
            }
            questions.append(question_dict)


        # Append paragraphs to the separate list
        paragraph_dict = {
            'title': row['data']['title'],
            'context': paragraph['context']
        }
        paragraphs.append(paragraph_dict)

# Convert the lists to DataFrames
questions_df = pd.DataFrame(questions)
paragraphs_df = pd.DataFrame(paragraphs)

# Display the DataFrames
print("Questions:")

print(questions_df.head())
print(questions_df.tail())

print("\nParagraphs:")
print(paragraphs_df.head())


Questions:
                         id    title  \
0  56ddde6b9a695914005b9628  Normans   
1  56ddde6b9a695914005b9629  Normans   
2  56ddde6b9a695914005b962a  Normans   
3  56ddde6b9a695914005b962b  Normans   
4  56ddde6b9a695914005b962c  Normans   

                                             context  \
0  The Normans (Norman: Nourmands; French: Norman...   
1  The Normans (Norman: Nourmands; French: Norman...   
2  The Normans (Norman: Nourmands; French: Norman...   
3  The Normans (Norman: Nourmands; French: Norman...   
4  The Normans (Norman: Nourmands; French: Norman...   

                                            question  is_impossible  \
0               In what country is Normandy located?          False   
1                 When were the Normans in Normandy?          False   
2      From which countries did the Norse originate?          False   
3                          Who was the Norse leader?          False   
4  What century did the Normans first gain their ...    

Now we can goup the samples and go through individual dialogues to build the pairs.
We note that the answers contained in the list are sometimes all the same, sometimes different. Later on we will deal with this. For all equal answers, we will take the first one, for all different answers, we will take the most suitable one.


In [None]:
def extract_question_pairs(row, questions_pairs):
    for p in row["data"]["paragraphs"]:
        for q in p["qas"]:
                answers = [ans["text"] for ans in q["answers"]]
                questions_pairs.append(
                    {'message': q["question"], 'response': answers}
                )


dev_dataframe_pairs = []
dev_dataframe.apply(extract_question_pairs, axis=1, questions_pairs=dev_dataframe_pairs)

print(len(dev_dataframe_pairs))
print(dev_dataframe_pairs[-5:])


11873
[{'message': 'What is the seldom used force unit equal to one thousand newtons?', 'response': ['sthène', 'sthène', 'sthène', 'sthène', 'sthène']}, {'message': 'What does not have a metric counterpart?', 'response': []}, {'message': 'What is the force exerted by standard gravity on one ton of mass?', 'response': []}, {'message': 'What force leads to a commonly used unit of mass?', 'response': []}, {'message': 'What force is part of the modern SI system?', 'response': []}]


#### Index

Now we can embed the messages in our data set.
We use the sentence embeddings model to encode the questions. Corpus_embeddings is a list of tensors, each of which represents the embedding of a question.

In [None]:
corpus_embeddings = semb_model.encode([sample['message'] for sample in dev_dataframe_pairs], convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/372 [00:00<?, ?it/s]

And we can build an ANN index to do a quick cosine similarity search

* Hnswlib is a library for approximate nearest neighbor search used for efficient similarity search in high-dimensional spaces.
* We use cosine similarity to initialize the index. For each element in the corpus, hnswlib_index.add_items adds the corpus embeddings and their corresponding indices to the index.

In [None]:
import os
import hnswlib

# Create empty index
hnswlib_index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

# Define hnswlib index path
index_path = "./emp_dialogue_hnswlib.index"

# Load index if available
if os.path.exists(index_path):
    print("Loading index...")
    hnswlib_index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print("Start creating HNSWLIB index")
    hnswlib_index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    hnswlib_index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print("Saving index to:", index_path)
    hnswlib_index.save_index(index_path)

Start creating HNSWLIB index
Saving index to: ./emp_dialogue_hnswlib.index


### Search for response

We are going to search for a response in the following way:
1. Search for a similar message to the latest user input.
2. Retrieve response associated to the message.
2. Re-rank possible responses.

#### Retrieval function

Let's define a response function


* We use the sentence encoder to encode the question revieved from the user, as we have done before for the corpus questions.
* We use the index to find the closest question in the corpus to the question recieved from the user.
* Now we can encounter two distinct cases:


* If the distance between the two questions is small, it means that the question is already in the corpus.
  We check if the question is impossible to answer, if it is, we return a message saying that we don't know how to answer.
  Otherwise, we return the answer to the question. We do this by finding the best answer in the list of answers to the question using the cross-encoder.
* If the distance between the two questions is large, it means that the question is not in the corpus.
  In this case, we fide the top-k closest question in the corpus to the question recieved from the user with the index.
  Then, since the answers are a list for each question, we find the best answer in the list of answers to the question using the cross-encoder between the answers withouth repeatitions.
  In the case all the answers are identical, we take the first one.

In [None]:
import numpy as np

def get_response(message, mes_resp_pairs, index, re_ranking_model=None, top_k=32):
    message_embedding = semb_model.encode(message, convert_to_tensor=True).cpu()


    # Check the distance between the message and questions using HNSWLib
    hnsw_indices, hnsw_distances = index.knn_query(message_embedding, k=1)
    
    

    if hnsw_distances[0][0] < 1e-6:  
        # If the distance is small, return "I don't know"
        # Retrieve the corresponding pair for hnsw_distances[0][0]
        corresponding_pair = mes_resp_pairs[hnsw_indices[0][0]]

    ####################################################################
    ##################### check if you don't know ######################
    ####################################################################
        
        if not corresponding_pair['response']:
          return "I don't know. This is an impossible question ! (ノ ゜Д゜)ノ ︵ ┻━┻"

    ####################################################################
    ############### check if you already have the answer ###############
    ####################################################################

        else:
          # If there are identical responses, find the unique ones
          mes_resp_pairs = corresponding_pair['response']
          unique_responses, unique_indices = np.unique(mes_resp_pairs, return_index=True)
          unique_model_inputs = [(str(message), str(response)) for response in unique_responses]
          unique_cross_scores = xenc_model.predict(unique_model_inputs)

          # Find the index of the best unique response
          best_answer_index = np.argmax(unique_cross_scores)


          # Retrieve the best unique response
          best_response = unique_responses[best_answer_index]
          return best_response + ", I am sure about this. ٩(^‿^)۶ "

    ####################################################################
    ####################################################################


    corpus_ids, _ = index.knn_query(message_embedding, k=top_k)

    if len(corpus_ids) > 0 and len(corpus_ids[0]) > 0:
        model_inputs = [(str(message), str(mes_resp_pairs[idx]['response'])) for idx in corpus_ids[0]]
        
        cross_scores = xenc_model.predict(model_inputs)
        
        
        if len(cross_scores) > 0:

            idx = np.argsort(-cross_scores)[0]
            mes_resp_pairs = mes_resp_pairs[corpus_ids[0][idx]]['response']

            ####################################################################
            ################ pick only one answer in the list ##################
            ####################################################################

            # If there are identical responses, find the unique ones
            unique_responses, unique_indices = np.unique(mes_resp_pairs, return_index=True)

            # Calculate cross-encodings again only for unique responses because we want to choose just one response
            # between all the possible ones
            unique_model_inputs = [(str(message), str(response)) for response in unique_responses]
            if(len(unique_model_inputs)==0):
              return "I don't know."
            unique_cross_scores = xenc_model.predict(unique_model_inputs)

            # Find the index of the best unique response
            best_answer_index = np.argmax(unique_cross_scores)


            # Retrieve the best unique response
            best_response = unique_responses[best_answer_index]

            ####################################################################
            ####################################################################

            # Return the best response
            return best_response

        else:
            print("Warning: Cross scores are empty.")
            return "I don't know how to answer with the data available. I don't get any good matches."
    else:
        print("Warning: Corpus IDs are empty.")
        return "I don't know how to answer with the data available. This is an impossible question."


In [None]:
# example to test the chatbot
chatbot_response = get_response(
    "I like going out with my puppies.", dev_dataframe_pairs, hnswlib_index, re_ranking_model=xenc_model
)
chatbot_response

0.705709


"I don't know."

#### Conversation loop

Let's try chatting with our retreival system

In [None]:
# Initialise dialogue history
dialogue_history = ["Hello, how are you?"]

# Start chatting
print("Press [Ctrl-C] to stop\n\n\n\n")
print(f"Chatbot: {dialogue_history[0]}")
# Keep talking until stop
running = True
while running:
    try:
        # Read user message
        user_message = input("User: ")
        # Append message to dialogue history
        dialogue_history.append(user_message)
        # Search for a chatbot response
        chatbot_response = get_response(
            user_message, dev_dataframe_pairs, hnswlib_index, re_ranking_model=xenc_model
        )
        if chatbot_response:
            # Append chatbot response to dialogue history
            dialogue_history.append(chatbot_response)
            # Print chatbot response
            print(f"Chatbot: {chatbot_response}")
        else:
            print("Chatbot: I'm sorry, I couldn't find a suitable response.")
    except KeyboardInterrupt:
        running = False


Press [Ctrl-C] to stop




Chatbot: Hello, how are you?
User: What force leads to a commonly used unit of mass?
0.0
Chatbot: I don't know. This is an impossible question ! (ノ ゜Д゜)ノ ︵ ┻━┻


# Results:

*   CASE 1:
We ask a question equal to one in the dataset:
It recognizes the question thanks to a wery low hnsw distance (4.7683716e-07) and answers with the best matching phrase in the list.
It says "I am sure about this.".
```
      Chatbot: Hello, how are you?
      User: In what country is Normandy located?
      Chatbot: France I am sure about this.  ٩(^‿^)۶
```
      


*   CASE 2:
We ask a question that is a variation to one in the dataset.
It calculates the best match and it finds the list of the similar question.
It gives an answer.
```
User: Where is normandy?
Chatbot: France, Italy, Belgium, the Netherlands, Luxembourg and Germany
```


*   CASE 3:
We ask an impossible question.
It recognizes the question thanks to a wery low hnsw distance and it finds an empty list of answers.
It says "I don't know."
```
User: What force leads to a commonly used unit of mass?
Chatbot: I don't know. This is an impossible question ! (ノ ゜Д゜)ノ ︵ ┻━┻"
```



*   CASE 4:
We ask a variation of an impossible question.
It calculates the best match and it finds the empty list of the similar question.
It says "I don't know.".
```
User: What is a unit of mass?
Chatbot: I don't know.
```

