## Retrieval-based chatbots

We can use these semantic search pipelines to build a *retrieval-based chabot*, which is a data-driven open-domain conversational agent that finds responses from a corpus.

In [1]:
# Install the required Huggingface libs
# ! pip install datasets transformers accelerate -U

import datasets
import os
import torch
import pandas

In [9]:

!pip install sentencepiece
!pip -q install transformers sentencepiece accelerate




In [7]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=134c61f32c4e58840ef6a70b31fc943ce1a4ff172a25bb65c1fcbc0612c0c925
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-tr

In [2]:

import torch
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
from transformers import AdamW
from torch.utils.data import DataLoader

from transformers.data.processors.squad import SquadV2Processor

# Load the SquadV2 processor
processor = SquadV2Processor()

# Load the pre-trained DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
from sentence_transformers import SentenceTransformer, CrossEncoder, util

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Let's load the dataset

In [4]:
# Set the directory to work in
WORKING_DIR = "./squad"
DATA_DIR = "./data"
# Download the dataset
print("Downloading the DEV dataset of SQuAD2.0")
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json


dev_dataframe = pandas.read_json("/content/dev-v2.0.json")
print(f"Size of the dataset: {dev_dataframe.size} (e.g. Categories of questions)")
dev_dataframe.head()


Downloading the DEV dataset of SQuAD2.0
--2024-01-04 19:44:17--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.108.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2024-01-04 19:44:17 (42.9 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]

Size of the dataset: 70 (e.g. Categories of questions)


Unnamed: 0,version,data
0,v2.0,"{'title': 'Normans', 'paragraphs': [{'qas': [{..."
1,v2.0,"{'title': 'Computational_complexity_theory', '..."
2,v2.0,"{'title': 'Southern_California', 'paragraphs':..."
3,v2.0,"{'title': 'Sky_(United_Kingdom)', 'paragraphs'..."
4,v2.0,"{'title': 'Victoria_(Australia)', 'paragraphs'..."


Now we need to build all the message-response pairs.
Normally, we start by grouping together the utterances (turns) that belong to the same dialogue.
However, our dataset have only questions and answers, it doesn't have dialogs.
First we convert the data set to a Pandas DataFrame.

In [5]:
import pandas as pd

df = pd.DataFrame(dev_dataframe)
df

Unnamed: 0,version,data
0,v2.0,"{'title': 'Normans', 'paragraphs': [{'qas': [{..."
1,v2.0,"{'title': 'Computational_complexity_theory', '..."
2,v2.0,"{'title': 'Southern_California', 'paragraphs':..."
3,v2.0,"{'title': 'Sky_(United_Kingdom)', 'paragraphs'..."
4,v2.0,"{'title': 'Victoria_(Australia)', 'paragraphs'..."
5,v2.0,"{'title': 'Huguenot', 'paragraphs': [{'qas': [..."
6,v2.0,"{'title': 'Steam_engine', 'paragraphs': [{'qas..."
7,v2.0,"{'title': 'Oxygen', 'paragraphs': [{'qas': [{'..."
8,v2.0,"{'title': '1973_oil_crisis', 'paragraphs': [{'..."
9,v2.0,"{'title': 'European_Union_law', 'paragraphs': ..."


Extract questions and paragraphs and index them

In [31]:
# Replace 'path_al_tuo_file.json' with the actual file path
squad_data = dev_dataframe

# Extract questions and paragraphs and index them
questions = []
paragraphs = []

# Iterate over documents and paragraphs
for _, row in squad_data.iterrows():
    for paragraph in row['data']['paragraphs']:
        for qa in paragraph['qas']:
            question_dict = {
                'id': qa['id'],
                'title': row['data']['title'],
                'context': paragraph['context'],
                'question': qa['question'],
                'is_impossible': qa['is_impossible'],
                'answers': qa['answers']
            }
            questions.append(question_dict)


        # Append paragraphs to the separate list
        paragraph_dict = {
            'title': row['data']['title'],
            'context': paragraph['context']
        }
        paragraphs.append(paragraph_dict)

# Convert the lists to DataFrames
questions_df = pd.DataFrame(questions)
paragraphs_df = pd.DataFrame(paragraphs)

# Display the DataFrames
print("Questions:")
print(questions_df.head())

print("\nParagraphs:")
print(paragraphs_df.head())

impossible_questions_df = questions_df[questions_df['is_impossible']]
print(impossible_questions_df)

Questions:
                         id    title  \
0  56ddde6b9a695914005b9628  Normans   
1  56ddde6b9a695914005b9629  Normans   
2  56ddde6b9a695914005b962a  Normans   
3  56ddde6b9a695914005b962b  Normans   
4  56ddde6b9a695914005b962c  Normans   

                                             context  \
0  The Normans (Norman: Nourmands; French: Norman...   
1  The Normans (Norman: Nourmands; French: Norman...   
2  The Normans (Norman: Nourmands; French: Norman...   
3  The Normans (Norman: Nourmands; French: Norman...   
4  The Normans (Norman: Nourmands; French: Norman...   

                                            question  is_impossible  \
0               In what country is Normandy located?          False   
1                 When were the Normans in Normandy?          False   
2      From which countries did the Norse originate?          False   
3                          Who was the Norse leader?          False   
4  What century did the Normans first gain their ...    

Now we can goup the samples and go through individual dialogues to build the pairs

In [29]:
def extract_question_pairs(row, questions_pairs):
    for p in row["data"]["paragraphs"]:
        for q in p["qas"]:
            if not q["is_impossible"]:
                answers = [ans["text"] for ans in q["answers"]]
                questions_pairs.append(
                    {'message': q["question"], 'response': answers}
                )


dev_dataframe_pairs = []
dev_dataframe.apply(extract_question_pairs, axis=1, questions_pairs=dev_dataframe_pairs)

print(len(dev_dataframe_pairs))
print(dev_dataframe_pairs[:20])


5928
[{'message': 'In what country is Normandy located?', 'response': ['France', 'France', 'France', 'France']}, {'message': 'When were the Normans in Normandy?', 'response': ['10th and 11th centuries', 'in the 10th and 11th centuries', '10th and 11th centuries', '10th and 11th centuries']}, {'message': 'From which countries did the Norse originate?', 'response': ['Denmark, Iceland and Norway', 'Denmark, Iceland and Norway', 'Denmark, Iceland and Norway', 'Denmark, Iceland and Norway']}, {'message': 'Who was the Norse leader?', 'response': ['Rollo', 'Rollo', 'Rollo', 'Rollo']}, {'message': 'What century did the Normans first gain their separate identity?', 'response': ['10th century', 'the first half of the 10th century', '10th', '10th']}, {'message': 'Who was the duke in the battle of Hastings?', 'response': ['William the Conqueror', 'William the Conqueror', 'William the Conqueror']}, {'message': 'Who ruled the duchy of Normandy', 'response': ['Richard I', 'Richard I', 'Richard I']}, 

#### Index

Now we can embed the messages in our data set

In [17]:
corpus_embeddings = semb_model.encode([sample['message'] for sample in dev_dataframe_pairs], convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/186 [00:00<?, ?it/s]

And we can build an ANN index to do a quick cosine similarity search

In [18]:
!pip install hnswlib

Collecting hnswlib
  Downloading hnswlib-0.8.0.tar.gz (36 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (pyproject.toml) ... [?25l[?25hdone
  Created wheel for hnswlib: filename=hnswlib-0.8.0-cp310-cp310-linux_x86_64.whl size=2287618 sha256=a9c04f4178b768281ca6f06df12bcfaab9111262637f1ea575d67e56aefc19ac
  Stored in directory: /root/.cache/pip/wheels/af/a9/3e/3e5d59ee41664eb31a4e6de67d1846f86d16d93c45f277c4e7
Successfully built hnswlib
Installing collected packages: hnswlib
Successfully installed hnswlib-0.8.0


In [19]:
import os
import hnswlib


# Create empty index
hnswlib_index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

# Define hnswlib index path
index_path = "./emp_dialogue_hnswlib.index"

# Load index if available
if os.path.exists(index_path):
    print("Loading index...")
    hnswlib_index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print("Start creating HNSWLIB index")
    hnswlib_index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    hnswlib_index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print("Saving index to:", index_path)
    hnswlib_index.save_index(index_path)

Start creating HNSWLIB index
Saving index to: ./emp_dialogue_hnswlib.index


### Search for response

We are going to search for a response in the following way:
1. Search for a similar message to the latest user input.
2. Retrieve response associated to the message.
2. Re-rank possible responses.

#### Retrieval function

Let's define a response function

In [75]:
import numpy as np

def get_response(message, mes_resp_pairs, index, re_ranking_model=None, top_k=32):
    message_embedding = semb_model.encode(message, convert_to_tensor=True).cpu()

    corpus_ids, _ = index.knn_query(message_embedding, k=top_k)

    model_inputs = [(str(message), str(mes_resp_pairs[idx]['response'])) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)

    idx = np.argsort(-cross_scores)[0]

    return mes_resp_pairs[corpus_ids[0][idx]]['response']


    import numpy as np

def get_response(message, mes_resp_pairs, index, re_ranking_model=None, top_k=32):
    message_embedding = semb_model.encode(message, convert_to_tensor=True).cpu()


    ####################################################################
    ##################### check if you don't know ######################
    ####################################################################

    # Check the distance between the message and questions using HNSWLib
    hnsw_indices, hnsw_distances = index.knn_query(message_embedding, k=1)

    if hnsw_distances[0][0] < 1e-6:  # Adjust YOUR_DISTANCE_THRESHOLD as needed
        # If the distance is small, return "I don't know"
        # Retrieve the corresponding pair for hnsw_distances[0][0]
        corresponding_pair = mes_resp_pairs[hnsw_indices[0][0]]

        if not corresponding_pair['response']:
          return "I don't know."

    ############### check if you already have the answer ###############

        else:
          # If there are identical responses, find the unique ones
          mes_resp_pairs = corresponding_pair['response']
          unique_responses, unique_indices = np.unique(mes_resp_pairs, return_index=True)
          unique_model_inputs = [(str(message), str(response)) for response in unique_responses]
          unique_cross_scores = xenc_model.predict(unique_model_inputs)

          # Find the index of the best unique response
          best_answer_index = np.argmax(unique_cross_scores)


          # Retrieve the best unique response
          best_response = unique_responses[best_answer_index]
          return best_response

    ####################################################################
    ####################################################################


    corpus_ids, _ = index.knn_query(message_embedding, k=top_k)

    if len(corpus_ids) > 0 and len(corpus_ids[0]) > 0:
        model_inputs = [(str(message), str(mes_resp_pairs[idx]['response'])) for idx in corpus_ids[0]]
        #model_questions_matching = [(str(message), str(mes_resp_pairs[idx]['message'])) for idx in corpus_ids[0]]

        cross_scores = xenc_model.predict(model_inputs)
        #cross_scores_questions = xenc_model.predict(model_questions_matching)
        #print(cross_scores_questions)
        if len(cross_scores) > 0:

            idx = np.argsort(-cross_scores)[0]
            mes_resp_pairs = mes_resp_pairs[corpus_ids[0][idx]]['response']

            ####################################################################
            ################ pick only one answer in the list ##################
            ####################################################################

            # If there are identical responses, find the unique ones
            unique_responses, unique_indices = np.unique(mes_resp_pairs, return_index=True)

            # Calculate cross-encodings again only for unique responses because we want to choose just one response
            # between all the possible ones
            unique_model_inputs = [(str(message), str(response)) for response in unique_responses]
            unique_cross_scores = xenc_model.predict(unique_model_inputs)

            # Find the index of the best unique response
            best_answer_index = np.argmax(unique_cross_scores)


            # Retrieve the best unique response
            best_response = unique_responses[best_answer_index]

            ####################################################################
            ####################################################################

            # Return the best response
            return best_response

        else:
            print("Warning: Cross scores are empty.")
            return "I don't know how to answer with the data available. I don't get any good matches."
    else:
        print("Warning: Corpus IDs are empty.")
        return "I don't know how to answer with the data available. This is an impossible question."


Note that the use of re-ranking is optional (you can pass a none `re_ranking_model`) and the top results to re-score are configurable.
You can play a bit with these hyperaparameters to see how responses change.

In [24]:

chatbot_response = get_response(
    "I like going out with my puppies.", dev_dataframe_pairs, hnswlib_index, re_ranking_model=xenc_model
)
chatbot_response

['July', 'July', 'July']

#### Conversation loop

Let's try chatting with our retreival system

In [76]:
# Initialise dialogue history
dialogue_history = ["Hello, how are you?"]

# Start chatting
print("Press [Ctrl-C] to stop\n\n\n\n")
print(f"Chatbot: {dialogue_history[0]}")
# Keep talking until stop
running = True
while running:
    try:
        # Read user message
        user_message = input("User: ")
        # Append message to dialogue history
        dialogue_history.append(user_message)
        # Search for a chatbot response
        chatbot_response = get_response(
            user_message, dev_dataframe_pairs, hnswlib_index, re_ranking_model=xenc_model
        )
        if chatbot_response:
            # Append chatbot response to dialogue history
            dialogue_history.append(chatbot_response)
            # Print chatbot response
            print(f"Chatbot: {chatbot_response}")
        else:
            print("Chatbot: I'm sorry, I couldn't find a suitable response.")
    except KeyboardInterrupt:
        running = False


Press [Ctrl-C] to stop




Chatbot: Hello, how are you?
User: Who was the Norse leader?
5.9604645e-08
Chatbot: Rollo


Now we want that it chooses only the best between the answers and it answers "I don't know" to impossible questions

Let's try with: What is France a region of? (should answer I don't kmow)

TODO impossible questions are not in the pairs