# A LITTLE CHAT GOES A LONG WAY: HANDS-ON TUTORIAL ON CONVERSATIONAL INFORMATION ACCESS - (HANDS-ON)

### Tutorial on Joint Conversational Search and Recommendation at ECIR 2025, 04th April, Lucca, Italy

Guglielmo Faggioli, Nicola Ferro, Simone Merlo

**Step 0.a**: in this tutorial, we use the GPUs available in colab. The colab enviroment must be set properly and should access to a GPU to allow the rest of the code to work properly.

If this is not the case, you can set it up with the following steps.

edit > notebook setting > check a GPU configuration

Run the following snippet of code to check that the GPU is visible

In [1]:
!nvidia-smi #this won't work if you did not set correctly the environment (i.e., no gpu)
!nvcc --version

Tue Apr  1 12:29:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   44C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

**Step 0.b**: install the support libraries, namely `ir_datasets` [1], `faiss` [2], and `ir_measures` [3]. We use `ir_datasets` to get the CaST 2019 data, `faiss` to carry out the search phase, and `ir_measures` to evaluate the quality of the retrieval.

This installation step requires approx 2 minutes.


[1] Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, Nazli Goharian: Simplified Data Wrangling with ir_datasets. SIGIR 2021: 2429-2436.

[2] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, Hervé Jégou: The Faiss library. CoRR abs/2401.08281 (2024)

[3] Sean MacAvaney, Craig Macdonald, Iadh Ounis:
Streamlining Evaluation with ir-measures. ECIR (2) 2022: 305-310

In [2]:
%%time
%%capture
!pip install torch
!pip install pytorch-transformers
!pip install faiss-cpu
!pip install ir_datasets
!pip install ir_measures

import ir_datasets
import ir_measures
import numpy as np
import pandas as pd
from tqdm import tqdm
import math
import faiss

CPU times: user 989 ms, sys: 230 ms, total: 1.22 s
Wall time: 2min 4s


In [3]:
%%time
%%capture
#download the tutorial's files
!gdown 1tkVTKT3ZC75f8wsVnyHu3SoIvD24Sq9B
!unzip tutorial_cs.zip
!rm tutorial_cs.zip

CPU times: user 35.7 ms, sys: 1.98 ms, total: 37.7 ms
Wall time: 6.34 s


# CONVERSATIONAL QUERY REWRITING

---

In this case, we explicitly rewrite each utterance to make it a stand-alone information need that does not require the context to be understood.

**Step 1**: the first step consists in downloading the models. This download will require approx 2 minutes.


In [4]:
%%time
%%capture
!mkdir -p data/VoskaridesEtAl2020
!mkdir -p data/YuEtAl2020/models

#download and unzip the weights for QuReTeC
!gdown 1BKvRoKnbjWWne8Cp-dfLkQ_6s3SIte1f -O data/VoskaridesEtAl2020/models.zip
!unzip data/VoskaridesEtAl2020/models.zip -d data/VoskaridesEtAl2020
!rm data/VoskaridesEtAl2020/models.zip

#download  and unzip the weights for CQR
#uncomment the next line for the self-learned model
#!wget -P data/YuEtAl2020/models https://thunlp.s3-us-west-1.amazonaws.com/Self-Learn%2BCV-0.zip --no-check-certificate
#uncomment the next line for the rule-based model
!wget -P data/YuEtAl2020/models https://thunlp.s3-us-west-1.amazonaws.com/Rule-based%2BCV-1.zip --no-check-certificate
!unzip data/YuEtAl2020/models/Rule-based+CV-1.zip -d data/YuEtAl2020/models/
!rm data/YuEtAl2020/models/Rule-based+CV-1.zip

CPU times: user 340 ms, sys: 71.8 ms, total: 412 ms
Wall time: 1min 22s


**Step 2**: load the data. We will load the qrels and queries directly using ir_datasets.

We can inspect the queries/conversations to see how they look like.

In [5]:
# get the dataset from ir_datasets. We use the "judged" version, to consider
# only queries for which we have judged documents
dataset = ir_datasets.load("trec-cast/v1/2019/judged")

# load qrels and queries
qrels = pd.DataFrame(dataset.qrels_iter())
queries = pd.DataFrame(dataset.queries_iter())

queries

[INFO] [starting] https://trec.nist.gov/data/cast/2019qrels.txt
[INFO] [finished] https://trec.nist.gov/data/cast/2019qrels.txt: [00:00] [1.14MB] [4.67MB/s]
[INFO] [starting] https://raw.githubusercontent.com/daltonj/treccastweb/master/2019/data/evaluation/evaluation_topics_v1.0.json
[INFO] [finished] https://raw.githubusercontent.com/daltonj/treccastweb/master/2019/data/evaluation/evaluation_topics_v1.0.json: [00:00] [57.2kB] [82.4MB/s]


Unnamed: 0,query_id,raw_utterance,topic_number,turn_number,topic_title,topic_description
0,31_1,What is throat cancer?,31,1,head and neck cancer,A person is trying to compare and contrast typ...
1,31_2,Is it treatable?,31,2,head and neck cancer,A person is trying to compare and contrast typ...
2,31_3,Tell me about lung cancer.,31,3,head and neck cancer,A person is trying to compare and contrast typ...
3,31_4,What are its symptoms?,31,4,head and neck cancer,A person is trying to compare and contrast typ...
4,31_5,Can it spread to the throat?,31,5,head and neck cancer,A person is trying to compare and contrast typ...
...,...,...,...,...,...,...
168,79_5,How is his work related to Comte?,79,5,sociology,Information about the field of sociology inclu...
169,79_6,What is the functionalist theory?,79,6,sociology,Information about the field of sociology inclu...
170,79_7,What is its main criticism?,79,7,sociology,Information about the field of sociology inclu...
171,79_8,How does it compare to conflict theory?,79,8,sociology,Information about the field of sociology inclu...


In [6]:
import rewriters


def conversation_rewrite(utterances, rewriter):
    '''
    utterances: pandas DataFrame containing the sequence of utterances. Utterances
    are expected to be in order.
    rewriter: object that implments Abstract Rewriter
    '''
    rewriter.reset_history()

    rewrites = []
    for utterance in utterances.iterrows():
        utterance = utterance[1]
        rewrites.append([utterance.query_id, rewriter.rewrite(utterance.raw_utterance)])
    rewrites = pd.DataFrame(rewrites, columns = ["query_id", "rewritten_utterance"])
    return rewrites

## Conversational Query Rewriter (CQR)
We rewrite TREC CaST queries using Conversational Query Rewriter (CQR), as proposed in [4].

The implementation is based on https://github.com/thunlp/ConversationQueryRewriter.

[4] Shi Yu, Jiahua Liu, Jingqin Yang, Chenyan Xiong, Paul N. Bennett, Jianfeng Gao, Zhiyuan Liu: Few-Shot Generative Conversational Query Rewriting. SIGIR 2020: 1933-1936

In [7]:
rewriter = rewriters.Cqr(model_dir = "data/YuEtAl2020/models/Rule-based+CV-1", device = "cuda:0")

rewrites_Cqr = queries.groupby("topic_number")\
             .apply(lambda x:conversation_rewrite(x, rewriter))\
             .reset_index(drop=True)\
             .merge(queries[["query_id", "raw_utterance"]])

rewrites_Cqr

  .apply(lambda x:conversation_rewrite(x, rewriter))\


Unnamed: 0,query_id,rewritten_utterance,raw_utterance
0,31_1,What is throat cancer?,What is throat cancer?
1,31_2,Is throat cancer treatable?,Is it treatable?
2,31_3,Tell me about lung cancer.,Tell me about lung cancer.
3,31_4,What are lung cancer's symptoms?,What are its symptoms?
4,31_5,Can lung cancer spread to the throat?,Can it spread to the throat?
...,...,...,...
168,79_5,How is Herbert Spencer's work related to Comte?,How is his work related to Comte?
169,79_6,What is the functionalist theory in sociology?,What is the functionalist theory?
170,79_7,What is the main criticism of functionalist th...,What is its main criticism?
171,79_8,How does functionalist theory compare to confl...,How does it compare to conflict theory?


## Query Resolution by Term Classification (QuReTeC)

We call here the Query Resolution by Term Classification (QuReTeC) rewriter as proposed in [5]. The implementation is based on https://github.com/nickvosk/sigir2020-query-resolution


[5] Nikos Voskarides, Dan Li, Pengjie Ren, Evangelos Kanoulas, Maarten de Rijke:
Query Resolution for Conversational Search with Limited Supervision. SIGIR 2020: 921-930

In [8]:
rewriter = rewriters.QuReTeC(model_dir = "data/VoskaridesEtAl2020/models/191790_50", device = "cuda:0")

rewrites_QuReTeC = queries.groupby("topic_number")\
             .apply(lambda x:conversation_rewrite(x, rewriter))\
             .reset_index(drop=True)\
             .merge(queries[["query_id", "raw_utterance"]])

rewrites_QuReTeC

  .apply(lambda x:conversation_rewrite(x, rewriter))\


Unnamed: 0,query_id,rewritten_utterance,raw_utterance
0,31_1,What is throat cancer?,What is throat cancer?
1,31_2,Is it treatable? throat cancer,Is it treatable?
2,31_3,Tell me about lung cancer. cancer,Tell me about lung cancer.
3,31_4,What are its symptoms? cancer lung,What are its symptoms?
4,31_5,Can it spread to the throat? cancer lung,Can it spread to the throat?
...,...,...,...
168,79_5,How is his work related to Comte? auguste herb...,How is his work related to Comte?
169,79_6,What is the functionalist theory? auguste,What is the functionalist theory?
170,79_7,What is its main criticism? functionalist,What is its main criticism?
171,79_8,How does it compare to conflict theory? functi...,How does it compare to conflict theory?


## Neural Transfer Reformulation (NTR)

We apply here Neural Transfer Reformulation (NTR) as described in [6], following the implementation available on github: https://github.com/castorini/chatty-goose/tree/c7d0cd8c45354b09b5fb930ab0b5af8be2e5772b


[6] Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin: Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting. ACM Trans. Inf. Syst. 39(4): 48:1-48:29 (2021)

In [9]:
rewriter = rewriters.Ntr(device = "cuda:0")

rewrites_Ntr = queries.groupby("topic_number")\
             .apply(lambda x:conversation_rewrite(x, rewriter))\
             .reset_index(drop=True)\
             .merge(queries[["query_id", "raw_utterance"]])

rewrites_Ntr

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
  .apply(lambda x:conversation_rewrite(x, rewriter))\


Unnamed: 0,query_id,rewritten_utterance,raw_utterance
0,31_1,What is throat cancer?,What is throat cancer?
1,31_2,Is throat cancer treatable?,Is it treatable?
2,31_3,Tell me about lung cancer.,Tell me about lung cancer.
3,31_4,What are lung cancer symptoms?,What are its symptoms?
4,31_5,Can lung cancer spread to the throat?,Can it spread to the throat?
...,...,...,...
168,79_5,How is Herbert Spencer's work related to Augus...,How is his work related to Comte?
169,79_6,What is the functionalist theory?,What is the functionalist theory?
170,79_7,What is the functionalist theory's main critic...,What is its main criticism?
171,79_8,How does the functionalist theory compare to c...,How does it compare to conflict theory?


# CONVERSATIONAL DENSE RETRIEVAL

---



Differently from rewriting techinques, ConvDR [7] does not explicitly rewites the utterances to obtain a conversation-informed rerpesentation of the utterance-level information need.

In detail, ConvDR is a bi-encoder built upon ANCE, which employs two separate encoders to project the query and the documents in the same 768-dimensional latent space.
Then, the similarity between the query and the documents correspond to the dot product between their representations.




[7] Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, Zhiyuan Liu: Few-Shot Conversational Dense Retrieval. SIGIR 2021: 829-838

**Step 1**: download the ConvDR model weights. In particular, in this case we operate exclusively on inference, hence we assume to have access to a pretrained model. Yu Et Al. provide the pretrained models for three collections, CaST 2019, CaST 2020, and OR-QuAC (https://github.com/thunlp/ConvDR). In our case, we focus on CaST 2019.

This download should take <1 min

In [10]:
%%time
%%capture
!mkdir -p data/YuEtAl2021/checkpoints
!mkdir -p data/YuEtAl2021/indexes/convdr
#download and unzip the weights for convdr
!wget -P data/YuEtAl2021/checkpoints https://data.thunlp.org/convdr/convdr-kd-cast19-1.zip --no-check-certificate
!unzip data/YuEtAl2021/checkpoints/convdr-kd-cast19-1.zip -d data/YuEtAl2021/checkpoints/convdr-kd-cast19-1

# download and unzip the documents' embeddings of a subset of documents of CaST 2019 for convdr
!gdown --id 14dqFeBl-C6Bc_SuNjLq5MnB3XpM7oGZI -O data/YuEtAl2021/indexes/convdr_faiss.zip
!unzip data/YuEtAl2021/indexes/convdr_faiss.zip -d data/YuEtAl2021/indexes/convdr


#remove zip files
!rm data/YuEtAl2021/checkpoints/convdr-kd-cast19-1.zip
!rm data/YuEtAl2021/indexes/convdr_faiss.zip

CPU times: user 206 ms, sys: 56.7 ms, total: 263 ms
Wall time: 43 s


**Step 2**: initialize the encoder. Since we operate on inference, we assume the weights are already available. In this case, we downloaded them in step 1. If you wish to employ ConvDR using different weigths or on a different collection, you might need to retrain the model using the approach described in the ConvDR's github page.

In [11]:
import encoders #encoders is one of the classess we created for the tutorial

encoder = encoders.ConvDR(
    model_path = "data/YuEtAl2021/checkpoints/convdr-kd-cast19-1",
    device="cuda:0")



Using mean: False


Some weights of the model checkpoint at data/YuEtAl2021/checkpoints/convdr-kd-cast19-1 were not used when initializing RobertaDot_NLL_LN: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaDot_NLL_LN from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaDot_NLL_LN from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Step 3.a (alternative to 3.b)**: if you do not have a faiss index storing the documents' representation, the first step is to create it. We get the corpus using the `ir_datasets` library, split into batches, use the function `encode_documents` of the encoder, and save the vectors in a faiss dataset.

Importantly, the code used here is not optimized. Very likely, if you are in a real world scenario, you might want to use a more flexible data structure to load the data, such as a torch dataloader. Furthermore, you might want to parallelize over multiple GPUs. This code should be intended as guideline of what are the steps to employ a dense conversational search pipeline, hence we we favoured simplicity and readability, over optimisation.


In [None]:
'''
# read the corpus
corpus = pd.DataFrame(dataset.docs_iter())
corpus_size = len(corpus)

# initialise a numpy array to store the documents' representation
representation = np.zeros((corpus_size, 768))

# we iterate over the documents in batches. First, we need to determine a
# suitable batch size, and consequently the number of batches
batch_size = 250
n_batches =  math.ceil(corpus_size/batch_size)


for i in tqdm(range(n_batches)):
    # extract from the corpus dataframe the text of the documents
    batch_texts = corpus.iloc[i*batch_size:(i+1)*batch_size]["text"].to_list()

    # use the encode_documents function to encode such documents into vectors
    out = encoder.encode_documents(batch_texts)

    # store the vectors
    representation[i*batch_size:min(corpus_size, (i+1)*batch_size)] = out

# finally, the representation can be stored in a faiss index.
index = faiss.IndexFlatIP(representation.shape[1])

# we need a support structure to map the internal indexes of faiss to the CaST
# 2019 alphanumeric document ids.
map = {e: idx for e, idx in enumerate(corpus.doc_id)}
'''

**Step 3.b (alternative to 3.a)**: if you already have a faiss index containing the document vectors, then you can upload it, and use it for the retrieval. For this tutorial, we provide a small corpus that contains the annotated documents, both relevant and non-relevant - i.e., those contained in the qrels (approx 21.7k docs) - and a random sample of 50k documents.

In [12]:
import json

index = faiss.read_index("data/YuEtAl2021/indexes/convdr/small_cast_convdr_index.faiss")
map_tmp = json.load(open("data/YuEtAl2021/indexes/convdr/convdr_map.json", "r"))
map = {int(k): v for k, v in map_tmp.items()}
del map_tmp

We can now define a couple of support functions that will allow to carry out the conversational retrieval end evaluate the performance of the system.

First, we define `evaluate_run` which employs `ir_measures` [3] to evaluate the performance of the retrieval system.

Secondly, we define `conversation_retrieve` which allows us to operate the retrieval at the conversation level. This is nececessary, because the history must persist across the conversation turns, but must be reset once the conversation is over/at the beginning of a new conversation.

[3] Sean MacAvaney, Craig Macdonald, Iadh Ounis: Streamlining Evaluation with ir-measures. ECIR (2) 2022: 305-310


In [14]:

def evaluate_run(run, measures = ["nDCG@3", "nDCG@5", "nDCG@10", "MRR"]):

  # first, if the passed measures are strings and not ir_measures functions, we
  # need to convert them.
  measures = [ir_measures.parse_measure(m) if type(m) == str else m for m in measures]

  # compute the performance via iter_calc
  performance = pd.DataFrame(ir_measures.iter_calc(measures, qrels, run))

  # cast the measure objects in the performance dataset into strings
  performance['measure'] = performance['measure'].astype(str)

  return performance.groupby("measure")["value"].mean()


def conversation_retrieve(utterances, encoder, index, map):
    encoder.reset_history()

    run = []
    for utterance in utterances.iterrows():
        utterance = utterance[1]
        utt_emb = encoder.encode_query(utterance.raw_utterance)
        ip, idx = index.search(utt_emb, 10)

        qrun = pd.DataFrame({"query_id": [utterance.query_id] * len(ip),
                             "doc_id": list(idx),
                             "score": list(ip)}).explode(["doc_id", "score"])

        qrun.doc_id = qrun.doc_id.map(lambda x: map[x])
        qrun.score = qrun.score.astype(float)
        run.append(qrun)
        # what happens if we reset the history after every utterance?
        # encoder.reset_history()
    run = pd.concat(run)
    return run



**Step 4**: Finally, we can operate the retrieval and evaluate the performance

In [15]:
run = queries.groupby("topic_number")\
             .apply(lambda x: conversation_retrieve(x, encoder, index, map))\
             .reset_index(drop=True) #this is just to reshape the dataframe

evaluate_run(run)

  .apply(lambda x: conversation_retrieve(x, encoder, index, map))\


Unnamed: 0_level_0,value
measure,Unnamed: 1_level_1
RR,0.819483
nDCG@10,0.503259
nDCG@3,0.530672
nDCG@5,0.521453


What happens now if we use the wrong function that resets the history after each utterance, basically, behaving like a ad-hoc first-stage retriever?


In [16]:
def reset_history_conversation_retrieve(utterances, encoder, index, map):
    encoder.reset_history()

    run = []
    for utterance in utterances.iterrows():
        utterance = utterance[1]
        utt_emb = encoder.encode_query(utterance.raw_utterance)
        ip, idx = index.search(utt_emb, 10)

        qrun = pd.DataFrame({"query_id": [utterance.query_id] * len(ip),
                             "doc_id": list(idx),
                             "score": list(ip)}).explode(["doc_id", "score"])

        qrun.doc_id = qrun.doc_id.map(lambda x: map[x])
        qrun.score = qrun.score.astype(float)
        run.append(qrun)
        # what happens if we reset the history after every utterance?
        encoder.reset_history()
    run = pd.concat(run)
    return run

bad_run = queries.groupby("topic_number")\
             .apply(lambda x: reset_history_conversation_retrieve(x, encoder, index, map))\
             .reset_index(drop=True) #this is just to reshape the dataframe

evaluate_run(bad_run)

  .apply(lambda x: reset_history_conversation_retrieve(x, encoder, index, map))\


Unnamed: 0_level_0,value
measure,Unnamed: 1_level_1
RR,0.509911
nDCG@10,0.298339
nDCG@3,0.3025
nDCG@5,0.300266
