# Clarifying questions

This demo is based on ClariQ code available [here](https://github.com/aliannejadi/ClariQ).

ClariQ dataset aims to study the following situation for dialogue settings:
  * a user is asking an ambiguous question (where ambiguous question is a question to which one can return > 1 possible answers)
  * the system must identify that the question is ambiguous, and, instead of trying to answer it directly, ask a good clarifying question

ClariQ was collected as part of the [ConvAI3](http://convai.io) challenge which was co-organized with the [SCAI workshop](https://scai-workshop.github.io/2020/). The collected dataset consists of: 


1.   **User Request**: an initial user request in the conversational form, e.g., "What is Fickle Creek Farm?", with a label reflects if clarification is needed ranged from 1 to 4;
2.   **Clarification questions**: a set of possible clarifying questions, e.g., "Do you want to know the location of fickle creek farm?";
3.   **User Answers**: each questions is supplied with a user answer, e.g., "No, I want to find out where can i purchase fickle creek farm products."


For training, the collected dataset is split into training (187 topics) and validation (50 topics) sets. For testing, the participants are supplied with: (1) a set of user requests in conversational form and (2) a set of questions (i.e., question bank) which contains all the questions that we have collected for the collection. Therefore there are the following two tasks:

1.   Given a user request, return a score [1−4] indicating the necessity of asking clarifying questions.
2.   Given a user request which needs clarification, return the most suitable clarifying question. Here participants are able to choose: 
      * either select the clarifying question from the provided question bank (all clarifying questions we collected), aiming to maximize the precision, 
      * or choose not to ask any question (by choosing Q0001 from the question bank.)

In this notebook we investigate the BM25 ranker being a simple baseline model. It ranks the questions simply by their BM25 relevance score compared to the original_request.

In [None]:
# Installs required packages & clones ClariQ repo.

! pip install rank_bm25
! git clone https://github.com/aliannejadi/ClariQ.git ClariQ-repo

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2
Cloning into 'ClariQ-repo'...
remote: Enumerating objects: 210, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 210 (delta 14), reused 34 (delta 14), pack-reused 176[K
Receiving objects: 100% (210/210), 253.02 MiB | 33.21 MiB/s, done.
Resolving deltas: 100% (105/105), done.
Checking out files: 100% (40/40), done.


# New Section

In [None]:
# Imports required packages, defines stem & tokenizez function

import pandas as pd
from rank_bm25 import BM25Okapi
import nltk
from nltk.stem.porter import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

def stem_tokenize(text, remove_stopwords=True):
  stemmer = PorterStemmer()
  tokens = [word for sent in nltk.sent_tokenize(text) \
                                      for word in nltk.word_tokenize(sent)]
  tokens = [word for word in tokens if word not in \
          nltk.corpus.stopwords.words('english')]
  return [stemmer.stem(word) for word in tokens]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Files paths

request_file_path = './ClariQ-repo/data/dev.tsv'
question_bank_path = './ClariQ-repo/data/question_bank.tsv'
run_file_path = './ClariQ-repo/sample_runs/dev_bm25'

In [None]:
# Reads files and build bm25 corpus (index)

dev = pd.read_csv(request_file_path, sep='\t')
question_bank = pd.read_csv(question_bank_path, sep='\t').fillna('')

question_bank['tokenized_question_list'] = question_bank['question'].map(stem_tokenize)
question_bank['tokenized_question_str'] = question_bank['tokenized_question_list'].map(lambda x: ' '.join(x))

bm25_corpus = question_bank['tokenized_question_list'].tolist()
bm25 = BM25Okapi(bm25_corpus)


In [None]:
dev

Unnamed: 0,topic_id,initial_request,topic_desc,clarification_need,facet_id,facet_desc,question_id,question,answer
0,101,Find me information about the Ritz Carlton Lake Las Vegas.,Find information about the Ritz Carlton resort at Lake Las Vegas.,2,F0010,Find information about the Ritz Carlton resort at Lake Las Vegas.,Q00697,are you looking for a specific web site,yes for the ritz carlton resort at lake las vegas
1,101,Find me information about the Ritz Carlton Lake Las Vegas.,Find information about the Ritz Carlton resort at Lake Las Vegas.,2,F0010,Find information about the Ritz Carlton resort at Lake Las Vegas.,Q03272,would you like the history of ritz carlton lake las vegas,where can i find the history of the ritz carton lake in las vegas
2,101,Find me information about the Ritz Carlton Lake Las Vegas.,Find information about the Ritz Carlton resort at Lake Las Vegas.,2,F0010,Find information about the Ritz Carlton resort at Lake Las Vegas.,Q03282,would you like the location of the ritz carlton lake las vegas,yes along with other information
3,101,Find me information about the Ritz Carlton Lake Las Vegas.,Find information about the Ritz Carlton resort at Lake Las Vegas.,2,F0010,Find information about the Ritz Carlton resort at Lake Las Vegas.,Q03582,would you like to know the capacity of ritz carlton lake las vegas,yes and other information
4,101,Find me information about the Ritz Carlton Lake Las Vegas.,Find information about the Ritz Carlton resort at Lake Las Vegas.,2,F0010,Find information about the Ritz Carlton resort at Lake Las Vegas.,Q03695,would you like to know where ritz carlton lake las vegas is on a map,no i want to know more about the ritz carlton lake las vegas resort
...,...,...,...,...,...,...,...,...,...
2308,292,i'm interested in history of the electronic medical record,Find information on how the electronic medical record (or electronic health record) has evolved through the years.,3,F0745,Find information on how the electronic medical record (or electronic health record) has evolved through the years.,Q02038,do you want to know how electronic medical record keeping began,yes and the evolution there after
2309,292,i'm interested in history of the electronic medical record,Find information on how the electronic medical record (or electronic health record) has evolved through the years.,3,F0745,Find information on how the electronic medical record (or electronic health record) has evolved through the years.,Q02314,do you want to know when the first medical record was record,that would help
2310,292,i'm interested in history of the electronic medical record,Find information on how the electronic medical record (or electronic health record) has evolved through the years.,3,F0745,Find information on how the electronic medical record (or electronic health record) has evolved through the years.,Q03305,would you like to buy a book about this topic,no show me the history of it
2311,292,i'm interested in history of the electronic medical record,Find information on how the electronic medical record (or electronic health record) has evolved through the years.,3,F0745,Find information on how the electronic medical record (or electronic health record) has evolved through the years.,Q03679,would you like to know when the electronic medical record became more widely used,i would like to know how its evolved over the years


In [None]:
question_bank

Unnamed: 0,question_id,question,tokenized_question_list,tokenized_question_str
0,Q00001,,[],
1,Q00002,a total cholesterol of 180 to 200 mgdl 10 to 1...,"[total, cholesterol, 180, 200, mgdl, 10, 111, ...",total cholesterol 180 200 mgdl 10 111 mmoll le...
2,Q00003,about how many years experience do you want th...,"[mani, year, experi, want, instructor]",mani year experi want instructor
3,Q00004,according to anima the bible or what other source,"[accord, anima, bibl, sourc]",accord anima bibl sourc
4,Q00005,ae you looking for examples of septic system d...,"[ae, look, exampl, septic, system, design]",ae look exampl septic system design
...,...,...,...,...
3936,Q03937,would you want to buy flame design stickers,"[would, want, buy, flame, design, sticker]",would want buy flame design sticker
3937,Q03938,would you want to know about ron howards actin...,"[would, want, know, ron, howard, act, career]",would want know ron howard act career
3938,Q03939,would you want to know credit report scores,"[would, want, know, credit, report, score]",would want know credit report score
3939,Q03940,would you want to know what is in a credit report,"[would, want, know, credit, report]",would want know credit report


In [None]:
# Runs bm25 for every query and stores output in file.

queries = []
clarifying_questions = []

with open(run_file_path, 'w') as fo:
  for tid in dev['topic_id'].unique():
    query = dev.loc[dev['topic_id']==tid, 'initial_request'].tolist()[0]
    bm25_ranked_list = bm25.get_top_n(stem_tokenize(query, True), 
                                    bm25_corpus, 
                                    n=30)
    bm25_q_list = [' '.join(sent) for sent in bm25_ranked_list]
    preds = question_bank.set_index('tokenized_question_str').loc[bm25_q_list, 'question_id'].tolist()
    queries.append(query)
    clarifying_questions.append(question_bank.loc[question_bank['question_id'] == preds[0], 'question'].tolist()[0])
    for i, qid in enumerate(preds):    
      fo.write('{} 0 {} {} {} bm25\n'.format(tid, qid, i, len(preds)-i))


In [None]:
pd.set_option('max_colwidth', 300)
pd.DataFrame({"query": queries, "clarifying question": clarifying_questions})

Unnamed: 0,query,clarifying question
0,Find me information about the Ritz Carlton Lake Las Vegas.,do you want historical information on the ritz carlton lake las vegas
1,I'm looking for universal animal cuts reviews,would you like to review universal animal cuts
2,tell me about cass county missouri,do you want to know about the schools in cass county missouri
3,Tell about an adobe indian house?,are you looking for a tour of adobe indian houses
4,What is von Willebrand Disease?,are you interested in the types of von willebrand disease
5,Tell me about atypical squamous cells,are you interested in atypical squamous cells in urine
6,all men are created equal,when was raspberry pi created
7,Tell me more about Rocky Mountain News,are you looking for information regarding the rocky mountain range
8,Find me information about the sales tax in Illinois.,are you interested in how to pay your illinois state tax
9,I'm looking for information on hobby stores,are you looking for a specific hobby store


In [None]:
# Report question relevance performance
! python ./ClariQ-repo/src/clariq_eval_tool.py  --eval_task question_relevance\
                                                --data_dir ./ClariQ-repo/data/ \
                                                --experiment_type dev \
                                                --run_file {run_file_path} \
                                                --out_file {run_file_path}_question_relevance.eval


Recall5: 0.3245570421150917
Recall10: 0.5638042646208281
Recall20: 0.6674997108155003
Recall30: 0.6912818698329535
