PLACAT is a voice-based conversational agent built using the Google Home platform, with the goal of combining the advantages of chatbots (user-friendly but not goal- oriented) with the capacities of question answering (QA) systems (which lack interactivity). Thanks to a controller that directs user input either to the chatbot or to the QA system by recognizing dialogue acts, we obtain a spoken QA chatbot over Wikipedia, implemented as a Google Home Action.
The development of PLACAT is supported by a grant from the HES-SO (AGP n. 82681). The main developer is Gabriel Luthier and the principal investigator is Andrei Popescu-Belis, both at HEIG-VD, Yverdon-les-Bains, Switzerland. The outcomes of the PLACAT project are summarized in the following article: Luthier G. and Popescu-Belis A., Chat or Learn: a Data-Driven Robust Question-Answering System, Proceedings of LREC 2020 (12th Language Resources and Evaluation Conference), Marseille, 11-16 May 2020, p. 5474-5480.
To run this application, you first need to have an Elasticsearch index containing Wikipedia pages. Then you'll have to download the required models. This application has been tested using python=3.7.3 (it appears to break with python 3.8).
- Download and install Elasticsearch (installation steps are listed at the bottom of the page). We used version
6.3.1
. - Download a CirrusSearch dump of Wikipedia (a dump of Wikipedia pages in a format enabling indexing on Elasticsearch). The first file named
enwiki-20190114-cirrussearch-content.json.gz
(or similar) is a dump of the English Wikipedia. For a smaller file, e.g. for testing, you can try first the Simple English dump namedsimplewiki-20190114-cirrussearch-content.json.gz
. - Run Elasticserach:
systemctl start elasticsearch
. - Create a new index:
curl -X PUT "localhost:9200/enwiki"
. - Cut the dump in multiple files (the Bulk API accepts mass uploads in
ndjson
format but does not handle big files):
export dump=enwiki-20190114-cirrussearch-content.json.gz
export index=enwiki
mkdir chunks
cd chunks
zcat ../$dump | split -a 10 -l 500 - $index
for file in *; do
echo -n "${file}: "
took=$(curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/$index/_bulk --data-binary @$file |
grep took | cut -d':' -f 2 | cut -d',' -f 1)
printf '%7s\n' $took
[ "x$took" = "x" ] || rm $file
done
- You can now test the index by executing a simple search query:
curl -X GET "localhost:9200/$index/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"simple_query_string" : {
"query": "Switzerland",
"fields": ["title"]
}
}
}
'
- [Optional] Download and install Kibana to visualize the data.
- [Optional] If you want to keep only some attributes in the index:
curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
{
"source": {
"index": "enwiki",
"_source": ["title", "opening_text"]
},
"dest": {
"index": "enwiki_clean"
}
}
'
- [Optional] If you want to delete individual pages (which may just add noise to the QA system):
# Find the page's id
curl -X GET "localhost:9200/enwiki/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"term": { "title": "where is where" }
}
}
'
# Test if the id is the right one
curl -X GET "localhost:9200/enwiki/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"terms": { "_id": [ "36897462" ] }
}
}
'
# Delete the page
curl -X DELETE "localhost:9200/enwiki/page/36897462"
- Check that your Google account has the following permissions enabled at the Activity Controls:
Web & App Activity
,Device Information
andVoice & Audio Activity
- Create a new Google Action project at the Actions Console. Then create a new Action with a
Custom intent
. - Once redirected to Dialogflow, create a new agent.
- In this agent, create a new intent called
question
. - Under
Action and parameters
, add a new parameter namedquestion
withrequired
checked,@sys.any
as entity,$question
as value, andis list
unchecked. Write the prompt text that you wish to use. - Under
Training phrases
, add any noun (for instance "banana") and double click on it to bind it to the@sys.any:question
entity. - Delete all text responses under
Responses
and checkEnable webhook call for this intent
underFulfillment
. - In the tab
Fulfillment
for the agent, specify the URL for your webhook, which you must enable. For testing purposes, you can use ngrok.
- Download and unzip the chatbot model 8000_checkpoint.tar (488MB) into the
data/save/bnc_cornell/2-2_500/
folder. The model has been trained using data from the Cornell Movie-Dialogs Corpus and the British National Corpus zipped up together. - Download and unzip the question-answering model pytorch_model.bin (387MB) into the
bert-model/
folder. - Download the controller model controller.pt (132KB) into the
data/
folder. - Install the dependent packages, for instance into a virtual environment with
conda install --file requirements.txt
. You might need to addconda-forge
's channel:conda config --add channels conda-forge
and thenconda config --set channel_priority strict
. You might as well need to install some packages manually. - Run
python -m spacy download en_core_web_lg
to download the model used by theneuralcoref
module to enable pronouns resolution. - Execute
./run_backend.sh
to run PLACAT
Use one of the following methods:
- Web interface at
http://127.0.0.1:5000/chat
once the server is up (adjust address and/or port depending on your server). qa.py
script to test one question:python qa.py -q What is penicillin ?
- Simulator on Dialogflow, if you have set it up in the optional step.