This repository represents the works of the Nanibot Team on the Vietnamese Wikipedia Question Answering task on the Zalo AI Challenge 2019.
The works on this repository is based on a previous work on a similar task (Question Answering for regulations of UIT)
- QASystem and Ultilities contain source codes, base model as well as fine-tuning models and dataset used in this project. Guide on how to setup and re-produce the result is also provided.
- Dataset contains the dataset that is used in this project.
Details on how to train/predict using the model is described here
- Apply BERT as baseline for the QA problem defined by Zalo
- Data augmented using the SQuAD dataset by translating & de-noising, resulted in 1% F1 boost compared to the baseline model
- Improve BERT by trying different approaches (BERT + TextCNN, BERT with additional fully-connected layer (1), (2)), but yield no improvements
- Try different loss function for the classification problem ((Squared) Hinge loss, KLD loss & Focal loss) along with label smoothing, but yield no improvements
- Data augmented using backtranslation
- Apply multilligual RoBERTa for the problem
Our solution yeild an F1 score of 79.15%, ranked 11 in the public leaderboard of the Zalo AI Challenge 2019 for the Vietnamese Wiki Question Answering problem for the public test set.