KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs
Incorporating factual knowledge into pre-trained language models (PLM) such as BERT is an emerging trend in recent NLP studies. However, most of the existing methods combine the external knowledge integration module with a modified pre-training loss and re-implement the pre-training process on the large-scale corpus. Re-pretraining these models is usually resource-consuming, and difficult to adapt to another domain with a different knowledge graph (KG). Besides, those works either cannot embed knowledge context dynamically according to textual context or struggle with the knowledge ambiguity issue. In this paper, we propose a novel knowledge-aware language model framework based on fine-tuning process, which equips PLM with a unified knowledge-enhanced text graph that contains both text and multi-relational sub-graphs extracted from KG. We design a hierarchical relational-graph-based message passing mechanism, which can allow the representations of injected KG and text to mutually updates each other and can dynamically select ambiguous mentioned entities that share the same text. Our empirical results show that our model can efficiently incorporate world knowledge from KGs into existing language models such as BERT, and achieve significant improvement on the machine reading comprehension (MRC) task compared with other knowledge-enhanced models.
Framework of KELM (left) and illustrates how to generate knowledge-enriched token embeddings (right)
To install requirements:
pip install -r requirements.txt
The experiments in the paper are worked on 4 V100 GPUs. For using distributed training with multiple GPUs, please utilize Pytorch DistributedDataParallel.
For train and dev set, download from ReCoRD
For test set, download from SuperGlue
Download from from SuperGlue
We uses two knowledge graphs: WordNet and NELL
For NELL, name entity recognition is performed by Standford CoreNLP. For WordNet, we find text matching from this repository
For convenience, you can download all relative files from following google drive links:
link unzip and replace ./data/
download bert-large-cased model from huggingface to ./cache/bert-large-cased
For train set,
sh data_preprocess_$dataset$_train.sh
For dev set,
sh data_preprocess_$dataset$_dev.sh
We follow the same "two-staged" training strategy with KT-NET
Firstly, freeze language model and run
sh run_first_$dataset$.sh
Then unfreeze language model and run from the saved first stage model(replace INIT_DIR with local address of saved model)
sh run_second_$dataset$.sh
Our model(backbone: BERT) achieves the following performance on:
ReCoRD:
Model name | EM/F1(Dev) | EM/F1(Test) |
---|---|---|
BERTlarge | 70.2/72.2 | 71.3/72.0 |
SKG+BERTlarge | 70.9/71.6 | 72.2/72.8 |
KT-NETWordNet | 70.6/72.8 | - |
KT-NETNELL | 70.5/72.5 | - |
KT-NETBOTH | 71.6/73.6 | 73.0/74.8 |
KELMWordNet | 75.4/75.9 | 75.9/76.5 |
KELMNELL | 74.8/75.3 | 75.9/76.3 |
KELMBOTH | 75.1/75.6 | 76.2/76.7 |
MultiRC:
Model name | EM/F1(Dev) | EM/F1(Test) |
---|---|---|
BERTlarge | - | 24.1/70.0 |
KT-NETBOTH* | 26.7/71.7 | 25.4/71.1 |
KELMWordNet | 29.2/70.6 | 25.9/69.2 |
KELMNELL | 27.3/70.4 | 26.5/70.6 |
KELMBOTH | 30.3/71.0 | 27.2/70.8 |
*from our implementation
COPA:
Model name | Accuracy(Dev) | Accuracy(Test) |
---|---|---|
BERTlarge | - | 70.6 |
KELMWordNet | 76.1 | 78.0 |
We also implement KELM base on RoBERTalarge and evaulate it on ReCoRD:
Model name | EM/F1(Dev) | EM/F1(Test) |
---|---|---|
RoBERTalarge | 87.9/88.4 | 88.4/88.9 |
KELMBOTH | 88.2/88.7 | 89.1/89.6 |
The part of code is implemented based on the open source code of huggingface, DGL and KT-NET, jiant
We use Apache License 2.0.