Skip to content

izuna385/Entity-Linking-Tutorial

Repository files navigation

Entity-Linking-Tutorial

  • In this tutorial, we will implement a Bi-encoder based entity disambiguation system using the BC5CDR dataset and data from the MeSH knowledge base.

  • We will compare the surface-form based candidate generation with the Bi-encoder based one, to understand the power of Bi-encoder model in entity linking.

Docs for English

Docs for Japanese

Tutorial with Colab-Pro.

See here.

Environment Setup

  • First, create base environment with conda.
# If you don't use colab-pro, create environment from conda.
$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt

Preprocessing

  • First, download preprocessed files from here, then unzip.

  • Second, download BC5CDR dataset to ./dataset/ and unzip.

  • You have to place CDR_DevelopmentSet.PubTator.txt, CDR_TestSet.PubTator.txt and CDR_TrainingSet.PubTator.txt under ./dataset/.

  • Then, run python3 BC5CDRpreprocess.py and python3 preprocess_mesh.py.

Models and Scoring

Models

  • Surface-Candidate based

    biencoder

  • ANN-search based

    entire_biencoder

Scoring

  • Default: Dot product between mention and predicted entity.

    scoring

  • L2-distance and cosine similarity are also supported.

Experiment and Evaluation

$ rm -r serialization_dir # Remove pre-experiment result if you run `python3 main.py -debug` for debugging.
$ python3 main.py

Parameters

We only here note critical parameters for training and evaluation. For further detail, see parameters.py.

Parameter Name Description Default
batch_size_for_train Batch size during learning. The more there are, the more the encoder will learn to choose the correct answer from more negative examples. 16
lr Learning rate. 1e-5
max_candidates_num Determine how many candidates are to be generated for each mention by using surface form. 5
search_method_for_faiss This specifies whether to use the cosine distance (cossim), inner product (indexflatip), or L2 distance (indexflatl2) when performing approximate neighborhood search. indexflatip

Result

  • Surface-Candidate based recall

    Generated Candidates Num 5 10 20
    dev_recall 76.80 79.91 80.92
    test_recall 74.35 77.14 78.25

batch_size_for_train: 16

  • Surface-Candidate based acc.

    Generated Candidates Num 5 10 20
    dev_acc 59.85 52.56 47.23
    test_acc 58.51 51.38 45.69
  • ANN-search Based

    (Generated Candidates Num: 50 (Fixed))

    Recall@X 1 (Acc.) 5 10 50
    dev_recall 21.58 42.28 50.48 67.11
    test_recall 21.50 40.29 47.95 64.52

batch_size_for_train: 48

  • Surface-Candidate based acc.

    Generated Candidates Num 5 10 20
    dev_acc 72.39 68.21 65.40
    test_acc 70.95 66.87 63.72
  • ANN-search Based

    (Generated Candidates Num: 50 (Fixed))

    Recall@X 1 (Acc.) 5 10 50
    dev_recall 58.86 74.33 78.14 83.10
    test_recall 57.66 73.14 76.73 81.39

LICENSE

MIT