Skip to content

A Vietnamese Biomedical Named Entity Recognition Corpus

Notifications You must be signed in to change notification settings

ptpuyen1511/VietBioNER

Repository files navigation

A Vietnamese Biomedical Named Entity Recognition Corpus

Context

VietBioNER is constituted by biomedical grey literature specified for tuberculosis. The corpus was annotated with five named entity categories of Organisation, Location, Date and Time, Symptom and Disease, and Diagnostic Procedure.

The construction of VietBioNER and some baseline performance are detailed in our LREC 2022 paper:

@inproceedings{vietbioner,
    title = "{A Named Entity Recognition Corpus for Vietnamese Biomedical Texts to Support Tuberculosis Treatment}",
    author = "Phan, Uyen and Nguyen, Phuong and Nguyen, Nhung",
    booktitle = "Proceedings of the 13th Language Resources and Evaluation Conference",
    year = "2022",
    publisher = "European Language Resources Association"
}

Entity Category Examples

Entity Category Examples
Symptom_and_Disease lao đa kháng thuốc (multidrug-resistant tuberculosis), ho có đờm (coughing up phlegm)
DiagnosticProcedure GeneXpert
Location phía Bắc Việt Nam (North Vietnam), Khu vực Đông Nam Á (Southeast Asia)
DateTime mùa hè (summer), từ 1990-1992 (from 1990-1992)
Organisation Bộ Y tế (Ministry of Health), Khoa Phổi Thận (Lung and Kidney Department)

Content

The corpus contains 1706 sentences with an average length of 31 tokens. About 74% of sentences in the corpus contained annotated entities, whose distribution among the five categories is detailed in Fig.1.

distribution_ne

Figure 1: Distribution of each entity type in VietBioNER.

Supervised learning benchmark setting

The corpus is randomly split into three sets: training, validation, and test sets. These sets are constructed with an approximate ratio of 7:3:7. Specifically, the training set consists of 706 sentences, the validation set consists of 300 sentences, and the test set has 700 sentences.

The distribution of all entity categories in supervised learning setting is reported in Table 1.

Entity Category Train Valid Test
Symptom_and_Disease 838 378 810
DiagnosticProcedure 191 89 202
Location 162 78 106
DateTime 141 58 97
Organisation 77 36 71

Table 1: Number of entities in each sets.

Training, validation, and test files are available in the data_supervised_learning folder.

Few-shot learning benchmark setting

We build 1-shot, 5-shot, and 10-shot learning sets from the training set mentioned in supervised setting. For each n-shot learning set, we generated 5 random support sets using the Greedy Sampling algorithm (Yang and Katiyar, 2020). n denotes the number of entities in each category that are selected for inclusion in each support set. Consequently, each n-shot support set will have a maximum of (n * num_entity_categories) sentences.

The distribution of all entity categories in few-shot learning setting is reported in Table 2.

Entity Category 1-shot 5-shot 10-shot
Symptom_and_Disease 5 20 35
DiagnosticProcedure 3 8 13
Location 4 8 17
DateTime 2 7 13
Organisation 2 6 11

Table 2: Number of entities in each sets (average over 5 sets).

n-shot sets are available in the data_fewshot_learning folder. We can use the same test set as in the supervised setting.

Original files with brat format

We also provide the original .ann files with brat format in the data_brat folder.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

About

A Vietnamese Biomedical Named Entity Recognition Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published