Skip to content

Implementation of NAACL 2024 main conference paper: Named Entity Recognition Under Domain Shift via Metric Learning for Life Science

Notifications You must be signed in to change notification settings

Lhtie/Bio-Domain-Transfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences

Accepted by NAACL 2024 Main Conference (Oral Presentation), Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences

Overview

Requirements

Environments

  • Python 3.8 (Ubuntu 20.04)

  • PyTorch 1.11.0 & CUDA 11.3

Setups

Here is some basic steps to setup the environment.

Step1: Create an unique Conda environment and install Python and PyTorch with CUDA support of specified version.

conda create -n [ENV_NAME] python=3.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidia

Step2: Install all the required Python packages for the repository by the following command:

pip install -r requirements.txt

Step3: Install NLTK data. Run the Python interpreter and type the following commands:

>>> import nltk
>>> nltk.download("punkt")

Datasets

All the datasets involved have been uploaded to Huggingface Lhtie/Bio-Domain-Transfer. Download the datasets by typing the commands:

git lfs install
git clone https://huggingface.co/datasets/Lhtie/Bio-Domain-Transfer

The folder contains biomedical datasets PathwayCuration, Cancer Genetics ,Infectious Diseases and chemical datasets CHEMDNER, BC5CDR, DrugProt.

Models

All the models used (BERT, SapBERT, S-PubMedBert-MS-MARCO-SCIFACT) can be download from from Huggingface Repositories:

git lfs install
git clone https://huggingface.co/bert-base-uncased
git clone https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext
git clone https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO-SCIFACT

Quickstart

Configurations

  • dataConfig contains data process scripts

    DataConfig: Modify dataset_dir from dataConfig/config.py: directory path to datasets (eg. ./Bio-Domain-Transfer)

    ModelConfig: Modify sapbert_path, sentbert_path, bert_path from dataConfig/confg.py: directory path to models respectively

  • configs/para contains configuration files for different experiment senarios

    few-shot_bert.yaml: Target Only

    oracle_bert.yaml: Target Only with full training data

    transfer_learning.yaml: Direct Transfer

    transfer_learning_eg.yaml: EG (Fill in DATA.BIOMEDICAL.SIM_METHOD to switch between concat and sentEnc)

    transfer_learning_disc.yaml: ED

    transfer_learning_eg_disc.yaml: EG+ED

Run

Train

Run the train.py script (Multi-Processing) by the following command:

torchrun --nnodes=1 --nproc_per_node=<# gpus> train.py \
	--cfg_file <configuration file> \

Test

Run the eval.py script to test finetuned models:

python eval.py --cfg_file <configuration file>

Citation

@inproceedings{liu-etal-2024-named,
    title = "Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences",
    author = "Liu, Hongyi  and
      Wang, Qingyun  and
      Karisani, Payam  and
      Ji, Heng",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.1",
    pages = "1--21",
}