In this paper, we present the BioWiC benchmark, a new dataset designed to assess how well language models represent biomedical concepts according to their corresponding context. BioWiC is formulated as a binary classification task where each instance involves a pair of biomedical terms along with their corresponding sentences. The task is to classify each instance as True if the target terms carry the same meaning across both sentences or False if they do not.
For further details refer to the article BioWiC paper.
Follow these instructions to install the necessary dependencies for the project
git clone https://github.com/hrouhizadeh/BioWiC
cd BioWiC
pip install -r requirements.txt
To reproduce the construction of BioWiC, you need to perform the following steps:
- Extracting UMLS information: In the UMLS directory, detailed steps are provided to extract UMLS information needed for the BioWiC dataset development.
- Building BioWiC dataset: Follow the instructions in the BioWiC_construction directory to reconstruct the BioWiC dataset.
- Train and evaluate models: The models folder contains scripts that enable you to train and test various discriminative and generative large language models using the BioWiC dataset.
Additionally, the official release of the BioWiC dataset is available for direct download in the data folder.
-
Install the Hugging Face
datasets
Library: If not already installed, you can add thedatasets
library from Hugging Face.pip install datasets
-
Load the BioWiC Dataset: To load the BioWiC dataset, execute the following Python code.
from datasets import load_dataset dataset = load_dataset("hrouhizadeh/BioWiC")
This command will automatically download and cache the dataset.
If you use the BioWiC dataset in your research, please cite the following paper:
@article{rouhizadeh2024dataset,
title={A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models},
author={Rouhizadeh, Hossein and Nikishina, Irina and Yazdani, Anthony and Bornet, Alban and Zhang, Boya and Ehrsam, Julien and Gaudet-Blavignac, Christophe and Naderi, Nona and Teodoro, Douglas},
journal={Scientific Data},
volume={11},
number={1},
pages={1--13},
year={2024},
publisher={Nature Publishing Group}
}
Should you have any inquiries, feel free to contact us at hossein.rouhizadeh@unige.ch.