Skip to content

nadavlab/MedConceptsQA

Repository files navigation

The benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA.

If you use MedConceptsQA or find this repository useful for your research or work, please cite us using the following citation:

@article{shoham2024medconceptsqa,
  title={MedConceptsQA--Open Source Medical Concepts QA Benchmark},
  author={Shoham, Ofir Ben and Rappoport, Nadav},
  journal={arXiv preprint arXiv:2405.07348},
  year={2024}
}

How To Run?

Install the required dependencies:

pip install -r requirements.txt

Run the benchmark evaluation:

python evaluation_runner.py --model_id MODEL_ID --output_results_dir_path OUTPUT_RESULTS_DIR_PATH --shots_num SHOTS_NUM --total_num_experiments NUM_EXPERIMENTS

Replace MODEL_ID with the model name (HuggingFace) or local path to the pretrained model you want to evaluate.

Replace OUTPUT_RESULTS_DIR_PATH with the directory path to store the results CSV file.

Replace SHOTS_NUM with the number of shots. The default is 4. For zero-shot learning, use 0.

Replace NUM_EXPERIMENTS with the number of experiments to use for calculating the 95% condifence intervals. The default is 1.

Few-shot evaluation:

python evaluation_runner.py --model_id BioMistral/BioMistral-7B-DARE --total_eval_examples_num 250 --output_results_dir_path results/few_shot/250_examples/ --shots_num 4

Zero-shot evaluation:

python evaluation_runner.py --model_id BioMistral/BioMistral-7B-DARE --total_eval_examples_num 250 --output_results_dir_path results/zero_shot/250_examples/ --shots_num 0

Run GPT benchmark evaluation:

SET OPENAI_API_KEY as an environment variable with your OpenAI key and then run with:

python gpt4_runner.py --model_id gpt-4-0125-preview --shots_num 4 --total_eval_examples_num 250 --output_results_dir_path results/few_shot/250_examples/

gpt-4 zero shot evaluation:

python gpt4_runner.py --model_id gpt-4-0125-preview --shots_num 0 --total_eval_examples_num 250 --output_results_dir_path results/zero_shot/250_examples/

Leaderboard:

Zero-shot:

model mean_accuracy ci
gpt-4-0125-preview 52.489 3.135
gpt-3.5-turbo 37.058 2.399
dmis-lab/biobert-v1.1 26.151 3.571
meta-llama/Meta-Llama-3-8B-Instruct 25.840 6.199
epfl-llm/meditron-7b 25.751 3.340
dmis-lab/meerkat-7b-v1.0 25.680 3.983
HuggingFaceH4/zephyr-7b-beta 25.538 3.075
epfl-llm/meditron-70b 25.360 2.630
yikuan8/Clinical-Longformer 25.040 2.406
UFNLP/gatortron-medium 24.862 3.170
PharMolix/BioMedGPT-LM-7B 24.747 4.219
BioMistral/BioMistral-7B-DARE 24.569 3.867
johnsnowlabs/JSL-MedMNX-7B 24.427 3.185

Few-shot:

model mean_accuracy ci
gpt-4-0125-preview 61.911 2.320
gpt-3.5-turbo 41.476 2.481
meta-llama/Meta-Llama-3-8B-Instruct 25.653 2.707
johnsnowlabs/JSL-MedMNX-7B 25.627 2.497
yikuan8/Clinical-Longformer 25.547 3.495
dmis-lab/biobert-v1.1 25.458 2.649
epfl-llm/meditron-70b 25.262 3.499
BioMistral/BioMistral-7B-DARE 25.058 2.676
HuggingFaceH4/zephyr-7b-beta 25.058 2.121
dmis-lab/meerkat-7b-v1.0 24.942 2.879
PharMolix/BioMedGPT-LM-7B 24.924 3.363
epfl-llm/meditron-7b 23.787 3.496

If you wish to submit your model for evaluation, please open us a GitHub issue with your model's HuggingFace name.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages