Health literacy is crucial yet often hampered by complex medical terminology. Existing simplification approaches are limited by small, sentence-level, and monolingual datasets. To address this, we introduce MedSiML, a large-scale dataset designed to simplify and translate medical texts into the ten most spoken languages, improving global health literacy. This repository contains a sample of our 64k paragraph level medical text simplification dataset created from multiple sources using automated annotation process involving Gemini model and manual scrutiny. We also share our code for fine-tuning and evaluating the models. The entire dataset and model checkpoints were very large to be committed. They would be made available on DropBox post publication and the link would be shared here.
MedSiML includes over 64,000 paragraphs from PubMed, Wikipedia, and Cochrane reviews, simplified into English, Mandarin, Spanish, Arabic, Hindi, Bengali, Portuguese, Russian, Japanese, and Punjabi. An additional super-simplified English version is available for those with learning disabilities.
- Sources: PubMed, Wikipedia, Cochrane Reviews
- Languages: English, Mandarin, Spanish, Arabic, Hindi, Bengali, Portuguese, Russian, Japanese, Punjabi
- Simplification Model: Flash-1.5 Gemini model
- Fine-tuning Model: Text-To-Text Transfer Transformer (T5) base model
We compiled data from three main sources:
- PubMed: 50,000 abstracts from biomedical articles.
- Wikipedia: Biomedical articles sourced from Hugging Face 2022 Wikipedia corpus.
- Cochrane Reviews: Derived from existing Cochrane datasets.
The collected data underwent rigorous cleaning to remove noise, ensure quality, and eliminate duplicates.
Using the Flash-1.5 Gemini model, we simplified and translated the texts into ten languages and created a super-simplified English version.
We fine-tuned the T5 base model on this paragraph-level, multilingual data, achieving significant improvements in readability and semantic similarity.
Our fine-tuned model showed improvements over existing models in various metrics:
- ROUGE1: +10.61%
- SARI: +11.01%
- Semantic Similarity: +49.1%
- Readability Scores: FK +0.38, ARI +1.06
To use the MedSiML dataset and models, follow these steps:
-
Clone the Repository
git clone https://github.com/nepython/MedSiML.git cd MedSiML
-
Installation
pip install -r requirements.txt
-
Accessing the Dataset
- Download the
data.zip
file from DropBox. - Move the
data.zip
intodata
directory. - Unzip the
data.zip
. - Then it can be loaded using
datasets
library
from datasets import load_dataset raw_datasets = load_dataset("csv", data_files='data/data.tsv', delimiter='\t')
- (or) Using
pandas
library
import pandas as pd df = pd.read_csv('data/data.tsv', sep='\t')
- Download the
-
Model Inference
- Download the
checkpoints
from DropBox. - Move the
checkpoints
intonotebooks
directory. - Unzip the
checkpoints
. - Open the notebook
T5_base.ipynb
and run cells in all sections skippingPreprocessing
,Finetuning
.
- Download the
- Hardik A. Jain
- Chirayu Patel
- Riyasatali Umatiya
- Sajib Mistry
- Aneesh Krishna
- Amin Beheshti
If you use this dataset or model in your research, please cite our paper:
@inproceedings{Jain2025MedSiML,
title={MedSiML: A Multilingual Approach for Simplifying Medical Texts},
author={Hardik A. Jain, Chirayu Patel, Riyasatali Umatiya, Sajib Mistry, Aneesh Krishna, Amin Beheshti},
booktitle={Neural Information Processing},
year={2025},
publisher={Springer Nature Singapore},
address={Singapore},
pages={165-179},
isbn={978-981-96-6606-5},
doi={https://doi.org/10.1007/978-981-96-6606-5_12}
}
This project is licensed under the MIT License. See the LICENSE file for details.