MedSiML: A Multilingual Approach for Simplifying Medical Texts

Health literacy is crucial yet often hampered by complex medical terminology. Existing simplification approaches are limited by small, sentence-level, and monolingual datasets. To address this, we introduce MedSiML, a large-scale dataset designed to simplify and translate medical texts into the ten most spoken languages, improving global health literacy. This repository contains a sample of our 64k paragraph level medical text simplification dataset created from multiple sources using automated annotation process involving Gemini model and manual scrutiny. We also share our code for fine-tuning and evaluating the models. The entire dataset and model checkpoints were very large to be committed. They would be made available on DropBox post publication and the link would be shared here.

Dataset Overview

MedSiML includes over 64,000 paragraphs from PubMed, Wikipedia, and Cochrane reviews, simplified into English, Mandarin, Spanish, Arabic, Hindi, Bengali, Portuguese, Russian, Japanese, and Punjabi. An additional super-simplified English version is available for those with learning disabilities.

Sources: PubMed, Wikipedia, Cochrane Reviews
Languages: English, Mandarin, Spanish, Arabic, Hindi, Bengali, Portuguese, Russian, Japanese, Punjabi
Simplification Model: Flash-1.5 Gemini model
Fine-tuning Model: Text-To-Text Transfer Transformer (T5) base model

Methodology

Data Collection

We compiled data from three main sources:

PubMed: 50,000 abstracts from biomedical articles.
Wikipedia: Biomedical articles sourced from Hugging Face 2022 Wikipedia corpus.
Cochrane Reviews: Derived from existing Cochrane datasets.

Data Cleaning

The collected data underwent rigorous cleaning to remove noise, ensure quality, and eliminate duplicates.

Annotation

Using the Flash-1.5 Gemini model, we simplified and translated the texts into ten languages and created a super-simplified English version.

Model Training

We fine-tuned the T5 base model on this paragraph-level, multilingual data, achieving significant improvements in readability and semantic similarity.

Results

Our fine-tuned model showed improvements over existing models in various metrics:

ROUGE1: +10.61%
SARI: +11.01%
Semantic Similarity: +49.1%
Readability Scores: FK +0.38, ARI +1.06

Usage

To use the MedSiML dataset and models, follow these steps:

Clone the Repository

git clone https://github.com/nepython/MedSiML.git
cd MedSiML

Installation
```
pip install -r requirements.txt
```

Accessing the Dataset

Download the data.zip file from DropBox.
Move the data.zip into data directory.
Unzip the data.zip.
Then it can be loaded using datasets library

from datasets import load_dataset

raw_datasets = load_dataset("csv", data_files='data/data.tsv', delimiter='\t')

(or) Using pandas library

import pandas as pd

df = pd.read_csv('data/data.tsv', sep='\t')

Model Inference
- Download the checkpoints from DropBox.
- Move the checkpoints into notebooks directory.
- Unzip the checkpoints.
- Open the notebook T5_base.ipynb and run cells in all sections skipping Preprocessing, Finetuning.

Contributors

Hardik A. Jain
Chirayu Patel
Riyasatali Umatiya
Sajib Mistry
Aneesh Krishna
Amin Beheshti

Citation

If you use this dataset or model in your research, please cite our paper:

@inproceedings{Jain2025MedSiML,
  title={MedSiML: A Multilingual Approach for Simplifying Medical Texts},
  author={Hardik A. Jain, Chirayu Patel, Riyasatali Umatiya, Sajib Mistry, Aneesh Krishna, Amin Beheshti},
  booktitle={Neural Information Processing},
  year={2025},
  publisher={Springer Nature Singapore},
  address={Singapore},
  pages={165-179},
  isbn={978-981-96-6606-5},
  doi={https://doi.org/10.1007/978-981-96-6606-5_12}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
notebooks		notebooks
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedSiML: A Multilingual Approach for Simplifying Medical Texts

Table of Contents

Dataset Overview

Methodology

Data Collection

Data Cleaning

Annotation

Model Training

Results

Usage

Contributors

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

nepython/MedSiML

Folders and files

Latest commit

History

Repository files navigation

MedSiML: A Multilingual Approach for Simplifying Medical Texts

Table of Contents

Dataset Overview

Methodology

Data Collection

Data Cleaning

Annotation

Model Training

Results

Usage

Contributors

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages