This is the repository to 'MuLan-Methyl - Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction', contained resources including source code for implementing MuLan-Methyl fine-tuning procedure and prediction on custom dataset.
Please kindly cite our paper if you use the model Link.
Web service for MuLan-Methyl is present at: http://ab.cs.uni-tuebingen.de/software/mulan-methyl
Download MuLan-Methyl from the github repository.
git clone https://github.com/husonlab/mulan-methyl.git
cd mulan-methyl
The needed data is stored in data.zip, with structure
├── benchmark
│ ├── example_data_processing
│ │ ├── test_set.tsv
│ │ └── train_set.tsv
│ ├── initial_dataset
│ │ ├── 4mC_C.equisetifolia
│ │ ├── 4mC_F.vesca
│ │ ├── 4mC_S.cerevisiae
│ │ ├── 4mC_Tolypocladium
│ │ ├── 5hmC_H.sapiens
│ │ ├── 5hmC_M.musculus
│ │ ├── 6mA_A.thaliana
│ │ ├── 6mA_C.elegans
│ │ ├── 6mA_C.equisetifolia
│ │ ├── 6mA_D.melanogaster
│ │ ├── 6mA_F.vesca
│ │ ├── 6mA_H.sapiens
│ │ ├── 6mA_R.chinensis
│ │ ├── 6mA_S.cerevisiae
│ │ ├── 6mA_T.thermophile
│ │ ├── 6mA_Tolypocladium
│ │ └── 6mA_Xoc BLS256
│ └── processed_dataset
│ ├── test
│ │ ├── processed_4mC.tsv
│ │ ├── processed_5hmC.tsv
│ │ └── processed_6mA.tsv
│ └── train
│ ├── processed_4mC.tsv
│ ├── processed_5hmC.tsv
│ └── processed_6mA.tsv
├── taxonomy
│ ├── ncbi_gtdb_processed.csv
│ └── species_name_mapped.csv
We recommand you to run MuLan-Methyl in a python virtual environemnt that built by Anaconda, build a new conda enviroment equipped with required packages.
conda env create -n mulan-methyl --file MuLan.yaml
conda activate mulan-methyl
Input of MuLan-methyl is a sentence contains DNA seuqence and description of sample's taxonomic lineage. The following command give an example for processesing DNA sequence to the required format.
python code/main.py \
--data_proc \
--input_file ./data/benchmark/example_data_processing/train.tsv \
--data_type tsv \
--labelled
The pretrained MuLan-Methyl contains five pretrained language model, which are available on Hugging Face.
MuLan-Methyl contains three methylation-site type-wise prediction models, where 6mA prediction model ensemble five transformer-based language model, each is fine-tuned on the corresponding pretrained language model, sub-models of 4mC prediction model is fine-tuned on the 6mA fine-tuned models, similarly, sub-models of 5hmC prediction model is fine-tuned on the 4mC fine-tuned models.
Fine-tuning MuLan-Methyl for each methylation site by passing variable name 6mA, 4mC, 5hmC to argument methy_type, respectively.
This command give an example of fine-tuning MuLan-Methyl for identifying 6mA methylation site on the processed dataset.
python code/main.py \
--finetune \
--input_file ./data/benchmark/processed_dataset/train/processed_6mA.tsv \
--methyl_type 6mA \
--model_list BERT DistilBERT ALBERT XLNet ELECTRA \
--learning_rate 1e-5 1e-5 5e-5 2e-5 1e-5 \
--finetuned_output_dir ./pretrained_model \
After fine-tuning, using Mulan-Methyl to predict the methylation status on the DNA sequence. This command conduct 6mA methylation site prediction on R.chinensis.
python code/main.py \
--prediction \
--input_file ./data/benchmark/processed_dataset/test/processed_6mA.tsv \
--data_processed \
--methyl_type 6mA \
--labelled \
--data_type tsv \
--finetuned_output_dir ./pretrained_model \
--multi_species \
--species R.chinensis \
--prediction_output_dir ./prediction
- Wenhuan Zeng, Anupam Gautam, Daniel H Huson. MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction, GigaScience, Volume 12, 2023, giad054, https://doi.org/10.1093/gigascience/giad054