Prokaryotic Promoter Prediction

Project Overview

This project focuses on enhancing the prediction of Escherichia coli promoters by utilizing transfer learning techniques with Nucleotide Transformer models.

I employed pre-trained Nucleotide Transformer models with different configurations and datasets, which were fine-tuned using specific datasets from the Prokaryotic Promoter Database (PPD). Further experiments were conducted to improve model accuracy by adjusting training steps and the Low-Rank Adaptation (LoRA) dropout rate.

The models were evaluated against a dataset comprising 865 natural, experimentally validated E. coli K-12 promoter sequences and 1,000 randomly generated negative sequences.

Installation & Scripts

The experiments were conducted using Python notebooks on Google Colab. Figures were visualized using the following script:

Bioinformatics_Project/figures/figures.py

Data

The main data files required to run the experiments include:

Training Data:

Bioinformatics_Project/PPD/train_data/train.csv
Bioinformatics_Project/PPD/train_data/train_coli.csv

Testing Data:

Bioinformatics_Project/PPD/test_data/test.csv

Findings

The results and findings of this project are documented in:

02-604_Bioinformatics_Project_FinalReport.pdf

Citations

J. Wright, Gene Control. Scientific e-Resources, 2019.
M. E. Consens et al., “To Transformers and Beyond: Large Language Models for the Genome,” arXiv preprint arXiv:2311.07621, 2023.
M. H. A. Cassiano and R. Silva-Rocha, “Benchmarking bacterial promoter prediction tools: Potentialities and limitations,” Msystems, vol. 5, no. 4, pp. 10–1128, 2020.
W. Su et al., “PPD: a manually curated database for experimentally verified prokaryotic promoters,” Journal of Molecular Biology, vol. 433, no. 11, pp. 166860–166861, 2021.
V. H. Tierrafria et al., “RegulonDB 11.0: Comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12,” Microbial Genomics, vol. 8, no. 5, pp. 833–834, 2022.
H. Dalla-Torre et al., “The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics,” bioRxiv, 2023.
Y. Lin et al., “LoRA Dropout as a Sparsity Regularizer for Overfitting Control,” arXiv preprint arXiv:2404.09610, 2024.
E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2106.09685, 2021.
A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 625–632.
D. S. Wilks, “On the combination of forecast probabilities for consecutive precipitation periods,” Weather and Forecasting, vol. 5, no. 4, pp. 640–650, 1990.
J. Clauwaert, G. Menschaert, and W. Waegeman, “Explainability in transformer models for functional genomics,” Briefings in Bioinformatics, vol. 22, no. 5, p. bbab60, 2021.
H. Huang, Y. Wang, C. Rudin, and E. P. Browne, “Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization,” Communications Biology, vol. 5, no. 1, pp. 719–720, 2022.
A. Tareen and J. B. Kinney, “Logomaker: Beautiful sequence logos in Python,” Bioinformatics, vol. 36, no. 7, pp. 2272–2274, 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Bioinformatics_Project		Bioinformatics_Project
.DS_Store		.DS_Store
02-604_Bioinformatics_Project_FinalReport.pdf		02-604_Bioinformatics_Project_FinalReport.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prokaryotic Promoter Prediction

Project Overview

Table of Contents

Installation & Scripts

Data

Findings

Citations

About

Uh oh!

Releases

Packages

Languages

manjot-nagyal/prokprompred

Folders and files

Latest commit

History

Repository files navigation

Prokaryotic Promoter Prediction

Project Overview

Table of Contents

Installation & Scripts

Data

Findings

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages