Disease Classification Model

Practical Bioinformatics Course Project

Authors:

Aditya Jagadale
Hemanth Dindigallu
Het Shah
Muthuraj Vairamuthu
Noel Tiju
Rishi Pendyala

Course Details:

Course: Practical Bioinformatics
Semester: Winter 2024
Instructor: Dr. Tavpritesh Sethi
Institute: Indraprastha Institute of Information Technology, Delhi

Problem Statement

Disease Classification: Collect patients’ gene information obtained through differential gene expression analysis and classify genes associated with a particular disease using a machine learning model.

Dataset

GEO Datasets from NCBI: Downloaded .CEL files related to various diseases on the platform GPL570. The datasets can be accessed via the following link: GEO Datasets.

Normalization and Differential Gene Expression

We start by initializing essential R packages including BiocManager, oligo, GEOquery, and limma, among others, required to handle and analyze microarray data. Subsequently, disease-specific microarray datasets are imported and preprocessed using the oligo::rma() function, a method known for its efficiency in background correction, normalization, and summarization of probe intensities into expression values.

Following data preparation, the code undertakes differential gene expression analyses through linear modeling and empirical Bayes statistics provided by the limma package. Each disease dataset is arranged into cancerous and non-cancerous groups, and a linear model is used to note genes exhibiting significant expression variations between these groups. The outcomes of these analyses are stored in CSV files, documenting key gene expression details such as identifiers, log fold changes, average expression levels, statistical test values, and adjusted p-values.

Machine Learning Model

An ML model is used to classify the genes associated with a particular disease. This begins by loading the dataset and preprocessing it. Subsequently, the data is standardized and split into training and testing sets.

The training set is then utilized to train a Convolutional Neural Network (CNN) model implemented using TensorFlow's Keras API. The model architecture comprises convolutional layers followed by max-pooling layers, dense layers, and a softmax output layer. Training progress is monitored through epochs, with validation set accuracy and loss also evaluated.

Post-training, the model's performance is assessed using the testing set, and accuracy metrics are printed. Additionally, a plot displaying the training and validation loss curves over epochs is generated for visualization.

Following CNN modeling, feature selection is performed using an XGBoost classifier to identify the most influential features for disease classification. The selected features are then utilized to train a new XGBoost classifier, and its accuracy on the testing set is evaluated.

Lastly, the code generates a pair plot visualization using seaborn, displaying pairwise scatter plots colored by target disease labels, providing insights into the distribution and relationships among features across different disease categories.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Datasets		Datasets
Differential_Analysis		Differential_Analysis
.DS_Store		.DS_Store
LICENSE		LICENSE
Normalization_DEG.r		Normalization_DEG.r
PB ML Model.ipynb		PB ML Model.ipynb
README.md		README.md
feature_selection.py		feature_selection.py
final_dataset.csv		final_dataset.csv
making_dataset.py		making_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disease Classification Model

Practical Bioinformatics Course Project

Authors:

Course Details:

Problem Statement

Dataset

Normalization and Differential Gene Expression

Machine Learning Model

About

Releases

Packages

Languages

License

noeltiju/PB-Project

Folders and files

Latest commit

History

Repository files navigation

Disease Classification Model

Practical Bioinformatics Course Project

Authors:

Course Details:

Problem Statement

Dataset

Normalization and Differential Gene Expression

Machine Learning Model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages