Skip to content

nebojsa55/Computational-Genomics_MidTerm-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preterm Birth Prediction based on Gene Expression

This project was completed as part of the Computational Genomics course at the University of Belgrade, School of Electrical Engineering.

The goal was to predict gestational age in pregnant women by analyzing gene expression via Regression Models.

Data

The Datasets used for the analyses described in this project were contributed by Wayne State University School of Medicine Perinatal Initiative and by the Perinatology Research Branch, Division of Obstetrics and Maternal-Fetal Medicine, Division of Intramural Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, U.S. Department of Health and Human Services (NICHD/NIH/DHHS); and, in part, with Federal funds from NICHD/NIH/DHHS under Contract No. HHSN275201300006C. They were obtained as part of the DREAM Preterm Birth Prediction Challenge through Synapse (syn18380862), managed by Sage Bionetworks.

To learn more please click on the picture above.

Data stats:

Total num of samples Train samples Testing samples Number of features
735 367 368 32 830

DISCLAIMER: The challenge ended in 2019. and this implementation was not an active part of the challenge

Requirements

To install the necessary libraries, type in the terminal:

pip install -r requirements.txt 

Table of contents and results

The project is divided into 3 Jupyter notebooks, which simulate the thought flow that went to building the model. The main metric regression score was RMSE (root mean square error).

  1. Basic-regression-models.ipybn

    • Samples were standard scaled according to the belonging batch
    • PCA analysis was performed to acquire a minimum number of components to account for 95% variance
    • Random Forest Regressor and Support Vector regressor were tested as they are one of the most common ML models used in literature
    • Hyperparameter cross-validation and 10-fold cross-validation were performed in order to get the best model possible
    • RMSE(RFR) = 7.5441 ; RMSE(SVR) = 8.4081
  2. Better-model.ipybn

    • Samples were standard scaled according to the belonging batch
    • Instead of PCA, features were selected according to the f_regression score and SelectKBest class from sklearn.feature_selection module
    • Samples from the set 'GSE113966' were dropped as they seem to be outliers (32 in total)
    • Random Forest Regressor and Support Vector regressor were tested with the parameters found in notebook 1 through different parameter K to find the optimal number of features to use. After that, 10-fold cross-validation was performed
    • RMSE(RFR) = 5.9324 ; RMSE(SVR) = 8.0756
  3. Linear-regression.ipybn

    • Samples were standard scaled according to the belonging batch
    • Only linear regressor ElasticNet was considered, as results from the previous notebook suggest that linear regressors are most suitable for this dataset (f_regression score is linear regression test)
    • 10-fold cross-validation was performed to find optimal parameters for ElasticNet regressor and cross-validation was performed for the parameter K in SelectKBest
    • RMSE(EN) = 4.9283 ✔️
  4. Gene-importance.ipybn

    • The top 10 features (genes) were plotted, and the top 5 were presented in the table below, with their respective gene symbol, description (acquired from https://www.genecards.org/) and K-score:
Feature label Gene symbol Description K score
199675_at MCEMP1 This gene encodes a single-pass transmembrane protein. Based on its expression pattern, it is speculated to be involved in regulating mast cell differentiation or immune responses 72.28
2359_at FPR3 FPR3 (Formyl Peptide Receptor 3) is a Protein Coding gene. Diseases associated with FPR3 include Rubeosis Iridis. Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and N-formyl peptide receptor activity 62.60
3507_at IGHM IGHM (Immunoglobulin Heavy Constant Mu) is a Protein Coding gene. Diseases associated with IGHM include Agammaglobulinemia 1, Autosomal Recessive and Agammaglobulinemia, Non-Bruton Type. Gene Ontology (GO) annotations related to this gene include single-stranded DNA binding and phosphatidylcholine binding 57.93
9619_at ABCG1 The protein encoded by this gene is a member of the superfamily of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the White subfamily. It is involved in macrophage cholesterol and phospholipids transport, and may regulate cellular lipid homeostasis in other cell types. Six alternative splice variants have been identified. 57.83
6689_at SPIB The protein encoded by this gene is a transcriptional activator that binds to the PU-box (5'-GAGGAA-3') and acts as a lymphoid-specific enhancer. Four transcript variants encoding different isoforms have been found for this gene 52.16

EDIT 5. Final-model.ipybn

  • This notebook will be the same as notebook #3, except that the model was now tested on the whole testing set (368 samples) to observe whether the model is viable.
  • RMSE(EN) = 5.0400 ✔️

Releases

No releases published

Packages

No packages published