Predicting the genetic component of gene expression using gene regulatory networks

Introduction

This README presents a detailed overview of our pipeline for building prediction method and model for genetic component of gene expression. This document provides an overview of our data transformation stages of preprocessing, structure learning, and modeling.

Pipeline Stages

Preprocessing Stage

The preprocessing stage prepares the raw data by aligning expressions with genotypes and splitting the dataset into training and test sets.

Structure Learning Stage

In the structure learning stage, the pipeline infers gene regulatory network and make input-output features for each gene in the network

Modeling Stage

The final stage involves training the model on the featurized data and evaluating its performance.

Path Configuration

Our pipeline utilizes a path configuration file /src/path_config.py where all necessary paths for data are defined.

Each Python file within our pipeline refers to these path configurations to read input data and store outputs. Here is an example demonstrating how scripts use the path configuration:

#Example content of path_config.py file
from pathlib import Path
MY_PATH = Path("/cluster/projects/nn1015k/GRN-TI")  # TODO: Needs to be updated
DATA_PATH = MY_PATH / "data"
RAW_CSV_GZ_PATH = DATA_PATH / "raw_csv_gz" # Path to the raw CSV GZ files.
ALIGNED_PATH = DATA_PATH / "aligned" # Path where aligned data is stored.
SPLIT_PATH = DATA_PATH / "split" # Path for the split data.
PAIRWISE_PROBABILITY_PATH =  DATA_PATH / "pairwise_probability" # Path for the output of pairwise inference.

#How the align_expression_genotype.py file called
if __name__ == '__main__':
    in_path = RAW_CSV_GZ_PATH
    out_path = ALIGNED_PATH
    if not out_path.exists():
        out_path.mkdir(parents=True)
    align_data(in_path=in_path,out_path=out_path)

Parameters Configuration

Additionally, the pipeline uses a param_config.yaml file to define all variables and parameters needed at each stage of the pipeline. This configuration file is structured to include settings for preprocessing, structure learning, and modeling stages, allowing for easy adjustments to the pipeline's behavior without modifying the code directly.

preprocessing:
  align:
  split:
    test_size: 0.2
structure_learning:
  pairwise_inference:
  network_inference:
    fdr_prior: 0.6
  featurize:

modeling:
  train: 
    learning_rate: 0.01

#How it is used in split_train_test.py
import yaml

params = yaml.safe_load(open("src/param_config.yaml"))["preprocessing"]
params_split = params['split']
test_size = float(params_split['test_size'])

Pipeline Structure and Workflow

The flow of the pipeline is shown below. Each stage depends on the next stage solely through the data file, meaning if the data is provided in the correct format, the code for each stage will be independent of the others.


PROJECT ROOT
├── src
│   ├── preprocessing
│   │   ├── geuvadis
│   │   │   ├── script.py
│   │   ├── dream
│   │   │   ├── script.py
│   │   └── yeast
│   │       ├── script.py
│   ├── structure_learning
│   │   ├── geuvadis
│   │   │   ├── script.py
│   │   ├── dream
│   │   │   ├── script.py
│   │   └── yeast
│   │       ├── script.py
│   └── modeling
│       ├── geuvadis
│       │   └── script.py
│       ├── dream
│       │   └── script.py
│       └── yeast
│           └── script.py
│   ├── param_config.yaml
│   ├── path_config.py
│   └── utils.py
├── data
│   ├── raw
│   │   ├── geuvadis
│   │   ├── dream
│   │   └── yeast
│   └── processed
│       ├── aligned
|            └── [output directory  for preprocessing]
│       ├── split
|            └── [output directory  for preprocessing]
│       ├── pairwise_inference
|            └── [output directory  for structure learning]
│       ├── network_inference
|            └── [output directory  for structure learning]
├── metrics
|    └── [output directory  for modelling]
└── models
|     └── [output directory  for modelling]

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
data		data
metrices		metrices
models		models
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting the genetic component of gene expression using gene regulatory networks

Introduction

Pipeline Stages

Preprocessing Stage

Structure Learning Stage

Modeling Stage

Path Configuration

Parameters Configuration

Pipeline Structure and Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Predicting the genetic component of gene expression using gene regulatory networks

Introduction

Pipeline Stages

Preprocessing Stage

Structure Learning Stage

Modeling Stage

Path Configuration

Parameters Configuration

Pipeline Structure and Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages