eQTL_Trees

This repo contains the codes and data used for the work: Predicting target genes of noncoding regulatory variants with ICE

Note that all pickle files (dataset and saved models) need to be downloaded through git-lfs

All raw data are downloaded from GTEx database (GTEx V7, tissue-specific data) and ORegAnno database. Curation process can be found in the methods section of the paper, related codes are stored under scripts/support_scripts. Generated data files include:

Data/assembled_balanced_dataset_123.pkl - main dataset for the cross-validation study, each entry represents a variant-gene pair, in the same form as GTEx entry: [gene_id, variant_id, tss_distance, ma_samples, ma_count, maf, pval_nominal, slope, slope_se]
Data/assembled_balanced_dataset_123_Xy.pkl - features and labels for the main dataset, each entry corresponds to a row in the feature 2d-array, names and descriptions of the features can be found in the supplementary spreadsheet and scripts/generate_X.py
Data/test_pairs.pkl - test dataset collected from ORegAnno, same format as main dataset
Data/test_pairs_Xy.pkl - features and labels for the test dataset
Data/ranking_analysis.pkl.pkl - selected variants (from the main dataset) with extra negative pairs collected from GTEx, used for Figure S6
Data/ranking_analysis.pkl_Xy.pkl - features and labels for the ranking analysis dataset

Trained (xgboost) models are stored under scripts:

scripts/random_assembled_balanced_dataset_123_Xy_models.pkl - models trained under random cross-validation, split can be reproduced through functions in split.py, see run.py for usage. The first model under 'FULL' key (models['FULL'][0]) is used for the feature importance analysis in this work.
scripts/position_assembled_balanced_dataset_123_Xy_models.pkl - models trained under position-based cross-validation
scripts/maf_assembled_balanced_dataset_123_Xy_models.pkl - models trained under maf split (threshold 0.01)

Scripts used to train/evaluate models can be found in scripts/run.py More detailed analysis (to reproduce figures in the manuscript) can be found in:

scripts/feature_importance.py - Figure 2, S1, S2
scripts/pred_distribution.py - Figure S3, S4
scripts/rank_analysis.py - Figure S6
scripts/test_pairs.py - Figure 1, S5

Requirements

numpy
pandas
sklearn
xgboost
xgbfir
matplotlib
seaborn

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Data		Data
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

scripts

scripts

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

Repository files navigation

eQTL_Trees

Requirements

About

Releases

Packages

Languages

miaecle/eQTL_Trees

Folders and files

Latest commit

History

Repository files navigation

eQTL_Trees

Requirements

About

Resources

Stars

Watchers

Forks

Languages