Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Welcome to the PrediXcan wiki!
PredictDB Update 8/18/2016
With the GTEx consortium’s release of a patch to its version 6 data, we are pleased to release updates to 40 of our current tissue models and 4 brand-new tissue models for use with PrediXcan and MetaXcan. The version 6 patch (V6p) data from GTEx contains new gene-level expression quantification based on an improved gene annotation derived from GENCODE version 19. In addition, these models were trained using the 1000 Genomes snp set, as opposed to the HapMap snp set, which frequently results in a larger number of snps being utilized to predict expression.
We have also filtered the results to only include linear models which are significant at a FDR of less than 5%, whereas previously we included all gene models regardless of strength of the model in training. Consequently, there are fewer genes for which the models will predict expression, and the proportion of genes which were filtered out of the model is mainly determined by the number of samples in the data. For example, in Brain Anterior Cingulate Cortex where n_samples = 72, we have approximately 70% fewer gene models than the previous release. Tissues with a larger sample size retain a larger proportion of models. For example, in Lung, where n_samples = 278, we have about 37% fewer gene models.
To see summary statistics of the data used to train these models, visit http://www.gtexportal.org/home/tissueSummaryPage.
New Tissue Models
In addition to the 40 tissue models we are updating, we are also releasing 4 new tissue models for Prostate, Uterus, Vagina, and Adipose – Visceral (Omentum).
V6p Impact Upon Results
We ran PrediXcan with the updated models on GEUVADIS data and found the predicted transcriptomes to be mostly positively correlated with the old models. There are some outlier genes with negative correlations between the old and new predictions. Box plots of the correlations between the new and old prediction models can be seen below:
New Database Schema
We have modified the schema to sqlite databases to include more information about the training of the models, as well as some additional statistics. The updated tables are as follows:
extra – holds info about each linear model for predicting the transcriptome in the tissue. The column names with descriptions are listed here:
- gene – The ensembl ID of the gene
- genename – The gene’s HUGO symbol
- pred.perf.R2 – The cross-validated R2 value found when training the model.
- n.snps.in.model – The number of cissnps used to predict the expression level of the gene
- pred.perf.pval – The p-value of the correlation between cross-validated prediction and observed expression
- pred.perf.qval – The q-value obtained when analyzing the initial distribution of p-values. The models in these databases have been filtered to only include results that are significant at a FDR of less than 5%.
weights – the weights for the snps in the linear models. The column names with descriptions are listed here:
- rsid – The rsid number for the snp from dbSNP build 142
- gene – The ensembl ID of the gene for which the snp weight is predicting expression
- weight – The weight value for the snp in the model
- ref_allele – The other (non-effect, non-dosage) allele of the snp
- eff_allele – The effect (dosage) allele of the snp
sample_info – Has only one column (n.samples) and one value, which is the number of samples used to train the model.
construction – Contains information from the training of the models. Primarily included for reproducibility purposes.
Getting the New Models
Visit predictdb.hakyimlab.org to download the new models. There you can search by tissue to find the database you would like to use in your analysis. We hope these updates will prove beneficial to your research!