-
Notifications
You must be signed in to change notification settings - Fork 99
Idefix supplementary
The method described here aims to identify sample mix-ups in biobanks using polygenic scores (PGS). Sample mix-ups frequently occur in genetic genomic datasets generated in a research setting (Westra et al., 2011). This novel tool takes advantage of the relationship between PGSs and actual phenotypes to predict which samples are erroneous. Details of this new method are described in a yet to be published article. The GitHub repository can be cloned from here.
In the repository a couple of main scripts are included as well as a number of scripts that can be used for diagnostic purposes.
-
Main scripts (
./src):-
install-packages.R: Installs required packages. -
sample-swap-prediction.R: Main sample mix-up prediction script.
-
-
Helper scripts (
./src):-
generate-sample-couplings.R: Introducing fake sample mix-ups for ROC calculations. -
sum-plink-profiles.R: Sum polygenic scores from PLINK over chromosomes. (More in input - polygenic scores).
-
-
Diagnostic scripts (
./src/diagnostic-scripts):-
plot-roc-figures.R: Plots ROC curves for both sex concordance check, PGS-based sample swap prediction and combined ROCs. Additionally, corresponding confusion matrices for ROCs are plotted as well. -
polygenic-score-power-calculation.R,compare-runs.R,plot-pgs-predictive-power.R,polygenic-score-power-calculation.Randplot-intermediate-figures.Rall contain functionality for plotting additional intermediate results.
-
-
Lifelines specific files: (
./src/lifelines,./data/lifelines): Files specific to Lifelines phenotype processing. The files in./data/lifelinescan be used as a reference for generating files specific to other studies. -
Scripts for simulations: (
./src/lifelines/simulations):-
simulate-data.R: A script that are is used to simulated data from the Lifelines dataset. -
compare-simulations.R: A script that is used to compare results for simulated datasets.
-
In order to estimate the performance of Idéfix,
we require a dataset in which it is known which samples are mix-ups and which samples are correct.
This can be achieved by introducing fake mix-ups into a dataset.
To get a reliable performance estimate, a considerable number of mix-ups will have to be introduced.
However, since Idéfix expects that the majority of the sample mappings is correct,
introducing a large proportion of sample mix-ups might underestimate the performance.
Therefore, we suggest creating a separate training and testing dataset. This can be done by using the
generate-sample-couplings.R and sample-swap-prediction.R scripts. For ease of use
the, --split-prediction option in sample-swap-prediction.R can also be used. Here, the
steps for the original method are shown.
-
Sample half of the available samples from the entire study using the
./generate-sample-couplings.Rscript with in conjunction with the--sample-countoption. Do not introduce fake mix-ups in this step. -
Perform the sample mix-up prediction (
sample-swap-prediction.R) using the sample coupling file obtained in step 1. Write the fitted models to a directory of choice using the--base-fit-model-pathoption. -
Get the remaining samples from the study (
./generate-sample-couplings.R). Use the option--sample-coupling-file-excludeto exclude the first half of the study from step 1, and introduce 50% mix-ups. -
Perform the sample mix-up prediction (
sample-swap-prediction.R) using the sample coupling file obtained in step 3, and the fitted models from step 2. The output fileoverallOutputStatistics.tsvcontains an accurate AUC for the sample mix-up predictions. -
The ROC curve can be compared with the sex concordance check and the combined predictive power using the
./diagnostic-scripts/plot-roc-figures.Rscript.
./generate-sample-couplings.R is a script that introduces a number of fake mix-ups or downsamples the number
of samples. This can be helpful when you are trying to obtain a reliable ROC. The script can be used as follows.:
usage: ./generate-sample-couplings.R [-h]
[--mix-up-percentage MIX_UP_PERCENTAGE]
[--sample-count SAMPLE_COUNT] --out OUT
[--sample-coupling-file-exclude SAMPLE_COUPLING_FILE_EXCLUDE]
(--sample-coupling-file-include SAMPLE_COUPLING_FILE_INCLUDE
| --phenotypes-file PHENOTYPES_FILE)
optional arguments:
-h, --help show this help message and exit
--mix-up-percentage MIX_UP_PERCENTAGE
introduce mix-ups in link fileand phenotype sample ids
in the second column
--sample-count SAMPLE_COUNT
number of samples to include in the coupling file
--out OUT path to output prefix
--sample-coupling-file-exclude SAMPLE_COUPLING_FILE_EXCLUDE
file containing genotype sample ids in the first
column, and phenotype sample ids in the second
column. the samples in the genotype column will be
exluded.
--sample-coupling-file-include SAMPLE_COUPLING_FILE_INCLUDE
file containing genotype sample ids in the first
columnand phenotype sample ids in the second
columnthese samples will be used as a starting point.
--phenotypes-file PHENOTYPES_FILE
path to a tab-delimited file holding all processed
phenotype data.
More ROC curves can be plotted using the ./src/diagnostic-scripts/plot-roc-figures.R script.
This includes ROC curves for a sex-check as well as a combined ROC curve. confusion matrix are also visualized.
The script can be executed as described below:
usage: ./diagnostic-scripts/plot-roc-figures.R [-h] --dir DIR
--phenotypes-file
PHENOTYPES_FILE
optional arguments:
-h, --help show this help message and exit
--dir DIR path from where to read sample swap prediction
results.
--phenotypes-file PHENOTYPES_FILE
path to a tab-delimited file holding all processed
phenotype data.
We simulated data according to the ./src/lifelines/simulations/simulate-data.R script.
This script is tailored to the Lifelines data. It will generate datasets with explained variance
of 50%, 100%, 150%, and 200% of the original explained variance, and 10, 25, 50, 75 and 100 traits.
- QTL mapping pipeline
- Genotype Harmonizer
- Genotype IO
- ASE
- GADO Command line
- Downstreamer
- GeneNetwork Analysis
Analysis plans
Other