Skip to content
Robert Warmerdam edited this page Aug 2, 2021 · 10 revisions

Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores

The method described here aims to identify sample mix-ups in biobanks using polygenic scores (PGS). Sample mix-ups frequently occur in genetic genomic datasets generated in a research setting (Westra et al, 2011). This novel tool takes advantage of the relationship between PGSs and actual phenotypes to predict which samples are erroneous. Details of this new method are described in our pre-print: Warmerdam & Lanting et al, 2021. The GitHub repository can be cloned from here.


  1. Method description
  2. Repository overview
  3. Input
  1. Usage
  1. Output
  2. References

PGS-based sample mix-up identification method

Our method for identifying mix-ups is based on the expected correspondence of the genetic propensity for a particular phenotype and the actual phenotype. Herein the genetic propensity for a particular phenotype is represented by polygenic scores. When this sample mix-up method is performed for enough traits, the predictive power (measured by the area under the ROC, for 25 traits) exceeds that of the regular used check for the correspondence of reported sex and inferred sex (~ 0.8 vs ~ 0.75 respectively). Our method works by:

  • Modelling the relationships between phenotypes and polygenic scores,
  • calculating the residuals of the provided samples and their permutations,
  • employing a likelihood model fitted on the residuals for provided and permuted samples,
  • combining log likelihood ratios over traits.
  • estimating the mix-up rate in the dataset to apply a threshold, making the remaining set of samples suitable for the intended application.

Supplementary scripts

Here, we will focus on how to use the main scripts to identify mix-ups in a dataset. Other scripts are described here


This tool is developed in the R programming language (tested with version 3.6.1). A number of R packages are used in the tool. These can be installed using the following steps:

  • Either clone this repository using the following git command:
git clone

or download the source code directly from Github.

  • Move directory to the package directory:
cd pgs_based_sample_mix-up_correction
  • Install the required packages by executing the ./src/install-packages.R R-script:
Rscript ./src/install-packages.R


Input requires polygenic scores and phenotypes for a range of traits that have well powered GWASs available like height, BMI and platelet concentration. Traits can be of continuous, ordinal or binary type.


The traits that are most effective for predicting mix-ups are those for which the most accurate polygenic scores can be calculated. Accurate polygenic scores can only be calculated for traits with high SNP-heritability, such as body height or platelet concentration. The heritability of traits in the UK biobank is maintained in the UKB SNP-Heritability browser. This resource also contains links to correlation data. Traits ideally are independently inherited. This can be assessed using these correlations.

There should also be a well-powered GWAS with summary statistics available for the trait in question. While the previous resource also lists summary statistics estimated in the UK biobank, A list of available summary statistics is also maintained in, for instance, the GWAS Catalog.

The PGS Catalog lists published polygenic scores with performance estimates and how they are constructed.

Examples of informative traits include the following:

  • Standing height
  • Body mass index
  • Hand grip strength
  • Basal metabolic rate
  • Mean platelet (thrombocyte) volume
  • Corneal resistance factor
  • Heel bone mineral density (BMD)
  • IGF-1 (nmol/L)
  • Monocyte count
  • Reticulocyte count
  • Medication for cholesterol, blood pressure or diabetes: Blood pressure medication
  • Eosinophill percentage
  • Creatinine (umol/L)
  • Glycated haemoglobin (mmol/mol)
  • Lymphocyte count
  • Total protein (g/L)
  • Neutrophill count
  • Coronary atherosclerosis
  • Smoking status
  • Red hair colour

Polygenic scores

The core of the tool expects polygenic scores in a tab separated files with sample identifiers in the first column, and polygenic scores in the second column. The first line is assumed to be a header:

S01     0.042
S02     1.345
...     ...

Over recent years a wide range of tools have been published for calculating polygenic scores like LDpred (Vilhjálmsson et al., 2015) and PRScs (Ge et al., 2019). The latter relies on the scoring functionality of PLINK (--score) to do the final calculation.

The ./src/sum-plink-profiles.R script has been made to sum the output over all chromosomes for all samples when having performed PLINK's allelic scoring method on a per chromosome basis.

usage: ./src/sum-plink-profiles.R [-h] --plink-profiles PLINK_PROFILES [PLINK_PROFILES ...] --out
             OUT --scoresum-column SCORESUM_COLUMN

optional arguments:
-h, --help            show this help message and exit
--plink-profiles PLINK_PROFILES [PLINK_PROFILES ...]
                      path to plink --score output files
--out OUT             output path of summed profiles
--scoresum-column SCORESUM_COLUMN
                      columnname containing summable polygenic scores

The PLINK2 --score command requires additional option cols=+scoresums to obtain a column that can be summed between chromosomes.


For optimal performance polygenic scores should be able to explain a large part of the variance in phenotypes. Therefore, phenotypes should ideally be processed in similar fashion to the phenotypes used in their corresponding GWAS. In this light our method also incorporates sex and age of samples in modelling the relationship between polygenic scores and actual phenotypes.

The input with regards to phenotypes should be a tab-separated file with 5 required columns (each having a fixed column name):

ID      AGE     SEX     VALUE       TRAIT
S01     34      Male    176         Height
S02     45      Female  165         Height
S89     19      Male    19          EduYears
S99     52      Female  22          EduYears

Herein every row corresponds to the phenotype of a sample for a particular trait, combined with covariates corresponding to the individual and the time of the measurement. Each of the columns represent the following:

  • ID: Sample identifier column
  • AGE: Age at the time of measurement
  • SEX: Biological sex of the sample. Either Female or Male
  • VALUE: The phenotype of the sample for the trait in question
  • TRAIT: The trait corresponding to this row

Trait-GWAS mapping and expected file structure

A tab-separated file is required that links the polygenic score input to the phenotypes. This file should additionally include the data type of the phenotypes, which affects the type of model that used for modelling the relationship between the polygenic scores and phenotypes. The supported data types are continuous, ordinal and binary. We apply linear regression, ordinal logistic regression and regular logistic regression respectively.

The file should look like the following:

trait               traitDataType   summaryStatistics
Height              continuous      Meta-analysis_Wood_et_al+UKBiobank_2018.processed
BMI                 continuous      Meta-analysis_Locke_et_al+UKBiobank_2018.processed
Blondeness of hair  ordinal         ukb-d-1747_1.processed
Red hair colour     binary          ukb-d-1747_2.processed

Herein every row corresponds to a single trait, and the column names must match. The columns represent the following:

  • trait: Name of a trait; Should correspond to the TRAIT column in the phenotype input file
  • traitDataType: Data type of the trait in question; continuous, ordinal and binary
  • summaryStatistics: Name of the folder containing polygenic scores for the trait in question.

The path to the actual polygenic scores is compiled of a base path, the folder name corresponding to a particular trait, and the end of the path directing to a particular file present in all of the trait-specific folders. The base path (--base-pgs-path) and end of the path (--pgs-file-name) are provided as an option to the method. (detailed use in sample swap prediction).

Sample coupling file

In the case sample identifiers are of different nature for the phenotypes compared to the polygenic scores, a sample coupling file can be supplied with the genotype / PGS sample ids in the first column and the phenotype sample ids in the second column.

Sample swap prediction usage

Options for the sample swap prediction script are listed below:

usage: src/sample-swap-prediction.R [-h] [--debug]
                                    [--base-fit-model-path BASE_FIT_MODEL_PATH]
                                    --trait-gwas-mapping TRAIT_GWAS_MAPPING
                                    (--base-pgs-path BASE_PGS_PATH | --pgs-file PGS_FILE)
                                    [--pgs-file-name PGS_FILE_NAME]
                                    [--no-adults-only] --phenotypes-file
                                    [--sample-coupling-file SAMPLE_COUPLING_FILE]
                                    [--split-prediction [SPLIT_PREDICTION]]
                                    [--llr-bayes-method LLR_BAYES_METHOD [LLR_BAYES_METHOD ...]]
                                    (--sex-check-table SEX_CHECK_TABLE | --calc-p-expected-mixUps CALC_P_EXPECTED_MIXUPS CALC_P_EXPECTED_MIXUPS CALC_P_EXPECTED_MIXUPS CALC_P_EXPECTED_MIXUPS | --p-expected-mixUps P_EXPECTED_MIXUPS)
                                    [--p-max-mixUps P_MAX_MIXUPS]
                                    [--bayes-method-sweep-mode] [--out OUT]
                                    [--likelihood-ratio-alpha LIKELIHOOD_RATIO_ALPHA]

optional arguments:
  -h, --help            show this help message and exit
  --debug               write intermediate llr and residuals files.
  --base-fit-model-path BASE_FIT_MODEL_PATH
                        path to a directory to write fitted model parameters
                        to, or to load fitted models from.
  --trait-gwas-mapping TRAIT_GWAS_MAPPING
                        path to a tab-delimited file that maps traits to the
                        polygenic scores.
  --base-pgs-path BASE_PGS_PATH
                        path to a directory containing polygenic scores.
                        Folder structure should be '<base-pgs-path>/<name-of-
  --pgs-file PGS_FILE   path to a tab-delimited file holding all polygenic
                        score data. Columns should represent sample IDs and
                        individual PGSs respectively.
  --pgs-file-name PGS_FILE_NAME
                        name of files that hold polygenic scores. ignored when
                        a combined table of polygenic scores is supplied
  --no-adults-only      Turns filtering step (AGE > 18) off
  --phenotypes-file PHENOTYPES_FILE
                        path to a tab-delimited file holding all processed
                        phenotype data
  --sample-coupling-file SAMPLE_COUPLING_FILE
                        file containing genotype sample ids in the first
                        column and phenotype sample ids in the second column
  --split-prediction [SPLIT_PREDICTION]
                        receiver operating characteristics using 50/50 split
  --llr-bayes-method LLR_BAYES_METHOD [LLR_BAYES_METHOD ...]
                        the naive bayes method to use for calculating log
                        likelihood ratios. If NA, the method will be based on
                        the data type for each trait <ewi-discretization |
                        efi-discretization | gaussian | NA> [(average) number
                        of samples per bin]
  --sex-check-table SEX_CHECK_TABLE
                        Tab-separated file containing genotype ids in the
                        first column genotype inferred sex in the second
                        column, and reported sex in the third column This will
                        be used to estimate the number of mix-ups present in
                        the data, and for a sex correspondence check. Use
                        --p-expected-mixUps or --calc-p-expected-mixUps if sex
                        correspondence check is already performed
                        Estimates the number of mix-ups in the cohort.
                        Number of samples that are identified in a sex-
                        correspondence check, The total number of mix-ups that
                        are removed, The total number of remaining samples and
                        the frequency of females or males.
  --p-expected-mixUps P_EXPECTED_MIXUPS
                        Proportion of expected mix-ups. range between 0 and 1
  --p-max-mixUps P_MAX_MIXUPS
                        Maximum proportion of mix-ups. range between 0 and 1.
                        Default is 1e-04
                        special mode to perform a sweep over a series of naive
                        bayes methods
  --out OUT             path to output directory
  --likelihood-ratio-alpha LIKELIHOOD_RATIO_ALPHA
                        the t-test alpha to use for selecting traits based on
                        the difference in log likelihood ratios between the
                        provided and permuted samples
                        Setting this to false will prevent intermediate AUC
                        calculations, requiring less memory.


  • --base-fit-model-path can be used to fit parameters on a set of samples which are expected to be correct. The fitted parameters can subsequently be used on a set of samples with a large proportion of introduced fake mix-ups to get a better ROC estimate.
  • --llr-bayes-method: If this is NA, gaussian functions will be fitted to residuals for continuous traits, and an equal width discretization technique (ewi-discretization) will be used for ordinal and binary traits.
  • --likelihood-ratio-alpha: This is 0.05 by default.
  • --bayes-method-sweep-mode: Aborts the method after traits are independently processed; no combined output.

Estimating mix-up rate

To allow Idéfix to calculate a mix-up rate and perform filtering based op a set maximum, the sex-check table should be present if a sex-check was not already performed. Otherwise the mix-up rate can be provided using --p-expected-mixUps. Idéfix can calculate the mix-up rate when arguments are supplied to --calc-p-expected-mixUps SWAPS_SEX_CHECK SWAPS_TOTAL N_REMAINING SEX_FREQUENCY. Herein,

  • SWAPS_SEX_CHECK denotes the number of samples for which the genotype inferred sex did not match the phenotype inferred sex (sex-check failed samples).
  • SWAPS_TOTAL denotes the total number of samples that are removed after sample mix-ups identificant methods (for instance both a sex-check and a pedigree check)
  • N_REMAINING denotes the total number of samples that remain after removal of sample mix-ups.
  • SEX_FREQUENCY denotes the frequency of males or females. The maximum mix-up rate can be given using --p-max-mixUps. This is set to 0.01% by default.


A number of output files are written to the directory supplied to --out. The main output is described below:

  • ./idefixPredictions.txt: tab-separated file with every row representing an input sample. The following columns are included:

    • Var1: the sample identifier from the phenotype input
    • Var2: the sample identifier from the PGS input.
    • logLikelihoodRatios: the raw log likelihood ratio of the sample being mixed-up.
    • scaledLlr: main prediction value; the scaled log likelihood ratio of the sample being mixed-up.
    • numberOfTraits: the number of traits that were available and used for aggregating log likelihood ratios over traits.
    • sensitivity: the sensitivity calculated from the provided and permuted sample mappings.
    • specificity: the specificity calculated from the provided and permuted sample mappings.

    When the sex-check is performed in idéfix, or when the mix-up rate is calculated or provided by other means. The following columns are additionally present:

    • nPass: for each sample the number of samples with a sensitivity larger or equal.
    • rPass: for each sample the updated mix-up rate in the set of samples with sensitivity larger or equal.
    • pass: TRUE when rPass meets the requirement of a maximum mix-up rate, FALSE when rPass is higher than the maximum set mix-up rate.
  • ./overallOutputStatistics.tsv: tab-separated file with a series of statistics gathered. Contains the following data:

    • matrixWideAuc: An ROC-AUC calculated on scaled log likelihood ratios, over provided samples and all their permutations.
    • confinedAuc: An ROC-AUC calculated on scaled log likelihood ratios, over provided samples only. Only available with fake mix-ups introduced.
  • ./outputStatisticsPerTrait.tsv: tab-separated file with a series of statistics on single traits. The following statistics are always included:

    • numberOfSamples: The number of samples for which the trait was available.
    • pValue: The pValue for a t-test testing the if the log likelihood ratios for permuted samples are greater that those of provided samples.

    Optional values when --output-intermediate-statistics has not been disabled:

    • matrixWideAucOnScaledLlr: An ROC-AUC calculated on scaled log likelihood ratios, over provided samples and all their permutations.
    • confinedAucOnScaledLlr: An ROC-AUC calculated on scaled log likelihood ratios, over provided samples only. Only available with fake mix-ups introduced.
  • ./aggregatedLogLikelihoodRatiosDataFrame.tsv: tab-separated file similar in nature to ./providedSampleDataFrame.tsv, but permuted samples are included.


  • Ge, Tian, Chia-Yen Chen, Yang Ni, Yen-Chen Anne Feng, and Jordan W. Smoller. “Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors.” Nature Communications 10, no. 1 (April 16, 2019): 1–10.

  • Vilhjálmsson, Bjarni J., Jian Yang, Hilary K. Finucane, Alexander Gusev, Sara Lindström, Stephan Ripke, Giulio Genovese, et al. “Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores.” American Journal of Human Genetics 97, no. 4 (October 1, 2015): 576–92.

  • Westra, Harm-Jan, Ritsert C. Jansen, Rudolf S. N. Fehrmann, Gerard J. te Meerman, David van Heel, Cisca Wijmenga, and Lude Franke. “MixupMapper: Correcting Sample Mix-Ups in Genome-Wide Datasets Increases Power to Detect Small Genetic Effects.” Bioinformatics 27, no. 15 (August 1, 2011): 2104–11.

Clone this wiki locally