Skip to content

Methylation mixupmapping

Marc Jan Bonder edited this page Feb 13, 2015 · 3 revisions

This is a focused cookbook for performing the methylation mixup mapping. This cookbook illustrates only a part of the capabilities of our software. See for other options the other parts of the wiki.

##Manual contents

  1. Downloading the software and reference data
  2. Before you start
  3. Step by step meQTL analysis

##Downloading the software and reference data You will need four programs to perform the QTL analysis.

##Before you start

In this manual we assume that you have basic knowledge on how to work with the java Virtual Machine (VM) and the command line. If you are not sure about working command line, please have a look at our introduction page and/or check the information about the Java VM (including the Java memory options). If you run into any problems with heapspace or out of memory errors please supply the -Xmx and -Xms command, see Java VM information. At each java command we supply the recomended settings.

##Definitions Througout the manual, references to different files/folders will be made. Here is an overview of these items:

  • The methylation file will be referred to as traitfile
  • The methylation annotation file will be referred to as annotationfile
  • The full path of your genotype data will be referred to as genotypedir
  • The file linking methylation id to genotype id will be referred to as genotypemethylationcoupling
  • The file containing covariates will be defined as covariatefile

Descriptions of each of these files and their usage is detailed below, and their formats are described in the data formats section which can be found in the general QTL mapping documentation.

##Step 1 - Preparation of methylation data As our software uses a nonparametric test by default, you can use virtually any continuous data as trait values to map a variety of QTL effects. However, currently the normalization tools provided with this package are focused on array based (Illumina) expression data, preprocessed RNA-seq data (e.g. transcript level quantified data), (GC)-RMA processed Affymetrix data and include several steps for the 450K methylation array. Therefore we first need to perform a part of the normalization in a separate program.

Initial processing of the idat files to an expression matrix can be done using alot of different methods. This if of choice, we recommend DASEN. Information on our default processing can be found here.

Before mixupmapping we need to normalize the data, for the methylation data we will perform the following steps:

  1. Quantile normalization
  2. Probe centering and scaling (Z-transformation): (MethylationProbe,Sample = MethylationProbe,Sample - MeanProbe) / Std.Dev.Probe

Run these steps using the following command:

java -Xmx15g -Xms15g -jar eqtl-mapping-pipeline.jar --mode normalize --in traitfile --out outdir --qqnorm --centerscale

After these steps a tab-separated gzipped plaintext file is created with an identical number of rows and columns as the input file. This means these files can subsequently be used during meQTL mapping.

###Check your data Running the general normalization procedure yields a number of files in the directory of your traitfile, or in the outdir if you specified one. The default procedure will generate file suffixes listed below. Suffixes will be appended in the default order as described above. Selecting multiple normalization methods will add multiple suffixes. Please do not remove or replace ANY intermediate files as they may be used in subsequent steps.

Suffix Description
QuantileNormalized Quantile Normalized trait data.
ProbesCentered Probes were centered.
SamplesZTransformed Samples were Z-transformed.

##Step 2 - Preparation of genotype data Our software is able to use both unimputed called genotypes, as well as imputed genotypes and their dosage values. The software, however, requires these files to be in TriTyper format. We provide Genotype Harmonizer to harmonize and convert your genotype files.

###Convert data to the TriTyper fileformat. For the QTL mapping the data needs to be in TriTyper format. Using GenotypeHarmonizer one can harmonize and convert genotype formats. You can either run this step per each chromosome separately or for all chromosomes at once.

Important When you are using imputed genotypes, make sure to use as an input format a file that contains the probabilities or dosages and not just genotype calls. In case of doubt please contact bonder.m.j @

📌 Note: please make sure to use verion 1.4.9 of the Genotype Harmoinzer.

java -Xmx10g -Xms10g -jar ./GenotypeHarmonizer.jar -i {locationOfInputData} -I {InputType} -o {Outputlocation} -O TRITYPER -cf 0.95 -hf 0.0001 -mf 0.01 -mrf 0.5 -ip 0

Input arguments: -i location of the input data, -I input type, -o output location, -O output type TRITYPER. See the Genotype Harmonization manual for further details on the input flags.

✅ Please upload your logs file to the FTP server.

If you ran the GenotypeHarmonizer per chromosome you now need to merge all TriTyper folders per chromosome to one TriTyper folder containig data from all chromosomes. By using the following command you merge the individual TriTyper folders.

java -Xmx2g -Xms2g -jar ./eqtl-mapping-pipeline.jar --imputationtool --mode concat --in {folder1;folder2;folder3;ETC} --out {OutputFolder}

Input arguments:--in is a semicolon-separated list of input TriTyper Folders with information per chromosome, --out is the output location of the merged TriTyper data. Note: in case of problems try putting quotes (") around your list of input folders

Great! You now have your genotype and methylation data ready to go!

#File Checklist Before continuing, now is a good time to check whether your files are in the correct format and whether you have all the required files ready. Please check the following:

  • All required TriTyper files are in your genotypedir.
  • The methylation files are formatted properly.
  • You have an annotationfile.
  • The identifiers of samples are identicial in your methylation and genotype files. If not, you should make a file to link the proper samples. This file is called a genotypemethylationcoupling file, the format is described here: Genotype - phenotype coupling. This file is not necessary if the names are already matched.

##Step 3 - MixupMapper We have shown in a paper published in Bioinformatics (Westra et al.: MixupMapper: correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects), that sample mix-ups often occur in genetical genomics datasets (i.e. datasets with both genotype and methylation data). Therefore, we developed a method called MixupMapper, which is implemented in the QTL Mapping Pipeline. This program performs the following steps:

  1. At first a cis-meQTL analysis is conducted on the dataset:
    • using a 250 kb window between the SNP and the mid-probe position
    • performing 10 permutations to control the false discovery rate (FDR) at 0.05
  2. Calculate how well the methylation data matches the genotype data. For details how this exactly works, please have a look at the paper.


  1. Note down the full path to your TriTyper genotype data directory. We will refer to this directory as genotypedir.
  2. Determine the full path of the methylation data produced in step 3, the file ending with "SamplesZTransformed.txt.gz". We will refer to this path as traitfile.
  3. Locate your annotationfile. Also note down the platformidentifier, which will be GPL13534 for the methylation 450K array.
  4. (Optional) Locate your genotypemethylationcoupling if you have such a file. You can also use this file to test specific combinations of genotype and phenotype individuals.
  5. Find a location on your hard drive to store the output. We will refer to this directory as outputdir.

###Commands to be issued The MixupMapper analysis can be run using the following command:

java -Xmx15g -Xms15g -jar eqtl-mapping-pipeline.jar --mode mixupmapper --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform GPL13534 --inexpannot {annotationfile} --testall (--gte {genotypephenotypecoupling}) 2>&1 | tee {outdir}/mixupmapping.log

If your genotypes and/or methylation data include more samples than those who are matched based on sample names, it is beneficial to test all possible combinations. This can be performed using the following command line switch --testall.

If you are running the software in a cluster environment, you can specify the number of threads to use (nrthreads) by appending the command above with the following command line switch --threads nrthreads (nrthreads should be an integer).

Note that the --gte, --threads and --snps flags are optional, --gte only applies if you are using a genotypephenotypecoupling file.

###Check your data MixupMapper is a two stage approach. As such, the default procedure creates two directories in the outdir you specified: cis-meQTLs and MixupMapping. Both folders contain a different set of output files, described below.

####cis-meQTLs directory This directory contains output from a default cis-meQTL mapping approach. However we are not interested in this folder, these files are just necessary to create the actual mixupmapping output which can be found in the MixupMapping directory.

####MixupMapping directory

File Description
Heatmap.pdf Visualization of overall Z-scores per assessed pair of samples. The genotyped samples are plotted on the X-axis, and the methylation samples are plotted on the Y-axis. The brightness of each box corresponds to the height of the overall Z-score, with lower values having brighter colours. Samples are sorted alphabetically on both axes.
BestMatchPerGenotype.txt This file shows the best matching trait samples per genotype: the result matrix (MixupMapperScores.txt) is not symmetrical. As such, scanning for the best sample per genotype may yield other results than scanning for the best sample per trait.
BestMatchPerTrait.txt This file shows the best matching genotype sample per trait sample.
MixupMapperScores.txt A matrix showing the scores per pair of samples (combinations of traits and genotypes).

In the BestMatchPerGenotype.txt and BestMatchPerTrait.txt files, you can find the best matching trait sample for each genotyped sample and vice versa:

  • 1st column = genotyped sample ID, or trait sample ID dependent on file chosen (see above)
  • 2nd column = trait sample originally linked to genotype sample ID in column 1, or genotype sample originally linked with trait sample in column 1
  • 3rd column = the MixupMapper Z-score for the link between the samples in column 1 and 2
  • 4th column = best matching trait (for BestMatchPerGenotype.txt) or best matching genotype (for BestMatchPerTrait.txt)
  • 5th column = the MixupMapper Z-score for the link between the samples in column 1 and 4
  • 6th column = this column determines whether the best matching trait or genotype is identical to the sample found in column 2

Example of BestMatchPerGenotype.txt

Trait   OriginalLinkedGenotype  OriginalLinkedGenotypeScore     BestMatchingGenotype    BestMatchingGenotypeScore       Mixup
Sample1		Sample1		-11.357			Sample1		-11.357	false
Sample2		Sample2		-15.232			Sample2		-15.232	false
Sample3		Sample3		-3.774			Sample4		-6.341	true
Sample4		Sample4		-3.892			Sample3		-12.263	true

In case of the example there are two mixups indentified. If you don't have any mix-ups you are done, otherwise you need to resolve or remove these mix-ups. Please check this separate document for instructions on fixing mixups.

Always rerun the Sample Mix-up Mapper after removing or changing sample IDs and check if you actauly fixed/removed mixups instead of creating more!

Clone this wiki locally