Skip to content

Resolving mixups

Bonder-MJ edited this page Aug 6, 2014 · 1 revision

Resolving Sample mix-ups

As described above, the BestMatchPerGenotype.txt describes possible sample mix-ups in your data. Resolving sample mix-ups is however a bit of a puzzle: many things could have happened during hybridization to the genotype and gene expression chips. For example, samples could have been duplicated (hybridized to the array twice), contaminated, swapped or not hybridized at all. The MixupMapper tries to resolve these issues automatically, although you should always check in your logbook whether the proposed sample mix-ups actually make sense (for example, are the mixed-up samples located on the same row or column of the chip, was a complete row inverted, or was the DNA quality poor). For each match, a z-score is presented which describes how much the samples are alike. You should interpret the z-score as a distance measure, so the lower the z-score, the better the match.

You should read the output of the program as follows: for every genotype sample, a single gene expression sample is matched (it is the methylation sample with the lowest z-score for the genotype sample). The assessed genotype sample ID is located in column 1, the matching methylation sample is in column 4. The second column describes what assignment you gave to the methylation sample (eg: to which genotype sample you think the methylation sample belongs). Now, if the original assigned genotype sample in column 4 is identical to the genotype sample in column 1, there is no indication of a sample mix-up, and column 6 will be FALSE. However, if the sample names in column 1 and 4 are different, column 6 is TRUE and something might be going on with the samples.

Consider the following example:

GT-1    Ex-1	-3.4	Ex-2	-10.6	TRUE
GT-2	Ex-2	-2.60	Ex-1	-9.5	TRUE

The example described here is to be considered a classical sample swap: you observe that Ex-2 matches GT-1 best, and that gene Ex-1 matches GT-2 best. In this case, you can see that column 6 is also TRUE for both samples: for GT-1, this means that not only Ex-2 is the best match for GT-1, but also GT-1 is the best match for Ex1 (eg: the relationship is bidirectional). If we now observe that for GT-2 this relationship is also bidirectional, and the z-scores in column 5 are very low (eg: below -4, although this depends on the dataset), we get a strong indication that these samples are swapped.

Now consider the following example:

GT-1    Ex-1	-3.4	Ex-2	-10.6	TRUE
GT-2	Ex-2	-9.5	Ex-2	-9.5	TRUE

In this case, Ex-2 is matched to two genotype samples. This means that either GT-1 and GT-2 are identical, or Ex-2 is contaminated (eg: a mix of RNA of both GT-1 and GT-2). The problem here is to decide which sample to include and which sample to exclude. In such a case, the best choice will be to stick with the original assignment provided by you, and exclude GT-1 (see column 7) even though the z-score is lower for the GT-1-Ex-2 match. The choice to exclude a sample should however not be made by a program: like described above, you should check whether things makes sense from the lab. The z-score can however give you an indication of how much a gene expression sample resembles the genotype.

Finally, consider the following example:

GT-1	Ex-1	-4.3	Ex-6	-10.4	TRUE
GT-2	Ex-2	-4.2	Ex-5	-9.60	TRUE
GT-3	Ex-3	-5.3	Ex-4	-10.4	TRUE
GT-4	Ex-4	-4.3	Ex-3	-9.60	TRUE
GT-5	Ex-5	-7.5	Ex-2	-10.4	TRUE
GT-6	Ex-6	-2.3	Ex-1	-9.60	TRUE

The above example shows you an example of what a row inversion on a chip would look like: GT-1 matches Ex-6, GT-2 matches Ex-5, etcetera.

After checking the BestMatchPerGenotype.txt or BestMatchPerTrait.txt files, some samples can be identified as sample mix-ups. You can easily replace the mixed IDs in the genotypemethylationcoupling file, without the need to change the MethylationData files themselves. Please check in your plate- and array layouts whether it was possible to make these sample mix-ups. If your layouts don’t give you any information on why a sample could have been mixed-up our best practice is to remove the sample pair from further analysis!

If you want to remove a sample after for example the mix-up step, remove the sample by either deleting it from your genotypemethylationcoupling, by setting the sample to ‘exclude’ in the ‘PhenotypeInformation.txt’, or by removing the sample from the methylation data. Make sure that you never remove lines from the Individuals.txt in your TriTyper folder, as this will result in erroneous genotypes for the remainder of the samples.

Clone this wiki locally