Skip to content

Genotype IO

Robert Warmerdam edited this page Jan 17, 2021 · 25 revisions

The molgenis IO is a java library that allows accessing genotype data form difference sources in a uniform and fast manner.

Content

Download latest jar
Features
Usage notes
Examples

Download latest jar

Most user can simply download the latest jar and use it in there programs.

https://molgenis50.gcc.rug.nl/jenkins/job/systemsgenetics/lastStableBuild/nl.systemsgenetics$Genotype-IO/

Make sure to download the stand alone jar: molgenis-genotype-reader-******-jar-with-dependencies

Features

Readers

  • Plink PED/MAP
  • Plink binary format BED/BIM/FAM
  • VCF
  • TriTyper
  • Oxford GEN
  • SHAPEIT2 format
  • Oxford Binary GEN

Writers

  • Plink PED/MAP
  • Plink binary format BED/BIM/FAM
  • IMPUTE2 phased haplotypes
  • Oxford GEN
  • SHAPEIT2 format
  • Oxford Binary GEN

Memory efficient

With the execption of Plink PED/MAP files only indices are stored in memory. Sample genotypes are only loaded when accessed. The loading of genotypes is cached preventing unnecessary disk IO.

Sample and Variant filters

Easy methods to select a subset of samples or variants based on identifiers or features.

Variants

  • SNP's and indel's
  • Basic statistics; MAF, HWE, callrate
  • LD calculation between variants
  • Both dosage and called genotype support

Editing data

In memory support to modify data. Writers can be used to save modifications.

High unit testing coverage

Usage notes

VCF

Only VCF files that are compressed using bgzip and indexed with a tabix are supported. This prevents having to read all data into memory.

Installing tabix

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.6.tar.bz2
tar -jxvf tabix-0.2.6.tar.bz2
cd tabix-0.2.6/
make

Preparing a VCF file

bgzip -c example.vcf > example.vcf.gz
tabix -p vcf example.vcf.gz

Getting the content from the VCF filter column

Genotype IO has a generic flexible variant annotation system. The VCF filter column content has been linked to this system and can be accessed using the following snip-it. Note, variant is of the calls GeneticVariant see below on how to load variants.

String filterValue = (String) variant.getAnnotationValues().get("VCF_Filter");

Running commandline

If you want summary statistics on a genotype data set of one the support formats you can run the jar. Help will be provided on running.

java -jar molgenis-genotype-reader-******-jar-with-dependencies.jar

Examples

Basic genotype data

The normal genotype data interface allows the iteration over variants

String datasetPath = "/some/path/file"; //Note omition of extentions

//Create a binary plink genotype data object
GenotypeData genotypeData = null;
try {
	genotypeData = new BedBimFamGenotypeData(datasetPath, 1000);
} catch (IOException ex) {
	LOGGER.fatal("IO error: " + ex.getMessage());
	System.exit(1);
}

for(GeneticVariant variant : genotypeData){
	//Iterate over all variants
	System.out.println(variant.getPrimaryVariantId());
}

for(Sample sample : genotypeData.getSamples()){
	//Note sex is an enum, 
	//to string outputs a human readable form of the gender
	System.out.println("Sample ID: " + sample.getId() 
                           + " sex: " + sample.getSex());
}

Random access genotype data

In most cases the random access genotype reader is more usefull. 3 examples are shown on how they can be loaded.

RandomAccessGenotypeData randomAccessGenotypeData = null;

try {

	//Here we read the dataset of the specified type using the auto loader
	randomAccessGenotypeData = new BedBimFamGenotypeData(datasetPath);

	//Or
	randomAccessGenotypeData = RandomAccessGenotypeDataReaderFormats
        .PLINK_BED.createGenotypeData(datasetPath, 1000);

	//Or
	randomAccessGenotypeData = RandomAccessGenotypeDataReaderFormats
        .valueOf("PLINK_BED").createGenotypeData(datasetPath, 1000);

} catch (IOException ex) {
	LOGGER.fatal("IO error: " + ex.getMessage());
	System.exit(1);
} catch (GenotypeDataException ex) {
	LOGGER.fatal("Genotype data error: " + ex.getMessage());
	System.exit(1);
}

for(String sequenceNames : randomAccessGenotypeData.getSeqNames()){
	//No a sequence can be chr or contig
	System.out.println("Seq/Chr: " + sequenceNames);
}

for(GeneticVariant variant : randomAccessGenotypeData
        .getVariantsByRange("1", 1, 10)){
	//get variants from chr 1 pos 1 to 10 (exclusive)
	System.out.println("Variant ID: " + variant.getPrimaryVariantId());
}

//Create hashmap based on primary ID. 
//Note this takes lots of memory. 
//Variants without an ID will always be ignored.
//If multiple variants have the same ID an arbritary variant is selected.
HashMap<String, GeneticVariant> variantHashMap = randomAccessGenotypeData.getVariantIdMap();

Oxford Binary GEN

Loading Binary GEN files

Genotype IO contains a complete implementation of a Oxford Binary Gen (BGEN) file Reader according to the specifications. A BGEN file can be loaded as is, or with an additional bgenix index file (.bgen.bgi). If the bgenix index file is not provided, this will created when the BGEN file is loaded. A regular Oxford GEN file is supposed to be accompanied by a .sample file that contains sample annotations, this is also required for Binary GEN files whenever this does not contain sample identifiers. When a sample file is provided, this is used in favor of the sample identifiers within the BGEN file. The following examples show how to load these files.

File bgenFile = new File("/some/path/file.bgen");

//The default bgenix file location will be equivalent to the following:
File defaultBgenixFile = new File(bgenFile.getAbsolutePath() + ".bgi");

//A sample file with annotations is required when 
//the BGEN file does not have sample identifiers:
File sampleFile = new File("/some/path/file.sample");

BgenGenotypeData bgenGenotypeData = null;

try {
    
    //Here we read a BGEN file using the auto loader.
    //The bgenix file will be loaded from, or created at the default location, 
    //depending on whether it already exists.
    bgenGenotypeData = new BgenGenotypeData(bgenFile);
    // If the file does not contain sample identifiers, an exception is thrown.
    
    //Supply a .sample file to read sample identifiers and annotations from.
    bgenGenotypeData = new BgenGenotypeData(bgenFile, sampleFile);
    
    //One can also specify a bgenix file at another location.
    File otherBgenixFile = new File("/some/other/path/file.bgi");
    bgenGenotypeData = new BgenGenotypeData(bgenFile, 
                                            sampleFile, 
                                            otherBgenixFile);

    //An exception is thrown whenever a loaded bgenix file does not
    //correspond to the bgen file.

} catch (IOException ex) {
	LOGGER.fatal("IO error: " + ex.getMessage());
	System.exit(1);
} catch (GenotypeDataException ex) {
	LOGGER.fatal("Genotype data error: " + ex.getMessage());
	System.exit(1);
}

BGEN files vary in a couple of different manners. In the BGEN file the characteristics of that file are stored in the set of flags within the header of the BGEN file. These flags represent the compression type of the genotype data block (can be uncompressed, compressed using zlib's compress() function or compressed using zstandard's ZSTD_compress() function), the layout type of the BGEN file (1 or 2) which influences what can be stored in the genotype data block, and finally, the presence of sample identifiers within the file.

The following methods can be used on a bgenGenotypeData object to verify characteristics of the loaded BGEN file.

//Determine the compression type of the loaded BGEN file.
BlockRepresentation blockRepresentation = bgenGenotypeData
        .getGenotypeDataBlockRepresentation();

if (blockRepresentation == BlockRepresentation.compression_0) {
    System.out.println("The genotype data block is not compressed.");
} else if (blockRepresentation == BlockRepresentation.compression_1) {
    System.out.println("zlib's compress() function was used.'");
} else if (blockRepresentation == BlockRepresentation.compression_2) {
    System.out.println("ZSTD_compress() was used.");
}

//Determine the layout type of the loaded BGEN file.
Layout layout = bgenGenotypeData.getFileLayout();

if (layout == Layout.layOut_1) {
    System.out.println("Layout 1 is used.");
} else if (layout == Layout.layOut_2) {
    System.out.println("Layout 2 is used.");
}

//Determine if the sample identifiers are present in the BGEN file.
boolean sampleIdentifierPresent = bgenGenotypeFile
        .areSampleIdentifiersPresent();

if (sampleIdentifierPresent) {
    System.out.println("Sample identifiers are present.");
} else {
    System.out.println("Sample identifiers are not present.");
}

Whenever sample identifiers are not present, the sample identifiers in the collection of samples is generated by leveraging the number of samples within the file.

Writing Binary GEN files

BGEN files (version 1.3) can be written using the writer that is implemented within Genotype IO. The writer attempts to write both a .sample file accompanying the .bgen file, and a .bgen.bgi index file. Currently the files that will be written correspond to layout two, and the probabilities are always compressed using ZSTD compression.

Thew following example shows how to write BGEN files.

File newBgenFile16bits = new File("/some/path/new16bits.bgen");
File newBgenFile32bits = new File("/some/path/new32bits.bgen");

try {
    
    //First we will have to initialize the BgenGenotypeWriter
    BgenGenotypeWriter writer = new BgenGenotypeWriter(randomAccessGenotypeData);
    // Now we can use the write method.
    writer.write(newBgenFile16bits);

    // The default precision that is used for the probabilities corresponds to 16 bits.
    // This can be changed to a custom value between 1 and 32 using the method below.
    int numberOfBits = 32; // The number of bits between 1 (inclusive) and 32 (inclusive)
    writer.setProbabilityPrecisionInBits(numberOfBits);
    // Now we can use the writer again to write the same data with higher precision
    writer.write(newBgenFile32bits);

} catch (IOException ex) {
	LOGGER.fatal("IO error: " + ex.getMessage());
	System.exit(1);
} catch (GenotypeDataException ex) {
	LOGGER.fatal("Genotype data error: " + ex.getMessage());
	System.exit(1);
}

Genetic variant

Some examples on what can be done with genetic variants.

GeneticVariant snp = randomAccessGenotypeData.getSnpVariantByPos("1", 1);
//Get snp variant at this position. null if not present

snp.isSnp(); //must be true since we did getSnpVariantByPos

for(Alleles sampleAlleles : snp.getSampleVariants()){
	//Iterate over the alleles form all samples.
	System.out.println(sampleAlleles.getAllelesAsString());
}

for(byte dosage : snp.getSampleCalledDosages()){
	//Iterate over the dosage values of snps. (0,1,2)
	System.out.println(dosage);
}

for(float dosage : snp.getSampleDosages()){
	//Iterate over the dosage values of snps. range 0 - 2
	System.out.println(dosage);
}

snp.getMinorAllele();
snp.getMinorAlleleFrequency();

snp.getCallRate();

snp.getHwePvalue();

snp.isBiallelic();

//chr != 0 && pos != 0
snp.isMapped();

snp.isAtOrGcSnp();

Genotype probabilities

Probabilities for genetic variants can be obtained using the getSampleGenotypeProbilities() method. This method returns an array of probabilities for the three possible genotypes of a diploid sample for a biallelic variant, per sample:

float[][] probabilities = snp.getSampleGenotypeProbilities(); 
//probabilities[sample][AA,AB,BB]

The genotypes that these probabilities represent are, in order, homozygosity for the reference allele, heterozygosity for both, and homozygosity for the alternative allele.

Probabilities for multiallelic variants (variants with three or more possible alleles) or for haploid or polyploid (more than two copies) samples are not available for every file type. However, the Oxford Binary Gen format supports this, and this is implemented within Genotype IO. The getSampleGenotypeProbabilitiesComplex() method makes these probabilities available. use the method as follows:

GeneticVariant variant = bgenGenotypeData.getSnpVariantByPos("1", 1);
double[][] probabilities = variant.getSampleGenotypeProbabilitiesComplex();
//probabilities[sample][AAA,AAB,ABB,...,ACC,BCC,CCC]

The returned order of probabilities corresponds to the order of unphased data for a layout 2 BGEN file.

Probabilities for individual haplotypes are available when data is phased. Check whether this is available prior to requesting the phased data:

if (variant.isPhasedDataPresent()) {
    double[][][] probabilities = bgenVariant
            .getSampleGenotypeProbabilitiesPhased();
    //probabilities[sample][haplotype][A,B,C,...]
}
//If phased data is not available when 
//.getSampleGenotypeProbabilitiesPhased is called, 
//an exception will be thrown.

Alleles and Allele

The Alleles object stores a collection of Allele objects

Alleles snpAlleles =  snp.getVariantAlleles();
		
for(Allele a : snpAlleles){
	//Print the alles found for this variant
	System.out.println("Allele: " + a.getAlleleAsString());
}

//In case of bi-allelic variant
System.out.println("Allele 1: " + snpAlleles.get(0).getAlleleAsString());
System.out.println("Allele 2: " + snpAlleles.get(1).getAlleleAsString());

Alleles alleles1 = Alleles.createAlleles(Allele.A, Allele.C);
Alleles alleles2 = Alleles.createAlleles(Allele.C, Allele.A);
Alleles alleles3 = Alleles.createAlleles(Allele.A, Allele.C);

if(alleles1 == alleles2){}; // false because order is different
if(alleles1 == alleles3){}; // true because it is guaranteed that if alleles and order is endentical it is the same object
if(alleles1.sameAlleles(alleles2)){}; // true

Linkage Disequilibrium

We allow easy LD calculation

Ld ld = null;
try {
	ld = snp.calculateLd(snp2);
} catch (LdCalculatorException ex) {
	LOGGER.fatal("Error in LD calculation: " + ex.getMessage(), ex);
	System.exit(1);
}
ld.getR2();
ld.getDPrime();

Filtering random access genotype data

Filtering a dataset on variants or samples will create a new RandomAccessGenotypeData that when accessed will only reviel the included variants and samples as if the others do not even exist. Data is not copied, instead the filtered datasets are backed by the orignal

Filtering variants

HashSet<String> includedSnps = new HashSet<String>();
includedSnps.add("rs1");
includedSnps.add("rs2");

//Create filter
VariantFilter variantFilter = new VariantIdIncludeFilter(includedSnps);

//Filter to only include selected variants
RandomAccessGenotypeData genotypeData1 = new VariantFilterableGenotypeDataDecorator(randomAccessGenotypeData, variantFilter);

Filtering samples

HashSet<String> includedSsamples = new HashSet<String>();
includedSsamples.add("pietje");
includedSsamples.add("jansje");

SampleFilter sampleFilter = new SampleIdIncludeFilter(includedSsamples);

//Filter the data with the selected variants to only include the selected samples
RandomAccessGenotypeData randomAccessGenotypeDataSampleFilter = new SampleFilterableGenotypeDataDecorator(randomAccessGenotypeDataVariantFilter1, sampleFilter);

//Here we define a combined variant filter consting of 2 filters. This can contain as many filters as requered
VariantFilter combinedFilter = new VariantCombinedFilter(new VariantQcChecker(0.05f, 0.95f, 0.001d), new VariantFilterBiAllelic());

//Now we also do some QC of the variants after the samples are filtered. maf 0.05, call rate 0.95 and hwe-p 0.001
RandomAccessGenotypeData randomAccessGenotypeDataVariantFilter2 = new VariantFilterableGenotypeDataDecorator(randomAccessGenotypeDataSampleFilter, combinedFilter);

Autoloader and filters

It is also possible to use the auto loader from the basicUsage examples in combination with filters. This is faster and more memory efficient for some filetypes, like TriTyper and in the future (or now if this is not updated) also for ped_map and binary plink. I is possible to use this method also on the other files types although it does not give increased performance. It is possible to set either the variant filter or the sample filter to null include all variants or samples.

Note that samples are filtered first and then the variant filter is applied, so a when filtering on the minor allele frequency then it will be calculated only using the included samples.

RandomAccessGenotypeDataReaderFormats.VCF.createFilteredGenotypeData(datasetPath, 1000, combinedFilter, sampleFilter);

More examples

The org.molgenis.genotype.examples package contains these and other basic examples.

Working with the code

  • Install Maven (http://maven.apache.org/) if you haven't done so already
  • Build molgenis-genotype-reader (mvn clean install)

How to open in Eclipse

  • Install the m2e Maven plugin; This is standard installed in the more recent versions of Eclipse, if you got an older version you probably need to install it.

  • Import the project in Eclipse with File/Import -> Existing Maven Projects and select the molgenis-genotype-reader root folder

Clone this wiki locally