Skip to content

Allele Specific Expression

Patrick Deelen edited this page Aug 29, 2014 · 13 revisions
Clone this wiki locally

Allele Specific Expression (ASE) mapping using the read count field in a VCF file. Can either use the genotypes in the same VCF file to select the heterozygous samples for a variant or use a secondary genotype file with high quality genotype calls.

Getting started

Downloading the software

You can download the latest version of the software here: Latest version.

Make sure to download eqtl-mapping-pipeline-*.*.*-SNAPSHOT-dist.zip or eqtl-mapping-pipeline-*.*.*-SNAPSHOT-dist.tar.gz

Running an ASE mapping

You can run an ASE analysis using this command:

java -jar eqtl-mapping-pipeline.jar --mode ase

and the options described below.

In the case of an out of memory exception you need to allocate more memory to java like this:

java -Xmx10g -Xms10g -jar eqtl-mapping-pipeline.jar --mode ase

If you run into problems please have look at this page: Basic introduction to commandline

Using VCF files

Before VCF files can be used they need to be compressed using bgzip and indexed with a tabix. This prevents having to read all data into memory yet still allows quick access.

Installing tabix and bgzip

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.6.tar.bz2
tar -jxvf tabix-0.2.6.tar.bz2
cd tabix-0.2.6/
make

Preparing a VCF file

bgzip -c example.vcf > example.vcf.gz
tabix -p vcf example.vcf.gz

Create VCF files from pileup file

Pileup files can be created using mpileup, http://samtools.sourceforge.net/mpileup.shtml. Make sure to make note of the recommendation in the mpileup manual to get precise read depth:

Use -BQ0 -d10000000 -f ref.fa if the purpose is to get the precise depth of coverage rather than call SNPs. Under this setting, mpileup will count low-quality bases, process all reads (by default the depth is capped at 8000), and skip the time-demanding BAQ calculation.

To use these pileup files they need to be converted to VCF. This can be done using the command below, help will be provided when running the command.

java -jar eqtl-mapping-pipeline.jar --mode pileupToVcf

ASE mapping options

###Basic commands

Short Long Description
-i --input Paths to one or more vcf.gz files or folders with each per chr 1 vcf.gz file. Tabix file must be present
-l --inputList One or more text files with on each line a path to a vcf.gz file or a folder with per chr 1 vcf.gz file. Tabix file must be present. Can be combined with --input
-g --genotypes The path to the reference genotypes. These genotypes will be used to determine if a sample is hetrozygous
-G --genotypesType The input data type. If not defined will attempt to automatically select the first matching dataset on the specified path (see inputeType options below)
-sc --sampleCoupling Tab separated file with column 1 sample ID and column 2 sample ID in reference genotype data, no header. If only contains a mapping for a subset of samples the original identifier in the reference is used
-o --output Path to output folder

###Fine tunning ASE

Short Long Description
-m --maxReads Maximum number of total reads
-p --minAllelePercentage Min percentage of read per allele. Default 0. Can be used in combination --minAlleleReads
-r --minReads Min number total reads for genotype
-s --minNumSamples Min number of samples per ASE effect

###Other options

Short Long Description
-f --gtf Optional .gtf file for annotations of ASE effects. Must be grouped by chromosome
-t --threads Maximum number of threads to start for parallel reading of multiple VCF files. Defaults to number of cores.
-d --debug Activate debug mode. This will result in a more verbose log file

####--genotypesType options

Base path refers to --genotypes

  • PED_MAP
  • Expects Plink PED file at: ${base path}.ped
  • Expects Plink MAP file at: ${base path}.map
  • Note: it is strongly recommend to use PLINK_BED due to the large memory usage of the current implementation. When dealing with large datasets it is recommend to first use plink to convert the data binary plink using this command plink --file pedmapData --make-bed --out binaryData
  • VCF
  • Expects VCF file at: ${base path}.vcf.gz
  • Must be compressed using bgzip. (see chapter: Preparing a VCF file)
  • Expects tabix file at: ${base path}.vcf.gz.tbi (see chapter: Preparing a VCF file)
  • PLINK_BED
  • Expects Plink BED file at: ${base path}.bed
  • Expects Plink BIM file at: ${base path}.bim
  • Expects Plink FAM file at: ${base path}.fam
  • Must be in SNP major mode. This is the default of Plink 1.07.
  • VCFFOLDER
  • Matches all vcf.gz files in the folder specified with the bash path.
  • SHAPEIT2
  • Expects haps file at: ${base path}.haps
  • Expects sample file at: ${base path}.sample
  • See also chapter on 'Using SHAPEIT2 Output and Oxford Gen format'
  • GEN
  • Expects gen file at: ${base path}.gen or without the extentions ${base path}
  • Expects sample file at: ${base path}.sample
  • See also chapter on 'Using SHAPEIT2 Output and Oxford Gen format'
  • TRITYPER
    • Expects folder with TriTyper data. Can handle dosage matrix