QTL mapping analysis cookbook for Affymetrix expression arrays
- QTL mapping pipeline
- Genotype Harmonizer
- Genotype IO
- ASE
- GADO Command line
- Downstreamer
- GeneNetwork Analysis
Analysis plans
Other
Clone this wiki locally
NB! This is preliminary pipeline for testing in available Affymetrix cohorts.
This is custom cookbook for performing the eQTL analysis using Affymetrix arrays. This software allows for eQTL mapping using linear models and direct meta-analysis of such data through weighted Z-score analysis. This cookbook illustrates only a part of the capabilities of our software. See for other options the other parts of the wiki.
Questions or suggestions?
You can contact the authors of this manual and software at urmo.vosa @ gmail.com, anniqueclaringbould @ gmail.com, patrickdeelen @ gmail.com, natalia.tsernikova @ gmail.com, bonder.m.j @ gmail.com, westra.harmjan @ gmail.com, dashazhernakova @ gmail.com or lude @ ludesign.nl for questions or suggestions regarding the software or the manual. This manual was written in cooperation with Marjolein Peters, Joyce van Meurs, Annique Claringbould and Pieter Neerincx.
Manual contents
- Downloading the software and reference data
- Getting the account to the sharing server
- Before you start
- Step-by-step eQTL analysis
- Step 1 - Preparation of expression data
- Step 2 - Preparation of genotype data
- Step 3 - Formatting expression data
- Step 4 - MixupMapper
- Step 5 - Removing principal components
- Step 6 - Performing the cis-eQTL analysis
- Step 7 - Conducting conditional cis-eQTL analysis
- Step 8 - Performing the trans-eQTL analysis
- Step 9 - Performing the Polygenic Risk Score eQTL analysis
- Step 10 - Uploading your result files
Downloading the software and reference data
You will need three programs to perform the QTL analysis.
- The PLINK toolcase (v 1.07), to calculate MDS components.
- The software to process and convert the genotype data, please download Genotype Harmonizer version 1.4.9.
- The actual QTL mapping software. The latest version for using is eQTLGen for both cis- and trans/PRS-eQTL phase is version 1.2.4F. This adds additional functionality for PRS-eQTL analysis and is available under the eQTLGen shared account: (Getting the account to the sharing server)
Affymetrix/pipeline/eqtl-mapping-pipeline-1.2.4F-SNAPSHOT-dist.tar.gz
.
Additionally you will need the following reference files:
- Giant genotype reference file, which can be downloaded here.
- The annotation files for the expression probes, which can be downloaded from the eQTLGen shared account: (Getting the account to the sharing server)
Affymetrix/AnnotationFiles
. Please contact us if the annotation file for your platform is missing.
Getting the account to the sharing server
We will use UMCG Genomics Coordination Center .sftp server named: cher-ami.hpc.rug.nl to deliver your result files and provide you the files and tools needed for next steps.
To get an access to the upload server please follow the steps:
-
Generate the private-public key pair as documented here: http://wiki.gcc.rug.nl/wiki/RequestAccount
-
Send the e-mail to GCC Helpdesk (Helpdesk dot GCC dot Groningen at gmail dot com) and your FrankeSwertzLab collaborator (urmo.vosa @ gmail.com) asking the guest account, specifying the name of the cohort and including your public key. The format is as follows:
Dear GCC Helpdesk, I am working in the collaboration with FrankeSwertzLab, project is eQTLGen, contact is Urmo Võsa (cc’d). The name of the cohort is [ALSPAC/LIFE/Rotterdam, etc]. Please provide me: 1. guest account to the cher-ami SFTP server for three months. 2. access to the umcg-eqtlgen shared account. My public key is attached. Please cc: urmo.vosa@gmail.com and anniqueclaringbould@gmail.com to all the correspondence and provide them access to my guest account. Best regards, [Name] [Institution]
After getting the account you will gain the access to shared eQTLGen account (umcg-eqtlgen) where we supply files needed for running the trans-eQTL and PRS-QTL analyses.
The folder structure for the umcg-eqtlgen account is shown below. For running the analyses from this cookbook, you need the folders Affymetrix
and PRS_QTL
.
|--Affymetrix
|--AnnotationFiles
|--ProbeAnnotation8_Affy_U219_Genes.txt
|--ProbeAnnotation8_Affy_U219_Exons.txt
|--README
|--HelpFiles
|--ListOfGeneticRiskFactors_20161121.txt
|--README
|--scripts
|--normalization
|--normalizeTMM.R
|--README
|--conditional_analysis
|--script_run_cond_analysis.sh
|--generate_regress_out_scripts.sh
|--template.xml
|--README
|--pipeline
|--eqtl-mapping-pipeline-1.2.4F-SNAPSHOT-dist.tar.gz # this pipeline version is used for the next steps
|--RNA_seq
|--pipeline
|--cis_eQTL_results
|--trans_eQTL
|--PRS_QTL
|--SummaryStatisticsDatabase_freeze1b # folder with all the harmonised summary statistic files
|--SummaryStatisticsDatabase_freeze1b.md5 # please use this to check the integrity of files after download
|--GeneticRiskScoreCalculator-0.1.2c-SNAPSHOT # folder with PRS calculating tool
|--README
Download instructions are here: http://wiki.gcc.rug.nl/wiki/DataSharing
If the instructions below do not work for your cluster setup, please consult with the extended manual: https://github.com/molgenis/systemsgenetics/wiki/Using-sharing-server-for-eQTLGen-analyses
## one possibility is to use lftp, if this is installed in your environment
## use these commands to download data from cher-ami server
# go to the local folder where you want to run analysis:
cd local/analysis/folder/
# start lftp
lftp
# connect with the shared eQTLGen account:
lftp :~> open -u umcg-eqtlgen,none -p 22 sftp://cher-ami.hpc.rug.nl
# download needed folder:
lftp :~> mirror folder_needed_for_analysis
# exit from the remote server
lftp :~> exit
You will also get the separate guest account (named umcg-guest[0-9]) that will be used to upload your results files in Step 10.
Before you start
In this manual we assume that you have basic knowledge on how to work with the java Virtual Machine (VM) and the command line. If you are not sure about working with the command line, please have a look at our introduction page and/or check the information about the Java VM (including the Java memory options). If you run into any problems with heapspace or out of memory errors please supply the -Xmx
and -Xms
command, see Java VM information. At each java command we supply the recommended settings.
Uploading the results
We use an SFTP server for providing correct analysis files and for delivering the results. Please consult: Getting the account to the sharing server. You can upload the results to the individual guest account you got (named in the format umcg-guest[0-9]) and it is explained in Step 10.
Step-by-step eQTL analysis
This is a step-by-step guide which will guide you through the eQTL mapping process using our software. Our general method consists of seven steps described below:
- Step 1 - Preparation of expression data
- Step 2 - Preparation of genotype data
- Step 3 - Formatting expression data
- Step 4 - MixupMapper
- Step 5 - Removing principal components
- Step 6 - Performing the cis-eQTL analysis
- Step 7 - Conducting conditional cis-eQTL analysis
- Step 8 - Performing the trans-eQTL analysis
- Step 9 - Performing the Polygenic Risk Score eQTL analysis
- Step 10 - Uploading your result files
Definitions
Throughout the manual, references to different full paths will be made. Here is an overview of these paths:
- The expression file will be referred to as
traitfile
- The probe/trait/gene annotation file will be referred to as
annotationfile
- The full path of your genotype data will be referred to as
genotypedir
- The file linking expression sample identifiers to genotype sample identifiers will be referred to as
genotypeexpressioncoupling
- The file containing covariates will be defined as
covariatefile
Descriptions of each of these files and their usage is detailed below, and their formats are described in the data formats section which can be found in the general QTL mapping documentation.
Step 1 - Preparation of expression data
This custom cookbook shows how to pre-process the Affymetrix expression array data which is available as .cel files. Currently we provide instructions for HuEx v1.0 ST and U219 arrays. For those steps you need R (v3.2.0 or newer), and R package oligo (v1.38 or newer) which can be added to your R installation from here: https://www.bioconductor.org/packages/release/bioc/html/oligo.html. This will also install package oligo dependancies. The standard preprocessing involves RMA normalization, log2 transformation.
❗ If your Affymetrix array expression data is already preprocessed with RMA and you know that expression outliers were removed before that, please send the e-mail with the description of the preprocessing steps to urmo.vosa @ gmail.com and proceed with already preprocessed matrix in Step 3B.
To process your Affymetrix array keep your .cel files in separate folder and use the following command with R script availiable on cher-ami (Affymetrix/scripts/normalization/normalizeRMA.R
):
Rscript normalizeRMA.R {input_directory_with_cel_files}
Now is your expression data in a simple tab separated text file. The format of this file is described here: [Phenotype file, covariate file] (https://github.com/molgenis/systemsgenetics/wiki/File%20descriptions#phenotype-file-covariate-file).
eQTL mapping will require an annotation of the probes/traits/genes in your traitfile
. We store this annotation in a separate file. The format of this file is described here: Probe annotation file. Currently we provide annotations for HuEx v1.0 ST and U219 arrays and these are under the cher-ami shared umcg-eqtlgen account: Affymetrix/AnnotationFiles
.
Step 2 - Preparation of genotype data
Our software is able to use both unimputed called genotypes, as well as imputed genotypes and their dosage values. The software, however, requires these files to be in TriTyper format. We provide Genotype Harmonizer to harmonize and convert your genotype files. However, we first need to calculate MDS components to correct the expression data for population stratification.
Step 2A - Correction for population stratification.
You will need to correct the expression data for 4 MDS components during the gene expression normalization step. First we need to get our data into the PLINK format. It is recommend to use the binary PLINK format and you should therefore have these files: mydata.bed, mydata.fam and mydata.bim.
📌 Note: this step should be performed using the nonimputed autosomal SNPs. If you run into problems feel free to contact patrickdeelen @ gmail.com for advice. It is also recommended to use pruned results. Please have a look at the PLINK manual (http://zzz.bwh.harvard.edu/plink/summary.shtml#prune) on how to obtain this.
Because PLINK uses complete linkage agglomerative clustering based on pairwise identity-by-state (IBS) distance, we first need to execute this command:
plink --bfile mydata --genome
which creates the file plink.genome.
This file will have as many rows as there are unique pairs of individuals in the sample. For large samples with thousands of individuals, this file will take a considerable time to generate.
To obtain these MDS components, you can use a simple command using the PLINK tool to create a multidimensional scaling matrix. The command line will look like this:
plink --bfile mydata --read-genome plink.genome --cluster --mds-plot 4
This command creates the file plink.mds which contains one row per individual with the following fields:
- FID - Family ID;
- IID - Individual ID;
- SOL - Assigned solution code;
- C1 - Position on first dimension;
- C2 - Position on second dimension;
- C3 - Position on third dimension;
- C4 - Position on fourth dimension;
You can find more information on PLINK website
✅ Please upload your log file to the central server.
Step 2B - Convert data to the TriTyper fileformat.
For the eQTL mapping the data needs to be in TriTyper format. Using GenotypeHarmonizer one can harmonize and convert genotype formats. In this step we will directly harmonize, filter and convert the genotype data. The data is harmonized to be matched to the GIANT release of 1000G. You can either run this step per each chromosome separately or for all chromosomes at once.
❗ Important: If you are using imputed genotypes, make sure to use as an input format a file that contains the probabilities or dosages and not just genotype calls. In case of doubt please contact patrickdeelen @ gmail.com
📌 Note: please make sure to use verion 1.4.6, 1.4.7 or 1.4.9 of the Genotype Harmoinzer.
java -Xmx10g -Xms10g -jar ./GenotypeHarmonizer.jar -i {locationOfInputData} -I {InputType} -o {Outputlocation} -O TRITYPER -r {locationOf1000G-GiantVcfs} --refType VCF_FOLDER --update-id -cf 0.95 -hf 0.0001 -mf 0.01 -mrf 0.5 -ip 0
Input arguments: -i
location of the input data (without extension), -I
input type, -o
output location, -O
output type TRITYPER, -r
location of the reference data, --refType
type of reference data. See the Genotype Harmonization manual for further details on the input flags.
Note: it is not recommended to exclude samples at this step. This can easily be done at a later stage using the genotype to expression coupling file. Having all samples will help resolve possible sample mixups in step 4 more accurately.
✅ Please upload your log files to the FTP server.
If you ran the GenotypeHarmonizer per chromosome, you now need to merge all TriTyper folders per chromosome to one TriTyper folder containig data from all chromosomes. By using the following command you merge the individual TriTyper folders.
java -Xmx2g -Xms2g -jar ./eqtl-mapping-pipeline.jar --imputationtool --mode concat --in {folder1;folder2;folder3;ETC} --out {OutputFolder}
Input arguments:--in
is a semicolon-separated list of input TriTyper folders with information per chromosome, --out
is the output location of the merged TriTyper data. Note: in case of problems try putting quotes ("
) around your list of input folders
Great! You now have your genotype and expression data ready to go!
File Checklist
Before continuing, now is a good time to check whether your files are in the correct format and whether you have all the required files ready. Please check the following:
- All required TriTyper files are in your
genotypedir
. - The expression files are formatted properly.
- You have an
annotationfile
(you should have downloaded this from the FTP at the start). - The identifiers of samples are identicial in your expression and genotype files. If not, you should make a file to link the proper samples. This file is called a
genotypeexpressioncoupling
file, the format is described here: Genotype - phenotype coupling. This file is not necessary if the names are already matched.
Step 3 - Formatting expression data
Step 3A - Removing outliers from expression data
❗ If you are working already QC-d and RMA-normalized expression data (you know that expression outliers were removed prior RMA), you can skip this step and proceed to Step 3B, using your original expression file as an input.
Before performing the actual normalization it is essential that we remove outlier samples. These can easily be detected by performing a PCA on the data without prior centring of the expression values.
java -Xmx10g -Xms10g -jar eqtl-mapping-pipeline.jar --mode normalize --in traitfile --out {tmpoutdir} --qqnorm --logtransform --adjustPCA --maxnrpcaremoved 0 --stepsizepcaremoval 0 2>&1 | tee {tmpoutdir}/outlierDetection.log
Use the command below to extract the first 2 principle components. Please plot these components. You might observe a few clear outliers: please remove these samples from the analysis (see instruction below). In case of doubt feel free to contact patrickdeelen @ gmail.com for support, please provide a plot of the first component.
zcat ***.QuantileNormalized.Log2Transformed.PCAOverSamplesEigenvectors.txt.gz | awk 'BEGIN {OFS="\t"} NR == 1 {print "Sample","Comp1","Comp2"} NR > 1 {print $1,$2,$3}' > pc1_2.txt
You can use the R code below to create a plot of these components and create a file with samples to include.
pc1_2 <- read.delim("pc1_2.txt")
plot(pc1_2$Comp1, pc1_2$Comp2, xlab = "Component 1", ylab = "Component 2")
#
## Please fill in the outlier cut-off below
#
outlierCutoff <- #
pdf("OutlierDetection.pdf")
palette(c("dodgerblue2", "red2"))
plot(pc1_2$Comp1, pc1_2$Comp2, xlab = "Component 1", ylab = "Component 2", col = as.numeric(pc1_2$Comp1 < outlierCutoff)+1, main = paste("Removed", sum(pc1_2$Comp1 < outlierCutoff), "expression outlier samples"))
abline(v=outlierCutoff, col = "red2")
dev.off()
## NB!: please check on which side of the outlierCutoff the outlier samples are:
# If outlier samples are on the left of the cutoff line run:
write.table(pc1_2$Sample[pc1_2$Comp1 >= outlierCutoff], file = "includedExpressionSamples.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
# If outlier samples are on the right of the cutoff line run:
# write.table(pc1_2$Sample[pc1_2$Comp1 <= outlierCutoff], file = "includedExpressionSamples.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
✅ Please upload your log file and the plot of the first two components to the FTP server.
After you uploaded the plot and the log to the FTP server, it is wise to remove most of the data in the tmpoutdir. The data, apart from the log and the pc1_2.txt, is no longer needed.
Step 3B - Expression data normalization
After removal of low quality samples we go for the actual normalization. Generally, continuous trait data need to be normalized prior to applying statistical testing. Our software package provides different ways of normalizing your data contained in the traitfile
.
Our general normalization strategy for Affymetrix array expression data consists of the following steps:
- RMA normalisation (done by R script)
- Log2 transformation (done by R script)
- Probe centering and scaling (Z-transform): (ExpressionProbe,Sample = ExpressionProbe,Sample - MeanProbe) / Std.Dev.Probe
- Removal of covariates - Correct gene expression data for the first 4 MDS of the GWAS data - to remove possible population stratification
- We run a Generalized Linear Model with the probes as dependent variables, and the GWAS MDSs as orthogonal covariates. For the remainder of the analysis, we use the residuals of this model.
- Principal component adjustment
- In the Principal Component Analysis, the software tries to convert the normalized and standardized expression results into a set of values of uncorrelated variables called Principal Components (PCs). The number of PCs is equal to the number of samples, and the first PCs explain a higher variance than the last PCs. By adjusting the normalized and standardized expression data for a set of PCs (like we did with the removal of the covariates - use of a Generalized Linear Model), we try to remove batch effects in the data. Note: the PCs will be used in Step 5.
This step should be performed only on samples that are not marked as outliers in the previous step. To do this we need a file with the identifiers of the samples to include. Each sample identifier should be on a separate line and no header should be used. If you used the RMA normalization R script you can use includedExpressionSamples.txt
.
Step 3B.1 Re-normalization
❗ If you are working already QC-d and RMA-normalized expression data (you know that expression outliers were removed prior RMA), you can skip this step and proceed to Step 3B.2, using your original expression file as an input.
In other case, to remove potential bias emerging from outlier samples, you have to re-normalize the data based on only non-outlier samples. For that, we re-run the RMA normalization script with the additional argument pointing to the includedExpressionSamples.txt
file. This script performs RMA normalization and log2-transformation.
Rscript normalizeRMA.R {input_directory_with_cel_files} {includeFile}
NB! Please pay attention to use exactly this order of arguments in the command.
Step 3B.2 Scaling, PC calculation and covariate removal
After that, use the resulting file named ExpressionData.SampleSelection.RMA.Log2Transformed.txt to run remaining preprocessing steps in the following command:
java -Xmx10g -Xms10g -jar eqtl-mapping-pipeline.jar --mode normalize --in traitfile --out {outdir} --sampleInclude {includeFile} --centerscale --adjustcovariates --cov {covariatefile} --adjustPCA --maxnrpcaremoved 0 --stepsizepcaremoval 0 2>&1 | tee {outdir}/normalization.log
For details on the covariate file see this link. We want to use the PLINK MDS results as confounders for our analysis. Before you can use the PLINK MDS results you need to run this command: awk 'BEGIN { OFS = "\t" } {print $2,$4,$5,$6,$7}' < plink.mds > plinkMds.txt
.
[] (Additionally, this file can be used for correcting the expression data for other confounders, for example pedigree structure or batch effects. For that, just add the corresponding covariates to the columns of the file.)
❗ Please make sure that the sample identifiers in this file correspond to the sample identifiers in the expression data.
✅ Please upload your log file to the FTP server.
Check your output
Running the general normalization procedure yields a number of files in the directory of your traitfile
, or in the outdir
if you specified one. The default procedure will generate file suffixes listed below. Suffixes will be appended in the default order as described above. Selecting multiple normalization methods will add multiple suffixes. For Affymetrix array data, RMA normalization and Please do not remove or replace ANY intermediate files from this normalization step as they may be used in subsequent steps.
Suffix | Description |
---|---|
SampleSelection | Data after selection of the samples. |
ProbesCentered | Centered probes. |
SamplesZTransformed | Z-transformed samples. |
CovariatesRemoved | Expression data adjusted for covariates. |
PCAOverSamplesEigenvalues | Eigenvalues created during eigenvalue decomposition of the expression sample correlation matrix (created from the RMA Normalized, Log2 Transformed, Z-transformed data) |
PCAOverSamplesEigenvectors | Eigenvectors created during eigenvalue decomposition of the expression sample correlation matrix (created from the RMA Normalized, Log2 Transformed, Z-transformed data) |
PCAOverSamplesEigenvectorsTransposed | Eigenvectors transposed |
PCAOverSamplesPrincipalComponents | Principal Components describing the sample correlation matrix (created from the RMA Normalized, Log2 Transformed, Z-transformed data) |
Step 4 - MixupMapper
In our paper published in Bioinformatics (Westra et al.: MixupMapper: correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects) we have shown that sample mix-ups often occur in genetical genomics datasets (i.e. datasets with both genotype and gene expression data). Therefore, we developed a method called MixupMapper, which is implemented in the QTL Mapping Pipeline. This program performs the following steps:
- A cis-eQTL analysis on the dataset:
- using a 250 kb window between the SNP and the mid-probe position
- performing 10 permutations to control the false discovery rate (FDR) at 0.05
- Calculation of how well the expression data matches the genotype data. For details on this process please consult the paper.
Data checklist
- Note down the full path to your TriTyper genotype data directory. We will refer to this directory as
genotypedir
. - Determine the full path of the expression data produced in step 3, the file ending with
CovariatesRemoved.txt.gz
. We will refer to this path astraitfile
. - Locate your
annotationfile
. Also note down theplatformidentifier
, which is written in the first column of theannotationfile
. Affymetrix U219 gene level: "Affymetrix_U219_GeneLevel", Affymetrix U219 gene level: "Affymetrix_U219_ExonLevel". Please contact us if the annotation file for your platform is missing. - (Optional) Locate your
genotypeexpressioncoupling
if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
Commands to be issued
The MixupMapper analysis can be run using the following command:
java -Xmx10g -Xms10g -XX:StringTableSize=319973 -jar eqtl-mapping-pipeline.jar --mode mixupmapper --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --snps {snpfile} (--testall) (--gte {genotypeexpressioncoupling}) 2>&1 | tee {outdir}/mixupmapping.log
It is more efficient to test a subset of SNPs as testing a full 1000G imputed dataset will take a lot of time. To do this you can use this snpfile and adding this flag: --snps {snpfile}
(remember to use the full path).
If your genotypes and/or expression data includes more samples than those which are matched based on sample names, it is beneficial to test all possible combinations. This can be performed using the following command line switch --testall
.
If you are running the software in a cluster environment, you can specify the number of threads to use (nrthreads
) by appending the command above with the following command line switch --threads nrthreads
(nrthreads should be an integer).
Note that the --gte
, --threads
and --snps
flags are optional, --gte
only applies if you are using a genotypeexpressioncoupling
file.
Check your output
MixupMapper is a two stage approach. Therefore the default procedure creates two directories in the outdir
you specified: cis-eQTLs and MixupMapping. Both directories contain a different set of output files described below.
cis-eQTLs directory
This directory contains output from a default cis-eQTL mapping approach. However, we are not interested in this folder, these files are only necessary to create the actual mixupmapping output which can be found in the MixupMapping directory.
MixupMapping directory
File | Description |
---|---|
Heatmap.pdf | This file contains a visualization of overall Z-scores per assessed pair of samples. The genotyped samples are plotted on the X-axis, and the expression samples are plotted on the Y-axis. The brightness of each box corresponds to the height of the overall Z-score, with lower values having brighter colours. Samples are sorted alphabetically on both axes. |
BestMatchPerGenotype.txt | This file shows the best matching trait (i.e. expression) samples per genotype: the result matrix (MixupMapperScores.txt) is not symmetrical. Therefore scanning for the best sample per genotype may yield other results than scanning for the best sample per trait. |
BestMatchPerTrait.txt | This file shows the best matching genotype sample per trait sample. |
MixupMapperScores.txt | This file contains a matrix showing the scores per pair of samples (combinations of traits and genotypes). |
In the BestMatchPerGenotype.txt and BestMatchPerTrait.txt files, you can find the best matching trait sample for each genotyped sample and vice versa:
- 1st column = genotyped sample ID, or trait sample ID dependent on file chosen (see above)
- 2nd column = trait sample originally linked to genotype sample ID in column 1, or genotype sample originally linked with trait sample in column 1
- 3rd column = the MixupMapper Z-score for the link between the samples in column 1 and 2
- 4th column = best matching trait (for BestMatchPerGenotype.txt) or best matching genotype (for BestMatchPerTrait.txt)
- 5th column = the MixupMapper Z-score for the link between the samples in column 1 and 4
- 6th column = this column determines whether the best matching trait or genotype is identical to the sample found in column 2
Example of BestMatchPerGenotype.txt
Trait OriginalLinkedGenotype OriginalLinkedGenotypeScore BestMatchingGenotype BestMatchingGenotypeScore Mixup Sample1 Sample1 -11.357 Sample1 -11.357 false Sample2 Sample2 -15.232 Sample2 -15.232 false Sample3 Sample3 -3.774 Sample4 -6.341 true Sample4 Sample4 -3.892 Sample3 -12.263 true
In this example there are two mix-ups indentified. If you don't have any mix-ups you are done, otherwise you need to resolve or remove these mix-ups. Please check this separate document for instructions on fixing mix-ups.
Always rerun the sample Mix-up Mapper after removing or changing sample IDs and check if you actually fixed/removed mix-ups instead of creating more!
✅ Please upload your logs file to the FTP server.
Step 5 - Removing principal components
Prior to QTL mapping, you will need to remove the first 20 expression components to increase the power to detect cis- and trans-QTLs. As we have seen that some PCs explain genetic variation, we perform our analysis on data only adjusted for PCs which explain no genetic variation. To identify the PCs that have no genetic association, we perform a QTL mapping on the principal component eigenvector matrix (PCAOverSamplesEigenvectorsTransposed.txt.gz). We call PCs genetically associated when they have a FDR of 0 (thus selecting significantly affected components only).
Data checklist
- Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as
genotypedir
. - Determine the full path of the expression data you want to use, the file created in step 3 before PC correction. The file name ends with "*.CovariatesRemoved.txt.gz".
- (Optional) Locate your
genotypeexpressioncoupling
if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
As with MixupMapper, SNPs will only be tested with a minor allele frequency of > 0.05, a Hardy-Weinberg P-value > 0.001 and a call-rate > 0.95. Examples of these files can be found here: resources
❗ Important: We fixed a bug which caused an error during this step. Please make sure that you use at least version 1.2.4B of the QTL mapping pipeline, if necessary please download the new version from the locations mentioned in chapter 1.
Commands to be issued
Use the following command to run the mapping on the eigenvectors and to subsequently remove the PCs which are not under genetic control:
java -Xmx20g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --metamode nongeneticPcaCorrection --in {TriTyper location} --inexp {traitfile} --gte {gtefile} --out {outdir} --adjustPCA --stepsizepcaremoval 20 --maxnrpcaremoved 20 --nreqtls 500000 2>&1 | tee {outdir}/removeExpressionComponents.log
Note that the --gte
switch is required for this step so you do need genotypeexpressioncoupling
file. Please make sure any sample mixups are resolved in this file. If the sample identifiers in your expression and genotype data are identical you can simple create one using the command below.
awk 'BEGIN {OFS="\t"}{print $1,$1}' < Individuals.txt > gte.txt
✅ Please upload your log file to the FTP server.
Check your output
After running the nongeneticPcaCorrection command, the outdir
will contain the output of a full QTL mapping on the eigenvectors for your dataset. At the location of your input expression data the program has written a new PC corrected expression file, which was corrected for all components but the ones under genetic control (FDR 0). So if the process has finished succesfully there should be a new file which ends with 20PCAsOverSamplesRemoved-GeneticVectorsNotRemoved.txt.gz
. This is the traitfile
which will be used in the subsequent cis-QTL mapping.
Step 6 - Performing the cis-eQTL analysis
❗ Please note the following
- For the steps below make sure to use version 1.2.4D, 1.2.4E or 1.2.4F of the eQTL mapping software.
The cis-QTL mapping can be performed using a single command. Standard settings for the cis-eQTL mapping are: HWEP > 0.0001, MAF > 0.01, and Call Rate > 0.95. For cis-eQTL mapping, the maximum distance between the SNP and the middle of the probe is 1,000,000 bp. To control for multiple testing, we perform 10 permutations, thereby shuffling sample labels to calculate the false discovery rate (FDR) controlled at 0.05.
Data checklist
- Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as
genotypedir
. - Determine the full path of the trait data you want to use, created in step 5. We will refer to this path as
traitfile
. - Locate your expression probe annotation file:
annotationfile
. Also note down theplatformidentifier
. Affymetrix U219: "U219", Affymetrix HuEx v1.0 ST: "HuEx_v1". Please contact us if the annotation file for your platform is missing. - (Optional) Locate your
genotypeexpressioncoupling
if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
Commands to be issued
The command needed to be issued for this cis-QTL analysis:
java -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --mode metaqtl --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --cis --binary --perm 10 --maf 0.01 2>&1 | tee cisEQtlMapping.log
If you are running the software in a cluster or a multithreaded environment, you can specify the number of threads to use (nrthreads
) by appending the command above with the following command line switch --threads nrthreads
(nrthreads should be an integer).
Note that you should apply the --gte
switch if you are using a genotypeexpressioncoupling
file.
QTL Mapping output - Binary mode
This section describes the binary output files generated using the --binary
command line switch.
File | Description |
---|---|
Dataset.dat | Zscore matrix for the data analysis. |
Dataset-ColNames.txt.gz | Column names for the Zscore matrix. |
Dataset-RowNames.txt.gz | Row names for the Zscore matrix. |
Dataset-PermutationRound-n.dat | Zscore matrix for the permuted data analysis |
Dataset-PermutationRound-n-ColNames.txt | Column names for the permuted Zscore matrix. |
Dataset-PermutationRound-n-RowNames.txt | Row names for the permuted Zscore matrix. |
excludedSNPsBySNPFilter.txt.gz | SNPs excluded by the user. |
excludedSNPsBySNPProbeCombinationFilter.txt.gz | The file with no gene expression probes located within 250.000bp of this SNP position. |
ProbeQCLog.txt.gz | List of probes excluded from analysis. |
SNPQCLog.txt.gz | QC information for all tested SNPs, including MAF, HWE-pvalue and call-rate (which is always 1.0 for imputed SNPs). |
UsedSettings.txt | The settings used for the eQTL mapping. |
cisEQtlMapping.log | Log file of the mapping. |
✅ Please upload all the above mentioned result files
Checking SNPs with proxies within a probe
To make sure that you can later validate that an eQTL is not an artefact of proxy SNPs within the probe binding region we run the script below. The validCombinations file contains all SNP-probe combinations that are safe to analyse. Make sure you use the same mappingfile for the SNP-in-Probe detection and the cis-eQTL mapping!
java -Xmx10g -Xms10g -jar eqtl-mapping-pipeline.jar --metamode determineSnpProbList -dp 0.2 -r 0.2 -ri {genotypedir} -rt TRITYPER -pm 1 -ws 1000000 -i {annotationfile} -o {outdir}/validCombinations.txt 2>&1 | tee validCombinations.log
✅ Please upload the validCombinations file and the log file.
🎉 Cis-eQTL mapping completed
Step 7 - Conducting conditional cis-eQTL analysis
To maximize our power to identify distal (trans-) effects, we want to regress out all the independent cis-eQTL effects. To do so, we will perform conditional cis-eQTL analysis which enables us to identify secondary/tertiary/quaternary/... cis-eQTLs. This is done by conducting a series of cis-eQTL analyses and by regressing in each analysis round out all the significant cis-eQTL effects from all the previous rounds. All independent cis-eQTL effects will be regressed out in trans-phase (Steps 8-9).
Please make a separate directory for the conditional analysis (conditional_output_dir
). Three necessary files for conditional analysis are available on cher-ami, Affymetrix/scripts/conditional_analysis/
. Please copy those to the conditional analysis directory.
File: Affymetrix/scripts/conditional_analysis/template.xml
This file is settings template for cis-eQTL analysis.
File: Affymetrix/scripts/conditional_analysis/generate_regress_out_scripts.sh
This script will automatically generate new scripts and config files if another round of cis-mapping is necessary.
File: Affymetrix/scripts/conditional_analysis/script_run_cond_analysis.sh
This script will check how many conditional cis-eQTL rounds are necessary and will start generate_gene_regress_out_scripts.sh in each new round. For each round, separate output folder is made (output_round*_RegressedOut
). The result file (eQTLProbesFDR0.05-ProbeLevel.txt) of last round should contain no significant results any more.
Use this command for running the analysis in conditional analysis directory:
sh script_run_cond_analysis.sh {folder with eqtl-mapping-pipeline.jar} {traitfile} {genotypedir} {conditional_output_dir} {annotationfile} {genotypeexpressioncoupling} 2>&1 | tee {conditional_output_dir}/conditional_cis_mapping.log
Please use the same input files as in previous cis-eQTL mapping.
❗ Important: When using the job scheduler, please specify at least 120GB of memory for this job! Please contact urmo.vosa @ gmail.com and anniqueclaringbould @ gmail.com in case of errors or overly long run times. First round of cis-eQTL mapping should not take much more that ~40 hours and you can monitor the progress with output_round*.log
written in each output folder.
❗ Important: Please make sure to use this exact order of arguments. The conditional_output_dir should be the directory in which you are running this command. The genotype to expression coupling file is optional: if you do not have one you can submit the job without that argument. Please also ensure that none of the folders have a final slash in the command above.
These analyses will result in one file with all independent cis-eQTLs that can be found in your dataset. You will later use this file to regress out cis-eQTLs before mapping trans-eQTLs.
When the job is finished, please go to the folder of the last round and check whether there are no results in eQTLsFDR0.05-ProbeLevel.txt.
✅ Please upload the log files created to cond_output_dir
: conditional_cis_mapping.log
, cond_eQTLs_overview.log
and toRegressOut.log
Regress out file will be created to the cond_output_dir
named: Regress_cis_eQTLs.txt
This file will be used in Step 8 for regressing out the cis-eQTLs before trans-eQTL mapping.
Also, please upload the log files created to cond_output_dir
:cond_eQTLs_overview.log
and toRegressOut.log
.
Step 8 - Performing the trans-eQTL analysis
In the trans-eQTL mapping phase we will use the dataset processed by the previous steps of the pipeline.
- If your cohort was part of the cis-eQTL mapping phase please use the same genotype data and preprocessed expression data as in Step 6.
Standard settings for the trans-eQTL mapping are: HWEP > 0.0001, MAF > 0.01, and Call Rate > 0.95. For trans-eQTL mapping, the minimum distance between the SNP and the middle of the probe equals 5,000,000bp. To control for multiple testing, we perform 10 permutations, thereby shuffling sample labels, to calculate the false discovery rate (FDR) at 0.05.
We have found that removing cis effects greatly enhances the power to detect trans effects. Consequently, our software provides a way to correct your traitfile
data for cis effects. Note down the full path of the file containing the cis-QTL effects to be removed. We will refer to this file as cisqtlfile
. The format of this file is identical to the QTL file.
Due to the limitations on the computational power and file sizes we test only preselected list of SNPs in this phase, consisting of EBI GWAS Catalogue and Immunobase (all GWAS P<=5e-8). We will refer to this file as snpsFile
.
❗ Please note the following
- For the steps below make sure to use version 1.2.4F of the eQTL mapping software.
- As for probe mapping files, please use the files starting with: ProbeAnnotation6 for HT12v3 and ProbeAnnotation7 for HT12v4 and HT12-v4 WG-DASL. These are available under eQTLGen account in cher-ami server (trans_eQTL/AnnotationFiles/)
Data checklist
- Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as
genotypedir
. - Determine the full path of the trait data you want to use, created in step 5. We will refer to this path as
traitfile
. - Locate your expression probe annotation file: annotationfile. Also note down the
platformidentifier
. Illumina HT12-V3: "GPL6947", Illumina HT12-V4: "GPL10558", Illumina HT12-V4 WG-DASL: "GPL13938". Please contact us if the annotation file for your platform is missing. - (Optional) Locate your
genotypeexpressioncoupling
if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals. - Use the results of your conditional cis-eQTL analysis to regress out the cis-eQTL effects. It is located in {conditional_output_dir}/Regress_cis_eQTLs.txt. We will refer to this file as
cisqtlfile
. - Use the list of preselected SNPs for trans-eQTL analysis. This file is referred as
snpsFile
and is available under the umcg-eqtlgen account in the cher-ami server (trans_eQTL/HelpFiles/ListOfGeneticRiskFactors_20161121.txt). - Find a location on your hard drive to store the output. We will refer to this directory as
outdir
.
Commands to be issued
Use the following command to run the trans-eQTL analysis:
java -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --mode metaqtl --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --trans --binary --perm 10 --maf 0.01 --snps {snpsFile} --regressouteqtls {cisqtlfile} 2>&1 | tee {outdir}/transEQtlMapping.log
❗ While you are waiting for the trans-eQTL analysis to complete, please already proceed with calculating the genetic risk scores in Step 9A.
QTL Mapping output - Binary mode
This section describes the binary output files generated using the --binary command line switch.
File | Description |
---|---|
Dataset.dat | Zscore matrix for the data analysis. |
Dataset-ColNames.txt.gz | Column names for the Zscore matrix. |
Dataset-RowNames.txt.gz | Row names for the Zscore matrix. |
Dataset-PermutationRound-n.dat | Zscore matrix for the permuted data analysis. |
Dataset-PermutationRound-n-ColNames.txt | Column names for the permuted Zscore matrix. |
Dataset-PermutationRound-n-RowNames.txt | Row names for the permuted Zscore matrix. |
excludedSNPsBySNPFilter.txt.gz | SNPs excluded by the user. |
excludedSNPsBySNPProbeCombinationFilter.txt.gz | The file with no gene expression probes located within 250.000bp of this SNP position. |
ProbeQCLog.txt.gz | List of probes excluded from analysis. |
SNPQCLog.txt.gz | QC information for all tested SNPs, including MAF, HWE-pvalue and call-rate (which is always 1.0 for imputed SNPs). |
UsedSettings.txt | The settings used for the eQTL mapping. |
transEQtlMapping.log | Log file of the trans mapping. |
check.md5 | md5 file Zscore matrix. |
check-PermutationRound-n.md5 | md5 file for permuted Zscore matrix. |
✅ Please upload all the above mentioned result files
This step produces the file ExpressionData.SampleSelection.QuantileNormalized.Log2Transformed.ProbesCentered.SamplesZTransformed.CovariatesRemoved.20PCAsOverSamplesRemoved-GeneticVectorsNotRemoved.txt.gz-EQTLEffectsRemoved.txt.gz into the same directory where your input expression data is. From this expression file, the cis-eQTL effect are removed and this is the expression input for the next step.
❗ If this file was not written, please check whether you used the correct version 1.2.4F of QTL mapping pipeline.
Step 9 - Performing the Polygenic Risk Score eQTL analysis
In this part of the pipeline we will calculate polygenic risk scores for all the individuals in your eQTL dataset, using publicly available GWAS summary statistics for traits. Those risk scores we will correlate with expression levels of all genes to identify novel trait-associated "hub" genes and pathways.
Step 9A
The software for calculating genetic risk scores is available under umcg-eqtlgen account in cher-ami server (PRS_QTL/GeneticRiskScoreCalculator-0.1.2c-SNAPSHOT). Please make sure to use v0.1.2c of the software. Usage instructions can be seen by the command:
java -jar GeneticRiskScoreCalculator-0.1.2c-SNAPSHOT/GeneticRiskScoreCalculator.jar
NB! This tool is tested and works with Java versions 7 and 8, please make sure that you are using one of those!
The command for PRS calculation:
java -Xmx40g -Xms40g -jar GeneticRiskScoreCalculator-0.1.2c-SNAPSHOT/GeneticRiskScoreCalculator.jar -gi {InputTriTyperFolder} -gt TRITYPER -i {SummaryStatisticsFolder} -o {OutputFolder} -p 5e-8:1e-5:1e-4:1e-3:1e-2 -r 0.1 -w 250000:5000000 -er 6:25000000-35000000 2>&1 | tee {OutputFolder}/PRSCalculation.log
- NB! If you rerun the job, please remove the output folder constructed by the last job before!
✅ Please upload PRSCalculation.log
Arguments:
-gi
- this is your genotype folder constructed in Step 2 (InputTriTyperFolder
), meaning it has to be harmonized to match GIANT 1000G p1v3.
-i
- this is folder consisting gzipped GWAS summary statistic files (SummaryStatisticsFolder
) and is available under cher-ami eQTLGen account (PRS_QTL/SummaryStatisticsDatabase_freeze1b/). These files are preprocessed to match GIANT 1000G p1v3.
❗Please note the following! Most of the datasets included were publicly available. The Terms and Conditions, when available on the corresponding web page of respective GWAS consortum, were checked and adhered. However, these files are meant to use in this project only.
Therefore, if you use those summary statistics in any other project than eQTLGen, it is your responsibility to make sure that you:
1. Adhere to the Terms and Conditions indicated on the respective consortium web page.
2. Cite correctly the respective study in all the resulting publications.
-o
- output folder of the genetic risk score calculation. This incorporates TT folder which consists scaled risk scores in TriTyper format. This folder will be used as input for Step 9B, analogously to trans-eQTL mapping.
-p 5e-8:1e-5:1e-4:1e-3:1e-2
- this flag specifies the GWAS significance thresholds used for constructing the risk score. For this project we construct five risk scores per trait (GWAS P: 5e-8, 1e-5, 1e-4, 1e-3, 1e-2).
-r 0.1
- this flag specifies the R2 threshold for clumping.
-w 250000:5000000
- this flags specifies the clumping strategy. Top SNP is defined by the lowest P-value or the highest absolute effect size if there are several equal P-values. In the first round of clumping all the SNPs in the LD (R2>0.1) are removed from the distance of 250 Kb from the top SNP. Second round of clumping removes all the remaining SNPs in LD (R2>0.1) in the extended distance of 5 Mb.
-er 6:25000000-35000000
- this flag removes the SNPs from HLA region (chr6:25Mb-35Mb) from the calculation of genetic risk scores.
Step 9B
In this step we perform PRS-eQTL analysis on PC-corrected expression matrix.
Data checklist
- Note down the full path to your TriTyper PRS data directory, created in step 9A. We will refer to this directory as
PRSdir
. - Determine the full path of the expression data you want to use, created in step 8. We will refer to this path as
traitfile_cis_removed
. - Locate your
annotationfile
. Also note down theplatformidentifier
, which is written in the first column of theannotationfile
. Affymetrix U219 gene level: "Affymetrix_U219_GeneLevel", Affymetrix U219 gene level: "Affymetrix_U219_ExonLevel". Please contact us if the annotation file for your platform is missing. - (Optional) Locate your
genotypeexpressioncoupling
if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
Commands to be issued
Use the following command to run the PRS-eQTL mapping:
java -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline-1.2.4F-SNAPSHOT/eqtl-mapping-pipeline.jar --mode metaqtl --in {PRSdir} --out {outdir} --inexp {traitfile_cis_removed} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --trans --binary --hwe 0 --maf 0 --perm 10 2>&1 | tee {outdir}/PRSeQtlMappingAdjPC.log
Data checklist
- Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as
genotypedir
. - Determine the full path of the trait data you want to use, created in step 5. We will refer to this path as
traitfile
. - Locate your
annotationfile
. Also note down theplatformidentifier
, which is written in the first column of theannotationfile
. Affymetrix U219 gene level: "Affymetrix_U219_GeneLevel", Affymetrix U219 gene level: "Affymetrix_U219_ExonLevel". Please contact us if the annotation file for your platform is missing. - (Optional) Locate your
genotypeexpressioncoupling
if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals.
The output files are named the same as indicated in Step 8 binary output file description.
✅ Please upload all the files in the output directory.
Step 9C
In this step we perform trans-eQTL and PRS-eQTL analyses on expression matrix not corrected for 20 first PCs. We are interested for those results to evaluate the principal component correction effect on the observed eQTLs.
trans-eQTL mapping on PC-uncorrected expression data
For that, we first have to rerun the trans-eQTL analysis using this time the output expression matrix from Step 3 ending with the *.CovariatesRemoved.txt.gz. This is by default in the same folder where PC-corrected expression matrix was written.
Data checklist
- Note down the full path to your TriTyper genotype data directory, created in step 2. We will refer to this directory as
genotypedir
. - Determine the full path of the trait data you want to use, created in step 3. We will refer to this path as
traitfile
. - Locate your
annotationfile
. Also note down theplatformidentifier
, which is written in the first column of theannotationfile
. Affymetrix U219 gene level: "Affymetrix_U219_GeneLevel", Affymetrix U219 gene level: "Affymetrix_U219_ExonLevel". Please contact us if the annotation file for your platform is missing. - (Optional) Locate your
genotypeexpressioncoupling
if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals. - Use the results of your conditional cis-eQTL analysis to regress out the cis-eQTL effects. It is located in {conditional_output_dir}/Regress_cis_eQTLs.txt. We will refer to this file as
cisqtlfile
. - Use the list of preselected SNPs for the second trans-eQTL analysis. This file is referred as
snpsFile
and is available under the umcg-eqtlgen account in the cher-ami server (trans_eQTL/HelpFiles/ListOfGeneticRiskFactors_20161121.txt). - Find a location on your hard drive to store the output. We will refer to this directory as
outdir
.
Use the following command to run the trans-eQTL analysis:
java -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline.jar --mode metaqtl --in {genotypedir} --out {outdir} --inexp {traitfile} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --trans --binary --perm 10 --maf 0.01 --snps {snpsFile} --regressouteqtls {cisqtlfile} 2>&1 | tee {outdir}/NonPcTransEQtlMapping.log
✅ Please upload the all the files in outdir
, including NonPcTransEQtlMapping.log.
This step produces the file ExpressionData.SampleSelection.QuantileNormalized.Log2Transformed.ProbesCentered.SamplesZTransformed.CovariatesRemoved.txt.gz-EQTLEffectsRemoved.txt.gz into the same directory where your input expression data is. From this expression file, the cis-eQTL effect are removed and this is the expression input for the following command.
PRS-eQTL mapping on PC-uncorrected expression data
We then run again the PRS-eQTL analysis, using the expression matrix not corrected to PCs as an input.
Data checklist
- Note down the full path to your TriTyper PRS data directory, created in step 9A. We will refer to this directory as
PRSdir
. - Determine the full path of the expression data you want to use, created in step 9C. We will refer to this path as
traitfile_PCs_not_removed_cis_removed
. - Locate your
annotationfile
. Also note down theplatformidentifier
, which is written in the first column of theannotationfile
. Affymetrix U219 gene level: "Affymetrix_U219_GeneLevel", Affymetrix U219 gene level: "Affymetrix_U219_ExonLevel". Please contact us if the annotation file for your platform is missing. - (Optional) Locate your
genotypeexpressioncoupling
if you have such a file. You can also use this file to test specific combinations of genotype and expression individuals. - Find a location on your hard drive to store the output. We will refer to this directory as
outputdir
.
Commands to be issued
Use the following command to run the second PRS-eQTL mapping:
java -Xmx40g -Xms20g -XX:StringTableSize=10000019 -XX:MaxPermSize=512m -jar eqtl-mapping-pipeline-1.2.4F-SNAPSHOT/eqtl-mapping-pipeline.jar --mode metaqtl --in {PRSdir} --out {outdir} --inexp {traitfile_PCs_not_removed_cis_removed} --inexpplatform {platformidentifier} --inexpannot {annotationfile} --trans --binary --hwe 0 --maf 0 --perm 10 2>&1 | tee {outdir}/PRSeQtlMappingNotAdjPC.log
✅ Please upload all the files in the output directory.
Step 10 - Uploading your result files
If you are ready with the analyses, please upload the the data to the server to your guest account acquired in Step 7 (named umcg-guest[0-9]). Please upload all the output files, .log files and .md5 files from Step 8, Step 9A, Step 9B and Step 9C.
If you analysed several datasets, please organise your analysis results into separate folders (e.g. LIFEa1 and LIFEb3, etc.). Please organise your results into subfolders (i.e. PRS_eQTL_results_PC_corrected and PRS_eQTL_results_PC_uncorrected).
The example of folder structure is following:
|--LIFEa1
|--trans_eQTL_results
|--trans_eQTL_results_PC_corrected
|--trans_eQTL_results_PC_uncorrected
|--PRS_eQTL_results
|--PRS_eQTL_results_PC_corrected
|--PRS_eQTL_results_PC_uncorrected
|--LIFEb3
|--trans_eQTL_results
|--trans_eQTL_results_PC_corrected
|--trans_eQTL_results_PC_uncorrected
|--PRS_eQTL_results
|--PRS_eQTL_results_PC_corrected
|--PRS_eQTL_results_PC_uncorrected
The upload instructions are here: http://wiki.gcc.rug.nl/wiki/DataSharing
If the instructions below do not work for your cluster setup, please consult with the extended manual: https://github.com/molgenis/systemsgenetics/wiki/Using-sharing-server-for-eQTLGen-analyses
## one possibility is to use lftp, if this is installed in your environment
## use these commands to upload/download data from cher-ami server
# go to the local folder where data is stored:
cd local/folder/with/your/data/
# start lftp
lftp
# connect with your guest account:
lftp :~> open -u [your_guest_accountname],none -p 22 sftp://cher-ami.hpc.rug.nl
# make remote upload directories and subdirectories for trans-eQTL and PRS-QTL results. e.g:
lftp :~> mkdir LIFEa1
lftp :~> mkdir LIFEa1/trans_eQTL_results
lftp :~> mkdir LIFEa1/trans_eQTL_results/trans_eQTL_results_PC_corrected
lftp :~> mkdir LIFEa1/trans_eQTL_results/trans_eQTL_results_PC_uncorrected
lftp :~> mkdir LIFEa1/PRS_eQTL_results
lftp :~> mkdir LIFEa1/PRS_eQTL_results/PRS_eQTL_results_PC_corrected
lftp :~> mkdir LIFEa1/PRS_eQTL_results/PRS_eQTL_results_PC_uncorrected
# go to the folder where you upload the results. e.g:
lftp :~> cd LIFEa1/trans_eQTL_results
# use put command for uploading the individual files and/or mirror command for uploading whole directories
lftp :~> put individual_result_file_in_local_server
lftp :~> mirror -R result_folder_in_local_server
# exit from the remote server when uploaded all necessary files:
lftp :~> exit
- Please send additional e-mail to urmo.vosa @ gmail.com when the data has finished uploading.
- We will do the sanity check and you will get the feedback whether the upload of the data succeeded. By default, your access to cher-ami will be active for three months, but it can be extended if needed.
NB! In case of issues with uploading, please contact to GCC Helpdesk with cc: urmo.vosa @ gmail.com.
🎉 You are done with the trans-eQTL and PRS-eQTL analysis.