This script is designed to predict allele-specific expressian and total gene expression using allelic fold change (aFC). See the manuscript for method description.
More information on estimating aFCs, could be found here.
In the data folder there is a file named aFC_Whole_Blood.txt. This is a tab delimited file providing allelic fold change for each eQTL. In this file for each variant in columnn "variant_id" we have the following information:
-
gene_id : gene associated to that variant
-
rank : rank of that variant in eQTL calling process
-
CHROM : the chromosome of the variant
-
POS : position of that variant
-
REF : The Reference allele
-
ALT : The Alternative allele
-
log2_aFC : The effect size of the variant
A phased vcf file is required to extract the genotypes of individuals. A sample VCF file named "vcf_sample.txt" could be found in the data folder.
Genotype, encoded as allele values separated by either of / or |. " /" means genotype unphased and "|" means genotype phased. The allele values are 0 for the reference allele (what is in the REF field), 1 for the allele listed in ALT. For diploid calls examples could be 0/1, 1|0.
The REF and ALT information should match the REF and ALT information in aFC_Whole_Blood.txt
If a tabix index is generated for the vcf file, in order to extract the subset of the vcf file you can use the following command in which "IDs.txt" repesents the required variants. A sample of this file is available in the data folder.
for line in $(cat ~/data/IDs.txt)
do
id=`echo ${line} | sed 's/\"/\t/g'`
chr=`echo ${id} | sed 's/_/\t/g' | cut -f 1`
pos=`echo ${id} | sed 's/_/\t/g' | cut -f 2`
echo ${id}
echo ${chr}
echo ${pos}
tabix vcf_file ${chr}:${pos}-${pos} | grep -w $id >> vcf_sample.txt
done
Here is an example to predict the expression of a specific gene for an individual using these inputs
# aFC file
aFC_file = "data/aFC_Whole_Blood.txt"
#VCF sample file
VCF_file = "data/sample_vcf.txt"
# Name of gene that we want to predict its expression
gene_id = "gene_id"
# Name of individual that we want to predict its expression
individual_id = "individual_id"
#########################
expression_prediction_gene_individual<-function(aFC_file,vcf_file,gene_id,individual_id){
# read aFC file
aFC_df=read.table(aFC_file, header=TRUE, sep="\t")
#read VCF file
genotype_info= read.table(vcf_file, header=TRUE, sep = "\t")
# get the afc vector for a specific gene
# the function definition is available in R folder
aFC_vector<-aFC_gene_vector(gene_id,aFC_df)
# get the afc vector for a specific gene
# the function definition is available in R folder
variant_vector<-variant_gene_vector(gene_id,aFC_df)
#null genotypes
genotype_h1 = c()
genotype_h2 = c()
# assuming variant and individual_id exist in vcf file
for (variant in variant_vector){
genotype<-genotype_info[genotype_info$ID == variant,individual_id]
genotype_split_full <-unlist(strsplit(as.character(genotype),":"))[1]
genotype_split<-unlist(strsplit(as.character(genotype),"|"))
genotype_h1<-c(genotype_h1,genotype_split[1])
genotype_h2<-c(genotype_h2,genotype_split[3])
}
#get the expression output vector for two haplotypes
# the function definition is available in R folder
#output[1] : log expression for haplotype 1 in log2 scale
#output[2] : gene expression for haplotype 2 in log2 scale
#output[3] : total expression in log2 scale
result<-gene_expression_estimation(as.numeric(genotype_h1),as.numeric(genotype_h2),aFC_vector)
return(result)
}
This notebook provides functions in R used in the above example that will read the aFC file and predict the gene expression.
This R script gets a sorted aFC file (sorted based on gene_id), counts the number of variants for each gene and produces lookup tables representing expression values for all genotypes. The lookup tables are generated based on the number of the eQTLs for each gene, thus "lookupTable_variantNo_2" includes genes with two eQTLS. To run the script use the following command:
Rscript gene_expression_lookupTable.R data\aFC_Whole_Blood.txt
This script uses the lookup tables to predict expression for each haplotype, reading the individual genotypes from vcf file.