Prediction of gene expression

This script is designed to predict allele-specific expressian and total gene expression using allelic fold change (aFC). See the manuscript for method description.

More information on estimating aFCs, could be found here.

Inputs

aFC file

In the data folder there is a file named aFC_Whole_Blood.txt. This is a tab delimited file providing allelic fold change for each eQTL. In this file for each variant in columnn "variant_id" we have the following information:

gene_id : gene associated to that variant
rank : rank of that variant in eQTL calling process
CHROM : the chromosome of the variant
POS : position of that variant
REF : The Reference allele
ALT : The Alternative allele
log2_aFC : The effect size of the variant

VCF file

A phased vcf file is required to extract the genotypes of individuals. A sample VCF file named "vcf_sample.txt" could be found in the data folder.

Genotype, encoded as allele values separated by either of / or |. " /" means genotype unphased and "|" means genotype phased. The allele values are 0 for the reference allele (what is in the REF field), 1 for the allele listed in ALT. For diploid calls examples could be 0/1, 1|0.

The REF and ALT information should match the REF and ALT information in aFC_Whole_Blood.txt

If a tabix index is generated for the vcf file, in order to extract the subset of the vcf file you can use the following command in which "IDs.txt" repesents the required variants. A sample of this file is available in the data folder.

for line in $(cat ~/data/IDs.txt)
    do
    id=`echo ${line} | sed 's/\"/\t/g'`
    chr=`echo ${id} | sed 's/_/\t/g' | cut -f 1`
    pos=`echo ${id} | sed 's/_/\t/g' | cut -f 2`
    echo ${id}
    echo ${chr}
    echo ${pos}
    tabix vcf_file ${chr}:${pos}-${pos} | grep -w $id >> vcf_sample.txt
    done

Here is an example to predict the expression of a specific gene for an individual using these inputs

# aFC file
aFC_file = "data/aFC_Whole_Blood.txt"

#VCF sample file
VCF_file = "data/sample_vcf.txt"


# Name of gene that we want to predict its expression
gene_id = "gene_id"

# Name of individual that we want to predict its expression
individual_id = "individual_id"

#########################

expression_prediction_gene_individual<-function(aFC_file,vcf_file,gene_id,individual_id){
    # read aFC file
    aFC_df=read.table(aFC_file, header=TRUE, sep="\t")

    #read VCF file 
    genotype_info= read.table(vcf_file, header=TRUE, sep = "\t")
   

    # get the afc vector for a specific gene 
    # the function definition is available in R folder
    aFC_vector<-aFC_gene_vector(gene_id,aFC_df)
   

    # get the afc vector for a specific gene 
    # the function definition is available in R folder
    variant_vector<-variant_gene_vector(gene_id,aFC_df)


    #null genotypes
    genotype_h1 = c()
    genotype_h2 = c()

# assuming variant and individual_id exist in vcf file
    for (variant in variant_vector){
        
        genotype<-genotype_info[genotype_info$ID == variant,individual_id]      
        genotype_split_full <-unlist(strsplit(as.character(genotype),":"))[1]
        genotype_split<-unlist(strsplit(as.character(genotype),"|"))
        genotype_h1<-c(genotype_h1,genotype_split[1])
        genotype_h2<-c(genotype_h2,genotype_split[3])
    }

    
    #get the expression output vector for two haplotypes
    # the function definition is available in R folder
    #output[1] : log expression for haplotype 1 in log2 scale
    #output[2] : gene expression for haplotype 2 in log2 scale
    #output[3] : total expression in log2 scale
    result<-gene_expression_estimation(as.numeric(genotype_h1),as.numeric(genotype_h2),aFC_vector)
    
    return(result)
    }

Resources

R/gene_expression_estimation_functions.ipynb

This notebook provides functions in R used in the above example that will read the aFC file and predict the gene expression.

R/gene_expression_lookupTable.R

This R script gets a sorted aFC file (sorted based on gene_id), counts the number of variants for each gene and produces lookup tables representing expression values for all genotypes. The lookup tables are generated based on the number of the eQTLs for each gene, thus "lookupTable_variantNo_2" includes genes with two eQTLS. To run the script use the following command:

    Rscript gene_expression_lookupTable.R data\aFC_Whole_Blood.txt

python/ASE_prediction.ipynb

This script uses the lookup tables to predict expression for each haplotype, reading the individual genotypes from vcf file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction of gene expression

Inputs

aFC file

VCF file

Resources

R/gene_expression_estimation_functions.ipynb

R/gene_expression_lookupTable.R

python/ASE_prediction.ipynb

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
R		R
data		data
python		python
README.md		README.md

navaehsan/gene_expression_prediction_1

Folders and files

Latest commit

History

Repository files navigation

Prediction of gene expression

Inputs

aFC file

VCF file

Resources

R/gene_expression_estimation_functions.ipynb

R/gene_expression_lookupTable.R

python/ASE_prediction.ipynb

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages