# fwGWAS - Naive Method (baseline model)
**Author:** Jesse Marks <br>
**Date:** September 10, 2018 <br>
**Programming Language:** Python3

This is the first method we are testing for the functional weighting GWAS (fwGWAS) method comparison. We are referring to this method as the Naive Method or the baseline model. This approach&mdash;based off of the 2016 Nature Genetics paper by Sveinbjornsson et al.&mdash;uses different P-value thresholds for each sequence variant functional category. More specifically, sequence variants are grouped into four categories: <br>
i) **loss-of-function variants** <br>
ii) **moderate-impact variants** <br>
iii) **low-impact variants** <br>
iv) **other** <br>

A different P-value threshold will be applied to each of these four categories based off of there functional annotation and its putative functional effects. We use the software SnpEff to annotate the variants.

This notebook details the methods (functions) developed to carry out the Naive Method. A proof of concept has been performed using the data stored at:

`/share/storage/Johnson/fwGWAS/data/SCZ/SCZ2/daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.gz`

## SnfEff Software
[SnpEff](http://snpeff.sourceforge.net/index.html) is a variant annotation and effect prediction tool that we will be using to perform the sequence variant annotations. It is located at: <br>
`/share/storage/Johnson//share/storage/Johnson/software/SnpEff/`.



### Input file
Converting GWAS Results to Variant Call Format (VCF) is the first step because SnpEff expects an input file in VCF format. The function we developed to perform the conversion **expects the GWAS results to have very specific header names**. Because of this, if the input file does not work properly you might have to either rename the column headers or, less desired, modify the code in this function to match your headers. Specifically, the function is expecting the GWAS result to have (at least) the following 7 columns:

* CHR
* SNP
* BP
* A1
* A2 
* P

The VCF file will have the header:

```
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
```

The mapping will be:
```
CHR --> #CHROM
BP --> POS
SNP --> ID
A1 --> REF
A2 --> ALT
P --> INFO
```
The columns `QUAL` and `FILTER` in the VCF file can be left empty, or rather map a period to each entry&mdash;this is the approach shown in the examples in the SnpEff manual. The final VCF file should look like the following:

```
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
22	17675324	rs5748937	T	C	.	.	0.7533
22	17798848	rs77501298	C	G	.	.	0.646
22	17699299	rs5748957	T	G	.	.	0.6269
22	17450765	rs61738794	A	G	.	.	0.7285
```

##  Main
Execute all functions in sequence.

In [None]:
### Python3 ###
def main():
    """
    Carryout all the functions necessary to 
    perform the fwGWAS naive method.
    """
    ### Variables to alter
    data_dir = '/share/storage/Johnson/fwGWAS/data/SCZ/SCZ2/'
    rtvf_in = data_dir + 'daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.gz'
    processing_dir = '/share/storage/Johnson/fwGWAS/method_comp/fwGWAS_SCZ2_Test5/'
    rtvf_out = processing_dir + 'scz2-gw.vcf'
    
    ### DO NOT alter below this line
    ##########################################################
    
    # Convert GWAS results to VCF format
    se_inF = results_to_vcf_format(in_file=rtvf_in,
                                   out_file=rtvf_out)
    se_outF = se_inF[:-4] + '-ann.vcf'

    # Run SnpEff to obtain annotations
    ex_inF = snp_eff(base_dir=processing_dir, 
                     in_file=se_inF, 
                     out_file=se_outF)
    ex_outF = ex_inF[:-4] + '-cleaned'

    # Extract annotations, IDs, and P-values
    ga_inF = extract_ann(in_file=ex_inF,
                         out_file=ex_outF)
    ga_outF = ga_inF + '-grouped'
    
    # Group variants into the four function annotation categories
    ff_inF = group_annotations(in_file=ga_inF,
                               out_file=ga_outF)
    ff_outF = ff_inF + '-filtered-fw'
    
    # Filter using the four fw-thresholds
    rs_inF1 = filter_fw(in_file=ff_inF,
                        out_file=ff_outF)
    
    fs_outF = ff_inF + 'filtered-std'
    
    # Filter using the standard GWAS threshold (5e-8)
    rs_inF2 = filter_standard(in_file=ga_outF, 
                             out_file=fs_outF)
    rs_out = rtvf_out[:-4] + '-results-summary'
    
    # Get results summary
    results_summary(in_file=rs_inF1,
                    in_file2=rs_inF2, 
                    out_file=rs_out)
    return

**Note:** using the SCZ2 test case (15,358,497 variants), this pipeline ran in ~9min.
```
00:09:00	Logging
00:09:05	Checking for updates...
00:09:08	Done.
```

## Function1&mdash;Convert results to vcf format
Note, a future feature could be to add a catch to report an error if the header names of the input file are incorrect.

In [None]:
# in_file = /share/storage/Johnson/fwGWAS/data/SCZ/SCZ2/daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.gz
# out_file = /share/storage/Johnson/fwGWAS/method_comp/fwGWAS_SCZ2_Test4/scz2-gw.vcf
def results_to_vcf_format(in_file, out_file):
    """
    The SnpEff software expects, as input, a file in VCF format. 
    This function performs the conversion of the GWAS results 
    to VCF format so that SnpEff can obtain the annotations.
    
    INPUT:
    in_file - GWAS results file. 
    out_file - Name (and path) of the VCF file to be created.
    
    OUTPUT:
    This function returns a character string of the name and path of out_file.
    """
    import gzip
    try:
        with gzip.open(in_file, 'rt') as inF:
            with open(out_file, 'wt') as outF:
                line = inF.readline()
    
                head_line = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO']
                outF.write('\t'.join(head_line) + '\n')
                split_line = line.split()
     
                pval_index = split_line.index('P')
                chr_index = split_line.index('CHR')
                rsid_index = split_line.index('SNP')
                position_index = split_line.index('BP')
                ref_allele_index = split_line.index('A1')
                alt_allele_index = split_line.index('A2')

                # skip the header now
                line = inF.readline()
                while(line):
                    split_line = line.split()
     
                    f1 = split_line[chr_index]
                    f2 = split_line[position_index]
                    f3 = split_line[rsid_index]
                    f4 = split_line[ref_allele_index]
                    f5 = split_line[alt_allele_index]
                    f6 = '.'
                    f7 = '.'
                    f8 = split_line[pval_index]
     
                    vcf_list = [f1, f2, f3, f4, f5, f6, f7, f8]
                    outF.write('\t'.join(vcf_list) + '\n')
                    line = inF.readline()
    except (OSError):
        print("Please gzip your input file.")
    return out_file; 

## Function2&mdash;Obtain variant annotations with SnpEff 

**Note**: adjust the java memory specification as needed. Default allocation is 2GB; here I specified 8GB.

In [None]:
# base_dir = '/share/storage/Johnson/fwGWAS/method_comp/fwGWAS_SCZ2_Test4/'
# in_file = base_dir + 'scz2-gw.vcf'
# out_file = base_dir + 'scz2-gw-ann.vcf'
def snp_eff(base_dir, in_file, out_file):
    """
    This function executes the SnpEff software that annotates the
    sequence variants using the Genome Build 37 as the reference.'
    
    INPUT:
    base_dir - path were results should be saved.
    in_file - name (and path) of VCF file for input to SnpEff
    out_file - name of the output annotated VCF.
    
    OUTPUT:
    This function returns a character string of the name and path of out_file.
    """
    
    import subprocess

    ### DO NOT modify these variables
    ###########################################################################
    config_path = '/share/storage/Johnson/software/SnpEff/snpEff/snpEff.config' 
    snpEff_path = '/share/storage/Johnson/software/SnpEff/snpEff/snpEff.jar'
    snp_eff = base_dir + 'snp-eff.sh'
    # -t for multithreading implies -noStats (speeds process way up)
    command_list = ['java', '-Xmx8g', '-jar', snpEff_path, '-c', config_path, '-v',
                    '-t', 'GRCh37.75', in_file, '>', out_file]
    command_string = ' '.join(command_list)
    ###########################################################################

    # save command as a bash script
    with open(snp_eff, 'w') as outF:
        message = '#!/usr/bin/bash\n'
        message += command_string
        outF.write(message)

    # execute bash script
    run_command = ['bash', snp_eff]
    subprocess.run(run_command)
    return out_file

## Function3&mdash;extract annotation information
We might want to add the rsID field, in the future. This would change down stream behavior that we would need to address.

In [None]:
# base_dir = '/share/storage/Johnson/fwGWAS/method_comp/fwGWAS_SCZ2_Test4/'
# in_file = base_dir + 'scz2-gw-ann.vcf'
# out_file = base_dir +'scz2-gw-ann-cleaned'
def extract_ann(in_file, out_file):
    """
    Extract the annotation information from the SnpEff
    results VCF file. The unique ID and GWAS P-value
    for each variant is extracted as well.
    
    INPUT: 
    in_file - The name of the file that was output from SnpEff.
              The file should be in vcf format and have in the 
              INFO field the pval+annotation for each variant.
    out_file - Name of the file for which to save the results 
               of this function to. This file will have the 
               following three fields:
          1. unique ID (CHR:POSTION:A1:A2)
          2. sequence variant annotation (e.g. stop-gain,)
          3. P-value
    OUTPUT: 
    This function returns a character string of the name and path of out_file.
    
    """
    with open(in_file, 'r') as inF:
        with open(out_file, 'w') as outF:
            line = inF.readline()

            while line[0] == '#':
                line = inF.readline()
     
            while(line):
                split_line = line.split()
                unique_id = split_line[0] + ':' + split_line[1] + \
                    ':' + split_line[3] + ':' + split_line[4]
    #             rsID = splitLine[2]
                info_field = split_line[7]
                all_annotations = info_field.split(';')
                pval = all_annotations[0]
                functional_annotations = all_annotations[1].split(',')
                for item in functional_annotations:
                    output = unique_id + '\t' + item.split('|')[1] + '\t' + pval
                    outF.write(output + '\n')
                line = inF.readline()  
    return out_file;

## Function4&mdash;Group Variants into Four Annotational Categories

Sequence variant annotation of each function group as described by the 2016 Nature paper by Sveinbjornsson et al.
1. loss-of-function (stop-gain & stop-loss, frameshift indel, donor and acceptor splice-site, and initiator codon variants)
2. moderate-impact (missense, in-frame indel and splice region variants)
3. low-impact (synonymous, 3' and 5' UTR, and upstream and downstream variants)
4. other (all other variants)

**Note**: there are some variant annotations whose classification is in discordance when considering the Nature paper vs the SnpEff report on (putative) variant impact. For example, initiator codon variants are considered to be in the loss of function group according to the Nature paper where as the SnpEff software annotations the impact as low. There could be issues with this when considering *other* variants. One of these *other* variants might actually be a loss of impact variant or even a moderate impact. The threshold for *other* variants is set more stringent so that we might filter one of these variants out when in actuality we should have lower the threshold. This is just a side-note; I have not tried to determine if there are any instances of this yet. We should take a closer look at it though.

### SnpEff annotation (Sequence Ontology terms)

<br>

**loss-of-function**
* stop_gained 
* stop_lost
* frameshift_variant
* splice_donor_variant
* splice_acceptor_variant
* initiator_codon_variant (note that SnpEff indicates that the impact of these variants are LOW)

<br>

**moderate-impact-variants**
* missense_variant
* inframe_insertion
* splice_region_variant (impact LOW)

<br>

**low-impact variants**
* synonymous_variant
* 3_prime_UTR_variant
* 5_prime_UTR_variant
* upstream_gene_variant
* downstream_gene_variant

**other**
* All other variant annotations detailed in SnpEff

In [None]:
# base_dir = '/share/storage/Johnson/fwGWAS/method_comp/fwGWAS_SCZ2_Test4/'
# in_file = base_dir + 'scz2-gw-ann-cleaned.vcf'
# out_file = base_dir + 'scz2-gw-ann-cleaned-grouped'

def group_annotations(in_file, out_file):
    """
    Categorized the variants into groups based off
    of their annotations.
    
    INPUT:
    in_file - the file name (and path) of the file
              that contains the variant annotations.
    out_file - the file name (and path) of the output file
               that will essentially append the variant group
               to the in_put file.
    
    OUTPUT:
    This function returns a character string of the name and path of out_file.
    """
    with open(in_file, 'r') as inF:
        with open(out_file, 'w') as outF:
            line = inF.readline()
            group_dict = {}
            # add loss-of-function variants
            group_dict['loss-of-function variants'] = ['stop_gained', 
                                                      'stop_lost', 
                                                      'frameshift_variant', 
                                                      'splice_donor_variant', 
                                                      'splice_acceptor_variant', 
                                                      'initiator_codon_variant']
            # add moderate-impact variants
            group_dict['moderate-impact variants'] = ['missense_variant', 
                                                     'inframe_insertion', 
                                                     'splice_region_variant']
            # add low-impact variants
            group_dict['low-impact variants'] = ['synonymous_variant', 
                                                '3_prime_UTR_variant', 
                                                '5_prime_UTR_variant', 
                                                'upstream_gene_variant', 
                                                'downstream_gene_variant']
            ## Group variants based on their categorization.
            while(line):
                split_line = line.split()
                # Search for ann in dict.
                for item in group_dict:
                    if split_line[1] in group_dict[item]:
                        split_line.append(item)
                        break
                # If the variant was not in any of the three groups;
                # then it is categorized as an 'other' variant.
                if len(split_line) == 3:
                    split_line.append('other')
                outF.write('\t'.join(split_line) + '\n')
                line = inF.readline()
    return out_file;

## Function5&mdash;Filter variants based off of the fwThresholds
1. Loss of function; $5.5 \times 10^{-7}$ 
2. Moderate impact; $1.1 \times 10^{-7}$ 
3. Low impact; $1.0 \times 10^{-8}$ 
4. Other; $1.7 \times 10^{-9}$ 

These thresholds are based off of the estimated enrichment of categories among association signals of 1000G and the resulting significance thresholds - detailed in `Weighting Sequence Variants` 2016 Nature paper by Sveinbjornsson et al.

In [None]:
# base_dir = '/share/storage/Johnson/fwGWAS/method_comp/fwGWAS_SCZ2_Test4/'
# in_file = base_dir + 'scz2-gw-ann-cleaned-grouped'
# out_file = base_dir + 'scz2-gw-ann-cleaned-grouped-filtered-fw'

def filter_fw(in_file, out_file):
    """
    This function filters the variants by comparing the sequence
    variant P-values obtained from the GWAS results with the 
    functional-weighted threshold.
    
    INPUT:
    in_file - name (and path) of the input file that contains the 
              variant groupings.
    out_file - name (and path) of the output file that will be the
               fw-threshold filtered file.
    
    OUTPUT:
    This function returns a character string of the name and path of out_file.
    """
    with open(in_file) as inF:
        with open(out_file, 'w') as outF:
            line = inF.readline()
            loss_func = 5.5e-7
            mod_impact = 1.1e-7
            low_impact = 1.0e-8
            other = 1.7e-9
            thresh_dict = {'loss-of-function variants': loss_func,
                           'moderate-impact variants': mod_impact,
                           'low-impact variants': low_impact,
                           'other': other}
            while(line):
                split_line = line.split('\t')
                group = split_line[3].strip()
                pval = float(split_line[2])
                if pval <= thresh_dict[group]:
                    outF.write(line)
                line = inF.readline()
    return out_file;

##  Function6&mdash;Filter variants based off of the standard P-value
Filter the sequence variants by the standard genome-wide significance (WGS) P-Value threshold of $5\times 10^{-8}$.

In [None]:
# base_dir = '/share/storage/Johnson/fwGWAS/method_comp/fwGWAS_SCZ2_Test4/'
# in_file = base_dir + 'scz2-gw-ann-cleaned-grouped'
# out_file = base_dir + 'scz2-gw-ann-cleaned-grouped-filtered-standard'

def filter_standard(in_file, out_file):
    """
    This function filters the variants by comparing the sequence
    variant pvalues obtained from the GWAS results with the 
    standard GWAS threshold of 5e-8.
         
    INPUT:
    in_file - name (and path) of the input file that contains the 
              variant groupings.
    out_file - name (and path) of the output file that will be the
               standard-threshold filtered file.
    
    OUTPUT:
    This function returns a character string of the name and path of out_file.
    """
    with open(in_file) as inF:
        with open(out_file, 'w') as outF:
            line = inF.readline()
            bf_threshold = 5e-8
            #
            while(line):
                split_line = line.split('\t')
                group = split_line[3].strip()
                pval = float(split_line[2])
                if pval <= bf_threshold:
                    outF.write(line)
                line = inF.readline()
    return out_file;

## Function7&mdash;Report Results
Results comparison: compare the sequence variants that were deemed significant when using the fw-thresholds vs the standard WGS P-value threshold.

In [None]:
# base_dir = '/share/storage/Johnson/fwGWAS/method_comp/fwGWAS_SCZ2_Test4/'
# fw_file = base_dir + 'scz2-gw-ann-cleaned-grouped-filtered-fw'
# std_file = base_dir + 'scz2-gw-ann-cleaned-grouped-filtered-standard'
# out_file = base_dir + 'scz2-gw-results-summary'

def results_summary(in_file, in_file2, out_file):
    """
    The function takes as input the two results files from the threshold filtering
    performed above and outputs the summary statistics. Specifically, the output file 
    will contain two counts-dictionaries. One dict will count the number of variants 
    in each functional group that was exclusive to the fw-thresholded sequence
    variants. The other will be for the variants exclusive to the standard WGS P-value
    thresholded results.
    
    INPUT:
    in_file -  name of the fw-thresholded file
    in_file2 - name of the standard thresholded file
    out_file     - Name of the file to which the summary statistics will be saved.
    
    OUTPUT:
    Nothing is returned from this function, however three files are output. One is
    the out_file provided as input, and then two additional files are created.
    
    additional01 - File assumes the name <out_file>-fw-variants and includes a list of 
                   the variants that were deemed significant only when the fw-thresholds
                   were applied as thresholds of significance.
    additional02 - File assumes the name <out_file>-std-variants and includes a list of
                   the variants that were deemed significant only when the standard
                   threshold (5e-8) was applied.
    """
    import subprocess
    ## compare the two filtered files and print the variants 
    #  Exclusive to the fw-thresholded variants
    bash_command = 'bash -c "comm -23 <(sort {0}) <(sort {1})"'.format(in_file, in_file2)
    fw_exclusive = subprocess.run(bash_command, shell=True, stdout=subprocess.PIPE, encoding='utf-8').stdout
    # Exlusive to the standard thresholded variants (5e-8)
    bash_command2 = 'bash -c "comm -13 <(sort {0}) <(sort {1})"'.format(in_file, in_file2)
    standard_exclusive = subprocess.run(bash_command2, shell=True, stdout=subprocess.PIPE, encoding='utf-8').stdout
    # Variants deemed siginificant using both thresholding methods
    bash_command3 = 'bash -c "comm -12 <(sort {0}) <(sort {1})"'.format(in_file, in_file2)
    both_methods = subprocess.run(bash_command3, shell=True, stdout=subprocess.PIPE, encoding='utf-8').stdout
    with open(out_file, 'w') as outF:
        for count,string in enumerate([fw_exclusive, standard_exclusive, both_methods]):
            counts_dict = {'loss-of-function variants':0,
                           'moderate-impact variants':0,
                           'low-impact variants':0,
                           'other':0}
            if count == 0:
#                 message = '####\n####The following lines detail the number of sequence variants that '\
#                       'were statistically significant when compared against the functional ' \
#                       'weighted thresholds based off of their functional annotation. ' \
#                       'Note, these variants were not deemed significant when compared ' \
#                       'against the standard GWAS threshold of 5e-8.\n\n'
                message = '####\n####Novel variants.\n\n'

            elif count == 1:
#                 message = '\n\n####\n####The following lines detail the number of sequence variants that '\
#                       'were statistically significant when compared against the standard GWAS ' \
#                       'threshold of 5e-8, but were not deemed significant based off of the function '\
#                       'weighted thresholds.\n\n'
                message = '\n\n####\n####Variants not replicated.\n\n'

            else:
#                 message = '\n\n####\n####The following lines detail the number of sequence variants that '\
#                         'were statistically significant when compared against BOTH the standard GWAS '\
#                         'threshold and the fw-thresholds. In otherwords, these are the variants that were '\
#                         'replicated.\n\n'
               message = '\n\n####\n####Variants that were replicated.\n\n'

            split_lines = string.splitlines()
            for line in split_lines:
                ann = line.split('\t')[3]
                counts_dict[ann] += 1
            outF.write(message)
            dict_sum = str(sum(counts_dict.values()))
            dict_sum = '\nTotal number of elements: {}'.format(dict_sum)
            outF.write(str(counts_dict) + dict_sum + '\n')
    with open(out_file+'-fw-variants', 'w') as fwF:
        fwF.write(fw_exclusive)
    with open(out_file+'-std-variants', 'w') as stdF:
        stdF.write(standard_exclusive)
    return ;