# **Mutated-gene Genome-Wide Association Study (MugGWAS)**

MugGWAS identifies potential causal mutations for phenotypic change through testing associations between mutated genes and phenotypic values. The tool aims to 

### **Prerquisites**

This package uses the fixed effect model in  [`pyseer`](https://pyseer.readthedocs.io/en/master/index.html) for association analysis. So, the users have to install and match the [requirements](https://pyseer.readthedocs.io/en/master/installation.html#prerequisites) of `pyseer`.

### **Installation**
Clone this repository from GitHub in the terminal:

`git clone https://github.com/lyuchengmarvin/MugGWAS.git`


### **Build Environment**
Build an environment based on an environment.yaml file to satisfy software prerequisites and ensure applicability.

`conda env create --name muggwas --file envs/environment.yaml`

Activate envrionment before use each time:

`conda activate muggwas`

You will need to install these dependencies, This step is necessary until I build a docker image for the users:

`pip install glmnet_py`

`conda install -c bioconda ucsc-gff3togenepred`

Deactivate after use:

`conda deactivate`

### **Tutaorial**
Steps:
1. Annotate variants
2. Build gene mutation table
3. Estimate population structure
4. GWAS with pyseer
5. Plot GWAS result

#### **Annotate variants with ANNOVAR**

We will use the external tool ANNOVAR and annotate the variants from vcf files. Although one vcf file should contain the variant information for multiple samples, ANNOVAR annotates samples iteratively and create an annotation file per sample.

ANNOVAR can be downloaded from this [website](http://annovar.openbioinformatics.org/en/latest/), but registration is required. To ensure accessibility and easy execution for the users, I will build a docker image in the future so that the users can directly annotate their variants through docker.

This tool also assumed that users will build their customized database for their none-model organisms, so a genome annotation `<ref_prefix.gff3>` and a genome fasta file `<ref_prefix>.fna` are required and placed in the directory `<ref_prefix>/`. The database will utilize the tool `gff3ToGenePred` to build the database, which was also built when users created their conda environments.


**Annotation workflow**

1. **Make input files**:
    - Input: a VCF file `vcf_prefix.vcf` (with multiple samples)
    - Convert VCF to ANNOVAR input format `vcf_prefix.<sample_name>.avinput`
    - Store in `/path_to_vcf/annovar_files/`
2. **Build database**:
    - Input: `ref_prefix.gff3` and `ref_prefix.fna` in `db_dir/`
    - Build customized database: `ref_prefix_refGene.txt` and `ref_prefix_refGeneMrna.fa`
3. **Annotate variants with ANNOVAR**:
    - Output1: `vcf_prefix.<sample_name>.avinput.variant_function` infer mutation position on a gene.
    - Output2: `vcf_prefix.<sample_name>.avinput.exonic_variant_function` infer mutation effects on translation, namely synonymous, nonsynonymous, stop codon gain, or stop codon loss.

In [None]:
from annotate_variants import run_annovar

## User inputs
# path to the annovar directory
annovar_dir = '/Users/linyusheng/MugGWAS/scripts/annovar/'
# the directory where the reference files are located
db_dir = '/Users/linyusheng/MugGWAS/data/LE18_22/'
# path to the vcf file
input_dir = '/Users/linyusheng/MugGWAS/data/graz_LE.snp.vcf'
# output prefix or the converted annovar input file
vcf_prefix = 'graz_LE'
# output prefix for the annovar database
ref_prefix = 'LE18_22'

## Run ANNOVAR
run_annovar(annovar_dir, input_dir, db_dir, vcf_prefix, ref_prefix)


Converting VCF file to annovar input format...



NOTICE: output files will be written to /Users/linyusheng/MugGWAS/data/annovar_files/graz_LE.<samplename>.avinput
NOTICE: Finished reading 21826 lines from VCF file
NOTICE: A total of 21154 locus in VCF file passed QC threshold, representing 21154 SNPs (13997 transitions and 7157 transversions) and 0 indels/substitutions
NOTICE: Finished writing 90239 SNP genotypes (61175 transitions and 29064 transversions) and 0 indels/substitutions for 22 samples



The VCF file has been converted and saved to /Users/linyusheng/MugGWAS/data/annovar_files/.



NOTICE: Reading region file /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22_refGene.txt ... Done with 5095 regions from 613 chromosomes
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.fna
NOTICE: Fi

Database has been built and saved to /Users/linyusheng/MugGWAS/data/LE18_22/.

Annotating avinput files...



NOTICE: Output files are written to /Users/linyusheng/MugGWAS/data/annovar_files/graz_LE.ND_101.avinput.variant_function, /Users/linyusheng/MugGWAS/data/annovar_files/graz_LE.ND_101.avinput.exonic_variant_function
NOTICE: Reading gene annotation from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22_refGene.txt ... Done with 5095 transcripts (including 54 without coding sequence annotation) for 5095 unique genes
NOTICE: Processing next batch with 3939 unique variants in 3939 input lines
NOTICE: Reading FASTA sequences from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22_refGeneMrna.fa ... Done with 317 sequences
NOTICE: Output files are written to /Users/linyusheng/MugGWAS/data/annovar_files/graz_LE.ND_84.avinput.variant_function, /Users/linyusheng/MugGWAS/data/annovar_files/graz_LE.ND_84.avinput.exonic_variant_function
NOTICE: Reading gene annotation from /Users/linyusheng/MugGWAS/data/LE18_22/LE18_22_refGene.txt ... Done with 5095 transcripts (including 54 without coding sequence annotat

22 samples have been annotated and saved to /Users/linyusheng/MugGWAS/data/annovar_files/.


### **Build Gene Mutation Table:**

For each sample:
1. Build gene maps from the gff3 file.
2. Compile mutation info gene-by-gene.
3. Determine mutation types.
    - binary: mutation or wildtype
    - multiple: nonsense, nonstop, missense, silent or wildtype
4. Output mutation types.

**Import functions and annotation data**

In [1]:
from compile_variants_by_gene import compile_gene_mutations, write_gene_mutation_summary

# Directory containing the ANNOVAR output files
annovar_output_dir = '/Users/linyusheng/MugGWAS/data/annovar_files'
# Read gff3 to build gene maps
gff_file = '/Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.gff3'
# Write the mutation types to a tab delimited file
output_file = '/Users/linyusheng/MugGWAS/data/gene_mutation_summary.txt'

**Compile variants and output gene mutation table**

In [12]:
## Compile the gene mutations for each sample
# if you want to output binary mutation types (mutated or not), set model = 'binary'
gene_mutation_summary = compile_gene_mutations(annovar_output_dir, gff_file, model = 'multiple')
# See sample 'ND_100' outputs
gene_mutation_summary['ND_100']
## Write the gene mutation summary
write_gene_mutation_summary(gene_mutation_summary, output_file)

In [10]:
# Read the gene mutation summary with pandas
import pandas as pd
# Read the gene mutation summary
gene_mutation_summary = pd.read_csv(output_file, sep = '\t')
# Print the gene mutation summary
print(gene_mutation_summary.head())
# Print the gene mutation summary
print(gene_mutation_summary.columns)
# Print the gene mutation summary
print(gene_mutation_summary.shape)
# Print the gene mutation summary
print(gene_mutation_summary.info())

           Gene     ND_99     ND_94     ND_86     ND_91     ND_83    ND_102  \
0  HENFNN_00005  wildtype  wildtype  wildtype  wildtype  wildtype  wildtype   
1  HENFNN_00010  wildtype  missense  wildtype  wildtype  missense  wildtype   
2  HENFNN_00015  missense  wildtype  nonsense  wildtype  wildtype  nonsense   
3  HENFNN_00020  wildtype  missense  missense  wildtype  wildtype  wildtype   
4  HENFNN_00025  wildtype  missense  missense  wildtype  missense  missense   

      ND_78     ND_97     ND_85  ...     ND_81     ND_93     ND_84     ND_89  \
0  wildtype  wildtype  wildtype  ...    silent  wildtype  wildtype  wildtype   
1  wildtype  wildtype  wildtype  ...  wildtype  wildtype  wildtype  wildtype   
2  wildtype  nonsense  nonsense  ...  nonsense  wildtype  missense  missense   
3  wildtype  missense    silent  ...    silent  wildtype  wildtype  missense   
4  wildtype  missense  missense  ...  missense  wildtype  missense  wildtype   

      ND_98     ND_79     ND_82     ND_90   