# **Mutated-gene Genome-Wide Association Study (MugGWAS)**

MugGWAS identifies potential causal mutations for phenotypic change through testing associations between mutated genes and phenotypic values. The tool aims to 

### **Prerquisites**

This package uses the fixed effect model in  [`pyseer`](https://pyseer.readthedocs.io/en/master/index.html) for association analysis. So, the users have to install and match the [requirements](https://pyseer.readthedocs.io/en/master/installation.html#prerequisites) of `pyseer`.

### **Installation**
Clone this repository from GitHub in the terminal:

`git clone https://github.com/lyuchengmarvin/MugGWAS.git`


### **Build Environment**
Build an environment based on an environment.yaml file to satisfy software prerequisites and ensure applicability.

`conda env create --name muggwas --file envs/environment.yaml`

Activate envrionment before use each time:

`conda activate muggwas`

You will need to install these dependencies, This step is necessary until I build a docker image for the users:

`pip install glmnet_py`

`conda install -c bioconda ucsc-gff3togenepred`

Deactivate after use:

`conda deactivate`

### **Tutaorial**
Steps:
1. Annotate variants
2. Build gene mutation table
3. Estimate population structure
4. GWAS with pyseer
5. Plot GWAS result

#### **Annotate variants with ANNOVAR**

We will use the external tool ANNOVAR and annotate the variants from vcf files. Although one vcf file should contain the variant information for multiple samples, ANNOVAR annotates samples iteratively and create an annotation file per sample.

ANNOVAR can be downloaded from this [website](http://annovar.openbioinformatics.org/en/latest/), but registration is required. To ensure accessibility and easy execution for the users, I will build a docker image in the future so that the users can directly annotate their variants through docker.

This tool also assumed that users will build their customized database for their none-model organisms, so a genome annotation `<ref_prefix.gff3>` and a genome fasta file `<ref_prefix>.fna` are required and placed in the directory `<ref_prefix>/`. The database will utilize the tool `gff3ToGenePred` to build the database, which was also built when users created their conda environments.


**Annotation workflow**

1. **Make input files**:
    - Input: a VCF file `vcf_prefix.vcf` (with multiple samples)
    - Convert VCF to ANNOVAR input format `vcf_prefix.<sample_name>.avinput`
    - Store in `/path_to_vcf/annovar_files/`
2. **Build database**:
    - Input: `ref_prefix.gff3` and `ref_prefix.fna` in `db_dir/`
    - Build customized database: `ref_prefix_refGene.txt` and `ref_prefix_refGeneMrna.fa`
3. **Annotate variants with ANNOVAR**:
    - Output1: `vcf_prefix.<sample_name>.avinput.variant_function` infer mutation position on a gene.
    - Output2: `vcf_prefix.<sample_name>.avinput.exonic_variant_function` infer mutation effects on translation, namely synonymous, nonsynonymous, stop codon gain, or stop codon loss.

In [None]:
from annotate_variants import run_annovar

## User inputs
# path to the annovar directory
annovar_dir = '/Users/linyusheng/MugGWAS/scripts/annovar/'
# the directory where the reference files are located
db_dir = '/Users/linyusheng/MugGWAS/data/LE18_22/'
# path to the vcf file
input_dir = '/Users/linyusheng/MugGWAS/data/graz_LE.snp.vcf'
# output prefix or the converted annovar input file
vcf_prefix = 'graz_LE'
# output prefix for the annovar database
ref_prefix = 'LE18_22'

## Run ANNOVAR
run_annovar(annovar_dir, input_dir, db_dir, vcf_prefix, ref_prefix)


#### **Build Gene Mutation Table:**

For each sample:
1. Build gene maps from the gff3 file.
2. Compile mutation info gene-by-gene.
3. Determine mutation types.
    - binary: mutation or wildtype
    - multiple: nonsense, nonstop, missense, silent or wildtype
4. Output mutation types.

**Import functions and annotation data**

In [None]:
from compile_variants_by_gene import compile_gene_mutations, write_gene_mutation_summary

# Directory containing the ANNOVAR output files
annovar_output_dir = '/Users/linyusheng/MugGWAS/data/annovar_files'
# Read gff3 to build gene maps
gff_file = '/Users/linyusheng/MugGWAS/data/LE18_22/LE18_22.gff3'
# Write the mutation types to a tab delimited file
output_file = '/Users/linyusheng/MugGWAS/data/gene_mutation_summary.txt'

**Compile variants and output gene mutation table**

In [None]:
## Compile the gene mutations for each sample
# if you want to output binary mutation types (mutated or not), set model = 'binary'
gene_mutation_summary = compile_gene_mutations(annovar_output_dir, gff_file, model = 'binary')
## Write the gene mutation summary
write_gene_mutation_summary(gene_mutation_summary, output_file)

#### **Estimate Population Structure Effect:**

In gene-trait association analyses, false positives could arise from shared genetic lineage, especially in clonal organisms. There are two methods to estimate the effect of population structure:
- Phylogenetic-based: Infer population structure based on phylogenetic distances. Since MugGWAS will use a linear mixed model, the distance will be estimated from the shared branch length between the MRCA and the root.
- Kinship: Variants on core gene sequences represent the result of vertical evolution. To make inferences for identical by descent, the script will calculate the kinship matrix from the genotype matrix of the presence and absence of variants. --> (not supported yet as of 2025.04.03)

Inputs: 
- Phylogeny-based: Output from IQ-tree `core_gene_tree.nwk`. A high-quality phylogeny based on single-copy core genes from a pangenome.
- Kinship: A VCF file documenting variants on core genes `core_gene_snp.vcf.gz`.

Output: 
- `phylogeny_distances.tsv`: a file to account for population structure effect in pyseer.

**Phylogeny-based method**

In [2]:
from estimate_pop_structure import phylogeny2distmatrix

# Read the phylogenetic tree
phylogeny = '/Users/linyusheng/MugGWAS/data/LE_tree.nwk'
output_file = '/Users/linyusheng/MugGWAS/data/phylogeny_distances.tsv'

# Convert the phylogenetic tree to a distance matrix
phylogeny2distmatrix(phylogeny, output_file)