# **Mutated-gene Genome-Wide Association Study (MugGWAS)**

MugGWAS offers an non-model-organism-friendly pipeline to infer gene-trait association. It tackles common challenges faced by non-model organism research—such as lack of readily available annotation databases and small sample sizes—by integrating a simple annotation pipeline with mutation-centric GWAS analysis. Users can identify putative gene mutations that drive phenotypic change without the large statistical power required to perform genome-wide tests on a per nucleotide basis.

## **Motivation**

Understanding the genetic basis of a trait is central to many biological research that help gain mechanismtic insight into complex biological phenomena. For instance, the discovery of genetic loci that contribute to antibiotic resistance could better our knowledge for how important pathogens evolve to escape medicinal application. 

Most association tools classify raw genetic variants (like single nucleotide polymorphisms or structural variants) into reference or alternative genotypes, then perform statistical tests to infer associations between particular variants and phenotypes. However, genome-wide analyses typically require large sample sizes for statistical power—a major challenge for organisms that are difficult to culture or traits that are challenging to measure. Alternative approaches test associations between phenotypes and gene presence/absence, but these methods don't capture finer details like important mutations that potentially disrupt gene function.

Mutated-gene Genome-Wide Association Study (MugGWAS) addresses these limitations by offering a non-model-organism-friendly pipeline to infer gene-trait associations. By annotating disruptive mutation types (nonsense, missense, stopgain, or nonstop), MugGWAS identifies putative gene dysfunctions associated with phenotypic changes. This approach conserves statistical power by avoiding tests on variants that either (1) don't affect function at the gene level or (2) result in disruption of the same gene, allowing users to identify putative gene mutations driving phenotypic changes without the large statistical power required for per-nucleotide genome-wide tests.

## **Prerquisites**
MugGWAS uses the external tool-**ANNOVAR** to annotate variants. Please follow the [instructions](http://annovar.openbioinformatics.org/en/latest/) to download **ANNOVAR** (_registration is required_), and specify the path to ANNOVAR scripts for MugGWAS. More instructions are provided below.

This package uses the fixed effect model in [**pyseer**](https://pyseer.readthedocs.io/en/master/index.html) for association analysis. Users don't need to download **Pyseer** and the [prerequisites](https://pyseer.readthedocs.io/en/master/installation.html#prerequisites) themselves. They will be included in the conda environments following the instructions below.

## **Installation Guide**

The easiest way to install MugGWAS and fulfill the prerequites is through this path:

#### **Build Conda Environment**

Download the [environmental.yaml](https://github.com/lyuchengmarvin/MugGWAS/blob/main/envs/environment.yaml) file for setting up your conda environement. You can download it using this command in your terminal.

```{command line}
wget https://github.com/lyuchengmarvin/MugGWAS/blob/main/envs/environment.yaml 
```

Build the environment to satisfy software prerequisites and ensure applicability.

```{command line}
conda env create --name muggwas --file envs/environment.yaml
```

Activate envrionment before use each time:

```{command line}
conda activate muggwas
```

Deactivate after use:

```{command line}
conda deactivate
```

#### **Install MugGWAS**

Install MugGWAS in your environment and make sure that all prerequisites are fulfilled.

```{command line}
pip install -i https://test.pypi.org/simple/ MugGWAS
```
You might need to restart the kernal to run this tutorial in Jupyter Notebook.

## **Tutaorial**

### **Download the Tutorial Dataset**
Type this command in your terminal to download the [dataset](https://github.com/lyuchengmarvin/MugGWAS/tree/main/tutorials/data/pyseer_dataset). Remeber to place this under the same directory of this tutorial file.
```{comand line}
wget https://github.com/lyuchengmarvin/MugGWAS/tree/main/tutorials/data/pyseer_dataset
```

### **Overview:**
1. Annotate variants
2. Build gene mutation table
3. Estimate population structure
4. GWAS with pyseer

![MugGWAS Workflow](MugGWAS_workflow.png)

### **Step 1: Annotate variants with ANNOVAR**

We will use the external tool ANNOVAR and annotate the variants from vcf files. Although one vcf file should contain the variant information for multiple samples, ANNOVAR annotates samples iteratively and create an annotation file per sample.

This tool also assumed that users will build their customized database for their none-model organisms, so a genome annotation **<ref_prefix>.gff3** and a genome fasta file **<ref_prefix>.fna** are required and placed in the directory **<ref_prefix>/**. The database will utilize the tool `gff3ToGenePred` to build the database, which was also built when users created their conda environments.


**Annotation workflow**

1. **Make input files**:
    - Input: a VCF file `vcf_prefix.vcf` (with multiple samples)
    - Convert VCF to ANNOVAR input format `vcf_prefix.<sample_name>.avinput`
    - Store in `/path_to_vcf/annovar_files/`
2. **Build database**:
    - Input: `ref_prefix.gff3` and `ref_prefix.fna` in `db_dir/`
    - Build customized database: `ref_prefix_refGene.txt` and `ref_prefix_refGeneMrna.fa`
3. **Annotate variants with ANNOVAR**:
    - Output1: `vcf_prefix.<sample_name>.avinput.variant_function` infer mutation position on a gene.
    - Output2: `vcf_prefix.<sample_name>.avinput.exonic_variant_function` infer mutation effects on translation, namely synonymous, nonsynonymous, stop codon gain, or stop codon loss.

In [None]:
from muggwas import run_annovar

## User inputs
# the path to the ANNOVAR installation
annovar_dir = '../annovar/'
# the directory where the reference files are located
db_dir = 'pyseer_dataset/Spn23F/' #ref is susceptible to penicillin
# path to the vcf file
input_dir = 'pyseer_dataset/snps.snp.subset.filtered.vcf.gz'
# output prefix or the converted annovar input file
vcf_prefix = 'snps'
# output prefix for the annovar database
ref_prefix = 'Spn23F' #ref is susceptible to penicillin

## Run ANNOVAR
run_annovar(
    annovar_dir=annovar_dir,
    db_dir=db_dir,
    input_dir=input_dir,
    vcf_prefix=vcf_prefix,
    ref_prefix=ref_prefix
)

Converting VCF file to annovar input format...



NOTICE: output files will be written to data/pyseer_dataset/annovar_files/snps.<samplename>.avinput
NOTICE: Finished reading 91729 lines from VCF file
NOTICE: A total of 91717 locus in VCF file passed QC threshold, representing 91717 SNPs (69023 transitions and 22694 transversions) and 0 indels/substitutions
NOTICE: Finished writing 3154524 SNP genotypes (2386705 transitions and 767819 transversions) and 0 indels/substitutions for 300 samples



The VCF file has been converted and saved to data/pyseer_dataset/annovar_files/.



NOTICE: Reading region file data/pyseer_dataset/Spn23F/Spn23F_refGene.txt ... Done with 2271 regions from 1 chromosomes
NOTICE: Finished reading 1 sequences from data/pyseer_dataset/Spn23F/Spn23F.fna
NOTICE: Finished writting FASTA for 2271 genomic regions to data/pyseer_dataset/Spn23F/Spn23F_refGeneMrna.fa
2025-04-15 21:02:56,444 - INFO - Starting parallel annotation of avinput files.


Database has been built and saved to data/pyseer_dataset/Spn23F/.

Annotating avinput files...


NOTICE: Output files are written to data/pyseer_dataset/annovar_files/snps.6925_2#84.avinput.variant_function, data/pyseer_dataset/annovar_files/snps.6925_2#84.avinput.exonic_variant_function
NOTICE: Reading gene annotation from data/pyseer_dataset/Spn23F/Spn23F_refGene.txt ... NOTICE: Output files are written to data/pyseer_dataset/annovar_files/snps.7622_4#14.avinput.variant_function, data/pyseer_dataset/annovar_files/snps.7622_4#14.avinput.exonic_variant_function
NOTICE: Reading gene annotation from data/pyseer_dataset/Spn23F/Spn23F_refGene.txt ... NOTICE: Output files are written to data/pyseer_dataset/annovar_files/snps.7622_3#89.avinput.variant_function, data/pyseer_dataset/annovar_files/snps.7622_3#89.avinput.exonic_variant_function
NOTICE: Reading gene annotation from data/pyseer_dataset/Spn23F/Spn23F_refGene.txt ... NOTICE: Output files are written to data/pyseer_dataset/annovar_files/snps.7622_5#84.avinput.variant_function, data/pyseer_dataset/annovar_files/snps.7622_5#84.avi

Annotated 300 avinput files in 93.93 seconds.
Annotated avinput files are saved in data/pyseer_dataset/annovar_files/.


### **Step 2: Build Gene Mutation Table:**

For each sample:
1. **Build gene maps** from the gff3 file.
2. Compile mutation info **gene-by-gene**.
3. **Determine mutation types**.
    - binary: mutation or wildtype
    - multiple: nonsense, nonstop, missense, silent or wildtype
4. **Output mutation types per gene** across all samples:

    |Gene|Sample1|Sample2|Sample3|
    |:-:|:-:|:-:|:-:|
    |g1|m|w|w|m|
    |g2|m|m|m|w|
    |g3|w|w|w|m|
    |...|...|...|...|

**Import functions and annotation data**

In [None]:
from muggwas import compile_gene_mutations, write_gene_mutation_summary, write_gene_annotation_summary

# Directory containing the ANNOVAR output files
annovar_output_dir = 'pyseer_dataset/annovar_files'
# Read gff3 to build gene maps
gff_file = 'pyseer_dataset/Spn23F/Spn23F.gff3'
# Write the mutation types to a tab delimited file
output_mutation_table = 'pyseer_dataset/gene_mutation_summary.txt'
# Write the annotations to a tab delimited file
output_annotation_table = 'pyseer_dataset/gene_annotation_summary.txt'

**Compile mutation and output gene summary table**

In [2]:
## Compile the gene mutations for each sample
# if you want to output binary mutation types (mutated or not), set model = 'binary'
gene_mutation_summary = compile_gene_mutations(annovar_output_dir, gff_file, model = 'binary')
## Write the gene mutation summary
write_gene_mutation_summary(gene_mutation_summary, output_mutation_table)
## Write the gene annotations
write_gene_annotation_summary(gff_file, output_annotation_table)

Time elapsed for compiling gene mutations: 19.00 seconds
Total number of samples processed: 300
Gene mutation summary has been written to data/pyseer_dataset/gene_mutation_summary.txt.

Gene annotation summary has been written to data/pyseer_dataset/gene_annotation_summary.txt.


#### **Step 3: Estimate Population Structure Effect:**

In gene-trait association analyses, false positives could arise from shared genetic lineage, especially in clonal organisms. There are two methods to estimate the effect of population structure:
- Phylogenetic-based: Infer population structure based on phylogenetic distances. Since MugGWAS will use a linear mixed model, the distance will be estimated from the shared branch length between the MRCA and the root.
- Kinship: Variants on core gene sequences represent the result of vertical evolution. To make inferences for identical by descent, the script will calculate the kinship matrix from the genotype matrix of the presence and absence of variants. --> (not supported yet as of 2025.04.15)

Inputs: 
- Phylogeny-based: Output from IQ-tree `core_gene_tree.nwk`. A high-quality phylogeny based on single-copy core genes from a pangenome.
- Kinship: A VCF file documenting variants on core genes `core_gene_snp.vcf.gz`. --> (not supported yet as of 2025.04.15)

Output: 
- `phylogeny_distances.tsv`: a distance matrix to account for population structure effect in pyseer.

    | |Sample1|Sample2|Sample3|Sample4|
    |:-:|:-:|:-:|:-:|:-:|
    |Sample1|0|4|0|0|
    |Sample2|4|0|0|0|
    |Sample3|0|0|0|3|
    |Sample4|0|0|3|0|
    |...|...|...|...|

**Phylogeny-based method**

In [None]:
from muggwas import phylogeny2distmatrix

# Read the phylogenetic tree
phylogeny = 'pyseer_dataset/core_genome_aln.tree'
output_file = 'pyseer_dataset/phylogeny_distances.tsv'

# Convert the phylogenetic tree to a distance matrix
phylogeny2distmatrix(phylogeny, output_file)

### **Step 4: Run GWAS on pyseer**

This function filter the genes that have no mutations in the population and run the [linear mixed effect model (FaST-LMM)](https://pyseer.readthedocs.io/en/master/usage.html#mixed-model-fast-lmm) from `pyseer` to infer gene-trait association. Read the GWAS [tutorial](https://pyseer.readthedocs.io/en/master/tutorial.html) for more details.



- Input:
    1. The **phenotype file**: a tab-delimited file with the first column as sample names and the second column as the phenotypic value (support binary or numeric).
    2. The **gene mutation summary table**: the output from step 2.
    3. The **distance matrix**: the output from step 3.

- Output: A summary table of **GWAS result** that looks like this:

    |Genes|Allele Freq.|filter-value|lrt-pvalue|beta|beta-std-err|variant_h2|notes|
    |:---|:--|:------|:----------|:------|:-----------|:------|:-----------|
    |g1|3.00E-01|6.96E-02|1.59E-01|2.09E-01|1.35E-01|4.81E-01|
    |g2|6.00E-01|2.89E-01|2.50E-01|-8.73E-02|7.03E-02|4.02E-01|	
    |g3|6.00E-01|7.02E-01|8.27E-01|-2.92E-02|1.30E-01|7.94E-02|
    |...|...|...|...|...|...|...|...|

In [None]:
from muggwas import run_pyseer

phenotype_file = 'pyseer_dataset/resistances.subset.txt'
mutation_file = 'pyseer_dataset/gene_mutation_summary.txt'
similarity_matrix_file = 'pyseer_dataset/phylogeny_distances.tsv'
output_dir = 'pyseer_dataset/'

# Run pyseer
run_pyseer(phenotype_file, similarity_matrix_file, mutation_file, output_dir)


2025-04-15 23:09:46,264 - INFO - Start running GWAS with pyseer
2025-04-15 23:09:46,282 - INFO - Gene gene-SPN23F_RS00005 has a mutation rate of 0.27.
2025-04-15 23:09:46,282 - INFO - Gene cds-WP_000660615.1 has no mutations in the population. Skipping this gene.
2025-04-15 23:09:46,283 - INFO - Gene gene-SPN23F_RS00010 has a mutation rate of 0.27.
2025-04-15 23:09:46,284 - INFO - Gene cds-WP_000581157.1 has no mutations in the population. Skipping this gene.
2025-04-15 23:09:46,285 - INFO - Gene gene-SPN23F_RS00015 has a mutation rate of 0.11333333333333333.
2025-04-15 23:09:46,285 - INFO - Gene cds-WP_000285194.1 has no mutations in the population. Skipping this gene.
2025-04-15 23:09:46,286 - INFO - Gene gene-SPN23F_RS00020 has a mutation rate of 0.2966666666666667.
2025-04-15 23:09:46,286 - INFO - Gene cds-WP_001218707.1 has no mutations in the population. Skipping this gene.
2025-04-15 23:09:46,287 - INFO - Gene gene-SPN23F_RS00025 has a mutation rate of 0.7133333333333334.
2025-0

## **Future Direction**

1. Add statistical significance threshold to the GWAS result:
- The Bonferroni threshold depends on how many genes are tested.

2. Add a plotting module:
- Input: `gwas_result.txt`
- Output:
    1. Q-Q plot: `gwas_result.qq.png`
    2. manhattan plot: `gwas_result.man.png`
- Functionality: Plot Q-Q plots and manhattan plots.