# **Mutated-gene Genome-Wide Association Study (MugGWAS)**

MugGWAS identifies potential causal mutations for phenotypic change through testing associations between mutated genes and phenotypic values. The tool aims to 

### **Prerquisites**
MugGWAS uses the external tool-**ANNOVAR** to annotate variants. Please follow the [instructions](http://annovar.openbioinformatics.org/en/latest/) to download ANNOVAR (registration is required), and specify the path to ANNOVAR scripts for MugGWAS. More instructions are provided below.

This package uses the fixed effect model in  [`pyseer`](https://pyseer.readthedocs.io/en/master/index.html) for association analysis. So, the users have to install and match the [requirements](https://pyseer.readthedocs.io/en/master/installation.html#prerequisites) of `pyseer`.

### **Installation**
Clone this repository from GitHub in the terminal:

`git clone https://github.com/lyuchengmarvin/MugGWAS.git`


### **Build Environment**
Build an environment based on an environment.yaml file to satisfy software prerequisites and ensure applicability.

`conda env create --name muggwas --file envs/environment.yaml`

Activate envrionment before use each time:

`conda activate muggwas`

Deactivate after use:

`conda deactivate`

### **Install MugGWAS**

`pip install -i https://test.pypi.org/simple/ MugGWAS`

### **Tutaorial**
Steps:
1. Annotate variants
2. Build gene mutation table
3. Estimate population structure
4. GWAS with pyseer
5. Plot GWAS result

#### **Annotate variants with ANNOVAR**

We will use the external tool ANNOVAR and annotate the variants from vcf files. Although one vcf file should contain the variant information for multiple samples, ANNOVAR annotates samples iteratively and create an annotation file per sample.

This tool also assumed that users will build their customized database for their none-model organisms, so a genome annotation `<ref_prefix.gff3>` and a genome fasta file `<ref_prefix>.fna` are required and placed in the directory `<ref_prefix>/`. The database will utilize the tool `gff3ToGenePred` to build the database, which was also built when users created their conda environments.


**Annotation workflow**

1. **Make input files**:
    - Input: a VCF file `vcf_prefix.vcf` (with multiple samples)
    - Convert VCF to ANNOVAR input format `vcf_prefix.<sample_name>.avinput`
    - Store in `/path_to_vcf/annovar_files/`
2. **Build database**:
    - Input: `ref_prefix.gff3` and `ref_prefix.fna` in `db_dir/`
    - Build customized database: `ref_prefix_refGene.txt` and `ref_prefix_refGeneMrna.fa`
3. **Annotate variants with ANNOVAR**:
    - Output1: `vcf_prefix.<sample_name>.avinput.variant_function` infer mutation position on a gene.
    - Output2: `vcf_prefix.<sample_name>.avinput.exonic_variant_function` infer mutation effects on translation, namely synonymous, nonsynonymous, stop codon gain, or stop codon loss.

In [5]:
from muggwas import run_annovar

## User inputs
# the path to the ANNOVAR installation
annovar_dir = '../annovar/'
# the directory where the reference files are located
db_dir = 'data/LE18_22/'
# path to the vcf file
input_dir = 'data/graz_LE.snp.vcf'
# output prefix or the converted annovar input file
vcf_prefix = 'graz_LE'
# output prefix for the annovar database
ref_prefix = 'LE18_22'

## Run ANNOVAR
run_annovar(
    annovar_dir=annovar_dir,
    db_dir=db_dir,
    input_dir=input_dir,
    vcf_prefix=vcf_prefix,
    ref_prefix=ref_prefix
)


Converting VCF file to annovar input format...



NOTICE: output files will be written to data/annovar_files/graz_LE.<samplename>.avinput
NOTICE: Finished reading 21826 lines from VCF file
NOTICE: A total of 21154 locus in VCF file passed QC threshold, representing 21154 SNPs (13997 transitions and 7157 transversions) and 0 indels/substitutions
NOTICE: Finished writing 90239 SNP genotypes (61175 transitions and 29064 transversions) and 0 indels/substitutions for 22 samples



The VCF file has been converted and saved to data/annovar_files/.



NOTICE: Reading region file data/LE18_22/LE18_22_refGene.txt ... Done with 5095 regions from 613 chromosomes
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_22.fna
NOTICE: Finished reading 642 sequences from data/LE18_22/LE18_2

Database has been built and saved to data/LE18_22/.

Annotating avinput files...


NOTICE: Output files are written to data/annovar_files/graz_LE.ND_85.avinput.variant_function, data/annovar_files/graz_LE.ND_85.avinput.exonic_variant_function
NOTICE: Reading gene annotation from data/LE18_22/LE18_22_refGene.txt ... NOTICE: Output files are written to data/annovar_files/graz_LE.ND_79.avinput.variant_function, data/annovar_files/graz_LE.ND_79.avinput.exonic_variant_function
NOTICE: Reading gene annotation from data/LE18_22/LE18_22_refGene.txt ... NOTICE: Output files are written to data/annovar_files/graz_LE.ND_78.avinput.variant_function, data/annovar_files/graz_LE.ND_78.avinput.exonic_variant_function
NOTICE: Reading gene annotation from data/LE18_22/LE18_22_refGene.txt ... NOTICE: Output files are written to data/annovar_files/graz_LE.ND_101.avinput.variant_function, data/annovar_files/graz_LE.ND_101.avinput.exonic_variant_function
NOTICE: Reading gene annotation from data/LE18_22/LE18_22_refGene.txt ... NOTICE: Output files are written to data/annovar_files/graz_LE

Annotated 22 avinput files in 2.42 seconds.
Annotated avinput files are saved in data/annovar_files/.


#### **Build Gene Mutation Table:**

For each sample:
1. Build gene maps from the gff3 file.
2. Compile mutation info gene-by-gene.
3. Determine mutation types.
    - binary: mutation or wildtype
    - multiple: nonsense, nonstop, missense, silent or wildtype
4. Output mutation types.

**Import functions and annotation data**

In [3]:
from muggwas import compile_gene_mutations, write_gene_mutation_summary

# Directory containing the ANNOVAR output files
annovar_output_dir = 'data/annovar_files'
# Read gff3 to build gene maps
gff_file = 'data/LE18_22/LE18_22.gff3'
# Write the mutation types to a tab delimited file
output_file = 'data/gene_mutation_summary.txt'

**Compile variants and output gene mutation table**

In [6]:
## Compile the gene mutations for each sample
# if you want to output binary mutation types (mutated or not), set model = 'binary'
gene_mutation_summary = compile_gene_mutations(annovar_output_dir, gff_file, model = 'multiple')
## Write the gene mutation summary
write_gene_mutation_summary(gene_mutation_summary, output_file)

Time elapsed for compiling gene mutations: 2.13 seconds
Total number of samples processed: 22
Gene mutation summary has been written to data/gene_mutation_summary.txt.



#### **Estimate Population Structure Effect:**

In gene-trait association analyses, false positives could arise from shared genetic lineage, especially in clonal organisms. There are two methods to estimate the effect of population structure:
- Phylogenetic-based: Infer population structure based on phylogenetic distances. Since MugGWAS will use a linear mixed model, the distance will be estimated from the shared branch length between the MRCA and the root.
- Kinship: Variants on core gene sequences represent the result of vertical evolution. To make inferences for identical by descent, the script will calculate the kinship matrix from the genotype matrix of the presence and absence of variants. --> (not supported yet as of 2025.04.03)

Inputs: 
- Phylogeny-based: Output from IQ-tree `core_gene_tree.nwk`. A high-quality phylogeny based on single-copy core genes from a pangenome.
- Kinship: A VCF file documenting variants on core genes `core_gene_snp.vcf.gz`.

Output: 
- `phylogeny_distances.tsv`: a file to account for population structure effect in pyseer.

**Phylogeny-based method**

In [7]:
from muggwas import phylogeny2distmatrix

# Read the phylogenetic tree
phylogeny = 'data/LE_tree.nwk'
output_file = 'data/phylogeny_distances.tsv'

# Convert the phylogenetic tree to a distance matrix
phylogeny2distmatrix(phylogeny, output_file)