# Notebook for computing amino acid fitness mutations

## Snakemake input

In [None]:
orf1ab_to_nsps=snakemake.params.orf_to_nsps
gene_overlaps=snakemake.params.gene_ov
genes = snakemake.params.genes
fitness_pseudocount=snakemake.params.fit_pseudo
ntmut_fit=snakemake.input.ntfit_csv
output=snakemake.output.aafit_csv

## Import packages

In [None]:
import numpy as np
import pandas as pd
import sys
import os

In [None]:
# Adding module folder to system paths
module_path = os.path.abspath(os.path.join(".."))
if module_path not in sys.path:
    sys.path.append(module_path)

In [None]:
from modules import aamutfit

Columns to be exploded

In [None]:
explode_cols = [
    "gene",
    "clade_founder_aa",
    "mutant_aa",
    "codon_site",
    "aa_mutation",
]



Read data, then:

* Exclude mutations in overlapping reading frames specified for exclusion
* Explode dataframe to make a separate line for each gene.
* Drop ORF1a, the reason being that after we exclude overlapping reading frame sites there aren't any ORF1a sites not also in ORF1ab.
* Aggregate all expected and actual counts for the same amino acid change for each clade / amino-acid mutation

In [None]:
# Read-in fitness of nucleotide mutations
ntmut_fit = pd.read_csv(ntmut_fit)

In [None]:
ntmut_fit.head()

Get only coding mutations

In [None]:
ntmut_fit_coding = aamutfit.get_coding(ntmut_fit, gene_overlaps, explode_cols)

In [None]:
ntmut_fit_coding.head()

Aggregate counts for amino acid mutations

In [None]:
aa_counts = aamutfit.aggregate_counts(ntmut_fit_coding, explode_cols)

Adding naive fitness estimates

In [None]:
aamutfit.naive_fitness(aa_counts, fitness_pseudocount=fitness_pseudocount)

In [None]:
aa_counts.head()

Dataframe with refined fitness estimates

In [None]:
aa_fit = aamutfit.aa_fitness(ntmut_fit_coding, explode_cols)

In addition to the entries for full ORF1ab, we also want to have mutations numbered by the nsp naming.

First, make a data frame that converts the numbering:

In [None]:
orf1ab_to_nsps_df = aamutfit.map_orf1ab_to_nsps(orf1ab_to_nsps)

Now we add to our dataframes that have ORF1ab the estimates for the nsp proteins. Note that these means mutations in both ORF1ab and nsp show up **twice** in the data frame with different names, so we add a column to indicate which genes are a subset of ORF1ab:

In [None]:
aa_counts = aamutfit.add_nsps(aa_counts, orf1ab_to_nsps_df)
aa_fit = aamutfit.add_nsps(aa_fit, orf1ab_to_nsps_df)

Merge counts and fitness dataframes and write to file

In [None]:
aamut_fitness = aamutfit.merge_aa_df(aa_fit, aa_counts, explode_cols)

Order dataframe according to: genes order, site within the gene

In [None]:
aamut_fitness['gene'] = pd.CategoricalIndex(aamut_fitness['gene'], ordered=True, categories=genes)
aamut_fitness = aamut_fitness.sort_values(['gene', 'aa_site']).reset_index(drop=True)

In [None]:
aamut_fitness.head()

In [None]:
aamut_fitness.to_csv(output, index=False)