# Toy example with `CodonVariantTable`
This example illustrates use of a [CodonVariantTable](https://jbloomlab.github.io/dms_variants/dms_variants.codonvarianttable.html#dms_variants.codonvarianttable.CodonVariantTable) on a toy example.

It is written primarily as a test for that class.

Import required modules:

In [1]:
import os
import tempfile

import pandas as pd

from dms_variants.codonvarianttable import CodonVariantTable

Initialize a toy `CodonVariantTable` for a short gene:

In [2]:
geneseq = "ATGGGATGA"

with tempfile.NamedTemporaryFile(mode="w") as f:
    _ = f.write(
        "library,barcode,substitutions,variant_call_support\n"
        "lib_1,AAC,,2\n"
        "lib_1,GAT,G4C A6C,1\n"
        "lib_2,AAC,T2A G4T,2\n"
        "lib_2,CAT,A6C,3"
    )
    f.flush()
    variants = CodonVariantTable(barcode_variant_file=f.name, geneseq=geneseq)

Check attributes of the `CodonVariantTable`:

In [3]:
variants.sites

[1, 2, 3]

In [4]:
variants.codons

OrderedDict([(1, 'ATG'), (2, 'GGA'), (3, 'TGA')])

In [5]:
variants.aas

OrderedDict([(1, 'M'), (2, 'G'), (3, '*')])

In [6]:
variants.libraries

['lib_1', 'lib_2']

In [7]:
variants.valid_barcodes("lib_1")

{'AAC', 'GAT'}

In [8]:
variants.valid_barcodes("lib_2")

{'AAC', 'CAT'}

In [9]:
variants.barcode_variant_df

Unnamed: 0,library,barcode,variant_call_support,codon_substitutions,aa_substitutions,n_codon_substitutions,n_aa_substitutions
0,lib_1,AAC,2,,,0,0
1,lib_1,GAT,1,GGA2CGC,G2R,1,1
2,lib_2,AAC,2,ATG1AAG GGA2TGA,M1K G2*,2,2
3,lib_2,CAT,3,GGA2GGC,,1,0


We can also look at the number of variants; we get this by calling `CodonVariantTable.n_variants_df` with `samples=None` since we don't have samples, and just want stats across barcoded variants:

In [10]:
variants.n_variants_df(samples=None)

Unnamed: 0,library,sample,count
0,lib_1,barcoded variants,2
1,lib_2,barcoded variants,2
2,all libraries,barcoded variants,4


Here is the number of variants if we look at just single amino-acid mutants (and wildtype):

In [11]:
variants.n_variants_df(samples=None, variant_type="single", mut_type="aa")

Unnamed: 0,library,sample,count
0,lib_1,barcoded variants,2
1,lib_2,barcoded variants,1
2,all libraries,barcoded variants,3


We can also see how these numbers change if we require a variant call support of at least 2:

In [12]:
variants.n_variants_df(samples=None, min_support=2)

Unnamed: 0,library,sample,count
0,lib_1,barcoded variants,1
1,lib_2,barcoded variants,2
2,all libraries,barcoded variants,3


If we want to combine the data for the two libraries, we can use `CodonVariantTable.addMergedLibraries`, which creates a new combined library called "all libraries":

In [13]:
variants.addMergedLibraries(variants.barcode_variant_df)

Unnamed: 0,library,barcode,variant_call_support,codon_substitutions,aa_substitutions,n_codon_substitutions,n_aa_substitutions
0,lib_1,AAC,2,,,0,0
1,lib_1,GAT,1,GGA2CGC,G2R,1,1
2,lib_2,AAC,2,ATG1AAG GGA2TGA,M1K G2*,2,2
3,lib_2,CAT,3,GGA2GGC,,1,0
4,all libraries,lib_1-AAC,2,,,0,0
5,all libraries,lib_1-GAT,1,GGA2CGC,G2R,1,1
6,all libraries,lib_2-AAC,2,ATG1AAG GGA2TGA,M1K G2*,2,2
7,all libraries,lib_2-CAT,3,GGA2GGC,,1,0


Note however `CodonVariantTable.addMergedLibraries` doesn't do anything if there is only one library:

In [14]:
variants.addMergedLibraries(variants.barcode_variant_df.query('library == "lib_1"'))

Unnamed: 0,library,barcode,variant_call_support,codon_substitutions,aa_substitutions,n_codon_substitutions,n_aa_substitutions
0,lib_1,AAC,2,,,0,0
1,lib_1,GAT,1,GGA2CGC,G2R,1,1


Count number of barcoded variants with each mutation (here we only show mutations with non-zero counts):

In [15]:
variants.mutCounts("all", "aa", samples=None).query("count > 0")

Unnamed: 0,library,sample,mutation,count,mutation_type,site
0,lib_1,barcoded variants,G2R,1,nonsynonymous,2
60,lib_2,barcoded variants,G2*,1,stop,2
61,lib_2,barcoded variants,M1K,1,nonsynonymous,1
120,all libraries,barcoded variants,G2*,1,stop,2
121,all libraries,barcoded variants,G2R,1,nonsynonymous,2
122,all libraries,barcoded variants,M1K,1,nonsynonymous,1


We can do the same for codon mutations (here for only a single library), first for all variants:

In [16]:
variants.mutCounts("all", "codon", samples=None, libraries=["lib_2"]).query("count > 0")

Unnamed: 0,library,sample,mutation,count,mutation_type,site
0,lib_2,barcoded variants,ATG1AAG,1,nonsynonymous,1
1,lib_2,barcoded variants,GGA2GGC,1,synonymous,2
2,lib_2,barcoded variants,GGA2TGA,1,stop,2


Like above but only single-mutant variants:

In [17]:
variants.mutCounts("single", "codon", samples=None, libraries=["lib_2"]).query(
    "count > 0"
)

Unnamed: 0,library,sample,mutation,count,mutation_type,site
0,lib_2,barcoded variants,GGA2GGC,1,synonymous,2


So far we haven't added any barcode count information for any samples:

In [18]:
all(variants.samples(lib) == [] for lib in variants.libraries)

True

In [19]:
variants.variant_count_df is None

True

Now we add barcode count information for sample "input" from library 1 using `CodonVariantTable.addSampleCounts`:

In [20]:
counts_lib1_input = pd.DataFrame({"barcode": ["AAC", "GAT"], "count": [253, 1101]})
variants.addSampleCounts("lib_1", "input", counts_lib1_input)
variants.variant_count_df

Unnamed: 0,barcode,count,library,sample,variant_call_support,codon_substitutions,aa_substitutions,n_codon_substitutions,n_aa_substitutions
0,GAT,1101,lib_1,input,1,GGA2CGC,G2R,1,1
1,AAC,253,lib_1,input,2,,,0,0


We get an error if we try to add these same data again, as they are already added for that sample to that library:

In [21]:
try:
    variants.addSampleCounts("lib_1", "input", counts_lib1_input)
except ValueError as exception:
    print(f"Raises following ValueError: {exception}")

Raises following ValueError: `library` lib_1 already has `sample` input


But, we can add barcode counts for another sample (named "selected" in this case) to library 1:

In [22]:
counts_lib1_selected = pd.DataFrame({"barcode": ["AAC", "GAT"], "count": [513, 401]})
variants.addSampleCounts("lib_1", "selected", counts_lib1_selected)

As well as barcode counts for the same two samples ("input" and "selected") to our other library (library 2):

In [23]:
counts_lib2_input = pd.DataFrame({"barcode": ["AAC", "CAT"], "count": [1253, 923]})
variants.addSampleCounts("lib_2", "input", counts_lib2_input)
counts_lib2_selected = pd.DataFrame({"barcode": ["AAC", "CAT"], "count": [113, 1200]})
variants.addSampleCounts("lib_2", "selected", counts_lib2_selected)
variants.variant_count_df

Unnamed: 0,barcode,count,library,sample,variant_call_support,codon_substitutions,aa_substitutions,n_codon_substitutions,n_aa_substitutions
0,GAT,1101,lib_1,input,1,GGA2CGC,G2R,1,1
1,AAC,253,lib_1,input,2,,,0,0
2,AAC,513,lib_1,selected,2,,,0,0
3,GAT,401,lib_1,selected,1,GGA2CGC,G2R,1,1
4,AAC,1253,lib_2,input,2,ATG1AAG GGA2TGA,M1K G2*,2,2
5,CAT,923,lib_2,input,3,GGA2GGC,,1,0
6,CAT,1200,lib_2,selected,3,GGA2GGC,,1,0
7,AAC,113,lib_2,selected,2,ATG1AAG GGA2TGA,M1K G2*,2,2


What are the average counts per variant?

In [24]:
variants.avgCountsPerVariant()

Unnamed: 0,library,sample,avg_counts_per_variant
0,lib_1,input,677.0
1,lib_1,selected,457.0
2,lib_2,input,1088.0
3,lib_2,selected,656.5
4,all libraries,input,882.5
5,all libraries,selected,556.75


We can also use `CodonVariantTable.mutCounts` to look at total counts of each mutation:

In [25]:
variants.mutCounts("all", "aa").query("count > 0")

Unnamed: 0,library,sample,mutation,count,mutation_type,site
0,lib_1,input,G2R,1101,nonsynonymous,2
60,lib_1,selected,G2R,401,nonsynonymous,2
120,lib_2,input,G2*,1253,stop,2
121,lib_2,input,M1K,1253,nonsynonymous,1
180,lib_2,selected,G2*,113,stop,2
181,lib_2,selected,M1K,113,nonsynonymous,1
240,all libraries,input,G2*,1253,stop,2
241,all libraries,input,M1K,1253,nonsynonymous,1
242,all libraries,input,G2R,1101,nonsynonymous,2
300,all libraries,selected,G2R,401,nonsynonymous,2


In [26]:
variants.mutCounts("all", "aa", libraries=["lib_2"]).query("count > 0")

Unnamed: 0,library,sample,mutation,count,mutation_type,site
0,lib_2,input,G2*,1253,stop,2
1,lib_2,input,M1K,1253,nonsynonymous,1
60,lib_2,selected,G2*,113,stop,2
61,lib_2,selected,M1K,113,nonsynonymous,1


We can use `CodonVariantTable.writeCodonCounts` to write codon count files. First for only **single** mutants:

In [27]:
with tempfile.TemporaryDirectory() as tmpdir:
    countfiles = variants.writeCodonCounts(
        "single", outdir=tmpdir, include_all_libs=True
    )
    lib1_input = pd.read_csv(f"{tmpdir}/lib_1_input_codoncounts.csv")
    all_sel = pd.read_csv(f"{tmpdir}/all-libraries_selected_codoncounts.csv")

Names of created count files:

In [28]:
countfiles.assign(countfile=lambda x: x.countfile.apply(os.path.basename))

Unnamed: 0,library,sample,countfile
0,lib_1,input,lib_1_input_codoncounts.csv
1,lib_1,selected,lib_1_selected_codoncounts.csv
2,lib_2,input,lib_2_input_codoncounts.csv
3,lib_2,selected,lib_2_selected_codoncounts.csv
4,all-libraries,input,all-libraries_input_codoncounts.csv
5,all-libraries,selected,all-libraries_selected_codoncounts.csv


Check for expected values in a few of these counts files, only showing columns with non-zero entries:

In [29]:
lib1_input.iloc[:, (lib1_input != 0).any(axis="rows").values]

Unnamed: 0,site,wildtype,ATG,CGC,GGA,TGA
0,1,ATG,253,0,0,0
1,2,GGA,0,1101,253,0
2,3,TGA,0,0,0,253


In [30]:
all_sel.iloc[:, (all_sel != 0).any(axis="rows").values]

Unnamed: 0,site,wildtype,ATG,CGC,GGA,GGC,TGA
0,1,ATG,513,0,0,0,0
1,2,GGA,0,401,513,1200,0
2,3,TGA,0,0,0,0,513


Now write codon counts files for **all** mutants:

In [31]:
with tempfile.TemporaryDirectory() as tmpdir:
    _ = variants.writeCodonCounts("all", outdir=tmpdir, include_all_libs=True)
    lib1_input_all = pd.read_csv(f"{tmpdir}/lib_1_input_codoncounts.csv")
    all_sel_all = pd.read_csv(f"{tmpdir}/all-libraries_selected_codoncounts.csv")

In [32]:
lib1_input_all.iloc[:, (lib1_input_all != 0).any(axis="rows").values]

Unnamed: 0,site,wildtype,ATG,CGC,GGA,TGA
0,1,ATG,1354,0,0,0
1,2,GGA,0,1101,253,0
2,3,TGA,0,0,0,1354


In [33]:
all_sel_all.iloc[:, (all_sel_all != 0).any(axis="rows").values]

Unnamed: 0,site,wildtype,AAG,ATG,CGC,GGA,GGC,TGA
0,1,ATG,113,2114,0,0,0,0
1,2,GGA,0,0,401,513,1200,113
2,3,TGA,0,0,0,0,0,2227


We can also initialize `CodonVariantTable` from the `variant_count_df` if we have written that to a CSV file. 
We do this using `CodonVariantTable.from_variant_count_df`.
The example below shows how this newly initialized variant table is equal to the original one used to write the CSV file:

In [34]:
with tempfile.NamedTemporaryFile(mode="w") as f:
    variants.variant_count_df.to_csv(f, index=False)
    f.flush()
    variants_eq = CodonVariantTable.from_variant_count_df(
        variant_count_df_file=f.name, geneseq=geneseq
    )
variants == variants_eq

False

Of course, the initialized variant table is **not** equal to original one if we don't write the full `variant_count_df` to the CSV file:

In [35]:
with tempfile.NamedTemporaryFile(mode="w") as f:
    (variants.variant_count_df.query('sample == "input"').to_csv(f, index=False))
    f.flush()
    variants_ne = CodonVariantTable.from_variant_count_df(
        variant_count_df_file=f.name, geneseq=geneseq
    )
variants == variants_ne

False

We can use `CodonVariantTable.func_scores` to compute the functional effects of mutations. We cannot use this method with default options as we have no wildtype counts (needed for normalization) for `lib_2` so we get an error:

In [36]:
try:
    variants.func_scores("input")
except ValueError as exception:
    print(f"ValueError: {exception}")

ValueError: no wildtype counts:
         library    sample  count
0          lib_1     input    253
1          lib_1  selected    513
2          lib_2     input      0
3          lib_2  selected      0
4  all libraries     input    253
5  all libraries  selected    513


However, we can use the method with the `permit_zero_wt` option:

In [37]:
scores = variants.func_scores("input", permit_zero_wt=True)
scores

Unnamed: 0,library,pre_sample,post_sample,barcode,func_score,func_score_var,pre_count,post_count,pre_count_wt,post_count_wt,pseudocount,codon_substitutions,n_codon_substitutions,aa_substitutions,n_aa_substitutions
0,lib_1,input,selected,GAT,-2.474376,0.019337,1101,401,253,513,0.5,GGA2CGC,1,G2R,1
1,lib_1,input,selected,AAC,0.0,0.024528,253,513,253,513,0.5,,0,,0
2,lib_2,input,selected,AAC,-3.465198,8.345474,1253,113,0,0,0.5,ATG1AAG GGA2TGA,2,M1K G2*,2
3,lib_2,input,selected,CAT,0.378452,8.329463,923,1200,0,0,0.5,GGA2GGC,1,,0
4,all libraries,input,selected,lib_1-GAT,-2.474376,0.019337,1101,401,253,513,0.5,GGA2CGC,1,G2R,1
5,all libraries,input,selected,lib_1-AAC,0.0,0.024528,253,513,253,513,0.5,,0,,0
6,all libraries,input,selected,lib_2-AAC,-4.483576,0.032262,1253,113,253,513,0.5,ATG1AAG GGA2TGA,2,M1K G2*,2
7,all libraries,input,selected,lib_2-CAT,-0.639927,0.016251,923,1200,253,513,0.5,GGA2GGC,1,,0


Add the amino-acid and codon sequences to the `scores` data frame with `CodonVariantTable.add_full_seqs`:

In [38]:
(
    variants.add_full_seqs(scores)[
        ["library", "func_score", "func_score_var", "aa_sequence", "codon_sequence"]
    ]
)

Unnamed: 0,library,func_score,func_score_var,aa_sequence,codon_sequence
0,lib_1,-2.474376,0.019337,MR*,ATGCGCTGA
1,lib_1,0.0,0.024528,MG*,ATGGGATGA
2,lib_2,-3.465198,8.345474,K**,AAGTGATGA
3,lib_2,0.378452,8.329463,MG*,ATGGGCTGA
4,all libraries,-2.474376,0.019337,MR*,ATGCGCTGA
5,all libraries,0.0,0.024528,MG*,ATGGGATGA
6,all libraries,-4.483576,0.032262,K**,AAGTGATGA
7,all libraries,-0.639927,0.016251,MG*,ATGGGCTGA
