# Reference site numbering
So far all the preceding examples have involved analysis of proteins with sequential integer site numbers (e.g., 1, 2, 3, ...).
This sequential integer numbering can also start at sites other than one (e.g., 331, 332, 333, ...).

However, often proteins are numbered in relation to alignment to sequential integer numbering of some reference homolog.
For instance, SARS-CoV-2 spike sequences are usually numbered in relation to the Wuhan-Hu-1 reference.
But if the protein has accumulated insertions or deletions relative to the reference, there may be missing sites (gaps) or insertions (typically numbered as `214`, `214a`, `214b`, etc).

This notebook shows how to perform an analysis and analyze the results with non-sequential reference site numbering.
The advantage of doing this is that the results can directly be visualized/analyzed using the conventional site numbering scheme.

It does this by analyzing deep mutational scanning of the SARS-CoV-2 Omicron BA.1 spike, which has several indels relative to the Wuhan-Hu-1 numbering reference.

## Fit and visualize a model with non-integer site numbering

First, import Python modules:

In [1]:
import numpy

import polyclonal

import pandas as pd

Now read the data to fit:

In [2]:
# read data w `na_filter=None` so empty aa_substitutions read as such rather than NA
data_to_fit = pd.read_csv(
    "Lib-2_2022-06-22_thaw-1_LyCoV-1404_1_prob_escape.csv",
    na_filter=None,
)

data_to_fit.head()

Unnamed: 0,antibody_concentration,barcode,aa_substitutions_sequential,aa_substitutions_reference,n_aa_substitutions,prob_escape,prob_escape_uncensored,no-antibody_count
0,0.654,AAGAGCTTACTCTCGA,S443P S1167P,S446P S1170P,2,0.9004,0.9004,2890
1,0.654,ATAAGATAGATTTAGG,K441T A519S,K444T A522S,2,0.8006,0.8006,2217
2,0.654,GATCGAGTGTGTAGCA,F152T K441E,F157T K444E,2,0.5797,0.5797,2966
3,0.654,CCAAACGGTATGATGA,Q14H V67P D212H N445T I1224M,Q14H V67P D215H N448T I1227M,5,0.4125,0.4125,4131
4,0.654,TTACTGTGCAACCCAA,F163L N445S,F168L N448S,2,0.36,0.36,4275


Notice how these data have sites numbered in two different schemes:

 1. `aa_substitutions_sequential`: sequential integer numbering (1, 2, 3, ...) of the protein used in the experiment. You could just analyze the mutations this way, but then the results will not be in standard reference numbering scheme.
 
 2. `aa_substitutions_reference`: the referencey based site numbering, which skips sites with indels and has some non-numeric sites where there are insertions (eg, `214a`), as for example in the variant shown below:

In [3]:
data_to_fit.query("aa_substitutions_reference.str.contains('a')").head(n=1)

Unnamed: 0,antibody_concentration,barcode,aa_substitutions_sequential,aa_substitutions_reference,n_aa_substitutions,prob_escape,prob_escape_uncensored,no-antibody_count
32,0.654,CTAGATAAATCCCTGC,E209A K441N,E214aA K444N,2,0.3965,0.3965,2214


Here we will use the reference based amino-acid substitutions, so create a column with the name used by `Polyclonal` (the "aa_substitutions" column) that uses this numbering scheme:

In [4]:
data_to_fit["aa_substitutions"] = data_to_fit["aa_substitutions_reference"]

Importantly, in order to use reference based numbering that is not sequential integer, you also have to provide a list of the sites in order.
The reason is that otherwise it's not possible for `Polyclonal` to figure out for instance if sites are just missing from the data are are actually deletions.

Here we read a data frame that maps sequential to reference site numbering, and then use that to extract the list of reference sites:

In [5]:
site_numbering_map = pd.read_csv("BA.1_site_numbering_map.csv")
display(site_numbering_map.head())

print("Note how some reference sites differ from sequential ones due to indels:")
display(
    site_numbering_map[
        site_numbering_map["sequential_site"].astype(str)
        != site_numbering_map["reference_site"]
    ].head()
)

sites = site_numbering_map["reference_site"].tolist()

Unnamed: 0,sequential_site,reference_site
0,1,1
1,2,2
2,3,3
3,4,4
4,5,5


Note how some reference sites differ from sequential ones due to indels:


Unnamed: 0,sequential_site,reference_site
68,69,71
69,70,72
70,71,73
71,72,74
72,73,75


Now initialize and fit the `Polyclonal` model, but pass the `sites` argument so we can use these non-sequential-integer reference sites.
(If you don't pass the `sites` argument, sites are assumed to be sequential integer):

In [6]:
model = polyclonal.Polyclonal(
    # `polyclonal` expects the concentration column to be named "concentration"
    data_to_fit=data_to_fit.rename(columns={"antibody_concentration": "concentration"}),
    n_epitopes=1,
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    sites=sites,
)

Note how the model has its `sequential_integer_sites` attribute set to `False`:

In [7]:
assert set(model.sites) == set(sites)

assert model.sequential_integer_sites is False

Now fit the model:

In [8]:
_ = model.fit(logfreq=100, reg_escape_weight=0.1)

# First fitting site-level model.
# Starting optimization of 1248 parameters at Sun Jul 31 15:09:19 2022.
         step     time_sec         loss     fit_loss   reg_escape   reg_spread reg_activity
            0     0.052314        37315        37314            0            0      0.90499
          100       6.5281       4867.4       4825.8       37.948            0       3.6703
          188       11.691       4865.6       4822.7       39.237            0       3.6724
# Successfully finished at Sun Jul 31 15:09:31 2022.
# Starting optimization of 8450 parameters at Sun Jul 31 15:09:31 2022.
         step     time_sec         loss     fit_loss   reg_escape   reg_spread reg_activity
            0     0.091255       7299.1       7031.4       264.09   8.7534e-31       3.6724
          100       8.3385       6826.9         6719       83.863       19.734       4.3189
          200       16.279       6821.7       6715.9       80.416       21.021       4.3081
          299       23.978       

Look at output.
First we look at the `mut_escape_df`.
The site entries are str:

In [9]:
assert all(model.mut_escape_df["site"].astype(str) == model.mut_escape_df["site"])
assert set(model.mut_escape_df["site"]).issubset(sites)

model.mut_escape_df.head()

Unnamed: 0,epitope,site,wildtype,mutant,mutation,escape,times_seen
0,1,1,M,I,M1I,-0.003252,1
1,1,1,M,T,M1T,-0.005627,2
2,1,2,F,I,F2I,0.000555,2
3,1,2,F,L,F2L,0.010234,14
4,1,2,F,S,F2S,-0.006794,14


The same is true for `mut_escape_site_summary_df`:

In [10]:
assert all(
    model.mut_escape_site_summary_df()["site"].astype(str)
    == model.mut_escape_site_summary_df()["site"]
)
assert set(model.mut_escape_site_summary_df()["site"]).issubset(sites)

model.mut_escape_site_summary_df().head()

Unnamed: 0,epitope,site,wildtype,mean,total positive,max,min,total negative,n mutations
0,1,1,M,-0.00444,0.0,-0.003252,-0.005627,-0.00888,2
1,1,2,F,0.006602,0.033203,0.022414,-0.006794,-0.006794,4
2,1,3,V,0.062742,0.6431,0.562297,-0.118661,-0.141163,8
3,1,4,F,0.222067,1.11796,1.024416,-0.005115,-0.007626,5
4,1,5,L,0.015171,2.084054,0.942382,-0.890501,-1.780636,20


## Confirm same results for reference and sequential integer number
To demonstrate how the results are the same regardless of which numbering scheme is used for the fitting, we also fit a model with the sequentially numbered variants.
This section of the notebook can almost be considered a test rather than an example.

In [11]:
model_sequential = polyclonal.Polyclonal(
    data_to_fit=(
        data_to_fit
        .drop(columns="aa_substitutions")
        .rename(
            columns={
                "antibody_concentration": "concentration",
                "aa_substitutions_sequential": "aa_substitutions",
            }
        )
    ),
    n_epitopes=1,
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
)

assert model_sequential.sequential_integer_sites is True

_ = model_sequential.fit(logfreq=100, reg_escape_weight=0.1)

# First fitting site-level model.
# Starting optimization of 1248 parameters at Sun Jul 31 15:10:02 2022.
         step     time_sec         loss     fit_loss   reg_escape   reg_spread reg_activity
            0     0.052583        37315        37314            0            0      0.90499
          100       6.6108       4867.4       4825.7       37.948            0       3.6702
          188        12.02       4865.6       4822.6       39.396            0       3.6721
# Successfully finished at Sun Jul 31 15:10:14 2022.
# Starting optimization of 8450 parameters at Sun Jul 31 15:10:14 2022.
         step     time_sec         loss     fit_loss   reg_escape   reg_spread reg_activity
            0      0.07094         7300       7031.3       265.04   8.9894e-31       3.6721
          100       8.4684       6826.5       6718.9       83.282       19.982       4.3163
          200       16.449       6821.6       6715.8       80.443       21.032       4.3093
          245       20.259       

Now make sure the fitting gives nearly the same result regardless of which site numbering is used.
First check the activity values:

In [12]:
pd.testing.assert_frame_equal(
    model_sequential.activity_wt_df, model.activity_wt_df, atol=0.01,
)

Now compare the mutation-escape values.
In order to do this, we have to re-number the sequential values to reference numbering:

In [13]:
min_times_seen = 50

mut_escape = model.mut_escape_df.drop(columns="mutation")

mut_escape_sequential = (
    model_sequential.mut_escape_df.assign(
        site=lambda x: x["site"].map(
            site_numbering_map.set_index("sequential_site")["reference_site"].to_dict()
        )
    )
    .drop(columns="mutation")
)
    
# have to use fairly big atol to test this, so also do correlations
pd.testing.assert_frame_equal(
    mut_escape, mut_escape_sequential, atol=1.5,
)
assert 0.99  < mut_escape["escape"].corr(mut_escape_sequential["escape"])