# Reference site numbering
So far all the preceding examples have involved analysis of proteins with sequential integer site numbers (e.g., 1, 2, 3, ...).
This sequential integer numbering can also start at sites other than one (e.g., 331, 332, 333, ...).

However, often proteins are numbered in relation to alignment to sequential integer numbering of some reference homolog.
For instance, SARS-CoV-2 spike sequences are usually numbered in relation to the Wuhan-Hu-1 reference.
But if the protein has accumulated insertions or deletions relative to the reference, there may be missing sites (gaps) or insertions (typically numbered as `214`, `214a`, `214b`, etc).

This notebook shows how to use a "site-numbering map" that converts the sequential integer numbering to reference site numbering for display.

First, import Python modules:

In [None]:
import polyclonal

import pandas as pd

First, read the file that maps sequential to reference site numbering.
It is required to have columns named `sequential_site` and `reference_site`.
Other columns are allowed, but are ignored.
The `sequential_site` column **must** be integer (as those are sequential integers), but the `reference_site` column can be integer or string (will be latter if there are insertions numbered like `214a`, etc):

In [None]:
site_numbering_map = pd.read_csv("BA.1_site_numbering_map.csv")

site_numbering_map.head()

You can see there are indels that lead to differences in sequential and reference numbers, such as a deletion of reference site 157 and 158:

In [None]:
site_numbering_map.query("sequential_site >= 155").head()

Now read the data to fit:

In [None]:
# read data w `na_filter=None` so empty aa_substitutions read as such rather than NA
data_to_fit = pd.read_csv(
    "Lib-2_2022-06-22_thaw-1_LyCoV-1404_1_prob_escape.csv",
    na_filter=None,
)

Fit the polyclonal model:

In [None]:
# NBVAL_IGNORE_OUTPUT

model = polyclonal.Polyclonal(
    # `polyclonal` expects the concentration column to be named "concentration"
    data_to_fit=data_to_fit.rename(columns={"antibody_concentration": "concentration"}),
    n_epitopes=1,
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
)

_ = model.fit(logfreq=100, reg_escape_weight=0.1)