# Canonical sequence analysis

Canonical sequence analysis (CSA) is a useful method for identifying potential trouble spots in
an antibody being developed for clinical use and thereby mitigating some of the potential
upstream development risk for that antibody. This analysis considers the relative frequency of
occurrence of the amino acid at each position in the scanned sequence, against a library of
homologous antibody sequences. The CSA makes no assumptions about which organism the
antibody sequences are derived from, nor which antibody frameworks were included in these
sequences - particularly since it is common for antibodies to be derived from different organisms
or different antibody frameworks from the same organism’s germline. Furthermore, many
antibody sequences destined for clinical use, have been edited during lead optimization, and/or
have been fully or partially humanized. The CSA simply compares the antibody sequences with
the closest homologs identified using a global BLAST alignment, and compiles the frequencies1
of occurrence of the amino acids at each position as an indicator of how representative or
“typical” each amino acid is at a given position in the canonical antibody sequence represented
by the alignment.

Plotting these frequencies against the sequence facilitates the rapid identification of sequence
positions that contain amino acids that are atypical at those positions. For the sequence
positions in and around the complementarity-determining regions (CDRs) where recombination
and hypermutation events produce more novel sequences, it is expected that these regions will
be far less canonical and more unique, in keeping with the specificity of the antibody for its
antigen.

The low frequency positions in the sequence do not necessarily entail potential trouble spots in
the antibody but they can identify the positions in the sequence with the highest potential risk
that may merit further investigation. This is because antibodies generated by the immune
system undergo a significant degree of natural selection, which is a significant filter for
eliminating non-functional sequences.

Closer examination of these atypical positions may reveal for example, functionally conservative
substitutions that are unlikely to be problematic, or conversely - very unusual substitutions that
may turn out, upon further investigation, to be essential, neutral or even detrimental to the
antibody’s folding and/or function. A canonical sequence analysis is also a useful tool for any
subsequent engineering of the antibody – to reduce its potential immunogenicity for example.
CSA can provide invaluable guidance to the protein engineer in the selection of substitutions at
any given position that will have a higher probability of preserving the structure and function of
the antibody (which is always the most challenging aspect of protein engineering).

The CSA is also very useful in highlighting the potential sites of chemical modification that are
identified in a subsequent section. A site identified as potentially susceptible to a chemical
modification that impairs the antibody’s stability and/or function, is less likely to be problematic if
it is a residue that is often observed at that position in homologous antibodies. A highly unusual
residue at that position, however, might merit further consideration for substitution. In the case
that this report includes a structural analysis, data from the molecular surface analysis described
later on, can also be used in conjunction with the CSA, to determine whether the potential
modification site is both non-canonical and exposed on the antibody surface. Both of these
factors would make it a higher risk for chemical modification.

Presented below, are the CSA histograms for the heavy and light chain variable regions of your
antibody. In each graph, the sequence itself is shown on the x-axis, with the percentage
frequency in the library of the amino acid at each position, shown on the y-axis. The library was
generated by a BLAST search of each antibody sequence against the database of homologous
protein sequences, with the aligned set of the most closely homologous sequences being used
for compiling the statistics.

At each position in the sequence, the degree to which the amino acid is ‘typical’ at that position
is represented by a green histogram showing its frequency of occurrence at the correspondingly
aligned positions of the homologous antibody sequences in the library. Regions of the sequence
that are more unique to the current antibody appear as gaps or shorter histograms. For the
purposes of this analysis, we consider an amino acid at a given position in the sequence whose
frequency in the library falls below about 10%, as being atypical.

Along with the canonical sequence analysis plots, pie charts for the light and heavy chains are
presented, that express the fractions (as percentages) of the aligned sequences derived from
the different organisms represented in the alignments. For clarity, only the five most commonly
observed organisms in the alignments, are shown in each pie chart.