# License

Licensed under the Apache License, Version 2.0 (the "License")
```
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Setup

In [None]:
# Uncomment to install the covid_vhh_design package

# !pip install git+https://github.com/google-research/google-research.git#subdirectory=covid_vhh_design

# Imports

In [None]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
from covid_vhh_design import covid

In [None]:
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 200)

# Loading data files
Annotated AlphaSeq measurements of the three described libraries are stored in
* `data/round1.csv`: The 1st library
* `data/round2.csv`: The 2nd library
* `data/round3.csv`: The 3rd library


In [None]:
df = covid.load_df('round1.csv.gz')

# Description of columns
All three CSV files have the following columns:
* `source_key`: A unique identifier of an assayed VHH sequence. Note that there can be sources with different keys but the same sequence if multiple sequences replicates were assayed.
* `target_key`: A unique identifier of a CoV RBD binding target.
* `replica`: The AlphaSeq experiment replica. round0 has 3 replicas (1-3), round1 and round2 six (1-6).
* `value`: An AlphaSeq log KD measurement. There exist exactly one value per `source_key x target_key x replica` triplet. A value can be infinity (`inf`) if the binding strength is below the detecable threshold.

The remaining `target_*` and `source_*` columns store annotations of targets and sources respectively, which will be described in the following.

In [None]:
df.head()

## Target columns
* `target_name`: A unique name per `target_key`
  - `MATalpha_Control_[1-6]`: Positive control targets
  - `MATalpha_Negative_[1-3]`: Negative control targets
  - `SARS-CoV1_RBD`: Wildtype SARS-CoV1
  - `SARS-CoV2_RBD`: Wildtype SARS-CoV2
  - `SARS-CoV2_RBD_*`: SARS-CoV2 mutants
  - All other: CoV-related targets
* `target_is_control`: Whether a target is a control
* `target_control_type`: Whether a target is 'positive' or 'negative' a control
* `target_group`: The name of a target group. E.g. 'CoV2' for wildtype CoV2 or CoV2 mutants

In [None]:
df.loc[:, df.columns.str.startswith('target_')].drop_duplicates()

## Source columns
* `source_name`: A unique name per `source_key`.
* `source_is_control`: Whether a source is a control. Control sources were included for santiy-checking and should bind only weakly or not at all.
* `source_control_type`: Whether a source is 'positive' or 'negative' a control.
* `source_num_mutation`: The number of mutations (Hamming distance) from the wildtype sequence.
* `source_category`: One of the following categories
    - `mutant`: A mutant sequence of VHH-72 (`source_num_mutations > 0`).
    - `parent`: The parent sequence VHH-72 (`source_num_mutations == 0`).
    - `negative_control`: A negative control.
    - `positive_control`: A positive control.
    - `special`: A 'special' sequences that was not derived from VHH-72, e.g. 'SARS_VHH44' or 'MERS_VHH55'.
* `source_hash`: The MD5 hash of `source_seq`.
* `source_replica`: The sequence replica. Distiguishes sources with the exact same sequence. For example, round0 contains 46 copies of the parent sequence VHH-72 (1-46).
* `source_distance`: Same as `source_num_mutations`, except that distances are all `>0` instead of `>=0`, or `NaN`.
* `source_score`: The negative log-likelihood of the VAE that was used for scoring mutants.
* `source_group`: Describes how sequences were designed per round.
* `source_seq`: The amino-acid sequence.
* `source_std_group`: Standardized annotation for sequence design across rounds.



In [None]:
source_df = df.loc[:, df.columns.str.startswith('source_')].drop_duplicates()
source_df.head()

Round0 sequences were grouped as follows:
* `mbo`: Model-designed sequences,
  - `cdrh12_multies1`: Sequences with mutations in CDR1 or CDR2 designed by optimizing the 1st` VAE.
  - `cdrh12_multies2: Sequences with mutations in CDR1 or CDR2 designed by optimizing the 2nd` VAE.
  - `cdrh12_random`: Sequences with mutations in CDR1 or CDR2 scored randomly.
  - `cdr3_multies1`: Sequences with mutations in CDR3 designed by optimizing the 1st VAE.
  - `cdr3_multies2`: Sequences with mutations in CDR3 designed by optimizing the 2nd VAE.
  - `cdr3_random`: Sequences with mutations in CDR3 obtained by reservoir sampling.
  - `mutant`: Uncategorized sequences with randomly sampled mutations.
* `negative_control`: Negative control sources.
* `parent`: The parent sequence VHH-72.
* `positive_control`: Positive control sources.
* `singles`: A single-mutant sequence of VHH-72.
* `special`: A 'special' sequences that was not derived from VHH-72, e.g. 'SARS_VHH44' or 'MERS_VHH55'.

In [None]:
source_df.groupby('source_std_group')['source_group'].unique()

In [None]:
# Round0 contains 46 copies of the parent sequence
source_df.query('source_num_mutations == 0').head()

In [None]:
# The number of sources with N mutations aways from VHH-72
source_df.value_counts('source_num_mutations')

In [None]:
# Sequences replicas have different `source_replica` IDs.
source_df.query('source_num_mutations == 0')['source_replica'].nunique()

In [None]:
# Special sequences
source_df.query('source_category == "special"')

# Visualization of AlphaSeq measuremnts

AlphaSeq log KD measurements are stored in the `value` column. A value can be infinity (`inf`) if the binding strength is below the detecable threshold. The following functions illustrate how to plot 1) the distribution of non-infinity values per target, and 2) the percentage of infinity values (non-binding events) per target.

In [None]:
def plot_non_inf_values_per_target(df):
  data = df.loc[~np.isinf(df['value'])]
  _, ax = plt.subplots(figsize=(15, 6))
  order = data.groupby('target_name')['value'].mean().sort_values().index
  sns.boxplot(
      data=data,
      x='target_name',
      order=order,
      y='value',
      ax=ax)
  ax.set_ylabel('AlphaSeq log KD')
  ax.figure.canvas.draw()
  ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha='right')


plot_non_inf_values_per_target(df)

In [None]:
def plot_percentage_of_inf_values_per_target(df):
  data = df.copy()
  data['value'] = np.isinf(data['value'])
  data = (
      data.groupby('target_name')['value'].mean().reset_index()
      .sort_values('value'))

  _, ax = plt.subplots(figsize=(15, 6))
  sns.barplot(
      data=data,
      x='target_name',
      y='value',
      ax=ax)
  ax.set_ylabel('% Inf values')
  ax.figure.canvas.draw()
  ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha='right')


plot_percentage_of_inf_values_per_target(df)