## Filter summary data

The ab_ag.tsv files contains multiple lines for each pdb_id, sometimes Hchain or Lchain are missing, etc.

This needs to be cleaned before one can process the data further.

Here we do a very simple cleaning step:

- remove lines where Hchain or Lchain or antigen_chain are missing
- remove lines for which we do not have a PDB file
- remove lines for which antigen_chain is not a single character
- for duplicated pdb_ids, select the first line only

You will probably need to do more selection of entries.
This is just to get started.

In [27]:
PDB_DIR = "../data/pdbs"


import pandas as pd
import os.path

summary = pd.read_csv('../data/ab_ag.tsv', sep='\t')

filtered_summary = summary[
    summary['Hchain'].notna() &
    summary['Lchain'].notna() &
    (summary['antigen_chain'].fillna('').str.strip().str.len() == 1) &
    summary['pdb'].apply(lambda pdb: os.path.exists(f"{PDB_DIR}/{pdb}_chothia.pdb"))
].groupby('pdb').head(1)

In [28]:
filtered_summary['antigen_chain'].unique()

array(['M', 'E', 'A', 'D', 'B', 'C', 'T', 'Q', 'G', 'Z', 'a', 'R', 'F',
       'H', 'I', 'P', 'X', 'K', 'J', 'c', 'N', 'V', 'Y', 'S', 'U', 'd',
       'O', 'm', 'b', 'W'], dtype=object)

Create a directory `generated/preprocess` and save the DataFrame there as `summary_pdb.tsv`.

In [30]:
filtered_summary.to_csv('../generated/preprocess/summary_pdb.tsv', sep='\t', index=False)