## Filter summary data

The ab_ag.tsv files contains multiple lines for each pdb_id, sometimes Hchain or Lchain are missing, etc.

This needs to be cleaned before one can process the data further.

Here we do a very simple cleaning step:

- remove lines where Hchain or Lchain or antigen_chain are missing
- remove lines for which we do not have a PDB file
- remove lines for which antigen_chain is not a single character
- for duplicated pdb_ids, select the first line only

You will probably need to do more selection of entries.
This is just to get started.

In [19]:
PDB_DIR = "../data/pdbs"


import pandas as pd
import os.path

summary = pd.read_csv('../data/ab_ag.tsv', sep='\t')

filtered_summary = summary[
    summary['Hchain'].notna() &
    summary['Lchain'].notna() &
    (summary['antigen_chain'].fillna('').str.strip().str.len() == 1) &
    summary['pdb'].apply(lambda pdb: os.path.exists(f"{PDB_DIR}/{pdb}_chothia.pdb"))
].groupby('pdb').head(1)

In [20]:
filtered_summary = filtered_summary[filtered_summary['scfv'] == False]
filtered_summary = filtered_summary[filtered_summary['resolution'] < 3.25]
filtered_summary["species"] = ""
filtered_summary.loc[filtered_summary["antigen_species"].str.contains("coronavirus2", case = False, na = False), "species"] = "SARS-CoV-2" 
filtered_summary.loc[filtered_summary["antigen_species"].str.contains("homo sapiens", case = False, na = False), "species"] = "Homo Sapiens"
filtered_summary.loc[filtered_summary["antigen_species"].str.contains("influenza a", case = False, na = False), "species"] = "Influensa A"

In [21]:
filtered_summary = filtered_summary[filtered_summary["species"] != ""]

In [23]:
filtered_summary.species.value_counts()

species
SARS-CoV-2      460
Homo Sapiens    393
Influensa A      97
Name: count, dtype: int64

In [24]:
filtered_summary['antigen_chain'].unique()

array(['E', 'A', 'D', 'B', 'C', 'R', 'H', 'I', 'P', 'F', 'X', 'K', 'Z',
       'J', 'Q', 'T', 'G', 'M', 'V', 'S', 'c', 'Y', 'U', 'm', 'b', 'N',
       'O', 'W'], dtype=object)

Create a directory `generated/preprocess` and save the DataFrame there as `summary_pdb.tsv`.

In [25]:
filtered_summary.to_csv('../generated/preprocess/summary_pdb.tsv', sep='\t', index=False)