## Parse CD-HIT cluster file

In notebook 05 we ran cd-hit on the VH.fa and VL.fa files which assigned every sequence to a cluster.

Now we want to remove antibodies that are redundant. We consider antibodies to be redundant if they show high sequence similarity of both their heavy and light chains, i.e. if their heavy chains are in the same cluster and their light chains are in the same cluster, respectively.

To do so, we create a DataFrame with columns
- pbs_id
- Hchain
- Lchain
- Hcluster
- Lcluster

where the first 3 columns are parsed from the summary file, and the last 2 columns are populated by parsing the cluster files. 


### Parsing the cluster file    


The cluster file looks like

Lines starting with '>' start a new cluster.

The other lines contain the sequence id of the sequences belonging to that cluster enclosed by '>' on the left and three points '...' on the right.


How can we extract the pdb_id from those lines?

Given a line, we can use the `find` method to obtain the indices of those enclosing characters and use the string between those indices as pdb_id. An example is given below

In [26]:
line = "0	141aa, >6urh... *"

# we want to extract 5o0w 
# the first character starts after the >
# the last character ends before the first ',' after '>'

# find the first > on the line
startidx = line.find('>') 
# find three points
endidx = line.find('...')

# be aware of python indexing. Slice a:b includes a but not b
print(line[(startidx+1):endidx])

6urh


Write a function `parse_pdb_id(line)` that returns the pdb_id.

We also need to make sure that there is a pdb_id found on the line.
The `find` method returns -1 if the substring is not found.

Assume a pdb_id is found when startidx != -1 and endidx != -1 and endidx - startidx > 3.
If no pdb_id is found, `raise ValueError(f"No pbs_id found in {line}")`

In [27]:
def parse_pdb_id(line):
    startidx = line.find('>') 
    endidx = line.find('...')
    if startidx != -1 and endidx != -1 and endidx - startidx > 3:
        return line[(startidx+1):endidx]
    else:
        raise ValueError(f"Could not parse line {line}")


Test the function on a few examples to see it works as expected. You want to test both successful execution and error.

Now we can parse the cluster file.

Write a function `parse_cluster_file(cluster_file)` that

- declares empty dictionary pdb2cluster
- sets current_cluster to ''
- opens cluster_file (use a `with` block)
- loops over all lines
  - if line starts with a '>'
    - reassign current_cluster
  - else
    - parse pdb_id
    - set pdb2cluster[pdb_id] = current_cluster
- return pdb2cluster

In [31]:
def parse_cluster_file(cluster_file):
    pdb2cluster = {}
    current_cluster = ''
    with open(cluster_file) as f:
        for line in f:
            if line.startswith(">"):
                current_cluster = line[9:].strip()
            else:
                pdb_id = parse_pdb_id(line)
                pdb2cluster[pdb_id] = current_cluster
    return pdb2cluster

In [35]:
vhc = parse_cluster_file("../generated/preprocess/VHcluster.clstr")
vlc = parse_cluster_file("../generated/preprocess/VLcluster.clstr")

### Annotating summary file with cluster numbers

- parse VH cluster file into pdb2vhcluster
- parse VL cluster file into pdb2vlcluster

- load the summary DataFrame
- create two additional columns Hcluster and Vcluster and populate them with the clusters
- sort the DataFrame (see below)
- create a column duplicated that indicates if Hcluster and Vcluster entries of a row are duplicated (pandas has a `.duplicated()` method. Use it on subset of Hcluster and Vcluster columns.)

- save the DataFrame

- drop duplicated lines (select lines where duplicated is False)

- save the DataFrame without duplicates


In [40]:
import pandas as pd
import os.path

PDB_DIR = "../data/pdbs"

summary = pd.read_csv("../data/ab_ag.tsv", sep='\t')

filtered_summary = summary[
    summary['Hchain'].notna() &
    summary['Lchain'].notna() &
    summary['pdb'].apply(lambda pdb: os.path.exists(f"{PDB_DIR}/{pdb}_chothia.pdb"))
].groupby('pdb').head(1).reset_index()

filtered_summary.head()

Unnamed: 0,index,pdb,Hchain,Lchain,model,antigen_chain,antigen_type,antigen_het_name,antigen_name,short_header,...,scfv,engineered,heavy_subclass,light_subclass,light_ctype,affinity,delta_g,affinity_method,temperature,pmid
0,0,9ds2,G,I,0,D | C,protein | protein,NA | NA,hemagglutinin ha2 chain | hemagglutinin ha1 chai,VIRAL PROTEIN/IMMUNE SYSTEM,...,False,True,IGHV3,IGKV3,Kappa,,,,,
1,3,8uzp,H,L,0,M,protein,,stem_mimetic_01,IMMUNE SYSTEM,...,False,True,IGHV1,IGLV1,Lambda,,,,,
2,5,8veb,G,I,0,E,protein,,hemagglutinin,IMMUNE SYSTEM/VIRAL PROTEIN,...,False,True,IGHV4,IGKV1,Kappa,,,,,
3,8,8ved,H,L,0,A,protein,,hemagglutinin,IMMUNE SYSTEM/VIRAL PROTEIN,...,False,True,IGHV4,IGKV2,Kappa,,,,,
4,11,8vee,H,L,0,A,protein,,hemagglutinin,IMMUNE SYSTEM/VIRAL PROTEIN,...,False,True,IGHV4,IGKV1,Kappa,,,,,


In [46]:
filtered_summary["Hcluster"] = [vhc.get(pdb, '') for pdb in filtered_summary['pdb']] 
filtered_summary["Lcluster"] = [vlc.get(pdb, '') for pdb in filtered_summary['pdb']] 

In [47]:
filtered_summary.head()

Unnamed: 0,index,pdb,Hchain,Lchain,model,antigen_chain,antigen_type,antigen_het_name,antigen_name,short_header,...,heavy_subclass,light_subclass,light_ctype,affinity,delta_g,affinity_method,temperature,pmid,Hcluster,Lcluster
0,0,9ds2,G,I,0,D | C,protein | protein,NA | NA,hemagglutinin ha2 chain | hemagglutinin ha1 chai,VIRAL PROTEIN/IMMUNE SYSTEM,...,IGHV3,IGKV3,Kappa,,,,,,405,259
1,3,8uzp,H,L,0,M,protein,,stem_mimetic_01,IMMUNE SYSTEM,...,IGHV1,IGLV1,Lambda,,,,,,695,187
2,5,8veb,G,I,0,E,protein,,hemagglutinin,IMMUNE SYSTEM/VIRAL PROTEIN,...,IGHV4,IGKV1,Kappa,,,,,,126,723
3,8,8ved,H,L,0,A,protein,,hemagglutinin,IMMUNE SYSTEM/VIRAL PROTEIN,...,IGHV4,IGKV2,Kappa,,,,,,183,74
4,11,8vee,H,L,0,A,protein,,hemagglutinin,IMMUNE SYSTEM/VIRAL PROTEIN,...,IGHV4,IGKV1,Kappa,,,,,,126,723


In [52]:
filtered_summary["duplicated"] = filtered_summary.duplicated(subset = ['Hcluster','Lcluster'])

In [56]:
filtered_summary[~filtered_summary["duplicated"]].to_csv("../generated/preprocess/summary_pdb_clusters_deduplicated.tsv", sep='\t', index=False)

There is on caveat here. If we have duplicated anitbodies, we want to keep those instances that are the best. By best here we mean they have affinity data and good resolution. We can achieve this when we understand how the duplicated method works. If we use `df.duplicated(subset=[..], keep = 'first')`, the function marks all duplicates as `True` except for the first one. 

So all we need to do is sort the DataFrame such that columns with affinity value appear first and columns with good resolution as well. And then apply the `duplicated` method on the sorted DataFrame. Check the documentation of `sort_values`, and pay attention to `na_position` and `ascending` options. 