## Parse CD-HIT cluster file

In notebook 05 we ran cd-hit on the VH.fa and VL.fa files which assigned every sequence to a cluster.

Now we want to remove antibodies that are redundant. We consider antibodies to be redundant if they show high sequence similarity of both their heavy and light chains, i.e. if their heavy chains are in the same cluster and their light chains are in the same cluster, respectively.

To do so, we create a DataFrame with columns
- pbs_id
- Hchain
- Lchain
- Hcluster
- Lcluster

where the first 3 columns are parsed from the summary file, and the last 2 columns are populated by parsing the cluster files. 


### Parsing the cluster file    


The cluster file looks like

Lines starting with '>' start a new cluster.

The other lines contain the sequence id of the sequences belonging to that cluster enclosed by '>' on the left and three dots '...' on the right.


How can we extract the pdb_id from those lines?

Given a line, we can use the `find` method to obtain the indices of those enclosing characters and use the string between those indices as pdb_id. An example is given below

In [None]:
line = "0	136aa, >5o0w... *"

# we want to extract 5o0w 
# the first character starts after the >
# the last character ends before the first '...' 

# find the first > on the line
startidx = line.find('>') 
# find the first three points 
endidx = line.find('...')

# be aware of python indexing. Slice a:b includes a but not b
print(line[(startidx+1):endidx])

5o0w


Write a function `parse_pdb_id(line)` that returns the pdb_id.

We also need to make sure that there is a pdb_id found on the line.
The `find` method returns -1 if the substring is not found.

Assume a pdb_id is found when startidx != -1 and endidx != -1 and endidx - startidx > 3.
If no pdb_id is found, `raise ValueError(f"No pbs_id found in {line}")`

In [None]:
def parse_pdb_id(line):
    ...

Test the function on a few examples to see it works as expected. You want to test both successful execution and error.

Now we can parse the cluster file.

Write a function `parse_cluster_file(cluster_file)` that

- declares empty dictionary pdb2cluster
- sets current_cluster to ''
- opens cluster_file (use a `with` block)
- loops over all lines
  - if line starts with a '>'
    - reassign current_cluster
  - else
    - parse pdb_id
    - set pdb2cluster[pdb_id] = current_cluster
- return pdb2cluster

In [None]:
def parse_cluster_file(cluster_file):
    ...
    

### Annotating summary file with cluster numbers

- parse VH cluster file into pdb2vhcluster
- parse VL cluster file into pdb2vlcluster

- load the summary DataFrame (summary_pdb.tsv)
- create two additional columns Hcluster and Vcluster and populate them with the clusters
 
  Here you loop over pdb_ids (you can use a list comprehension) and find the cluster from the corresponding dictionary.
  As some pdb_ids could not be processed, not all pdb_ids will be in the dictionary.
  Use `pdb2vhcluster.get(pdb_id, '')` to look up the key, and return '' if it does not exist.
  

- sort the DataFrame (see below)
- create a column duplicated that indicates if Hcluster and Vcluster entries of a row are duplicated (pandas has a `.duplicated(subset=[...])` method. Use it on subset of Hcluster and Vcluster columns.)

- save the DataFrame as `../generated/preprocess/summary_pdb_clusters.tsv`

- drop duplicated lines (select lines where duplicated is False)

- save the DataFrame without duplicates as `../generated/preprocess/summary_pdb_clusters_dedup.tsv`


There is on caveat here. If we have duplicated anitbodies, we want to keep those instances that are the best. By best here we mean they have affinity data and good resolution. We can achieve this when we understand how the duplicated method works. If we use `df.duplicated(subset=[..], keep = 'first')`, the function marks all duplicates as `True` except for the first one. 

So all we need to do is sort the DataFrame such that columns with affinity value appear first and columns with good resolution as well. And then apply the `duplicated` method on the sorted DataFrame. Check the documentation of `sort_values`, and pay attention to `na_position` and `ascending` options. 