# Isolating target clusters and selecting best representatives

This notebook is an example of how to select representatives given a set of protein clusters of interest.

**Required:** Run the [**run pipeline**](./run_pipeline.ipynb) notebook first to make sure the required outputs are present. Also run **Step 1** from the [**hierarchical cluster plot**](./hierarchical_cluster_plot.ipynb) notebook as we will re-use some of those outputs.  
  

## Perform some exploration to select clusters

In [1]:
import pandas as pd
import pickle
import proteinclustertools.visuals.circle_plot as cp
import proteinclustertools.visuals.annotate as an
from matplotlib.colors import ListedColormap

Read files from previous visualization.  
  
We will use multiple levels to filter the data as an example, but if you have a particular level of interest you can just pick representatives directly from them as well.

In [2]:
clusters=pd.read_csv("layouts/mmseqs_137-196-303.csv", dtype=str)
clusters.head()

Unnamed: 0,id,137.0,196.0,303.0
0,0,0,0,7
1,1,0,0,187
2,10,0,0,3
3,100,0,0,3
4,1000,0,0,260


Load some annotations as picking criteria.

In [3]:
annot_table=pd.read_csv("../data/IPR001761_with_taxonomy.tsv", sep="\t")
annot_table.head()

Unnamed: 0,Entry,Reviewed,Length,Organism (ID),PDB,Superkingdom,Phylum
0,G3XD97,reviewed,340,208964,,Bacteria,Pseudomonadota
1,P02924,reviewed,329,83333,1ABE;1ABF;1APB;1BAP;2WRZ;5ABP;6ABP;7ABP;8ABP;9...,Bacteria,Pseudomonadota
2,P0ACP1,reviewed,334,83333,1UXC;1UXD;2IKS;,Bacteria,Pseudomonadota
3,Q88HH7,reviewed,339,160488,,Bacteria,Pseudomonadota
4,Q9KM69,reviewed,326,243277,,Bacteria,Pseudomonadota


Again, we will need to remap the numeric sequence ids back to their original.

In [4]:
header_map=pd.read_csv("../output/IPR001761_header_map.txt", dtype=str)
header_map['Entry']=header_map['header'].str.split("|").str[1]
header_map.head()

Unnamed: 0,id,header,Entry
0,0,sp|Q88HH7|PTXS_PSEPK,Q88HH7
1,1,tr|A0A0M1P0M0|A0A0M1P0M0_9BACL,A0A0M1P0M0
2,2,tr|A0A100YV14|A0A100YV14_TRASO,A0A100YV14
3,3,tr|A0A1C7IBZ4|A0A1C7IBZ4_9FIRM,A0A1C7IBZ4
4,4,tr|A0A1R4KFK1|A0A1R4KFK1_9LACT,A0A1R4KFK1


In [5]:
clusters=clusters.merge(header_map[['id','Entry']], on="id")
clusters.head()

Unnamed: 0,id,137.0,196.0,303.0,Entry
0,0,0,0,7,Q88HH7
1,1,0,0,187,A0A0M1P0M0
2,10,0,0,3,A0A485EFL7
3,100,0,0,3,A0A1G9ETG5
4,1000,0,0,260,A0A943EYG6


Now the sequence level annotations can be matched directly to the clusters through this table.

Let's remake the plot to see the sequence space.

In [6]:
# load layout
layout=pickle.load(open("layouts/mmseqs_137-196-303.pkl", "rb"))

levels=clusters.columns[1:4]
# create annotations of taxonomy and review status
phylum=an.AnnotateClusters(clusters, levels, annot_table, "Phylum", "Entry", "Entry")
phylum_colors=an.ColorAnnot(phylum, cmap="Set3", top_n=10, saturation=.7, shuffle_colors_seed=1)

is_reviewed=annot_table['Reviewed']=='reviewed'
reviewed_only=clusters[clusters['Entry'].isin(annot_table[is_reviewed]['Entry'])]
reviewed=an.AnnotateClusters(reviewed_only, [levels[-1]], annot_table, "Reviewed", "Entry", "Entry")
reviewed_colors=an.ColorAnnot(reviewed, cmap=ListedColormap(['red']), saturation=.7)

In [8]:
plot, plot_data=cp.CirclePlot(layout, size=800,
                            annot_colors=phylum_colors,
                            outlines=reviewed_colors,
                            highlight_line_width=2
                            )

Pretend we are only interested in the main group of related sequences (big base cluster) and not in the outliers.  
  
We can filter for just the sequences in the big cluster (cluster 0 at cut-off of 137).  

In [32]:
non_outlier=clusters[clusters['137.0']=='0']
print(non_outlier.shape)
non_outlier.head()

(27730, 5)


Unnamed: 0,id,137.0,196.0,303.0,Entry
0,0,0,0,7,Q88HH7
1,1,0,0,187,A0A0M1P0M0
2,10,0,0,3,A0A485EFL7
3,100,0,0,3,A0A1G9ETG5
4,1000,0,0,260,A0A943EYG6


Say we wanted to sample just non-reviewed clusters, up to 48 of the largest ones. For picking representatives, the big cluster (~28k sequences) is too large, so we should subdivide the cluster by moving up the cut-offs. We can select using the highest cut-off in the plot (303), that looks like it has at least 50 clusters. 

In [33]:
reviewed_clusters=non_outlier[non_outlier['Entry'].isin(annot_table[is_reviewed]['Entry'])]['303.0'].unique()
non_reviewed=non_outlier[~non_outlier['303.0'].isin(reviewed_clusters)]
print(non_reviewed.shape)
non_reviewed.head()

(21945, 5)


Unnamed: 0,id,137.0,196.0,303.0,Entry
1,1,0,0,187,A0A0M1P0M0
4,1000,0,0,260,A0A943EYG6
5,10000,0,0,25,A0A246DUJ7
6,10001,0,0,42,A0A246DXA1
7,10002,0,0,159,A0A246E7G0


In [36]:
# rank by size
cluster_sizes=non_reviewed.groupby('303.0').size().sort_values(ascending=False)
print(cluster_sizes.head())
# select top 48
top_clusters=cluster_sizes.index[:48]

303.0
1    1986
2    1818
5     994
6     984
9     624
dtype: int64


## Pick cluster representatives

Given a list of cluster IDs, we can pick some representatives.

Again, because the sequences are tracked as numbers, prepare a dataframe to map the ids back that will be used by the representative selection code. It just needs to have the id (number) and header (desired output header) as columns.

In [47]:
remap_df=header_map.copy()
remap_df.drop(columns=['header'], inplace=True)
remap_df.rename(columns={'Entry':'header'}, inplace=True)
remap_df.head()

Unnamed: 0,id,header
0,0,Q88HH7
1,1,A0A0M1P0M0
2,2,A0A100YV14
3,3,A0A1C7IBZ4
4,4,A0A1R4KFK1


#### HMM based method

In [48]:
import proteinclustertools.tools.hmm_cluster_rep as hmm_rep

# this creates a zip file with all the results
fasta_file='../output/IPR001761_cleaned.fasta'
outfile='representatives/303.0_top48_hmm.zip'
hmm_rep.ParseClusterLevel(clusters, '303.0', fasta_file, top_clusters, outfile, id_map=header_map)



This creates a zip file containing 3 types of information:
1. 'cluster_top_hits.csv' tracks which clusters each selected representative is from.
2. 'cluster_top_hits.fasta' is just all cluster representatives in one fasta file.
3. For every cluster analyzed, a folder is created that contains the intermediate steps in the analysis.  
    a. The MSA of all sequences in the cluster  
    b. The HMM created from the MSA  
    c. The results of searching all sequences with the HMM. The user can traverse down this list for other hits (e.g., second or third best)  

#### Vector based method

In [69]:
import proteinclustertools.tools.vector_cluster_rep as vec_rep
import pickle

# need to load vector embeddings
embeddings=pickle.load(open("../output/IPR001761_embeddings.pkl", "rb"))
vreps=vec_rep.ClusterVectorRepresentative(clusters, '303.0', embeddings, top_clusters)

In [70]:
vreps.head()

Unnamed: 0,cluster,top_hit
0,1,27021
1,11,2922
2,12,12237
3,13,13450
4,14,16768


We can save the results, and convert the sequences to their original ids.

In [73]:
id_map=remap_df.set_index('id')['header'].to_dict()

Collect the sequences from the cleaned fasta file.

In [77]:
from Bio import SeqIO

outpath='representatives/vector_picks/'

fasta='../output/IPR001761_cleaned.fasta'
seqs=[rec for rec in SeqIO.parse(fasta, "fasta") if rec.id in vreps['top_hit'].values]
# convert headers
for seq in seqs:
    seq.id=id_map[seq.id]
    seq.description=''

# write to file
SeqIO.write(seqs, outpath+'top_vector_picks.fasta', "fasta")

48

In [67]:
vreps['top_hit']=vreps['top_hit'].map(id_map)
print(vreps.shape)
vreps.head()

(48, 2)


Unnamed: 0,cluster,top_hit
0,1,A0A7L4TJ50
1,11,A0A1B3XQA8
2,12,A0A1H0QTR6
3,13,A0A7I0K8F0
4,14,A0A6P2WNZ8


In [68]:
# save cluster-representative mapping
vreps.to_csv('representatives/vector_picks/vector_pick_clusters.csv', index=False)