# Analyzing Foldseek clusters (and comparing them to DALI)

Foldseek has an all-against-all and cluster functionality similar to the one we've experimented with using DALI (and various other Python packages on top). The clusters are in an easy to parse `.tsv` file - let's analyze this and compare these clusters to the DALI ones.

First, let's load the cluster ``.tsv` file and check out how many clusters we have:

In [3]:
from collections import defaultdict

cluster_dict = defaultdict(list)

with open("../foldseek/clusters.tsv", "r") as cluster_file:
    for line in cluster_file:
        line = line.rstrip().split()
        
        #Need to remove the trailing _ptm.pdb
        cluster_name = line[0].split("_")[0].split(".")[0]
        cluster_member = line[1].split("_")[0].split(".")[0]

        cluster_dict[cluster_name].append(cluster_member)
    
for cluster, members in cluster_dict.items():
    print(f"{cluster}: {len(members)}")

print(len(cluster_dict))

MGYP000131233602: 2
MGYP000248784877: 4
MGYP000275899355: 1
MGYP000345353568: 62
MGYP000352468595: 2
MGYP000355333620: 1
MGYP000550574731: 1
MGYP000572237864: 1
MGYP000614909350: 1
MGYP000625709158: 1
MGYP000688619306: 34
MGYP000703306994: 1
MGYP000708048110: 3
MGYP000744648862: 1
MGYP001001071371: 1
MGYP001190945602: 1
MGYP001191932810: 15
MGYP001230091200: 1
MGYP001245027200: 7
MGYP001301876493: 23
MGYP001353263834: 48
MGYP001376145772: 2
MGYP001409318126: 5
MGYP001460590038: 1
MGYP001481878659: 426
MGYP001508311700: 1
MGYP001517785574: 52
MGYP001563887314: 1
MGYP001595317607: 49
MGYP001660711218: 3
MGYP001770665372: 1
MGYP001770687550: 1
MGYP001772626497: 32
MGYP001793225276: 1
MGYP001809693053: 1
MGYP001975743896: 1
MGYP003082420564: 3
MGYP003088767509: 1
MGYP003108735519: 1
MGYP003109044713: 2
MGYP003110009043: 1
MGYP003131550817: 1
MGYP003134644496: 31
MGYP003148598028: 5
MGYP003203574958: 1
MGYP003226455751: 3
MGYP003230362339: 1
MGYP003289700650: 1
MGYP003290089060: 3
MGYP00336

Looks like we have LOTS of clusters, many of which are singletons (only have a single member).

Let's plot this data to explore this visually:

In [13]:
import plotly.express as px

lengths = sorted([len(cluster) for cluster in cluster_dict.values()], reverse=True)

fig = px.bar(y=lengths, labels={"x": "Cluster", "y": "Number of Structures"})
fig.update_layout(template="plotly_white")
fig.show()

Now, let's make a plot of protein length versus isoelectric point, coloured by cluster - this is what we've done with previous clustering experiments to show that clusters correspond to meaningful physical properties of proteins.

In [18]:
import pandas as pd

records = [{"MGYP": mgyp, "Cluster": cluster} for cluster, structures in cluster_dict.items() for mgyp in structures]

947
