# Analyzing Foldseek clusters (and comparing them to DALI)

Foldseek has an all-against-all and cluster functionality similar to the one we've experimented with using DALI (and various other Python packages on top). The clusters are in an easy to parse `.tsv` file - let's analyze this and compare these clusters to the DALI ones.

First, let's load the cluster ``.tsv` file and check out how many clusters we have:

In [35]:
from collections import defaultdict

cluster_dict = defaultdict(list)

with open("../foldseek/clusters.tsv", "r") as cluster_file:
    for line in cluster_file:
        line = line.rstrip().split()
        
        #Need to remove the trailing _ptm.pdb
        cluster_name = line[0].split("_")[0].split(".")[0]
        cluster_member = line[1].split("_")[0].split(".")[0]

        cluster_dict[cluster_name].append(cluster_member)

#I just want the clusters named by a number rather than MGYP so let's rename them

cluster_dict = {str(i): cluster for i, (cluster_name, cluster) in enumerate(cluster_dict.items())}

for cluster, members in cluster_dict.items():
    print(f"{cluster}: {len(members)}")

print(len(cluster_dict))

0: 2
1: 4
2: 1
3: 62
4: 2
5: 1
6: 1
7: 1
8: 1
9: 1
10: 34
11: 1
12: 3
13: 1
14: 1
15: 1
16: 15
17: 1
18: 7
19: 23
20: 48
21: 2
22: 5
23: 1
24: 426
25: 1
26: 52
27: 1
28: 49
29: 3
30: 1
31: 1
32: 32
33: 1
34: 1
35: 1
36: 3
37: 1
38: 1
39: 2
40: 1
41: 1
42: 31
43: 5
44: 1
45: 3
46: 1
47: 1
48: 3
49: 2
50: 1
51: 4
52: 1
53: 1
54: 9
55: 75
56: 1
57: 1
58: 1
59: 1
60: 1
61: 7
62


Looks like we have LOTS of clusters, many of which are singletons (only have a single member).

Let's plot this data to explore this visually:

In [36]:
import plotly.express as px

lengths = sorted([len(cluster) for cluster in cluster_dict.values()], reverse=True)

fig = px.bar(y=lengths, labels={"x": "Cluster", "y": "Number of Structures"})
fig.update_layout(template="plotly_white")
fig.show()

Now, let's make a plot of protein length versus isoelectric point, coloured by cluster - this is what we've done with previous clustering experiments to show that clusters correspond to meaningful physical properties of proteins.

First, let's make a DataFrame of our Foldseek clusters:

In [37]:
import pandas as pd

records = [{"MGYP": mgyp, "Cluster": cluster} for cluster, structures in cluster_dict.items() for mgyp in structures]
foldseek_df = pd.DataFrame(records)
foldseek_df

Unnamed: 0,MGYP,Cluster
0,MGYP000131233602,0
1,MGYP003624382026,0
2,MGYP000248784877,1
3,MGYP000189593907,1
4,MGYP003361717447,1
...,...,...
942,MGYP000098784662,61
943,MGYP003145882433,61
944,MGYP003204103103,61
945,MGYP003237769608,61


Now, let's load our DALI cluster DataFrame and merge the two together into a new DataFrame:

In [38]:
dali_df = pd.read_csv("../metadata/DALI_cluster_table_2.csv").rename(columns={"Cluster": "DALI Cluster"}).loc[:, ["DALI Cluster", "MGYP", "Length", "mW", "pI"]]

foldseek_df = foldseek_df.merge(dali_df, how="left", on="MGYP").rename(columns={"Cluster": "Foldseek Cluster"})
foldseek_df

Unnamed: 0,MGYP,Foldseek Cluster,DALI Cluster,Length,mW,pI
0,MGYP000131233602,0,Dissimilar,367,39.706791,4.718000
1,MGYP003624382026,0,Dissimilar,409,45.119116,9.324723
2,MGYP000248784877,1,Dissimilar,283,32.198028,5.408196
3,MGYP000189593907,1,Dissimilar,272,31.417796,5.413937
4,MGYP003361717447,1,Dissimilar,328,36.976115,5.252855
...,...,...,...,...,...,...
942,MGYP000098784662,61,Dissimilar,369,41.264702,5.272806
943,MGYP003145882433,61,7,335,34.875598,5.409731
944,MGYP003204103103,61,13,368,41.434656,5.777365
945,MGYP003237769608,61,13,368,41.378593,6.097255


Let's plot the DALI Clusters as before:

In [44]:
colours = px.colors.qualitative.Alphabet_r
fig = px.scatter(dali_df[dali_df["DALI Cluster"] != "Dissimilar"], x="Length", y="pI", color="DALI Cluster", color_discrete_sequence=colours, labels=dict(Length="Length (Amino Acids)", pI="Isoelectric Point"))

fig.update_layout(template="plotly_white", width=1400, height=700, font=dict(size=20))
fig.update_xaxes(range=[200, 600])
fig.write_image("DALI_clusters.png")
fig.show()

And now the Foldseek clusters:

In [42]:
colours = px.colors.qualitative.Alphabet_r
fig = px.scatter(foldseek_df, x="Length", y="pI", color="Foldseek Cluster", color_discrete_sequence=colours, labels=dict(Length="Length (Amino Acids)", pI="Isoelectric Point"))

fig.update_layout(template="plotly_white", width=1400, height=700, font=dict(size=20))
fig.update_xaxes(range=[200, 600])
fig.write_image("Foldseek_clusters.png")
fig.show()

Clusters look mostly the same in both analyses - most importantly the DALI clusters 6, 8, and 9 are preserved, which are the ones we've focused on in the paper so no need to worry here!