# Analyzing Predicted Structure Clusters

Previous notebooks `DALI_data_analysis.ipynb` and `esm_if_representations_clustering_experiments.ipynb` have explored and explained the data processing and clustering procedures we've used with predicted structures.

Let's take a look at the DALI structure clusters and see if there are any meaningful patterns or other information encoded within the clusters! First, let's plot those clusters again:

In [6]:
import pandas as pd
import plotly.express as px

cluster_df = pd.read_csv("../metadata/DALI_cluster_table_2.csv")

colours = px.colors.qualitative.Dark24
fig = px.scatter(cluster_df, x="Length", y="pI", color="Cluster", color_discrete_sequence=colours)

fig.update_layout(width=1400, height=700, template="plotly_white")
fig.show()

Now, let's add information about the encapsulins that have annotated cargo types:

In [7]:
family_df = pd.read_csv("../encapsulin_families.csv").rename(columns={"Encapsulin MGYP": "MGYP"})
family_df.head()

Unnamed: 0,MGYP,Cargo Description,Cargo Search Method
0,MGYP000005098346,DyP Peroxidase,family1_clp_consensus
1,MGYP001553646702,DyP Peroxidase,all_cargo_clp_consensus
2,MGYP001551909951,DyP Peroxidase,all_cargo_clp_consensus
3,MGYP000907569931,DyP Peroxidase,family1_clp_consensus
4,MGYP003143336405,DyP Peroxidase,family1_clp_consensus


In [8]:
cluster_family_df = family_df.merge(cluster_df, on="MGYP")
cluster_family_df.head()

Unnamed: 0,MGYP,Cargo Description,Cargo Search Method,Cluster,Length,mW,pI,T_maritima_T1,M_xanthus_T3,S_elongatus_T1,Q_thermotolerans_T4,Closest Match,Longest Disordered Region (pLDDT)
0,MGYP000005098346,DyP Peroxidase,family1_clp_consensus,Dissimilar,272,29.237761,5.330952,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0
1,MGYP001553646702,DyP Peroxidase,all_cargo_clp_consensus,8,274,28.91321,4.94655,0.721228,0.601504,0.351351,0.579345,T_maritima_T1,1.0
2,MGYP001551909951,DyP Peroxidase,all_cargo_clp_consensus,8,268,28.821247,5.313048,0.744246,0.598997,0.353808,0.574307,T_maritima_T1,3.0
3,MGYP000907569931,DyP Peroxidase,family1_clp_consensus,8,266,28.50943,4.662241,0.741688,0.593985,0.348894,0.556675,T_maritima_T1,9.0
4,MGYP003143336405,DyP Peroxidase,family1_clp_consensus,8,271,29.495891,4.823948,0.716113,0.598997,0.361179,0.571788,T_maritima_T1,1.0


In [9]:
import pandas as pd
from plotly.subplots import make_subplots
import plotly.express as px

colours = px.colors.qualitative.Prism
fig = px.scatter(cluster_family_df, x="Cargo Description", y="Cluster")

fig.update_layout(width=1400, height=700, template="plotly_white")
fig.show()

In [10]:
cluster_family_df[cluster_family_df["Cluster"] == "Dissimilar"]

Unnamed: 0,MGYP,Cargo Description,Cargo Search Method,Cluster,Length,mW,pI,T_maritima_T1,M_xanthus_T3,S_elongatus_T1,Q_thermotolerans_T4,Closest Match,Longest Disordered Region (pLDDT)
0,MGYP000005098346,DyP Peroxidase,family1_clp_consensus,Dissimilar,272,29.237761,5.330952,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0
9,MGYP000520525852,DyP Peroxidase,Manually curated (Pfam),Dissimilar,264,28.672676,4.750057,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0
25,MGYP001508311700,Ferritin,Manually curated (Pfam),Dissimilar,259,27.621772,4.671108,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0
26,MGYP000469965527,Ubiquinone biosynthesis protein COQ7,all_cargo_clp_consensus,Dissimilar,260,28.493308,4.898123,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0
48,MGYP001235099818,Cysteine Desulfurase,HMM Search,Dissimilar,469,50.760759,6.415383,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0
52,MGYP001245027200,Cysteine Desulfurase,HMM Search,Dissimilar,476,53.074208,5.464637,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0
53,MGYP003670114767,Cysteine Desulfurase,HMM Search,Dissimilar,286,29.654232,5.077905,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,7.0
54,MGYP001414737684,Cysteine Desulfurase,HMM Search,Dissimilar,354,40.382773,8.625306,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,2.0
71,MGYP001365287079,Cysteine Desulfurase,Manually curated (Pfam),Dissimilar,367,41.04783,9.196754,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,6.0
78,MGYP000630911641,Polyprenyl Transferase,Manually curated (Pfam),Dissimilar,443,47.716359,5.288493,0.002558,0.070175,0.002457,0.002519,M_xanthus_T3,0.0


Inspection of the predicted structure reveals that the following MGYPs have a super weird E-loop with a HUGE insertion domain. BLAST search against the E-loop region shows hits at 80% and 40% identity respectively, against family 2B encapsulins with cNMP binding domains, which explains the insertions!

In [11]:
mgyps = ["MGYP001245027200", "MGYP001235099818"]
cluster_family_df.query("MGYP in @mgyps")

Unnamed: 0,MGYP,Cargo Description,Cargo Search Method,Cluster,Length,mW,pI,T_maritima_T1,M_xanthus_T3,S_elongatus_T1,Q_thermotolerans_T4,Closest Match,Longest Disordered Region (pLDDT)
48,MGYP001235099818,Cysteine Desulfurase,HMM Search,Dissimilar,469,50.760759,6.415383,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0
52,MGYP001245027200,Cysteine Desulfurase,HMM Search,Dissimilar,476,53.074208,5.464637,0.002558,0.002506,0.002457,0.002519,T_maritima_T1,0.0


## Cluster 6 Proteins

Cluster 6 proteins are all seen to contain an E-loop insertion of unknown function. Searching the sequences with BLAST gives hits against "Ig-like domain containing proteins" and "DUF5309 family proteins", neither of which are helpful!

What are the biological functions of these encapsulins?

In [12]:
cluster_family_df[cluster_family_df["Cluster"] == "6"]

Unnamed: 0,MGYP,Cargo Description,Cargo Search Method,Cluster,Length,mW,pI,T_maritima_T1,M_xanthus_T3,S_elongatus_T1,Q_thermotolerans_T4,Closest Match,Longest Disordered Region (pLDDT)
91,MGYP001522116553,Polyprenyl Transferase,Manually curated (Pfam),6,360,40.009296,5.452815,0.342711,0.353383,0.388206,0.405542,Q_thermotolerans_T4,6.0
92,MGYP000631152961,Polyprenyl Transferase,Manually curated (Pfam),6,406,45.127086,5.402001,0.365729,0.373434,0.447174,0.43073,S_elongatus_T1,41.0
96,MGYP002732142711,Polyprenyl Transferase,HMM Search,6,419,46.076938,5.009073,0.365729,0.383459,0.44226,0.43073,S_elongatus_T1,42.0
98,MGYP000702098687,Xylulose Kinase,Manually curated (Pfam),6,423,46.135411,5.81846,0.322251,0.348371,0.385749,0.392947,Q_thermotolerans_T4,44.0


In [13]:
print(cluster_family_df["Cluster"].sort_values(ascending=True).unique())

['1' '10' '11' '16' '3' '4' '6' '7' '8' 'Dissimilar']
