## Visualizing Cargo Families and Types

I want to make icicle plots for the family types to visualize their composition for the paper:

In [1]:
import pandas as pd

family_df = pd.read_csv("../encapsulin_families.csv")
family_df["Cargo Description"].value_counts()

Cysteine Desulfurase                            40
Ferritin                                        39
Polyprenyl Transferase                          28
Rubrerythrin                                    27
DyP Peroxidase                                  25
Saccharide BGC                                  17
Saccharide BGC/Cysteine Desulfurase (Hybrid)     5
Manganese containing catalase                    4
Ubiquinone biosynthesis protein COQ7             3
DeoC                                             2
T3PKS-like BGC                                   1
arylpolyene BGC                                  1
Xylulose Kinase                                  1
NRPS-like BGC                                    1
NAGGN BGC                                        1
Alternative oxidase                              1
Amine oxidoreductase                             1
OsmC                                             1
Name: Cargo Description, dtype: int64

In [2]:
family_df = family_df.replace("Saccharide BGC/Cysteine Desulfurase (Hybrid)", "Cysteine Desulfurase")
family_df = family_df[~family_df["Cargo Description"].str.contains("BGC")]
family_df["Cargo Description"].value_counts()

Cysteine Desulfurase                    45
Ferritin                                39
Polyprenyl Transferase                  28
Rubrerythrin                            27
DyP Peroxidase                          25
Manganese containing catalase            4
Ubiquinone biosynthesis protein COQ7     3
DeoC                                     2
Alternative oxidase                      1
Amine oxidoreductase                     1
Xylulose Kinase                          1
OsmC                                     1
Name: Cargo Description, dtype: int64

In [76]:
family_dict = {
"Cysteine Desulfurase": "Family 2",
"Ferritin": "Family 1",
"Polyprenyl Transferase": "Family 2",
"Rubrerythrin": "Family 1",
"DyP Peroxidase": "Family 1",
"Manganese containing catalase": "Family 1",
"Ubiquinone biosynthesis protein COQ7": "Family 1",
"DeoC": "Family 4",
"Alternative oxidase": "Family 1",
"Amine oxidoreductase": "Family 1",
"Xylulose Kinase": "Family 2",
"OsmC": "Family 4"
}

def get_family(cargo):
    return(family_dict[cargo])

family_df["Family"] = family_df["Cargo Description"].apply(get_family)
family_df = family_df.rename(columns={"Cargo Description": "Cargo"})
family_df = family_df.replace("Ferritin", "Ferritin-like")
family_df = family_df.replace("Rubrerythrin", "Ferritin-like")
family_df = family_df.replace("Manganese containing catalase", "Ferritin-like")
family_df = family_df.replace("Ubiquinone biosynthesis protein COQ7", "Ferritin-like")
family_df

Unnamed: 0,Encapsulin MGYP,Cargo,Cargo Search Method,Family
0,MGYP000005098346,DyP Peroxidase,family1_clp_consensus,Family 1
1,MGYP001553646702,DyP Peroxidase,all_cargo_clp_consensus,Family 1
2,MGYP001551909951,DyP Peroxidase,all_cargo_clp_consensus,Family 1
3,MGYP000907569931,DyP Peroxidase,family1_clp_consensus,Family 1
4,MGYP003143336405,DyP Peroxidase,family1_clp_consensus,Family 1
...,...,...,...,...
193,MGYP003131404975,Cysteine Desulfurase,DeepBGC,Family 2
194,MGYP003636931262,Cysteine Desulfurase,DeepBGC,Family 2
195,MGYP003625121643,DeoC,Manually curated (Pfam),Family 4
196,MGYP000377540521,DeoC,Manually curated (Pfam),Family 4


In [78]:
plot_df = family_df.groupby(["Family", "Cargo"]).count().reset_index().rename(columns={"Encapsulin MGYP": "Count"}).loc[:, ["Family", "Cargo", "Count"]]
plot_df

Unnamed: 0,Family,Cargo,Count
0,Family 1,Alternative oxidase,1
1,Family 1,Amine oxidoreductase,1
2,Family 1,DyP Peroxidase,25
3,Family 1,Ferritin-like,73
4,Family 2,Cysteine Desulfurase,45
5,Family 2,Polyprenyl Transferase,28
6,Family 2,Xylulose Kinase,1
7,Family 4,DeoC,2
8,Family 4,OsmC,1


In [79]:
import plotly.express as px

colours = px.colors.qualitative.Prism

fig = px.icicle(plot_df, path=[px.Constant("Annotated Cargo Proteins"),"Family", "Cargo"], values="Count", color_discrete_sequence=colours,branchvalues="total")
fig.update_traces(root_color="lightgrey", textinfo = "label + value")
fig.update_layout(
    template="plotly_white",
    uniformtext=dict(minsize=18, mode="hide"),
    margin = dict(t=50, l=25, r=25, b=25),
    width=1600,
    height=500
)
fig.show()

In [80]:
fig.write_image("../plots/family_icicle_plot.svg")
fig.write_image("../plots/family_icicle_plot.png")