# Encapsulin Hits against UniRef90

We've searched our ≈1500 encapsulin hits against UniRef90 - let's see how many of them are truly novel!

In [6]:
import pandas as pd

hits_df = pd.read_csv("../encapsulin_UniRef90_hits.tsv", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                              "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])

hits_df.head()

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
0,MGYP001573638360,UniRef90_Q8YPL4 All4180 protein n=2 Tax=Nostoc...,0.981,272,5,0,4,275,148,419,2.633e-171,542
1,MGYP003681970701,UniRef90_A0A382GAI9 Uncharacterized protein (F...,0.759,361,87,0,1,361,1,361,9.715e-174,554
2,MGYP001775299270,UniRef90_UPI0020A1C726 HAMP domain-containing ...,0.615,445,166,4,26,465,19,463,4.263e-178,571
3,MGYP003640887170,UniRef90_E6QZK2 Uncharacterized protein n=4 Ta...,0.266,312,206,7,13,311,8,309,1.599e-25,120
4,MGYP003111022402,UniRef90_A0A0Q4UCX7 DksA C4-type domain-contai...,0.301,229,133,9,37,259,394,601,2.453e-09,70


How many encapsulins have any hits in UniRef90?

In [7]:
len(hits_df["Query"].unique())

1473

That's a lot! But how many have hits with ID above 80%?

In [8]:
len(hits_df[hits_df["Identity"] > 0.8]["Query"].unique())

411

## Visualizing Sequence Identity of UniRef90 Hits

We probably need to visualize the sequence identities in a graph. First, let's make a new DataFrame with our UniRef90 hits binned by sequence identity:

In [17]:
binned_df = hits_df.sort_values(by="Identity", ascending=False).drop_duplicates(subset="Query").loc[:, ["Query", "Identity"]]
binned_df["Identity"] = pd.cut(binned_df["Identity"], [0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95, 1], labels=[str(i / 10) for i in range(2, 11)])
binned_df["Identity"].value_counts()

0.3    380
1.0    225
0.6    173
0.4    152
0.5    149
0.9    131
0.7    120
0.8    110
0.2     33
Name: Identity, dtype: int64

Now, let's add in our encapsulins that are missing any hits, and assign them a label:

In [29]:
from Bio import SeqIO

uniref90_encapsulins  = set(binned_df["Query"].unique())
all_encapsulins = set([str(record.id).split()[0] for record in SeqIO.parse("../seqs/encapsulin_hits_filtered.fasta", "fasta")])
missing_encapsulins = all_encapsulins.difference(uniref90_encapsulins)

missing_encapsulins_df = pd.DataFrame([{"Query": mgyp, "Identity": "No Hit"} for mgyp in missing_encapsulins])
binned_df = pd.concat([binned_df, missing_encapsulins_df])
binned_df

Unnamed: 0,Query,Identity
1472,MGYP000451522494,1.0
1292,MGYP000324372200,1.0
1182,MGYP003108981179,1.0
39,MGYP000362831547,1.0
1149,MGYP001305861203,1.0
...,...,...
72,MGYP003108437609,No Hit
73,MGYP003290296536,No Hit
74,MGYP003109209238,No Hit
75,MGYP003118733393,No Hit


And finally, let's plot this data:

In [48]:
import plotly.express as px

fig = px.bar(binned_df.groupby("Identity").count().reset_index(), 
             x="Identity", y="Query", 
             color_discrete_sequence=["rgb(95, 70, 144)"],
             category_orders={"Identity": ["No Hit", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9", "1.0"]},
             labels={"Query": "Count"})

fig.update_layout(
    template="plotly_white",
    width=1400,
    height=700,
    font=dict(size=18),
    title="Sequence Identity of Best Hit in UniRef90",
)

fig.update_traces(marker_line_width=1,marker_line_color="white")

fig.write_image("../plots/fig1_identities.svg")
fig.show()

## Annotating Fusion Cargos

Let's check the sequence hits to see if they give any clues about fusion proteins that our encapsulins have - this is a big clue as to the cargo function.

The code below follows the same methods as `notebooks/cargo_BLAST_nr_hits.ipynb`

In [7]:
#Load annotated encapsulins
annotated_encapsulins = pd.read_csv("../encapsulin_families.csv")["Encapsulin MGYP"].unique()

#Remove any annotated encapsulins
cargo_df = hits_df.query("Target not in @annotated_encapsulins")

In [11]:
for query in ["Rubrerythrin", "ferritin", "Ferritin", "peroxidase", "Peroxidase"]:

    cargos = cargo_df[cargo_df["Target"].str.contains(query)]
    cargos = cargos[cargos["Identity"] > 0.5] #Only keep hits above 50% identity

    for cargo_mgyp in cargos["Query"].unique():
        print(f"{cargo_mgyp},{query}")

MGYP001772626497,Rubrerythrin
MGYP001772615312,Rubrerythrin
MGYP001772614567,Rubrerythrin
MGYP001772554294,Rubrerythrin
MGYP001772539121,Rubrerythrin
MGYP001626330305,Rubrerythrin


In [9]:
cargos = cargo_df[cargo_df["Target"].str.contains("cysteine desulfurase")]
cargos = cargos[cargos["Identity"] > 0.5] #Only keep hits above 50% identity
cargos

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
88,MGYP001806634088,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.762,291,69,0,8,298,12,302,1.029e-142,460
312,MGYP003108509317,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.875,289,36,0,11,299,12,300,2.22e-166,528
418,MGYP001627025497,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.861,289,40,0,11,299,12,300,7.1040000000000005e-165,524
516,MGYP003384526964,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.795,288,59,0,10,297,13,300,1.867e-149,479
596,MGYP000488071151,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.868,289,38,0,11,299,12,300,5.184e-165,524
648,MGYP003365015307,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.927,290,21,0,10,299,12,301,9.535000000000001e-175,552
874,MGYP001627112497,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.91,290,26,0,11,300,12,301,5.448e-172,544
1073,MGYP003703364957,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.821,123,22,0,1,123,178,300,8.832e-63,219
1080,MGYP003392391534,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.736,304,77,1,1,304,1,301,8.458e-141,454
1437,MGYP000284114529,UniRef90_A0A6A5KXP5 cysteine desulfurase n=1 T...,0.865,289,39,0,11,299,12,300,4.703e-164,522
