# Hunting Down Low Identity Cargo Proteins

We have a list of manually curated cargo protein types in `encapsulin_families.csv`. We also have search hits for these cargo proteins against BLAST nr in `cargo_blast_nr_hits.tsv`. Let's comb through it and find some annotated cargo proteins that have low identity to any known protein - or maybe proteins that have homology against proteins that are unannotated or misannotated as something else.

First, we load the BLAST hits:

In [2]:
import pandas as pd

hits_df = pd.read_csv("../cargo_blast_nr_hits.tsv", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                              "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])

hits_df.head()

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
0,MGYP001385583216,MAE56341.1 hypothetical protein [Porticoccacea...,0.869,84,11,0,4,87,2,85,1.157e-39,152
1,MGYP001385583216,CAB4150874.1 hypothetical protein UFOVP574_44 ...,0.753,77,19,0,11,87,4,80,2.051e-31,128
2,MGYP001385583216,CAB4171523.1 hypothetical protein UFOVP927_6 [...,0.708,79,23,0,9,87,2,80,1.155e-28,120
3,MGYP001385583216,MCA8835931.1 hypothetical protein [Pseudomonad...,0.776,67,15,0,8,74,2,68,5.5e-24,107
4,MGYP001385583216,BAQ84647.1 hypothetical protein [uncultured Me...,0.765,64,15,0,24,87,24,87,6.940000000000001e-23,104


Next, let's load a list of annotated encapsulin systems that we've manually curated, and get the cargo proteins for each one:

In [3]:
#Load operon DataFrame and ensure Pfam lists for each cargo MGYP are read correctly
operon_df = pd.read_csv("../operon_df_filtered.csv")
indices = [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for i in indices:
    operon_df[f"Pfam {i}"] = operon_df[f"Pfam {i}"].apply(eval)

#Load annotated encapsulins
family_df = pd.read_csv("../encapsulin_families.csv")
annotated_encapsulins = family_df["Encapsulin MGYP"].unique()

#Filter out any annotated encapsulins from our operon DataFrame
operon_df = operon_df.query("`Encapsulin MGYP` in @annotated_encapsulins")
operon_df.head()

Unnamed: 0,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1,...,Pfam 1,Pfam 2,Pfam 3,Pfam 4,Pfam 5,Pfam 6,Pfam 7,Pfam 8,Pfam 9,Pfam 10
14,MGYP000087665821,MGYP000703964854,MGYP000736392069,MGYP000338316496,MGYP002077299657,MGYP000067028803,MGYP000512287377,MGYP000403178503,MGYP000081770201,MGYP000173178704,...,"[PF07687, PF01546]",[PF10994],[PF13419],[PF00082],[PF01740],[PF05494],[PF04333],[PF02470],[PF02405],[PF00005]
15,MGYP000423818695,MGYP000532926311,MGYP000143698383,MGYP000709860896,MGYP000606656294,MGYP000406125095,MGYP000376646145,MGYP000155487403,MGYP000482799648,MGYP000241003398,...,[PF04261],"[PF00775, CL0154]","[PF00384, PF04879, PF01568]",[PF10023],[PF09650],[],"[PF02518, PF00989, PF16927, PF00512]","[PF00072, PF00196]",[PF00072],"[CL0004, PF00884]"
16,MGYP000423818695,MGYP000532926311,MGYP000143698383,MGYP000709860896,MGYP000606656294,MGYP000406125095,MGYP000376646145,MGYP000155487403,MGYP000482799648,MGYP000241003398,...,[PF04261],"[PF00775, CL0154]","[PF00384, PF04879, PF01568]",[PF10023],[PF09650],[],"[PF02518, PF00989, PF16927, PF00512]","[PF00072, PF00196]",[PF00072],"[CL0004, PF00884]"
21,MGYP003081621609,MGYP003081621620,MGYP000657744308,MGYP000144496472,MGYP003081621650,MGYP000038336433,MGYP000678389341,MGYP002751754157,MGYP000909261574,MGYP003081648573,...,"[PF14801, PF08704]","[PF17758, PF16450, PF00004]",[],[],[],[],[],[],[],[]
22,,,,,MGYP003085065548,MGYP000038336433,MGYP000424619035,MGYP001644060519,MGYP000941113010,MGYP000460280988,...,"[PF08704, PF14801]","[PF00004, PF17758, PF16450]",[PF03136],[],[PF03136],[CL0487],[PF00977],[PF13828],[PF13828],[PF10825]


And finally, let's load our Pfam mapping dictionaries to get descriptions easily:

In [4]:
# The below code will load the Pfam labels dictionary downloaded from the GoogleResearch/Proteinfer GitHub repo
# I've also manually added some code below to fill in some important missing labels that I manually curated
import gzip
import json

with open("../DBs/label_descriptions.json.gz", 'rb') as f:
    with gzip.GzipFile(fileobj=f, mode='rb') as gzip_file:
      labels_dict = json.load(gzip_file)

labels_dict["PF19821"] = "Phage capsid protein"
labels_dict["PF19307"] = "Phage capsid-like protein"
labels_dict['PF19289'] = "PmbA/TldA metallopeptidase C-terminal domain"
labels_dict['PF19290'] = "PmbA/TldA metallopeptidase central domain"
labels_dict['PF20211'] = "Family of unknown function (DUF6571)"
labels_dict['PF19782'] = "Family of unknown function (DUF6267)"
labels_dict['PF19343'] = "Family of unknown function (DUF5923)"
labels_dict['PF20036'] = "Major capsid protein 13-like"
labels_dict['PF18960'] = "Family of unknown function (DUF5702)"
labels_dict['PF18906'] = "Phage tail tube protein"
labels_dict['PF19753'] = "Family of unknown function (DUF6240)"

def get_label(pfam):
    try:
        return(labels_dict[pfam])
    except KeyError:
        return(None)

pfam_df = pd.read_csv("../pfams/cargo_pfams.tsv", sep="\t", names=["Cargo MGYP", "Pfam", "Start", "End"])
pfam_df["Description"] = pfam_df["Pfam"].apply(get_label)

display(pfam_df.head())

Unnamed: 0,Cargo MGYP,Pfam,Start,End,Description
0,MGYP004138304253,PF09956,36,83,Uncharacterized conserved protein (DUF2190)
1,MGYP003963085681,CL0110,469,743,GT-A
2,MGYP004383807105,PF01391,3,56,Collagen triple helix repeat (20 copies)
3,MGYP003964778019,PF00295,48,153,Glycosyl hydrolases family 28
4,MGYP003960646311,PF01068,155,384,ATP dependent DNA ligase domain


Here's some code that will get the encapsulin MGYP for any given cargo MGYP:

In [5]:
cargo_dict = {}

for index, row in operon_df.iterrows():
    for item in row.values[:20]:
        if isinstance(item, str):
            cargo_dict[item] = row[20]

def get_encapsulin(cargo_mgyp):
    try:
        return(cargo_dict[cargo_mgyp])
    except KeyError:
        return(None)

Next, let's get a list of each annotated cargo protein:

In [6]:
with open("../pfams/family_1_cargo_pfams.tsv", "r") as family1file:
    family1file.readline()
    family_1_pfams = [line.rstrip().split()[0] for line in family1file]

with open("../pfams/family_2_cargo_pfams.txt", "r") as family2file:
    family2file.readline()
    family_2_pfams = [line.rstrip().split()[0] for line in family2file]


all_pfams = family_1_pfams + family_2_pfams

annotated_cargos = set(pfam_df.query("Pfam in @all_pfams")["Cargo MGYP"].unique())

encapsulin_annotated_cargos = []

for cargo_mgyp in hits_df["Query"].unique():
    if get_encapsulin(cargo_mgyp) in annotated_encapsulins:
        encapsulin_annotated_cargos.append(cargo_mgyp)

encapsulin_annotated_cargos = set(encapsulin_annotated_cargos)

annotated_cargos = annotated_cargos.intersection(encapsulin_annotated_cargos)
len(annotated_cargos)

133

And now, let's get each one's best hit in the BLAST nr database:

In [7]:
hits_df = hits_df.query("Query in @annotated_cargos").sort_values(by="Identity", ascending=False).drop_duplicates(subset="Query").sort_values(by="Identity", ascending=True)
hits_df.iloc[:20]

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
174825,MGYP003302485827,PWL40683.1 hypothetical protein DBY43_06740 [C...,0.327,159,101,3,9,161,12,170,6.246e-12,76
370651,MGYP003626338685,"CAB4162602.1 Glycosyl transferase, family 25 [...",0.504,228,102,4,7,228,2,224,3.018e-68,242
217548,MGYP001178528009,OUW91461.1 hypothetical protein CBD94_01720 [G...,0.553,177,78,1,1,176,1,177,8.637e-53,195
359273,MGYP003108877804,REK53366.1 hypothetical protein DWQ49_11805 [B...,0.559,177,77,1,1,176,1,177,1.0470000000000001e-54,200
31578,MGYP003110663811,OUW91461.1 hypothetical protein CBD94_01720 [G...,0.564,177,76,1,1,176,1,177,1.3030000000000001e-53,197
396388,MGYP001626144187,MCC6055797.1 hypothetical protein [Desulfuroco...,0.617,290,111,0,1,290,1,290,1.246e-104,351
175413,MGYP001626144193,MCC6041668.1 ferritin-like domain-containing p...,0.619,155,59,0,1,155,1,155,2.187e-56,204
73529,MGYP001996287958,GIR05032.1 MAG: hypothetical protein CM15mP16_...,0.634,230,84,0,54,283,1,230,3.108e-80,280
31436,MGYP001346362583,MBD1141706.1 4-hydroxybenzoate octaprenyltrans...,0.636,154,56,0,1,154,130,283,7.383e-56,202
37649,MGYP001438425319,MBD1141706.1 4-hydroxybenzoate octaprenyltrans...,0.636,154,56,0,1,154,130,283,1.388e-55,201


These first few hits at very low identity look very interesting, let's take a look at their Pfam annotations:

In [8]:
low_identity_hits = hits_df.iloc[:20]["Query"].unique()
low_identity_pfams = pfam_df.query("`Cargo MGYP` in @low_identity_hits").sort_values(by="Cargo MGYP")
low_identity_pfams

Unnamed: 0,Cargo MGYP,Pfam,Start,End,Description
606615,MGYP000270495616,PF00619,15,58,Caspase recruitment domain
699138,MGYP000270495616,CL0044,21,40,Ferritin
35754,MGYP001178528009,PF00210,36,164,Ferritin-like domain
386260,MGYP001187095434,PF01040,3,148,UbiA prenyltransferase family
538789,MGYP001346362583,PF01040,3,148,UbiA prenyltransferase family
661325,MGYP001438425319,PF01040,3,148,UbiA prenyltransferase family
503401,MGYP001442578972,PF01040,22,282,UbiA prenyltransferase family
512447,MGYP001626144187,PF02915,4,125,Rubrerythrin
361212,MGYP001626144187,PF01988,118,281,VIT family
386483,MGYP001626144193,PF02915,7,144,Rubrerythrin


In [9]:
get_encapsulin("MGYP003302485827")

'MGYP003302485682'

## Interesting Hits

`MGYP003302485827` is the interesting example here - it's annotated as a Ferritin which is a known encapsulin however its best hit in the BLAST nr database is `PWL40683.1 hypothetical protein DBY43_06740 [Clostridiaceae bacterium]` with only 32% identity.

If we search its ESMFold predicted structure using Foldseek we get significant hits against crystal structures of ferritins from *E. coli* K12 (PDB codes **4ZTT** and **4XGS**).

Its encapsulin protein is `MGYP003302485682`, let's find out more about it:

In [10]:
uniref_hits_df = pd.read_csv("../encapsulin_UniRef90_hits.tsv", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                              "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])

uniref_hits_df = uniref_hits_df[uniref_hits_df["Query"] == "MGYP003302485682"]
uniref_hits_df

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
837,MGYP003302485682,UniRef90_A0A330LPZ7 Lipoprotein n=1 Tax=Morite...,0.462,359,191,1,5,363,4,360,1.249e-94,324


We can see that its best hit in UniRef90 is `UniRef90_A0A330LPZ7 Lipoprotein n=1 Tax=Moritella yayanosii TaxID=69539 RepID=A0A330LPZ7_9GAMM` with identity of only 46.2%