# Cargo Hits against BLAST non-redundant DB

We've searched our ≈22,000 encapsulin hit putative cargo proteins against BLAST nr - let's see if these reveal some functions we hadn't seen before!

In [2]:
import pandas as pd

hits_df = pd.read_csv("../cargo_blast_nr_hits.tsv", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                              "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])

hits_df.head()

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
0,MGYP001385583216,MAE56341.1 hypothetical protein [Porticoccacea...,0.869,84,11,0,4,87,2,85,1.157e-39,152
1,MGYP001385583216,CAB4150874.1 hypothetical protein UFOVP574_44 ...,0.753,77,19,0,11,87,4,80,2.051e-31,128
2,MGYP001385583216,CAB4171523.1 hypothetical protein UFOVP927_6 [...,0.708,79,23,0,9,87,2,80,1.155e-28,120
3,MGYP001385583216,MCA8835931.1 hypothetical protein [Pseudomonad...,0.776,67,15,0,8,74,2,68,5.5e-24,107
4,MGYP001385583216,BAQ84647.1 hypothetical protein [uncultured Me...,0.765,64,15,0,24,87,24,87,6.940000000000001e-23,104


First of all, let's filter out any "hypothetical proteins" since these tell us literally nothing about function. We'll also make a nicely formatted description column as well:

In [2]:
hits_df_filtered = hits_df[~hits_df["Target"].str.contains("hypothetical protein")]
#This regex will get out the gene function name from the BLAST header
hits_df_filtered["Description"] = hits_df_filtered["Target"].str.extract(r"\..\s(.+)\s\[")
#And this regex will remove any superfluous text to leave just the nicely formatted function
hits_df_filtered["Description"] = hits_df_filtered["Description"].str.split("[").str[0]
hits_df_filtered = hits_df_filtered.dropna(how="any")
hits_df_filtered

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hits_df_filtered["Description"] = hits_df_filtered["Target"].str.extract(r"\..\s(.+)\s\[")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hits_df_filtered["Description"] = hits_df_filtered["Description"].str.split("[").str[0]


Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Description
39,MGYP001301883557,MBK93439.1 heat-shock protein [Rickettsiales b...,0.770,148,33,1,1,148,1,147,1.557000e-66,233,heat-shock protein
40,MGYP001301883557,MAV94423.1 heat-shock protein [Euryarchaeota a...,0.762,143,33,1,6,148,8,149,2.020000e-62,221,heat-shock protein
41,MGYP001301883557,ASF00636.1 putative heat shock protein [uncult...,0.707,147,42,1,3,148,4,150,8.923000e-61,216,putative heat shock protein
42,MGYP001301883557,OUU18516.1 heat-shock protein [Crocinitomicace...,0.682,148,46,1,1,148,1,147,1.394000e-58,210,heat-shock protein
43,MGYP001301883557,OUW32545.1 heat-shock protein [Flavobacteriace...,0.686,150,42,2,1,148,1,147,1.741000e-57,207,heat-shock protein
...,...,...,...,...,...,...,...,...,...,...,...,...,...
399386,MGYP001185537067,MBK9480173.1 2OG-Fe(II) oxygenase [Bacteroidot...,0.310,187,82,6,54,224,47,202,5.395000e-14,84,2OG-Fe(II) oxygenase
399387,MGYP001185537067,ARF08263.1 2OG-FeII oxygenase superfamily prot...,0.321,227,103,12,2,221,15,197,1.353000e-13,83,2OG-FeII oxygenase superfamily protein
399388,MGYP001185537067,MCG8694262.1 2OG-Fe(II) oxygenase [Minwuiales ...,0.418,110,43,3,121,225,1,94,9.790000e-12,78,2OG-Fe(II) oxygenase
399391,MGYP001185537067,NBV35790.1 2OG-Fe(II) oxygenase [Bacteroidota ...,0.304,230,103,14,8,224,5,190,1.273000e-09,71,2OG-Fe(II) oxygenase


In [3]:
hits_df_filtered["Description"].value_counts()[:10]

co-chaperonin GroES                           3413
MAG: putative structural protein               955
glycosyltransferase                            718
ABC transporter ATP-binding protein            695
ABC transporter permease                       685
AAA family ATPase                              639
ABC transporter ATP-binding protein            590
DEAD/DEAH box helicase                         588
recombinase family protein                     586
helix-turn-helix transcriptional regulator     585
Name: Description, dtype: int64

## Annotating Encapsulins Hits

Recall that we already have a few encapsulins annotated with cargo proteins (mainly from Pfam annotations).

We also have a DataFrame containing the operon info for each encapsulin - that is to say, a list of cargo MGYPs and cargo Pfams for each encapsulin MGYP.

Let's write some code so we can quickly search through any unannotated encapsulins and find if they're associated with cargo functions from the BLAST data here. Firstly, let's load our operon info:

In [4]:
#Load operon DataFrame and ensure Pfam lists for each cargo MGYP are read correctly
operon_df = pd.read_csv("../operon_df_filtered.csv")
indices = [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for i in indices:
    operon_df[f"Pfam {i}"] = operon_df[f"Pfam {i}"].apply(eval)
operon_df.head()

#Load annotated encapsulins
annotated_encapsulins = pd.read_csv("../encapsulin_families.csv")["Encapsulin MGYP"].unique()

#Filter out any annotated encapsulins from our operon DataFrame
operon_df = operon_df.query("`Encapsulin MGYP` not in @annotated_encapsulins")
operon_df.head()

Unnamed: 0,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1,...,Pfam 1,Pfam 2,Pfam 3,Pfam 4,Pfam 5,Pfam 6,Pfam 7,Pfam 8,Pfam 9,Pfam 10
0,,,,,MGYP003181590055,MGYP003181590098,MGYP003181590127,MGYP000392823663,MGYP000171667185,MGYP000165772378,...,[],[],[],[],[],"[CL0219, CL0219, CL0023, CL0219]",[],[PF00156],"[PF20582, PF04002]",[]
1,,,,,MGYP003181590055,MGYP003181590098,MGYP003181590127,MGYP000392823663,MGYP000171667185,MGYP000165772378,...,[],[],[],[],[],"[CL0219, CL0219, CL0023, CL0219]",[],[PF00156],"[PF20582, PF04002]",[]
2,,,,,MGYP003181590055,MGYP003181590098,MGYP003181590127,MGYP000392823663,MGYP000171667185,MGYP000165772378,...,[],[],[],[],[],"[CL0219, CL0219, CL0023, CL0219]",[],[PF00156],"[PF20582, PF04002]",[]
3,MGYP003181590835,MGYP000451799168,MGYP003181590758,MGYP003215934654,MGYP003181590729,MGYP000611038907,MGYP000050146259,MGYP000353896556,MGYP000705400687,MGYP000177563455,...,[PF00574],[],[],[],"[CL0023, PF08275, CL0413, PF01807, PF08706, PF...",[],[],[],[],[]
4,MGYP000368987579,MGYP000702359586,MGYP002431000206,MGYP000241842941,MGYP000474794676,MGYP000436459235,MGYP000536743081,MGYP000439406311,MGYP000112096761,MGYP000740182570,...,[],[],[],[],[],[],[],[],[],[]


Next, let's make a dictionary mapping each  MGYP to its encapsulin. We'll also write a function to return the encapsulin MGYP of a cargo protein:

In [5]:
cargo_dict = {}

for index, row in operon_df.iterrows():
    for item in row.values[:20]:
        if isinstance(item, str):
            cargo_dict[item] = row[20]

def get_encapsulin(cargo_mgyp):
    try:
        return(cargo_dict[cargo_mgyp])
    except KeyError:
        return(None)

## Annotating Cargo Proteins

First off, let's search for a few known family 1 cargo proteins and see if we get any hits:

In [6]:
from collections import defaultdict

annotations_dict = defaultdict(list)

for query in ["Rubrerythrin", "ferritin", "Ferritin", "Dyp-type peroxidase", "hemerythrin"]:

    cargos = hits_df_filtered[hits_df_filtered["Target"].str.upper().str.contains(query.upper())]
    cargos = cargos[cargos["Identity"] > 0.3] #Only keep hits above 30% identity

    for cargo_mgyp in cargos["Query"].unique():
        if get_encapsulin(cargo_mgyp):
            annotations_dict[get_encapsulin(cargo_mgyp)].append(query)

for key, value in annotations_dict.items():
    print(key, value)

annotated_encapsulins = list(annotated_encapsulins) + list(annotations_dict.keys())

MGYP000507254597 ['ferritin', 'Ferritin']
MGYP000016735148 ['ferritin', 'Ferritin']
MGYP000674248133 ['ferritin', 'Ferritin']


Lots of hits for these known cargos! Let's try more with family 2 now:

In [7]:
from collections import defaultdict

annotations_dict = defaultdict(list)

for query in ["Polyprenyl transferase", "Polyprenyl synthetase", "Xylulokinase", "Xylulose kinase", "Terpene synthetase", "Terpene cyclase", "Cysteine desulfurase"]:

    cargos = hits_df_filtered[hits_df_filtered["Target"].str.upper().str.contains(query.upper())]
    cargos = cargos[cargos["Identity"] > 0.3] #Only keep hits above 30% identity

    for cargo_mgyp in cargos["Query"].unique():
        if get_encapsulin(cargo_mgyp):
            annotations_dict[get_encapsulin(cargo_mgyp)].append(query)

for key, value in annotations_dict.items():
    print(key, value)

annotated_encapsulins.extend(list(annotations_dict.keys()))

## Enriching Pfam Annotations with BLAST nr Hits

Recall that we have a bunch of sequence search hits against the cargo loading peptide (CLP) consensus sequences.

From notebook `notebooks/CLP_hits_exploration.ipynb` you can see that whilst some of these hits had useful Pfam annotations, a lot of them were clearly junk or even missing entirely.

Let's combine our BLAST nr hits data with this CLP data so that we can get some idea of the function of these CLP hits in the absence of useful Pfam annotations. First, we load the data:

In [8]:
from Bio import SeqIO
import gzip
import json

clp_df = pd.read_csv("../clp_cargo_hits.m8", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                            "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])

#First, let's make a dictionary where the keys are cargo MGYPs and values are the protein lengths
length_dict = {str(record.id).split()[0]: len(str(record.seq)) for record in SeqIO.parse("../seqs/all_putative_cargo_proteins.fasta", "fasta")}

#Let's make a function to get the length of a cargo MGYP and apply it to the DataFrame
def get_length(cargo_mgyp):
    return(length_dict[cargo_mgyp])

clp_df["Length"] = clp_df["Target"].apply(get_length)
clp_df["C-terminus Distance"] = clp_df["Length"] - clp_df["Target End"]

# The below code will load the Pfam labels dictionary downloaded from the GoogleResearch/Proteinfer GitHub repo
# I've also manually added some code below to fill in some important missing labels that I manually curated


with open("../DBs/label_descriptions.json.gz", 'rb') as f:
    with gzip.GzipFile(fileobj=f, mode='rb') as gzip_file:
      labels_dict = json.load(gzip_file)

labels_dict["PF19821"] = "Phage capsid protein"
labels_dict["PF19307"] = "Phage capsid-like protein"
labels_dict['PF19289'] = "PmbA/TldA metallopeptidase C-terminal domain"
labels_dict['PF19290'] = "PmbA/TldA metallopeptidase central domain"
labels_dict['PF20211'] = "Family of unknown function (DUF6571)"
labels_dict['PF19782'] = "Family of unknown function (DUF6267)"
labels_dict['PF19343'] = "Family of unknown function (DUF5923)"
labels_dict['PF20036'] = "Major capsid protein 13-like"
labels_dict['PF18960'] = "Family of unknown function (DUF5702)"
labels_dict['PF18906'] = "Phage tail tube protein"
labels_dict['PF19753'] = "Family of unknown function (DUF6240)"

def get_label(pfam):
    try:
        return(labels_dict[pfam])
    except KeyError:
        return(None)

pfam_df = pd.read_csv("../pfams/cargo_pfams.tsv", sep="\t", names=["Cargo MGYP", "Pfam", "Start", "End"])
pfam_df["Description"] = pfam_df["Pfam"].apply(get_label)

display(clp_df.head())
display(pfam_df.head())

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Length,C-terminus Distance
0,all_cargo_clp_consensus,MGYP003667212098,0.923,13,0,1,1,13,114,125,0.001832,33,128,3
1,all_cargo_clp_consensus,MGYP001555959371,0.846,13,1,1,1,13,337,348,0.08722,28,352,4
2,all_cargo_clp_consensus,MGYP003362106646,0.909,11,0,1,1,11,357,366,0.1885,27,367,1
3,all_cargo_clp_consensus,MGYP001599236885,0.833,12,1,1,1,12,336,346,0.4072,26,351,5
4,all_cargo_clp_consensus,MGYP001627912779,0.875,8,1,0,6,13,120,127,1.29,24,127,0


Unnamed: 0,Cargo MGYP,Pfam,Start,End,Description
0,MGYP004138304253,PF09956,36,83,Uncharacterized conserved protein (DUF2190)
1,MGYP003963085681,CL0110,469,743,GT-A
2,MGYP004383807105,PF01391,3,56,Collagen triple helix repeat (20 copies)
3,MGYP003964778019,PF00295,48,153,Glycosyl hydrolases family 28
4,MGYP003960646311,PF01068,155,384,ATP dependent DNA ligase domain


Let's filter the CLP hits DataFrame to remove any cargos that correspond to already annotated encapsulins:

In [9]:
annotated_cargos = []

for cargo in clp_df["Target"].unique():
    if get_encapsulin(cargo) in annotated_encapsulins:
        annotated_cargos.append(cargo)

clp_df = clp_df.query("Target not in @annotated_cargos")
clp_df

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Length,C-terminus Distance
0,all_cargo_clp_consensus,MGYP003667212098,0.923,13,0,1,1,13,114,125,0.001832,33,128,3
1,all_cargo_clp_consensus,MGYP001555959371,0.846,13,1,1,1,13,337,348,0.087220,28,352,4
2,all_cargo_clp_consensus,MGYP003362106646,0.909,11,0,1,1,11,357,366,0.188500,27,367,1
3,all_cargo_clp_consensus,MGYP001599236885,0.833,12,1,1,1,12,336,346,0.407200,26,351,5
4,all_cargo_clp_consensus,MGYP001627912779,0.875,8,1,0,6,13,120,127,1.290000,24,127,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
722,family2_ntd_consensus_motif,MGYP001510367494,0.714,7,2,0,2,8,258,264,2156.000000,15,1099,835
723,family2_ntd_consensus_motif,MGYP000085676378,0.714,7,2,0,2,8,258,264,2156.000000,15,1099,835
724,family2_ntd_consensus_motif,MGYP003631992961,0.800,5,1,0,2,6,258,262,3169.000000,14,660,398
725,family2_ntd_consensus_motif,MGYP003118733449,0.666,6,2,0,1,6,257,262,4649.000000,14,1341,1079


Let's only examine CLP hits that are within 30 residues of either terminus of the encapsulin hit:

In [10]:
clp_df = pd.concat([clp_df[clp_df["Target Start"] < 30],clp_df[clp_df["C-terminus Distance"] < 30]])
clp_df

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Length,C-terminus Distance
29,all_cargo_clp_consensus,MGYP001188511991,0.857,7,1,0,7,13,7,13,40.08,20,690,677
60,all_cargo_clp_consensus,MGYP001448109398,1.000,5,0,0,3,7,14,18,181.30,18,139,121
105,all_cargo_clp_consensus,MGYP003320528748,1.000,5,0,0,4,8,20,24,383.40,17,216,192
118,all_cargo_clp_consensus,MGYP000324511467,1.000,5,0,0,4,8,25,29,383.40,17,494,465
132,all_cargo_clp_consensus,MGYP003325887715,1.000,5,0,0,5,9,20,24,806.90,16,107,83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
690,family2_ntd_consensus_motif,MGYP003148467311,1.000,4,0,0,3,6,284,287,673.00,16,314,27
691,family2_ntd_consensus_motif,MGYP003108842163,1.000,4,0,0,3,6,288,291,673.00,16,318,27
694,family2_ntd_consensus_motif,MGYP003663642500,1.000,4,0,0,3,6,293,296,673.00,16,324,28
698,family2_ntd_consensus_motif,MGYP003130907414,1.000,4,0,0,3,6,311,314,673.00,16,342,28


Let's add in our Pfam annotations:

In [11]:
clp_df = clp_df.merge(pfam_df, how="left", left_on="Target", right_on="Cargo MGYP")
clp_df

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Length,C-terminus Distance,Cargo MGYP,Pfam,Start,End,Description
0,all_cargo_clp_consensus,MGYP001188511991,0.857,7,1,0,7,13,7,13,40.08,20,690,677,,,,,
1,all_cargo_clp_consensus,MGYP001448109398,1.000,5,0,0,3,7,14,18,181.30,18,139,121,,,,,
2,all_cargo_clp_consensus,MGYP003320528748,1.000,5,0,0,4,8,20,24,383.40,17,216,192,MGYP003320528748,PF17212,7.0,201.0,Tail tubular protein
3,all_cargo_clp_consensus,MGYP000324511467,1.000,5,0,0,4,8,25,29,383.40,17,494,465,,,,,
4,all_cargo_clp_consensus,MGYP003325887715,1.000,5,0,0,5,9,20,24,806.90,16,107,83,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
231,family2_ntd_consensus_motif,MGYP003148467311,1.000,4,0,0,3,6,284,287,673.00,16,314,27,,,,,
232,family2_ntd_consensus_motif,MGYP003108842163,1.000,4,0,0,3,6,288,291,673.00,16,318,27,,,,,
233,family2_ntd_consensus_motif,MGYP003663642500,1.000,4,0,0,3,6,293,296,673.00,16,324,28,,,,,
234,family2_ntd_consensus_motif,MGYP003130907414,1.000,4,0,0,3,6,311,314,673.00,16,342,28,,,,,


Finally, I'm going to write a function to iterate through these ≈200 CLP hits and print out their Pfam annotations and BLAST nr hits, so we can manually comb through them and see if anything interesting appears:

In [12]:
clp_cargos = clp_df["Target"].unique()
clp_blast_df = hits_df_filtered.query("Query in @clp_cargos")

blast_cargo_dict = defaultdict(list)

for cargo in clp_cargos:
    blast_records = clp_blast_df[clp_blast_df["Query"] == cargo].to_dict(orient="records")
    if blast_records:
        blast_cargo_dict[cargo].extend([record["Description"]for record in blast_records])

for cargo_mgyp, descriptions in blast_cargo_dict.items():
    if get_encapsulin(cargo_mgyp):
        print(f"------------------------Encapsulin {get_encapsulin(cargo_mgyp)} Cargo {cargo_mgyp}------------------------")
        print(descriptions[0])

------------------------Encapsulin MGYP001434424420 Cargo MGYP003320528748------------------------
tail protein 
------------------------Encapsulin MGYP003181827401 Cargo MGYP003181827418------------------------
MAG TPA: Putative ATP dependent Clp protease
------------------------Encapsulin MGYP000236489926 Cargo MGYP003030405215------------------------
MAG TPA: Putative ATP dependent Clp protease
------------------------Encapsulin MGYP003205503727 Cargo MGYP003205503713------------------------
MAG TPA: Putative ATP dependent Clp protease
------------------------Encapsulin MGYP000675899088 Cargo MGYP000348586414------------------------
MAG TPA: Putative ATP dependent Clp protease
------------------------Encapsulin MGYP003131585963 Cargo MGYP003131585812------------------------
putative tail fiber
------------------------Encapsulin MGYP000248784877 Cargo MGYP000635077895------------------------
elongation factor Tu
------------------------Encapsulin MGYP000622200760 Cargo MGYP0032516222

## Addendum - Investigating Potential Virus False Positives

As of 3rd January 2023 - I've had a new idea surrounding reanalysis of this data. I realize that a lot of these cargo hits may share high identity to viral proteins, and so this could be a way of filtering out any false positive virus capsid hits.

Let's do a very crude filtering and check whether any of our cargos have hits against viral sequences:

In [50]:
virus_hits = hits_df[hits_df["Target"].str.contains("phage") | hits_df["Target"].str.contains("virus")]
virus_hits = virus_hits[virus_hits["Identity"] > 0.9]
virus_hits

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
1115,MGYP003367082276,DAW65885.1 MAG TPA: hypothetical protein [Bact...,0.966,207,7,0,1,207,1,207,6.786000e-136,436
1116,MGYP003367082276,WP_118595469.1 hypothetical protein [Blautia s...,0.937,207,13,0,1,207,1,207,1.316000e-132,426
1415,MGYP003396736581,YP_010114620.1 PhoH-like phosphate starvation-...,0.947,228,12,0,1,228,1,228,2.440000e-138,444
1812,MGYP000014746922,WP_118937028.1 MULTISPECIES: hypothetical prot...,0.987,79,1,0,1,79,1,79,2.079000e-43,162
2014,MGYP003109397898,ASE99822.1 putative peptidase M15 [uncultured ...,0.911,158,14,0,1,158,1,158,1.794000e-93,311
...,...,...,...,...,...,...,...,...,...,...,...,...
398689,MGYP001219582601,BCV03744.1 MAG: hypothetical protein CM15mV72_...,0.931,29,2,0,1,29,1,29,1.003000e-06,56
399076,MGYP003109038852,QPZ53650.1 hypothetical protein HTVC203P_gp26 ...,0.935,93,6,0,1,93,1,93,2.630000e-45,168
399081,MGYP003109038852,QGZ17672.1 hypothetical protein HTVC023P_gp43 ...,0.921,89,7,0,5,93,17,105,7.155000e-42,158
399092,MGYP003109038852,BCV02102.1 MAG: hypothetical protein CM15mV49_...,0.950,80,4,0,5,84,3,82,1.302000e-37,146


In [51]:
virus_hits["Description"] = virus_hits["Target"].str.extract(r"\w+\.\w\s(.+)")
virus_hits

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Description
1115,MGYP003367082276,DAW65885.1 MAG TPA: hypothetical protein [Bact...,0.966,207,7,0,1,207,1,207,6.786000e-136,436,MAG TPA: hypothetical protein [Bacteriophage sp.]
1116,MGYP003367082276,WP_118595469.1 hypothetical protein [Blautia s...,0.937,207,13,0,1,207,1,207,1.316000e-132,426,hypothetical protein [Blautia sp. AF17-9LB]RH...
1415,MGYP003396736581,YP_010114620.1 PhoH-like phosphate starvation-...,0.947,228,12,0,1,228,1,228,2.440000e-138,444,PhoH-like phosphate starvation-inducible [Flav...
1812,MGYP000014746922,WP_118937028.1 MULTISPECIES: hypothetical prot...,0.987,79,1,0,1,79,1,79,2.079000e-43,162,MULTISPECIES: hypothetical protein [Bacteroide...
2014,MGYP003109397898,ASE99822.1 putative peptidase M15 [uncultured ...,0.911,158,14,0,1,158,1,158,1.794000e-93,311,putative peptidase M15 [uncultured virus]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
398689,MGYP001219582601,BCV03744.1 MAG: hypothetical protein CM15mV72_...,0.931,29,2,0,1,29,1,29,1.003000e-06,56,MAG: hypothetical protein CM15mV72_360 [uncult...
399076,MGYP003109038852,QPZ53650.1 hypothetical protein HTVC203P_gp26 ...,0.935,93,6,0,1,93,1,93,2.630000e-45,168,hypothetical protein HTVC203P_gp26 [Pelagibact...
399081,MGYP003109038852,QGZ17672.1 hypothetical protein HTVC023P_gp43 ...,0.921,89,7,0,5,93,17,105,7.155000e-42,158,hypothetical protein HTVC023P_gp43 [Pelagibact...
399092,MGYP003109038852,BCV02102.1 MAG: hypothetical protein CM15mV49_...,0.950,80,4,0,5,84,3,82,1.302000e-37,146,MAG: hypothetical protein CM15mV49_700 [uncult...


Now, let's see how many of these cargo hits are associated with encapsulins in our final, filtered dataset:

In [52]:
#Load operon DataFrame and ensure Pfam lists for each cargo MGYP are read correctly
operon_df = pd.read_csv("../operon_df_filtered.csv")
indices = [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for i in indices:
    operon_df[f"Pfam {i}"] = operon_df[f"Pfam {i}"].apply(eval)

filtered_cargos = set([value for value in operon_df.iloc[:, :20].to_numpy().flatten() if value])
len(filtered_cargos)

24579

In [53]:
virus_hits.query("Query in @filtered_cargos")
virus_hits

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Description
1115,MGYP003367082276,DAW65885.1 MAG TPA: hypothetical protein [Bact...,0.966,207,7,0,1,207,1,207,6.786000e-136,436,MAG TPA: hypothetical protein [Bacteriophage sp.]
1116,MGYP003367082276,WP_118595469.1 hypothetical protein [Blautia s...,0.937,207,13,0,1,207,1,207,1.316000e-132,426,hypothetical protein [Blautia sp. AF17-9LB]RH...
1415,MGYP003396736581,YP_010114620.1 PhoH-like phosphate starvation-...,0.947,228,12,0,1,228,1,228,2.440000e-138,444,PhoH-like phosphate starvation-inducible [Flav...
1812,MGYP000014746922,WP_118937028.1 MULTISPECIES: hypothetical prot...,0.987,79,1,0,1,79,1,79,2.079000e-43,162,MULTISPECIES: hypothetical protein [Bacteroide...
2014,MGYP003109397898,ASE99822.1 putative peptidase M15 [uncultured ...,0.911,158,14,0,1,158,1,158,1.794000e-93,311,putative peptidase M15 [uncultured virus]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
398689,MGYP001219582601,BCV03744.1 MAG: hypothetical protein CM15mV72_...,0.931,29,2,0,1,29,1,29,1.003000e-06,56,MAG: hypothetical protein CM15mV72_360 [uncult...
399076,MGYP003109038852,QPZ53650.1 hypothetical protein HTVC203P_gp26 ...,0.935,93,6,0,1,93,1,93,2.630000e-45,168,hypothetical protein HTVC203P_gp26 [Pelagibact...
399081,MGYP003109038852,QGZ17672.1 hypothetical protein HTVC023P_gp43 ...,0.921,89,7,0,5,93,17,105,7.155000e-42,158,hypothetical protein HTVC023P_gp43 [Pelagibact...
399092,MGYP003109038852,BCV02102.1 MAG: hypothetical protein CM15mV49_...,0.950,80,4,0,5,84,3,82,1.302000e-37,146,MAG: hypothetical protein CM15mV49_700 [uncult...


Looks like all of these putative viral proteins are associated with encapsulins in our filtered dataset!

Let's see how many of these encapsulins we have:

In [54]:
cargo_mapping_dict = {}

for i, row in operon_df.iterrows():
    for index in indices:
        cargo_mapping_dict[f"{row[str(index)]}"] = row["Encapsulin MGYP"]

def get_encapsulin(cargo):
    return(cargo_mapping_dict[cargo])

virus_hits["Encapsulin MGYP"] = virus_hits["Query"].apply(get_encapsulin)
putative_mcps = virus_hits["Encapsulin MGYP"].unique()
len(putative_mcps)

293

293 out of our 1548 encapsulins may be associated with viral proteins! Let's see how many of these are annotated with a family type:

In [55]:
family_df = pd.read_csv("../encapsulin_families.csv")
family_df =family_df.query("`Encapsulin MGYP` in @putative_mcps")
family_df

Unnamed: 0,Encapsulin MGYP,Cargo Description,Cargo Search Method
41,MGYP001508311700,Ferritin,Manually curated (Pfam)
114,MGYP000488071151,Cysteine Desulfurase,family2_ntd_consensus_motif
115,MGYP000284114529,Cysteine Desulfurase,family2_ntd_consensus_motif
172,MGYP003110288594,arylpolyene BGC,antiSMASH
173,MGYP001216717877,Saccharide BGC,DeepBGC
177,MGYP003113059926,Saccharide BGC,DeepBGC
178,MGYP003131024615,Saccharide BGC,DeepBGC
185,MGYP003626701920,Saccharide BGC,DeepBGC
187,MGYP003662477660,Saccharide BGC,DeepBGC


Let's investigate these Saccharide BGC hits further:

In [56]:
saccharide_bgc_encs = family_df[family_df["Cargo Description"] == "Saccharide BGC"]["Encapsulin MGYP"].unique()

saccharide_bgc_cargo_hits = virus_hits.query("`Encapsulin MGYP` in @saccharide_bgc_encs")
saccharide_bgc_cargo_hits

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Description,Encapsulin MGYP
13875,MGYP000909089093,BAQ88196.1 hypothetical protein [uncultured Me...,0.925,40,3,0,1,40,1,40,9.984e-14,78,hypothetical protein [uncultured Mediterranean...,MGYP003662477660
53109,MGYP003627352942,BAR35889.1 hypothetical protein [uncultured Me...,0.911,90,8,0,49,138,6,95,2.507e-42,162,hypothetical protein [uncultured Mediterranean...,MGYP003662477660
53111,MGYP003627352942,ASE99985.1 hypothetical protein [uncultured vi...,0.919,87,7,0,52,138,10,96,1.667e-41,160,hypothetical protein [uncultured virus],MGYP003662477660
53113,MGYP003627352942,BCV01713.1 MAG: hypothetical protein CM15mV45_...,0.919,87,7,0,52,138,11,97,1.108e-40,158,MAG: hypothetical protein CM15mV45_060 [uncult...,MGYP003662477660
53119,MGYP003627352942,BAQ89509.1 hypothetical protein [uncultured Me...,0.909,88,8,0,51,138,7,94,5.373e-40,156,hypothetical protein [uncultured Mediterranean...,MGYP003662477660
53121,MGYP003627352942,URG13092.1 hypothetical protein [phage 023Pt_p...,0.908,87,8,0,52,138,7,93,2.6050000000000003e-39,154,hypothetical protein [phage 023Pt_psg01],MGYP003662477660
67502,MGYP000191352606,ASN63298.1 co-chaperonin GroES [uncultured virus],0.987,155,2,0,1,155,1,155,2.817e-96,319,co-chaperonin GroES [uncultured virus],MGYP003662477660
137075,MGYP003113059623,QDP55842.1 MAG: hypothetical protein Tp1100SUR...,0.933,45,3,0,1,45,46,90,7.974e-16,81,MAG: hypothetical protein Tp1100SUR639781_55 [...,MGYP003113059926
137078,MGYP003113059623,BCV05827.1 MAG: hypothetical protein CM15mV118...,0.933,45,3,0,1,45,43,87,2.844e-15,79,MAG: hypothetical protein CM15mV118_230 [uncul...,MGYP003113059926
143795,MGYP001144230547,BCV06039.1 MAG: hypothetical protein CM15mV124...,0.913,92,8,0,87,178,3,94,3.98e-45,173,MAG: hypothetical protein CM15mV124_300 [uncul...,MGYP003662477660


In [57]:
# The below code will load the Pfam labels dictionary downloaded from the GoogleResearch/Proteinfer GitHub repo
# I've also manually added some code below to fill in some important missing labels that I manually curated
import gzip
import json

with open("../DBs/label_descriptions.json.gz", 'rb') as f:
    with gzip.GzipFile(fileobj=f, mode='rb') as gzip_file:
      labels_dict = json.load(gzip_file)

labels_dict["PF19821"] = "Phage capsid protein"
labels_dict["PF19307"] = "Phage capsid-like protein"
labels_dict['PF19289'] = "PmbA/TldA metallopeptidase C-terminal domain"
labels_dict['PF19290'] = "PmbA/TldA metallopeptidase central domain"
labels_dict['PF20211'] = "Family of unknown function (DUF6571)"
labels_dict['PF19782'] = "Family of unknown function (DUF6267)"
labels_dict['PF19343'] = "Family of unknown function (DUF5923)"
labels_dict['PF20036'] = "Major capsid protein 13-like"
labels_dict['PF18960'] = "Family of unknown function (DUF5702)"
labels_dict['PF18906'] = "Phage tail tube protein"
labels_dict['PF19753'] = "Family of unknown function (DUF6240)"

def get_label(pfam):
    try:
        return(labels_dict[pfam])
    except KeyError:
        return(None)

pfam_df = pd.read_csv("../pfams/cargo_pfams.tsv", sep="\t", names=["Cargo MGYP", "Pfam", "Start", "End"])
saccharide_bgc_cargos = saccharide_bgc_cargo_hits["Query"].unique()
pfam_df = pfam_df.query("`Cargo MGYP` in @saccharide_bgc_cargos")
pfam_df["Description"] = pfam_df["Pfam"].apply(get_label)

saccharide_bgc_cargo_hits = saccharide_bgc_cargo_hits.merge(pfam_df, left_on="Query", right_on="Cargo MGYP")
saccharide_bgc_cargo_hits

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Description_x,Encapsulin MGYP,Cargo MGYP,Pfam,Start,End,Description_y
0,MGYP003627352942,BAR35889.1 hypothetical protein [uncultured Me...,0.911,90,8,0,49,138,6,95,2.507e-42,162,hypothetical protein [uncultured Mediterranean...,MGYP003662477660,MGYP003627352942,PF19846,63,137,
1,MGYP003627352942,BAR35889.1 hypothetical protein [uncultured Me...,0.911,90,8,0,49,138,6,95,2.507e-42,162,hypothetical protein [uncultured Mediterranean...,MGYP003662477660,MGYP003627352942,PF01632,89,119,Ribosomal protein L35
2,MGYP003627352942,ASE99985.1 hypothetical protein [uncultured vi...,0.919,87,7,0,52,138,10,96,1.667e-41,160,hypothetical protein [uncultured virus],MGYP003662477660,MGYP003627352942,PF19846,63,137,
3,MGYP003627352942,ASE99985.1 hypothetical protein [uncultured vi...,0.919,87,7,0,52,138,10,96,1.667e-41,160,hypothetical protein [uncultured virus],MGYP003662477660,MGYP003627352942,PF01632,89,119,Ribosomal protein L35
4,MGYP003627352942,BCV01713.1 MAG: hypothetical protein CM15mV45_...,0.919,87,7,0,52,138,11,97,1.108e-40,158,MAG: hypothetical protein CM15mV45_060 [uncult...,MGYP003662477660,MGYP003627352942,PF19846,63,137,
5,MGYP003627352942,BCV01713.1 MAG: hypothetical protein CM15mV45_...,0.919,87,7,0,52,138,11,97,1.108e-40,158,MAG: hypothetical protein CM15mV45_060 [uncult...,MGYP003662477660,MGYP003627352942,PF01632,89,119,Ribosomal protein L35
6,MGYP003627352942,BAQ89509.1 hypothetical protein [uncultured Me...,0.909,88,8,0,51,138,7,94,5.373e-40,156,hypothetical protein [uncultured Mediterranean...,MGYP003662477660,MGYP003627352942,PF19846,63,137,
7,MGYP003627352942,BAQ89509.1 hypothetical protein [uncultured Me...,0.909,88,8,0,51,138,7,94,5.373e-40,156,hypothetical protein [uncultured Mediterranean...,MGYP003662477660,MGYP003627352942,PF01632,89,119,Ribosomal protein L35
8,MGYP003627352942,URG13092.1 hypothetical protein [phage 023Pt_p...,0.908,87,8,0,52,138,7,93,2.6050000000000003e-39,154,hypothetical protein [phage 023Pt_psg01],MGYP003662477660,MGYP003627352942,PF19846,63,137,
9,MGYP003627352942,URG13092.1 hypothetical protein [phage 023Pt_p...,0.908,87,8,0,52,138,7,93,2.6050000000000003e-39,154,hypothetical protein [phage 023Pt_psg01],MGYP003662477660,MGYP003627352942,PF01632,89,119,Ribosomal protein L35
