# Checking Predicted Encapsulin BGCs for Cargo Loading Peptide Hits

We have a set of curated BGC predictions for our encapsulin hits, collated from DeepBGC and antiSMASH predictions together. We also have mmseqs2 search results when querying a few different cargo loading peptides (CLPs) against all of our cargo proteins.

Let's check if any of our predicted BGCs contain proteins with a putative CLP sequence. First, let's load a list of encapsulin MGYPs that are in predicted BGCs, and get their accompanying cargo proteins:

In [37]:
import pandas as pd
from collections import defaultdict

#Get all encapsulin MGYPs that are part of a predicted BGC
family_df = pd.read_csv("../encapsulin_families.csv")
bgc_encapsulins = family_df[family_df["Cargo Description"].str.contains("BGC")]["Encapsulin MGYP"].to_list()
len(bgc_encapsulins)

#Get all cargo proteins +/- 10 CDSes away from each of these encapsulins
operon_df = pd.read_csv("../operon_df_filtered.csv").query("`Encapsulin MGYP` in @bgc_encapsulins").fillna("None")

#Iterate through each encapsulin and make a dictionary where keys are the encapsulin MGYPs and values are lists of cargo MGYPs
cargo_mapping_dict = defaultdict(list)

for row in operon_df.iloc[:, :21].to_dict(orient="records"):
    cargo_mapping_dict[row["Encapsulin MGYP"]].extend([mgyp for mgyp in list(row.values())[:-1] if mgyp != "None"])

#Each encapsulin MGYP appears in multiple contigs so we might have duplicate cargo MGYPs in each list
for encapsulin, cargos in cargo_mapping_dict.items():
    cargo_mapping_dict[encapsulin] = list(set(cargos))

Now we can iterate through each encapsulin MGYP and check any hits in the cargo loading peptide search hits:

In [48]:
clp_df = pd.read_csv("../clp_cargo_hits.m8", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                              "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])

for encapsulin, cargos in cargo_mapping_dict.items():
    out_df = clp_df.query("Target in @cargos")

    if len(out_df) > 0:
        print(out_df.loc[:, ["Query", "Target", "Identity", "E-Value"]])

                       Query            Target  Identity  E-Value
205  all_cargo_clp_consensus  MGYP001259485910     0.833  2438.00
342    family1_clp_consensus  MGYP000270495616     0.888    18.75
                     Query            Target  Identity  E-Value
488  family1_clp_consensus  MGYP003109322410       1.0   7279.0
                     Query            Target  Identity  E-Value
383  family1_clp_consensus  MGYP003561695898     0.833    383.4
                       Query            Target  Identity  E-Value
108  all_cargo_clp_consensus  MGYP003631920257       1.0    383.4
                       Query            Target  Identity  E-Value
58   all_cargo_clp_consensus  MGYP003667787324     1.000    181.3
68   all_cargo_clp_consensus  MGYP003667787534     0.857    181.3
356    family1_clp_consensus  MGYP003667787324     0.875    124.5
410    family1_clp_consensus  MGYP003628557755     1.000    806.9


Looks like we have a few potential hits that are interesting! Although we don't know if these CLP hits are at the N- or C-termini of the protein, or somewhere in the middle. Let's check this:

In [55]:
from Bio import SeqIO

#First, let's make a dictionary where the keys are cargo MGYPs and values are the protein lengths
length_dict = {str(record.id).split()[0]: len(str(record.seq)) for record in SeqIO.parse("../seqs/all_putative_cargo_proteins.fasta", "fasta")}

#Let's make a function to get the length of a cargo MGYP and apply it to the DataFrame
def get_length(cargo_mgyp):
    return(length_dict[cargo_mgyp])

#Now let's iterate through our encapsulins again and add the length info
for encapsulin, cargos in cargo_mapping_dict.items():
    out_df = clp_df.query("Target in @cargos")

    if len(out_df) > 0:
        out_df["Cargo Length"] = out_df["Target"].apply(get_length)
        out_df["Distance C-Terminal"] = out_df["Cargo Length"] - out_df["Target End"]
        print(out_df.loc[:, ["Target", "Target End", "Identity", "Cargo Length", "Distance C-Terminal"]])

               Target  Target End  Identity  Cargo Length  Distance C-Terminal
205  MGYP001259485910         149     0.833           336                  187
342  MGYP000270495616         132     0.888           138                    6
               Target  Target End  Identity  Cargo Length  Distance C-Terminal
488  MGYP003109322410         129       1.0           835                  706
               Target  Target End  Identity  Cargo Length  Distance C-Terminal
383  MGYP003561695898         185     0.833           187                    2
               Target  Target End  Identity  Cargo Length  Distance C-Terminal
108  MGYP003631920257         108       1.0           240                  132
               Target  Target End  Identity  Cargo Length  Distance C-Terminal
58   MGYP003667787324         123     1.000           127                    4
68   MGYP003667787534         253     0.857           314                   61
356  MGYP003667787324         123     0.875         

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  out_df["Cargo Length"] = out_df["Target"].apply(get_length)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  out_df["Distance C-Terminal"] = out_df["Cargo Length"] - out_df["Target End"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  out_df["Cargo Length"] = out_df["Target"].apply(get_length)
A value

Looks like we have a few Cargo MGYPs that are found in BGCs and that have predicted CLP sequences within 4-6 residues from the C-terminal! Let's load the Pfam data to see what their predicted functions are:

In [59]:
# The below code will load the Pfam labels dictionary downloaded from the GoogleResearch/Proteinfer GitHub repo
# I've also manually added some code below to fill in some important missing labels that I manually curated

import gzip
import json

with open("../DBs/label_descriptions.json.gz", 'rb') as f:
    with gzip.GzipFile(fileobj=f, mode='rb') as gzip_file:
      labels_dict = json.load(gzip_file)

labels_dict["PF19821"] = "Phage capsid protein"
labels_dict["PF19307"] = "Phage capsid-like protein"
labels_dict['PF19289'] = "PmbA/TldA metallopeptidase C-terminal domain"
labels_dict['PF19290'] = "PmbA/TldA metallopeptidase central domain"
labels_dict['PF20211'] = "Family of unknown function (DUF6571)"
labels_dict['PF19782'] = "Family of unknown function (DUF6267)"
labels_dict['PF19343'] = "Family of unknown function (DUF5923)"
labels_dict['PF20036'] = "Major capsid protein 13-like"
labels_dict['PF18960'] = "Family of unknown function (DUF5702)"
labels_dict['PF18906'] = "Phage tail tube protein"
labels_dict['PF19753'] = "Family of unknown function (DUF6240)"

def get_label(pfam):
    try:
        return(labels_dict[pfam])
    except KeyError:
        return(None)

In [62]:
#Now we can load the Pfam DataFrame and check the pfams for each of our putative cargo sequences with CLP

bgc_cargo_mgyps = ["MGYP003667787324", "MGYP003561695898", "MGYP000270495616"]

pfam_df = pd.read_csv("../pfams/cargo_pfams.tsv", sep="\t", names=["Cargo MGYP", "Pfam", "Start", "End"])
pfam_df["Description"] = pfam_df["Pfam"].apply(get_label)

pfam_df.query("`Cargo MGYP` in @bgc_cargo_mgyps").sort_values(by="Cargo MGYP")

Unnamed: 0,Cargo MGYP,Pfam,Start,End,Description
606615,MGYP000270495616,PF00619,15,58,Caspase recruitment domain
699138,MGYP000270495616,CL0044,21,40,Ferritin
262486,MGYP003561695898,CL0044,2,139,Ferritin
833208,MGYP003561695898,PF01814,1,131,Hemerythrin HHE cation binding domain
288115,MGYP003667787324,CL0041,8,90,Death


This is super odd! MGYP000270495616 and MGYP003561695898 are annotated as ferritins which means their encapsulins should already be assigned as family 1 encapsulins and not as BGC encapsulins.

Let's get their encapsulin MGYPs and check our manually curated `encapsulin_families.csv` file to see if I've already annotated these:

In [66]:
family_df = pd.read_csv("../encapsulin_families.csv")

for encapsulin_mgyp, cargos in cargo_mapping_dict.items():
    for cargo in bgc_cargo_mgyps:
        if cargo in cargos:
            print(cargo)
            print(family_df[family_df["Encapsulin MGYP"] == encapsulin_mgyp])

MGYP000270495616
      Encapsulin MGYP Cargo Description Cargo Search Method
120  MGYP000432667684     NRPS-like BGC           antiSMASH
MGYP003561695898
      Encapsulin MGYP Cargo Description Cargo Search Method
121  MGYP003561695899    T3PKS-like BGC           antiSMASH
MGYP003667787324
      Encapsulin MGYP Cargo Description Cargo Search Method
119  MGYP003667787347         NAGGN BGC           antiSMASH


`MGYP003561695899` does exactly match a known encapsulin sequence - `V4JJW8`, a family 1 encapsulin annotated with a Hemerythrin cargo. We do see this Hemerythrin cargo in our antiSMASH predicted BGC, just upstream of encapsulin and then the subsequent T3PKS-like BGC. Maybe we might need to remove this encapsulin from our dataset since it's already annotated with a cargo type and has a putative CLP. Something to discuss and think about here.

`MGYP003667787324` is a putative cargo protein with a CLP hit found immediately upstream of encapsulin `MGYP003667787347` - antiSMASH detects an NAGGN BGC just upstream of it however BLASTing the sequence of the gene gives 60% identity to a ferritin. It would therefore appear that this antiSMASH prediction may be a false positive since the encapsulin is right next to a potential ferritin with a CLP. Annoying!

Similar story with `MGYP000270495616` - predicted ferritin, found downstream of a putative NRPS-like operon but actually has ≈ 68%% identity to ferritin and a putative CLP sequence.