# Putative Cargo Loading Peptides

We have searched a few different putative cargo loading peptide (CLP) and family 2 N-terminal domain (NTD) motifs against all of our cargo proteins. Let's check out these search hits and see if any of them give clues about potential encapsulin cargo proteins.

First, we can load our CLP hits data:

In [9]:
import pandas as pd

clp_df = pd.read_csv("../clp_cargo_hits.m8", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                              "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])

clp_df.head()

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
0,all_cargo_clp_consensus,MGYP003667212098,0.923,13,0,1,1,13,114,125,0.001832,33
1,all_cargo_clp_consensus,MGYP001555959371,0.846,13,1,1,1,13,337,348,0.08722,28
2,all_cargo_clp_consensus,MGYP003362106646,0.909,11,0,1,1,11,357,366,0.1885,27
3,all_cargo_clp_consensus,MGYP001599236885,0.833,12,1,1,1,12,336,346,0.4072,26
4,all_cargo_clp_consensus,MGYP001627912779,0.875,8,1,0,6,13,120,127,1.29,24


Let's add in the length of each target protein so we can find out how close our hits are to the C-terminus of the protein.

(Note that we don't need to calculate the distance from the N-terminus since the `Target Start` field already contains this information)

In [17]:
from Bio import SeqIO

#First, let's make a dictionary where the keys are cargo MGYPs and values are the protein lengths
length_dict = {str(record.id).split()[0]: len(str(record.seq)) for record in SeqIO.parse("../seqs/all_putative_cargo_proteins.fasta", "fasta")}

#Let's make a function to get the length of a cargo MGYP and apply it to the DataFrame
def get_length(cargo_mgyp):
    return(length_dict[cargo_mgyp])

clp_df["Length"] = clp_df["Target"].apply(get_length)
clp_df["C-terminus Distance"] = clp_df["Length"] - clp_df["Target End"]
clp_df.head()

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore,Length,C-terminus Distance
0,all_cargo_clp_consensus,MGYP003667212098,0.923,13,0,1,1,13,114,125,0.001832,33,128,3
1,all_cargo_clp_consensus,MGYP001555959371,0.846,13,1,1,1,13,337,348,0.08722,28,352,4
2,all_cargo_clp_consensus,MGYP003362106646,0.909,11,0,1,1,11,357,366,0.1885,27,367,1
3,all_cargo_clp_consensus,MGYP001599236885,0.833,12,1,1,1,12,336,346,0.4072,26,351,5
4,all_cargo_clp_consensus,MGYP001627912779,0.875,8,1,0,6,13,120,127,1.29,24,127,0


Let's grab our Pfam data too:

In [18]:
# The below code will load the Pfam labels dictionary downloaded from the GoogleResearch/Proteinfer GitHub repo
# I've also manually added some code below to fill in some important missing labels that I manually curated

import gzip
import json

with open("../DBs/label_descriptions.json.gz", 'rb') as f:
    with gzip.GzipFile(fileobj=f, mode='rb') as gzip_file:
      labels_dict = json.load(gzip_file)

labels_dict["PF19821"] = "Phage capsid protein"
labels_dict["PF19307"] = "Phage capsid-like protein"
labels_dict['PF19289'] = "PmbA/TldA metallopeptidase C-terminal domain"
labels_dict['PF19290'] = "PmbA/TldA metallopeptidase central domain"
labels_dict['PF20211'] = "Family of unknown function (DUF6571)"
labels_dict['PF19782'] = "Family of unknown function (DUF6267)"
labels_dict['PF19343'] = "Family of unknown function (DUF5923)"
labels_dict['PF20036'] = "Major capsid protein 13-like"
labels_dict['PF18960'] = "Family of unknown function (DUF5702)"
labels_dict['PF18906'] = "Phage tail tube protein"
labels_dict['PF19753'] = "Family of unknown function (DUF6240)"

def get_label(pfam):
    try:
        return(labels_dict[pfam])
    except KeyError:
        return(None)

pfam_df = pd.read_csv("../pfams/cargo_pfams.tsv", sep="\t", names=["Cargo MGYP", "Pfam", "Start", "End"])
pfam_df["Description"] = pfam_df["Pfam"].apply(get_label)

And now let's examine some of the CLP hits that are close to the N- or C-terminus:

In [20]:
C_term_hits = clp_df[clp_df["C-terminus Distance"] < 15].drop_duplicates(subset="Target")

for row in C_term_hits.to_dict(orient="records"):
    out_df = pfam_df[pfam_df["Cargo MGYP"] == row["Target"]]["Description"]
    if len(out_df) > 0:
        print(out_df.values)

['Ferritin' 'Caspase recruitment domain']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Ferritin']
['Ubiquinone biosynthesis protein COQ7']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Ferritin' 'Ferritin-like domain']
['Rubrerythrin']
['PH']
['Death']
['Death']
['Chaperonin 10 Kd subunit' 'Chaperonin 10 Kd subunit']
['Ferritin']
['Death']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Ferritin' 'Death']
['Death']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
['Dyp-type peroxidase family']
[

In [21]:
N_term_hits = clp_df[clp_df["Target Start"] < 15].drop_duplicates(subset="Target")

for row in N_term_hits.to_dict(orient="records"):
    out_df = pfam_df[pfam_df["Cargo MGYP"] == row["Target"]]["Description"]
    if len(out_df) > 0:
        print(out_df.values)

['Clp protease']
['Clp protease']
['Clp protease']
['Clp protease']
['Glycine rich protein']
['Beta_propeller' 'Beta_propeller']
[None]
['Gluconate 2-dehydrogenase subunit 3']
['Regulator of chromosome condensation (RCC1) repeat' 'Peptidase_CA']
['Peptidase family M20/M25/M40' 'Peptidase dimerisation domain']
['Peptidase dimerisation domain' 'Peptidase family M20/M25/M40']
