# Filtering Encapsulin Hits by Identity

We have a dataset of the best hits in UniRef90 when queried with our metagenomic encapsulin hits (`notebooks/encapsulin_uniref90_hits.ipynb`).

We've also seen from prior analysis (`notebooks/encapsulin_phage_hits.ipynb`) that encapsulins generally only really share up to 40% identity with phage capsid proteins. Therefore if we have encapsulin hits sharing 90%+ identity with a phage protein, we can be reasonably confident that these are phage capsid proteins not encapsulins.

Let's investigate these high identity hits and see if any match phage proteins (which we can remove). First, we load our DataFrame:

In [1]:
import pandas as pd

hits_df = pd.read_csv("../encapsulin_UniRef90_hits.tsv", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                              "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])

hits_df.head()

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
0,MGYP001573638360,UniRef90_Q8YPL4 All4180 protein n=2 Tax=Nostoc...,0.981,272,5,0,4,275,148,419,2.633e-171,542
1,MGYP003681970701,UniRef90_A0A382GAI9 Uncharacterized protein (F...,0.759,361,87,0,1,361,1,361,9.715e-174,554
2,MGYP001775299270,UniRef90_UPI0020A1C726 HAMP domain-containing ...,0.615,445,166,4,26,465,19,463,4.263e-178,571
3,MGYP003640887170,UniRef90_E6QZK2 Uncharacterized protein n=4 Ta...,0.266,312,206,7,13,311,8,309,1.599e-25,120
4,MGYP003111022402,UniRef90_A0A0Q4UCX7 DksA C4-type domain-contai...,0.301,229,133,9,37,259,394,601,2.453e-09,70


We're only interested in hits above 90% identity:

In [2]:
hits_df = hits_df[hits_df["Identity"] >= 0.9].sort_values(by="Identity", ascending=False).drop_duplicates(subset="Query")
hits_df

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
1472,MGYP000451522494,UniRef90_A0A0A0RUU4 Cytochrome b n=3 Tax=Alydi...,1.000,389,0,0,1,389,1,389,1.496000e-254,787
1292,MGYP000324372200,UniRef90_UPI0016771803 class E sortase n=4 Tax...,1.000,477,0,0,1,477,1,477,6.158000e-317,972
165,MGYP003132644972,UniRef90_UPI0021E5136F hypothetical protein n=...,1.000,412,0,0,1,412,1,412,4.187000e-273,842
168,MGYP000518926698,UniRef90_A0A2G5L1A1 ABC transporter permease n...,1.000,588,0,0,1,588,1,588,0.000000e+00,1156
1213,MGYP001375024420,UniRef90_UPI0007775DFE helix-turn-helix domain...,1.000,156,0,0,1,156,303,458,4.533000e-97,319
...,...,...,...,...,...,...,...,...,...,...,...,...
204,MGYP003110727254,UniRef90_UPI0020CF805B family 20 glycosylhydro...,0.904,439,42,0,1,439,1,439,8.731000e-258,799
1467,MGYP003369226081,UniRef90_UPI000E2139DE ATP-grasp domain-contai...,0.903,485,47,0,1,485,1,485,8.723000e-291,897
400,MGYP003088767509,UniRef90_A0A1L7XIC9 Related to integral membra...,0.903,852,70,3,1,849,1,842,0.000000e+00,1554
45,MGYP003131803286,UniRef90_A0A6S6WMM4 Cysteine synthase n=16 Tax...,0.902,371,34,2,1,369,1,371,1.174000e-206,648


Let's check out what information we have in that `Target` field:

In [3]:
hits_df.iloc[0, 1]

'UniRef90_A0A0A0RUU4 Cytochrome b n=3 Tax=Alydidae TaxID=41702 RepID=A0A0A0RUU4_9HEMI'

Looks like we not only have the `Tax` field but more importantly, the `TaxID` field. This is amazing - we can use this to query the UniProt REST API and get back a lineage! Let's experiment with this below using a simple test case:

In [7]:
import requests

taxid = "37554" #This is Eschericia phage HK97
url = "https://rest.uniprot.org/taxonomy/"

request = requests.get(f"{url}{taxid}")
print(f"Status code: {request.status_code}")

response_data = request.json()

for key, value in response_data.items():
    print(f"Key: {key}")
    print(f"Value: {value}")

Status code: 200
Key: scientificName
Value: Byrnievirus HK97
Key: taxonId
Value: 37554
Key: mnemonic
Value: 9CAUD
Key: parent
Value: {'scientificName': 'Byrnievirus', 'taxonId': 2842574}
Key: rank
Value: species
Key: hidden
Value: True
Key: active
Value: True
Key: otherNames
Value: ['Escherichia virus HK97']
Key: lineage
Value: [{'scientificName': 'Byrnievirus', 'taxonId': 2842574, 'rank': 'genus', 'hidden': False}, {'scientificName': 'Hendrixvirinae', 'taxonId': 2842527, 'rank': 'subfamily', 'hidden': False}, {'scientificName': 'Caudoviricetes', 'taxonId': 2731619, 'rank': 'class', 'hidden': False}, {'scientificName': 'Uroviricota', 'taxonId': 2731618, 'rank': 'phylum', 'hidden': False}, {'scientificName': 'Heunggongvirae', 'taxonId': 2731360, 'rank': 'kingdom', 'hidden': False}, {'scientificName': 'Duplodnaviria', 'taxonId': 2731341, 'rank': 'no rank', 'hidden': False}, {'scientificName': 'Viruses', 'taxonId': 10239, 'rank': 'superkingdom', 'hidden': False}]
Key: statistics
Value: {'

That `lineage` field and the list of dictionaries it returns looks like the important part for us - we can send a request using a taxID and get back the superkingdom it belongs to. In this case the superkingdom will most likely be `cellular organisms` or (hopefully not) `Viruses`.

Let's write a function to get the superkingdom of a given taxID:

In [8]:
def get_superkingdom(taxid):
    url = "https://rest.uniprot.org/taxonomy/"
    request = requests.get(f"{url}{taxid}")
    try:
        response_data = request.json()
    except TypeError:
        return("None")

    lineage = response_data["lineage"]
    superkingdom = lineage[-1]["scientificName"]
    return(superkingdom)

print(f"E. coli: {get_superkingdom('562')}")
print(f"Phage HK97: {get_superkingdom('37554')}")

E. coli: cellular organisms
Phage HK97: Viruses


It works! Now, let's add a column to the DataFrame to indicate the taxID of each hit:

In [10]:
hits_df["taxID"] = hits_df["Target"].str.extract(r"TaxID=(\d+)")
hits_df = hits_df.loc[:, ["Query", "Target", "Identity", "taxID"]]
hits_df

Unnamed: 0,Query,Target,Identity,taxID
1472,MGYP000451522494,UniRef90_A0A0A0RUU4 Cytochrome b n=3 Tax=Alydi...,1.000,41702
1292,MGYP000324372200,UniRef90_UPI0016771803 class E sortase n=4 Tax...,1.000,1883
165,MGYP003132644972,UniRef90_UPI0021E5136F hypothetical protein n=...,1.000,2893553
168,MGYP000518926698,UniRef90_A0A2G5L1A1 ABC transporter permease n...,1.000,1889772
1213,MGYP001375024420,UniRef90_UPI0007775DFE helix-turn-helix domain...,1.000,315405
...,...,...,...,...
204,MGYP003110727254,UniRef90_UPI0020CF805B family 20 glycosylhydro...,0.904,1864822
1467,MGYP003369226081,UniRef90_UPI000E2139DE ATP-grasp domain-contai...,0.903,1761016
400,MGYP003088767509,UniRef90_A0A1L7XIC9 Related to integral membra...,0.903,576137
45,MGYP003131803286,UniRef90_A0A6S6WMM4 Cysteine synthase n=16 Tax...,0.902,2800384


In order to minimize the number of API calls we make (let's be nice to Uniprot!), we can collect all the unique taxIDs, make a dictionary to map them to their superkingdom, and then just add this to the DataFrame:

In [13]:
taxid_dict = {}

for taxid in hits_df["taxID"].unique():
    taxid_dict[taxid] = get_superkingdom(taxid)

for key, value in list(taxid_dict.items())[:5]:
    print(f"{key}: {value}")

41702: cellular organisms
1883: cellular organisms
2893553: cellular organisms
1889772: cellular organisms
315405: cellular organisms


Finally, let's add a column to the DataFrame and map taxids to superkingdoms:

In [14]:
def match_superkingdom(taxid):
    try:
        return(taxid_dict[taxid])
    except KeyError:
        return("None")

hits_df["Superkindgom"] = hits_df["taxID"].apply(match_superkingdom)
hits_df

Unnamed: 0,Query,Target,Identity,taxID,Superkindgom
1472,MGYP000451522494,UniRef90_A0A0A0RUU4 Cytochrome b n=3 Tax=Alydi...,1.000,41702,cellular organisms
1292,MGYP000324372200,UniRef90_UPI0016771803 class E sortase n=4 Tax...,1.000,1883,cellular organisms
165,MGYP003132644972,UniRef90_UPI0021E5136F hypothetical protein n=...,1.000,2893553,cellular organisms
168,MGYP000518926698,UniRef90_A0A2G5L1A1 ABC transporter permease n...,1.000,1889772,cellular organisms
1213,MGYP001375024420,UniRef90_UPI0007775DFE helix-turn-helix domain...,1.000,315405,cellular organisms
...,...,...,...,...,...
204,MGYP003110727254,UniRef90_UPI0020CF805B family 20 glycosylhydro...,0.904,1864822,cellular organisms
1467,MGYP003369226081,UniRef90_UPI000E2139DE ATP-grasp domain-contai...,0.903,1761016,cellular organisms
400,MGYP003088767509,UniRef90_A0A1L7XIC9 Related to integral membra...,0.903,576137,cellular organisms
45,MGYP003131803286,UniRef90_A0A6S6WMM4 Cysteine synthase n=16 Tax...,0.902,2800384,cellular organisms


Now the moment of truth - how many viral hits above 90% identity do we have?

In [15]:
hits_df[hits_df["Superkindgom"] != "cellular organisms"]

Unnamed: 0,Query,Target,Identity,taxID,Superkindgom
654,MGYP000530814956,UniRef90_A0A383E5R1 Uncharacterized protein (F...,0.995,408172,unclassified entries
950,MGYP003626512906,UniRef90_A0A382NF39 Uncharacterized protein (F...,0.969,408172,unclassified entries
1079,MGYP003664292513,UniRef90_A0A382NF39 Uncharacterized protein (F...,0.967,408172,unclassified entries
123,MGYP003108894514,UniRef90_A0A382NF39 Uncharacterized protein (F...,0.963,408172,unclassified entries
1133,MGYP003321325438,UniRef90_A0A6J5PYJ1 Major capsid protein n=1 T...,0.957,2100421,Viruses
313,MGYP003333955921,UniRef90_A0A382CCN2 Uncharacterized protein (F...,0.913,408172,unclassified entries
970,MGYP003134589469,UniRef90_A0A7U3NKL8 Uncharacterized protein n=...,0.91,2783539,Viruses


Phew! Only two viral hits out of 284, and four "unclassified entries". Let's see what these are:

In [16]:
hits_df[hits_df["Superkindgom"] != "cellular organisms"]["Target"].values

array(['UniRef90_A0A383E5R1 Uncharacterized protein (Fragment) n=1 Tax=marine metagenome TaxID=408172 RepID=A0A383E5R1_9ZZZZ',
       'UniRef90_A0A382NF39 Uncharacterized protein (Fragment) n=1 Tax=marine metagenome TaxID=408172 RepID=A0A382NF39_9ZZZZ',
       'UniRef90_A0A382NF39 Uncharacterized protein (Fragment) n=1 Tax=marine metagenome TaxID=408172 RepID=A0A382NF39_9ZZZZ',
       'UniRef90_A0A382NF39 Uncharacterized protein (Fragment) n=1 Tax=marine metagenome TaxID=408172 RepID=A0A382NF39_9ZZZZ',
       'UniRef90_A0A6J5PYJ1 Major capsid protein n=1 Tax=uncultured Caudovirales phage TaxID=2100421 RepID=A0A6J5PYJ1_9CAUD',
       'UniRef90_A0A382CCN2 Uncharacterized protein (Fragment) n=1 Tax=marine metagenome TaxID=408172 RepID=A0A382CCN2_9ZZZZ',
       'UniRef90_A0A7U3NKL8 Uncharacterized protein n=1 Tax=Bacillus phage Kirov TaxID=2783539 RepID=A0A7U3NKL8_9CAUD'],
      dtype=object)

Looks like we can remove those two viral sequences, but the other four are just unclassified `marine metagenome` hits and so we have no reason to throw them out.