# Comparing Hits from Different Search Methods and Versions of the ESM Atlas

Previously, we had the following hits:

 * `all_structure_hits.txt` - Foldseek search results from the ESM Atlas website when we use PDBs `7MU1`, `6NJ8`, `7S2T`, and `6X8M` as input

 * `encapsulin_pfam_hits.txt` - Text search against Pfam annotations in the MGnify protein databse, matching Pfam families `PF04454` (encapsulating protein for peroxidase) and `PF05065` (phage capsid family.)

 * `hk97_pfam_hits.txt` - same as the above but using all Pfam families from the clan `CL0373` (phage coat) as input

 * `all_non_redundant_hits.txt` - All hits from above, deduplicated

All of these searches were against version `2022/05` of the MGnify protein database. Since doing this work two things have happened:

1. ESM Atlas has been updated to use the `2023/02` version of the MGnify protein database

2. We now have access to Oracle VMs with 1 TB of RAM and 64 cores/128 threads, and more importantly 25 TB of SSD storage. This means we can now download the entire ESM Atlas and search it using foldseek as opposed to using the ESM Atlas website which limits us to searching a subset of ≈37 million structures.


Let's compare these sets of hits together and see if the new structure search method has given us any more hits than before. First let's load in our new expanded hits dataset:

In [5]:
import pandas as pd 

foldseek_full_df = pd.read_csv("../hits/esm_atlas_foldseek_hits_full.m8", sep="\t", names=["Query", "Target", "Identity", "Alignment Length", "Mismatches", "Gap Openings",
                                                              "Query Start", "Query End", "Target Start", "Target End", "E-Value", "Bitscore"])
foldseek_full_df["Target"] = foldseek_full_df["Target"].str[:-7]
foldseek_full_df

Unnamed: 0,Query,Target,Identity,Alignment Length,Mismatches,Gap Openings,Query Start,Query End,Target Start,Target End,E-Value,Bitscore
0,7S2T.pdb_C,MGYP005830974233,0.531,269,126,0,2,270,11,279,9.421000e-35,1199
1,7S2T.pdb_C,MGYP005830974233,0.531,269,126,0,2,270,11,279,9.421000e-35,1199
2,7S2T.pdb_C,MGYP005846286663,0.509,269,132,0,2,270,10,278,3.730000e-34,1183
3,7S2T.pdb_C,MGYP005846286663,0.509,269,132,0,2,270,10,278,3.730000e-34,1183
4,7S2T.pdb_C,MGYP005841185857,0.531,269,125,0,2,270,11,278,2.728000e-34,1181
...,...,...,...,...,...,...,...,...,...,...,...,...
421589,6NJ8.pdb_B,MGYP006189216969,0.092,260,162,0,94,272,159,418,8.358000e+00,10
421590,6NJ8.pdb_B,MGYP006077962181,0.118,187,139,0,71,257,25,183,9.477000e+00,9
421591,6NJ8.pdb_B,MGYP006077962181,0.118,187,139,0,71,257,25,183,9.477000e+00,9
421592,6NJ8.pdb_B,MGYP006281665651,0.113,255,211,0,1,255,1,239,9.477000e+00,9


How many hits in total do we have in the new dataset?

In [6]:
len(foldseek_full_df["Target"].unique())

33521

Now let's load all of our previous hits and check how many new ones we have:

In [10]:
all_new_structure_hits = set(foldseek_full_df["Target"].unique())

with open("../hits/all_non_redundant_hits.txt", "r") as nr_hits_file:
    all_previous_hits = set([line.rstrip() for line in nr_hits_file])

print(f"Total previous hits: {len(all_previous_hits)}")
print(f"Total new hits: {len(all_new_structure_hits)}")
print(f"Unique new hits: {len(all_new_structure_hits.difference(all_previous_hits))}")

Total previous hits: 768090
Total new hits: 33521
Unique new hits: 33483


Looks like we have ≈33,000 new hits! That's super cool so let's put those in a new file and we can analyze those separately:

In [11]:
new_unique_hits = all_new_structure_hits.difference(all_previous_hits)

with open("../hits/unique_large_scale_atlas_hits.txt", "w") as outfile:
    outfile.write("\n".join([mgyp for mgyp in new_unique_hits]))