# Run PLMSearch locally 🧪

**Notice:**

**The experiment are implement on a server with a `56-core Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz and 256 GB RAM`.**

**The GPU environment of the server is `1 × GeForce GTX 1080 Ti and 11 GB GPU Memory`.**

## Quick links
* [Start from Fasta (preprocessing)](#1)
  * [Generate ESM-1b embedding](#1-1)
  * [Generate Pfam result](#1-2)
* [SS-predictor pipeline](#2)
* [PLMSearch pipeline](#3)
* [Alignment (Sequence align & TM-align)](#4)

## Start from Fasta (preprocessing)
<span id="1"></span>

### 1. Generate ESM-1b embedding
<span id="1-1"></span>

In [1]:
#esm generate
!python ./plmsearch/embedding_generate.py \
-f './example/protein.fasta' \
-e './example/embedding.pkl' #--nogpu #for CPU-ONLY

Transferred model to GPU
Read ./example/protein.fasta with 5 sequences
Processing 1 of 1 batches (5 sequences)
Embedding generation time cost: 27.4346022605896 s


### 2. Generate Pfam result
PLMSearch requires this input, while SS-predictor `does not`.

Important Note: Due to the third-party software Pfamscan, the original ProteinID `should not` contain `Spaces ' '`.
<span id="1-2"></span>

In [3]:
#pfam generate
!python ./plmsearch/pfam_generate.py \
-f './example/protein.fasta' \
-o './example/pfam_result.json'

1715173337.4911733
perl ./plmsearch_data/PfamScan/pfam_scan.pl -fasta ./example/protein.fasta -dir ./plmsearch_data/Pfam_db -outfile ./example/tmp.txt
Pfam local generate time cost 2.730412721633911 s


## SS-predictor pipeline
<span id="2"></span>
<div align=center><img src="scientist_figures/workflow_img/ss_predictor3.png" width="90%" height="90%" /></div>

Set Swiss-Prot as target dataset

In [3]:
!python ./plmsearch/main_similarity.py \
-iqe './example/embedding.pkl' \
-ite './plmsearch_data/swissprot/embedding.pkl' \
-smp './plmsearch_data/model/plmsearch.sav' #-d #for CPU-ONLY

Embedding load time cost: 42.21443510055542 s
We have 4 GPUs in total!, we will use as you selected
Search query proteins batch by batch: 100%|███████| 1/1 [00:08<00:00,  8.24s/it]
Search time cost: 9.784892082214355 s


## PLMSearch pipeline
<span id="3"></span>
<div align=center><img src="scientist_figures/workflow_img/framework1.png" width="90%" height="90%" /></div>

Set Swiss-Prot as target dataset

In [4]:
#Step 1. generate pfamclan prefilter result
!python ./plmsearch/main_pfam.py \
-qpr './example/pfam_result.json' \
-tpr './plmsearch_data/swissprot/pfam_result.json' \
-c

[32m[I 231212 23:36:38 main_pfam:8][39m query protein num = 5
[32m[I 231212 23:36:38 main_pfam:9][39m target protein num = 430140
query protein list: 100%|█████████████████████████| 5/5 [00:00<00:00,  6.23it/s]


In [5]:
#Step 2. PLMSearch search
!python ./plmsearch/main_similarity.py \
-iqe './example/embedding.pkl' \
-ite './plmsearch_data/swissprot/embedding.pkl' \
-smp './plmsearch_data/model/plmsearch.sav' \
-isr './example/search_result/pfamclan' #-d #for CPU-ONLY

Embedding load time cost: 40.31576943397522 s
We have 4 GPUs in total!, we will use as you selected
Get search list: 17365it [00:00, 194832.11it/s]
[32m[I 231212 23:37:23 main_similarity:156][39m presearch num = 17365
Search query proteins batch by batch: 100%|███████| 1/1 [00:03<00:00,  3.86s/it]
Search time cost: 5.484592914581299 s


## Alignment (Sequence align & TM-align)
<span id="4"></span>

In [1]:
!python ./plmsearch/sequence_align.py \
-qf './example/protein.fasta' \
-tf './example/protein.fasta' \
-ipr './example/alignment/test'

pairwise sequence align: 100%|█████████████████| 6/6 [00:00<00:00, 45507.82it/s]
sequence align output:   0%|                              | 0/6 [00:00<?, ?it/s]
P0AD96	P0AD96	1.0
>P0AD96	P0AD96
MNIKGKALLAGCIALAFSNMALAEDIKVAVVGAMSGPVAQYGDQEFTGAEQAVADINAKGGIKGNKLQIVKYDDACDPKQAVAVANKVVNDGIKYVIGHLCSSSTQPASDIYEDEGILMITPAATAPELTARGYQLILRTTGLDSDQGPTAAKYILEKVKPQRIAIVHDKQQYGEGLARAVQDGLKKGNANVVFFDGITAGEKDFSTLVARLKKENIDFVYYGGYHPEMGQILRQARAAGLKTQFMGPEGVANVSLSNIAGESAEGLLVTKPKNYDQVPANKPIVDAIKAKKQDPSGAFVWTTYAALQSLQAGLNQSDDPAEIAKYLKANSVDTVMGPLTWDEKGDLKGFEFGVFDWHANGTATDAK
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MNIKGKALLAGCIALAFSNMALAEDIKVAVVGAMSGPVAQYGDQEFTGAEQAVADINAKGGIKGNKLQI

In [2]:
!python ./plmsearch/tmalign.py \
-qsd './example/structure/' \
-tsd './example/structure/' \
-ipr './example/alignment/test'

pairwise tmalign: 100%|████████████████████████| 6/6 [00:00<00:00, 15748.33it/s]
tmalign output:   0%|                                     | 0/6 [00:00<?, ?it/s]
P0AD96	P0AD96	1.0
>P0AD96	P0AD96
MNIKGKALLAGCIALAFSNMALAEDIKVAVVGAMSGPVAQYGDQEFTGAEQAVADINAKGGIKGNKLQIVKYDDACDPKQAVAVANKVVNDGIKYVIGHLCSSSTQPASDIYEDEGILMITPAATAPELTARGYQLILRTTGLDSDQGPTAAKYILEKVKPQRIAIVHDKQQYGEGLARAVQDGLKKGNANVVFFDGITAGEKDFSTLVARLKKENIDFVYYGGYHPEMGQILRQARAAGLKTQFMGPEGVANVSLSNIAGESAEGLLVTKPKNYDQVPANKPIVDAIKAKKQDPSGAFVWTTYAALQSLQAGLNQSDDPAEIAKYLKANSVDTVMGPLTWDEKGDLKGFEFGVFDWHANGTATDAK
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
MNIKGKALLAGCIALAFSNMALAEDIKVAVVGAMSGPVAQYGDQEFTGAEQAVADINAKGGIKGNKLQI