# Example notebook tutorial
## local HMMER walkthrough

In [11]:
# system dependecies


# library dependencies
import duckdb as ddb
import pandas as pd
from joblib import Parallel, delayed
from tqdm import tqdm

# local dependencies/utils
from pairpro.hmmer import local_hmmer_wrapper_example

In this notebook walkthrough, we are assuming the user has downloaded pfam-A HMMs, which be downloaded by running the following command after clonning the repository:

```bash
python scripts/download_pfam.py
```

You will only need to run this command once, and you will find the HMMs in the `data` directory.

We will also use an adapated version of fastest local HMMER wrapper, which assumes the input is a database of protein pairs, but we have other versions of the wrapper that can be used for other purposes. For more information, please refer to the Read the Docs page.

The aim of this walkthrough is mainly to elucidate how the local HMMER wrapper works, and how to use it.

In [20]:
# Pfam path
PFAM_PATH = "../data/pfam/"

# any user-defined protein db
SAMPLE_DB_PATH = "../data/pairpro_50k.db"

# output path
HMMER_OUTPUT_DIR = '../data/protein_pairs/'

### Import data and preprocess it a bit

let's see how the sample database looks like. Let's us first make a connection to the database:

In [13]:
conn = ddb.connect(SAMPLE_DB_PATH, read_only=False)

Make to a dataframe to see how the data looks like:

In [14]:
sample_query = conn.execute("SELECT * FROM pairpro.proteins LIMIT 10").df()

In [15]:
sample_query

Unnamed: 0,pid,taxid,pdb_id,alphafold_id,proteome,protein_seq
0,A0A6L5BYG5,104087,,A0A6L5BYG5,UP000475265,MLQRYLWKLLPKQQRAFLLGRLSVVDRQVVNKSMSANLQFPSSFAQ...
1,A0A6L5BTN2,104087,,A0A6L5BTN2,UP000475265,MEQQEAWQVLIVEDDQRLAELTRDYLEANGLRVAIEGNGALAAARI...
2,A0A6L5BN44,104087,,A0A6L5BN44,UP000475265,MQPFVIAPSILSADFARLGEEVDNVLAAGADFVHFDVMDNHYVPNL...
3,A0A6L5BWT7,104087,,A0A6L5BWT7,UP000475265,MQVESRPDKKSGRFFMRIGHGYDVHRFAEGDFITLGGVRIAHGFGL...
4,A0A6L5BWU3,104087,,A0A6L5BWU3,UP000475265,MRPSEWFEGLRKIDINDLDTNNIGSWPPAIKALAGILLMVLVLGLG...
5,A0A6L5BX00,104087,,A0A6L5BX00,UP000475265,MTPSLLMAVLASGFIYGITPGPGVLAVFGIGAARGRRAGAGFLCGH...
6,A0A6L5C1M3,104087,,A0A6L5C1M3,UP000475265,MSRLKNKYALITGGTSGIGLETARQFLAQGATVAITGRSESALAAA...
7,A0A6L5BNJ3,104087,,A0A6L5BNJ3,UP000475265,MEQTKRVLVVEDDLHIADLICLHLRDEQFEVVHCADGDEGMRLLQQ...
8,A0A6L5BR68,104087,,A0A6L5BR68,UP000475265,MFTKQRLIIVATAVALLSGCASPNPYDNQGQADGGSQGMSKTAKYG...
9,A0A6L5BVP2,104087,,A0A6L5BVP2,UP000475265,MQNPQNLIWIDLEMTGLNPDTDVIIEMATIVTDSDLNTLAEGPVIA...


This is a pretty feature-rich dataframe, but we only read sequence pairs and their indexes for the hmmer_wrapper, so let's do a bit of preprocessing before we run HMMER against pfam!
The wrapper expects a dataframe with the following columns:

pid, protein_seq 

Where the pid is the protein ID, and protein_seq is the protein sequence. These two columns are only for proteins from pairs, so HMMER runs the most efficently, and we can get the most out of the wrapper. The wrapper will take a chunked list of PIDs, and query the databse for the sequences, and then run HMMER against the sequences via pfam.

In [16]:
proteins_in_pair_pids = conn.execute(f"SELECT pid FROM pairpro.proteins LIMIT 4000").df()

In [17]:
chunk_size = 1000
# chunking the PID so the worker function queries
protein_pair_pid_chunks = [proteins_in_pair_pids[i:i + chunk_size]
                        for i in range(0, len(proteins_in_pair_pids), chunk_size)]

let's run the wrapper in parallel, and see how it works!

In [18]:
njobs = 4 # number of jobs to run in parallel
conn.close()

In [21]:
with tqdm(total=len(protein_pair_pid_chunks)) as pbar:
        pbar.update(1)
        Parallel(
            n_jobs=njobs)(
            delayed(local_hmmer_wrapper_example)(
                chunk_index,
                SAMPLE_DB_PATH,
                protein_pair_pid_chunks,
                PFAM_PATH,
                HMMER_OUTPUT_DIR,
                None) for chunk_index,
            protein_pair_pid_chunks in enumerate(protein_pair_pid_chunks))

 25%|██▌       | 1/4 [01:08<03:25, 68.62s/it]


And just like that we ran HMMER against pfam in parallel! The results can be found in the `data` directory. Let's see how the results look like:

In [22]:
pd.read_csv(f"{HMMER_OUTPUT_DIR}0_output.csv")

Unnamed: 0,query_id,accession_id
0,A0A6L5BYG5,PF03567.17
1,A0A6L5BTN2,PF00072.27;PF00486.31
2,A0A6L5BN44,PF00834.22
3,A0A6L5BWT7,PF02542.19
4,A0A6L5BWU3,PF04350.16
...,...,...
995,A0A1H3C732,
996,A0A1H4LUD9,
997,A0A1H3AVF1,
998,A0A562J0D1,


---

We also have a function that parses HMMER results! However, we won't show in this notebook, but you can find it in the 'hmmer.py' file in the 'pairpro' package. Please check out the Read the Docs page for more information!

---