# Example notebook tutorial
## local HMMER walkthrough

In [1]:
# system dependecies
from pathlib import Path

# library dependencies
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# local dependencies/utils
from FAFSA.compute_local_hmmer import hmmer_wrapper

Unfortunately, for the script to run, the user must have a proper pfam db stored locally in their machine. Thus, I have to show my local path in this walkthrough:

In [2]:
# Pfam path
PFAM_PATH = Path("/Users/humoodalanzi/pfam/Pfam-A.hmm") 

# any user-defined protein db
SAMPLE_DB_PATH = Path("../examples/learn2therm_sample_50k_exploration.csv")

### Import data and preprocess it a bit

let's see how the sample database looks like:

In [3]:
df_sample = pd.read_csv(SAMPLE_DB_PATH, index_col=0)

In [4]:
df_sample.head()

Unnamed: 0,local_gap_compressed_percent_id,scaled_local_query_percent_id,scaled_local_symmetric_percent_id,query_align_len,query_align_cov,subject_align_len,subject_align_cov,bit_score,thermo_index,meso_index,...,bit_score_16s,m_ogt,t_ogt,ogt_difference,m_protein_seq,t_protein_seq,m_protein_desc,t_protein_desc,m_protein_len,t_protein_len
0,0.287582,0.217822,0.215686,160,0.792079,152,0.737864,131,875,12897,...,1153.0,27.5,50.0,22.5,MAESGTSRRADHLVPVPGPDAEPPAVADELLRAVGRGDEQAFGRLY...,MPSQITESERIELAERFERDALPLLDQLYSAALRMTRNPADAEDLV...,ECF RNA polymerase sigma factor SigK,sigma-70 family RNA polymerase sigma factor,206,202
1,0.319635,0.295359,0.297872,218,0.919831,226,0.969957,282,11324,13026,...,1014.0,25.0,54.0,29.0,MARIALVDDDRNILTSVSMTLEAEGFEVETYNDGQSALDAFNKRMP...,MRVLLVEDDPNTSRSIEMMLTHANLNVYATDMGEEGIDLAKLYDYD...,response regulator transcription factor,response regulator transcription factor,233,237
2,0.279621,0.234127,0.218924,211,0.837302,210,0.731707,96,875,8203,...,1138.0,28.0,50.0,22.0,MKDTVVFVTGAARGIGAHTARLAVARGARVALVGLEPHLLADLAAE...,MTPEQIFSGQTAIVTGGASGIGAATVEHIARRGGRVFSVDLSYDSP...,SDR family oxidoreductase,SDR family oxidoreductase,287,252
3,0.327273,0.200743,0.214712,166,0.6171,163,0.696581,175,875,3340,...,1077.0,28.0,50.0,22.0,MTSGLWERVLDGVWVTIQLLVLSALLATAVSFVVGIARTHRLWIVR...,MAMSRRKRGQLARGIQYAILVIVVVVLALLADWGKIGKAFFDWEAA...,ectoine/hydroxyectoine ABC transporter permeas...,amino acid ABC transporter permease,234,269
4,0.33871,0.318182,0.287671,60,0.909091,71,0.8875,61,9827,14020,...,991.0,30.0,50.0,20.0,MIISLRRGLRFIRFIVFFAALVYLFYHVLDLFNGWISPVDQYQMPT...,MKRMVWRTLKVFIIFIACTLLFYFGLRFMHLEYEQFHRYEPPEGPA...,YqzK family protein,YqzK family protein,80,66


This is a pretty feature-rich dataframe, but we only read sequence pairs and their indexes for the hmmer_wrapper, so let's do a bit of preprocessing before we run HMMER against pfam!

In [9]:
# split the database into corresponding thermo and meso lists
meso_seq_db = df_sample[["meso_index", "m_protein_seq"]]
thermo_seq_db = df_sample[["thermo_index", "t_protein_seq"]]

# make the corresponding index the dataframe index and only sample a 500 sequences
meso_seq_list = meso_seq_db.set_index("meso_index").iloc[:500]
meso_seq_list.index.name = None
meso_seq_list.rename({'m_protein_seq': 'protein_seq'}, axis="columns", inplace=True)

thermo_seq_list = thermo_seq_db.set_index("thermo_index").iloc[:500]
thermo_seq_list.index.name = None
thermo_seq_list.rename({'t_protein_seq': 'protein_seq'}, axis="columns", inplace=True)

How does the dataframe look now?

In [10]:
meso_seq_list.head()

Unnamed: 0,protein_seq
12897,MAESGTSRRADHLVPVPGPDAEPPAVADELLRAVGRGDEQAFGRLY...
13026,MARIALVDDDRNILTSVSMTLEAEGFEVETYNDGQSALDAFNKRMP...
8203,MKDTVVFVTGAARGIGAHTARLAVARGARVALVGLEPHLLADLAAE...
3340,MTSGLWERVLDGVWVTIQLLVLSALLATAVSFVVGIARTHRLWIVR...
14020,MIISLRRGLRFIRFIVFFAALVYLFYHVLDLFNGWISPVDQYQMPT...


Thus, the input to the hmmer_wrapper is just the seqeunce with a protein_seq column with the indexing that you desire

---

The hmmer_wrapper functions takes the following arguments:
- sequence list that you want to HMMER
- input filename
- the local pfam path, which HMMER will run against
- input filename with extension (specify .FASTA for best results. Other filetypes are allowed, but not recommended)
- output filename with extension (similarly, specify .domtblout for best results)
- number of cpu that HMMER will use

In [11]:
hmmer_wrapper(meso_seq_list, "meso_input", PFAM_PATH, "meso_input.fasta", "meso_output.domtblout", 4)
hmmer_wrapper(thermo_seq_list, "thermo_input", PFAM_PATH, "thermo_input.fasta", "thermo_output.domtblout", 4)

Upon running this, you should generate an input and output files as specified above in your current directroy where you run the code

---