In [19]:
import numpy as np 
import pandas as pd 
from Bio import Align 
import os 
from selenobot.tools import CDHIT, MMseqs
from selenobot.datasets import *

%load_ext autoreload 
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Hidden Markov Model baseline

I think it is important to have an HMM baseline against which to compare selenobot performance. To use an HMM as a classifier, I will need at least one HMM per category (i.e. full-length, truncated selenoproteins, and truncated non-selenoproteins); I suspect I will need more granularity than this, as HMMs are built using multi-sequence alignments, which require some degree of sequence similarity. 

## Training

 For building HMMs, I am thinking of doing the following:

1. Cluster sequences in each category from the training set at 50 percent similarity using CD-HIT. 
2. Generate a multi-sequence alignment (MSA) for each category (possibly just using BioPython?)
3. Use PyHMMER (a Python wrapper for HMMER3) to build an HMM for each cluster. 

The result of training will be a map from each overarching category (full-length, truncated selenoproteins, and truncated non-selenoproteins) to a set of HMMs

## Prediction

Generate a score for each sequence, for each generated HMM. 

In [None]:
train_dataset = TernaryDataset.from_hdf('../data/train.h5')

In [42]:
for label, df in train_dataset.metadata.groupby('label'):
    cdhit = CDHIT(df, c_cluster=0.5, cwd=os.getcwd(), name=f'hmm_{label}')
    cdhit_df = cdhit.run(dereplicate=False, overwrite=False)
    mmseqs = MMseqs(df, name=f'hmm_{label}', cwd=os.getcwd())
    mmseqs_df = mmseqs.run(overwrite=True)
    break

CDHIT._run: Using pre-saved clustering results at /home/prichter/Documents/selenobot/notebooks/cluster_hmm_0
easy-cluster /home/prichter/Documents/selenobot/notebooks/hmm_0.fa hmm_0 tmp --min-seq-id 0.2 

MMseqs Version:                     	16.747c6
Substitution matrix                 	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix            	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                         	4
k-mer length                        	0
Target search mode                  	0
k-score                             	seq:2147483647,prof:2147483647
Alphabet size                       	aa:21,nucl:5
Max sequence length                 	65535
Max results per query               	20
Split database                      	0
Split mode                          	2
Split memory limit                  	0
Coverage threshold                  	0.8
Coverage mode                       	0
Compositional bias                  	1
Compositional bias                  	1
Diagonal sco

In [40]:
cluster_sizes = cdhit_df.groupby('cdhit_cluster').apply(len, include_groups=False).sort_values(ascending=False)
cluster_sizes[cluster_sizes < 5]

cdhit_cluster
22490    4
6301     4
25953    4
25937    4
20062    4
        ..
16       1
15       1
14       1
11       1
8        1
Length: 31112, dtype: int64

In [45]:
cluster_sizes = mmseqs_df.groupby('mmseqs_cluster').apply(len, include_groups=False).sort_values(ascending=False)
cluster_sizes[cluster_sizes < 5]
len(mmseqs_df.mmseqs_cluster.unique())
len(cdhit_df.cdhit_cluster.unique())

35948

## Questions

1. What is a good rule-of-thumb number for sequence similarity when generating multi-sequence alignments?
2. Why use alignments instead of raw sequences when training HMMs?
3. Should I de-replicate the sequences before building HMMs?