In [1]:
import numpy as np 
import pandas as pd 
from Bio import Align 
import os 
from selenobot.tools import CDHIT
from selenobot.datasets import *

%load_ext autoreload 
%autoreload 2

# Hidden Markov Model baseline

I think it is important to have an HMM baseline against which to compare selenobot performance. To use an HMM as a classifier, I will need at least one HMM per category (i.e. full-length, truncated selenoproteins, and truncated non-selenoproteins); I suspect I will need more granularity than this, as HMMs are built using multi-sequence alignments, which require some degree of sequence similarity. 

## Training

 For building HMMs, I am thinking of doing the following:

1. Cluster sequences in each category from the training set at 50 percent similarity using CD-HIT. 
2. Generate a multi-sequence alignment (MSA) for each category (possibly just using BioPython?)
3. Use PyHMMER (a Python wrapper for HMMER3) to build an HMM for each cluster. 

The result of training will be a map from each overarching category (full-length, truncated selenoproteins, and truncated non-selenoproteins) to a set of HMMs

## Prediction

Generate a score for each sequence, for each generated HMM. 

In [2]:
train_dataset = TernaryDataset.from_hdf('../data/train.h5')

In [7]:
cdhit = CDHIT(train_dataset.metadata, c_cluster=0.5, cwd=os.getcwd(), name='hmm')
df = cdhit.run(dereplicate=False, overwrite=True)

cd-hit -i /home/prichter/Documents/selenobot/notebooks/hmm.fa -o /home/prichter/Documents/selenobot/notebooks/cluster_hmm -n 3 -c 0.5 -l 5


KeyboardInterrupt: 

## Questions

1. What is a good rule-of-thumb number for sequence similarity when generating multi-sequence alignments?
2. Why use alignments instead of raw sequences when training HMMs?
3. Should I de-replicate the sequences before building HMMs?