## Intein Finder

Source code: https://github.com/omsai/intein_finder

### Project goal

Identify putative inteins in a given genome
by training position weight matrices of known InBase inteins
using PSI Blast or similar.

### Suggestions

- Focus on identifying splicing domain and homing endonucleases;
  not necessarily all the annotated domains which have lots of variation.
- Then do a phylogeny of the homing endonucleases and see if they are really vertically inherited
  or see if they horizontally jump back and forth and are the basis of some phylogenies.

## Database summary

Poke around the database:
- List the available columns.
- See how many inteins are experimentall valid.
- Describe the columns.

In [None]:
from inbase import INBASE
import pandas as pd
from pprint import pprint

pprint(INBASE.columns.tolist())

### Experimental data

Would it be more accurate to only use experimentally validated inteins to train our PSI-Blast position weight matrix?
Is there enough data?

In [None]:
INBASE['Intein Class'].unique()

In [None]:
exp = INBASE[INBASE['Intein Class'] == 'Experimental']
print('%d out of %d (%d%%) inteins are experimentally valid' % (len(exp), len(INBASE), len(exp) * 100.0 / len(INBASE)))

In [None]:
print('Experimentally valid data:')
exp['Domain of Life'].value_counts()

Overview of the other columns:

In [None]:
INBASE.describe()

## Cluster by the annotated protein domains

We need to first cleanup the annotations:
- Remove invalid entries.
- Split the numeric location of the protein motif from the motif sequence.

In [None]:
cols_domain = [col for col in INBASE.columns if 'Block' in col]
INBASE.loc[:, cols_domain].head(10)
# Some of the cells contain invalid data; blank values or dashes.
# Set invalid cells to None.
temp = INBASE.loc[:, cols_domain].stack()
valid = temp.str.match('[A-Z*?/ ]+[0-9]+')
#from pprint import pprint
#pprint(temp[~valid].values.tolist())
inbase = temp[valid].unstack()
inbase.head()

Split the location numbers from the domain strings.

In [None]:
df = pd.DataFrame()

for col in cols_domain:
    block = inbase[col]
    col_new = col.replace(' ', '_')
    block = block.str.extract('(?P<{block}>^[A-Z*?/]+)[ NC]*(?P<{loc}>[0-9]+$)'.format(block=col_new, loc=col_new+'_loc'), expand=True)
    df = pd.concat([df, block], axis=1)

df

Now we can cluster based on the annotated domains.

In [None]:
df.groupby('Block_C').groups