# Intein Finder

## Project goal

Focus on splicing domain,
not necessarily all the annotated domains which have lots of variation.

## Broad goal

Paul Lewis is a verticalist.
His first thought is inteins evolved in a lineage, not horizontal gene transfer.
Peter immediately sees horizontal gene transfer.
It would be nice to have an algorithm to look at genomes and get homing endonucleases.
Then do a phylogeny of the homing endonucleases and see if they are really vertically inherited
or see if they jump back and forth and are the basis of some phylogenies.

In [None]:
from inbase import INBASE
from pprint import pprint

pprint(INBASE.columns.tolist())

In [None]:
INBASE['Intein Class'].unique()

In [None]:
exp = INBASE[INBASE['Intein Class'] == 'Experimental']
print('%d out of %d (%d%%) inteins are experimentally valid' % (len(exp), len(INBASE), len(exp) * 100.0 / len(INBASE)))

In [None]:
print('Experimentally valid data:')
exp['Domain of Life'].value_counts()

In [None]:
INBASE.describe()

### Cluster by the annotated protein domains

We need to first cleanup on the annotations.

In [None]:
cols_domain = [col for col in INBASE.columns if 'Block' in col]
INBASE.loc[:, cols_domain].head(10)
# Some of the cells contain invalid data; blank values or dashes.
# Set invalid cells to None.
temp = INBASE.loc[:, cols_domain].stack()
valid = temp.str.match('[A-Z*?/ ]+[0-9]+')
#from pprint import pprint
#pprint(temp[~valid].values.tolist())
inbase = temp[valid].unstack()
inbase.head()

Split the location numbers from the domain strings.

In [None]:
import pandas as pd

df = pd.DataFrame()

for col in cols_domain:
    block = inbase[col]
    col_new = col.replace(' ', '_')
    block = block.str.extract('(?P<{block}>^[A-Z*?/]+)[ NC]*(?P<{loc}>[0-9]+$)'.format(block=col_new, loc=col_new+'_loc'), expand=True)
    df = pd.concat([df, block], axis=1)

df

Now we can cluster based on the annotated domains.

In [None]:
df.groupby('Block_C').groups