In [1]:
import pandas as pd
import numpy as np

from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

import general_param as gparams
from custom import KmerTransformer

### Note on Pipelines

We only start to use pipelines here in the hyper-parameter tuning part.

Since we want to optimize our feature engineering by adapting the k-mer size and 
ngram_range, we **directly load the cleaned data here and include the feature engineering steps into
the pipeline**:

In [2]:
sequence_data = pd.read_pickle(gparams.cleanded_data)

We do as before and use the presence of the binding 'strength' as non-binding:

In [3]:
labels = sequence_data.strength.apply(lambda x: 1 if x else 0)

# Hyper-parameter  Tuning

Given that we do not know the size of the binding area we should modify both the kmer size and the ngram_range.

### kmer size tuning

We used a custom function to create the k-mers, so we need some extra steps to tune this parameter.

Actually, the only thing we need to do is to create a custom transromer by subclassing `sklearn.base.BaseEstimator` and mixin with the `sklearn.base.TransformerMixin`.

See [./custom.py](./custom.py) for the definition of our custom `KmerTransformer`, that basically promotes the `get_kmers` function from [./Feature_Engineering.ipynb](./Feature_Engineering.ipynb) into a transformer.

### ngram_range tuning

With `CountVectorizer` we already have a Transformer for this step, so we simply can add `ngram_range` to the parameter grid. Also, since _numpy 1.16.0_ linspace supports array like objects as start/stop, so this one is easy.

### SVC params tuning

This is an existing classifier so the parameters we want to tune can simply be added to the grid.

#### Note
Due to time constraints we simply ommit the tuning of the SVC params and focus on the k-mer sizes and ngram ranges.

In [4]:
kmer = KmerTransformer(size=3)
vect = CountVectorizer(analyzer='word')
svc = SVC(kernel='linear', random_state=gparams.random_state)

In [5]:
pipeline = Pipeline([
    ('kmer', kmer),
    ('vect', vect),
    ('svc', svc)
])

In [6]:
grid_params = {
   'kmer__size': np.linspace(2, 6, 5, dtype=int),
   'vect__ngram_range': list(map(tuple, np.linspace((1,1), (1, 20), 20, dtype=int))),
    # svc__...
}

In [7]:
clf = GridSearchCV(pipeline, grid_params, error_score='raise')
clf_fitted = clf.fit(sequence_data.sequence, labels)

In [8]:
print("Best Score: ", clf.best_score_)
print("Best Params: ", clf.best_params_)

Best Score:  0.9705
Best Params:  {'kmer__size': 5, 'vect__ngram_range': (1, 6)}
