In [7]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

import general_param as gparams

In [2]:
sequence_data = pd.read_pickle(gparams.cleanded_data)

# Feature Engineering

## Problems

1. We do not know the length of the bindig sequence

2. We do not know where the binding sequence is situated in the sequence

3. The sequences are of various lengths, which is a problem if we want to use them in some form as imput to a ML model

4. Most ML models (all in sklearn as far as I know) need numerical imput

## Solutions

### 4.
The sequences need to be converted, options that come to mind:

1 One-hot encoding (and either go for a 2d input or concatenate)
2 Ordinal encoding
3 Split up into 'words' and leverage from text processing tools

After some browsing:

> In genome analysis, k-mer-based comparison methods have become standard tools.

_see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5406869/_

#### Decision

**We will convert the sequence into a list of kmers and try a simple bag-of-words approach.**

### 3.
With a bag-of-words we do not need to adapt the length of the sequences as the output of `sklearn.feature_extraction.text.CountVectorizer` will be used as input for our model(s).

### 1. & 2.
If we **create k-mer in a rolling-word kind of manner** (i.e. abcdefg > abc, bcd, cde, def, efg) then we can **search for ngrams** and not just single words.
If we do so, we simply need to make sure that:

1. The k-mer size we choose is smaller than the binding site itself
2. The number of ngrams should not be smaller thant the binding site plus twice the k-mer size

If we wanted to do this properly, we could get the data from https://pubmed.ncbi.nlm.nih.gov/22887818/ and choose the k-mer size and ngrams size such that [k-mer size, ngrams + 2 * k-mer] covers 95% of the data.

#### Decision

For simplicity sake, lets **set the kmer size to 5 and ngrams to 30**

**Note**: The two parameter were added to `general_params.py`

In [3]:
kmer_size = 5
ngram_nbr = 3

In [4]:
def get_kmers(sequence: list[str], size: int):
    '''
    Creat a list of rolling kmer's of size `size` from a sequence
    '''
    return tuple(''.join(sequence[i: i+size]) for i in range(len(sequence) - size + 1))

In [5]:
sequence_data['kmers'] = sequence_data.sequence.apply(lambda x: get_kmers(x, size=kmer_size))

In [6]:
input_data = pd.DataFrame({
    'kmers': sequence_data.kmers.apply(lambda x: ' '.join(x)),
    'labels': sequence_data.strength.apply(lambda x: 1 if x else 0)
})

In [None]:
input_data.to_pickle(gparams.kmers_data)

### Note
Applying `CountVectorizer` to the kmers to create the input feature vector of presence/absence of kmers and its ngrams would be part of feature engineering. 

However, exporting the resulting ndarray to disk would require quite a bit of memory.
Therefore, this step will be done in the next part `Model_Selection_and_Tuning.ipynb` directly, so the data can remian in sparse form.

In [9]:
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, ngram_nbr))
X = vectorizer.fit_transform(input_data.kmers)

Now we get the final input data for our model(s):

In [13]:
input_data['counters'] = X


TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]