# The first CNN (but first more preparation)
Unfortunately CNNs cannot work with letter sequences directly so we have to think about how to encode the sequences (and our labels) into numerical form, and although CNNs are able to deal with sequences of varying length this would require us to split the training set by length so we are going to pad the sequences to have a uniform length.<br>
First we need our balanced dataset to continue.<br>
<b>Read the balanced dataset from csv into a DataFrame</b>

In [None]:
import pandas as pd

df_balanced = pd.read_csv("df_balanced.csv", index_col=0)
#df_balanced

### ASCII encoding and padding
A simple idea to encode letters is to use their numerical representation according to the ASCII table.<br>
<b>Create a new column in the DataFrame that contains the sequence as a list of decimal numbers according to the ASCII table</b>

In [None]:
df_balanced["seq_ord"] = 
#df_balanced

Now we can use the pad_sequences() function from keras.preprocessing to pad our sequences so they all have the same length (https://keras.io/preprocessing/sequence/).<br>
<b> Use pad_sequences() to create a new column containing the padded ascii encoded sequences.</b>

In [None]:
from keras.preprocessing.sequence import pad_sequences

df_balanced["seq_ord_pad"] = list(pad_sequences(df_balanced.seq_ord.to_numpy()))
#df_balanced

### Another numerical encoding
Since the ASCII representations can be large and also have varying distance between them, we want to reduce the encoding so that it uses only the smallest necessary integers. For this we can use the LabelEncoder from sklearn.preprocessing (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html ).<br>
<b>Use the LabelEncoder to create a column containg the sequences in a minimal numerical encoding</b>

In [None]:
from sklearn.preprocessing import LabelEncoder
import numpy as np # this might be handy

seq_enc = LabelEncoder()
seq_enc.fit()
df_balanced["seq_num_pad"] = df_balanced.seq_ord_pad.map()
#print(list(df_balanced.seq_num_pad[0]))

### One hot encoding
A popular method for encdoding categorical features that avoids an implicit order is "one hot encoding". Again scikit-learn offers tools for this (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder ).<br>
<b>Use OneHotEncoder to create column that contains the one hot encoded sequence (use sparse=False)</b>

In [None]:
from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder()
oh_enc.fit()

df_balanced["seq_oh_pad"] = df_balanced.seq_num_pad.map()

#print(df_balanced.seq_oh_pad[0])

### Flattened one hot encoding
This is a variation of the one hot encoding. Instead of having a one hot vector in each position the vectors are transposed and concatenated (i.e [[0,1],[1,0]] -> [0,1,1,0]).<br>
<b>Create column that contains the flattened one hot encoded sequence</b>

In [None]:
df_balanced["seq_oh_flat"] = df_balanced.seq_oh_pad.map()
#df_balanced.seq_oh_flat[0].shape

### Don't forget the labels
We have to encode the labels as well. Again scikit-learn offers several options but we will just use the LabelEncoder again, since we only have two classes (i.e. 0 or 1).<br>
<b>Create a column with encoded labels using LabelEncoder (save the encoder in a variable so we can use it for inverse transformation later)</b>

In [None]:
from sklearn.preprocessing import LabelEncoder

lbl_enc = LabelEncoder()
df_balanced["lbl_num"] = lbl_enc.fit_transform()
#df_balanced.lbl_num

<b>Save all of our hard work to a pickle so we can use it in other notebooks<b>

In [None]:
import pickle 

pickle.dump(df_balanced, open("df_balanced_enc.pickle", "wb"))

## Train/Test
<b>Create a 80/20 train test split, use the encoded labels!</b><br>
You can fix the random seed to get a reproducible split

In [None]:
from sklearn.model_selection import train_test_split

rnd_seed=42
xTrain, xTest, yTrain, yTest = 

# Create First convolutional model
Now we are finally ready to create our first model.<br>
The model is going to have the following simple architecture:
<li>A 1D convolutional layer using an RelU activation with padding="same" (Why?). Let's start with 16 filters of size 7. Use the input_shape parameter </li>
<li>A max pooling layer reducing by a factor of 2</li>
<li>A Flatten layer</li>
<li>A Dense output layer using a sigmoid activation</li>
<li>We will use the SGD optimizer with momentum set to 0.9. <i>What is a good choice for the loss function?</i></li><br>

Since we want to use GridSearchCV from sklearn later on we will create a function that returns the compiled model (This is needed for the scikit-learn wraper).<br>
This function should have parameters for the input_shape, because we are going to test our different encoding schemes.
It should also have parameters for the number of filters and their sizes in the first convolutional layer for the grid search. <i>Feel free to insert a print(model.summary()) to get an overview of the model</i><br>We are going to train for 10 epochs using a batch size of 32

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import utils

epochs=
batch_size=

def create_first_model(input_shape, opt=, c1_filter=, c1_size=, verbose=):
    # create the model
    model = keras.Sequential()
    
    
    model.compile()
    
    if verbose:
        print(model.summary())
    
    return model


## Test different encodings
Now we will test the different encoding techniques with the first model. For each test the KerasClassifier wrapper will be used to create the model (https://keras.io/scikit-learn-api/ )
<br>The model should then be trained with the training data and a classification report for the predictions on the test data should be generated. You can use th inverse_transform function of the LabelEncoder to have readable labels in the classification report.<br>Use the verbose option while fitting to monitor training progress.

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.metrics import classification_report

### First model with ASCII encoded sequences
<b> Create a KerasClassifier, train the model using the ASCII encoded training data and evaluate the model (numpy.reshape might come in handy)</b>

In [None]:
model = KerasClassifier(build_fn=, input_shape=, verbose=)
model.fit(, , epochs=, batch_size=, verbose=)
yPred = model.predict()
print(classification_report( lbl_enc.inverse_transform(yTest.to_list()), lbl_enc.inverse_transform(yPred.ravel())))

### First model with numerically encoded sequences
<b> Create a KerasClassifier, train the model using the numerically encoded training data and evaluate the model (numpy.reshape might come in handy)</b>

In [None]:
model = KerasClassifier(build_fn=, input_shape=, verbose=)
model.fit(, , epochs=, batch_size=, verbose=) # start training
yPred = model.predict()
print(classification_report( yTest.to_list(), yPred ))

## First model with one hot encoded sequences
<b> Create a KerasClassifier, train the model using the one hot encoded training data and evaluate the model (numpy.reshape might come in handy)</b>

In [None]:
model = KerasClassifier(build_fn=create_first_model, input_shape=, verbose=)
model.fit(, , epochs=, batch_size=, verbose=) # start training
yPred = model.predict()
print(classification_report( yTest.to_list(), yPred ))

## First model with flat one hot encoded sequences
<b> Create a KerasClassifier, train the model using the flat one hot encoded training data and evaluate the model (numpy.reshape might come in handy)</b>

In [None]:
model = KerasClassifier(build_fn=, input_shape=, verbose=)
model.fit(, , epochs=, batch_size=, verbose=) # start training
yPred = model.predict()
print(classification_report( yTest.to_list(), yPred ))

# Results
Which encoding performed best?<br>
Re-run the cells. Are the results stable?<br>
Are all CNNs used the same?

# Embrace the randomness!
In order to better evaluate the impact of the encoding we are going to perform a 5-fold cross validation. For this we are using the StratifiedKFold function from scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) and evaluate the Matthews Correlation Coefficient and the F1 score for each model in each round. <br>
<b> Create a DataFrame with a column for each metric for each model.<br> 
    Create splits from the training dataset using StratifiedKFold.<br>
    For each split train the model for all 4 encoding schemes and calculate the MCC for the test data (of the split).<br>
    Create a new row for each round in the DataFrame and save the MCCs and F1 scores in the appropriate column<br>
    Finally add a row to the DataFrame containing the mean for each model over all rounds

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import matthews_corrcoef, f1_score

#to save scores
df_scores = pd.DataFrame(columns=[])
df_scores.index.name = "Round"

# define 10-fold cross validation test 
kfold = StratifiedKFold(n_splits=5, shuffle=True)

for i, (train_index, test_index) in enumerate(kfold.split(xTrain, yTrain)):
    
    # get the splits
    X_train, X_test = 
    y_train, y_test = 
    
    # train ascii model

        
    # train num model

    
    # train one hot 

    
    # train flat one hot

    
    # save predictions
    df_scores.loc[i] = []
    
    print("Finished round {}".format(i))
    
df_scores.loc['mean'] = df_scores.mean()
df_scores

We can observe that the encoding can have an impact on the performance of our model.<br>
The one hot encoding seems to produce the best results so far, so we are going to concentrate on this encoding scheme when evaluating our next model.<br>
The flattened one hot encoding also performed almost as good but it also increased the trainable parameters of our model (Why should that be a concern?)<br>
<b>Let's move on to the next notebook "03_second_CNN" with a slightly more complex model<b>