# The first CNN (but first more preparation)
Unfortunately CNNs cannot work with letter sequences directly so we have to think about how to encode the sequences (and our labels) into numerical form, and although CNNs are able to deal with sequences of varying length this would require us to split the training set by length so we are going to pad the sequences to have a uniform length.<br>
First we need our balanced dataset to continue.<br>
<b>Read the balanced dataset from csv into a DataFrame</b>

In [13]:
import pandas as pd

df_balanced = pd.read_csv("df_balanced.csv", index_col=0)
df_balanced

Unnamed: 0,Seq,Label,Length,GC_content,ATGC_ratio
Arabidopsis_thaliana300001_SnoR1b,GGCGAGGATGAATAATGCTAAATTTCTGACACCTCTTGTATGAGGA...,CD-box,93,0.419355,1.384615
Arabidopsis_thaliana300003_SnoR10-1,AGAAATGATGAGAAATCAGATAAATCTTAGGACACCTTCTGACACA...,CD-box,81,0.345679,1.892857
Arabidopsis_thaliana300006_SnoR101,GGGATACACTTGATCTCTGAACTTCACAGGTAAGTTCGCTTGTTGA...,CD-box,68,0.441176,1.266667
Arabidopsis_thaliana300007_SnoR102,AGAAGTCAATAGACCAGACATTGTGGTAACACTCTCTTTCATGGCA...,CD-box,133,0.413534,1.418182
Arabidopsis_thaliana300010_SnoR105,AGGGGATATGATGAATGGTAAAAACTCGCTTATATTGCGAGAAGAG...,CD-box,107,0.448598,1.229167
...,...,...,...,...,...
Saccharomyces_cerevisiae300063_snR35,ATACAAAATTAATCGTGCGGATTAATAATCCAGGACTATAAAACCG...,HACA-box,204,0.392157,1.550000
Saccharomyces_cerevisiae300064_snR5,ATCATTCAATAAACTGATCTTCCGGATTACCATGCTTAAGACATCA...,HACA-box,204,0.328431,2.044776
Saccharomyces_cerevisiae300065_snR9,GGGAATATAATACTAAATACTCTGTTATATAGAACTTTCTACGCCT...,HACA-box,187,0.379679,1.633803
Saccharomyces_cerevisiae300070_snR44,CTCCGGGCTGATAACTAGATGGTGTGATCGGGCAGTATACTAATTT...,HACA-box,211,0.388626,1.573171


### ASCII encoding and padding
A simple idea to encode letters is to use their numerical representation according to the ASCII table.<br>
<b>Create a new column in the DataFrame that contains the sequence as a list of decimal numbers according to the ASCII table</b>

In [14]:
df_balanced["seq_ord"] = df_balanced.Seq.map(lambda x: [ord(c) for c in x])
#df_balanced

Now we can use the pad_sequences() function from keras.preprocessing to pad our sequences so they all have the same length (https://keras.io/preprocessing/sequence/).<br>
<b> Use pad_sequences() to create a new column containing the padded ascii encoded sequences.</b>

In [15]:
from keras.preprocessing.sequence import pad_sequences

df_balanced["seq_ord_pad"] = list(pad_sequences(df_balanced.seq_ord.to_numpy()))
df_balanced

Unnamed: 0,Seq,Label,Length,GC_content,ATGC_ratio,seq_ord,seq_ord_pad
Arabidopsis_thaliana300001_SnoR1b,GGCGAGGATGAATAATGCTAAATTTCTGACACCTCTTGTATGAGGA...,CD-box,93,0.419355,1.384615,"[71, 71, 67, 71, 65, 71, 71, 65, 84, 71, 65, 6...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
Arabidopsis_thaliana300003_SnoR10-1,AGAAATGATGAGAAATCAGATAAATCTTAGGACACCTTCTGACACA...,CD-box,81,0.345679,1.892857,"[65, 71, 65, 65, 65, 84, 71, 65, 84, 71, 65, 7...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
Arabidopsis_thaliana300006_SnoR101,GGGATACACTTGATCTCTGAACTTCACAGGTAAGTTCGCTTGTTGA...,CD-box,68,0.441176,1.266667,"[71, 71, 71, 65, 84, 65, 67, 65, 67, 84, 84, 7...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
Arabidopsis_thaliana300007_SnoR102,AGAAGTCAATAGACCAGACATTGTGGTAACACTCTCTTTCATGGCA...,CD-box,133,0.413534,1.418182,"[65, 71, 65, 65, 71, 84, 67, 65, 65, 84, 65, 7...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
Arabidopsis_thaliana300010_SnoR105,AGGGGATATGATGAATGGTAAAAACTCGCTTATATTGCGAGAAGAG...,CD-box,107,0.448598,1.229167,"[65, 71, 71, 71, 71, 65, 84, 65, 84, 71, 65, 8...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...,...,...,...
Saccharomyces_cerevisiae300063_snR35,ATACAAAATTAATCGTGCGGATTAATAATCCAGGACTATAAAACCG...,HACA-box,204,0.392157,1.550000,"[65, 84, 65, 67, 65, 65, 65, 65, 84, 84, 65, 6...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
Saccharomyces_cerevisiae300064_snR5,ATCATTCAATAAACTGATCTTCCGGATTACCATGCTTAAGACATCA...,HACA-box,204,0.328431,2.044776,"[65, 84, 67, 65, 84, 84, 67, 65, 65, 84, 65, 6...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
Saccharomyces_cerevisiae300065_snR9,GGGAATATAATACTAAATACTCTGTTATATAGAACTTTCTACGCCT...,HACA-box,187,0.379679,1.633803,"[71, 71, 71, 65, 65, 84, 65, 84, 65, 65, 84, 6...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
Saccharomyces_cerevisiae300070_snR44,CTCCGGGCTGATAACTAGATGGTGTGATCGGGCAGTATACTAATTT...,HACA-box,211,0.388626,1.573171,"[67, 84, 67, 67, 71, 71, 71, 67, 84, 71, 65, 8...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


### Another numerical encoding
Since the ASCII representations can be large and also have varying distance between them, we want to reduce the encoding so that it uses only the smallest necessary integers. For this we can use the LabelEncoder from sklearn.preprocessing (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html ).<br>
<b>Use the LabelEncoder to create a column containg the sequences in a minimal numerical encoding</b>

In [17]:
from sklearn.preprocessing import LabelEncoder
import numpy as np # this might be handy

seq_enc = LabelEncoder()
seq_enc.fit(np.hstack(df_balanced.seq_ord_pad.values))
df_balanced["seq_num_pad"] = df_balanced.seq_ord_pad.map(lambda x: seq_enc.transform(x))
print(list(df_balanced.seq_num_pad[0]))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

### One hot encoding
A popular method for encdoding categorical features that avoids an implicit order is "one hot encoding". Again scikit-learn offers tools for this (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder ).<br>
<b>Use OneHotEncoder to create column that contains the one hot encoded sequence (use sparse=False)</b>

In [18]:
from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder(sparse=False, categories="auto")
oh_enc.fit(np.hstack(df_balanced.seq_num_pad.to_numpy()).reshape(-1, 1))

df_balanced["seq_oh_pad"] = df_balanced.seq_num_pad.map(lambda x: oh_enc.transform(x.reshape(-1,1)))

print(df_balanced.seq_oh_pad[0])

[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]]


### Flattened one hot encoding
This is a variation of the one hot encoding. Instead of having a one hot vector in each position the vectors are transposed and concatenated (i.e [[0,1],[1,0]] -> [0,1,1,0]).<br>
<b>Create column that contains the flattened one hot encoded sequence</b>

In [20]:
df_balanced["seq_oh_flat"] = df_balanced.seq_oh_pad.map(lambda x: x.flatten())
df_balanced.seq_oh_flat[0].shape

(5020,)

### Don't forget the labels
We have to encode the labels as well. Again scikit-learn offers several options but we will just use the LabelEncoder again, since we only have two classes (i.e. 0 or 1).<br>
<b>Create a column with encoded labels using LabelEncoder (save the encoder in a variable so we can use it for inverse transformation later)</b>

In [21]:
from sklearn.preprocessing import LabelEncoder

lbl_enc = LabelEncoder()
df_balanced["lbl_num"] = lbl_enc.fit_transform(df_balanced.Label)
df_balanced.lbl_num

Arabidopsis_thaliana300001_SnoR1b        0
Arabidopsis_thaliana300003_SnoR10-1      0
Arabidopsis_thaliana300006_SnoR101       0
Arabidopsis_thaliana300007_SnoR102       0
Arabidopsis_thaliana300010_SnoR105       0
                                        ..
Saccharomyces_cerevisiae300063_snR35     1
Saccharomyces_cerevisiae300064_snR5      1
Saccharomyces_cerevisiae300065_snR9      1
Saccharomyces_cerevisiae300070_snR44     1
Saccharomyces_cerevisiae300073_snR191    1
Name: lbl_num, Length: 984, dtype: int64

<b>Save all of our hard work to a pickle so we can use it in other notebooks<b>

In [8]:
import pickle 

pickle.dump(df_balanced, open("df_balanced_enc.pickle", "wb"))

## Train/Test
<b>Create a 80/20 train test split, use the encoded labels!</b><br>
You can fix the random seed to get a reproducible split

In [9]:
from sklearn.model_selection import train_test_split

rnd_seed=42
xTrain, xTest, yTrain, yTest = train_test_split( df_balanced, df_balanced.lbl_num, test_size=0.2, random_state=rnd_seed )

# Create First convolutional model
Now we are finally ready to create our first model.<br>
The model is going to have the following simple architecture:
<li>A 1D convolutional layer using an RelU activation with padding="same" (Why?). Let's start with 16 filters of size 7. Use the input_shape parameter </li>
<li>A max pooling layer reducing by a factor of 2</li>
<li>A Flatten layer</li>
<li>A Dense output layer using a sigmoid activation</li>
<li>We will use the SGD optimizer with momentum set to 0.9. <i>What is a good choice for the loss function?</i></li><br>

Since we want to use GridSearchCV from sklearn later on we will create a function that returns the compiled model (This is needed for the scikit-learn wraper).<br>
This function should have parameters for the input_shape, because we are going to test our different encoding schemes.
It should also have parameters for the number of filters and their sizes in the first convolutional layer for the grid search. <i>Feel free to insert a print(model.summary()) to get an overview of the model</i><br>We are going to train for 10 epochs using a batch size of 32

In [22]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import utils

epochs=10
batch_size=32

def create_first_model(input_shape, opt=keras.optimizers.SGD(momentum=0.9), c1_filter=16, c1_size=7, verbose=0):
    # create the model
    model = keras.Sequential()
    model.add(keras.layers.Conv1D(c1_filter, c1_size, activation='relu', padding='same', input_shape=input_shape))
    model.add(keras.layers.MaxPooling1D(2))
    model.add(keras.layers.Flatten())    
    model.add(keras.layers.Dense(1, activation='sigmoid'))
    
    model.compile(optimizer=keras.optimizers.SGD(momentum=0.9),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    if verbose:
        print(model.summary())
    
    return model


## Test different encodings
Now we will test the different encoding techniques with the first model. For each test the KerasClassifier wrapper will be used to create the model (https://keras.io/scikit-learn-api/ )
<br>The model should then be trained with the training data and a classification report for the predictions on the test data should be generated. You can use th inverse_transform function of the LabelEncoder to have readable labels in the classification report.<br>Use the verbose option while fitting to monitor training progress.

In [23]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.metrics import classification_report

### First model with ASCII encoded sequences
<b> Create a KerasClassifier, train the model using the ASCII encoded training data and evaluate the model (numpy.reshape might come in handy)</b>

In [12]:
model = KerasClassifier(build_fn=create_first_model, input_shape=(1004,1), verbose=1)
model.fit(np.array(xTrain.seq_ord_pad.to_list()).reshape(-1,1004,1), yTrain.to_numpy(), epochs=epochs, batch_size=batch_size, verbose=0)
yPred = model.predict(np.array(xTest.seq_ord_pad.to_list()).reshape(-1,1004,1))
print(classification_report( lbl_enc.inverse_transform(yTest.to_list()), lbl_enc.inverse_transform(yPred.ravel())))

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d (Conv1D)              (None, 1004, 16)          128       
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 502, 16)           0         
_________________________________________________________________
flatten (Flatten)            (None, 8032)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 8033      
Total params: 8,161
Trainable params: 8,161
Non-trainable params: 0
_________________________________________________________________
None
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your m

### First model with numerically encoded sequences
<b> Create a KerasClassifier, train the model using the numerically encoded training data and evaluate the model (numpy.reshape might come in handy)</b>

In [14]:
model = KerasClassifier(build_fn=create_first_model, input_shape=(1004,1), verbose=1)
model.fit(np.array(xTrain.seq_num_pad.to_list()).reshape(787,1004,1), yTrain.to_numpy(), epochs=epochs, batch_size=batch_size, verbose=0) # start training
yPred = model.predict(np.array(xTest.seq_num_pad.to_list()).reshape(-1,1004,1))
print(classification_report( yTest.to_list(), yPred ))

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 1004, 16)          128       
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 502, 16)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 8032)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 8033      
Total params: 8,161
Trainable params: 8,161
Non-trainable params: 0
_________________________________________________________________
None
              precision    recall  f1-score   support

           0       0.88      0.96      0.92        99
           1       0.96      0.87      0.91        98

    accuracy                           0.91       197
   macro avg       0.92    

## First model with one hot encoded sequences
<b> Create a KerasClassifier, train the model using the one hot encoded training data and evaluate the model (numpy.reshape might come in handy)</b>

In [13]:
model = KerasClassifier(build_fn=create_first_model, input_shape=(1004,5), verbose=1)
model.fit(np.array(xTrain.seq_oh_pad.to_list()).reshape(787,1004,5), yTrain.to_numpy(), epochs=epochs, batch_size=batch_size, verbose=0) # start training
yPred = model.predict(np.array(xTest.seq_oh_pad.to_list()).reshape(-1,1004,5))
print(classification_report( yTest.to_list(), yPred ))

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 1004, 16)          576       
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 502, 16)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 8032)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 8033      
Total params: 8,609
Trainable params: 8,609
Non-trainable params: 0
_________________________________________________________________
None
              precision    recall  f1-score   support

           0       0.94      0.94      0.94        99
           1       0.94      0.94      0.94        98

    accuracy                           0.94       197
   macro avg       0.94    

## First model with flat one hot encoded sequences
<b> Create a KerasClassifier, train the model using the flat one hot encoded training data and evaluate the model (numpy.reshape might come in handy)</b>

In [14]:
model = KerasClassifier(build_fn=create_first_model, input_shape=(5020,1), verbose=1)
model.fit(np.array(xTrain.seq_oh_flat.to_list()).reshape(787,5020,1), yTrain.to_numpy(), epochs=epochs, batch_size=batch_size, verbose=0) # start training
yPred = model.predict(np.array(xTest.seq_oh_flat.to_list()).reshape(-1,5020,1))
print(classification_report( yTest.to_list(), yPred ))

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_2 (Conv1D)            (None, 5020, 16)          128       
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 2510, 16)          0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 40160)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 40161     
Total params: 40,289
Trainable params: 40,289
Non-trainable params: 0
_________________________________________________________________
None
              precision    recall  f1-score   support

           0       0.87      0.97      0.92        99
           1       0.97      0.86      0.91        98

    accuracy                           0.91       197
   macro avg       0.92  

# Results
Which encoding performed best?<br>
Re-run the cells. Are the results stable?<br>
Are all CNNs used the same?

# Embrace the randomness!
In order to better evaluate the impact of the encoding we are going to perform a 5-fold cross validation. For this we are using the StratifiedKFold function from scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) and evaluate the Matthews Correlation Coefficient and the F1 score for each model in each round. <br>
<b> Create a DataFrame with a column for each metric for each model.<br> 
    Create splits from the training dataset using StratifiedKFold.<br>
    For each split train the model for all 4 encoding schemes and calculate the MCC for the test data (of the split).<br>
    Create a new row for each round in the DataFrame and save the MCCs and F1 scores in the appropriate column<br>
    Finally add a row to the DataFrame containing the mean for each model over all rounds

In [16]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import matthews_corrcoef, f1_score
import pandas as pd

#to save scores
df_scores = pd.DataFrame(columns=["ord_mcc", "ord_f1",
                                  "num_mcc", "num_f1",
                                  "oh_mcc", "oh_f1",
                                  "foh_mcc", "foh_f1"])
df_scores.index.name = "Round"

# define 5-fold cross validation test 
kfold = StratifiedKFold(n_splits=2, shuffle=True)

for i, (train_index, test_index) in enumerate(kfold.split(xTrain, yTrain)):
    #print (i,train_index, test_index)
    
    # get the splits
    X_train, X_test = xTrain.iloc[train_index], xTrain.iloc[test_index]
    y_train, y_test = yTrain.iloc[train_index], yTrain.iloc[test_index]
    
    # train ascii model
    model_ord = KerasClassifier(build_fn=create_first_model, input_shape=(1004,1), verbose=0)
    model_ord.fit(np.array(X_train.seq_ord_pad.to_list()).reshape(-1,1004,1), y_train.to_numpy(), epochs=epochs, batch_size=batch_size, verbose=0)
    yPred_ord = model_ord.predict(np.array(X_test.seq_ord_pad.to_list()).reshape(-1,1004,1))
        
    # train num model
    model_num = KerasClassifier(build_fn=create_first_model, input_shape=(1004,1), verbose=0)
    model_num.fit(np.array(X_train.seq_num_pad.to_list()).reshape(-1,1004,1), y_train.to_numpy(), epochs=epochs, batch_size=batch_size, verbose=0) # start training
    yPred_num = model_num.predict(np.array(X_test.seq_num_pad.to_list()).reshape(-1,1004,1))
    
    # train one hot 
    model_oh = KerasClassifier(build_fn=create_first_model, input_shape=(1004,5), verbose=0)
    model_oh.fit(np.array(X_train.seq_oh_pad.to_list()).reshape(-1,1004,5), y_train.to_numpy(), epochs=epochs, batch_size=batch_size, verbose=0) # start training
    yPred_oh = model_oh.predict(np.array(X_test.seq_oh_pad.to_list()).reshape(-1,1004,5))
    
    # train flat one hot
    model_foh = KerasClassifier(build_fn=create_first_model, input_shape=(5020,1), verbose=0)
    model_foh.fit(np.array(X_train.seq_oh_flat.to_list()).reshape(-1,5020,1), y_train.to_numpy(), epochs=epochs, batch_size=batch_size, verbose=0) # start training
    yPred_foh = model_foh.predict(np.array(X_test.seq_oh_flat.to_list()).reshape(-1,5020,1))
    
    # save predictions
    df_scores.loc[i] = [matthews_corrcoef(y_test.to_list(), yPred_ord), f1_score(y_test.to_list(), yPred_ord),
                        matthews_corrcoef(y_test.to_list(), yPred_num), f1_score(y_test.to_list(), yPred_num),
                        matthews_corrcoef(y_test.to_list(), yPred_oh), f1_score(y_test.to_list(), yPred_oh),
                        matthews_corrcoef(y_test.to_list(), yPred_foh), f1_score(y_test.to_list(), yPred_foh)]
    
    print("Finished round {}".format(i))
    
df_scores.loc['mean'] = df_scores.mean()
print(df_scores)

0 [  2   3   5   6  10  11  13  14  17  18  22  26  27  28  29  30  33  35
  39  40  41  43  44  45  46  47  49  50  52  53  54  55  60  61  62  66
  67  68  69  70  71  72  73  76  79  83  85  86  90  96  98  99 100 101
 102 105 108 113 114 121 122 124 125 127 129 131 132 133 134 135 138 139
 142 143 144 145 146 147 151 152 153 154 160 161 163 166 167 169 170 171
 172 173 174 175 177 178 179 186 187 189 190 191 193 195 197 203 204 208
 212 213 215 216 218 220 223 225 227 228 231 232 233 236 239 240 244 246
 247 249 252 255 256 257 261 263 265 266 267 268 269 272 277 278 279 283
 287 293 294 295 298 299 300 302 305 306 307 311 312 314 315 316 317 319
 322 323 326 328 330 332 337 339 340 343 345 347 348 349 353 354 357 359
 362 365 370 372 373 375 379 382 384 385 388 389 390 393 396 398 401 402
 403 404 405 406 407 409 412 413 415 416 417 420 421 422 425 427 431 434
 436 437 439 440 444 445 447 450 451 453 455 461 463 464 466 467 468 469
 470 474 475 476 478 479 480 483 484 486 488 489 

We can observe that the encoding can have an impact on the performance of our model.<br>
The one hot encoding seems to produce the best results so far, so we are going to concentrate on this encoding scheme when evaluating our next model.<br>
The flattened one hot encoding also performed almost as good but it also increased the trainable parameters of our model (Why should that be a concern?)<br>
<b>Let's move on to the next notebook "03_second_CNN" with a slightly more complex model<b>