# Introduction
In the following kernel, I will use CNN with Keras to address the problem of the digital recognizer.

# Import Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Load the training data into a data frame object and displays some statistics about its content:

In [None]:
df_train = pd.read_csv('../input/train.csv')
df_train.describe()

# Check, Clean and Prepocess Data

Here with check if the training data contains any NULL or NaN values, possibly missing data:

In [None]:
df_train.isnull().values.any()

# Extract Data

Now we are extracting the data into NumPy array. This is because Keras methods needs array but also because all the preprocessing will be faster with array compare to data frame (indexing slower with data frame compare to array due to extra features provided bw the data frame).

In [None]:
data = df_train.values[:,:] 
print(data[:5]) # display the first 5 elements
print(data.shape) # display the dimension of the array

# Create Training and Test Sets
Here I made the choice to split the original training data into two set, a training set and test test. This is because I did not implemented CNN tuning. In case of CNN tuning, the best practice is to split the original training data at minimum in three sets, training set, validation set and test set, or even more subset depending on the fine tuning approach.

Here the choice has been doen to:
1. increase the original training data with new picture resulting from geometric transformation of the original data
2. to shuffle the data
3. to split the data 80% training and 20% test

This is performed in the following three sections.

## Data Augmentation
Create additional new data based on current ones, that is increase the original training data with new picture resulting from geometric transformation of the original data. Geometric transformation will be:

1. Rotation, clockwise and counter clockwise
2. Shift, up, down, left and right  
(inspired from the following kernel: "[Introduction to CNN Keras](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6)" section 3.3

Note: For picture processing with neural networks, it is known that increasing the data using geometric transformation has a positive effect. On the contrary, it is known also that increasing data with noisy data does not improve the resutls.


The coding approach is the following:
*  The original data set is NOT increased on the fly. Rather a separate data set with new data is created and merged to the orignal data at the very end when all the new data have been generated.
* The new data are flushed to different CSV files, one for each geometric transformation. This is to limit the file size and also to be able to process each transformation in parallel if needed.
* The flush is done once a certain number of new data have been generated, this number being not too large to prevent too much overhead due to array indexing and due to array stacking with numpy.vstack, but also not too small to limit the file access overhead. A good number, found with experience is 100.
* The data processing is implemented into a single method doWork, so that if needed we can easily parallelize it.

This approach has been guided by the performance and especially with the indexing overhead in mind. Experience showed that increasing the original data set on the fly would take many hours, the following approach on the same HW takes few minutes.


In [None]:
from scipy import ndimage
from enum import IntEnum

class ProcessAction(IntEnum):
    RotateClockwise = 0
    RotateCounterClockwise = 1
    ShiftUp=2
    ShiftDown=3
    ShiftLeft=4
    ShiftRight=5

file_name_ext=['RotateClockwise', 'RotateCounterClockwise', 'ShiftUp', 'ShiftDown', 'ShiftLeft', 'ShiftRight']

In [None]:
# Delete existing files if any
import os
for i in ProcessAction:
    file_name = 'train_'+ file_name_ext[int(i)] + '.csv'
    if os.path.isfile(file_name):
        os.remove(file_name)

In [None]:
# Data processing methode
def doWork(data, action):
    flush_limit=100
    display_limit=1000
    max=data.shape[0]
    flush_count=0
    for i in range(0,max):
        # Create additional image for each signal image from the initial data set.
        img=data[i,1:] # features    
        lbl=data[i,0] # labels
    
        img=img.reshape((28, 28))
        
        if action == ProcessAction.RotateClockwise:
            new_img= ndimage.rotate(img,-10,reshape=False)            

        elif action == ProcessAction.RotateCounterClockwise:    
            new_img= ndimage.rotate(img,10,reshape=False)

        elif action == ProcessAction.ShiftUp:    
            new_img= ndimage.shift(img,(0,28*0.1))

        elif action == ProcessAction.ShiftDown:    
            new_img= ndimage.shift(img,(0,-28*0.1))

        elif action == ProcessAction.ShiftLeft:    
            new_img= ndimage.shift(img,(28*0.1,0))

        elif action == ProcessAction.ShiftRight:    
            new_img= ndimage.shift(img,(-28*0.1,0))
            
        else:
            print("Unkknow action ", action)
            return;
    
        new_img_data=np.append([lbl],new_img.reshape((1,784))).reshape((1,785))

        if (i % flush_limit == 0 and i != 0) or i == (max-1):

            if i == (max-1):
                # add the very last conversion
                data_local = np.vstack((data_local,new_img_data))
           
            #-----------------------------------------------------
            # Save data, flush every flush_limit images
            #-----------------------------------------------------
                        
            df_new_train = pd.DataFrame(data_local, columns=df_train.columns)
        
            file_name='train_'+ file_name_ext[int(action)] + '.csv'
                
            if flush_count == 0:
                df_new_train.to_csv(file_name, sep=',', mode='a', index=False)
            else:
                df_new_train.to_csv(file_name, sep=',', mode='a', index=False, header=False)

            flush_count = flush_count+1


        if i % flush_limit == 0:
            data_local = new_img_data
        else:
            data_local = np.vstack((data_local,new_img_data))
       
        if i % display_limit == 0:
            print("Iteration (",action,"): ",i,"/",max)

In [None]:
# Perform the data processing for the different actions
import time
start = time.time()
doWork(data,ProcessAction.RotateClockwise)
doWork(data,ProcessAction.RotateCounterClockwise)
doWork(data,ProcessAction.ShiftUp)
doWork(data,ProcessAction.ShiftDown)
doWork(data,ProcessAction.ShiftLeft)
doWork(data,ProcessAction.ShiftRight)
end = time.time()
print("Process time (s): ", end - start)

In [None]:
# Merge the new data to the original data
import time
start = time.time()
for i in ProcessAction:
    file_name = 'train_'+ file_name_ext[int(i)] + '.csv'
    print("Add file to data:", file_name)
    df_tmp = pd.read_csv(file_name)
    df_tmp.describe()
    data_tmp = df_tmp.values[:,:] 
    print(data_tmp.shape)
    data = np.vstack((data,data_tmp))
print("Final data set size: ",data.shape)
end = time.time()
print("Process time (s): ", end - start)

## Shuffle the data

In [None]:
np.random.seed(6)
np.random.shuffle(data)
print(data[:5])
print(data.shape)

## Split Data
Split the data into three sets: training (80%) and test (20%):

In [None]:
X=data[:,1:] # features
y=data[:,0] # labels
print("X size: ", X.shape)
print("y size: ", y.shape)

In [None]:
X_train, X_test = np.split(X, [int(.8*X.shape[0])])
y_train, y_test = np.split(y, [int(.8*y.shape[0])])
print("X training set size: ", X_train.shape)
print("X test set size: ", X_test.shape)

# Display Data Samples

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
from scipy import ndimage

fig = plt.figure()

for idx in range(0,9):
    ax = fig.add_subplot(3,3,idx+1)
    img_data=X[idx,:].reshape((28, 28))
    plt.imshow(img_data, cmap="gray")

plt.show()

# Data Preprocessing

## Input Standardization
Inputs of traning set and test set are standardized by removing the mean and scaling to unit variance:

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit only to the training data
scaler.fit(X_train)
StandardScaler(copy=True, with_mean=True, with_std=True)

# Now apply the transformations to the data:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Input Formatting for 2D Process
At the moment the input features of a training example is a 1D array of 784 features, being the pixels on one channel (grayscale).  
First we need to reshape the input features so that they have the format of 2D image of 28x28 pixel on one channel. This necessary for the 2D convolution layer:

In [None]:
X_train_reshaped = X_train.reshape(X_train.shape[0],28, 28,1).astype( 'float32' )
print('Size of the input traning set: ',X_train_reshaped.shape)
X_test_reshaped = X_test.reshape(X_test.shape[0],28, 28,1).astype( 'float32' )
print('Size of the input test set: ',X_test_reshaped.shape)

## Output Label Transformation
The output labels of the training set and the test set have to be converted from a single class value (from 0 to 9) into 10 binary class values:

In [None]:
from sklearn.preprocessing import LabelBinarizer

class MyLabelBinarizer(LabelBinarizer):
    def transform(self, y):
        Y = super().transform(y)
        if self.y_type_ == 'binary':
            return np.hstack((Y, 1-Y))
        else:
            return Y

    def inverse_transform(self, Y, threshold=None):
        if self.y_type_ == 'binary':
            return super().inverse_transform(Y[:, 0], threshold)
        else:
            return super().inverse_transform(Y, threshold)
        

lb = MyLabelBinarizer()
print(y_train.shape)
print(y_train[:5])
y_train_bin = lb.fit_transform(y_train)
y_test_bin = lb.fit_transform(y_test)
print(y_train_bin[0:5,:])

# Build and Train Model

Now we define the CNN model:

In [None]:
# Importing libraries
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import Dropout

# Initialising the CNN
classifier = Sequential()

# Adding a first convolutional layer
classifier.add(Conv2D(32, (5, 5), input_shape = (28,28,1), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))

# Adding a second convolutional layer
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))

# Step 3 - Flattening
classifier.add(Flatten())

# Step 4 - Full connection
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 10, activation = 'softmax')) #softmax for classification

Compile the model:

In [None]:
# Compiling the CNN
from keras import optimizers
classifier.compile(optimizer = 'adam', 
                   loss = 'categorical_crossentropy', 
                   metrics = ['accuracy'])

Train the model using the traning set:

In [None]:
classifier.fit(X_train_reshaped, y_train_bin,
          epochs=70,
          batch_size= 160)

# Predictions and Evaluation

Evaluate the model on the test set:

In [None]:
score = classifier.evaluate(X_test_reshaped, y_test_bin, batch_size=128)
print('Score: ',score)
print('Metrics: ',classifier.metrics_names)
classifier.summary()

Compute predictions for the test set:

In [None]:
pred_test = classifier.predict(X_test_reshaped)

Transforme the 10 binary class prediction back to a single multi class value in order to be able to compute confusion matrix:

In [None]:
print(lb.classes_)
print(lb.y_type_)
pred_test = lb.inverse_transform(pred_test)
print(pred_test[:5])

Compute additional model prediction results:

In [None]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
print(confusion_matrix(y_test,pred_test))
print(classification_report(y_test,pred_test))
print('>>>>>> Accuracy score: ',accuracy_score(y_test,pred_test))

In [None]:
print(pred_test[:5])

# Evaluate

In [None]:
def build_classifier(optimizer='adam'):
    classifier = Sequential()
    classifier.add(Conv2D(32, (3, 3), input_shape = (28,28,1), activation = 'relu'))
    classifier.add(MaxPooling2D(pool_size = (2, 2)))
    classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
    classifier.add(MaxPooling2D(pool_size = (2, 2)))
    classifier.add(Flatten())
    classifier.add(Dense(units = 128, activation = 'relu'))
    classifier.add(Dense(units = 128, activation = 'relu'))
    classifier.add(Dense(units = 128, activation = 'relu'))
    classifier.add(Dense(units = 10, activation = 'softmax'))
    classifier.compile(optimizer = optimizer, 
                   loss = 'categorical_crossentropy', 
                   metrics = ['accuracy'])
    return classifier

In [None]:
# Not used at the moment
if 0:
    # Evaluate
    from keras.wrappers.scikit_learn import KerasClassifier
    from sklearn.model_selection import cross_val_score
    classifier = KerasClassifier(build_fn = build_classifier, batch_size = 160, epochs = 20)
    accuracies = cross_val_score(estimator = classifier, X = X_train_reshaped, y = y_train_bin, cv = 10)
    mean = accuracies.mean()
    variance = accuracies.std()
    print('Mean=',mean)
    print('Variance=',variance)

# Fine Tuning

In [None]:
# Not used at the moment
from sklearn.model_selection import GridSearchCV
if 0:
    classifier = KerasClassifier(build_fn = build_classifier)
    parameters = {'batch_size': [150, 170],
                  'epochs': [20, 30],
                  'optimizer': ['adam', 'rmsprop']}
    grid_search = GridSearchCV(estimator = classifier,
                               param_grid = parameters,
                               scoring = 'accuracy',
                               cv = 10)
    grid_search = grid_search.fit(X_train_reshaped, y_train_bin)
    print('Best parameters: ',grid_search.best_params_)
    print('Best score: ',grid_search.best_score_)

# Create Submission on Challenge Test Set

## Import Data

In [None]:
df_test = pd.read_csv('../input/test.csv')
df_test.describe()

## Check, Clean and Prepocess Data

In [None]:
df_test.isnull().values.any()

## Extract Data

In [None]:
X_submission = df_test.values[:,:] 
print(X_submission[:5]) # display the first 5 elements
print(X_submission.shape)

## Display Data Samples

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt

fig = plt.figure()

for idx in range(0,9):
    ax = fig.add_subplot(3,3,idx+1)
    img_data=X_submission[idx,:].reshape((28, 28))
    plt.imshow(img_data, cmap="gray")
    
plt.show()

## Input Normalization

In [None]:
X_submission = scaler.transform(X_submission)

## Input Formatting for 2D Process

In [None]:
X_submission_reshaped = X_submission.reshape(X_submission.shape[0],28, 28,1).astype( 'float32' )

## Compute Predictions

In [None]:
pred_submission = classifier.predict(X_submission_reshaped)

## Transforme 10 binary class prediction back to a single multi class value

In [None]:
pred_submission = lb.inverse_transform(pred_submission)
print(pred_submission[:5])

1. ## Create Submission Data Set

In [None]:
idx=np.arange(1,X_submission.shape[0]+1)
submussion = np.column_stack((idx , pred_submission))
print(submussion[:5,:]) # display the first 5 rows

In [None]:
columns = ['ImageId', 'Label']
df_submission = pd.DataFrame(submussion, columns=columns)
print(df_submission.head(5)) 

## Write CSV Submission File

In [None]:
df_submission.to_csv('Digit Recognizer Submission 3.csv', sep=',' ,index=False)

# Conclusion
CNN performance is good but it could be better. Possible improvement ways:
* Improve the data augmentation step with additional transformations
* Implement a fine tuning strategy, on the hyperparameters or CNN architecture, but is the kernel the right place to execute it with the performance and time limitations?
* Integrate in the CNN a Spatial Transformation Network (STN) module
