In [20]:
import keras
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.datasets import mnist
from keras.utils.np_utils import to_categorical
import numpy as np

(x_set, y_set), (x_test, y_test) = mnist.load_data()
x_set = x_set/255

# Introduction

Given we will not be having class next week and I cannot reasonably expect you to do work for which we will not have lectured; this weeks sprint will be broken up into two smaller pieces as was lossely voted on in class, with this being part 1.

For this sprint you will be doing a process called K-Fold Cross Validation.

### Instructions

In class you were briefly introduced to Keras, which is a high level machine learning library that can be used to create everything from an introductory model such as what you will be building to very complex models used in industry every day to handle everything from chat bots to object detection and more.

### Section 1

In the last sprint you did some exploration that helped you understand the dataset and what was in it, this time you are going to prepare it for training. 

Professor Memon had talked about in his lecture taking your data and properly holding back some of it so that later you could use it to validate if your model was working or not.

For this section you will be responsible for implementing in python an algorithm called K-Fold

This will be worth **40** points of the sprint


### Section 2

With K = 5 for the number of folds you will do the below:

Now that you have properly segmented your data you will have to train K-1 models and validate them. The code for the model has already been implemented, you do not need to worry about that.

The general procedure is:
    1. Split your dataset into K even sets of data using the k-fold algorithm.
    2. Train a model on set K=0
    3. Validate the model on set K=1
    4. Repeat for K+1 and K+2
    
**Note:** Training the models will take some time depending on your computer, each model will be saved so after you are sure this part is working you should only have to do it once. If you mess something up you can delete the model files and start again.
    
This will be worth **40** points of the sprint

### Section 3
Provide a few sentences about common pitfalls of k-fold-cross validation and training models with it.

This will be worth **20** points of the sprint

### Extra credit

There are very many other validation methods for constructing machine learning models. Find one and implement it.
This is worth **20** extra credit points for the sprint.


#### Note:
Before you begin, you can use the same virtual environments you created last week, but you must pip install h5py into them. h5py is a file format library that will be used to save the trained models. 



In [21]:
from random import randrange
def k_fold_split(x_set, y_set, folds=1):
    '''
    Inputs: The x_set data from mnist, the y_set labels from mnist
    Expected Output: The shuffled and K split datasets
    '''
    xdata = []
    ydata = []
     
    new_x = list(x_set)
    new_y = list(y_set)
    check = len(x_set) / folds
    fold_size = int(check)
    for j in range(folds):
        fold_x = []
        fold_y = []
        while fold_size > len(fold_x):
            index = randrange(len(new_x))
            fold_x.append(np.array(new_x.pop(index)).flatten()) 
            fold_y.append(new_y.pop(index))
        xdata.append(fold_x) 
        ydata.append(fold_y) 
    return (xdata, ydata)
x_folds, y_folds = k_fold_split(x_set, y_set, 5)
x_folds = np.array(x_folds)

In [22]:
def construct_model():
    mod = Sequential()
    mod.add(Dense(512, activation='relu', input_shape=(784,)))
    mod.add(Dropout(0.2))
    mod.add(Dense(512, activation='relu'))
    mod.add(Dropout(0.2))
    mod.add(Dense(10, activation='softmax'))
    mod.compile(optimizer='RMSprop', loss='categorical_crossentropy', metrics=['accuracy'])
    return mod


def train_model(model, train_dataset, validation_dataset, epochs, name):
    
    x_set, y_set = train_dataset
    model.fit(x_set, y_set, epochs=epochs, batch_size=128, validation_data=validation_dataset)
    model.save(f'./{name}')
    

In [23]:
#Hint: Neural Networks can't just handle the lables as they are, they need --categorical-- data
#Note: You must submit the trained models along with the notebook for full credit
def train_validate_k(x_folds, y_folds, num_folds):
    '''
        Inputs: x_folds, the x folds returned from the k_fold algorithm above, 
        y_folds the y folds returned from the k_fold algorithm above
        num_folds, the number of folds used to make the x_folds and y_folds
        Expected Output: Nothing, this function has no explicit output, 
        but there must be num_fold models trained and saved to disk
    '''
    
    for j in range(num_folds):
        
        if j != num_folds-1:
            res = "Test set " +str(i+1)+" Validate with " + str(i+2)
            train_model(construct_model(), (x_folds[j], to_categorical(y_folds[j])), (x_folds[j+1], to_categorical(y_folds[j+1])),20,res)
        else:
            res = "Test set " +str(i+1)+" Validate with " + str(1)
            train_model(construct_model(), (x_folds[j], to_categorical(y_folds[j])), (x_folds[0], to_categorical(y_folds[0])),20,res)   
            

In [24]:
train_validate_k(x_folds, y_folds, 5)

Train on 12000 samples, validate on 12000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Train on 12000 samples, validate on 12000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Train on 12000 samples, validate on 12000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20


Epoch 20/20
Train on 12000 samples, validate on 12000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Train on 12000 samples, validate on 12000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


#### Section 3, write a few sentences below.

### Data Leakage

If there is information coming outside from different sources CV wouldn't work well for this sort of use case. 

### High Imbalances

If the data is highly imbalanced then CV wouldn't work well for this use case. 