# COMP551 - MiniProject 3

## Convolutional Neural Network with 5 blocks of Hidden Layers based on VGGNet architecture

#### @author: Luiz Resende Silva

Enabling kaggle API and creating ```.kaggle``` directory with the ```kaggle.jason``` file to download datasets directly from competition API inside ```/content``` folder. Also enabling the download of ```.py``` file containing the functions and neural network class designed by the author.

#### **WARNING**: the cell below MUST be RUN ONCE before running all the cells (Ctrl+F9) or RUN TWICE if each cell is going to be run individually (Ctrl+Enter), in order to enable the project to recognize the existence of the ```/.kaggle``` directory  and allow both the data and the ```project03functions.py``` to be downloaded.

In [0]:
# !pip install kaggle
!mkdir .kaggle
with open('/content/.kaggle/kaggle.json', 'w') as file:
    json.dump(token, file)
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle config set -n path -v{/content}
!chmod 600 /root/.kaggle/kaggle.json

**Downloading the author's script containing functions and CNN class**; and saving in ```/content/project03functions.py```

In [0]:
!kaggle kernels pull luizresende/project03functions -p /content

**Downloading the README file** and **pre-trained model**; saving in ```/content/READEME.md```

In [0]:
!kaggle datasets download -d luizresende/readmeproject03ml -p /content
!unzip readmeproject03ml.zip

**Downloading the dataset** directly from kaggle competition and unzipping in the ```/content``` folder

In [0]:
!kaggle competitions download -c modified-mnist -p /content

In [0]:
!unzip train_max_x.zip
!unzip test_max_x.zip

#### Importing modules and libraries

In [0]:
############################################################################################################################
'''                                           IMPORTING GENERAL LIBRARIES                                                '''
############################################################################################################################
import pandas as pd
import numpy as np
import scipy
import seaborn as sb #Graphical plotting library
import matplotlib.pyplot as plt #Graphical plotting library
import pickle as pkl #Pickle format library
import time #Library to access time and assess running performance of the NN
import random #Generate random numbers
import pdb #Library to create breakpoints

from scipy.sparse import hstack #Scipy sparse matrix concatanation module
############################################################################################################################
'''                                      IMPORT SCIKIT-LEARN PREPROCESSING MODULES                                       '''
############################################################################################################################
from sklearn.model_selection import train_test_split

############################################################################################################################
'''                                         IMPORT PYTORCH MODULES/LIBRARY                                               '''
############################################################################################################################
import torch as th
import torchvision as tv
import torch.nn as nn
import torch.nn.functional as nf
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable
from torch.optim.lr_scheduler import MultiStepLR
!pip install torchsummary
from torchsummary import summary

############################################################################################################################
'''                                   PY FILE CONTAINING FUNCTIONS & CLASSES BUILT                                       '''
############################################################################################################################
import project03functions as pf

#### **BEGINNING OF THE SCRIPT**

In [0]:
##################################################################################################################################
'''                                                     BEGINNING OF THE SCRIPT                                                '''
##################################################################################################################################

###         DESCRIBING FILES' NAMES/PATHS          ###
FileTrainImages = "train_max_x"
FileTrainLabels = "train_max_y.csv"
FileTestImages = "test_max_x"

###                  READING FILES                 ###
train_images = pd.read_pickle(FileTrainImages)
train_labels = pf.Read_File_DF(FileTrainLabels, separation=",", head=0, replace=[], drop=False)
Test_Images = pd.read_pickle(FileTestImages)

###        PLOTTING DISTRIBUTION OF LABELS         ###
train_labels['Label'].hist(bins=10)

In [0]:
###       SAMPLE IMAGE FROM TRAINING DATASET       ###
pf.View_Image(Matrix=train_images[(random.randint(0,1000)),:,:], Is_NumPy=True, Is_DF=False, Multiple=False)

**Splitting** the entire training dataset into training and validation sets

In [0]:
###     SPLITTING DATASET IN TRAIN-VALIDATION      ###
X_train, X_valid, y_train, y_valid = train_test_split(train_images, train_labels, test_size=0.10, random_state=10657,
                                                    shuffle=True, stratify=train_labels['Label'])

sub_sample = False #Flag to take only a subset to speed training process during tests

if(sub_sample==True):
    tra = 20000 #Defining number of training samples
    val = 5000 #Defining number of validation samples
    X_train, X_valid, y_train, y_valid = X_train[0:tra,:,:], X_valid[0:val,:,:], y_train.iloc[0:tra,:], y_valid.iloc[0:val,:]

#### Entering some **general parameters** that will be **used in the CNN and in some preprocessing steps**

**PARAMETERS MUST BE SET**



In [0]:
###       PARAMETERS FOR THE TRANSFORMATIONS       ###
threshold = 200 #Setting the pixel grey/black intensity threshold. Any value below will be set to zero and clear the background image, with only the numbers remaining
input_size = 128 # Input dimension in number of pixels
output_size = 10 #Dimension of the output generated. This relates to the number of classes in this problem: numbers from 0 to 9
batchs = 25 #The batch size used during training phase

#### Performing **thresholding to clear the images' background** and retain only pixels for the numbers.

In [0]:
###       THRESHOLDING IMAGES       ###

do_thresholding = False #Flag to perform or not the image thresholding

if(do_thresholding==True):
    X_train = pf.Image_Thresholding(Matrix=X_train, threshold_px=threshold)
    X_valid = pf.Image_Thresholding(Matrix=X_valid, threshold_px=threshold)
    
    Test_Images = pf.Image_Thresholding(Matrix=Test_Images, threshold_px=threshold)

    print("Image thresholding performed!")

#### Performing **normalization of pixel values** to to have their intensity in a scale from 0 to 1.

In [0]:
###     NORMALIZING PIXEL VALUES     ###

do_normalize = False #Flag to perform or not the image thresholding

if(do_normalize==True):
    X_train = pf.Image_Normalization(Matrix=X_train) #Dividing all pixels by the largest value and scaling their value
    X_valid = pf.Image_Normalization(Matrix=X_valid)

    Test_Images = pf.Image_Normalization(Matrix=Test_Images) 
    
    print("Pixel normalization performed!")

#### The data (input features and labels) is **converted to PyToch tensors**

*P.S.1: variables are being overwriten at all steps to cause the less footprint as possible in the available RAM*

*P.S.2: Reshaping of variables and One-Hot encoding of the labels can be done by changing the Boolean in the ```if``` statments*

In [0]:
###       CONVERTING DATA TO PYTORCH TENSORS       ###
X_train = th.from_numpy(X_train).float() #The functions in the CNN construction require that the input features are of the type float. Same for validation and test sets
X_valid = th.from_numpy(X_valid).float() 
Test_Images = th.from_numpy(Test_Images).float() 

y_train = th.from_numpy(y_train['Label'].to_numpy()).long() #They also require that the input labels are of the type long. Same for validation set
y_valid = th.from_numpy(y_valid['Label'].to_numpy()).long()

In [0]:
###      DOING ONE-HOT ENCODING OF THE LABELS      ###
if(False): #Set this to True to perform One-Hot encoding of the lebels
  y_train = th.from_numpy(pf.OneHotEncoder(y_train['Label'].to_numpy()))
  y_valid = th.from_numpy(pf.OneHotEncoder(y_valid['Label'].to_numpy()))

In [0]:
print(X_train.shape)
print(X_valid.shape)

#### Creating ```training``` and ```validation``` *datasets* and *loaders* using ```torch.utils.data.TensorDataset``` and ```torch.utils.data.DataLoader```. These will be fed to the training process.

In [0]:
train_dataset = th.utils.data.TensorDataset(X_train, y_train) #Creating training dataset
train_loader = th.utils.data.DataLoader(train_dataset, batch_size=batchs, shuffle= True) #Creating training dataloader with the train_dataset and the batch size specified

valid_dataset = th.utils.data.TensorDataset(X_valid, y_valid) #Creating validation dataset
valid_loader = th.utils.data.DataLoader(valid_dataset, batch_size=batchs, shuffle= True) #Creating validation dataloader with the valid_dataset and the batch size specified

####  **Instantiating the neural network classes**: the convolutional neural network ```ConvNN_G23_Std``` created (or one of its variants) or the feed-forward neural network ```FFNN_G23``` as ```net```.

The classes requires two parameters:
1.   ```num-classes```, which is the number of classes in the classification problem (output dimension);

2.   ```input_ratio```, which is the value used in the reshaping of the vector fed to the first fully (FC1) connected layer, where this number will depend on the resolution of the input images (matrix size) and the number of max pooling layers used in the class to match the required input size for the FC1 (e.g. for the current set, the output from the last convolutinal layer for the class ```CNN_G23_Std``` will be a tensor of shape ```([512,8,8])``` and the output of FC1 will be 512, such that the tensor must be reshaped to ```([-1, 512*8*8])```). This number is defined by the max pooling layers and can be discovered in the model summary, by uncommenting and running the command below the CNN instantiation. **In the FFNN_G23**, this parameters refers to the size of the image (e.g. in the current dataset, this value is 128).

***OBS.1: the flag*** ```Is_CNN``` ***must be set to*** ```True``` ***if one of the convolutional neural networks is being istantiated or to*** ```False``` ***if the class*** ```FFNN_G23``` ***is being instantiated***

***OBS.2: the complete archtecture of the neural network classes is described in the report. Please, refer to it of visual aid or to the*** ```README.md``` ***file (uploaded to this .ipynb file)***

In [0]:
Is_CNN = True #Flag to choose which neural network to instantiate - this flag is passed on to other function in the training and accuracy steps,
                #since they must expect different inputs from different NN types

if(Is_CNN==True):
    expected_dim = 8
    net = pf.ConvNN_G23_Std(num_classes=output_size, input_ratio=expected_dim, soft_max=False, drop_out=False, drop_prob=0.25, FC4_relu=True)
else:
    net = pf.FFNN_G23(num_classes=output_size, input_ratio=128, soft_max=False, drop_out=False, drop_prob=0.25, final_relu=False)

go_cuda = True #Change this if statment to False to avoid moving network to GPU - This flag will be passed on to other functions
if(go_cuda==True):
    net = net.cuda() #Moving CNN to GPU

print(net)

In [0]:
# net = net.cuda() #Moving CNN to GPU
print(summary(net,(1,128,128)))

#### Defining **loss function** (Cross-Entropy Loss), **optimization function** (Stochastic Gradient Descent) and **schedule for the optimizaiton** function update (Multi-Step Learning Rate) or **Adam** algorithm

In [0]:
criterion = nn.CrossEntropyLoss() #Cross-Entropy loss function selected

Use_Adam = False #Flag to choose between Stochastic Gradient Descent and Adam optimizer

if(Use_Adam==False):
    optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.5, dampening=0, weight_decay=0, nesterov=False) #Stochastic Gradient Descent optimizer with initial learning rate of 0.1
    scheduler = MultiStepLR(optimizer, milestones=[15,25,45,55], gamma=0.1, last_epoch=-1) #The schedule for the learning rate will divide the current one by 10 after a number of epochs in milestones
    is_schedule = True #Flag must be set to True a scheduler if being used
else:
    optimizer = optim.Adam(net.parameters(), lr=0.01, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False) #Using Adam algorithm
    is_schedule = False #Flag must be set to True a scheduler if being used

#### **Starting training process**. Script allows for printing the training and validation losses

The second cell prints out the graphs for training and validation losses and accuracies.

*P.S.1: The loss for the validation dataset is also calculated; however, to prevent leakage of data, it is a different function than the one for the training dataset, which does not receive the instantiated* ```optimizer```

In [0]:
##########################################################
"""                GENERAL LIST PROCESS                """
##########################################################

train_loss_list = [] #List to store average training loss for each epoch
validation_loss_list = [] #List to store the average validation loss for each epoch

accuracy_train = [] #List to store the training accuracy of each epoch
accuracy_valid = [] #List to store the validation accuracy of each epoch

log = True #Flag for saving the info in a log file
if(log==True):
    Log_File = []
else:
    Log_File = None

##########################################################
"""               STARTING TRANING EPOCHS              """
##########################################################

num_epochs = 60 #Defining number of epochs to train the model

total_start_time = time.time() #Starting clock to measure total training time

for epoch in range(num_epochs):
    
    train_loss, temp1 = pf.LossInTraining(NN=net, TrainingLoader=train_loader, Criterion=criterion, Optimizer=optimizer, TrainLength=len(X_train),
                                          BatchSize=batchs, Epoch=epoch, is_CNN=Is_CNN, ImageSize=input_size, UseGPU=go_cuda, PrintPartialLoss=True,
                                          PartialBatch=15000, log_file=Log_File)
    
    valid_loss, temp2 = pf.LossInValidation(NN=net, ValidationLoader=valid_loader, Criterion=criterion, ValidLength=len(X_valid),
                                            BatchSize=batchs, is_CNN=Is_CNN, ImageSize=input_size, UseGPU=go_cuda)

    #Updating lists by adding calculated loss and accuracy values for current epoch
    train_loss_list.append(train_loss)
    validation_loss_list.append(valid_loss)
    accuracy_train.append(sum(temp1)/len(X_train))
    accuracy_valid.append(sum(temp2)/len(X_valid))

    print('{Epoch %d} - Train loss: %.6f' %(epoch+1, train_loss))
    print('{Epoch %d} - Validation loss: %.6f' %(epoch+1, valid_loss))
    print('{Epoch %d} - Train accuracy: %.6f' %(epoch+1, accuracy_train[-1]))
    print('{Epoch %d} - Validation accuracy: %.6f' %(epoch+1, accuracy_valid[-1]))
    
    if(log==True): #Adding info to log file
        Log_File.append('{Epoch %d} - Train loss: %.6f' %(epoch+1, train_loss))
        Log_File.append('{Epoch %d} - Validation loss: %.6f' %(epoch+1, valid_loss))
        Log_File.append('{Epoch %d} - Train accuracy: %.6f' %(epoch+1, accuracy_train[-1]))
        Log_File.append('{Epoch %d} - Validation accuracy: %.6f' %(epoch+1, accuracy_valid[-1]))
    
    if(is_schedule==True):
        if(epoch<30):
            scheduler.step() #Increasing scheduler step
        elif(epoch==30):
            optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.5, dampening=0, weight_decay=0, nesterov=False) #Resetting the Learning Rate
        elif(epoch>30):
            scheduler.step()

    th.cuda.empty_cache()

print('Finished training CNN in %0.3f minutes'%((time.time()-total_start_time)/60))

if(log==True):
    Log_File.append('Finished training CNN in %0.3f minutes'%((time.time()-total_start_time)/60))

##########################################################
"""               FINISHED TRANING EPOCHS              """
##########################################################

In [0]:
losses = pd.DataFrame({'Epochs':list(range(num_epochs)),'Training Loss':train_loss_list,'Validation Loss':validation_loss_list})
accuracies = pd.DataFrame({'Epochs':list(range(num_epochs)),'Training Accuracy':accuracy_train,'Validation Accuracy':accuracy_valid})

pf.Plot_Multi_Curves(Data=losses, Xlabel="Epochs", Ylabel="Average Loss", Title="Loss", Xlim=True, Xlim1=0, Xlim2=(num_epochs+1), Ylim=False, Ylim1=0, Ylim2=100, save=True)

pf.Plot_Multi_Curves(Data=accuracies, Xlabel="Epochs", Ylabel="Accuracy", Title="Accuracies", Xlim=True, Xlim1=0, Xlim2=(num_epochs+1), Ylim=True, Ylim1=0.00, Ylim2=1.00, save=True)

#### **Overall training and validation dataset accuracies** after trained model

In [0]:
##########################################################
"""  TRAINING AND VALIDATION ACCURACIES FOR THE MODEL  """
##########################################################

train_set_pred = pf.GetPredsAccur(NeuralNet=net, DataLoader=train_loader, DatasetType='Training', is_CNN=Is_CNN, ImageSize=input_size, UseGPU=go_cuda,
                                  PrintAccur=True, GetLebelsPreds=True, List=True, log_file=Log_File)

val_set_pred = pf.GetPredsAccur(NeuralNet=net, DataLoader=valid_loader, DatasetType='Validation', is_CNN=Is_CNN, ImageSize=input_size, UseGPU=go_cuda,
                                PrintAccur=True, GetLebelsPreds=True, List=True, log_file=Log_File)

if(log==True):
    Logging = pd.DataFrame({'Log_Info':Log_File})
    pf.Write_File_DF(Data_Set=Logging, File_Name="log", separation=",", head=True, ind=False) #Printing logging

#### **Saving the trained CNN model**


In [0]:
if(True): #Change boolean value to avoid saving trained model
    timestr = time.strftime("%y-%m-%d_%Hh%Mm%Ss_")
    path = 'CNN_G23_Std_Model_best.pkl'
    th.save(net.state_dict(), timestr+path)

#### **Uploading saved trained CNN model**

In [0]:
if(False): #Set this flag to True to upload the trained model
    net = pf.ConvNN_G23_Std(num_classes=output_size, input_ratio=8, soft_max=False, drop_out=False, drop_prob=0.25, FC4_relu=True)
    net.load_state_dict(th.load("CNN_G23_Std_Model_best.pkl"))
    net.eval()
    net = net.cuda()

#### **Making predictions for the held-out (test) data**

*P.S.: to fit the structure of the functions created, the TensorDataset created for the test dataset uses random tensor numbers as label. However, inside the prediction funciton, these "random labels" are discarded*

In [0]:
##########################################################
"""                 MAKING PREDICITONS                 """
##########################################################

test_dataset = th.utils.data.TensorDataset(Test_Images, th.rand(10000)) #Creating testing dataset by appending random tensor labels to the test dataset for it to be iterable for the prediction function
test_loader = th.utils.data.DataLoader(test_dataset, batch_size=batchs, shuffle=False) #Creating test set dataloader with the test_dataset and the batch size specified 

Results = pf.KagglePreds(NeuralNet=net, DataLoader=test_loader, is_CNN=Is_CNN, ImageSize=input_size, UseGPU=go_cuda, GetLebelsPreds=True) #Predicting

pf.Write_File_DF(Data_Set=Results, File_Name="Predictions_Group_23", separation=",", head=True, ind=False) #Saving results as .csv file

##################################################################################################################################
'''                                                              END                                                           '''
##################################################################################################################################