# Cross-validation: Tutorial

##Introduction
This short tutorial explains how to use cross validation on a dataset to train and evaluate a model. Cross-validation is useful in cases where the number of datapoints is lacking. The method splits the data in several different ways and trains a model for each split (these splits are also known as "folds".) 

There are a number of ways to perform Cross-validation, but they all seek to remedy the reduction of datapoints by the train-validation-test splits that is vital for model evaluation. For this reason the tutorial focuses on one of these methods. A complete list of cross-validation methods available in sklearn is available on the website https://scikit-learn.org/stable/modules/cross_validation.html. The tutorial below is sampled from the material available on that page. 

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. 

Let’s load the iris data set to fit a linear support vector machine on it:

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

In the traditional training strategy, we sample a training set while holding out 40% of the data for testing (evaluating) our classifier:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print("The shape of the training data:",X_train.shape, y_train.shape)
print("The shape of the test data:",X_test.shape, y_test.shape)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

print("Accuracy on the train dataset:",clf.score(X_train,y_train))
print("Accuracy on the test dataset:",clf.score(X_test, y_test))

##Cross-validation
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles. The following procedure is followed for each of the k “folds”:

A model is trained using k-1 of the folds as training data. The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

The simplest way to use cross-validation is to call the cross_val_score  function on the estimator and the dataset.

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

In [None]:
from sklearn.model_selection import cross_val_score

#Create the modeel
clf = svm.SVC(kernel='linear', C=1, random_state=42)

#Run the cross-validation and collect the accuracy of each fold
scores = cross_val_score(clf, X, y, cv=5)

#Print the accuracy of each fold
print("The accuracy of each fold:",scores)

The mean score and the standard deviation are hence given by:

In [None]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

We can experiment now with a value C=10 and choose the better setup as indicated by the accuracy of the two models. 

In [None]:
#Create the modeel
clf = svm.SVC(kernel='linear', C=10, random_state=42)

#Run the cross-validation and collect the accuracy of each fold
scores = cross_val_score(clf, X, y, cv=5)

#Print the accuracy of each fold
print("The accuracy of each fold:",scores)

print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

##Data transformation with held out data

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:

In [None]:
from sklearn import preprocessing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

#We use standardization, and we get its ooptimal setup by training on X_train 
scaler = preprocessing.StandardScaler().fit(X_train)

#Apply the standardization to the train dataset
X_train_transformed = scaler.transform(X_train)

#Train the model on the standardized training dataset
clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
print("Accuracy of the model on the (standardized) train data:",clf.score(X_train_transformed, y_train))

#Apply the standardization to the test dataset
#Note: this is the scaler that we got based on the training data. 
#We must not use a different scaler on the test data or when deploying the model.
X_test_transformed = scaler.transform(X_test)
print("Accuracy of the model on the (standardized) test data:",clf.score(X_test_transformed, y_test))

# Cross-validation: Project

In this project you will use a dataset from Kaggle (https://www.kaggle.com/fanconic/skin-cancer-malignant-vs-benign) consisting of images of skin tumours. Each image is a 224 X 244 X 3 matrix of 224x244 pixels, where the last dimension describes the color value of the different color channels (Red, Green, Blue).

You will use Cross-validation to train a model in different folds to increase accuracy for malign vs. benign predictions (cancerous vs. non-cancerous) You can use any type of model you want (Supervised Clustering, Support vector Machines, Deep Neural Networks etc.)

The first step is to access the data, the code below installs a library which makes this easy and downloads the data, but you will require a Kaggle username and API-key to acces this data (You may of course also download the data and upload it to colab, but the library is less tedious).

1. Sign up on Kaggle to create a user there
2. Go to your profile and select "Account" (in the field where Home, Competitions, Datasets etc. are located, to the far right)
3. Scroll down to the API section and press the button "Create New API Token"

This will download a .json file which you can open with a code edito, in it you will see a username and key. Insert these into the fields that appear in the cell below when you run it.

In [None]:
!pip install opendatasets
import opendatasets as od
od.download("https://www.kaggle.com/fanconic/skin-cancer-malignant-vs-benign")

The data should be available in the "Files"-section of your Google Colab menu, in a folder named *skin-cancer-malignant-vs-benign*

In [None]:
# Libraries for loading data
import os, cv2
import numpy as np
import matplotlib.pyplot as plt

# This might be handy
from sklearn.utils import shuffle

In [None]:
#Use this SEED for all your random seed values.
SEED=2021

In [None]:
# Method for loading the data
def load_data(path = "/content/skin-cancer-malignant-vs-benign/train"):
  # Necessary lists
  labels = []
  label_names = []
  data = []

  # Each folder in the parent folder contains examples of the relative class
  for label, dir in enumerate(os.listdir(path)):

    # So we can simply store the name of the folder as the label name
    label_names.append(dir)

    # For every file (picture) in the current directory
    for img in os.listdir(path+"/" + dir):

      # Save the label corresponding to the picture
      labels.append(label)

      # This reads the image using the cv2 library and converts it to RGB format
      example = cv2.imread(path+"/" + dir + "/" + img)   # reads an image in the BGR format
      example = cv2.cvtColor(example, cv2.COLOR_BGR2RGB)   # BGR -> RGB

      # Now add the matrix to the list
      data.append(example)

  # Convert them to numpy arrays for ease of use
  data = np.array(data)
  labels = np.array(labels)

  # Return
  return data, labels, label_names

In [None]:
# Load the training data
X_train, y_train, label_names = load_data("/content/skin-cancer-malignant-vs-benign/train")

# Shuffle the data
Your code here...

In [None]:
print("The size of the training dataset:", X_train.shape,y_train.shape)

In [None]:
#The dataset is too large for a mini project. We only work on the first 500 images (from the shuffled data).

Your code here...

print("The size of the training dataset:", X_train.shape,y_train.shape)


In [None]:
#Print 5 pictures, just to check our dataset 
for i in range(5):
  # We can display an example here by the imshow() method
  plt.title(label_names[y_train[i]])
  plt.imshow(X_train[i])
  plt.show()

The data requires preprocessing. The pixel values are commonly in a range between [0, 255]. Divide all values by 255 to normalize the pixel values into a range of [0, 1]. It is also better/easier to reshape the data into one vector i.e. (224 X 224) -> (50176)

In [None]:
#Normalize the data

Your code here...

In [None]:
# Reshape the data (N, 224, 224, 3) -> (N, 224^2, 3)

Your code here...

print(X_train.shape)

To further simplify the data we may want to reduce the size of each image by converting them into black and white. This will make it easier to apply the images to models such as support vector machines and random forrests. To do this you can look at each pixel value as a 3-dimensional vector and calculate the norm of it, then you can use that norm instead of the RGB-value.

In [None]:
# Remove colors to reduce memory requirements
print("Shape before color simplification:", X_train.shape)
X_train = np.linalg.norm(X_train, axis = 2)
print("Shape after color simplification:", X_train.shape)

In [None]:
# The images should now be gray-scaled
#Print 5 pictures, just to check our dataset 

Your code here...

##Task: load and pre-process the test dataset in the same wasy as we did for the training dataset

In [None]:
X_test, y_test, _ = load_data(path = "/content/skin-cancer-malignant-vs-benign/test")

# Reshape the data (N, 224, 224, 3) -> (N, 224^2, 3)
Your code here...

#Normalize the data
Your code here...

#Transform the pictures into gray-scaled 
Your code here...

print("Shape after color simplification:", X_test.shape,y_test.shape)

# Shuffle the data
Your code here...

#Keep only 200 pictures in the test dataset (for performance reasons)
Your code here...

print("Shape after data reduction:", X_test.shape,y_test.shape)


##Task

Find out which of the following settings give you the best support vector classifier on the training dataset;

*   Kernel: rbf, linear, polynomial with degree 2, 3, or 4
*   C hyper-parameter: 0.1, 1, or 10
*   Gamma hyper-parameter (for rbf and poly): 'scale', 'auto', or 0.01

To compare between these different models, run a 5-fold cross-validation on the training dataset and compare the accuracy of each of the models. 



*   Q1: What is the kernel in the best model?
*   Q2: What is the C hyper-parameter in the best model?
*   Q3: What is the gamma hyper-parameter in the best model?
*   Q4: What is the accuracy of the best model on the test dataset?

In [None]:
Your code here...