<a href="https://colab.research.google.com/github/kongarapukaveri/FMML-LAB1/blob/main/Copy_of_Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


QUESTION **1**

Yes, averaging validation accuracy across multiple splits can provide more consistent and robust results compared to relying on a single split. This practice is commonly known as cross-validation. Cross-validation involves splitting the dataset into multiple subsets, training the model on different subsets, and evaluating its performance on the remaining data. The results are then averaged to obtain a more reliable estimate of the model's performance.

Here are a few reasons why cross-validation can lead to more consistent results:

Reduced Dependency on a Single Split:

In a single train-test split, the performance of the model may be highly dependent on the specific samples chosen for training and testing. Cross-validation helps reduce this dependency by using multiple splits, ensuring that the model is evaluated on different subsets of the data.
Better Generalization:

Cross-validation helps in assessing how well the model generalizes to different subsets of the data. If the model consistently performs well across multiple splits, it is likely to be more robust and have better generalization performance.
More Reliable Performance Estimate:

Averaging results from multiple splits provides a more stable estimate of the model's performance. This can be particularly important when dealing with limited data, as it helps mitigate the impact of data variability.
Identifying Overfitting:

Cross-validation can help detect overfitting. If a model performs well on the training data but poorly on unseen data in multiple splits, it may be a sign of overfitting.
Common types of cross-validation include k-fold cross-validation and stratified k-fold cross-validation. In k-fold cross-validation, the dataset is divided into k subsets, and the model is trained and evaluated k times, with each subset serving as the test set exactly once. Stratified k-fold ensures that each fold maintains the same class distribution as the original dataset.

Keep in mind that the choice of the number of folds (k) and the specific type of cross-validation may depend on factors such as the size of your dataset and the nature of your problem.







QUESTION **2**

Cross-validation, by providing an average performance metric over multiple folds, can indeed give a more accurate estimate of how well a model is likely to perform on unseen data compared to a single train-test split. However, it's important to clarify that the primary goal of cross-validation is to provide a more reliable estimate of the model's generalization performance on the training data. The term "test accuracy" is commonly associated with the performance on a separate test set that the model has not seen during training.

Here's how cross-validation contributes to a more accurate estimate:

Reduced Variability:

Averaging the performance over multiple folds helps reduce the impact of randomness and variability that might be present in a single train-test split. This provides a more stable estimate of the model's performance.
Better Representation of Data:

Each fold in cross-validation represents a different subset of the data. By training and testing the model on different subsets, cross-validation helps ensure that the model's performance is representative of the entire dataset, capturing different patterns and variations.
Mitigating Overfitting:

Cross-validation allows you to assess how well the model generalizes across different subsets of the data. If the performance is consistent across folds, it suggests that the model is likely to generalize well to unseen data.
While cross-validation provides a more accurate estimate of the model's generalization performance on the training data, it doesn't replace the need for a separate test set to evaluate the model's performance on truly unseen data. In practice, it's common to use cross-validation during the model development phase to tune hyperparameters and assess performance, and then reserve a final test set for the ultimate evaluation of the model's readiness for deployment.

Remember that the accuracy estimate obtained through cross-validation is still an estimate based on the specific splits used, and the actual performance on new, unseen data may vary. It's always a good practice to use a final test set that the model has never encountered during training or cross-validation to provide a more realistic assessment of its performance in a real-world scenario.







QUESTION **3**

The number of iterations or folds in cross-validation can impact the estimate of the model's performance. The choice of the number of iterations is often referred to as "k" in k-fold cross-validation. Common values for k include 5, 10, or more, but the optimal choice can depend on factors such as the size of the dataset and the nature of the problem.

Here's how the number of iterations can affect the estimate:

More Stable Estimate with Higher k:

As you increase the number of iterations (higher k), the cross-validation process becomes more robust, providing a more stable estimate of the model's performance. This is because the model is trained and evaluated on different subsets of the data, and averaging over more folds helps reduce the impact of variability.
Computational Cost:

A higher number of iterations generally lead to a higher computational cost since the model needs to be trained and evaluated multiple times. The trade-off between computational cost and the accuracy of the estimate should be considered when choosing the value of k.
Smaller Test Set in Each Fold:

When you have a limited amount of data, a higher value of k means that each fold has a smaller test set, which can reduce the reliability of the estimate. In such cases, it might be beneficial to use a smaller value of k.
Balance with Dataset Size:

The choice of k should be balanced with the size of your dataset. If you have a large dataset, you might be able to use a larger value of k without sacrificing the size of each test set. For smaller datasets, it's common to use smaller values of k.
It's important to note that there are diminishing returns with increasing values of k. While higher values of k can provide a more stable estimate, the improvement in stability may become marginal beyond a certain point, and the computational cost may become prohibitive.

In practice, it's often recommended to start with a moderate value of k (e.g., 5 or 10) and then adjust based on considerations such as the size of the dataset and available computational resources. Cross-validation is a valuable tool for model evaluation and hyperparameter tuning, but it's crucial to complement it with a final evaluation on a separate test set to ensure a realistic assessment of the model's generalization performance.







QUESTION **4**


Increasing the number of iterations in cross-validation can help mitigate some challenges associated with very small training or validation datasets. However, it's essential to be aware of certain considerations:

Increased Robustness:

With a small dataset, there might be a higher chance of variability in the performance metric due to the specific samples chosen for training and testing in each split. Increasing the number of iterations (higher k) allows the model to be trained and evaluated on different subsets, providing a more robust estimate of its performance.
Better Representation:

Each iteration in cross-validation involves using a different subset for validation. With a higher number of iterations, the model is exposed to a larger variety of training and validation samples, which can help it better capture the underlying patterns in the data.
Optimal Use of Limited Data:

In situations where the dataset is very small, you might want to use a higher fraction of the data for training in each fold to ensure that the model has sufficient information to learn. This can be achieved by using a smaller value of k (e.g., leave-one-out cross-validation or a small k-fold value).
Despite these benefits, there are important considerations:

Computational Cost:

Increasing the number of iterations also increases the computational cost, as the model needs to be trained and evaluated more times. This is a trade-off that needs to be considered, especially if computational resources are limited.
Limited Data for Training:

With a very small dataset, each training set in a cross-validation fold may be extremely limited. This can make it challenging for the model to learn complex patterns, and the estimate of performance may still be subject to high variability.
Careful Interpretation:

While increasing iterations can improve robustness, it doesn't increase the amount of information in the dataset. Care should be taken in interpreting results, especially when dealing with extremely small datasets.
In summary, increasing the number of iterations in cross-validation can be a helpful strategy when dealing with very small datasets, but it's important to strike a balance between gaining robustness in the estimate and the computational cost. Additionally, other techniques such as data augmentation or considering more advanced model architectures may be explored to make the most of limited data.





