<a href="https://colab.research.google.com/github/harikavelaga9999/FMML-LAB1/blob/main/Copy_of_Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [None]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [None]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [None]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [None]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

**1ST QUESTION ANSWER **

=>Higher Validation Accuracy: When you allocate a larger percentage of your data to the validation set, it usually results in a smaller training set. This can lead to better estimates of validation accuracy because your model is evaluated on a larger and more representative subset of the data. It helps in assessing how well your model generalizes to unseen data.

=>Risk of Underfitting: However, increasing the percentage of the validation set also reduces the amount of data available for training. If you make the validation set too large, your model might not have enough training data to learn meaningful patterns, leading to underfitting.

Reducing the Percentage of Validation Set:

=>Higher Training Accuracy: Reducing the percentage of the validation set increases the size of the training set, which can allow your model to learn from more data. This can lead to higher training accuracy.

=>Risk of Overfitting: While increasing the training set size can improve training accuracy, it can also make your model more prone to overfitting. With a smaller validation set, you have less data to evaluate how well your model generalizes, which may result in overly optimistic estimates of performance.

*The choice of the percentage allocated to the validation set depends on several factors:

=>Size of the Dataset: In small datasets, you might need to allocate a larger portion to the validation set to ensure a representative sample for evaluation.

=>Model Complexity: More complex models often require larger validation sets to assess their performance accurately.

=>Computational Resources: A larger validation set might require more time and computational resources for training and evaluation.

=>Overfitting Concerns: If you suspect overfitting, a larger validation set can help detect it, but it might also exacerbate the overfitting problem during training.

=>In practice, a common practice is to use a 70-30 or 80-20 split for training and validation data. However, these percentages can vary depending on the specific problem and dataset characteristics. It's often necessary to experiment with different splits to find the balance that works best for your particular situation. Cross-validation can also be a valuable technique to assess the model's performance more robustly when dealing with different data splits.



**2ND QUESTION ANSWER **

>The size of the training and validation sets can have an impact on how well you can predict the accuracy on the test set using the validation set. Here's how different scenarios may play out:

Adequate Training Data and Representative Validation Set:
Scenario: You have a reasonably large training set, and the validation set is representative of the data distribution. The validation set is neither too small nor too large compared to the training set.

Effect: In this scenario, you are likely to get a good estimate of your model's performance on the test set. The validation set is representative enough to provide an accurate assessment of how well your model generalizes to unseen data.

Inadequate Training Data:

Scenario: If your training set is very small compared to the validation set, your model may not have enough data to learn meaningful patterns, and its performance on the validation set may not be a reliable indicator of test set performance.

Effect: In this case, the validation accuracy may overestimate the model's true performance on the test set because the model hasn't been adequately trained. Your model may not generalize well when faced with unseen data.

Inadequate Validation Set:

Scenario: If the validation set is too small compared to the training set, your model may not be thoroughly evaluated for generalization, and its performance on the validation set may not accurately predict test set performance.

Effect: Here, the validation accuracy may underestimate the model's true performance on the test set. The small validation set may not capture the variability in the data, and the estimate of model performance may not be as reliable.

Representative Validation Set but Overfitting:

Scenario: Your training set is adequate in size, but the model is overfitting the training data, resulting in a high validation accuracy.

Effect: In this case, the validation accuracy may not accurately predict test set performance because the model has memorized the training data but hasn't learned to generalize well. Test accuracy could be lower than the validation accuracy due to overfitting.

In summary, the size of the training and validation sets is crucial in assessing how well the validation set predicts test set accuracy. It's essential to strike a balance between the sizes of these sets to ensure that the validation set is representative and that the model has been adequately trained. Cross-validation can be a useful technique to mitigate some of these issues by repeatedly splitting the data into training and validation subsets and assessing performance across multiple iterations, providing a more robust estimate of test set accuracy.



** 3RD QUESTION ANSWER**

=>The percentage of data to reserve for the validation set can vary depending on the size of your dataset and the specific problem you're tackling. However, a commonly used guideline is to reserve around 20% to 30% of your data for the validation set when you have a reasonably sized dataset (i.e., not extremely small). This range often strikes a good balance between having enough data for training and a representative validation set while avoiding overfitting and underfitting issues.

*Here are some considerations when choosing the percentage for the validation set:

*Dataset Size: If you have a very small dataset, you might need to allocate a larger portion to the validation set (e.g., 40% or even 50%) to ensure that the validation set remains reasonably sized. In such cases, you might also consider techniques like k-fold cross-validation to make the most of your limited data.

*Model Complexity: More complex models often require larger validation sets to assess their performance accurately. If you're working with a simple model, a smaller validation set might suffice.

*Computational Resources: Keep in mind the computational resources available to you. A larger validation set can increase the time and computational power required for training and evaluation.

=>Overfitting Concerns: If you suspect overfitting, a larger validation set can help detect it. However, if the training dataset is too small, overfitting may still be a concern, and you should focus on techniques to mitigate it, such as regularization.

=>Data Variability: Consider the variability in your dataset. If your data has significant variations or is prone to outliers, a larger validation set can help capture these variations and provide a more robust estimate of model performance.

=>It's important to note that there is no one-size-fits-all answer, and the choice of the validation set size may require experimentation and fine-tuning based on your specific problem and dataset characteristics. Additionally, cross-validation techniques, such as k-fold cross-validation, can be used to assess model performance more robustly when dealing with different data splits, which can help in situations where it's challenging to determine the ideal validation set size.



## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [None]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [None]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


**1ST QUESTION ANSWER **

=>Yes,i'm agree with the question, averaging the validation accuracy across multiple splits (commonly referred to as "cross-validation") can indeed provide more consistent and reliable results when evaluating the performance of a machine learning model. Cross-validation is a robust technique used to assess how well a model generalizes to unseen data and can help mitigate the impact of data randomness and variability.

Here's how cross-validation works and why it's beneficial:

=>Reduced Variance: When you split your dataset into multiple subsets (folds) and train/validate your model on each of them separately, you get multiple accuracy scores. Averaging these scores reduces the impact of outliers or specific quirks in a single data split, which can lead to a more stable and reliable estimate of your model's performance.

=>Better Generalization Assessment: Cross-validation provides a more comprehensive assessment of your model's generalization ability. By testing your model on different subsets of the data, you gain insights into how well it performs across various data samples, which can help you detect overfitting or underfitting issues.

=>Maximizing Data Utilization: In traditional train-test splitting, you might reserve a significant portion of your data for testing, which reduces the amount of data available for training. Cross-validation iteratively uses different subsets for validation while maximizing the use of your data for training, which can lead to more efficient model training.

=>Common cross-validation techniques include k-fold cross-validation (where you split your data into k subsets and use each as a validation set while the rest serve as the training set) and stratified cross-validation (ensuring that each fold has a similar class distribution as the original data).

=>In summary, averaging the validation accuracy across multiple splits through cross-validation is a best practice for model evaluation. It provides a more reliable estimate of your model's performance and helps you make better decisions about model selection and hyperparameter tuning.





**2ND QUESTION ANSWER **

Cross-validation, while a valuable technique for estimating the performance of a machine learning model, does not directly provide a more accurate estimate of the model's test accuracy. Instead, it gives you a more accurate estimate of how well your model is likely to perform on unseen data based on the available training data. Here's why:

*Validation vs. Test Data: In
cross-validation, you split your dataset into multiple subsets and use them alternately for training and validation. These validation subsets are still part of your original dataset. They are used to estimate how well your model generalizes to unseen data but are not entirely independent of your training data.

*Test Data Independence: The true test accuracy, on the other hand, is determined by using a separate and entirely independent dataset that the model has never seen during training or validation. This is often referred to as the "holdout" test set, and it simulates how your model will perform when applied to completely new and unseen data.

*While cross-validation provides a more robust estimate of your model's performance on the data you have, it doesn't replace the need for a holdout test set to assess how your model will perform in real-world scenarios. The holdout test set is crucial for understanding how your model will generalize to new, previously unseen data.

In practice, the workflow typically involves:

=>Cross-validation: Use cross-validation to assess your model's performance on the available data, refine hyperparameters, and make model selection decisions.

=>Holdout Test Set: After you've finalized your model, evaluate it on a holdout test set to get an accurate estimate of how it will perform in real-world applications. This gives you a more accurate measure of the model's test accuracy.

*In summary, cross-validation provides a useful estimate of model performance on your available data but does not directly replace the need for a holdout test set to assess performance on truly independent, unseen data. Both techniques are important for a comprehensive evaluation of your machine learning model.



** 3RD QUESTION ANSWER**

>In the context of cross-validation, the number of iterations refers to the number of times you perform the cross-validation process, where you split your dataset into different subsets for training and validation. This is often denoted as "k" in k-fold cross-validation. Each iteration gives you an estimate of model performance.

*The effect of the number of iterations on the estimate is as follows:

=>Higher Iterations (Higher k): Using a higher number of iterations or folds in cross-validation (e.g., 10-fold or 5-fold) can provide a more stable and reliable estimate of your model's performance. It reduces the impact of randomness in the data splitting process. With more iterations, you are essentially averaging the results from multiple train-validation splits, which can lead to a more robust estimate.

=>Lower Iterations (Lower k): Using a lower number of iterations (e.g., 2-fold or 3-fold) can be computationally less expensive, but it might result in more variability in your performance estimates. The estimates can be sensitive to the specific data split in each iteration, and you may get less stable results.

=>However, it's essential to strike a balance between computational cost and accuracy when choosing the number of iterations. Very high values of k (e.g., leave-one-out cross-validation with k equal to the number of data points) can be computationally expensive and may not significantly improve the estimate's stability. On the other hand, very low values of k may not provide a reliable estimate due to increased variability.

=>In practice, common choices for k in
k-fold cross-validation are 5 or 10, as they strike a reasonable balance between stability and computational cost. The choice of the number of iterations should depend on the size of your dataset, available computational resources, and the need for a stable estimate.

*In summary, increasing the number of iterations in cross-validation can lead to a more stable and reliable estimate of your model's performance, but there is a trade-off with computational cost. The optimal value of k depends on various factors and should be chosen carefully based on your specific situation.



**4TH QUESTION ANSWER **

=>Increasing the number of iterations (folds) in cross-validation can help when dealing with a very small training dataset or validation dataset to some extent, but it has limitations. While more iterations can provide a more stable estimate of model performance, they do not fundamentally address the issues associated with having a very small dataset.

Here are some considerations:

*Advantages of Increasing Iterations:

=>Improved Stability: With more iterations, you reduce the impact of random variations in the data splitting process. This can lead to more consistent estimates of model performance.

=>Better Utilization: When you have a small dataset, using more iterations can help you make better use of the limited data you have for both training and validation.

*Limitations and Considerations:

=>Data Size: The fundamental limitation of having a small dataset is that your model may not be able to learn meaningful patterns, and the performance estimates may not generalize well to new data. Increasing iterations cannot create more data; it can only provide more robust estimates based on the available data.

=>Overfitting Risk: With very small datasets, you should be cautious about overfitting. If your dataset is extremely small, even a portion of it used for validation in each fold may be significant. This can lead to overly optimistic estimates of model performance. In such cases, it might be better to use techniques like leave-one-out cross-validation or stratified sampling to ensure each fold has a representative distribution.

=>Computational Cost: Increasing the number of iterations can be computationally expensive, especially if your dataset is small, to begin with. You should consider the trade-off between computational resources and the benefit gained from additional iterations.

=>In summary, while increasing the number of iterations in cross-validation can help mitigate some of the challenges associated with very small datasets, it does not solve the fundamental problem of limited data. You should be cautious about overfitting and carefully consider the computational cost when deciding on the number of iterations. Additionally, other techniques such as data augmentation, transfer learning, or obtaining more data, if possible, may be more effective ways to address the challenges of a small dataset.