<a href="https://colab.research.google.com/github/jyostna6/JYOTHSNA-FMML-/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [None]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [None]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [None]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [None]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [None]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [None]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [None]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [None]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
Yes, averaging the validation accuracy across multiple splits can indeed provide more consistent results. This technique is commonly referred to as cross-validation and is widely used to assess the generalizability of a model. Here's why it works:

Why Averaging Validation Accuracy Helps with Consistency

1. Reduces Variability:

A single train-test split can be highly dependent on the particular data you happen to get in the test set. If the test set is not representative of the overall dataset (due to randomness), your model evaluation may be biased.

Multiple splits average out this randomness by using different subsets of the data for training and validation, leading to a more robust and reliable estimate of the model's performance.



2. Better Generalization Estimate:

When you perform k-fold cross-validation (typically with k=5 or k=10), you partition the data into k subsets. The model is trained on k-1 folds and tested on the remaining fold, and this process is repeated k times, each with a different test fold.

Averaging the results across these splits helps to obtain a better estimate of how the model will perform on unseen data, because it has been validated on multiple, distinct subsets of the data.



3. Handles Data Imbalances Better:

In the case of imbalanced data (where some classes are underrepresented), a single train-test split might not reflect the performance on all classes well. Multiple splits help to ensure that each class appears in both training and validation sets in different ways, making the model's performance more consistent across different subsets.



4. Mitigates Overfitting:

If a model performs well on a single train-test split but poorly on others, it suggests the model might be overfitting to particular subsets of the data. Averaging across multiple validation splits helps highlight any overfitting tendencies and provides a more reliable estimate of how the model will perform on new, unseen data.




Cross-Validation Types

There are several types of cross-validation you can use to average the validation accuracy over multiple splits:

1. k-Fold Cross-Validation:

The dataset is divided into k subsets (or "folds"). The model is trained on k-1 folds and tested on the remaining fold, and this process is repeated k times.

The final result is the average accuracy across all k folds.



2. Stratified k-Fold Cross-Validation:

Similar to k-fold, but it ensures that the class distribution is approximately the same in each fold. This is especially useful when the data is imbalanced, ensuring that each fold has a representative distribution of classes.



3. Leave-One-Out Cross-Validation (LOO-CV):

In this extreme case, k is set to the total number of samples, so each test fold consists of a single sample, and the model is trained on the remaining data. This can be computationally expensive but is useful for small datasets.



4. Repeated Cross-Validation:

This involves performing k-fold cross-validation multiple times (with different random splits each time). Averaging the results across multiple runs gives even more consistent performance estimates.





---

Example: Cross-Validation Workflow

Imagine you are training a model on a dataset with 100 samples. Using k-fold cross-validation (k=5):

1. Split the data into 5 subsets.


2. Train the model on 4 subsets (80 samples) and test it on the remaining 1 subset (20 samples).


3. Repeat this process 5 times, each time using a different subset as the test set and the remaining ones as the training set.


4. Compute the accuracy for each fold and average the accuracies.



The final result is the average validation accuracy across the 5 splits, providing a more stable and reliable measure of the model’s performance compared to a single train-test split.


---

When Does Averaging Help?

Small Datasets: In small datasets, a single split might give highly variable results, so averaging helps provide a more accurate estimate.

Imbalanced Datasets: If one class is underrepresented, a single split might lead to poor evaluation. Averaging across different splits (especially stratified cross-validation) ensures that all classes are represented adequately in both the training and testing phases.

Model Tuning: When you tune hyperparameters, averaging validation accuracy across multiple splits gives you a better sense of the model's true performance, preventing overfitting to one particular data subset.



---

Limitations of Cross-Validation

While cross-validation helps to reduce variance and provide more reliable results, it still has some limitations:

1. Computational Cost: Cross-validation can be computationally expensive, especially for large datasets or complex models, because it requires training the model multiple times (one for each fold).


2. Data Leakage: If data leakage occurs (for instance, when information from the test set is used during training), it can invalidate the cross-validation results.




---

Conclusion

Averaging validation accuracy across multiple splits in cross-validation does indeed provide more consistent and reliable results, especially in the presence of data variability, imbalanced classes, or small datasets. It helps mitigate the impact of random variations in the data and offers a better estimate of the model's generalization ability.

Would you like to see how this works with an example or code implementation?


2. Does it give more accurate estimate of test accuracy?
Averaging validation accuracy across multiple splits (such as in cross-validation) generally provides a more reliable estimate of the model's performance, but it does not directly give a more accurate estimate of test accuracy. Here's why:

Validation Accuracy vs. Test Accuracy

Validation Accuracy: During cross-validation, the accuracy you compute is on the validation set (the held-out subset used for evaluation during training). This is used to estimate how well the model is likely to perform on new, unseen data.

Test Accuracy: The test accuracy refers to the performance of the model on a completely separate test set that the model has never seen before (i.e., after training and validation are complete). This is the final metric you care about in terms of how well the model generalizes to real-world data.


Why Cross-Validation Improves the Estimate of Test Accuracy:

1. Reduces Overfitting to the Validation Set:
Cross-validation helps reduce the risk of overfitting to a specific validation set by training the model multiple times on different subsets of the data. This makes the evaluation less dependent on the choice of a single validation split.


2. Provides a More Generalizable Estimate:
By averaging the performance across multiple validation splits, you are getting a more robust estimate of how the model might perform on new data, as it has been trained on different parts of the dataset each time. This makes the results more reliable than a single validation score.


3. Mimics Real-World Testing:
Since each fold in cross-validation is a different subset of the data, the averaged validation accuracy gives a better idea of how the model will generalize across different data, similar to how it will perform on real-world, unseen test data. This is more indicative of the model's generalization ability than a single split.



Does Cross-Validation Give an Accurate Test Accuracy Estimate?

No Direct Substitution for Test Accuracy:
Cross-validation gives a good estimate of how well the model will perform, but it does not directly estimate test accuracy. The actual test accuracy, obtained from a separate test set that the model has never seen before, is the true measure of generalization.

Overfitting Concern:
While cross-validation helps reduce overfitting to the training set, there is still a potential for overfitting the model to the validation set used during cross-validation. The test accuracy will give the final verdict on the model’s generalization performance, which is the most reliable indicator.


How Accurate is the Validation Accuracy Estimate?

1. When Cross-Validation is Done Right:
If you're using k-fold cross-validation (say, k=5 or k=10), the result is usually a very good approximation of test accuracy because:

The model has been validated on different subsets of data.

The model is likely to encounter similar data distributions in real-world use, so the validation accuracy reflects this.



2. Bias and Variance in Cross-Validation:
The mean of cross-validation scores is typically a good indicator of how the model will perform, but it can still suffer from:

Bias: If your model is underfitting or overfitting, cross-validation might not fully capture that bias.

Variance: If the data has a lot of variability, or if the dataset is very small, cross-validation results might still show some variance.



3. Estimation of Test Accuracy:
Cross-validation does a better job of estimating test accuracy than using a single train/test split, but it is still an estimate. The true test accuracy depends on how well the final model generalizes to the specific test set that is not involved in any part of the cross-validation process.




---

Can Cross-Validation Be Misleading?

Yes, cross-validation could give an inaccurate estimate of test accuracy in certain situations:

If the model has already been tuned on the validation data during the cross-validation process, you might get overly optimistic validation results, and these might not match the test accuracy.

If the dataset is small and the splits do not adequately represent the variability in the data, cross-validation results could be more variable.

If there is data leakage: If information from the test set has influenced the training process (even indirectly), cross-validation results could give an inflated estimate of the model's generalization ability.



---

How to Get the Best Estimate of Test Accuracy?

Use Cross-Validation During Model Selection:
Cross-validation is particularly useful when you're selecting models or tuning hyperparameters. It helps you choose the model that generalizes the best, without overfitting to a particular train/test split.

Hold Out a Final Test Set:
After model selection (using cross-validation for training and validation), it's essential to test the final model on a separate test set that was not used in the cross-validation or any part of the training process. This final test set gives the true test accuracy.



---

Conclusion

While cross-validation provides a more reliable estimate of test accuracy compared to a single train/test split, it is still just an approximation. The real test accuracy can only be obtained after training the model and evaluating it on a completely separate, unseen test set.

Cross-validation helps by reducing the variance of the performance estimate, providing a better approximation of how the model will perform in practice. However, to obtain the true generalization performance, you still need to test the model on a separate test set after the cross-validation phase.



3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
The number of iterations (or folds in cross-validation, or repeats in repeated cross-validation) can influence the reliability and stability of the performance estimate, but the effect depends on the context and how it's applied. Here's a detailed breakdown of the effect of increasing iterations on the estimate of model performance:

1. Effect of Increasing Number of Folds in Cross-Validation (e.g., k-Fold Cross-Validation)

In k-fold cross-validation, the number of iterations refers to the number of folds (k). Typically, k = 5 or k = 10 is used, but increasing or decreasing k changes the nature of the estimation.

Benefits of Increasing k (More Folds):

More Training Data: When you increase the number of folds (e.g., from k=5 to k=10), each fold gets a slightly smaller validation set, but each training set is larger. This can be helpful because the model has access to more data during training, which can help in improving generalization.

Better Estimate of Generalization: Larger k values (such as k=10) reduce the variance of the validation accuracy estimate by training the model on a larger portion of the data for each validation, thus making the estimate more stable and reliable. With more folds, you get more training data for each fold, leading to potentially better model performance on unseen data.

Lower Bias: Using more folds reduces the bias associated with the choice of validation set. A single fold might not be representative of the entire dataset, but with more folds, the model is validated across a broader spectrum of the data, resulting in a better approximation of how it will perform on unseen data.


Downsides of Increasing k (More Folds):

Computational Cost: Increasing the number of folds (k) also increases the computational cost because you have to train the model more times. For example, with k=5, the model is trained 5 times; with k=10, the model is trained 10 times.

Diminishing Returns: After a certain point (usually around k=10), increasing the number of folds yields diminishing returns in terms of improved estimate stability. Larger k values only marginally improve the estimate, and the computational cost might not justify the small improvement.


Optimal Number of Folds:

Generally, k=5 or k=10 is a good balance between training time and stability of the estimate. In practice, increasing the number of folds beyond 10 does not drastically improve the results, but it can be useful if you have a smaller dataset and need a more accurate estimate of performance.



---

2. Effect of Increasing the Number of Repeats (in Repeated Cross-Validation)

In repeated cross-validation, the model is trained and validated multiple times, with each repeat using different random splits of the data.

Benefits of Increasing the Number of Repeats:

More Accurate Estimate: Repeating the cross-validation process multiple times helps reduce the variance in the performance estimate, especially when the dataset is small or highly variable. Averaging over many repetitions provides a more reliable estimate of the model's generalization ability.

More Robust Evaluation: If a single run of cross-validation gives an unusually high or low score due to the random train-test split, repeating the cross-validation process helps smooth out these fluctuations, leading to more consistent and stable results.


Downsides of Increasing the Number of Repeats:

Increased Computation: The main downside of increasing the number of repeats is that it significantly increases the computational cost because you're running the cross-validation process multiple times, each with different splits of the data.

Diminishing Returns: After a certain number of repeats (e.g., 5-10 repetitions), you start to see diminishing returns in terms of accuracy improvement. If the variance in the model's performance is low across the first few repeats, adding more repeats will provide less additional insight into the model's performance.


Optimal Number of Repeats:

A typical choice for repeated cross-validation is to repeat the cross-validation process 3-10 times, depending on the dataset size and the computational resources available. Increasing beyond this is usually unnecessary unless you are working with a very small or highly noisy dataset.



---

3. Effect of Number of Iterations on Performance Estimates

Improved Stability:

Increasing the number of iterations (whether by increasing folds or repeats) typically results in a more stable and reliable estimate of the model's performance. The more iterations you run, the more representative your validation performance becomes because the model is tested across a wider variety of data splits.

Lower Variance:

If you use fewer iterations (such as with a single train-test split), the model evaluation could be highly variable based on the specific subset of data used for testing. With more iterations, the effect of any individual split is minimized, leading to more consistent results.


Better Reflection of Generalization:

More iterations mean that the model is trained on different data subsets, making the validation performance a better estimate of how the model will perform in real-world situations, on new unseen data.



---

When Does More Iterations Not Help?

Sufficient Data: If your dataset is very large, a smaller number of iterations (such as 5-fold cross-validation) may already provide a reliable estimate, and additional folds or repeats may yield diminishing returns. The model already has access to plenty of diverse training data, and the variance in performance will likely be low.

Computational Constraints: More iterations require more computational resources. If you’re working with large datasets or complex models, running more iterations might be computationally prohibitive, and the additional gains in estimate accuracy might not justify the cost.

Well-Defined Model and Data: If your model and data are relatively simple, and you have a good understanding of their behavior, you might not need a large number of iterations to get an accurate estimate.



---

Conclusion

Yes, more iterations (either through more folds or repeats) generally lead to a more consistent and reliable estimate of model performance, especially when there is variability in the data or when you have a small dataset.

However, after a certain point, the benefit of additional iterations becomes marginal and the computational cost increases.

k=5 or k=10 folds for cross-validation, or 3-10 repeats in repeated cross-validation, are often sufficient to get a good estimate of the model's generalization ability without significant diminishing returns.


If you want a more precise estimate and are willing to invest the computational time, increasing the number of iterations (folds or repeats) can help, but for most practical applications, a reasonable number of iterations should be enough to get reliable performance estimates.




4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?
Yes, increasing the number of iterations can help deal with a very small training or validation dataset, but it's not a perfect solution. Let's break down how it can help and what limitations exist when using small datasets:

How Increasing Iterations Can Help with Small Datasets:

1. Better Use of Available Data:

With small datasets, a single train-test split might not provide a representative sample of the data, which can lead to high variance in the performance estimate. By increasing the number of iterations (e.g., through more folds in cross-validation or more repeats in repeated cross-validation), you ensure that the model is tested on different subsets of the data, which helps to maximize the use of the available data and improve the robustness of the performance estimate.



2. Stabilizing Performance Estimates:

For a small dataset, any single test split could be unrepresentative of the overall data, leading to misleading evaluation metrics. Increasing the number of iterations helps to average out any bias or variance caused by random splits, resulting in more stable and consistent estimates of the model's performance.

Repeated cross-validation (multiple repetitions of k-fold cross-validation) can be particularly useful, as it ensures that the model has been tested on different random partitions of the data, providing a more reliable estimate of performance across a small sample.



3. Reducing the Risk of Overfitting:

With very small training sets, models can easily overfit the data. By using multiple splits, the model is exposed to different training and validation sets, which can help reduce overfitting to specific examples in the dataset. This improves the generalization of the model, as it's being trained on different subsets of data and evaluated across them.



4. Stratified Cross-Validation:

In cases of small datasets, especially with imbalanced classes, stratified cross-validation is useful. It ensures that each fold in cross-validation has the same proportion of each class as the entire dataset. This prevents situations where a fold might contain an unbalanced sample of classes, which could bias the results.




Limitations of Using More Iterations on Small Datasets:

1. Increased Computational Cost:

While increasing the number of folds or repeats helps stabilize performance estimates, it also comes with the downside of increased computation. With very small datasets, running many iterations can become computationally expensive without providing substantial improvements in the reliability of the estimate.



2. Diminishing Returns:

If the dataset is extremely small, even with more folds or repeats, the variance in the results may still be high because you're still working with limited data. Adding more folds (e.g., going from k=5 to k=10) may not significantly improve the stability of the estimate, and you might reach a point of diminishing returns.



3. Overfitting to Small Data:

Even with multiple folds or repetitions, a very small dataset might still result in overfitting if the model has too many parameters or is too complex relative to the size of the dataset. In these cases, even cross-validation won't be able to prevent the model from memorizing the few examples it has access to, leading to poor generalization.



4. Limited Representation of the Data:

With extremely small datasets, there's always the issue that even with many iterations, the folds or repeats may still not capture the full diversity of the data. The small number of examples might lead to results that aren't fully reflective of how the model will perform on truly unseen data.




Alternative Solutions for Small Datasets

While increasing the number of iterations helps with small datasets, here are some other strategies that can improve model performance and estimation reliability when data is limited:

1. Data Augmentation:

For certain types of data (especially images, text, and time-series), you can use data augmentation techniques to artificially increase the size of the dataset by generating new, slightly modified examples. This can help the model generalize better and reduce overfitting.



2. Transfer Learning:

If you are working with very small datasets, transfer learning can be highly effective. It involves using a model that has been pre-trained on a larger dataset and fine-tuning it on your small dataset. This approach leverages knowledge from a larger corpus and helps improve generalization without requiring a large dataset.



3. Regularization Techniques:

When working with small datasets, it's crucial to use regularization techniques (like L1/L2 regularization, dropout, etc.) to prevent overfitting. Regularization can help constrain the model's complexity, making it more robust when working with limited data.



4. Synthetic Data Generation:

If augmentation techniques are not possible, consider using methods like SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic examples, especially in cases of imbalanced datasets.



5. Use of Simpler Models:

Complex models with many parameters (e.g., deep neural networks) require a lot of data to generalize well. With small datasets, it might be better to use simpler models (like linear regression, decision trees, or small random forests) that are less likely to overfit.



6. Bootstrapping:

Bootstrapping is another resampling technique that can help in cases of small datasets. It involves creating multiple subsets of the data by sampling with replacement and training the model on those subsets. This is especially useful for estimating model uncertainty when data is limited.





---

Conclusion:

Yes, increasing the number of iterations (whether through more folds in cross-validation or more repeats) can help deal with a small training or validation dataset by:

Providing more stable performance estimates,

Maximizing the use of limited data,

Reducing overfitting risks.


However, the benefits have diminishing returns as the dataset size becomes extremely small, and it’s still important to consider other strategies like data augmentation, transfer learning, regularization, or simpler models to improve performance and generalization. If you have very limited data, relying solely on increasing iterations might not fully solve the problem, but it will still provide more reliable estimates compared to using a single train-test split.

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.