<a href="https://colab.research.google.com/github/lavanya950/FMML-LABS-MAIN/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [None]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [None]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [None]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [None]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [None]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [None]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [None]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [None]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?


Yes, **averaging the validation accuracy across multiple splits** does typically give **more consistent and reliable results** compared to using a single split. This practice is part of a technique called **cross-validation**, and it has several key benefits:

### Why Averaging Across Multiple Splits Helps:
1. **Reduces Variance**: Using a single train-test split can lead to results that are highly dependent on the specific division of the data. For example, if the split happens to favor certain classes or patterns, the results could be misleading. By averaging performance across multiple splits, you reduce the **variance** caused by a specific random partitioning of the data. This provides a more robust estimate of your model's true performance.

2. **Improves Generalization**: In cross-validation, the model is trained and validated on different subsets of the data, which gives it exposure to more varied data points. This helps the model generalize better and prevents it from being overly tuned to a single subset. As a result, the averaged validation accuracy is typically a more accurate reflection of the model's real-world performance.

3. **Helps Detect Overfitting**: When you use a single validation split, there’s a risk that the model might overfit the validation set, particularly if the data split happens to be unrepresentative of the overall data distribution. Multiple splits can reveal if the model is consistently performing well or if it is overfitting to certain parts of the data. A significant drop in performance on some folds of cross-validation could indicate that the model is overfitting to specific data characteristics.

4. **Evaluates Stability**: Averaging over multiple splits gives you an indication of how stable your model's performance is. If your model has a high variance in performance (i.e., it performs very differently across different splits), this could be a sign of instability, and further model tuning or feature engineering might be needed. Consistent performance across splits is a good indicator that your model is robust.

### How Cross-Validation Works:
- **K-Fold Cross-Validation**: One of the most common methods for averaging validation performance is **K-fold cross-validation**. Here’s how it works:
  1. The dataset is split into **K** equal-sized (or nearly equal-sized) subsets, known as **folds**.
  2. The model is trained on **K-1 folds** (the training set) and validated on the remaining fold (the validation set).
  3. This process is repeated **K times**, with each fold serving as the validation set exactly once.
  4. After all K rounds, the performance (e.g., validation accuracy) is averaged across all K iterations, giving a more reliable estimate of the model's performance.

  For example, in **5-fold cross-validation**, the data is split into 5 parts. The model is trained on 4 parts and validated on the 1 remaining part. This is repeated 5 times, each time with a different part serving as the validation set. Finally, the accuracies from all 5 runs are averaged.

- **Stratified K-Fold**: This is a variant of K-fold cross-validation that ensures that each fold has a proportional representation of each class (especially important for imbalanced datasets).

### Example:
Imagine you have a dataset of 1000 samples and you're using 5-fold cross-validation. The steps would look like this:

1. **Split** the data into 5 folds, so each fold contains 200 samples.
2. Train the model on 4 of the 5 folds (800 samples) and test on the remaining fold (200 samples).
3. Repeat this process for all 5 folds, each time using a different fold for validation.
4. Calculate the accuracy for each of the 5 runs and then take the **average** of these accuracies.

This process will give you a more stable estimate of model performance compared to a single train-test split.

### Benefits in Practice:
- **More Reliable Performance Estimate**: Averaging helps mitigate issues such as **data leakage**, where certain data points might give the model an unfair advantage, or **sampling bias**, where certain splits may not be representative of the overall data.
- **Better Comparison Between Models**: When comparing different models or hyperparameters, using cross-validation ensures that the evaluation metric (e.g., accuracy) is based on multiple tests, giving you a more fair and balanced comparison.
- **Model Selection and Hyperparameter Tuning**: If you're tuning hyperparameters or selecting between models, cross-validation allows you to evaluate multiple configurations on diverse splits, which helps avoid the risk of overfitting to a particular validation set.

### Consistency Across Splits:
While averaging over multiple splits provides a more consistent and generalizable measure of model performance, it's also important to note that:
- **Cross-validation doesn't eliminate all sources of variability**: While cross-validation reduces variance caused by a single split, it doesn't guarantee that your model will perform identically in production or on future unseen data. To account for this, you still need to monitor performance on a **test set** that is separate from the training and validation data.
- **Computational Overhead**: Cross-validation is computationally more expensive than a single split because the model is trained and evaluated multiple times. However, with modern computational power, this is often manageable.

### Conclusion:
Yes, **averaging the validation accuracy across multiple splits** (typically through cross-validation) does provide more consistent and stable results. It reduces the variance caused by a single data split and offers a better estimate of how the model is likely to perform on unseen data. Cross-validation is a widely adopted practice in machine learning for model evaluation and comparison.

2. Does it give more accurate estimate of test accuracy?


Averaging validation accuracy across multiple splits **does not directly give you the test accuracy** of your model, but it **provides a more accurate estimate** of how well your model will perform on **unseen data** (i.e., data that was not used during training). Here’s how it works in the context of **cross-validation** and its relationship to test accuracy:

### 1. **Validation vs. Test Accuracy**
   - **Validation Accuracy**: This refers to how well your model performs on the validation set, which is a subset of the data that was **not used for training** in a given fold of cross-validation. The goal of validation accuracy is to evaluate the model's generalization ability during the training process (or across multiple training runs in cross-validation).
   - **Test Accuracy**: This refers to the performance of your model on an entirely **separate test set** that was not used during either training or validation. The test set is typically only used after the model has been trained and validated to assess its final performance.

### 2. **Why Averaging Validation Accuracy Gives a Better Estimate of Model Performance**
Cross-validation gives you a more reliable and consistent measure of how well your model will generalize to new data. However, it **does not directly replace the test set accuracy**. Instead, cross-validation provides a more stable **estimate** of performance on unseen data, which can be thought of as an **approximation** of what your model's test accuracy might be. Here’s why:

- **Single Split vs. Multiple Splits**: When you split your data into just one training set and one validation set, the accuracy might be highly dependent on the specific random partition. A single split might not be representative of the model’s performance across the entire dataset. By averaging over multiple splits (folds), you effectively reduce the variance and account for different data distributions, leading to a more robust estimate of model performance.
  
- **Better Generalization**: When you average across multiple folds, your model gets trained on **different subsets of the data** and evaluated on different parts. This means it is less likely to be biased by a specific set of examples or patterns. As a result, the cross-validation accuracy tends to be a more reliable measure of how the model will perform on new, unseen data compared to a single validation split.

### 3. **Limitations of Cross-Validation in Estimating Test Accuracy**
- **Test Set Still Required**: Cross-validation helps you **estimate** the model's performance, but it does not replace the need for a final test set. Even after cross-validation, you should evaluate your model on a **final holdout test set** to get a true measure of **test accuracy**, as cross-validation does not guarantee that the model will perform exactly the same on unseen data.
  
- **Overfitting to the Validation Set**: Although cross-validation is useful, it still uses part of the data for evaluation during the training process. If you perform **hyperparameter tuning** or **model selection** based on validation accuracy, you risk overfitting to the validation set, and your cross-validation results might not reflect the model's true generalization power. To avoid this, it’s crucial to have a completely separate **test set** that is never used during training or validation.

### 4. **Why Cross-Validation Might Give a Better Estimate of Test Accuracy than a Single Split**
- **Reduced Bias**: A single validation split might be unrepresentative of the entire data distribution. For example, the validation set might contain more outliers, noise, or harder-to-predict examples, leading to overly optimistic or pessimistic validation accuracy. Cross-validation mitigates this risk by averaging over multiple validation sets.
  
- **Better Performance Assessment**: Especially in cases where you have **limited data**, cross-validation helps to maximize the use of the available data by training and validating on multiple subsets. This is particularly useful when you can’t afford to set aside a large test set.

- **Improved Stability**: By evaluating performance on multiple splits, you get a more stable estimate of how your model will perform in general. If your model performs well consistently across folds, it gives you confidence that it will also perform well on unseen test data. If the performance varies significantly across folds, it suggests the model might be unstable, and you need to investigate further (e.g., by adjusting hyperparameters or improving features).

### 5. **Using Cross-Validation Results for Hyperparameter Tuning**
If you use cross-validation to **select the best model** or **tune hyperparameters**, you will be relying on the **validation accuracy** averaged across multiple splits. This process gives a better approximation of the test performance than if you were to tune hyperparameters based on a single validation set. However, after hyperparameter tuning or model selection, you should always evaluate the model on a final **test set** that was held out during all of the cross-validation and hyperparameter tuning processes.

---

### Summary: Does Averaging Validation Accuracy Give More Accurate Estimate of Test Accuracy?

- **Yes, averaging validation accuracy across multiple splits gives a more accurate and stable estimate of your model's generalization performance** compared to using a single validation split.
- **However, cross-validation is still just an estimate** of how your model might perform on the test set. It **cannot replace** an actual test set evaluation, which is the final step to obtain a reliable measure of **test accuracy**.
- **Cross-validation reduces variance** and provides a more **robust estimate** by mitigating the effects of a potentially unrepresentative single validation split.
- You should always evaluate your final model on a **separate, held-out test set** to get the true test accuracy.

In short, averaging validation accuracy across multiple splits improves **the consistency and reliability of the estimate**, but **only the test set** provides an unbiased measure of your model’s **final performance**.


3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?


The number of iterations (or **folds** in the case of cross-validation) has a significant impact on the **stability** and **reliability** of the estimate of your model’s performance. Generally, increasing the number of iterations can lead to a **better estimate**, but it also comes with trade-offs in terms of **computational cost**. Here's a breakdown of the effects of the number of iterations (folds) on the estimate of model performance:

### 1. **Effect on the Accuracy of the Estimate**
   - **More Iterations = More Robust Estimate**: When you use **more iterations** (or **folds** in cross-validation), you increase the number of times the model is evaluated on different data subsets. This helps to reduce the variability of the performance estimate.
     - In **k-fold cross-validation**, each fold is used for validation once, and the model is trained on the remaining \( k - 1 \) folds. More folds mean more chances for the model to be tested on different subsets of the data, which leads to:
       - **Reduced variance** in the estimate of model performance, as the result will not depend on a single train-test split.
       - **Smoother performance curves**, especially when you have limited data.
   - **Fewer Iterations**: If you use fewer iterations (e.g., 3-fold cross-validation), the variance of the estimate is higher, and you may get a performance estimate that is more sensitive to the specific random split of the data.

   In summary, **higher iterations (more folds)** tend to provide a **better estimate** of the model's performance because the evaluation is less dependent on the randomness of the data split and the performance is averaged over more folds.

### 2. **Effect on Bias**
   - **Bias Remains Unchanged**: The bias of your performance estimate is largely independent of the number of iterations or folds in cross-validation. Bias refers to the **systematic error** between the estimated performance and the true performance on unseen data. Cross-validation helps reduce variance but doesn't eliminate this bias. The bias is determined by factors such as:
     - The model's inherent complexity and its ability to generalize.
     - The size and quality of the dataset.

   The primary effect of increasing the number of iterations is to reduce the **variance** of the estimate, not to introduce any new bias.

### 3. **Effect on Generalization (Overfitting)**
   - **More Folds = More Generalizable**: More folds help ensure that your model generalizes better. When you evaluate on different subsets of data, you force the model to learn from varied data, reducing the likelihood of overfitting to specific patterns in one portion of the data.
   - **Fewer Folds = Higher Risk of Overfitting**: With fewer folds, the model might get exposed to a more limited subset of data, which could lead to **overfitting** or **underfitting** depending on how representative the training set is for the entire dataset.

### 4. **Computational Cost**
   - **More Folds = More Computational Overhead**: While increasing the number of folds generally leads to a more reliable performance estimate, it also increases the amount of computation required. For instance, in **10-fold cross-validation**, the model is trained and validated 10 times. As the number of folds increases, the computation time increases linearly.
     - For example, in a **5-fold cross-validation**, the model is trained and evaluated 5 times. In **10-fold**, it's trained and evaluated 10 times, doubling the computational cost.
   - **Trade-off**: If your dataset is large or the model is computationally expensive to train (e.g., deep learning models), you may find that more folds are prohibitively slow. In such cases, **lower numbers of folds (e.g., 3 or 5)** might be a reasonable compromise between accuracy and computation time.

### 5. **Effect on Stability of Results**
   - **More Iterations = More Stable Results**: As the number of iterations increases, the performance estimate becomes more stable because it accounts for more variations in the data. For example, in a **5-fold cross-validation**, the model is trained on 80% of the data at each step and validated on 20%. In a **10-fold cross-validation**, it is trained on 90% of the data and validated on 10%. More iterations mean more chances for the model to be tested across a wider variety of data points, which reduces the impact of outliers or specific data distributions on the final estimate.

   - **Fewer Iterations = Less Stable Results**: With fewer iterations (e.g., 3 folds), you are essentially relying on fewer data partitions, and the results can be more sensitive to the specific characteristics of the folds. This increases the **variance** of your estimate.

### 6. **Special Case: Leave-One-Out Cross-Validation (LOOCV)**
   - **Maximizing Iterations**: In **Leave-One-Out Cross-Validation (LOOCV)**, the number of folds equals the number of data points (i.e., each fold contains just one data point for validation, and the rest of the data is used for training). This maximizes the number of iterations, but at the expense of **computational cost**, since LOOCV requires as many models to be trained as there are data points.
   - **LOOCV Benefits**: It can provide an almost **unbiased estimate** of model performance because every data point is used for both training and validation. It is particularly useful when you have a very small dataset.
   - **LOOCV Drawbacks**: It is **computationally expensive** and may be prone to **high variance** because it might overestimate the model’s ability to generalize. Since each validation set is only a single data point, the performance estimate can be noisy.

### 7. **Empirical Effect on Accuracy**
   - **More Iterations Improve Accuracy Estimates**: Empirically, increasing the number of folds tends to make the accuracy estimates **more reliable** and **less sensitive** to the idiosyncrasies of the dataset.
     - In practice, **10-fold cross-validation** is commonly used because it strikes a good balance between computation time and the robustness of the estimate.
     - **5-fold cross-validation** is often a good compromise when you want a balance between computational cost and reliability of the estimate.

---

### Conclusion: Does Higher Number of Iterations Give a Better Estimate?

- **Yes**, increasing the number of iterations (folds) tends to give a **more reliable** and **consistent estimate** of model performance by reducing the variance of the results.
- **The performance estimate becomes more stable** with more iterations because the model is evaluated on more diverse subsets of the data, which reduces the impact of any single partition or outlier.
- However, **increasing the number of folds also increases computational cost**, which can be a limiting factor, especially with large datasets or complex models.
- While higher iterations give more stable and robust estimates, there's a **diminishing return** after a certain point (usually around **10-fold cross-validation**). After a certain number of folds, you get less additional benefit in terms of accuracy or stability.

In general, **10-fold cross-validation** is widely used as a good balance, but depending on the problem size and the computational cost, you can experiment with other numbers of folds (e.g., 5, 20) to find the optimal trade-off.


4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


Increasing the number of iterations (or folds) in cross-validation **cannot fully compensate for a very small training or validation dataset**, but it can help to some extent in **reducing the variance** in the performance estimates. However, there are important limitations and trade-offs that need to be understood when dealing with small datasets. Let’s explore how increasing the number of folds affects performance with small datasets and other strategies to handle small data effectively.

### 1. **Effect of Increasing Folds with Small Datasets**

- **Limited Data for Training**: In cross-validation, each fold involves training the model on a subset of the data and validating it on a different subset. If the dataset is very small, even **K-fold cross-validation** may not provide a good estimate of performance because:
  - In **K-fold cross-validation**, each fold uses \( \frac{1}{K} \) of the data for validation and the remaining \( \frac{K-1}{K} \) for training. With a small dataset, even the largest portion used for training might not contain enough information for the model to learn well.
  - As you increase the number of folds, the training set for each fold becomes smaller. For example, with **10-fold cross-validation**, each training set would only contain 90% of the data, and in **20-fold cross-validation**, each fold would use just 95% of the data. When the data is already limited, this can result in models that are not well-trained, leading to **higher variance in performance** estimates.

- **More Folds = More Validation Data, But Smaller Training Sets**: Increasing the number of folds will give you more distinct validation sets, which may improve the stability of the performance estimate, but it reduces the amount of data available for training in each fold. For very small datasets, this can result in:
  - **Underfitting**: Models trained on very small subsets may fail to capture important patterns, leading to **underfitting**.
  - **Higher Variance in Performance**: The performance estimates may become **noisier** because each training set may not provide enough data for the model to generalize well.

- **Diminishing Returns**: Beyond a certain point, increasing the number of folds (e.g., moving from 5-fold to 10-fold) will result in only marginal improvements in the stability of your estimate. With very small datasets, **leave-one-out cross-validation (LOOCV)**, which uses each data point as its own validation set, may not provide much more value and can actually **exacerbate overfitting** or **high variance** issues.

### 2. **Strategies for Dealing with Small Datasets**

Rather than simply increasing the number of iterations, you might want to consider other techniques that are specifically designed for small datasets. Here are a few approaches:

#### a. **Data Augmentation** (for specific data types)
- For certain types of data (such as **images**, **text**, or **time series**), you can use **data augmentation** techniques to artificially increase the size of your training dataset by creating modified versions of the existing data (e.g., rotations, flipping, noise addition, or synthetic data generation).
  - **Example**: In image classification, you can rotate, crop, or add noise to the images to create new examples.
  - **Pros**: This can be a powerful way to increase the effective size of the training set without collecting new data.
  - **Cons**: Augmentation is not always applicable to all types of data (e.g., tabular data) and may not fully solve the problem if the dataset is very small.

#### b. **Transfer Learning** (for deep learning models)
- **Transfer learning** is a technique where you start with a pre-trained model on a larger dataset and **fine-tune** it on your small dataset. This is particularly effective in domains like **computer vision** and **natural language processing**.
  - **Example**: In image classification, instead of training a model from scratch, you can fine-tune a model like **ResNet** or **VGG** that has already been trained on large datasets (such as ImageNet).
  - **Pros**: This allows you to leverage knowledge from large datasets, which helps avoid overfitting on your small dataset.
  - **Cons**: This requires that the pre-trained model is relevant to the problem you're solving, and it may still need enough data to fine-tune effectively.

#### c. **Regularization Techniques**
- **Regularization** (e.g., **L1**, **L2**, **dropout**, etc.) helps prevent overfitting, which is especially critical when working with small datasets. Regularization forces the model to make simpler predictions by penalizing overly complex models, which can reduce the model’s ability to overfit on a small dataset.
  - **Example**: If you're training a linear regression model, using **L2 regularization** (Ridge) can help prevent the model from fitting too closely to noise in the data.
  - **Pros**: Helps to improve generalization on small datasets by controlling complexity.
  - **Cons**: You need to carefully tune regularization parameters, as too much regularization can lead to **underfitting**.

#### d. **Bootstrapping or Resampling**
- **Bootstrapping** involves creating multiple **random samples** (with replacement) from the original dataset to train the model. This can increase the diversity of training examples and reduce overfitting.
  - **Example**: In **Bagging** (Bootstrap Aggregating), multiple models are trained on different random subsets of the data, and their predictions are averaged. This can improve stability and reduce overfitting when data is limited.
  - **Pros**: Helps to generate more training data and can be effective when used in ensemble methods.
  - **Cons**: Computationally expensive and may not always be as effective as other techniques like augmentation or transfer learning.

#### e. **Synthetic Data Generation** (SMOTE, GANs)
- For certain types of datasets (e.g., tabular data, imbalanced datasets), you can use techniques like **SMOTE** (Synthetic Minority Over-sampling Technique) or **Generative Adversarial Networks (GANs)** to generate synthetic examples.
  - **SMOTE**: SMOTE generates synthetic examples by creating new instances between existing data points in the feature space.
  - **GANs**: GANs can generate synthetic data that resembles the distribution of the original dataset.
  - **Pros**: Can increase dataset size without needing new data collection.
  - **Cons**: Generating high-quality synthetic data can be challenging and may introduce noise if not done carefully.

#### f. **Ensemble Learning**
- **Ensemble methods** like **Random Forests**, **Gradient Boosting**, or **AdaBoost** are particularly effective with small datasets because they combine multiple weak learners to improve performance.
  - **Example**: Random Forests or XGBoost can help by building multiple models on random subsets of the data, making the final model less likely to overfit to small or noisy data.
  - **Pros**: Can improve accuracy and robustness on small datasets.
  - **Cons**: Computationally expensive, especially for very small datasets.

#### g. **Use Simpler Models**
- For very small datasets, **complex models** like deep neural networks may not be necessary. Simpler models like **linear regression**, **logistic regression**, **k-nearest neighbors (k-NN)**, or **support vector machines (SVMs)** might be more effective because they require fewer data points to perform well.
  - **Example**: For a small dataset with tabular data, a **logistic regression model** with **L2 regularization** may perform better than a complex deep neural network.
  - **Pros**: Simpler models are less prone to overfitting on small datasets.
  - **Cons**: May not capture complex patterns as well as more complex models.

### 3. **Summary of Key Points:**

- **Increasing folds** in cross-validation can help stabilize performance estimates by averaging over multiple data splits, but it cannot fully compensate for the lack of data. With very small datasets, increasing the number of folds may result in **underfitting** and **higher variance** in performance estimates.
- **Other techniques** like **data augmentation**, **transfer learning**, **regularization**, **bootstrapping**, and **synthetic data generation** are more effective ways of dealing with small datasets.
- **Smaller datasets** often require simpler models and techniques that focus on preventing overfitting rather than just boosting training data size.

In conclusion, while increasing the number of iterations can help reduce variance in your performance estimate, it is generally not a comprehensive solution for handling very small datasets. Instead, you should explore other methods like data augmentation, transfer learning, or regularization that are designed to work with small datasets and improve model generalization.



> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.