<a href="https://colab.research.google.com/github/karishmashaik549/FMML2024/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [3]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [2]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [4]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [5]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [6]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [7]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [8]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [9]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [10]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [12]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  17.205692108667527 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [13]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.86046511627907 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [14]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 33.959990359122685 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [15]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
A.The accuracy of the validation set, which reflects how well the model generalizes to unseen data, can be influenced by the size of the validation set, though not always in a straightforward manner.

Increasing the Percentage of the Validation Set
More Representative: A larger validation set can provide a more accurate estimate of the model’s performance because it is more representative of the overall data distribution. It reduces the variance in the accuracy estimate, making it more stable.

Reduced Training Data: As you increase the percentage of data allocated to the validation set, the amount of data available for training decreases. This can lead to a less well-trained model, which might negatively affect its performance. With less training data, the model might not learn as effectively, potentially leading to lower validation accuracy if the model is not well-regularized or if the data is not sufficient for learning.

Potential Overfitting: If the validation set is very large, you might observe overfitting to the training data, as there’s less data to test the model's generalization ability. However, this effect is often less pronounced compared to the impact on training performance.

Reducing the Percentage of the Validation Set
Less Representative: A smaller validation set might not capture the data distribution as well as a larger one. This can lead to a less reliable estimate of model performance due to higher variance in the accuracy measure.

More Training Data: With a larger portion of the data used for training, the model has more examples to learn from, which can improve its ability to generalize, provided that the training data is representative of the data the model will encounter in practice.

Potential for Less Reliable Validation Metrics: A smaller validation set can lead to higher variability in the accuracy estimates. If the validation set is too small, it might not provide a stable or accurate reflection of model performance.

Summary
Increasing Validation Set Size: Can lead to a more stable and representative estimate of model performance but may reduce the amount of data available for training, potentially impacting the model's ability to learn effectively.

Decreasing Validation Set Size: Provides more data for training but can lead to less stable and potentially less representative accuracy metrics, as the validation set might not adequately capture the overall data distribution.

Balancing the size of the validation set and the training set is key to achieving reliable performance evaluation while ensuring the model is adequately trained.





2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
A.The size of the train and validation sets can significantly influence how well you can estimate the performance of your model on the test set. Here’s a detailed look at how this works:

Train Set Size
Larger Train Set:

Pros: A larger train set generally allows the model to learn more about the data distribution, leading to better generalization. This can improve the accuracy of your model both on the validation set and the test set.
Cons: With a very large train set, the validation set becomes relatively smaller, which might reduce the reliability of the performance estimates from the validation set.
Smaller Train Set:

Pros: A smaller train set can make it easier to validate performance because you have a larger validation set. This can give a clearer picture of how the model performs on unseen data.
Cons: If the train set is too small, the model might not generalize well and may have higher variance. This can lead to less reliable predictions on both the validation and test sets.
Validation Set Size
Larger Validation Set:

Pros: A larger validation set can provide a more accurate estimate of the model’s performance because it better represents the distribution of the test set. This improves the reliability of the accuracy estimates.
Cons: If too much data is allocated to the validation set, the amount of data left for training might be insufficient, which can hurt the model’s ability to learn effectively.
Smaller Validation Set:

Pros: A smaller validation set allows more data to be used for training, which can improve the model’s performance.
Cons: A smaller validation set may not be representative of the test set, leading to less reliable estimates of the model’s performance. This can result in overfitting to the training data and poor generalization.
Balancing Act
Trade-offs: The key is to balance the sizes of the train and validation sets. You want enough data in the validation set to get a reliable estimate of performance while still having enough data in the train set to build a robust model.
Typical Splits: Common practice is to use about 70-80% of the data for training and 20-30% for validation. The exact split can depend on the total amount of data available. For smaller datasets, you might use techniques like cross-validation to make the most out of limited data.
Cross-Validation
K-Fold Cross-Validation: If you have limited data, cross-validation techniques, such as k-fold cross-validation, can help. In k-fold cross-validation, the data is split into k subsets. The model is trained on
𝑘
−
1
k−1 of these subsets and validated on the remaining one. This process is repeated k times, with each subset being used as the validation set once. This approach helps in making better use of available data and gives a more robust estimate of performance.
In summary, the size of the train and validation sets influences how well you can estimate model performance. Larger validation sets provide more reliable estimates but may reduce the amount of data available for training. Conversely, smaller validation sets might give less reliable performance estimates but allow more data for training. Balancing these aspects is crucial for effective model evaluation and generalization.





3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?
A.A commonly recommended practice is to reserve around 20% to 30% of your data for the validation set. This range strikes a balance between having enough data to train the model effectively and having enough data in the validation set to get a reliable estimate of the model’s performance. Here’s a more detailed breakdown of how you might decide on the exact percentage based on your specific context:

Guidelines for Validation Set Size
Standard Split:

Typical Range: 20% to 30% of the data for validation is generally considered a good starting point. This is widely used in practice and provides a reasonable balance between the train and validation set sizes.
Dataset Size Considerations:

Large Datasets: If you have a large dataset (e.g., hundreds of thousands of samples or more), you might lean towards the lower end of the range, such as 20%. The large volume of data allows for a smaller percentage to still provide a sufficiently large validation set.
Small Datasets: For smaller datasets (e.g., a few thousand samples), you might opt for a larger percentage, closer to 30%, to ensure that the validation set is adequately representative. Alternatively, you might consider using techniques like k-fold cross-validation to make better use of limited data.
Cross-Validation:

K-Fold Cross-Validation: If you are working with a small dataset or want a more robust estimate, cross-validation can be a good alternative. With k-fold cross-validation, you can use, for example, 10 folds, where each fold is used as the validation set once while the other 9 folds are used for training. This method helps mitigate the variance that might arise from a single train-validation split and provides a more reliable estimate of model performance.
Hold-Out Test Set:

Additional Split: If you also have a separate test set, the validation set is used for tuning and selecting the model, while the test set is reserved for the final evaluation of the model’s performance. The test set should be kept completely separate from the training and validation process to provide an unbiased evaluation.
Example Scenarios
Large Dataset (e.g., 100,000 samples): Use 20% for validation (20,000 samples) and the remaining 80% for training (80,000 samples). This typically provides a good balance.
Small Dataset (e.g., 5,000 samples): Use 30% for validation (1,500 samples) and the remaining 70% for training (3,500 samples). Alternatively, use k-fold cross-validation with 5 or 10 folds to better utilize the available data.
In summary, reserving 20% to 30% of the data for validation is a good general practice. Adjustments can be made based on the size of your dataset and specific needs of your model evaluation strategy.






Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.




In [19]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


X_train = traindata
y_train = trainlabel
X_test = testdata
y_test = testlabel


knn_1 = KNeighborsClassifier(n_neighbors=1)
knn_1.fit(X_train, y_train)


y_pred_1 = knn_1.predict(X_test)
accuracy_1 = accuracy_score(y_test, y_pred_1)


knn_3 = KNeighborsClassifier(n_neighbors=3)
knn_3.fit(X_train, y_train)

y_pred_3 = knn_3.predict(X_test)
accuracy_3 = accuracy_score(y_test, y_pred_3)
print("Accuracy of 1-Nearest Neighbor Classifier: {:.2f}".format(accuracy_1))
print("Accuracy of 3-Nearest Neighbor Classifier: {:.2f}".format(accuracy_3))

Accuracy of 1-Nearest Neighbor Classifier: 0.34
Accuracy of 3-Nearest Neighbor Classifier: 0.35


## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [16]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [17]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.96388622460814 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
A.Yes, averaging the validation accuracy across multiple splits generally gives more consistent and reliable results compared to a single validation split. Here’s why this approach is beneficial:

Benefits of Averaging Validation Accuracy
Reduces Variability:

Single Split Variability: A single validation split might not fully represent the variability in the data or the performance of the model. This is especially true if the validation set is not representative of the overall dataset, leading to higher variance in performance estimates.
Averaging Across Multiple Splits: By averaging validation accuracy across multiple splits, you account for different possible variations in the data, which helps in obtaining a more stable estimate of the model’s performance.
Improves Reliability:

More Data Points: Each split provides a different estimate of performance. Averaging these estimates gives a more reliable indication of the model’s generalization ability, as it smooths out any anomalies or biases present in any single split.
Reduces Overfitting to Validation Data:

Single Validation Set: With a single validation set, there is a risk that the model might tune its hyperparameters too specifically for that set, which can lead to overfitting.
Cross-Validation: By using multiple splits or cross-validation, you minimize the risk of overfitting to any particular validation set because the model is evaluated on multiple different subsets of the data.
Cross-Validation Techniques
K-Fold Cross-Validation:

Procedure: The data is divided into
𝑘
k folds. The model is trained
𝑘
k times, each time using
𝑘
−
1
k−1 folds for training and the remaining fold for validation. The performance metrics from each fold are then averaged.
Benefits: This technique ensures that every sample is used for both training and validation, providing a more comprehensive evaluation of the model’s performance.
Leave-One-Out Cross-Validation (LOOCV):

Procedure: This is a special case of k-fold cross-validation where
𝑘
k equals the number of samples in the dataset. Each sample is used once as a validation set while the remaining samples are used for training.
Benefits: LOOCV provides an extremely thorough evaluation but can be computationally expensive, especially for large datasets.
Repeated Cross-Validation:

Procedure: Cross-validation is repeated multiple times with different random splits of the data. The results from these repetitions are averaged.
Benefits: This approach further enhances the robustness of the performance estimate by accounting for different random variations in the data splits.




2. Does it give more accurate estimate of test accuracy?
A.Yes, averaging validation accuracy across multiple splits generally provides a more accurate estimate of test accuracy compared to using a single validation split.

Reasoning:

Reduced Variance: Multiple splits help account for the variability in data and model performance, leading to a more stable and reliable estimate of how the model will perform on unseen data (like the test set).
Minimized Overfitting to Validation: With a single split, there's a risk of overfitting to the validation set. Averaging across splits reduces this risk, as the model is evaluated on different subsets of the data.
This approach simulates how the model might perform on various unseen data samples, giving a more realistic expectation of its performance on the test set.

However, it's important to note that even with multiple splits, the estimate is still an approximation, and the true test accuracy might vary slightly.



3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
A.Generally, increasing the number of iterations in cross-validation (or multiple splits) leads to a more reliable and stable estimate of model performance.

Reasoning:

Reduced Randomness: Each iteration introduces some randomness due to the data splitting process. With more iterations, the impact of this randomness is reduced, as the results are averaged over a larger number of splits.
Smoother Estimate: Higher iterations lead to a smoother average, reducing the chances of getting a biased estimate due to a "lucky" or "unlucky" split.
However, there are practical considerations:

Computational Cost: More iterations mean more training and evaluation cycles, increasing the computational time required.
Diminishing Returns: The improvement in the estimate might plateau after a certain number of iterations. The gain in accuracy from additional iterations might become negligible.
In summary: Increasing iterations generally improves the estimate's reliability, but it's essential to balance it with computational resources and the point of diminishing returns.



4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?
A.Increasing iterations in cross-validation can help mitigate some of the issues associated with small datasets, but it cannot fully compensate for the limitations of having very small train or validation sets.

Here's why:

Small Training Set:

Limited Learning: A small training set restricts the model's ability to learn the underlying patterns in the data. Increasing iterations won't magically create more data for the model to learn from.
Overfitting Risk: With limited training data, the model might overfit even with cross-validation, as it might memorize the small training set instead of generalizing.
Small Validation Set:

Unreliable Estimate: A small validation set might not be representative of the overall data distribution, leading to a less reliable estimate of model performance, even with many iterations.
What increasing iterations CAN do:

Better Utilization of Available Data: With more iterations, the model gets exposed to different variations of the small dataset, potentially improving its learning within the constraints of the limited data.
More Stable Estimate (to a degree): It can help reduce the variance in the performance estimate caused by the randomness of data splitting.
However, it's important to acknowledge that increasing iterations is NOT a substitute for having sufficient data.

For very small datasets, consider:

Collecting More Data: If possible, the most effective solution is to gather more data to improve the model's learning and generalization.
Simpler Models: Opt for simpler models with fewer parameters to reduce the risk of overfitting.
Regularization Techniques: Employ regularization techniques to prevent overfitting and improve generalization.
In conclusion, while increasing iterations can offer some benefits with small datasets, it's crucial to be aware of the inherent limitations and explore other strategies to address the challenges posed by limited data.


> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.



In [23]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, neighbours=1):

    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = KNeighborsClassifier(n_neighbors=neighbours).fit(traindata, trainlabel).predict(valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations

In [24]:
splits = 10
splitpercent = 75/100

avg_acc_1 = AverageAccuracy(alltraindata, alltrainlabel, splitpercent, splits, neighbours=1)
print(f"Average validation accuracy for 1NN with {splits} splits and {splitpercent*100}% split: {avg_acc_1*100:.2f}%")

avg_acc_3 = AverageAccuracy(alltraindata, alltrainlabel, splitpercent, splits, neighbours=3)
print(f"Average validation accuracy for 3NN with {splits} splits and {splitpercent*100}% split: {avg_acc_3*100:.2f}%")



Average validation accuracy for 1NN with 10 splits and 75.0% split: 33.92%
Average validation accuracy for 3NN with 10 splits and 75.0% split: 34.50%


To observe how the accuracy of the 3 nearest neighbour classifier changes in relation to the number of splits and split size, and to compare it with the 1 nearest neighbour classifier, you'll need to conduct a series of experiments. Here's a suggested approach:

Experiment with different values:

    Number of splits: Try different values for the number of splits (e.g., 5, 10, 20).
        Split size: Vary the percentage of data used for training (e.g., 60/40, 70/30, 80/20).
        Neighbours: Compare the results for 1 nearest neighbour and 3 nearest neighbours.

    Run experiments and collect results:

For each combination of splits, split size, and neighbours, calculate the average accuracy using the AverageAccuracy function. Record the results in a table or plot them on a graph.

    Analyze the results:

    Effect of splits: How does increasing the number of splits affect the accuracy and consistency of the results for both classifiers?
        Effect of split size: How does changing the split size impact the accuracy of both classifiers?
        Comparison: How do the accuracies of the 1 nearest neighbour and 3 nearest neighbour classifiers compare across different splits and split sizes?