<a href="https://colab.research.google.com/github/hydracsnova13/IIITHAIML/blob/main/Aiml_module_1_lab_2_machine_learning_terms_and_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##AIML Module 1 - Lab 2
# Machine learning terms and metrics


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [None]:
 dataset =  datasets.fetch_california_housing()
 print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [None]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [None]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [None]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  print(split1)
  split2 = rnd>=percent
  print(split2)
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

[False False False ...  True  True False]
[ True  True  True ... False False  True]
Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

[False False  True ...  True  True  True]
[ True  True False ... False False False]


What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

[ True  True False ...  True False  True]
[False False  True ... False  True False]
Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

In [None]:
#for question 1

#percentage increased
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 90/100)#changed to 90 percent
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

#percentage decreased
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 45/100)#changed to 80 percent
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

[ True  True  True ...  True  True  True]
[False False False ... False False False]
Validation accuracy of nearest neighbour is  0.3504424778761062
[False False False ... False False False]
[ True  True  True ...  True  True  True]
Validation accuracy of nearest neighbour is  0.32691249312052834


#for question 2

Bias-Variance Trade-off: The size of the training set and validation set is closely related to the bias-variance trade-off. If you have a small training set, your model may underfit (high bias) because it hasn't seen enough data to learn the underlying patterns. Conversely, if your validation set is small, your estimated performance metrics may have high variance, meaning they can vary significantly depending on the particular data points in the validation set.

Overfitting and Generalization: A larger training set can help reduce overfitting, which occurs when a model learns to perform well on the training data but doesn't generalize well to unseen data. A well-sized validation set can help you monitor overfitting and assess how well your model is generalizing.

Reliability of Validation Metrics: The size of the validation set can affect the reliability of the metrics you calculate on it. Smaller validation sets may result in more volatile validation performance metrics, making it harder to make informed decisions about model hyperparameters or architecture.

Hyperparameter Tuning: When you're tuning hyperparameters (e.g., learning rate, model architecture, regularization strength), the size of the validation set matters. If your validation set is too small, you may not obtain a reliable estimate of how well a particular set of hyperparameters will perform on unseen data.

Statistical Significance: In some cases, if the validation set is very small, it may not be representative of the overall population of data, leading to statistical bias. A larger validation set is more likely to provide a representative estimate of your model's performance.

Cross-Validation: To mitigate the impact of the validation set size, you can use techniques like k-fold cross-validation. This involves splitting the data into multiple subsets (folds) and training and validating the model on different combinations of these subsets. Cross-validation provides a more robust estimate of model performance, especially when the dataset is limited.


#for question 3

Typical Split Ratios:

A common split ratio is 70-80% for training data and 20-30% for validation data when you have a reasonably large dataset (thousands of data points).
For very large datasets, you might allocate an even smaller percentage to the validation set (e.g., 90-95% for training, 5-10% for validation).
Conversely, for small datasets, you might allocate a larger percentage to the validation set (e.g., 60-80% for training, 20-40% for validation).
Cross-Validation:

If your dataset is limited in size, consider using cross-validation techniques such as k-fold cross-validation or stratified sampling to assess model performance more robustly. In k-fold cross-validation, the data is divided into k subsets, and the model is trained and validated k times, with each subset serving as the validation set once.
Purpose of Validation:

If the validation set is primarily used for hyperparameter tuning and model selection, a smaller validation set may suffice, as you'll be running multiple experiments with different hyperparameters.
If the validation set is also used for final model evaluation before deploying it in production, it's advisable to allocate a larger percentage to ensure a more reliable estimate of performance.
Domain Expertise:

In some domains, the nature of the data may influence the choice of validation set size. Domain knowledge can help you determine what percentage is appropriate. For example, in medical research, where data is often limited and valuable, you might use a larger validation set.
Iterative Development:

During the model development process, it's often beneficial to start with a larger validation set to quickly assess model performance, and then, as you fine-tune your model, you can reduce the validation set size.
Monitoring Overfitting:

Keep an eye on overfitting. If you notice that your model is overfitting the training data, consider increasing the size of your validation set to better detect this problem.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [None]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [None]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

[ True  True  True ...  True  True  True]
[False False False ... False False False]
[ True  True  True ...  True  True  True]
[False False False ... False False False]
[ True  True False ...  True  True False]
[False False  True ... False False  True]
[ True False  True ...  True  True  True]
[False  True False ... False False False]
[ True  True  True ...  True  True  True]
[False False False ... False False False]
[False False  True ...  True False  True]
[ True  True False ... False  True False]
[ True False  True ... False  True False]
[False  True False ...  True False  True]
[ True  True False ...  True False  True]
[False False  True ... False  True False]
[ True  True  True ...  True  True False]
[False False False ... False False  True]
[ True  True False ... False False  True]
[False False  True ...  True  True False]
Average validation accuracy is  0.3359366875267045
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


Yes, averaging the validation accuracy across multiple splits typically gives more consistent results. When you perform a single train/validation split, the accuracy can be influenced by the particular data points included in each set. Averaging over multiple splits helps smooth out the impact of random variations in the data distribution, providing a more stable and representative estimate of model performance.

Yes, averaging the validation accuracy across multiple splits can provide a more accurate estimate of test accuracy. By using multiple validation sets, you get a better sense of how well your model generalizes to different data distributions. This can reduce the risk of overfitting to a single validation set and provide a more reliable estimate of how the model will perform on unseen data (test data).

Increasing the number of iterations generally improves the estimate of test accuracy. With more iterations, you get to evaluate the model's performance on a larger number of different train/validation splits, which leads to a more robust estimate. However, there may be diminishing returns—going from a few iterations to a moderate number often yields significant improvements, but adding many more iterations might have diminishing benefits.

Increasing the number of iterations can help mitigate the impact of small train or validation datasets to some extent. When you have limited data, each split may not fully represent the underlying data distribution. By repeatedly splitting the data and averaging the results, you can reduce the influence of data variability. However, it's essential to strike a balance between the number of iterations and computational resources, as excessively high iteration counts may become impractical.