<a href="https://colab.research.google.com/github/podilapujagadeesh/FMML-2023_PROJECTS_and_SUBMISSIONS/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


1.QUESTION: Yes, averaging the validation accuracy across multiple splits in techniques like K-Fold Cross-Validation can indeed lead to more consistent and reliable results. This practice helps mitigate the impact of random variability in the data splits and provides a more stable estimate of model performance. Here's why averaging across multiple splits is beneficial:

Reduced Sensitivity to Data Split:

Averaging over multiple splits reduces the sensitivity of the evaluation to the particular way the data is partitioned into training and validation sets. Each split represents a different random subset of the data, and by averaging, you smooth out the impact of any peculiarities in a single split.
Robustness to Data Variability:

Machine learning models may perform differently on different subsets of the data. Averaging the performance metrics over multiple splits helps capture the overall behavior of the model across various scenarios and makes the evaluation more robust.
More Reliable Performance Estimate:

Averaging provides a more reliable estimate of the model's performance because it considers its behavior across a range of data distributions. This is especially important when dealing with limited datasets where a single split might not be representative of the overall model performance.
Better Generalization Assessment:

By averaging, you get a more accurate sense of how well the model generalizes to unseen data. If the model consistently performs well across multiple splits, it is likely to have better generalization capability.
Statistical Significance:

Averaging allows for a more statistically sound analysis of the model's performance. It provides a clearer picture of the central tendency of the accuracy metric and helps in drawing more reliable conclusions about the model's behavior.
While averaging over multiple splits is a common practice, it's also essential to consider the variability or standard deviation of the performance metrics across splits. A small standard deviation indicates more consistency, while a larger one may suggest greater variability in model performance across different subsets of the data.

In summary, averaging validation accuracy across multiple splits is a good practice as it contributes to more stable and reliable performance estimates, especially in situations where the dataset is limited, and the impact of random variability in the splits is significant.


2.QUESTION:
Averaging the validation accuracy across multiple splits, such as in K-Fold Cross-Validation, provides a more reliable estimate of the model's likely performance on unseen data. While it does not directly estimate the test accuracy (as the test set is kept separate for final evaluation), it can serve as a good proxy for generalization performance. Here's how it contributes to a more accurate estimate:

Reduced Overfitting to a Single Split:

Averaging over multiple splits helps mitigate the risk of overfitting to a specific subset of the data. If the model consistently performs well across various training-validation splits, it suggests a more robust generalization capability.
Better Representation of Data Variability:

By considering different subsets of the data, the averaged validation accuracy provides a more comprehensive view of the model's performance, capturing how well it adapts to various patterns and distributions present in the dataset.
Smoothing Out Random Variability:

Averaging helps smooth out the impact of random variability in the data splits. If the model's performance is consistently good or consistently poor across multiple splits, it provides a clearer indication of its overall behavior.
Identification of Consistent Trends:

Consistent patterns in the validation accuracy across folds suggest that the model's performance is less likely to be influenced by the specific data split. This makes the estimate more indicative of how the model might perform on new, unseen data.
Better Insight into Generalization:

While it doesn't directly estimate test accuracy, the averaged validation accuracy is a valuable metric for understanding the model's generalization potential. A model that performs well on average across different validation sets is more likely to generalize effectively to new, unseen instances.
It's important to note that the ultimate test of a model's accuracy is its performance on a truly independent test set that has not been used for training or model selection. The validation accuracy, even when averaged, is an estimate based on subsets of the data. However, a more consistent and higher averaged validation accuracy typically suggests a more reliable estimate of the model's generalization performance.

In practice, it's common to use techniques like K-Fold Cross-Validation to estimate model performance, and then a final evaluation is conducted on a separate test set to obtain an unbiased estimate of the model's accuracy on new, unseen data.


3.QUESTION: The number of iterations, often referred to as the number of folds in cross-validation or the number of training epochs in iterative training processes, can indeed influence the estimate of model performance. However, the relationship is not necessarily straightforward, and there are trade-offs to consider. Let's explore the effects:

More Iterations (Folds or Epochs):
Smoothing Out Variability:

Increasing the number of iterations, especially in cross-validation, can help smooth out the impact of random variability in the data splits. Averaging over more folds provides a more stable estimate of model performance.
Robustness to Data Distribution:

With more iterations, the model is trained and evaluated on a larger variety of data subsets. This can make the estimate more robust, capturing a broader range of patterns and data distributions present in the dataset.
Better Generalization Estimate:

Cross-validation with more folds can provide a more accurate estimate of a model's generalization performance. It simulates the model's behavior across a larger set of training and validation scenarios.
Considerations:
Computational Cost:

A higher number of iterations may come with increased computational cost. Training and evaluating the model multiple times can be time-consuming, especially with large datasets or complex models.
Dataset Size:

In situations where the dataset is small, having too many folds might lead to each training set being too small to adequately represent the overall data distribution. This could result in overfitting to specific subsets.
Statistical Significance:

While more iterations can provide a more stable estimate, there's a diminishing return. A point may be reached where additional folds or epochs do not significantly improve the reliability of the estimate, and the computational cost becomes a more prominent factor.
General Guidelines:
Balance: It's often a matter of finding a balance. Too few iterations might result in an estimate that is sensitive to the specific data split, while too many iterations might become computationally expensive without providing significant additional benefits.

Use Case Specific:

The optimal number of iterations depends on the characteristics of your dataset, the complexity of your model, and your computational resources. It's advisable to experiment with different numbers of iterations to find a balance that works well for your specific use case.
In conclusion, while increasing the number of iterations can improve the stability and robustness of performance estimates, it's essential to consider the trade-offs, including computational cost and potential overfitting to small subsets. It's common practice to perform experiments with different iteration settings to find an appropriate balance for the specific problem at hand.


4.QUESTION: While increasing the number of iterations (folds or epochs) can help address certain issues related to small training or validation datasets, there are limitations to how much this strategy can compensate for the challenges posed by limited data. Let's explore the considerations:

Small Training Dataset:
Overfitting to Training Data:

If the training dataset is very small, there's a risk of overfitting to the specific examples in that dataset. Increasing iterations alone may exacerbate this problem, especially if the model has sufficient capacity to memorize the limited training data.
Limited Representativeness:

A small training dataset might not adequately represent the diversity and complexity of the overall data distribution. The model may not generalize well to new, unseen examples.
Small Validation Dataset:
Limited Generalization Assessment:
A small validation dataset may not provide a robust assessment of the model's generalization performance. Increasing iterations can help in terms of averaging over different validation sets, but the small size still limits the reliability of the estimate.
Considerations:
Data Augmentation:

If the dataset is small, one approach is to employ data augmentation techniques. Data augmentation involves generating additional training examples by applying transformations (e.g., rotations, flips, shifts) to the existing data. This can artificially increase the effective size of the training dataset.
Regularization Techniques:

Regularization techniques, such as dropout or weight regularization, can help mitigate overfitting, even in the presence of a small training dataset.
Iterative Experimentation:

Experimenting with different strategies, including varying the number of iterations, is important. It's advisable to monitor both training and validation performance and make adjustments accordingly.
Transfer Learning:

If the dataset is extremely small, considering transfer learning with a pre-trained model on a related task or a larger dataset might be beneficial. This allows leveraging knowledge from a different but related domain.
Caveats:
No Substitute for Sufficient Data:

While increasing iterations and employing various techniques can mitigate some challenges, there's no substitute for having a sufficiently large and diverse dataset. The quality of the model's generalization is ultimately limited by the richness and representativeness of the available data.
Risk of Overfitting:

Increasing iterations alone might not necessarily address the risk of overfitting to a small training dataset. It's crucial to monitor the model's behavior and, if necessary, introduce regularization techniques.
In summary, while increasing iterations can be a part of addressing challenges associated with small datasets, it's important to adopt a holistic approach that includes techniques like data augmentation, regularization, and careful monitoring of model behavior. There are limits to what can be achieved with a very small dataset, and model performance may be constrained by the inherent limitations of the available data.














