<a href="https://colab.research.google.com/github/nandhinichowdary/FMML-LAB-1/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [26]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [27]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [28]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


In [29]:
type(dataset), dataset.DESCR

(sklearn.utils._bunch.Bunch,
 '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block group\n        - HouseAge      median house age in block group\n        - AveRooms      average number of rooms per household\n        - AveBedrms     average number of bedrooms per household\n        - Population    block group population\n        - AveOccup      average number of household members\n        - Latitude      block group latitude\n        - Longitude     block group longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe target variable is the median house value for California districts,\nexpressed in hundreds of 

In [30]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [31]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [32]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [33]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [34]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [35]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [36]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


In [37]:
'5,5 - A'
'6,6 - B'
'7,7 - C'


'8,8  - A | C'
'0,0 -B |A  - 0'

'10,10  -A'
'15,15  -C'

'15,15  -C'

For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [38]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [39]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [40]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

Answer for first question

Increasing the Percentage of Validation Set:

Pros:
Better Generalization: A larger validation set allows your model to evaluate its performance on a more diverse and representative subset of the data. This can help in better assessing the model's ability to generalize to new, unseen data.
Reduced Overfitting: With a larger validation set, it becomes more challenging for the model to overfit to the validation data, as there is more data to learn from.
Cons:
Reduced Training Data: When you increase the percentage of data allocated to validation, you are reducing the amount of data available for training. This can be a problem if you have limited data to start with, as it may lead to underfitting if the training set is too small.
Slower Training: Training deep learning models with a larger validation set can be computationally more expensive and time-consuming.
Decreasing the Percentage of Validation Set:

Pros:

More Data for Training: A smaller validation set means you have more data available for training your model. This can be beneficial if you have limited data.
Faster Training: Smaller validation sets require less computation for evaluating the model's performance during training.
Cons:

Overfitting Risk: With a smaller validation set, there is a higher risk of overfitting. The model may learn to perform well on the specific validation data but may not generalize well to new data.
Less Reliable Evaluation: A small validation set may not provide a reliable estimate of your model's generalization performance because it may be more sensitive to the particular examples in the validation set.
ation multiple times to obtain a more robust estimate of model performance.







2.Answer for second question

Bias:

Smaller Validation Set: When you have a small validation set, the performance estimate you obtain from it might be biased or unreliable. This is because the validation set may not adequately represent the overall data distribution. A small validation set may lead to overfitting or underfitting in model selection.
Larger Validation Set: A larger validation set is generally better at estimating the model's performance on unseen data, as it provides a more representative sample of your data. It can help reduce bias in your performance estimate.
Variance:

Smaller Training Set: If you have a small training set and a large validation set, the model might not learn enough from the training data, leading to high variance in model performance. In such cases, your model might perform poorly on the test set because it hasn't learned enough patterns from the data.
Larger Training Set: A larger training set allows the model to learn more complex patterns from the data, potentially reducing variance in model performance. However, if the validation set remains small, you may still have some uncertainty in your performance estimate.

3.Answer for third question


60-20-20 or 70-15-15 Split: A commonly used split ratio is to allocate 60-70% of your data to the training set, 15-20% to the validation set, and the remaining 15-20% to the test set. This allows you to have a reasonable amount of data for training, validation, and a final evaluation on unseen data.

Cross-Validation: If you have a limited dataset, consider using k-fold cross-validation. In k-fold cross-validation, the data is divided into k subsets (or folds), and the training/validation process is repeated k times, with each subset serving as the validation set once. This approach helps make the most of your data and provides more robust performance estimates.

Stratified Split: If your dataset has class imbalance or other important characteristics, consider using stratified sampling to ensure that each subset (training, validation, test) maintains the same class distribution as the original dataset. This can help avoid biased splits.

Experiment and Iterate: The choice of the validation set size should also depend on your computational resources and how much data you have available. You may need to experiment with different split ratios and cross-validation strategies to find what works best for your specific problem.

Consider Data Size: For very large datasets, you may be able to allocate a smaller percentage to the validation set because there's still a substantial amount of data for training. Conversely, with very small datasets, you may need to allocate a larger percentage to validation to ensure a meaningful evaluation.

Ultimately, the goal is to strike a balance where the validation set is large enough to provide a reliable estimate of model performance but not so large that it significantly reduces the amount of data available for training. Additionally, it's essential to keep in mind that the quality and representativeness of your data matter just as much as the percentage split, so make sure your data preprocessing is thorough and thoughtful.






## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [41]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [25]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


1.Answer for first question

Reduced Variance: By performing multiple splits of the data into training and validation subsets, you reduce the impact of data randomness on the model evaluation. It provides a more robust estimate of your model's performance by averaging over multiple validation sets.

Better Generalization Assessment: Averaging over multiple folds ensures that your model is evaluated on different subsets of the data. This can help you assess how well your model generalizes across various data partitions, reducing the risk of overfitting to a specific validation set.

More Representative Performance Estimate: Averaging allows you to obtain a performance estimate that is less dependent on the particular random split of your data. It provides a more representative assessment of your model's expected performance on unseen data.

Enhanced Confidence: Calculating statistics like the mean, standard deviation, or confidence intervals of the validation accuracy across folds can give you a better understanding of the range of possible model performances, helping you make more informed decisions.

Fairer Model Selection: When comparing different models or hyperparameter settings, using cross-validation and averaging the results ensures a fairer comparison. It minimizes the chances of selecting a model that happened to perform well on a single validation split due to luck.

However, it's important to note that k-fold cross-validation and averaging come with a computational cost, as you need to train and evaluate your model multiple times. The choice of the number of folds (k) depends on your dataset size, available computational resources, and the desired level of precision in your performance estimate. Common values for k include 5-fold, 10-fold, and even leave-one-out cross-validation (k equals the number of data points).

In summary, averaging validation accuracy across multiple splits, such as in k-fold cross-validation, is a valuable technique for obtaining more consistent and reliable model performance estimates, reducing the impact of random variations in the data split. It is particularly useful when assessing and comparing models in machine learning tasks.






2.Answer for second question


Cross-Validation Accuracy: When you perform k-fold cross-validation, you divide your dataset into k subsets or folds. You train and validate your model k times, each time using a different fold as the validation set while the remaining k-1 folds are used for training. You then average the accuracy or other performance metrics across these k iterations. This cross-validation accuracy provides a more accurate estimate of how well your model is likely to perform on unseen data compared to a single validation split.

Test Accuracy: The "test accuracy" typically refers to the performance of your trained model on a completely separate and previously unseen dataset, which is often kept aside until the final evaluation stage. This test dataset should not have been used during model development or hyperparameter tuning. Test accuracy provides an estimate of how well your model generalizes to real-world, out-of-sample data.

While cross-validation accuracy is a valuable and more reliable estimate of your model's performance compared to a single validation split, it is not the same as test accuracy. Test accuracy remains the ultimate measure of your model's real-world generalization performance.

Cross-validation is useful for model selection, hyperparameter tuning, and gaining confidence in your model's performance estimate. However, it's essential to remember that cross-validation results are still based on the same dataset, and they may not fully capture all possible variations in real-world data.

In practice, it's a good idea to use cross-validation to assess and fine-tune your model, and then, once you have a final model configuration, evaluate it on a separate test dataset to obtain a more accurate estimate of its true generalization performance. This separation of a test set from the development process helps ensure that you are not overfitting to the validation data.


3.Answer for third question

Bias-Variance Trade-off:

More Iterations (Higher k): Increasing the number of iterations (k-fold cross-validation with larger k) typically reduces the bias in your performance estimate. With more folds, you get to use more of your data for validation, which can lead to a more representative estimate.
Fewer Iterations (Lower k): Using fewer iterations (smaller k) can introduce bias because each validation fold represents a larger portion of your data. This can lead to a more stable but potentially higher variance estimate.
Computational Cost:

More Iterations: Using a higher number of iterations can be computationally expensive, especially when you have a large dataset or complex model. It may not always be feasible to use a very high k.
Dataset Size:

Small Dataset: In cases where you have a small dataset, increasing the number of iterations (higher k) can be beneficial to make the most of your limited data.
Large Dataset: With a large dataset, you may still obtain a reliable estimate with a smaller number of iterations (lower k), and this can save computational resources.
Stability of Estimates:

More Iterations: Increasing the number of iterations can provide a more stable estimate of model performance. The average performance across more iterations is less susceptible to randomness in the data splits.
Fewer Iterations: With fewer iterations, the estimate may be more sensitive to the particular random splits, potentially leading to variability in your results.
In summary, the choice of the number of iterations in cross-validation should be a balance between obtaining a reliable estimate of model performance and the computational cost. A common practice is to use 5-fold or 10-fold cross-validation, as they strike a reasonable balance for many datasets. However, if you have a small dataset, you might consider using leave-one-out cross-validation (k equals the number of data points) for a more thorough evaluation. Conversely, for large datasets, you can use smaller values of k to save computational resources.

It's essential to keep in mind that while more iterations can reduce bias in your estimate, they do not guarantee a better model. The quality of your data, the choice of model, and other factors also play a crucial role in model performance.


4.Answer for fourth question


Mitigating Bias:

Small Training Dataset: Increasing the number of iterations in cross-validation can provide the model with more opportunities to see different subsets of the small training dataset. This can help reduce bias in the model's performance estimate, as it gets exposed to more variations in the data.
Small Validation Dataset: When you have a small validation dataset, increasing the iterations allows you to assess the model's performance on different subsets of the validation data, reducing the impact of randomness in the data split and potentially providing a more stable estimate.
Limitations:

Extremely Small Datasets: If your training or validation dataset is extremely small (e.g., just a few data points), there are inherent limitations to how much information can be extracted from such data. Increasing the number of iterations helps, but it cannot compensate for the lack of data.
Overfitting: With very small training datasets, there's an increased risk of overfitting, especially if you use more iterations. The model may start memorizing the training examples rather than generalizing from them.
Computational Cost: It's worth noting that increasing the number of iterations can become computationally expensive, especially if you have limited computational resources. You should carefully consider the trade-off between computational cost and the benefits of cross-validation.

In cases of extremely small datasets, you might want to explore alternative strategies:

Data Augmentation: If applicable, data augmentation techniques can artificially increase the effective size of your dataset by generating additional training examples with variations of the existing data.
Transfer Learning: Leveraging pretrained models (transfer learning) can be effective when you have limited data. You can fine-tune a pretrained model on your small dataset to achieve better results.
Simpler Models: Consider using simpler, less complex models that are less prone to overfitting when data is scarce.
Collect More Data: If possible, collecting more data should be a priority, as it addresses the root cause of the small dataset problem.
While increasing the number of iterations in cross-validation can help improve the stability of your model evaluation and reduce bias, it doesn't replace the need for an adequate amount of data to train a robust model. The best approach often involves a combination of techniques tailored to the specific problem and dataset constraints.





