<a href="https://colab.research.google.com/github/poojisairama/fmml-lab-2/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

1 st question

The accuracy of a validation set can be affected by the percentage of data allocated to it. When you increase the percentage of the validation set, you typically reduce the size of the training set. Conversely, when you reduce the percentage of the validation set, you increase the size of the training set. These changes can have different impacts on model training and validation performance.

1 . Increasing Validation Set Percentage:

*Pros:

More data for validation can lead to a more reliable estimate of your model's performance. It provides a better evaluation of how well the model generalizes to unseen data. If your validation set is representative of the test or real-world data distribution, increasing its size can help identify overfitting earlier, resulting in a model that is more likely to generalize well.

*Cons:

A larger validation set means a smaller training set. With less data for training, the model may not learn complex patterns as effectively, potentially leading to underfitting. It can increase the computational cost of training because you're evaluating the model's performance on a larger validation set more frequently.

2 . Reducing Validation Set Percentage:

Pros:
A smaller validation set means a larger training set, which can help the model learn better from the available data and potentially capture more complex patterns. It can reduce the computational cost of training because you're evaluating the model's performance on a smaller validation set. Cons:

A smaller validation set might lead to less reliable estimates of model performance, making it harder to detect overfitting. The model might appear to perform well on the validation set even if it's overfitting.

In practice, the choice of the validation set percentage depends on the specific dataset, problem, and computational resources available. A common split is 80% training and 20% validation, but you may need to experiment with different splits to find the balance that works best for your particular task. Cross-validation techniques can also be used to mitigate some of the drawbacks of small validation sets.

Remember that the goal is to strike a balance between having enough data for the model to learn and having enough data for reliable model evaluation.




In [16]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset as an example
data = load_iris()
X, y = data.data, data.target

# Vary the percentage of the validation set
validation_percentages = [0.1, 0.2, 0.3, 0.4, 0.5]

for val_percentage in validation_percentages:
    # Split the data into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=val_percentage, random_state=42)

    # Create and train a simple classifier
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)

    # Predict on the validation set
    y_pred = model.predict(X_val)

    # Calculate accuracy on the validation set
    accuracy = accuracy_score(y_val, y_pred)

    print(f"Validation Percentage: {val_percentage * 100}%")
    print(f"Validation Accuracy: {accuracy * 100:.2f}%")
    print("-" * 30)

Validation Percentage: 10.0%
Validation Accuracy: 100.00%
------------------------------
Validation Percentage: 20.0%
Validation Accuracy: 100.00%
------------------------------
Validation Percentage: 30.0%
Validation Accuracy: 100.00%
------------------------------
Validation Percentage: 40.0%
Validation Accuracy: 100.00%
------------------------------
Validation Percentage: 50.0%
Validation Accuracy: 100.00%
------------------------------


2 nd question

The size of the training and validation sets can impact how well we can predict the accuracy on the test set using the validation set in the following ways:

Larger Training Set:

Pros:

When the training set is larger, the model has more data to learn from, which can help it capture complex patterns and generalize better to unseen data.
The model is less likely to suffer from underfitting because it has more opportunities to learn from the data.
Cons:

A larger training set means a smaller validation set. While this may lead to a more stable validation estimate, a very small validation set can result in less reliable performance metrics and make it harder to detect overfitting.
Larger Validation Set:

Pros:

A larger validation set provides a more reliable estimate of the model's performance. It reduces the impact of random variability in the validation set, leading to more consistent evaluation results.
It can help detect overfitting more effectively because a larger validation set is less likely to be influenced by noise.
Cons:

A larger validation set means a smaller training set. If the training set is too small, the model may not learn complex patterns effectively, potentially leading to underfitting.
Balanced Split (Balanced Training and Validation Sets):

Pros:

A balanced split, where both the training and validation sets are reasonably large, strikes a balance between model generalization and reliable performance estimation.
It helps ensure that the model learns well from a substantial portion of the data and provides a stable validation estimate.
Cons:

There is still a trade-off between the sizes of the training and validation sets. In some cases, a larger training set may be necessary to capture complex patterns.
In practice, finding the optimal split ratio depends on various factors, including the dataset's size, complexity, and the computational resources available. Cross-validation techniques, such as k-fold cross-validation, can also be used to mitigate the impact of different split sizes and provide a more robust estimate of model performance.

Ultimately, the goal is to balance the trade-off between model learning and reliable evaluation to ensure that the model performs well on unseen data (test set) while avoiding overfitting or underfitting.



In [17]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset as an example
data = load_iris()
X = data.data
y = data.target

# Define different training and validation set sizes
train_sizes = [0.5, 0.7, 0.8]  # You can modify these values

for train_size in train_sizes:
    # Split the data into training and validation sets
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=1 - train_size, random_state=42)

    # Train a simple classifier (Logistic Regression in this case)
    model = LogisticRegression()
    model.fit(X_train, y_train)

    # Predict on the validation set
    y_val_pred = model.predict(X_val)

    # Calculate accuracy on the validation set
    val_accuracy = accuracy_score(y_val, y_val_pred)

    print(f"Training size: {train_size}, Validation accuracy: {val_accuracy:.2f}")


Training size: 0.5, Validation accuracy: 1.00
Training size: 0.7, Validation accuracy: 1.00
Training size: 0.8, Validation accuracy: 1.00


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

3 rd question

A common and balanced percentage for reserving data for the validation set is around 20-30% of the total dataset, especially when you have a reasonably large dataset. However, the choice can depend on various factors, including the specific dataset, the complexity of your model, and the computational resources available.

Here's a code example using Python and scikit-learn to split your data into training and validation sets with a 80-20 split (80% training and 20% validation), which is a balanced starting point:
In this code:

X and y should be replaced with your actual feature and label data.
The test_size parameter is set to 0.2, indicating that 20% of the data will be reserved for validation.
This 80-20 split is a reasonable choice for many datasets and models, as it provides a good balance between having enough data for training and having a reasonably large validation set for reliable performance estimation.

However, you may need to adjust this split ratio based on your specific circumstances:

If you have a very large dataset, you might be able to reserve a smaller percentage (e.g., 10-15%) for validation.
If your dataset is small, you may consider reserving a larger percentage (e.g., 30-40%) for validation to obtain more reliable performance estimates.
Ultimately, the goal is to strike the right balance between training and validation data to ensure your model performs well on unseen data (test set) without overfitting or underfitting. You may need to experiment with different split ratios to determine what works best for your particular problem.

In [18]:
from sklearn.model_selection import train_test_split

# Splitting data into 80% training and 20% validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


1 st answer
Yes, averaging the validation accuracy across multiple splits, such as using cross-validation, does give more consistent and reliable results compared to a single split. Cross-validation is a widely used technique to obtain a better estimate of a model's performance and reduce the impact of randomness associated with a single split of the data.

Here's why averaging validation accuracy across multiple splits is beneficial:

Reduced Variance: Cross-validation helps mitigate the variability in performance that can arise from a single data split. By repeatedly splitting the data into different training and validation sets, you get a more robust estimate of how well your model generalizes to unseen data.

Better Generalization: Averaging results from multiple splits ensures that your model's performance is evaluated on different subsets of the data. This provides a more representative view of how your model will perform on unseen data.

Reduced Bias: Cross-validation helps reduce potential bias introduced by a specific random split. In a single split, you might get lucky or unlucky with the data division, leading to overly optimistic or pessimistic estimates of model performance. Cross-validation helps balance this bias.

Model Selection: Cross-validation is often used for model selection and hyperparameter tuning. It allows you to compare different models or hyperparameter settings more fairly and choose the one that performs consistently well across various data splits.

In summary, averaging validation accuracy across multiple splits using techniques like k-fold cross-validation is a standard practice in machine learning. It provides a more consistent and reliable evaluation of your model's performance and helps ensure that the results are not overly influenced by the randomness of a single data split.

in this code:

Replace LogisticRegression() with your own classifier or model. num_folds defines the number of folds in the cross-validation (e.g., 5-fold cross-validation). shuffle=True shuffles the data before splitting, which can help reduce potential bias in the data. random_state sets a random seed for reproducibility. The cross_val_score function performs k-fold cross-validation and returns an array of accuracy scores for each fold. Taking the mean of these scores provides a more stable estimate of your model's accuracy, and the standard deviation can give you an idea of the variability across folds.

In [20]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression

# Create your classifier (replace with your own model)
classifier = LogisticRegression()

# Define the number of folds (e.g., 5-fold cross-validation)
num_folds = 5

# Create a KFold cross-validation object
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Perform k-fold cross-validation and get accuracy scores
accuracy_scores = cross_val_score(classifier, X, y, cv=kf, scoring='accuracy')

# Calculate the mean and standard deviation of accuracy scores
mean_accuracy = accuracy_scores.mean()
std_accuracy = accuracy_scores.std()

print(f"Mean Accuracy: {mean_accuracy:.2f}")
print(f"Standard Deviation of Accuracy: {std_accuracy:.2f}")


Mean Accuracy: 0.97
Standard Deviation of Accuracy: 0.02


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


2 nd question

Averaging validation accuracy across multiple splits using techniques like k-fold cross-validation provides a more reliable estimate of a model's generalization performance, which can be considered a more accurate estimate compared to a single split. However, it's important to note that this estimate is still based on the validation data, and it's not the same as the test accuracy, which is evaluated on completely unseen data. Cross-validation serves as a robust proxy for test performance.

Here's a code example demonstrating how you can use k-fold cross-validation to obtain a more accurate estimate of your model's generalization performance:
In this code:

Replace LogisticRegression() with your own classifier or model.
num_folds defines the number of folds in the cross-validation (e.g., 5-fold cross-validation).
shuffle=True shuffles the data before splitting to reduce potential bias.
random_state sets a random seed for reproducibility.
The cross_val_score function performs k-fold cross-validation and returns an array of accuracy scores for each fold. Taking the mean of these scores provides a more accurate estimate of your model's generalization performance compared to a single validation split.

While cross-validation provides a more accurate estimate of how your model is likely to perform on unseen data, it's important to emphasize that the final evaluation of your model should always be done on a separate, held-out test dataset that has not been used during model development or hyperparameter tuning. This test set provides the most accurate estimate of your model's real-world performance.

In [21]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression

# Create your classifier (replace with your own model)
classifier = LogisticRegression()

# Define the number of folds (e.g., 5-fold cross-validation)
num_folds = 5

# Create a KFold cross-validation object
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Perform k-fold cross-validation and get accuracy scores
accuracy_scores = cross_val_score(classifier, X, y, cv=kf, scoring='accuracy')

# Calculate the mean accuracy across folds
mean_accuracy = accuracy_scores.mean()

print(f"Mean Cross-Validation Accuracy: {mean_accuracy:.2f}")

Mean Cross-Validation Accuracy: 0.97


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


3 rd question

The number of iterations (or folds) in cross-validation can have an effect on the estimate of model performance. In general, using a higher number of iterations in cross-validation can lead to a more robust and reliable estimate of the model's performance. However, it also comes at the cost of increased computational time.

Here's how the number of iterations can affect the estimate of model performance:

Higher Iterations (More Folds):

Pros:

A higher number of iterations, such as 10-fold or leave-one-out cross-validation, uses more diverse subsets of the data for training and validation. This can result in a more robust estimate because the model is evaluated on a larger number of validation sets.
It provides a better assessment of the model's generalization performance and can reveal how consistent the model's performance is across different data splits.
Cons:

Increased computational cost: Using a higher number of iterations can be computationally expensive, especially if your dataset is large or the model is complex.
Smaller training sets: With more folds, each training set becomes smaller, potentially limiting the model's ability to capture complex patterns.
Lower Iterations (Fewer Folds):

Pros:

Using fewer iterations, such as 5-fold or 3-fold cross-validation, is computationally more efficient and quicker to execute.
Each training set is larger, potentially allowing the model to learn better.
Cons:

The estimate of model performance may be less robust because it's based on a smaller number of validation sets. It might be more influenced by randomness in the data split.
The choice of the number of iterations depends on a trade-off between computational resources and the desire for a robust estimate. In practice, 5-fold or 10-fold cross-validation is often a good compromise, providing a reasonably accurate and stable estimate of model performance.

Here's a code example showing how to perform k-fold cross-validation with different numbers of iterations using scikit-learn:

In this code, you can experiment with different values of num_folds to observe how the number of iterations affects the cross-validation estimate of model accuracy.

In [22]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression

# Create your classifier (replace with your own model)
classifier = LogisticRegression()

# Define different numbers of folds (iterations)
num_folds = [3, 5, 10]  # You can modify these values

for fold in num_folds:
    # Create a KFold cross-validation object with the specified number of folds
    kf = KFold(n_splits=fold, shuffle=True, random_state=42)

    # Perform k-fold cross-validation and get accuracy scores
    accuracy_scores = cross_val_score(classifier, X, y, cv=kf, scoring='accuracy')

    # Calculate the mean accuracy across folds
    mean_accuracy = accuracy_scores.mean()

    print(f"Number of Folds: {fold}, Mean Cross-Validation Accuracy: {mean_accuracy:.2f}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Number of Folds: 3, Mean Cross-Validation Accuracy: 0.97
Number of Folds: 5, Mean Cross-Validation Accuracy: 0.97
Number of Folds: 10, Mean Cross-Validation Accuracy: 0.97


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

4 th question

Increasing the number of iterations in cross-validation (i.e., using more folds) can help mitigate the impact of a very small training dataset or validation dataset to some extent. However, it cannot completely compensate for an extremely small dataset, and there are practical limits to how much increasing the iterations can help.

Here's why increasing iterations can help but may not fully address the issue:

Pros of Increasing Iterations (Folds):

More Data Splits: With more folds, the model is trained and validated on different subsets of the data, which can reduce the impact of a small training or validation set on the overall performance estimate.

Better Assessment: A higher number of iterations provides a better assessment of how well the model generalizes and can reveal how consistent its performance is across different data splits.

Cons of Increasing Iterations (Folds):

Smaller Training Sets: As you increase the number of folds, each training set becomes smaller. If your training dataset is already very small, further reducing its size can limit the model's ability to learn meaningful patterns.

Computational Cost: Using a large number of folds can be computationally expensive, especially if your dataset is large or your model is complex.

Diminishing Returns: After a certain point, increasing the number of folds may not significantly improve the estimate of model performance. There are diminishing returns as you go beyond a reasonable number of folds.

In cases where you have a very small dataset, you might consider using techniques like stratified sampling to ensure that each fold contains a representative distribution of classes. Additionally, you can experiment with different cross-validation strategies, such as leave-one-out (LOO) cross-validation, which uses one sample as a validation set and the rest as training data in each iteration.

Here's a code example that shows how to perform LOO cross-validation using scikit-learn, which is one of the most extreme cases of increasing iterations:
In this code, LOO cross-validation is used, where each sample serves as the validation set once. While LOO cross-validation can provide an unbiased estimate, it's computationally expensive and may not always be practical, especially with larger datasets

In [23]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression

# Create your classifier (replace with your own model)
classifier = LogisticRegression()

# Create a LeaveOneOut cross-validation object
loo = LeaveOneOut()

# Perform LOO cross-validation and get accuracy scores
accuracy_scores = cross_val_score(classifier, X, y, cv=loo, scoring='accuracy')

# Calculate the mean accuracy across folds
mean_accuracy = accuracy_scores.mean()

print(f"Mean LOO Cross-Validation Accuracy: {mean_accuracy:.2f}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Mean LOO Cross-Validation Accuracy: 0.97


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt