Cross-validation is a crucial technique in machine learning for evaluating the performance of a model. K-fold cross-validation is one of the most commonly used methods for this purpose. It helps in estimating how well a model will generalize to new, unseen data. Here's an explanation of k-fold cross-validation:

K-Fold Cross-Validation:

What is Cross-Validation?
Cross-validation is a technique used to assess how well a machine learning model will perform on unseen data. Instead of training and evaluating a model on a single dataset split into a training set and a testing set, cross-validation involves splitting the dataset into multiple subsets, training the model on different combinations of these subsets, and then averaging the evaluation metric (e.g., accuracy, mean squared error) over all the subsets. This provides a more robust estimate of the model's performance.

K-Fold Cross-Validation Process:
K-fold cross-validation divides the dataset into 'k' equal-sized subsets (or "folds"). The process typically follows these steps:

The dataset is randomly shuffled to ensure that the data's order does not affect the results.
The dataset is divided into 'k' subsets of approximately equal size.
The model is trained 'k' times, each time using a different subset as the testing data and the remaining 'k-1' subsets as the training data.
For each fold, the model's performance metric (e.g., accuracy, error) is computed.
Finally, the performance metrics from all 'k' folds are averaged to get a single performance estimate for the model.


Advantages of K-Fold Cross-Validation:
-Better Estimation: K-fold cross-validation provides a more robust estimate of a model's performance because it uses multiple subsets for training and testing. This reduces the chance of the evaluation metric being biased by a single random split.
-Utilizes All Data: It ensures that all data points are used for both training and testing, which is important when you have limited data.
-Identifying Overfitting: It can help in identifying if a model is overfitting or underfitting by observing how the performance varies across different folds.

Choosing the Value of 'k':
The choice of 'k' depends on various factors, including the size of your dataset. Common values for 'k' include 5, 10, or even smaller values for very large datasets. Smaller 'k' values can be computationally less expensive but might lead to higher variance in the performance estimate. Larger 'k' values reduce variance but increase computation time.

Final Model Evaluation:
After performing k-fold cross-validation and obtaining the average performance metric, you can train your final model on the entire dataset (if cross-validation suggests it's a good model) and evaluate it on a separate, completely unseen test set to get a final estimate of how well it will perform in the real world.

K-fold cross-validation is a valuable tool for assessing and selecting machine learning models, tuning hyperparameters, and ensuring that your model generalizes well to new data. It helps in building more reliable and robust models for various applications.

In [11]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [2]:
#Data Import
data = pd.read_csv('D:\GIT1\Machine-Learning\Decision Trees\diabetes_dataset.csv')
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [22]:
#Assigning features(X) and Target variables(y)
X = data.drop('Outcome',axis=1)
y = data['Outcome']

In [21]:
#initiate the cross-validation mode
k = 10
k_fold = KFold(n_splits=k,shuffle=True,random_state=42)

In [20]:
#Initiat list to store results
accuracy_score =[]

#iterate trough each fold
for train_index,test_index in k_fold.split(X):
    X_train,X_test = X.iloc[train_index],X.iloc[test_index]
    y_train,y_test =y.iloc[train_index], y.iloc[test_index]
    

In [19]:
#Create a logistic regression
model = LogisticRegression()
model.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
#Predictions
y_pred = model.predict(X_test)
y_pred

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 1], dtype=int64)

In [10]:
# Calculate accuracy for this fold
accuracy = accuracy_score(y_test, y_pred)
accuracy_.append(accuracy)

# Calculate the mean accuracy across all folds
mean_accuracy = np.mean(accuracy_scores)
print(f'Mean Accuracy: {mean_accuracy:.2f}')

TypeError: 'list' object is not callable

In [23]:
#Evaluate the accuracy of the model
accuracy = accuracy_score(y_test,y_pred)
conf_marix = confusion_matrix(y_test,y_pred)
class_report = classification_report(y_test,y_pred)


#Print the evaluation metrics
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix score: {conf_marix}')
print(f'Classification Report:\n{class_report}')

TypeError: 'list' object is not callable