# Lesson 7: Algorithm Evaluation With Resampling Methods
* **Training dataset:** dataset used to train a machine learning algorithm.
* The dataset used to train an algorithm cannot be used to give you reliable estimates of the accuracy of the model on new data. 
* You can use statistical methods called **resampling methods** to split your training dataset up into subsets, some are used to train the model and others are held back and used to estimate the accuracy of the model on unseen data

***

**Cross-validation** is applied to further assess the performance of the model in situations where the data is limited or to estimate how well the model will generalize to new data from the same population.

Common techniques:
* **k-fold Cross-Validation:** popular choice for model evaluation due to its balance between bias and variance, computational efficiency, and ease of implementation. 
* **Leave-One-Out Cross-Validation (LOOCV):** for obtaining a less biased estimate of performance but is computationally expensive, especially for large datasets.

***
Cross-validation techniques are typically performed after model training to assess the performance of the trained model. 

Here's a typical workflow:
* **Data Preprocessing:** This may involve steps like handling missing values, feature scaling, feature engineering, and encoding categorical variables.
* **Model Training:** The model is trained on the training dataset using a chosen algorithm or combination of algorithms. During training, the model learns patterns and relationships in the data.
* **Model Evaluation with Cross-Validation:** After training, the model's performance is evaluated using cross-validation techniques. The dataset is divided into training and testing subsets multiple times, and the model is trained and evaluated on different combinations of data.
* **Performance Assessment:** Performance metrics are calculated for each fold or iteration of the cross-validation process. These metrics are typically averaged across all folds to obtain a single estimate of the model's performance.
* **Model Selection and Tuning:** Based on the performance metrics obtained from cross-validation, different models or variations of the same model may be compared. Hyperparameters of the model may be tuned to improve performance further.
* **Final Evaluation on Test Data:** Once the best model is selected and tuned, it is evaluated one final time on a separate test dataset that was not used during training or cross-valida

***
Validation is not always performed, but it's a recommended practice in machine learning model development. 

However, the specific approach to validation, whether through cross-validation or a separate validation set, depends on various factors such as the size of the dataset, computational resources available, and the specific problem being addressed.

In cases where the dataset is small, splitting it into separate training and validation sets may not be feasible. In such cases, cross-validation techniques like k-fold cross-validation or leave-one-out cross-validation are often preferred because they make more efficient use of the limited data.tion.


 

In [1]:
# Evaluate using k-fold cross-validation,
import pandas
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [2]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
colnames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# Create a DataFrame
df = pandas.read_csv(url, names = colnames) 

In [3]:
# Extracts the data from the DataFrame and stores it in a NumPy array.
array = df.values

In [4]:
# Separate array into features and target variables
X = array[:,0:8] # Select all rows and columns 0 to 7
Y = array[:,8] # Select all rows and column 8

In [5]:
# Initialize logistic regression model
model = LogisticRegression(solver='liblinear')

In [6]:
# Define cross-validation split
kfold = KFold(n_splits=10) # Create 10 folds for cross-validation

In [7]:
# Peform cross validation
results = cross_val_score(model, X, Y, cv=kfold)

In [14]:
# Calculate the mean and standard deviation of the accuracy scores obtained from cross-validation
print("Mean: ", results.mean()*100.0)
print("Standard deviation: ", results.std()*100.0)

Mean:  76.95146958304852
Standard deviation:  4.841051924567195


How k-fold cross-validation works in this context:
* **Data Splitting:** The dataset is divided into k (in this case, 10) equal-sized folds. Each fold is used as a validation set once, while the remaining k-1 folds form the training set.

* **Model Training and Evaluation:** The logistic regression model is trained k times, each time using a different combination of training and validation sets. For each fold, the model is trained on the training set and evaluated on the corresponding validation set.

* **Performance Estimation:** The performance metric (accuracy in this case) is computed for each fold. The final estimate of the model's performance is obtained by averaging the performance metrics across all k folds.