<center>
  <a href="MLSD-05-FeatureSelection-Ex-1.ipynb" target="_self">Feature Selection Exercise 1</a> | <a href="./">Content Page</a> | <a href="MLSD-07-ModelEvaluation-A.ipynb">Model Evaluation A | <a href="MLSD-07-ModelEvaluation-Ex-1.ipynb">Model Evaluation Exercise 1</a>
</center>

# <center>MACHINE LEARNING ERRORS A</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Definition
- A Machine Learning Error is an action which is inaccurate or wrong. 
- An error is used to see how accurately our model can predict on data it uses to learn; as well as new, unseen data. 
- Based on the error, we choose the machine learning model which performs best for a particular dataset.

# Types
- <b>Irreducible errors</b> are errors which will always be present in a machine learning model, because of unknown variables that have a direct influence on the output, and whose values cannot be reduced.
- <b>Reducible errors</b> are those errors whose values can be further reduced to improve a model. They are caused because our model’s output function does not match the desired output function and can be optimized. It has two components: <b>Bias</b> and <b>Variance</b>.

![image.png](attachment:image.png)

In Machine Learning, an <b>error</b> is the difference between <b>Actual</b> and <b>Predicted</b> output.<br>
To calculate the error, we do summation of the reducible and irreducible error.

# What is Bias?
Bias refers to distance between predicted and actual values. i.e., how far we have predicted the values from actual values.<br>
- <b>High Bias</b>: if the average prediction values are far away from actual values.
- <b>Low Bias</b>: Distance between prediction and actual values are minimal.

<b>High Bias</b> will cause the algorithm to miss a dominant pattern or relationship between input and output variables. <br>
If bias is too high, model performs very badly and accuracy will be low, which causes <b>underfitting</b>.

# What is Variance?
If model which predicts well with training dataset and fails with independent unseen data (Testing dataset), then it is evident that model has a variance i.e., it conveys how scattered the <b>predicted values</b> are from the <b>actual values</b>.

- <b>High Variance</b>: data is scattered significantly, the model has trained with lot of noise and irrelevant data that causes <b>overfitting</b>.
- <b>Low Variance</b>: Less scattered.

# Possibilities
LV = Low Variance<br>
HV = High Variance<br>
LB = Low bias <br>
HB = High Bias<br>

**LB & LV (ideal and Best Scenario - Best Model)**
LB - All predicted data similar to actual data. Distance between predicted and actual data points are very small.
LV - Data points are not scattered. They are close to each other.

**LB & HV (Model are somewhat accurate but inconsistent)**
LB - All predicted data similar to actual data. Distance between predicted and actual data points are very small.
HV - Data points are scattered and are away from actual data.

**HB & LV (Model are Consistent but inaccurate)**
LV - Data points are not scattered. They are close to each other.
HB - Predicted data points are far away from actual data. Distance between predicted and actual data is High(Error) i.e. Prediction is not close to Actual value.

**HB & HV (Model will be inconsistent & inaccurate)**
HB - HB - Predicted data points are far away from actual data. Distance between predicted and actual data is High(Error) i.e. Prediction is not close to Actual value.
HV - Data points are scattered and are away from actual data.

![image-2.png](attachment:image-2.png)

# Over-fitting and Under-fitting
<b>Under-fitting</b>: when the data model is unable to capture the relationship between the input and output variables accurately. i.e., model that can neither model the training data nor generalize to new data(test data). Therefore accuracy of training as well testing will be much lesser. 

Example: if training accuracy of the model is 65% and testing accuracy is 50%. then resulted error rate is high because model is unable to capture the relationship between the input and output variables accurately (usually underfit models performs bad for both training and testing data).

<b>Right-fitting</b>: when model performs well for training and as well as for testing dataset. Model is able to capture the relationship between the input and output variables accurately.

Example: if the training accuracy of the model is 85% and testing accuracy of the model is 80–90%. resulting in less error rate. 

<b>Over-fitting</b>: a model that models the training data too well but fails with testing data.

Example: if training accuracy of model is 95% and testing accuracy is 70%. the error rate between training and testing is very high because model is behaving well with training but fails with testing.

![image-2.png](attachment:image-2.png)

# Reasons for Under-fitting
- Data used for training is not cleaned and contains noise (garbage values) in it.
- The model has a high bias.
- The size of the training dataset used is not enough.
- The model is too simple.

# Handling Under-fitting
- Increase the number of features in the dataset.
- Increase model complexity.
- Reduce noise in the data.
- Increase the duration of training the data.

# Reasons for Over-fitting
- Data used for training is not cleaned and contains noise (garbage values) in it.
- The model has a high variance.
- The size of the training dataset used is not enough.
- The model is too complex.

# Handling Over-fitting
- Using K-fold cross-validation.
- Using Regularization techniques such as Lasso and Ridge.
- Training model with sufficient data.
- Adopting ensembling techniques.

# Tasks
- To read in and explore data set.
- To use <b>K-fold cross-validation</b> to improve model performance.
- To use <b>hyperparameter tuning (Grid Search)</b> to improve model performance.

# K-fold Cross-Validation

## Read in and Explore Data Set

In [None]:
# Import libraries
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import tree
import pandas as pd

In [None]:
# Read in data
df = pd.read_csv("./data/diabetes/diabetes.csv")
df.head()

In [None]:
# Shape of dataframe
df.shape

In [None]:
# Independent and dependent variables
X = df.iloc[:,:8].values # independent variables
y = df['class'].values # dependent variables

## Normalize Data

In [None]:
# Normalize data
sc = StandardScaler()
sc.fit_transform(X)
X

In [None]:
# Split into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2017)

In [None]:
# Build a decision tree classifier
clf = tree.DecisionTreeClassifier(random_state=2017)
clf.fit(X_train, y_train)

## Evaluate Model Performance Without Cross-Validation

In [None]:
# Import metrics
from sklearn import metrics

In [None]:
# Generate evaluation metrics for train set
print ("Train - Accuracy :", metrics.accuracy_score(y_train, clf.predict(X_train)))
print ("Train - AUC :", metrics.roc_auc_score(y_train, clf.predict_proba(X_train)[:,1]))

In [None]:
# Generate evaluation metrics for test set
print ("Test - Accuracy :", metrics.accuracy_score(y_test, clf.predict(X_test)))
print ("Test - AUC :", metrics.roc_auc_score(y_test, clf.predict_proba(X_test)[:,1]))

**Observations**
- <b>Over-fitting</b>: a model that models the training data too well (Training accuracy = 1.0) but fails with testing data (Test accuracy = 0.69).

## Evaluate Model Performance With 10-fold Cross-Validation

In [None]:
# Evaluate the model using 10-fold cross-validation
train_scores = cross_val_score(clf, X_train, y_train, scoring='accuracy', cv=10)
test_scores = cross_val_score(clf, X_test, y_test, scoring='accuracy', cv=10)
print ("Train Fold AUC Scores: ", train_scores)
print ("Train Mean CV AUC Score: ", train_scores.mean())
print ("\nTest Fold AUC Scores: ", test_scores)
print ("Test Mean CV AUC Score: ", test_scores.mean())

**Observations**
- The model's performance improved with Training accuracy at 0.72 and Test accuracy at 0.73.

# Grid Search
- Specify the grid of values (of hyperparameters) to try out and optimize to get the best parameter combinations.
- Then build models on each of those values (combination of multiple parameter values), using cross-validation, and report the best parameters’ combination in the whole grid. 
- The output will be the model using the best combination from the grid. 

## Read in and Explore Data Set

In [None]:
# Import libraries
from sklearn.datasets import load_breast_cancer

In [None]:
# Read in data
data = load_breast_cancer()
X = data.data
y = data.target
print(X.shape, data.feature_names)

## Split Dataset

In [None]:
# Split dataset into train (70%) and test (30%) sets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC # Support Vector Classification

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape, X_test.shape)

## Build SVM Model

In [None]:
# Build default SVM model
model = SVC(random_state=42)
model.fit(X_train, y_train)

## Evaluate Model Performance

In [None]:
# Import metrics
from sklearn import metrics

In [None]:
# Generate evaluation metrics for test set
print ("Test - Accuracy :", metrics.accuracy_score(y_test, model.predict(X_test)))
print ("Test - Confusion matrix :\n",metrics.confusion_matrix(y_test, model.predict(X_test)))
print ("Test - classification report :\n", metrics.classification_report(y_test, model.predict(X_test)))

**Observations**
- Test accuracy at 0.935.

## Hyperparameter Tuning using GridSearchCV

In [None]:
# Import libraries
from sklearn.model_selection import GridSearchCV

In [None]:
# Setting the parameter grid
grid_parameters = {'kernel': ['linear', 'rbf'],
                   'gamma': [1e-3, 1e-4],
                   'C': [1, 10, 50, 100]}

In [None]:
# Perform hyperparameter tuning
print("# Tuning hyper-parameters for accuracy\n")
clf = GridSearchCV(SVC(random_state=42), grid_parameters, cv=5, scoring='accuracy')
clf.fit(X_train, y_train)

# View accuracy scores for all the models
print("Grid scores for all the models based on CV:\n")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))
    
# Check out best model performance
print("\nBest parameters set found on development set:", clf.best_params_)
print("Best model validation accuracy:", clf.best_score_)

In [None]:
# Evaluate
model = clf.best_estimator_

# Generate evaluation metrics for test set
print('\n\nTuned Model Stats:')
print ("Test - Accuracy :", metrics.accuracy_score(y_test, model.predict(X_test)))
print ("Test - Confusion matrix :\n", metrics.confusion_matrix(y_test, model.predict(X_test)))
print ("Test - classification report :\n", metrics.classification_report(y_test, model.predict(X_test)))

**Observations**
- Test accuracy improved to 0.97.

<center>
  <a href="MLSD-05-FeatureSelection-Ex-1.ipynb" target="_self">Feature Selection Exercise 1</a> | <a href="./">Content Page</a> | <a href="MLSD-07-ModelEvaluation-A.ipynb">Model Evaluation A | <a href="MLSD-07-ModelEvaluation-Ex-1.ipynb">Model Evaluation Exercise 1</a>
</center>