# Model Evaluation

Dummy Classifiers

Confusion Matrices and Evaluation Metrices

ROC curves, AUC (Area under curve)

Evaluation measures for multi-class classification

Model Selection/Optimization: Grid Search

## Dummy Classifiers
DummyClassifier is a classifier that makes predictions using simple rules, which can be useful as a baseline for comparison against actual classifiers, especially with imbalanced classes.

- Provide Null Metrics

**It is used as F1-Scoring method, when positive class is in miniority**

Strategies:

        most_frequent
        statified
        uniform
        constant
Strategies Parameter Option:

        mean
        median
        quantile
        constant

In [None]:
from sklearn.dummy import DummyClassifier

# The dummy 'most_frequent' classifier always predicts majority (most frequent) class.
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)

y_dummy_predictions = dummy_majority.predict(X_test)
y_dummy_predictions

dummy_majority.score(X_test, y_test)

**Starfied Strategy**

Produces random predictions w/ same class proportion as training set

In [None]:
dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)
y_classprop_predicted = dummy_classprop.predict(X_test)

if my classifier accuracy is close to the null accuracy baseline
- Ineffective, erroneous or missing features
- Poor choice of kernel or hyperparameter
- Large class imbalance

## Confusion matrices

In [None]:
# Binary (two-class) confusion matrix

from sklearn.metrics import confusion_matrix
confusion = confusion_matrix(y_test, y_predicted)

## Evaluation metrics for binary classification

- Accuracy = TP + TN / (TP + TN + FP + FN)

- Precision = TP / (TP + FP)

- Recall = TP / (TP + FN)  Also known as sensitivity, or True Positive Rate

- F1 = 2 * Precision * Recall / (Precision + Recall) 

In [None]:
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, y_predicted)))

**Combined report with all above metrics**`

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_predicted, target_names=['not 1', '1']))

#### Tradeoff between precision and recall
•Recall-oriented machine learning tasks:

        – Search and information extraction in legal discovery
        – Tumor detection
        – Often paired with a human expert to filter out false positives
•Precision-oriented machine learning tasks:

        – Search engine ranking, query suggestion
        – Document classification
        – Many customer-facing tasks (users remember failures!)

### Decision Function and Predicted Probability

I don't understand this yet!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

# show the decision_function scores for first 20 instances
y_score_list

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))

# show the probability of positive class for first 20 instances
y_proba_list

### Precision-Recall curves
“Steepness” of P-R curves is important:
- Top right corner is The “ideal” point where Precision = 1.0 and Recall = 1.0
- Maximize precision while maximizing recall

In [None]:
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]

plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()

## ROC curves, Area-Under-Curve (AUC)
“Steepness” of ROC curves is important:
- Top left corner is The “ideal” point where False positive rate of zero True positive rate of one
- Maximize the true positive rate while minimizing the false positive rate

In [None]:
from sklearn.metrics import roc_curve, auc

X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes().set_aspect('equal')
plt.show()

#### AUC = 0 (worst) AUC = 1 (best)
AUC can be interpreted as:
1. The total area under the ROC curve.
2. The probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example.

Advantages:
- Gives a single number for easy comparison.
- Does not require specifying a decision threshold.

Drawbacks:
- As with other single-number metrics, AUC loses information, e.g. about tradeoffs and the shape of the ROC curve.
- This may be a factor to consider when e.g. wanting to compare the performance of classifiers with overlapping ROC curves.

## Evaluation measures for multi-class classification

A collection of true vs predicted binary outcomes, one per class.

**Each instance can have multiple labels**

The support (number of instances) for each class is important to consider, e.g. in case of imbalanced classes

In [None]:
# Load data and Train Test Split
dataset = load_digits()
X, y = dataset.data, dataset.target
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y, random_state=0)

# Model Creation and predictions
svm = SVC(kernel = 'linear').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)

# Confusion Matrix
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc, 
                     index = [i for i in range(0,10)], columns = [i for i in range(0,10)])

# Plotting a heatmap of Confusion Matrix
plt.figure(figsize=(5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM Linear Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, 
                                                                       svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label')

**Multi-class Evaluation Metrics via the "Average" Parameter for a Scoring Function**
- Micro: Metric on aggregated instances
- Macro: Mean per-class metric, classes have equal weight
- Weighted: Mean per-class metric, weighted by support
- Samples: for multi-label problems only

#### Multi-class classification report

In [None]:
print(classification_report(y_test_mc, svm_predicted_mc))

### Micro- vs. macro-averaged metrics

If some classes are much larger (more instances) than others, and you want to:
- Weight your metric toward the largest ones, use micro-averaging.
- Weight your metric toward the smallest ones, use macro-averaging


**Macro-average** : Each class has equal weight
1. Compute metric within each class
2. Average resulting metrics across classes

In [None]:
print('Macro-averaged precision = {:.2f} (treat classes equally)'
      .format(precision_score(y_test_mc, svm_predicted_mc, average = 'macro')))
print('Macro-averaged f1 = {:.2f} (treat classes equally)'
      .format(f1_score(y_test_mc, svm_predicted_mc, average = 'macro')))

**Micro-average**: Each instance has equal weight. Largest classes have most influence.
1. Compute metric with aggregate outcomes
2. Aggregrate outcomes across all classes

In [None]:
print('Micro-averaged precision = {:.2f} (treat instances equally)'
      .format(precision_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Micro-averaged f1 = {:.2f} (treat instances equally)'
      .format(f1_score(y_test_mc, svm_predicted_mc, average = 'micro')))

- If the micro-averageis much lower than the macro-average then examine the larger classes for poor metric performance. 
- If the macro-average is much lower than the micro-average then examine the smaller classes for poor metric performance.

## Regression evaluation metrics

- r2 Score: computes how well future instances will be predicted. Best Score:1.0

Alternative metrics include:
- mean_absolute_error(absolute difference of target & predicted values)
- mean_squared_error(squared difference of target & predicted values)
- median_absolute_error(robust to outliers)

In [None]:
print("Mean squared error (dummy): {:.2f}".format(mean_squared_error(y_test, 
                                                                     y_predict_dummy_mean)))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_predict)))

print("r2_score (dummy): {:.2f}".format(r2_score(y_test, y_predict_dummy_mean)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_predict)

## Model selection using evaluation metrics

Train/test on same data
- Single metric.
- Typically overfits and likely won't generalize well to new data.
- But can serve as a sanity check: low accuracy on the training set may indicate an implementation problem.

Single train/test split
- Single metric.
- Speed and simplicity.
- Lack of variance information

K-fold cross-validation
- K train-test splits.
- Average metric over all splits.
- Can be combined with parameter grid search: GridSearchCV(def. cv = 3)

## Grid Search CV

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

eavl_metric = ('precision','recall', 'f1','roc_auc')

# After train test split
clf = SVC(kernel='rbf')
#Grid Values
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
grid_values = {'class_weight':['balanced', {1:2},{1:3},{1:4},{1:5},{1:10},{1:20},{1:50}]}

# default metric to optimize over grid parameters: accuracy
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring ='accuracy')

grid_clf_acc.fit(X_train, y_train)
print('Grid best parameter (max. accuracy): ', grid_clf_acc.best_params_)
print('Grid best score (accuracy): ', grid_clf_acc.best_score_)

y_decision_fn_scores_acc = grid_clf_acc.decision_function(X_test) 
print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))

#### Evaluation metrics supported for model selection

In [None]:
from sklearn.metrics.scorer import SCORERS

print(sorted(list(SCORERS.keys())))

### Training, Validation, and Test Frameworkfor Model Selection and Evaluation

- Using only cross-validation or a test set to do model selection may lead to more subtle overfitting / optimistic generalization estimates

- Instead, use three data splits:
    1. Training set (model building)
    2. Validation set (model selection)
    3. Test set (final evaluation)

In practice:
- Create an initial training/test split
- Do cross-validation on the training data for model/parameter selection
- Save the held-out test set for final model evaluation

## Concluding Notes

- Accuracy is often not the right evaluation metric for many real-world machine learning tasks
- False positives and false negatives may need to be treated very differently
- Make sure you understand the needs of your application and choose an evaluation metric that matches your application, user, or business goals.

Examples of additional evaluation methods include:
- Learning curve: How much does accuracy (or other metric) change as a function of the amount of training data?
- Sensitivity analysis: How much does accuracy (or other metric) change as a function of key learning parameter values?