### https://www.youtube.com/watch?v=85dtiMz9tSo

## Review of model evaluation

* Need a way to choose between models: different model types, tuning parameters, and features
* Use a model evaluation procedure to estimate how well a model will generalize to out-of-sample data
* Requires a model evaluation metric to quantify the model performance

## Model evaluation procedures¶

**1. Training and testing on the same data**

* Rewards overly complex models that "overfit" the training data and won't necessarily generalize

**2. Train/test split**

* Split the dataset into two pieces, so that the model can be trained and tested on different data
* Better estimate of out-of-sample performance, but still a "high variance" estimate
* Useful due to its speed, simplicity, and flexibility

**3. K-fold cross-validation**

* Systematically create "K" train/test splits and average the results together
* Even better estimate of out-of-sample performance
* Runs "K" times slower than train/test split

## Model evaluation metrics

* Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
* Classification problems: Classification accuracy

#### Notes: K-Fold Cross validation is better than Train/Test Split, depending on how large the data set it. Sometimes train test split is more efficient because its simply quicker to do.

In [2]:
import pandas as pd

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv('../../data/pima-indians-diabetes.csv', names=col_names)

In [3]:
pima.drop(0, inplace=True)
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
1,6,148,72,35,0,33.6,0.627,50,1
2,1,85,66,29,0,26.6,0.351,31,0
3,8,183,64,0,0,23.3,0.672,32,1
4,1,89,66,23,94,28.1,0.167,21,0
5,0,137,40,35,168,43.1,2.288,33,1


### Predict the diabetes status of a patient given their health measurements?

In [4]:
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']


X = pima[feature_cols]
y = pima.label




In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [6]:


# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [7]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

### **Classification** accuracy: percentage of correct predictions

In [8]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.6927083333333334


### Null accuracy: accuracy that could be achieved by always predicting the most frequent class

In [9]:
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()


0    130
1     62
Name: label, dtype: int64

In [10]:
y_test.mean()

5.2135473964062504e+188

In [11]:
# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)

0    0.677083
Name: label, dtype: float64

## Comparing the true and predicted response values

In [17]:
# print the first 25 true and predicted responses
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])


True: ['1' '0' '0' '1' '0' '0' '1' '1' '0' '0' '1' '1' '0' '0' '0' '0' '1' '0'
 '0' '0' '1' '1' '0' '0' '0']
Pred: ['0' '0' '0' '0' '0' '0' '0' '1' '0' '1' '0' '1' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0']


### Whenever the True value is '0', the predicted value is CORRECT
### But when the True value is '1', the model rarely predicts the answer correctly

#### The model is making certain types of errors, but not others. We would not know this if we JUST looked at classification accuracy

## Conclusion:

* Classification accuracy is the easiest classification metric to understant.
* But, it does not tell you the underlying distribution of response values
* And, it does not tell you what "types" of errors your classifier is making

## Confusion matrix
 
 Table that describes the performance of a classification model

In [18]:
print(metrics.confusion_matrix(y_test, y_pred_class))

[[118  12]
 [ 47  15]]


![Confusion_Matrix](../../images/confusion-matrix.png)

* Every observation in the testing set is represented in exactly one box
* It's a 2x2 matrix because there are 2 response classes
* The format shown here is not universal

## Basic terminology

* True Positives (TP): we correctly predicted that they do have diabetes
* True Negatives (TN): we correctly predicted that they don't have diabetes
* False Positives (FP): we incorrectly predicted that they do have diabetes (a "Type I error")
* False Negatives (FN): we incorrectly predicted that they don't have diabetes (a "Type II error")

## Term 1 (True/False): Is the prediction correct or not
## Term 2 (Positive/Negative): For Binary classificiation ---> (1, 0)

In [46]:
# print the first 25 true and predicted responses
print('True:',  y_test.values[0:25])
print('Pred:', y_pred_class[0:25])

True: ['1' '0' '0' '1' '0' '0' '1' '1' '0' '0' '1' '1' '0' '0' '0' '0' '1' '0'
 '0' '0' '1' '1' '0' '0' '0']
Pred: ['0' '0' '0' '0' '0' '0' '0' '1' '0' '1' '0' '1' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0']


In [47]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

![new_confusion_matrix](../../images/detailed-confusion-matrix.png)

## Metrics computed from a confusion matrix
* **Classification Accuracy:** Overall, how often is the classifier correct?

In [48]:
print((TP+TN) / float(TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))

0.6927083333333334
0.6927083333333334


### Classification Error: Overall, how often is the classifier incorrect?
* Also known as "Misclassification Rate"

In [49]:
print((FP + FN) / float(TP + TN + FP + FN))
print(1 - metrics.accuracy_score(y_test, y_pred_class))

0.3072916666666667
0.30729166666666663


## Recall or Sensitivity: 

When the actual value is **positive**, how often is the prediction correct?

* How "sensitive" is the classifier to detecting positive instances?
* Also known as "True Positive Rate" or "Recall"

In [50]:
print(TP / float(TP + FN))
print(metrics.recall_score(y_test, y_pred_class))
# print(metrics.recall_score(y_test, y_pred_class))
# print(metrics.recall_score(y_test, y_pred_class))



0.24193548387096775


ValueError: pos_label=1 is not a valid label: array(['0', '1'], dtype='<U1')

**Specificity:** When the actual value is negative, how often is the prediction correct?

* How "specific" (or "selective") is the classifier in predicting positive instances?

In [51]:
print(TN / float(TN + FP))

0.9076923076923077
