## Validating Prediction

* https://scikit-learn.org/stable/modules/model_evaluation.html

In [1]:
import numpy as np

#### MSE (Mean Squared Error)
* penalize outlier errors

In [2]:
n = 3
y = np.array([4,3,5])
yhat = np.array([4,5,6])
sub = y-yhat
sqr = sub**2
summation = np.sum(sqr)
mse = summation/n
print(sqr)
print(mse)

[0 4 1]
1.6666666666666667


#### RMSE (Root Mean Squared Error)

In [3]:
n = 3
y = np.array([4,3,5])
yhat = np.array([4,100,6])
sub = y-yhat
sqr = sub**2
div = sqr/n
summation = np.sum(div)
rmse = np.sqrt(summation)
rmse

56.005952064639224

#### MAE (Mean Absolute Percentage Error)

In [4]:
n = 3
y = np.array([4,3,5])
yhat = np.array([4,100,6])
sub = y-yhat
absolute = np.abs(sub)
summation = np.sum(absolute)
mae = summation/n
mae

32.666666666666664

#### MAPE (Mean Absolute Percentage Error)

In [5]:
n = 3
y = np.array([4,6,8])
yhat = np.array([2,3,4])
sub = y-yhat
perc = sub/y
absol = np.abs(perc)
summation = np.sum(absol)
mape = summation/n
mape

0.5

#### Correlation Y and Yhat

In [6]:
from scipy.stats import pearsonr

In [7]:
y = np.array([4,6,8])
yhat = np.array([2,3,4])
np.corrcoef(y,yhat)

array([[1., 1.],
       [1., 1.]])

In [8]:
corr, pvalue = pearsonr(y,yhat)
print(corr)
print(pvalue)

0.9999999999999998
1.3415758552508151e-08


#### reasonable to use the median vs mean

In [9]:
y = np.array([4,6,8,5,2,3,4])
yhat = np.array([2,3,4,1,2,5,100])
absolute_errors = abs(y - yhat)
print("Mean: {}".format(np.mean(absolute_errors)))
print("Median: {}".format(np.median(absolute_errors)))

Mean: 15.857142857142858
Median: 3.0


## Validating Classification

<img src="https://miro.medium.com/max/1106/1*vMEqRXTl8PRfRtgWgwUVEg.jpeg"
     width="500" height="300" />

In [10]:
y = np.array([1,1,0,0,0,0])
yhat = np.array([1, 0, 0, 1, 0, 0])

print("Y  - Yhat")
for i in zip(y, yhat):
    print(i[0], " - ", i[1], " - ", i[0] == i[1])

Y  - Yhat
1  -  1  -  True
1  -  0  -  False
0  -  0  -  True
0  -  1  -  False
0  -  0  -  True
0  -  0  -  True


* true positive: is positive and we predicted positive
    * y = 1 and yhat = 1
* true negatives: is negative and we predicted negative
    * y = 0 and yhat = 0
* false positives: is negative but we marked as positive
    * y = 0 and yhat = 1
* false negatives: is positive but we marked as negative
    * y = 1 and yhat = 0

In [11]:
true_positives = 1
true_negatives = 3
false_positives = 1
false_negatives = 1

#### recall
* how many of our positive cases did we recall, or find

In [12]:
recall = true_positives/(true_positives + false_negatives)
recall

0.5

#### precision
* of all our true predictions, how precise were we, how many were actually true

In [13]:
precision = true_positives/(true_positives+false_positives)
precision

0.5

#### f1
* weighed average of precision and recall

In [14]:
f1 = 2 * (precision * recall) / (precision + recall)
f1

0.5

#### accuracy

In [15]:
y = np.array([1,1,0,0,0,0])
yhat = np.array([1, 0, 0, 1, 0, 0])

correct = 4
n = 8
accuracy = correct/n
accuracy

0.5

#### logic extends to multilabel classification

## Cross Validation
* we split our data in training and test
* then we divide up our data into k-folds (some arbitrary number)
* say we have 5 folds, we will make a model on 4 folds, then use the last as a test
* then we will pick 4 more folds, then have a new 5th fold be the test
* we repeat this process and validate
* test for model stability
* cross_validate
* The cross_validate function differs from cross_val_score in two ways:
    * It allows specifying multiple metrics for evaluation.
    * It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.


<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png"
     width="600" height="400" />

In [27]:
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [28]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris()
data = iris["data"]
labels = iris["target_names"]
feature_columns = iris["feature_names"]

df = pd.DataFrame(data, columns = feature_columns)
df["label"] = np.array([labels[x] for x in iris["target"]])

x = df.drop("label", 1)
y = df["label"]

In [25]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [26]:
clf = DecisionTreeClassifier()

* https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
* Evaluate metric(s) by cross-validation and also record fit/score times.

In [9]:
cv_results = cross_validate(clf, x_train, y_train, cv=10, scoring = ["precision_micro", "precision_macro"])

In [10]:
cv_results["test_precision_micro"]

array([0.92307692, 1.        , 0.91666667, 1.        , 0.66666667,
       0.83333333, 1.        , 0.91666667, 0.91666667, 0.90909091])

In [11]:
cv_results["test_precision_macro"]

array([0.94444444, 1.        , 0.93333333, 1.        , 0.66666667,
       0.88888889, 1.        , 0.93333333, 0.93333333, 0.93333333])

### Are these good scores?  Look at not only the average/median but the deviation.  Are they consistent?

In [18]:
df = pd.DataFrame({
    "micro": cv_results["test_precision_micro"],
    "macro": cv_results["test_precision_macro"]
})

In [19]:
df

Unnamed: 0,micro,macro
0,0.923077,0.944444
1,1.0,1.0
2,0.916667,0.933333
3,1.0,1.0
4,0.666667,0.666667
5,0.833333,0.888889
6,1.0,1.0
7,0.916667,0.933333
8,1.0,1.0
9,0.909091,0.933333


In [9]:
cv_results["fit_time"]

array([0.00199819, 0.00162983, 0.00148797, 0.00119495, 0.00118327,
       0.00218487, 0.00161505, 0.00117612, 0.00119019, 0.00117898])

### Is an MSE of .07 good?

* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [30]:
# won't work pushes you to cross_validate
#cross_val_score(clf, x_train, y_train, cv=5, scoring = ["precision_micro", "precision_macro"])
cross_val_score(clf, x_train, y_train, cv=5, scoring = "precision_micro")

array([0.96      , 0.95833333, 0.83333333, 0.95833333, 0.95652174])

In [30]:
cv_results = cross_validate(clf, x_train, y_train, cv=5, scoring=('accuracy', 'recall_weighted'))
cv_results

{'fit_time': array([0.00190592, 0.00145197, 0.00179386, 0.001333  , 0.00133395]),
 'score_time': array([0.00274205, 0.00211406, 0.00248909, 0.00208187, 0.00174284]),
 'test_accuracy': array([0.96      , 1.        , 0.83333333, 0.95833333, 0.95652174]),
 'test_recall_weighted': array([0.96      , 1.        , 0.83333333, 0.95833333, 0.95652174])}

In [32]:
cv_results = cross_validate(clf, x_train, y_train, cv=5, scoring=('accuracy', 'f1_weighted'))
cv_results

{'fit_time': array([0.00228906, 0.00201583, 0.00184202, 0.00183177, 0.00175023]),
 'score_time': array([0.003052  , 0.00277829, 0.00258088, 0.00247407, 0.00235891]),
 'test_accuracy': array([0.96      , 0.95833333, 0.83333333, 0.95833333, 0.95652174]),
 'test_f1_weighted': array([0.9597193 , 0.95816993, 0.83068783, 0.95816993, 0.95612827])}

In [19]:
final_model = DecisionTreeClassifier()
final_model.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [34]:
yhat = final_model.predict(x_test)

## Grid Search
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
* The simple rule is that data used for evaluating the performance of a model should not have been used to optimize the model in any way
* training and test split
* we can use a grid search with cross validation on the training
* understand the stability and get the best params

In [31]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
import numpy as np

In [32]:
param_map = {
    'criterion':('gini', 'entropy'), 
    "max_depth":[5,7,9,11,13,15,None],
    "min_samples_split":list(range(5,100,2)),
    "max_leaf_nodes":list(range(10,100,5))
}

In [20]:
clf = DecisionTreeClassifier()

In [33]:
gs = GridSearchCV(clf, param_map, cv=5, verbose = 1, n_jobs = 4)
gs.fit(x,y)

Fitting 5 folds for each of 12096 candidates, totalling 60480 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  68 tasks      | elapsed:    1.5s
[Parallel(n_jobs=4)]: Done 15884 tasks      | elapsed:   10.2s
[Parallel(n_jobs=4)]: Done 47884 tasks      | elapsed:   29.4s
[Parallel(n_jobs=4)]: Done 60480 out of 60480 | elapsed:   37.5s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n...=4,
             param_grid={'criterion': ('gini', 'entropy'),
                         'max_depth': [5, 7, 9, 11, 13, 15, None],
               

In [17]:
gs

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=4,
             param_grid={'criterion': ('gini', 'entropy'),
                         'max_depth': [5, 7, 9, 11, 13, 15, None],
             

In [34]:
# give our best params
gs.best_params_

{'criterion': 'gini',
 'max_depth': 5,
 'max_leaf_nodes': 10,
 'min_samples_split': 5}

In [50]:
# elements of our grid search object that was returned
gs.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_criterion', 'param_max_depth', 'param_min_samples_split', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

In [20]:
gs.cv_results_["mean_test_score"]

array([0.96666667, 0.96666667, 0.95333333, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.96      , 0.96666667, 0.95333333,
       0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.96666667, 0.95333333, 0.96666667,
       0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.95333333, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96      ,
       0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
       0.96666667, 0.96666667, 0.96666667, 0.95333333, 0.95333333,
       0.95333333, 0.96      , 0.96      , 0.96      , 0.96      ,
       0.96      , 0.96      , 0.95333333, 0.95333333, 0.95333

In [53]:
gs.cv_results_["std_test_score"]

array([0.03265986, 0.03651484, 0.03651484, 0.02108185, 0.02108185,
       0.02108185, 0.02108185, 0.02108185, 0.02108185, 0.03399346,
       0.03399346, 0.03651484, 0.02108185, 0.02108185, 0.02108185,
       0.02108185, 0.02108185, 0.02108185, 0.03651484, 0.03399346,
       0.03651484, 0.02108185, 0.02108185, 0.02108185, 0.02108185,
       0.02108185, 0.02108185, 0.03265986, 0.03399346, 0.03651484,
       0.02108185, 0.02108185, 0.02108185, 0.02108185, 0.02108185,
       0.02108185, 0.03265986, 0.03399346, 0.03651484, 0.02108185,
       0.02108185, 0.02108185, 0.02108185, 0.02108185, 0.02108185,
       0.03265986, 0.03399346, 0.03651484, 0.02108185, 0.02108185,
       0.02108185, 0.02108185, 0.02108185, 0.02108185, 0.03651484,
       0.03265986, 0.03399346, 0.02108185, 0.02108185, 0.02108185,
       0.02108185, 0.02108185, 0.02108185, 0.03399346, 0.03399346,
       0.03399346, 0.02494438, 0.02494438, 0.02494438, 0.02494438,
       0.02494438, 0.02494438, 0.03399346, 0.03399346, 0.03399

In [35]:
gs.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
                       max_features=None, max_leaf_nodes=10,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

### AutoML Options
* http://docs.h2o.ai/h2o-tutorials/latest-stable/h2o-world-2017/automl/index.html
* https://automl.github.io/auto-sklearn/master/manual.html#inspecting-the-results