### Validação
**1. Train/test split**

If one subset of our data only have people of a certain age or income levels, we can bias our estimate. This is typically referred to as a **sampling bias:** Sampling bias is systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample.

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```

**2. k-Fold Cross-Validation (k-Fold CV)**

To minimize sampling bias we can think about another approach. k-Fold CV splits the data into $k$ folds, then trains the data on $k-1$ folds and test on the one fold that was left out. It does this for all combinations and averages the result on each instance.

```
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
```

Mas é possível usar Pipelines que façam a validação cruzada, sem precisar varrer cada Fold com um `for` (https://towardsdatascience.com/validating-your-machine-learning-model-25b4c8643fb7):

```
p_grid = {"C": [1, 10, 100], "gamma": [.01, .1]}                  # parameters
svr = SVC(kernel="rbf")                                           # model
inner_cv = KFold(n_splits=2, shuffle=True, random_state=42)       # K-Fold for GridSearchCV (params selection)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)       # K-Fold for computing metrics

# GridSearchCV does inner K-Fold to select best params in p_grid
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)

# cross_val_score does outer K-Fold to compute the metrics of the model
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv, scoring="neg_mean_squared_error").mean() 
print(nested_score)
```

`cross_val_score` usa o score que for passado em `scoring`, se for `None` passa o scoring padrão do estimador. Conferir https://scikit-learn.org/stable/modules/model_evaluation.html com as métricas que ele calcula e as respectivas strings.

A função `cross_validate` permite passar vários scores para serem computados no K-fold através do parâmetro `scoring=('r2', 'neg_mean_squared_error')`. Usa as mesmas strings do link anterior.

**3. Leave-one-out Cross-Validation (LOOCV)**

Uses each sample in the data as a separate test set while all remaining samples form the training set. This variant is identical to k-fold CV when k = n (number of observations).

```
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
```

Computationally very costly as the model needs to be trained n times. Only do this if the data is small.

**4. Leave-one-group-out Cross-Validation (LOGOCV)**

You might want each fold to only contain a single group. For example, let’s say you have a dataset of 20 companies and their clients and you want to predict the success of these companies.

To keep the folds “pure” and only contain a single company you would create a fold for each company. That way, you create a version of k-Fold CV and LOOCV where you leave one company/group out.

```
logo = LeaveOneGroupOut()
for train_index, test_index in logo.split(X, y, groups):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
```

**5. Time Series CV**

Overfitting would be a major concern since your training data could contain information from the future. It is important that all your training data happens before your test data.

One way of validating time series data is by using k-fold CV and making sure that in each fold the training data takes place before the test data.

```
tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
```

**6. Model comparisons**

(https://towardsdatascience.com/validating-your-machine-learning-model-25b4c8643fb7)

* Wilcoxon signed-rank test
* McNemar’s test
* 5x2CV paired t-test
* 5x2CV paired F-test

### Pipeline:
Sequentially apply a list of transforms and a final estimator. Intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement fit (https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976).

```
steps = [('scaler', StandardScaler()), ('SVM', SVC())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=30, stratify=Y)
parameteres = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print(grid.best_params_)
```

The strings (‘scaler’, ‘SVM’) can be anything, as these are just names to identify clearly the transformer or estimator. We can use `make_pipeline` instead of Pipeline to avoid naming the estimator or transformer. The final step has to be an estimator in this list of tuples.

Other example using other components in the Pipeline (https://medium.com/data-hackers/como-usar-pipelines-no-scikit-learn-1398a4cc6ae9): a OHE, an imputer, a model:

```
# dividindo em conjunto de treino e teste
X_train, X_test, y_train, y_test = train_test_split(df.drop(['Survived'], axis=1), 
                                                    df['Survived'], 
                                                    test_size=0.2, 
                                                    random_state=42)

# criando o modelo usando pipeline
model = Pipeline(steps=[
    ('one-hot encoder', OneHotEncoder()),
    ('imputer', SimpleImputer(strategy='mean')),
    ('tree', DecisionTreeClassifier(max_depth=3, random_state=0))
])

# treinando o modelo
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)

# avaliando o modelo
test_score = model.score(X_test, y_test)

# validando o modelo usando 5-fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(model, X=df.drop(['Survived'], axis=1), y=df['Survived'], cv=kfold)
print("Average accuracy: %f (%f)" %(results['test_score'].mean(), results['test_score'].std()))
```

A função `cross_validate` permite passar vários scores para serem computados no K-fold através do parâmetro `scoring=('r2', 'neg_mean_squared_error')`

**Why Pipeline?**
One could proceed like that, without Pipeline:
```
scale = StandardScaler().fit(X_train)
X_train_scaled = scale.transform(X_train)
grid = GridSearchCV(SVC(), param_grid=parameteres, cv=5)
grid.fit(X_train_scaled, y_train)
```
_Problem:_ The scaled features used for cross-validation is separated into test and train fold but the test fold within grid-search already contains the info about training set, as the whole training set (`X_train`) was used for standardization. In a simpler note when `SVC.fit()` is done using cross-validation, the features already include info from the test-fold as **`StandardScaler.fit()` was done on the whole training set.**

https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976

### Images
**K-Fold**
![./images/k-fold.png](./images/k-fold.png)
**Leave-one-out**
![./images/leave-one-out.png](./images/leave-one-out.png)
**Leave-one-group-out**
![./images/leave-one-group-out.png](./images/leave-one-group-out.png)
**Nested CV**
![./images/nested-cv.png](./images/nested-cv.png)
**Time Series CV**
![./images/time-series-cv.png](./images/time-series-cv.png)

## Técnicas de validação

https://towardsdatascience.com/validating-your-machine-learning-model-25b4c8643fb7

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
(X_train, X_test, 
 y_train, y_test) = train_test_split(X, y, test_size=0.3, 
                                     random_state=42)

In [2]:
import numpy as np
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=5)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=4.

Exemplo legal do uso do K-Fold + GridSearchCV + cálculo de métrica:

In [3]:
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold

# Load the dataset
X_iris = load_iris().data
y_iris = load_iris().target

# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10, 100],
          "gamma": [.01, .1]}

# We will use a Support Vector Classifier with "rbf" kernel
svr = SVC(kernel="rbf")

# Create inner and outer strategies
inner_cv = KFold(n_splits=2, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Pass the gridSearch estimator to cross_val_score
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
print(nested_score)

0.9800000000000001


In [4]:
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

In [5]:
from scipy.stats import wilcoxon
from sklearn.datasets import load_iris
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold

# Load the dataset
X = load_iris().data
y = load_iris().target

# Prepare models and select your CV method
model1 = ExtraTreesClassifier()
model2 = RandomForestClassifier()
kf = KFold(n_splits=20, random_state=42, shuffle=True)

# Extract results for each model on the same folds
results_model1 = cross_val_score(model1, X, y, cv=kf)
results_model2 = cross_val_score(model2, X, y, cv=kf)

# Calculate p value
stat, p = wilcoxon(results_model1, results_model2, zero_method='zsplit'); p



0.6766573217164245

In [6]:
import numpy as np
from mlxtend.evaluate import mcnemar_table, mcnemar

# The correct target (class) labels
y_target = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

# Class labels predicted by model 1
y_model1 = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
                     0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1])

# Class labels predicted by model 2
y_model2 = np.array([0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
                     1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0])

# Calculate p value
tb = mcnemar_table(y_target=y_target, 
                   y_model1=y_model1, 
                   y_model2=y_model2)
chi2, p = mcnemar(ary=tb, exact=True)

print('chi-squared:', chi2)
print('p-value:', p)

ModuleNotFoundError: No module named 'mlxtend'

https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976

Exemplo legal:

In [7]:
import pandas as pd
winedf = pd.read_csv('red-wine/winequality-red.csv', sep=',')
print(winedf.isnull().sum()) # check for missing dataprint winedf.head(3)

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


In [8]:
X = winedf.drop(['quality'],axis=1)
Y = winedf['quality']

In [9]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [10]:
steps = [('scaler', StandardScaler()), ('SVM', SVC())]
pipeline = Pipeline(steps) # define the pipeline object.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=30, stratify=Y)

In [12]:
print(winedf['quality'].value_counts())

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64


In [13]:
parameteres = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}

In [14]:
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)

In [15]:
#grid.fit(X_train, y_train)

In [16]:
#print("score = %3.2f" %(grid.score(X_test,y_test)))
#print(grid.best_params_)

### Usando o dataset do Titanic

In [17]:
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline

In [18]:
train = pd.read_csv("./titanic/train.csv")
test = pd.read_csv("./titanic/test.csv")

In [19]:
train.shape, test.shape

((891, 12), (418, 11))

In [20]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Vou selecionar somente os campos numéricos

In [22]:
train = train.select_dtypes(include=['int64', 'float64'])
test = test.select_dtypes(include=['int64', 'float64'])

In [23]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.25
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.925
3,4,1,1,35.0,1,0,53.1
4,5,0,3,35.0,0,0,8.05


In [24]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Age            177
SibSp            0
Parch            0
Fare             0
dtype: int64

In [25]:
test.isna().sum()

PassengerId     0
Pclass          0
Age            86
SibSp           0
Parch           0
Fare            1
dtype: int64

Vou imputar a média para todos os campos

In [26]:
mean = train.mean()
train = train.fillna(mean)
test = test.fillna(mean)

In [27]:
train.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
dtype: int64

In [28]:
test.isna().sum()

PassengerId    0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
dtype: int64

Vou executar estes passos:
* Treinar um modelo de Random Forest, aplicando scaling nas features e fazendo feature selection com um GridSearchCV com K-Fold de 10 folds
* Vou medir as métricas de precision, recall e ROC-AUC médias entre os folds de um K-Fold de 5 folds

In [64]:
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [65]:
X_train = train.drop(columns=["Survived"])
y_train = train["Survived"]

In [75]:
model = RandomForestClassifier()
p_grid = {
    "RF__n_estimators": [100, 200],#[100, 250, 500, 1000],
    "RF__max_depth": [5, 10],#[5, 10, 15],
    #"RF__min_samples_split": [3, 4, 5],
    #"RF__min_samples_leaf": [3, 4, 5]
}
inner_cv = KFold(n_splits=10, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

steps = [('scaler', StandardScaler()), ('RF', model)]
pipeline = Pipeline(steps)

clf = GridSearchCV(estimator=pipeline, param_grid=p_grid, cv=inner_cv) # cv=10
results = cross_validate(
    clf,
    X_train,
    y_train,
    cv=outer_cv, # cv=5
    scoring=('precision', 'recall', 'roc_auc'),
    return_train_score=True,
    return_estimator=True
)

GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('RF', RandomForestClassifier())]),
             param_grid={'RF__max_depth': [5, 10],
                         'RF__n_estimators': [100, 200]})

In [None]:
# sorted(clf.get_params().keys())

In [77]:
print(results)

{'fit_time': array([8.44775391, 9.16711807, 8.18568206, 8.37114978, 9.7367487 ]), 'score_time': array([0.05172563, 0.02063632, 0.03149652, 0.03138566, 0.03125095]), 'estimator': [GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('RF', RandomForestClassifier())]),
             param_grid={'RF__max_depth': [5, 10],
                         'RF__n_estimators': [100, 200]}), GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('RF', RandomForestClassifier())]),
             param_grid={'RF__max_depth': [5, 10],
                         'RF__n_estimators': [100, 200]}), GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('RF

In [78]:
results["estimator"]

[GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
              estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                        ('RF', RandomForestClassifier())]),
              param_grid={'RF__max_depth': [5, 10],
                          'RF__n_estimators': [100, 200]}),
 GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
              estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                        ('RF', RandomForestClassifier())]),
              param_grid={'RF__max_depth': [5, 10],
                          'RF__n_estimators': [100, 200]}),
 GridSearchCV(cv=KFold(n_splits=10, random_state=42, shuffle=True),
              estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                        ('RF', RandomForestClassifier())]),
              param_grid={'RF__max_depth': [5, 10],
                          'RF__n_estimators': [100, 200]}),
 GridSearchCV(cv=KFold

In [79]:
mean_train_precision = results["train_precision"].mean()
mean_test_precision = results["test_precision"].mean()
mean_train_recall = results["train_recall"].mean()
mean_test_recall = results["test_recall"].mean()
mean_train_roc_auc = results["train_roc_auc"].mean()
mean_test_roc_auc = results["test_roc_auc"].mean()
print(mean_train_precision)
print(mean_test_precision)
print(mean_train_recall)
print(mean_test_recall)
print(mean_train_roc_auc)
print(mean_test_roc_auc)

0.8626684071305114
0.7077841114413942
0.6245885221498533
0.4810606060606061
0.8962846174915786
0.77538324350398


In [76]:
clf.fit(X_train, y_train)

In [None]:
# É o nosso resultado em cima do dataset de teste do Kaggle
y_prob = clf.predict_proba(test)[:, 1]

Exemplo de https://stackoverflow.com/questions/53252156/standardscaler-with-pipelines-and-gridsearchcv

In [None]:
pipe_MLPRegressor = Pipeline([('scaler',  StandardScaler()),
            ('MLPRegressor', MLPRegressor(random_state = 42))])


grid_params_MLPRegressor = [{
    'MLPRegressor__solver': ['lbfgs'],
    'MLPRegressor__max_iter': [100,200,300,500],
    'MLPRegressor__activation' : ['relu','logistic','tanh'],
    'MLPRegressor__hidden_layer_sizes':[(2,), (4,),(2,2),(4,4),(4,2),(10,10),(2,2,2)],
}]


CV_mlpregressor = GridSearchCV (estimator = pipe_MLPRegressor,
                               param_grid = grid_params_MLPRegressor,
                               cv = 5,return_train_score=True, verbose=0)

CV_mlpregressor.fit(x_train, y_train)

CV_mlpregressor.predict(x_test)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np

x,y = load_boston(return_X_y=True)


xtrain, xtest, ytrain, ytest = train_test_split(x,y, random_state=6784)

pipe_MLPRegressor = Pipeline([('scaler',  StandardScaler()),
            ('MLPRegressor', MLPRegressor(random_state = 42))])
grid_params_MLPRegressor = [{
    'MLPRegressor__solver': ['lbfgs'],
    'MLPRegressor__max_iter': [100,200,300,500],
    'MLPRegressor__activation' : ['relu','logistic','tanh'],
    'MLPRegressor__hidden_layer_sizes':[(2,), (4,),(2,2),(4,4),(4,2),(10,10),(2,
2,2)],}]


CV_mlpregressor = GridSearchCV (estimator = pipe_MLPRegressor,
                               param_grid = grid_params_MLPRegressor,
                               cv = 5,return_train_score=True, verbose=0)

CV_mlpregressor.fit(xtrain, ytrain)

ypred=CV_mlpregressor.predict(xtest)

print np.c_[ytest, ypred]