# CH 9 Overview
## Why do DS folks need to evaluate performance?
1. You need to know how well your algo will perform on unseen data.
2. Flavor 1- prove ur stuff works by testing on answers u already know
3. Flavor 2- make accurate predicitions using statistical resampling via scikit learn

## Why do DS folks fail at predictions?
1. Applying an algo from ur training-set unto other events is bad bc of OVER FITTING
2. OVERFITTING- ML algos predictions are most effective on unseen data, not data already seen
3. 4 jutsu to split data: train & test / k-fold cross validation/ leave one out cross validation/ repeated random test train splits


In [4]:
#9.2 Split Into Train & Test Sets
#Evaluate using a train and test set
from pandas import read_csv
from sklearn.model_selection  import train_test_split
from sklearn.linear_model  import LogisticRegression

#load data
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(url, names=names)
array = dataframe.values
# seperate array into input & output components
X = array [:,0:8]
Y = array [:,8]
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size,
random_state=seed)
model = LogisticRegression(solver='liblinear') 
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test) 
print("Accuracy: %.3f%%" % (result*100.0))

Accuracy: 75.591%


## 9.3 K-fold Cross Validation
### Cross-validation is a jutsu that est. the perf. of ML algo w/ less variance than a single train-test set split
1. It works by splitting data into k-parts/folds. 
2. the algo is trained on k-1 folds w/ one being held back 
3. step-2 is reiterated until all folds habve been held back 
4. after running cross-validation u end up w/ k different perf scores that u can urn a mean & std dev

In [4]:
#9.3 k-fold cross-validation

#Evaluate using Cross Validation
from pandas import read_csv
from sklearn.model_selection  import KFold
from sklearn.model_selection  import cross_val_score
from sklearn.linear_model  import LogisticRegression

#load data
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(url, names=names)
array = dataframe.values
# seperate array into input & output components
X = array [:,0:8]
Y = array [:,8]
#KFold jutsu
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=kfold)
#display results
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 77.086% (5.091%)


## Why use KCross, what is it telling me ?
1. result is a more reliable estimate of the performance of the algorithm on new data.
2. Its more accurate bc intuitibvely, the algo is trained & evaluated on multiple datasets. 
3. each k-part/fold must be a decent sample size so that enough iterations can provide a reasonable estimate on UNSEEN DATA
## For datasets btwn 1k - 10k its best to use k values of 3, 5, 10 (prime numbers & primes composite) 


# 9.5 Repeated Random Test-Train Splits
## Another Flavor of KCross. 
1. similar to test/split jutsu except keep repeating. 
2. jutsu utilizes the speed of test/split while reducing variance in kfoldscross
3. u can repeat as often as needed
4. **DOWNSIDE** may introduce redudancy 

In [6]:
#9.5 Evaluate using shuffle split cross validation
from pandas import read_csv
from sklearn.model_selection  import ShuffleSplit
from sklearn.model_selection  import cross_val_score
from sklearn.linear_model  import LogisticRegression

#load data
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(url, names=names)
array = dataframe.values
# seperate array into input & output components
X = array [:,0:8]
Y = array [:,8]

#Repeated kfold jutsu
n_splits = 10
test_size = 0.33
seed = 7 
kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.496% (1.698%)


## What Jutsu to use When?
1. **KFoldcross** is the gold standard for resampling/ ML algo performance on UNSEEN DATA
2. **Test Train/Test** is good for slow algo + produces performance estimates w/ lower bias given larger datasets 
3. Jutsu like RepeatedRandom Splits can help Data Scientist balance variance ad est perf.
4. If in doubt, use 10 fold cross, but really find the fastest + reasonable