# Evaluate the performance of machine learning algorithms with k-fold cross validation
### 期中之前最好都要做 k-fold!!

* Cross-validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split. It works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). 
* Each split of the data is called a fold. The algorithm is trained on k − 1 folds with one held back and tested on the held back fold. 
* This is repeated so that each fold of the dataset is given a chance to be the held back test set. 
* After running cross-validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.
* For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common.

In [5]:
# Evaluate using Cross Validation
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

filename = 'pima-indians-diabetes.data.csv' # 看一個印第安族發生糖尿病的可能性(有名的)
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
# num_folds = 5
kfold = KFold(n_splits=num_folds)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=kfold) # 這裡就做完traning了
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.951% (4.841%)


* We can compare the results with the simple train-test spliting approach 

In [6]:
# Evaluate using a train and a test set
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]
test_portion = [0.5, 0.33, 0.25, 0.1]
seed = 7

for i in range(len(test_portion)):
    test_size = test_portion[i]
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
    model = LogisticRegression(solver='liblinear')
    model.fit(X_train, Y_train)
    result = model.score(X_test, Y_test)
    print("Test size:", test_size, "Accuracy: %.3f%%" % (result*100.0))

Test size: 0.5 Accuracy: 77.604%
Test size: 0.33 Accuracy: 75.591%
Test size: 0.25 Accuracy: 78.125%
Test size: 0.1 Accuracy: 83.117%


## Leave One Out Cross-Validation (LOOCV)
#### 統計上非常有名
* You can configure cross-validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). This variation of cross-validation is called **leave-one-out cross-validation**. 
* The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of your model on unseen data. 
* A downside is that it can be a computationally more expensive procedure than k-fold cross-validation. In the example below we use leave-one-out cross-validation.

- 做出來的值應該最有代表性

In [7]:
# Evaluate using Leave One Out Cross Validation
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
loocv = LeaveOneOut()
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.823% (42.196%)


### Generally k-fold cross-validation is the gold standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.