### Classification metrics
(credit to Jason Brownlee's [blogpost](https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/))
* Accuracy score and a **logistic regression** classifier (just fitting a sigmoidal function)

In [22]:
# Cross Validation Classification Accuracy
import pandas as pd
from sklearn import model_selection #for kfold,train_test_split and cross-validation score
from sklearn.linear_model import LogisticRegression
url ="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
#the raw data only has values, because of that the variable names have to
#be provided separately 
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age', 'class'] 
dataframe = pd.read_csv(url, names=names)
print(dataframe.shape)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()
scoring = 'accuracy' #classification accuracy
results = model_selection.cross_val_score(model, X, Y, cv=kfold,
scoring=scoring)
print("Accuracy: {:.3f} ({:.3f})".format(results.mean(), results.std()))

(768, 9)
Accuracy: 0.770 (0.048)


In [7]:
dataframe['class'].value_counts()
#the target variable seems balanced

0    500
1    268
Name: class, dtype: int64

* logarithmic loss (between 0 and 1, perfect score is zero). LogLoss =  -1 * the log of the likelihood function.

In [8]:
# Cross Validation Classification LogLoss
scoring = 'neg_log_loss'
results = model_selection.cross_val_score(model, X, Y, cv=kfold,
scoring=scoring)
print("Logloss: {:.3f} ({:.3f})".format(results.mean(), results.std()))

Logloss: -0.493 (0.047)


* Area under ROC curve (area of 1 means perfect classification, area = 0.5 means the model is as good as random guessing)

In [9]:
# Cross Validation Classification ROC AUC
scoring = 'roc_auc'
results = model_selection.cross_val_score(model, X, Y, cv=kfold,
scoring=scoring)
print("AUC: {:.3f} ({:.3f})".format(results.mean(), results.std()))

AUC: 0.824 (0.041)


* confusion matrix

In [20]:
# Cross Validation Classification Confusion Matrix
from sklearn.metrics import confusion_matrix

#apparently confusion matrices are calculated for individual data fits:
test_size = 0.33
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X,
                                        Y, test_size=test_size,random_state=seed)
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)
print('0 predicted was 0\t0 predicted was 1')
print('1 predicted was 0\t1 predicted was 1')

[[141  21]
 [ 41  51]]
0 predicted was 0	0 predicted was 1
1 predicted was 0	1 predicted was 1


* classification report

In [21]:
# Cross Validation Classification Report
from sklearn.metrics import classification_report

report = classification_report(Y_test, predicted)
print(report)

             precision    recall  f1-score   support

        0.0       0.77      0.87      0.82       162
        1.0       0.71      0.55      0.62        92

avg / total       0.75      0.76      0.75       254



### Regression metrics
* Mean absolute error

In [29]:
# Cross Validation Regression MAE
from sklearn.linear_model import LinearRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS',
'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
model = LinearRegression()
scoring = 'neg_mean_absolute_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold,scoring=scoring)
print("MAE: {:.3f} ({:.3f})".format(results.mean(), results.std()))
dataframe.MEDV.describe()

MAE: -4.005 (2.084)


count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: MEDV, dtype: float64

* Mean Squared Error

In [26]:
# Cross Validation Regression MSE
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold,scoring=scoring)
print("MSE: {:.3f} ({:.3f})".format(results.mean(), results.std()))

MSE: -34.705 (45.574)


* R^2 Metric (coefficient of determination)

In [28]:
# Cross Validation Regression R^2
scoring = 'r2'
results = model_selection.cross_val_score(model, X, Y, cv=kfold,
scoring=scoring)
print("R^2: {:.3f} ({:.3f})".format(results.mean(), results.std()))

R^2: 0.203 (0.595)
