In [106]:
#import pandas
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("diabetes.csv", header=None, names=col_names)


We would like to predict if diabetes test will be positive or negative, given various data about the patients.

In [107]:
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
1,6,148,72,35,0,33.6,0.627,50,1
2,1,85,66,29,0,26.6,0.351,31,0
3,8,183,64,0,0,23.3,0.672,32,1
4,1,89,66,23,94,28.1,0.167,21,0


In [108]:
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pd.get_dummies(pima[feature_cols]) # Features
y = pima.label # Target variable


In [127]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)

In [128]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16) # in order to fix the splitting

# fit the model with data
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

In [129]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix # not really good so far...

array([[101,  30],
       [ 30,  32]])

The confusion matrix is a table that summarizes the performance of a classification algorithm by comparing the predicted and actual class labels for a set of test data. We see thet for the second class (second row) our model predicts around half of the data wrongly (the false negatives FN are almost as many as the true negatives TN).
The confusion matrix order is "TP, FP" on row 1, and "FN, TN" on row 2, where TP means "true positive" and FP is "false positive".

In [130]:
from sklearn.metrics import classification_report
target_names = ['without diabetes', 'with diabetes']
print(classification_report(y_test, y_pred, target_names=target_names)) # works also with multi class classification

                  precision    recall  f1-score   support

without diabetes       0.77      0.77      0.77       131
   with diabetes       0.52      0.52      0.52        62

        accuracy                           0.69       193
       macro avg       0.64      0.64      0.64       193
    weighted avg       0.69      0.69      0.69       193



This code snippet shows the performance metrics of a binary classification model. The model has predicted whether a person has diabetes or not, and the metrics are calculated based on the comparison of the predicted values with the actual values. The precision, recall, and f1-score are three commonly used metrics to evaluate the performance of a binary classification model. Precision measures the proportion of true positives among all the positive predictions, while recall measures the proportion of true positives among all the actual positives. 

F1-score is the harmonic mean of precision and recall. The support column shows the number of samples in each class. In this case, there are 123 samples without diabetes and 69 samples with diabetes. The accuracy is the proportion of correct predictions among all the predictions. The macro avg and weighted avg are the average metrics across all the classes, with the former giving equal weight to each class and the latter giving more weight to the class with more samples.

You got a classification rate of 0.69%, considered as not so good accuracy. How can you improve?

In [121]:
logreg1 = LogisticRegression(random_state=16, penalty="l2", C=1) #change regularization parameter C, does not much regulasize... change to C = 10...
logreg1.fit(X_train, y_train) # it does not get better with different C
# default of C is 1
# C is 1/lambda as in slides

y_pred1 = logreg1.predict(X_test)

# sparse modell take L1
# otherwise L2, is usually a good choice

In [122]:
target_names = ['without diabetes', 'with diabetes']
print(classification_report(y_test, y_pred1, target_names=target_names))

                  precision    recall  f1-score   support

without diabetes       0.77      0.77      0.77       131
   with diabetes       0.52      0.52      0.52        62

        accuracy                           0.69       193
       macro avg       0.64      0.64      0.64       193
    weighted avg       0.69      0.69      0.69       193



Regularization parameter C is defined such that a large parameters corresponds to small regularization effect. Furthermore, the model is not sensitive to this parameter.
The only way to improve classification accuracy is to increase the training data, or to change the distribution of the training data (reshuffle).