Importing useful libraries and setting the random state

In [391]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import confusion_matrix, f1_score, cohen_kappa_score

%matplotlib inline

random_state = 42

## POINT 1

Loading the data, replacing '?' character with the proper coding of nulls (i.e. np.nan) 

In [392]:
df = pd.read_csv('horse-colic.csv', header = None)
df = df.replace('?', np.nan)

Renaming the columns to numbers starting from 1, as shown in the description file 

In [393]:
df.columns = range(1,29)

Removing columns named 3, 25, 26, 27, 28.

In [394]:
df = df.drop([3,25,26,27,28], axis = 1)

Exploring the data

In [395]:
df.head()

Unnamed: 0,1,2,4,5,6,7,8,9,10,11,...,15,16,17,18,19,20,21,22,23,24
0,2,1,38.5,66,28,3.0,3.0,,2,5.0,...,,,3.0,5.0,45.0,8.4,,,2,2
1,1,1,39.2,88,20,,,4.0,1,3.0,...,,,4.0,2.0,50.0,85.0,2.0,2.0,3,2
2,2,1,38.3,40,24,1.0,1.0,3.0,1,3.0,...,,,1.0,1.0,33.0,6.7,,,1,2
3,1,9,39.1,164,84,4.0,1.0,6.0,2,2.0,...,2.0,5.0,3.0,,48.0,7.2,3.0,5.3,2,1
4,2,1,37.3,104,35,,,6.0,2,,...,,,,,74.0,7.4,,,2,2


Showing the count of nulls for each remaining column.

In [396]:
df.isna().sum()

1       1
2       0
4      60
5      24
6      58
7      56
8      69
9      47
10     32
11     55
12     44
13     56
14    104
15    106
16    247
17    102
18    118
19     29
20     33
21    165
22    198
23      1
24      0
dtype: int64

## POINT 2

Using column 23 as target for classification

In [397]:
target = 23

Dropping rows where the target is null.

In [398]:
df = df.dropna(axis=0, subset=[target])

Storing column names of all the columns except target one in order to reutilize those after the Imputing fase

In [399]:
column_names = df.loc[:, df.columns != target].columns

Imputing the nulls on the predicting columns using the column means via sklearn.impute.SimpleImputer (i.e. replacing missing values using the mean along each column)

In [400]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df_predicting = pd.DataFrame(imputer.fit_transform(df.loc[:, df.columns != target]), columns = column_names)

Separing the dataset into the predicting and target parts X, y

In [401]:
X = df_predicting
y = df[target]

## POINT 3

Splitting the dataset into trainin and test dataset

In [402]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = random_state)

Training, tuning and testing a  Decision Tree classification model with cross–validation via GridSearchCV

In [403]:
score = 'accuracy'
cv = 5 #number of split for cross validation
model_param = {'criterion':['gini', 'entropy'], 'max_depth':list(range(1,10)), 'min_samples_split': range(2,10)}
model_est = DecisionTreeClassifier(random_state=random_state)

model = GridSearchCV(model_est, model_param, scoring=score, cv=5) 
model.fit(X_train,y_train);

Plotting the resulting confusion matrix on the test set, normalising for true values

In [404]:
y_pred = model.predict(X_test)

print("Confusion matrix:\n{}".format(confusion_matrix(y_test, y_pred, normalize = 'true')))

Confusion matrix:
[[0.8372093  0.13953488 0.02325581]
 [0.4        0.6        0.        ]
 [0.58333333 0.33333333 0.08333333]]


Showing the values F1-Score (macro) speficing it as average, and Cohen Kappa Score on the test set.

In [405]:
f1_model1 = f1_score(y_test, y_pred, average='macro')
print(f1_model1)

0.4934143870314083


In [406]:
cohen_kappa_model1 = cohen_kappa_score(y_test, y_pred)
print(cohen_kappa_model1)

0.34299191374663085


## POINT 4

Repeating step 3 with a KNeighborsClassifier rather then a Decision Tree

In [407]:
score = 'accuracy'
cv = 5 #number of split for cross validation
model2_param = [{"n_neighbors": list(range(1, 16))}]
model2_est = KNeighborsClassifier()


model2 = GridSearchCV(model2_est, model2_param, scoring=score, cv=5) 
model2.fit(X_train,y_train);

In [408]:
y_pred2 = model2.predict(X_test)

print("Confusion matrix:\n{}".format(confusion_matrix(y_test, y_pred2,normalize = 'true')))

Confusion matrix:
[[0.88372093 0.11627907 0.        ]
 [0.6        0.35       0.05      ]
 [0.5        0.33333333 0.16666667]]


In [409]:
f1_model2 = f1_score(y_test, y_pred2, average='macro')
print(f1_model2)

0.4744107744107744


In [410]:
cohen_kappa_model2 = cohen_kappa_score(y_test, y_pred2)
print(cohen_kappa_model2)

0.2659909122684375


## POINT 5

Given the comparison of the results obtained via f1 scores and cohen_kappa_score the first model (i.e. the Decision Tree Classifier) has been selected.
In particular with an higher value of f1 score (i.e. an harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0) the Decision tree shows a slightly better result.
On the other hand the not so sligthly difference with respect to the cohen_kappa_score (i.e. a number between -1 and 1 that expresses the level of agreement between two annotators on a classification problem, with maximum value means complete agreement; zero or lower means chance agreement) is what let me easly choose between the two classifiers

In [411]:
print(f'Comparison between model scores: the f1 score of model 1 is {f1_model1} wrt model 2 {f1_model2}. The cohen kappa scores are: {cohen_kappa_model1} and {cohen_kappa_model2} ')

Comparison between model scores: the f1 score of model 1 is 0.4934143870314083 wrt model 2 0.4744107744107744. The cohen kappa scores are: 0.34299191374663085 and 0.2659909122684375 


In this case an easy to implement approach with tuned Decision Tree perform better than a KNN approach as seen here