# Predicting whether a patient's biomechanical features are abnormal or normal
by Kevin Young

I will be using the dataset provided by UCI Machine Learning.

The data contains 310 instances of patients' features with six biomechanical attributes which come from the shape and orientation of the pelvis and lumbar spine:

- pelvic incidence
- pelvic tilt
- lumbar lordosis angle
- sacral slope
- pelvic radius
- grade of spondylolisthesis

## Preparing the data
First I import the data file into a Pandas dataframe. Fortunatley, there are no missing values in the data set, so we won't have to do any cleaning.

In [1]:
import pandas as pd

patients_data = pd.read_csv('column_2C_weka.csv')
patients_data.head()

Unnamed: 0,pelvic_incidence,pelvic_tilt numeric,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis,class
0,63.027818,22.552586,39.609117,40.475232,98.672917,-0.2544,Abnormal
1,39.056951,10.060991,25.015378,28.99596,114.405425,4.564259,Abnormal
2,68.832021,22.218482,50.092194,46.613539,105.985135,-3.530317,Abnormal
3,69.297008,24.652878,44.311238,44.64413,101.868495,11.211523,Abnormal
4,49.712859,9.652075,28.317406,40.060784,108.168725,7.918501,Abnormal


Now I convert the dataframes into numpy arrays to be used by scikit_learn. We have one array that contains the class, an array with the feature data and another array with the feature name labels.

In [2]:
all_features = patients_data[['pelvic_incidence', 'pelvic_tilt numeric', 'lumbar_lordosis_angle', 'sacral_slope', 'pelvic_radius', 'degree_spondylolisthesis']].values

all_classes = patients_data['class'].values

feature_names = ['pelvic_incidence', 'pelvic_tilt numeric', 'lumbar_lordosis_angle', 'sacral_slope', 'pelvic_radius', 'degree_spondylolisthesis']

all_features

array([[  63.0278175 ,   22.55258597,   39.60911701,   40.47523153,
          98.67291675,   -0.25439999],
       [  39.05695098,   10.06099147,   25.01537822,   28.99595951,
         114.4054254 ,    4.56425864],
       [  68.83202098,   22.21848205,   50.09219357,   46.61353893,
         105.9851355 ,   -3.53031731],
       ..., 
       [  61.44659663,   22.6949683 ,   46.17034732,   38.75162833,
         125.6707246 ,   -2.70787952],
       [  45.25279209,    8.69315736,   41.5831264 ,   36.55963472,
         118.5458418 ,    0.21475017],
       [  33.84164075,    5.07399141,   36.64123294,   28.76764934,
         123.9452436 ,   -0.19924909]])

Now I will need to normalise the input data.

In [3]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)
all_features_scaled

array([[ 0.14708636,  0.50136873, -0.6651769 , -0.18495031, -1.4476468 ,
        -0.70805942],
       [-1.24586434, -0.74876898, -1.45300075, -1.0415207 , -0.26438488,
        -0.57955637],
       [ 0.4843695 ,  0.46793218, -0.09926175,  0.2730833 , -0.89768556,
        -0.79542095],
       ..., 
       [ 0.05520137,  0.51561812, -0.31097748, -0.31356364,  0.58289256,
        -0.77348834],
       [-0.88582307, -0.88565951, -0.55861259, -0.47712775,  0.04702109,
        -0.69554822],
       [-1.54892681, -1.24785954, -0.82539423, -1.05855695,  0.45311695,
        -0.70658867]])

## Logistic Regression

Given this is just a binary classification problem, I will first try logistic regression and see how high this accuracy is.

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

clf = LogisticRegression()
cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.8193548387096774

## Decision Trees

Before using the DecisionTreeClassifier, I will create a train/test split of the data - 80% for training and 20% for testing.

In [5]:
import numpy
from sklearn.model_selection import train_test_split

numpy.random.seed(1234)

(training_inputs, testing_inputs, training_classes, testing_classes) = train_test_split(all_features_scaled, all_classes, train_size= 0.8, random_state = 1)



Now I fit a DecisionTreeClassifier to the training data.

In [6]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=1)

clf.fit(training_inputs, training_classes)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

Measuring the accuracy of the decision tree model using the test data.

In [7]:
clf.score(testing_inputs, testing_classes)

0.80645161290322576

Now I am trying K-Fold cross validation to further help avoid overfitting (K=10).

In [8]:
clf = DecisionTreeClassifier(random_state=1)

cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.78709677419354851

Now I will also try a RandomForestClassifier to see if that accuracy is any better.

In [9]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, random_state=1)
cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.80322580645161279

## Support Vector Machines

svm.SVC has different kernels which may vary in performance. I will try linear, rbf, sigmoid and poly and see which results in the highest accuracy.

In [10]:
from sklearn import svm

C = 1.0
svc = svm.SVC(kernel='linear', C=C)
cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.83548387096774201

In [11]:
svc = svm.SVC(kernel='rbf', C=C)
cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.81612903225806444

In [12]:
svc = svm.SVC(kernel='sigmoid', C=C)
cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.7870967741935484

In [13]:
svc = svm.SVC(kernel='poly', C=C)
cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.71612903225806457

## K-Nearest Neighbours

In attempt to find the best value of K, I will use a for loop to iterate through different values of K from 1 to 50 and compare the resulting accuracy from each value of K.

In [14]:
from sklearn import neighbors

for i in range(1, 51):
    clf = neighbors.KNeighborsClassifier(n_neighbors=10)
    cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)
    print(i, ":", cv_scores.mean())


1 : 0.790322580645
2 : 0.790322580645
3 : 0.790322580645
4 : 0.790322580645
5 : 0.790322580645
6 : 0.790322580645
7 : 0.790322580645
8 : 0.790322580645
9 : 0.790322580645
10 : 0.790322580645
11 : 0.790322580645
12 : 0.790322580645
13 : 0.790322580645
14 : 0.790322580645
15 : 0.790322580645
16 : 0.790322580645
17 : 0.790322580645
18 : 0.790322580645
19 : 0.790322580645
20 : 0.790322580645
21 : 0.790322580645
22 : 0.790322580645
23 : 0.790322580645
24 : 0.790322580645
25 : 0.790322580645
26 : 0.790322580645
27 : 0.790322580645
28 : 0.790322580645
29 : 0.790322580645
30 : 0.790322580645
31 : 0.790322580645
32 : 0.790322580645
33 : 0.790322580645
34 : 0.790322580645
35 : 0.790322580645
36 : 0.790322580645
37 : 0.790322580645
38 : 0.790322580645
39 : 0.790322580645
40 : 0.790322580645
41 : 0.790322580645
42 : 0.790322580645
43 : 0.790322580645
44 : 0.790322580645
45 : 0.790322580645
46 : 0.790322580645
47 : 0.790322580645
48 : 0.790322580645
49 : 0.790322580645
50 : 0.790322580645


## Naive Bayes

In [15]:
from sklearn.naive_bayes import MultinomialNB

scaler = preprocessing.MinMaxScaler()
all_features_minmax = scaler.fit_transform(all_features)

clf = MultinomialNB()
cv_scores = cross_val_score(clf, all_features_minmax, all_classes, cv=10)

cv_scores.mean()

0.67741935483870974

## Neural Networks

In [16]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

def create_model():
    model = Sequential()
    model.add(Dense(10, input_dim=6, kernel_initializer='normal', activation='relu'))
    model.add(Dense(6, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

estimator = KerasClassifier(build_fn=create_model, epochs=100, verbose=0)
cv_scores = cross_val_score(estimator, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.73870966732501986

## Conclusion

It seems the best machine learning model was the model using a SVM with a linear kernel hyperparameter. It yielded highest accuracy being 83.5%.