# Biomechanical features of orthopedic patients
<br>Classifying patients based on six features

url: https://www.kaggle.com/uciml/biomechanical-features-of-orthopedic-patients/data

Method to use: k-NN

## Exploring the data

In [1]:
# load data
bio <- read.csv("data.csv", stringsAsFactors = FALSE)

str(bio)

'data.frame':	310 obs. of  7 variables:
 $ pelvic_incidence        : num  63 39.1 68.8 69.3 49.7 ...
 $ pelvic_tilt.numeric     : num  22.55 10.06 22.22 24.65 9.65 ...
 $ lumbar_lordosis_angle   : num  39.6 25 50.1 44.3 28.3 ...
 $ sacral_slope            : num  40.5 29 46.6 44.6 40.1 ...
 $ pelvic_radius           : num  98.7 114.4 106 101.9 108.2 ...
 $ degree_spondylolisthesis: num  -0.254 4.564 -3.53 11.212 7.919 ...
 $ class                   : chr  "Abnormal" "Abnormal" "Abnormal" "Abnormal" ...


**Variable of interest**: class

In [2]:
table(bio$class)


Abnormal   Normal 
     210      100 

check empty or missing values

In [3]:
# count empty cases
sum(bio$class == "")
# count na and nan values
sum(is.na(bio$class))
sum(is.nan(bio$class))

Check the features range

In [4]:
summary(bio)

 pelvic_incidence pelvic_tilt.numeric lumbar_lordosis_angle  sacral_slope   
 Min.   : 26.15   Min.   :-6.555      Min.   : 14.00        Min.   : 13.37  
 1st Qu.: 46.43   1st Qu.:10.667      1st Qu.: 37.00        1st Qu.: 33.35  
 Median : 58.69   Median :16.358      Median : 49.56        Median : 42.40  
 Mean   : 60.50   Mean   :17.543      Mean   : 51.93        Mean   : 42.95  
 3rd Qu.: 72.88   3rd Qu.:22.120      3rd Qu.: 63.00        3rd Qu.: 52.70  
 Max.   :129.83   Max.   :49.432      Max.   :125.74        Max.   :121.43  
 pelvic_radius    degree_spondylolisthesis    class          
 Min.   : 70.08   Min.   :-11.058          Length:310        
 1st Qu.:110.71   1st Qu.:  1.604          Class :character  
 Median :118.27   Median : 11.768          Mode  :character  
 Mean   :117.92   Mean   : 26.297                            
 3rd Qu.:125.47   3rd Qu.: 41.287                            
 Max.   :163.07   Max.   :418.543                            

we need to normalize the features values.

In [5]:
normalize <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}

In [6]:
bio_n <- as.data.frame(lapply(bio[1:6], normalize))

In [7]:
summary(bio_n)

 pelvic_incidence pelvic_tilt.numeric lumbar_lordosis_angle  sacral_slope   
 Min.   :0.0000   Min.   :0.0000      Min.   :0.0000        Min.   :0.0000  
 1st Qu.:0.1956   1st Qu.:0.3076      1st Qu.:0.2058        1st Qu.:0.1849  
 Median :0.3139   Median :0.4093      Median :0.3183        Median :0.2687  
 Mean   :0.3313   Mean   :0.4304      Mean   :0.3394        Mean   :0.2738  
 3rd Qu.:0.4507   3rd Qu.:0.5122      3rd Qu.:0.4385        3rd Qu.:0.3639  
 Max.   :1.0000   Max.   :1.0000      Max.   :1.0000        Max.   :1.0000  
 pelvic_radius    degree_spondylolisthesis
 Min.   :0.0000   Min.   :0.00000         
 1st Qu.:0.4369   1st Qu.:0.02947         
 Median :0.5182   Median :0.05313         
 Mean   :0.5145   Mean   :0.08695         
 3rd Qu.:0.5956   3rd Qu.:0.12185         
 Max.   :1.0000   Max.   :1.00000         

## Creating training and test datasets

In [8]:
# see https://stackoverflow.com/a/17200430

smp_size <- floor(0.75 * nrow(bio_n))

set.seed(123)
train_ind <- sample(seq_len(nrow(bio_n)), size = smp_size)

bio_train <- bio_n[train_ind, ]
bio_test <- bio_n[-train_ind, ]

Get labels

In [9]:
bio_train_labels <- bio[train_ind, c("class")]

bio_test_labels <- bio[-train_ind, c("class")]

## Training a model on the data

In [10]:
install.packages("class", quiet = TRUE)

In [11]:
library("class", quietly = TRUE)

In [12]:
bio_test_pred <- knn(train = bio_train, test = bio_test, cl = bio_train_labels, k = 21)

## Evaluating model performance

In [13]:
install.packages("gmodels", quiet = TRUE)

In [14]:
library("gmodels", quietly = TRUE)

In [15]:
CrossTable(x = bio_test_labels, y = bio_test_pred, prop.chisq = FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  78 

 
                | bio_test_pred 
bio_test_labels |  Abnormal |    Normal | Row Total | 
----------------|-----------|-----------|-----------|
       Abnormal |        40 |        12 |        52 | 
                |     0.769 |     0.231 |     0.667 | 
                |     0.909 |     0.353 |           | 
                |     0.513 |     0.154 |           | 
----------------|-----------|-----------|-----------|
         Normal |         4 |        22 |        26 | 
                |     0.154 |     0.846 |     0.333 | 
                |     0.091 |     0.647 |           | 
                |     0.051 |     0.282 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        44 |        34 |        78 | 
                |     0.564

## Improving

### Transformation

In [16]:
bio_z <- as.data.frame(scale(bio[-7]))

In [17]:
summary(bio_z)

 pelvic_incidence  pelvic_tilt.numeric lumbar_lordosis_angle  sacral_slope     
 Min.   :-1.9928   Min.   :-2.4078     Min.   :-2.0443       Min.   :-2.20418  
 1st Qu.:-0.8161   1st Qu.:-0.6870     1st Qu.:-0.8047       1st Qu.:-0.71569  
 Median :-0.1048   Median :-0.1184     Median :-0.1277       Median :-0.04089  
 Mean   : 0.0000   Mean   : 0.0000     Mean   : 0.0000       Mean   : 0.00000  
 3rd Qu.: 0.7183   3rd Qu.: 0.4574     3rd Qu.: 0.5966       3rd Qu.: 0.72577  
 Max.   : 4.0227   Max.   : 3.1862     Max.   : 3.9782       Max.   : 5.84632  
 pelvic_radius     degree_spondylolisthesis
 Min.   :-3.5922   Min.   :-0.9946         
 1st Qu.:-0.5415   1st Qu.:-0.6574         
 Median : 0.0261   Median :-0.3868         
 Mean   : 0.0000   Mean   : 0.0000         
 3rd Qu.: 0.5667   3rd Qu.: 0.3991         
 Max.   : 3.3903   Max.   :10.4435         

In [18]:
bio_train_z <- bio_z[train_ind, ]
bio_test_z <- bio_z[-train_ind, ]

bio_train_labels_z <- bio[train_ind, 7]
bio_test_labels_z <- bio[-train_ind, 7]

bio_test_pred_z <- knn(train = bio_train_z, test = bio_test_z, cl = bio_train_labels_z, k = 21)
CrossTable(x = bio_test_labels_z, y = bio_test_pred_z, prop.chisq = FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  78 

 
                  | bio_test_pred_z 
bio_test_labels_z |  Abnormal |    Normal | Row Total | 
------------------|-----------|-----------|-----------|
         Abnormal |        40 |        12 |        52 | 
                  |     0.769 |     0.231 |     0.667 | 
                  |     0.909 |     0.353 |           | 
                  |     0.513 |     0.154 |           | 
------------------|-----------|-----------|-----------|
           Normal |         4 |        22 |        26 | 
                  |     0.154 |     0.846 |     0.333 | 
                  |     0.091 |     0.647 |           | 
                  |     0.051 |     0.282 |           | 
------------------|-----------|-----------|-----------|
     Column Total |        44 |        34 |        78 

### Alternative values of k

In [22]:
run_tests <- function(k_value) {
    print(k_value)
    bio_test_pred <- knn(train = bio_train, test = bio_test, cl = bio_train_labels, k = k_value)
    CrossTable(x = bio_test_labels, y = bio_test_pred, prop.chisq = FALSE)
}

In [23]:
run_tests(1)
run_tests(5)
run_tests(11)
run_tests(15)
run_tests(21)
run_tests(27)

[1] 1

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  78 

 
                | bio_test_pred 
bio_test_labels |  Abnormal |    Normal | Row Total | 
----------------|-----------|-----------|-----------|
       Abnormal |        44 |         8 |        52 | 
                |     0.846 |     0.154 |     0.667 | 
                |     0.898 |     0.276 |           | 
                |     0.564 |     0.103 |           | 
----------------|-----------|-----------|-----------|
         Normal |         5 |        21 |        26 | 
                |     0.192 |     0.808 |     0.333 | 
                |     0.102 |     0.724 |           | 
                |     0.064 |     0.269 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        49 |        29 |        78 | 
                |    

## Evaluation

For k = 1

Precision = 44/52 = 0.8462

Recall = 44/49 = 0.898

F = 2.(0.8462 x 0.898)/(0.8462 + 0.898) = 0.8713

Markedness = 0.8462 + (21/26) - 1 = 0.6539

Informedness = 0.898 + (21/29) - 1 = 0.6221


...


For k = 1, the knn predicts correctly approximately 83% of cases. 