**Aluno**: José Carlos Viana Filho

# Base de dados

**Biomechanical features of orthopedic patients**

Classifying patients based on six features

url: https://www.kaggle.com/uciml/biomechanical-features-of-orthopedic-patients/data

# Tratamento dos dados

In [1]:
# load database
ortho <- read.csv("ortho.csv", stringsAsFactor = FALSE)

str(ortho)
table(ortho$class)

'data.frame':	310 obs. of  7 variables:
 $ pelvic_incidence        : num  63 39.1 68.8 69.3 49.7 ...
 $ pelvic_tilt.numeric     : num  22.55 10.06 22.22 24.65 9.65 ...
 $ lumbar_lordosis_angle   : num  39.6 25 50.1 44.3 28.3 ...
 $ sacral_slope            : num  40.5 29 46.6 44.6 40.1 ...
 $ pelvic_radius           : num  98.7 114.4 106 101.9 108.2 ...
 $ degree_spondylolisthesis: num  -0.254 4.564 -3.53 11.212 7.919 ...
 $ class                   : chr  "Abnormal" "Abnormal" "Abnormal" "Abnormal" ...



Abnormal   Normal 
     210      100 

Variável de interesse: *class*

Essa variável possui dois valores possíveis: *Abnormal* e *Normal*

## Checagem de valores vazios ou perdidos

In [2]:
sum(ortho$class == "")
sum(is.na(ortho$class))
sum(is.nan(ortho$class))

Não há valores perdidos ou vazios na variável de interesse.

## Normalizar os valores

Antes de prosseguir, é importante normalizar os valores das outras variáveis, que serão usadas para predizer a variável *class*. Antes de normalizar, vamos verificar o estado atual delas:

In [3]:
summary(ortho[1:6])

 pelvic_incidence pelvic_tilt.numeric lumbar_lordosis_angle  sacral_slope   
 Min.   : 26.15   Min.   :-6.555      Min.   : 14.00        Min.   : 13.37  
 1st Qu.: 46.43   1st Qu.:10.667      1st Qu.: 37.00        1st Qu.: 33.35  
 Median : 58.69   Median :16.358      Median : 49.56        Median : 42.40  
 Mean   : 60.50   Mean   :17.543      Mean   : 51.93        Mean   : 42.95  
 3rd Qu.: 72.88   3rd Qu.:22.120      3rd Qu.: 63.00        3rd Qu.: 52.70  
 Max.   :129.83   Max.   :49.432      Max.   :125.74        Max.   :121.43  
 pelvic_radius    degree_spondylolisthesis
 Min.   : 70.08   Min.   :-11.058         
 1st Qu.:110.71   1st Qu.:  1.604         
 Median :118.27   Median : 11.768         
 Mean   :117.92   Mean   : 26.297         
 3rd Qu.:125.47   3rd Qu.: 41.287         
 Max.   :163.07   Max.   :418.543         

Já podemos observar intervalos bem diferentes entre uma variável e outra. Precisamos de uma função para realizar a normalização:

In [4]:
normalize <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}

Agora podemos normalizar os dados:

In [5]:
ortho_n <- as.data.frame(lapply(ortho[1:6], normalize))
ortho_n['class'] <- as.factor(ortho$class)

Checando os valores normalizados:

In [6]:
summary(ortho_n)

 pelvic_incidence pelvic_tilt.numeric lumbar_lordosis_angle  sacral_slope   
 Min.   :0.0000   Min.   :0.0000      Min.   :0.0000        Min.   :0.0000  
 1st Qu.:0.1956   1st Qu.:0.3076      1st Qu.:0.2058        1st Qu.:0.1849  
 Median :0.3139   Median :0.4093      Median :0.3183        Median :0.2687  
 Mean   :0.3313   Mean   :0.4304      Mean   :0.3394        Mean   :0.2738  
 3rd Qu.:0.4507   3rd Qu.:0.5122      3rd Qu.:0.4385        3rd Qu.:0.3639  
 Max.   :1.0000   Max.   :1.0000      Max.   :1.0000        Max.   :1.0000  
 pelvic_radius    degree_spondylolisthesis      class    
 Min.   :0.0000   Min.   :0.00000          Abnormal:210  
 1st Qu.:0.4369   1st Qu.:0.02947          Normal  :100  
 Median :0.5182   Median :0.05313                        
 Mean   :0.5145   Mean   :0.08695                        
 3rd Qu.:0.5956   3rd Qu.:0.12185                        
 Max.   :1.0000   Max.   :1.00000                        

Agora os valores possuem o mesmo *range*, de forma que não irão causar impacto idesejado nas análises.

# Criando os datasets de treinamento e de teste

Para gerar modelos de aprendizagem e avaliar esses modelos, vamos dividir a base de dados em uma base de treinamento e uma base de teste. Para isso, vamos designar 75% para a base de treinamento e 25% para a base de teste. Essa divisão é aleatória.

In [7]:
# see https://stackoverflow.com/a/17200430

smp_size <- floor(0.75 * nrow(ortho_n))

# set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(ortho_n)), size = smp_size)

ortho_train <- ortho_n[train_ind, ]
ortho_test <- ortho_n[-train_ind, ]

# Funções de Machine Learning

As funções a seguir são usadas para executar diversos métodos de machine learning.

In [8]:
# install packages
list.of.packages <- c("class", "janitor", "e1071", "C50", "klaR", "randomForest", "OneR")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()
                                   [,"Package"])]
if(length(new.packages)) install.packages(new.packages, quiet = TRUE)

In [9]:
# load libs
library("class", quietly = TRUE, warn.conflicts = FALSE)
library("janitor", quietly = TRUE, warn.conflicts = FALSE)
library("e1071", quietly = TRUE, warn.conflicts = FALSE)
library("C50", quietly = TRUE, warn.conflicts = FALSE)
library("klaR", quietly = TRUE, warn.conflicts = FALSE)
library("randomForest", quietly = TRUE, warn.conflicts = FALSE)
library("OneR", quietly = TRUE, warn.conflicts = FALSE)

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.


In [10]:
# kNN
exec_knn <- function(k_value) {
    ortho_test_pred <- knn(train = ortho_train[-7], test = ortho_test[-7],
                           cl = ortho_train$class, k = k_value)
    datf <- data.frame(ortho_test$class, ortho_test_pred)
    return (tabyl(datf, ortho_test.class, ortho_test_pred))
}

In [11]:
# naive bayes
exec_naiveBayes <- function() {
    ortho_classifier <- NaiveBayes(ortho_train$class ~ ., data = ortho_train)
    ortho_test_pred <- predict(ortho_classifier, ortho_test)
    datf <- data.frame(ortho_test$class, ortho_test_pred)
    return (tabyl(datf, ortho_test.class, class))
}

In [12]:
# decision tree
exec_decision_tree <- function() {
    dt_model <- C5.0(ortho_train[-7], ortho_train$class)
    ortho_test_pred <- predict(dt_model, ortho_test[-7])
    datf <- data.frame(ortho_test$class, ortho_test_pred)
    return (tabyl(datf, ortho_test.class, ortho_test_pred))
}

In [13]:
# random forest
exec_random_forest <- function() {
    model <- randomForest(ortho_train$class ~ ., data = ortho_train)
    pred <- predict(model, ortho_test)
    datf <- data.frame(ortho_test$class, pred)
    return (tabyl(datf, ortho_test.class, pred))
}

In [14]:
# OneR
exec_OneR <- function() {
    model <- OneR(ortho_train$class ~ ., data = ortho_train)
    pred <- predict(model, ortho_test)
    datf <- data.frame(ortho_test$class, pred)
    return (tabyl(datf, ortho_test.class, pred))
}

# Funções das métricas

As funções à seguir calculam as métricas de interesse, para avaliação de cada técnica de machine learning.

In [15]:
# true positive
tp_value <- function(ctable) {
    return (ctable[1,2])
}

# true negative
tn_value <- function(ctable) {
    return (ctable[2,3])
}

# false positive
fp_value <- function(ctable) {
    return (ctable[1,3])
}

# false negative
fn_value <- function(ctable) {
    return (ctable[2,2])
}

# precision
precision <- function(ctable) {
    return (tp_value(ctable)/(tp_value(ctable) + fp_value(ctable)))
}

# recall
recall <- function(ctable) {
    return (tp_value(ctable)/(tp_value(ctable) + fn_value(ctable)))
}

# accuracy
accuracy <- function(ctable) {
    return ((tp_value(ctable) + tn_value(ctable))/
            (tp_value(ctable) + tn_value(ctable) + fp_value(ctable) + 
                                                   fn_value(ctable)))
}

# predicted positive condition rate
pred_pos_cond_rt <- function(ctable) {
    return ((tp_value(ctable) + fp_value(ctable))/
            (tp_value(ctable) + fp_value(ctable) + tn_value(ctable) +
                                                  fn_value(ctable)))
}

# f measure
f_measure <- function(ctable) {
    return ((2 * precision(ctable) * recall(ctable))/(precision(ctable) + 
                                                      recall(ctable)))
}

# informedness
informedness <- function(ctable) {
    return (recall(ctable) + (tn_value(ctable)/(tn_value(ctable) + 
                                                fp_value(ctable))) - 1)
}

# markedness
markedness <- function(ctable) {
    return (precision(ctable) + (tn_value(ctable)/(tn_value(ctable) + 
                                                   fn_value(ctable))) - 1)
}

# Execução das técnicas de Machine Learning

In [16]:
# dados para construir a tabela com as tecnicas e as medidas de precisao
data_vec <- c()

# vetor com os resultados das tecnicas
ml_results <- list()

# guarda resultado das tecnicas
ml_results[[1]] <- exec_knn(1)
ml_results[[2]] <- exec_knn(5)
ml_results[[3]] <- exec_knn(10)
ml_results[[4]] <- exec_decision_tree()
ml_results[[5]] <- exec_naiveBayes()
ml_results[[6]] <- exec_random_forest()

for(ml_result in ml_results) {
    data_vec <- append(data_vec, accuracy(ml_result))
    data_vec <- append(data_vec, precision(ml_result))
    data_vec <- append(data_vec, recall(ml_result))
    data_vec <- append(data_vec, f_measure(ml_result))
    data_vec <- append(data_vec, informedness(ml_result))
    data_vec <- append(data_vec, markedness(ml_result))
}

In [17]:
result_matrix <- matrix(round(data_vec, digits = 3), ncol = 6, byrow = TRUE)
colnames(result_matrix) <- c("Accuracy", "Precision", "Recall", "F Measure", "Informedness", "Markedness")
rownames(result_matrix) <- c("kNN(k = 1)", "kNN(k = 5)","kNN(k = 10)", "decision tree", "naive bayes","random forest")
result_matrix <- as.table(result_matrix)
result_matrix

              Accuracy Precision Recall F Measure Informedness Markedness
kNN(k = 1)       0.833     0.846  0.898     0.871        0.622      0.654
kNN(k = 5)       0.808     0.827  0.878     0.851        0.567      0.596
kNN(k = 10)      0.744     0.769  0.833     0.800        0.433      0.462
decision tree    0.846     0.846  0.917     0.880        0.650      0.692
naive bayes      0.769     0.673  0.972     0.795        0.567      0.635
random forest    0.859     0.865  0.918     0.891        0.677      0.712