# Simple binary classification (Sonar, Mines vs. Rocks)

In this notebook we will fit the ML algorithms: Linear Regression, Logistic Regression, SVM, kNN and Decision trees, to a dataset containing stone and mine signals obtained from a variety of different aspect angles, with the aim of predicting whether an object is either a mine or a rock given the strength of sonar returns at different angles.

Let's read the data. The label associated with each record contains the letter "R" if the object is a rock and "M" if it is a mine, we will asign the boolean 0 for the class "R", and the boolean 1 for the class "M" .

In [2]:
using CSV
using DataFrames

data = DataFrame(CSV.File("data/sonar.csv", header = 0) )
rename!(data, :Column61 => :class)
data.class = Bool.(replace(data.class, "R" => 0, "M" => 1));

Let's check if the classes are balanced.

In [3]:
using StatsBase
countmap(data.class)

Dict{Bool, Int64} with 2 entries:
  0 => 97
  1 => 111

As we can see, we have a problem of imbalanced classes. Since we don’t have a ton of data to work with, the Random Over-Sampling Technique can be a good choice to solve the problem.

In [4]:
using MLDataPattern

X = Matrix(data[:, 1:60])';
y = data.class;
X_bal, y_bal = oversample((X, y));


data_bal = DataFrame(Matrix(X_bal'));
data_bal.class = y_bal;

countmap(data_bal.class)

Dict{Bool, Int64} with 2 entries:
  0 => 111
  1 => 111

We would rather partition the balanced data set into two disjoint subsets using random assignment. We can do this by combining splitobs() with shuffleobs().

In [5]:
using MLDataUtils 
train, test = splitobs(shuffleobs(data_bal), at = 0.7);

Now, our data train has 155 elements (70% of total data), the data test has 67 elements (30% of total data), and its classes are distributed as follows

In [43]:
countmap(train.class)

Dict{Bool, Int64} with 2 entries:
  0 => 76
  1 => 79

In [44]:
countmap(test.class)

Dict{Bool, Int64} with 2 entries:
  0 => 35
  1 => 32

### Confusion Matrix

We will use the Confusion Matrix as a performance function for the problem. The Confusion matrix is a table with two rows and two columns that reports the number of true positives, false negatives, false positives, and true negatives, and is often used to evaluate the performance of a classification model. 

Different metrics are derived from this matrix, some of them are precision, accuracy, sensitivity and specificity. The convenience of using a metric as a measure of the estimator depends on each particular case and, specifically, on the "cost" associated with each classification error of the algorithm.

In [45]:
using EvalMetrics

function conf_matrix_metrics(test, pred)
    conf_matrix = counts(test, pred)
    precision = EvalMetrics.precision(test, pred)
    accuracy = EvalMetrics.accuracy(test, pred)
    sensitivity = EvalMetrics.sensitivity(test, pred)
    specificity = EvalMetrics.specificity(test, pred)
    
    println("The Confusion Matrix of this model is ", conf_matrix)
    println("The precision of this model is ", precision)
    println("The accuracy of this model is ", accuracy)
    println("The sensitivity of this model is ", sensitivity)
    println("The specificity of this model is ", specificity)
end

conf_matrix_metrics (generic function with 1 method)

## Linear Regression
Let's train Linear Regression model.

In [46]:
using GLM
fm = Term(:class) ~ sum(term.(names(train[:,1:60])))
linearRegressor = lm(fm, train)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

class ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 + x32 + x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 + x42 + x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x51 + x52 + x53 + x54 + x55 + x56 + x57 + x58 + x59 + x60

Coefficients:
─────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)    Lower 95%   Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.256354     0.358865  -0.71    0.4768   -0.968889    0.456181
x1            2.71934      2.57833    1.05    0.2943   -2.39999     7.83867
x2            2.14467      2.97614    0.72    0.4729   -3.7

Take a look at the decision values.

In [47]:
LinR_pred = Float64.(GLM.predict(linearRegressor,test))
println(LinR_pred)

[0.34028862845542024, -0.09225465426152925, 0.11111374600980137, 0.3614691190514965, 0.49586143623864054, -0.5002456603018126, -0.16833342051042316, 0.0791976241575987, 0.026722537855394343, 0.753884284210186, 0.29105572879212066, 0.6970420281554053, 0.2685946142612683, 0.7081114475226417, 0.7720175195009962, 1.310273348866756, 0.1820959614243846, 0.10243076304886312, 0.6170823980654903, 0.25431176798835226, -0.20454965634635375, 0.543804988942905, 0.22177147442544268, 0.5481446059007851, 0.22069792773862779, 0.3073836358865745, 0.19198859380004318, 0.8254855722954937, 0.09999938548382084, -0.00678567075794833, 0.3257110633909801, 0.7143266084008066, 0.347947598539777, 0.5965694104493858, 0.5639554063679135, 1.2735633572926235, 0.5415262741680118, 0.9389182076348137, 0.8419093186006387, 0.3778607652930825, 0.734411503202917, 0.434839161102965, 0.018017066473264676, 0.8340076301654891, 0.836322623185664, 0.03464203446809984, 1.5611457751268223, 0.2163872255215501, 0.8475343922489843, 0.

### Prediction

Convert the decision values to a class.

In [48]:
y_test = Bool.(test.class)
y_LinR_pred = Bool.([if x < 0.5 0 else 1 end for x in LinR_pred]);

println("label_act  = ", y_test)
println("label_pred = ", y_LinR_pred)

label_act  = Bool[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1]
label_pred = Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1]


Now let's look at the model's confusion matrix along with the previously mentioned metrics.

In [49]:
conf_matrix_metrics(y_test, y_LinR_pred)

The Confusion Matrix of this model is [32 3; 7 25]
The precision of this model is 0.8928571428571429
The accuracy of this model is 0.8507462686567164
The sensitivity of this model is 0.78125
The specificity of this model is 0.9142857142857143


## Logistic Regression
Let's train Logistic Regression model.

In [50]:
using GLM
fm = Term(:class) ~ sum(term.(names(train[:,1:60])))
logitRegressor = glm(fm, train, Binomial(), ProbitLink())

StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

class ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 + x32 + x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 + x42 + x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x51 + x52 + x53 + x54 + x55 + x56 + x57 + x58 + x59 + x60

Coefficients:
──────────────────────────────────────────────────────────────────────────────────────────
                   Coef.      Std. Error      z  Pr(>|z|)        Lower 95%       Upper 95%
──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)    -81.2141    81116.7        -0.00    0.9992  -159067.0             1.58905e5
x1             267.682         5.36437e5   0.0

Take a look at the decision values.

In [51]:
LogR_pred = Float64.(GLM.predict(logitRegressor,test))
println(LogR_pred)

[1.0, 7.619292719840532e-38, 1.6146773455955954e-21, 9.022785414499048e-290, 1.3958802761706985e-10, 0.0, 0.0, 2.504665041316085e-91, 0.0, 1.0, 4.9128041627586666e-182, 1.0, 1.779626452770876e-78, 1.5148091663210064e-9, 1.0, 1.0, 5.2682718445099516e-126, 0.0, 1.6310504520692574e-44, 2.1187835536423834e-141, 0.0, 0.0, 4.4924624946141885e-12, 1.0, 1.0, 1.0, 3.829418860799304e-68, 1.0, 2.6020983159332347e-19, 0.0, 2.920432075600862e-9, 1.0, 0.9999992293806472, 7.332103004666202e-14, 7.475695087271257e-20, 1.0, 0.0, 1.0, 1.0, 1.4468857609932417e-108, 0.9999137952339894, 1.0, 8.03283854e-316, 1.0, 1.0, 2.964926973219872e-31, 1.0, 2.0516710308879897e-128, 1.0, 1.3060064549100277e-243, 0.0, 1.0, 9.009376017075312e-126, 7.073579727149234e-24, 1.0, 1.0, 0.0, 1.0, 1.0, 2.1369180941509546e-246, 1.7764860756122216e-256, 1.0, 1.0, 1.6146773455955954e-21, 1.0, 1.623271112125105e-9, 1.0]


### Prediction

Convert probability score to a class.

In [52]:
y_test = Bool.(test.class)
y_LogR_pred = Bool.([if x < 0.5 0 else 1 end for x in LogR_pred]);

println("label_act  = ", y_test)
println("label_pred = ", y_LogR_pred)

label_act  = Bool[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1]
label_pred = Bool[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1]


Now let's look at the model's confusion matrix along with the previously mentioned metrics.

In [53]:
conf_matrix_metrics(y_test, y_LogR_pred)

The Confusion Matrix of this model is [30 5; 8 24]
The precision of this model is 0.8275862068965517
The accuracy of this model is 0.8059701492537313
The sensitivity of this model is 0.75
The specificity of this model is 0.8571428571428571


# Support Vector Machine (SVM)
Let's train SVM model.

In [54]:
using LIBSVM

X_train = Matrix(train[:, 1:60])'
X_test = Matrix(test[:, 1:60])'
y_train = Bool.(train.class)
y_test = Bool.(test.class)

model = svmtrain(X_train, y_train);

y_SVM_pred, SVM_pred = svmpredict(model, X_test);

Take a look at the decision values.

In [55]:
SVM_pred

2×67 Matrix{Float64}:
 0.280507  0.0640837  -0.0935976  -0.430027  …  1.14289  0.258246  1.05152
 0.0       0.0         0.0         0.0          0.0      0.0       0.0

### Prediction

Convert the decision values to a class.

In [56]:
println("label_act  = ", y_test)
println("label_pred = ", y_SVM_pred)

label_act  = Bool[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1]
label_pred = Bool[1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1]


Now let's look at the model's confusion matrix along with the previously mentioned metrics.

In [57]:
conf_matrix_metrics(y_test, y_SVM_pred)

The Confusion Matrix of this model is [15 20; 5 27]
The precision of this model is 0.574468085106383
The accuracy of this model is 0.6268656716417911
The sensitivity of this model is 0.84375
The specificity of this model is 0.42857142857142855


# k Nearest Neighbors
Let's train k Nearest Neighbors model. We will take the parameter as $k=3$, the reader can verify that with this value better values are obtained in the mentioned metrics, than with others.

In [58]:
using NearestNeighbors

X_train = Matrix(train[:, 1:60])'
X_test = Matrix(test[:, 1:60])'
y_train = Bool.(train.class)
y_test = Bool.(test.class)

k = 3

kdtree_train = KDTree(X_train)
idxs, dists = knn(kdtree_train, X_test, k, true);

Take a look at the decision values.

In [59]:
nbors_labels = y_train[hcat(idxs...)'];
nbors_labels'

3×67 adjoint(::BitMatrix) with eltype Bool:
 1  0  0  1  0  0  0  0  0  1  0  1  1  …  0  0  1  1  1  1  1  1  0  1  0  1
 1  0  0  1  1  0  0  0  0  1  0  1  0     0  0  1  1  0  1  1  1  0  1  0  1
 0  0  0  1  1  0  0  0  0  0  0  1  1     0  0  1  1  0  1  1  1  0  1  0  1

### Prediction

Convert the decision values to a class.

In [60]:
y_kNN_pred = [argmax(countmap(nbors_labels[i, :]))  for i in 1:size(nbors_labels)[1]];
println("label_act  = ", y_test)
println("label_pred = ", y_kNN_pred)

label_act  = Bool[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1]
label_pred = Bool[1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1]


Now let's look at the model's confusion matrix along with the previously mentioned metrics.

In [61]:
conf_matrix_metrics(y_test, y_kNN_pred)

The Confusion Matrix of this model is [30 5; 4 28]
The precision of this model is 0.8484848484848485
The accuracy of this model is 0.8656716417910447
The sensitivity of this model is 0.875
The specificity of this model is 0.8571428571428571


#  Decision trees
Let's train decision trees model. We will take the parameter as max_depth$=7$, the reader can verify that with this value better values are obtained in the mentioned metrics, than with others.

In [62]:
using DecisionTree

features = Matrix(train[:, 1:60])
labels = train.class

features = float.(features)
labels   = Bool.(labels);

model = DecisionTreeClassifier(max_depth=7)
DecisionTree.fit!(model, features, labels);

Take a look at the decision values.

In [63]:
DT_pred = predict_proba(model, Matrix(test)[:, 1:60]);
DT_pred'

2×67 adjoint(::Matrix{Float64}) with eltype Float64:
 0.0  1.0  1.0  0.0  1.0  1.0  1.0  1.0  …  0.0  1.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     1.0  0.0  1.0  0.0  1.0  0.0  1.0

### Prediction

Convert probability score to a class.

In [64]:
y_test = Bool.(test.class)
y_DT_pred = DecisionTree.predict(model, Matrix(test)[:, 1:60]);

println("label_act  = ", y_test)
println("label_pred = ", y_DT_pred)

label_act  = Bool[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1]
label_pred = Bool[1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1]


Now let's look at the model's confusion matrix along with the previously mentioned metrics.

In [65]:
conf_matrix_metrics(y_test, y_DT_pred)

The Confusion Matrix of this model is [29 6; 6 26]
The precision of this model is 0.8125
The accuracy of this model is 0.8208955223880597
The sensitivity of this model is 0.8125
The specificity of this model is 0.8285714285714286


## Algorithm with the highest performance measure

Let's compare some metrics of each model.

In [66]:
models = ["Linear Regression", "Logistic Regression", "SVM", "kNN", "Decision Tree"]
preds = [y_LinR_pred, y_LogR_pred, y_SVM_pred, y_kNN_pred, y_DT_pred]
y_test = Bool.(test.class)
n = size(models)[1]

function metrics()
    
    for i in 1:n
        println("The Precision of the ", models[i], " model is ", EvalMetrics.precision(y_test, preds[i]))
    end
    
    println("")
    
    for i in 1:n
        println("The Accuracy of the ", models[i], " model is ", EvalMetrics.accuracy(y_test, preds[i]))
    end
    
    println("")
    
        for i in 1:n
        println("The Sensitivity of the ", models[i], " model is ", EvalMetrics.sensitivity(y_test, preds[i]))
    end
    
    println("")
    
        for i in 1:n
        println("The Specificity of the ", models[i], " model is ", EvalMetrics.specificity(y_test, preds[i]))
    end
    
end

metrics (generic function with 1 method)

In [67]:
metrics()

The Precision of the Linear Regression model is 0.8928571428571429
The Precision of the Logistic Regression model is 0.8275862068965517
The Precision of the SVM model is 0.574468085106383
The Precision of the kNN model is 0.8484848484848485
The Precision of the Decision Tree model is 0.8125

The Accuracy of the Linear Regression model is 0.8507462686567164
The Accuracy of the Logistic Regression model is 0.8059701492537313
The Accuracy of the SVM model is 0.6268656716417911
The Accuracy of the kNN model is 0.8656716417910447
The Accuracy of the Decision Tree model is 0.8208955223880597

The Sensitivity of the Linear Regression model is 0.78125
The Sensitivity of the Logistic Regression model is 0.75
The Sensitivity of the SVM model is 0.84375
The Sensitivity of the kNN model is 0.875
The Sensitivity of the Decision Tree model is 0.8125

The Specificity of the Linear Regression model is 0.9142857142857143
The Specificity of the Logistic Regression model is 0.8571428571428571
The Specifi

In this classification problem, a true negative does not represent any risk for a person, while a false negative implies the possible loss of a person's life or the loss of some part of their body, we must avoid this situation. Therefore, the model to choose must have a high sensitivity, instead of a high specificity. This is the situation that interests us when our goal is to avoid false negatives at all costs.

We can see that the models with the highest sensitivity are the kNN model and the SVM model, with 88% and 84% respectivly. Nevertheless, the precision of the kNN model is approximately 85% and the precision of the SVM model is 57%, so the kNN model is closer to the result of a prediction of the true value than the SVM model, also the accuracy of the kNN model is approximately 87% and the accuracy of the SVM model is 63%, so the kNN model has a larger percentage of correct predictions compared to the total than the SVM model. In conclusion

#### for this problem, kNN is the best model of all.