# Simple Binary Classification

The purpose of this exercise is to study the binary classification problem, the idea is to use different methods to classify a given dataset.This time we are going to study the Sonar dataset. 

The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles. It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 208 observations with 60 input variables and 1 output variable. 

The CSV file contains 111 records of sonar signals dating over mines from different angles, it also contains 97 records of sonar signals dating over rocks under the same mine conditions.

In [1]:
#Load the data
using CSV
using DataFrames 
using StatsBase
using Statistics

df = DataFrame(CSV.File("C:/Users/maria/Desktop/Universidad/2022-I/Matemáticas para ML/Databases/sonar.csv"));

In [2]:
#Rename the columns of the dataframe
N = size(df, 2);
Y = Array((1:N));
rename!(df, fill(:Y, N), makeunique = true)

Unnamed: 0_level_0,Y,Y_1,Y_2,Y_3,Y_4,Y_5,Y_6,Y_7,Y_8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564
5,0.0286,0.0453,0.0277,0.0174,0.0384,0.099,0.1201,0.1833,0.2105
6,0.0317,0.0956,0.1321,0.1408,0.1674,0.171,0.0731,0.1401,0.2083
7,0.0519,0.0548,0.0842,0.0319,0.1158,0.0922,0.1027,0.0613,0.1465
8,0.0223,0.0375,0.0484,0.0475,0.0647,0.0591,0.0753,0.0098,0.0684
9,0.0164,0.0173,0.0347,0.007,0.0187,0.0671,0.1056,0.0697,0.0962
10,0.0039,0.0063,0.0152,0.0336,0.031,0.0284,0.0396,0.0272,0.0323


As mentioned before, the dataset has a class imbalance, we then verify this imbalance.

In [3]:
countmap(df.Y_60)

Dict{String, Int64} with 2 entries:
  "M" => 111
  "R" => 96

In addition, for ease of data processing we are going to replace in the last column of the dataframe, thus, the "R" that designate the rocks will be 0 and the "M" that designate the mines will be 1. 

This procedure is known as like One Hot Encoding, which is the process of taking some categorical feature and transforming it into various binaries.

In [4]:
df.Y_60 .= replace(df.Y_60, "R" => "0", "M" => "1");
df.Y_60 = parse.(Float64, df.Y_60);

There exists different ways to handle the Class Imbalance. A quite simple but popular strategy that works for data containers, is to either under- or over-sample it according to the class distribution. What that means is that the data container is re-sampled in such a way, that the class distribution in the resulting data container is approximately uniform.

For this exercise we are going to work with oversampling, this approach generates a re-balanced version of data by repeatedly sampling existing observations in such a way that every class will have at least fraction times the number observations of the largest class.

One consequence of oversampling is that it can overfit the model, let's see how it works.

In [5]:
#Handling Class Imbalance with oversampling, the data is organized and the new balanced dataframe is generated.
using MLDataPattern
A = Array(df[:, 1:60]);
b = df.Y_60;
A_bal, b_bal = oversample((A', b));
df2 = [A_bal' b_bal];
df3 = DataFrame(df2)
M = size(df3, 2);
X = Array((1:M));
rename!(df3, fill(:X, M), makeunique = true)

Unnamed: 0_level_0,X,X_1,X_2,X_3,X_4,X_5,X_6,X_7,X_8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0123,0.0022,0.0196,0.0206,0.018,0.0492,0.0033,0.0398,0.0791
2,0.0274,0.0242,0.0621,0.056,0.1129,0.0973,0.1823,0.1745,0.144
3,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564
4,0.0335,0.0258,0.0398,0.057,0.0529,0.1091,0.1709,0.1684,0.1865
5,0.0201,0.0165,0.0344,0.033,0.0397,0.0443,0.0684,0.0903,0.1739
6,0.0599,0.0474,0.0498,0.0387,0.1026,0.0773,0.0853,0.0447,0.1094
7,0.0261,0.0266,0.0223,0.0749,0.1364,0.1513,0.1316,0.1654,0.1864
8,0.0107,0.0453,0.0289,0.0713,0.1075,0.1019,0.1606,0.2119,0.3061
9,0.013,0.0006,0.0088,0.0456,0.0525,0.0778,0.0931,0.0941,0.1711
10,0.0221,0.0065,0.0164,0.0487,0.0519,0.0849,0.0812,0.1833,0.2228


In [6]:
#Split the Train and Test Data
using Random
Random.seed!(1)
using Lathe.preprocess: TrainTestSplit
train, test = TrainTestSplit(df3,.75)

(155×61 DataFrame. Omitted printing of 54 columns
│ Row │ X       │ X_1     │ X_2     │ X_3     │ X_4     │ X_5     │ X_6     │
│     │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 0.0123  │ 0.0022  │ 0.0196  │ 0.0206  │ 0.018   │ 0.0492  │ 0.0033  │
│ 2   │ 0.0274  │ 0.0242  │ 0.0621  │ 0.056   │ 0.1129  │ 0.0973  │ 0.1823  │
│ 3   │ 0.0762  │ 0.0666  │ 0.0481  │ 0.0394  │ 0.059   │ 0.0649  │ 0.1209  │
│ 4   │ 0.0335  │ 0.0258  │ 0.0398  │ 0.057   │ 0.0529  │ 0.1091  │ 0.1709  │
│ 5   │ 0.0599  │ 0.0474  │ 0.0498  │ 0.0387  │ 0.1026  │ 0.0773  │ 0.0853  │
│ 6   │ 0.013   │ 0.0006  │ 0.0088  │ 0.0456  │ 0.0525  │ 0.0778  │ 0.0931  │
│ 7   │ 0.0221  │ 0.0065  │ 0.0164  │ 0.0487  │ 0.0519  │ 0.0849  │ 0.0812  │
│ 8   │ 0.0071  │ 0.0103  │ 0.0135  │ 0.0494  │ 0.0253  │ 0.0806  │ 0.0701  │
│ 9   │ 0.01    │ 0.01

In [7]:
#Size of the train and test data
println("Examples used for training:", size(train,1))
println("Examples used for testing:", size(test,1))

Examples used for training:155
Examples used for testing:67


As we had mentioned initially, to work the binary classification problem, we are going to use different algorithms. Before working with each algorithm, it is important to choose a performance measure that will allow us, as its name indicates, to measure how well an algorithm works.

In our case we will choose the accuracy. Accuracy is a useful metric for evaluating the performance of classification models in machine learning.

This measure allows us to know, on average, how well the data are classified when the errors in the prediction of the classes are equally important.The accuracy is given by:

\begin{equation*}
Accuracy = \frac{TP+TN}{TP+TN+FP+FN}
\end{equation*}

where $TP = $ True Positives, $TN= $ True Negatives, $FP=$ False Positives and $FN = $ False Negatives.

In [8]:
A_train = Array(train[:, 1:60]);
b_train = train.X_60;
A_test = Array(test[:, 1:60]);
b_test = test.X_60;

#transpose data
A_train_t = A_train';
A_test_t = A_test';

In [9]:
#Linear Regresion
using GLM

fm = @formula(X_60 ~ X+X_2+X_3+X_4+X_5+X_6+X_7+X_8+X_9+X_10+X_11+X_12+X_13+X_14+X_15+X_16+X_17+X_18+X_19+X_20+X_21+X_22+X_23+X_24+X_25+X_26+X_27+X_28+X_29+X_30+X_31+X_32+X_33+X_34+X_35+X_36+X_37+X_38+X_39+X_40+X_41+X_42+X_43+X_44+X_45+X_46+X_47+X_48+X_49+X_50+X_51+X_52+X_53+X_54+X_55+X_56+X_57+X_58+X_59);
linearRegressor = lm(fm, train);
prediction1 = GLM.predict(linearRegressor, test);

#Classification
prediction_class1 = [if x < 0.5 0.0 else 1.0 end for x in prediction1];
prediction1 =  prediction_class1;

#Accuracy calculation
accuracy1 = mean(prediction1 .== b_test);
println("The accuracy of the model is: ", accuracy1)

The accuracy of the model is: 0.7761194029850746


In [10]:
#Logistic Regresion
logit = glm(fm, train, Binomial(), ProbitLink())
prediction2 = GLM.predict(logit, test)

#Classification
prediction_class2 = [if x < 0.5 0.0 else 1.0 end for x in prediction2];
prediction2 = prediction_class2

#Accuracy calculation
accuracy2 = mean(prediction2 .== b_test);
println("The accuracy of the model is: ", accuracy2)

The accuracy of the model is: 0.7164179104477612


In [11]:
#SVM
using LIBSVM

#run model
svmmodel = svmtrain(A_train_t, b_train);
prediction3, decision_values = svmpredict(svmmodel, A_test_t);

#Accuracy calculation
accuracy3 = mean(prediction3 .== b_test);
println("The accuracy of the model is: ", accuracy3)

The accuracy of the model is: 0.6268656716417911


In [12]:
#KNN
using NearestNeighbors

kdtree = KDTree(A_train_t);

#run model
k = 5;
idxs, dists = knn(kdtree, A_test_t, k, true);

#post-proccess
idxs_matrix = hcat(idxs...);
idxs_matrix_t = idxs_matrix';
knn_class = b_train[idxs_matrix_t];
df_knn_class = DataFrame(knn_class);

#Make Predictions
prediction4 = [];
for i = 1:size(df_knn_class,1)
    pred = argmax(countmap(df_knn_class[i, :]))
    push!(prediction4, pred)
end

#Accuracy calculation
accuracy4 = mean(prediction4 .== b_test);
println("The accuracy of the model is: ", accuracy4)

The accuracy of the model is: 0.7014925373134329


In [13]:
#Decision Tree
using DecisionTree
dtmodel = DecisionTreeClassifier()
DecisionTree.fit!(dtmodel, A_train, b_train)

#Make predictions
prediction5 = DecisionTree.predict(dtmodel, A_test)

#Accuracy calculation
accuracy5 = mean(prediction5 .== b_test);
println("The accuracy of the model is: ", accuracy5)

The accuracy of the model is: 0.7014925373134329


According to the results obtained in the implementation of the algorithms and considering the calculation of the performance measure, the algorithm with the highest performance in the binary classification problem is Linear Regression with a $accuracy = 77.61194029850746 \% $.