# Simple binary classification

On this notebook we will review some of the techniques and tools we often encounter when working with classification problems. We will give a glimpse on some of the most commonly used models for binary classification in order to use them on the sonar dataset (https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)). We will choose a performance function to study the performance of each model on the data and determine which suits the problem better.

# Reading the data

First we read the dataset and give a labe $v_i, 1\leq i \leq 60$ to each of the $60$ characteristics, and we call the last column the name $t$ for the type of the material (rock or mine).

In [12]:
using CSV, DataFrames

data1 = DataFrame(CSV.read("sonar1.all-data", DataFrame))

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6,v7,v8,v9
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109
2,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337
3,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598
4,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598
5,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564
6,0.0286,0.0453,0.0277,0.0174,0.0384,0.099,0.1201,0.1833,0.2105
7,0.0317,0.0956,0.1321,0.1408,0.1674,0.171,0.0731,0.1401,0.2083
8,0.0519,0.0548,0.0842,0.0319,0.1158,0.0922,0.1027,0.0613,0.1465
9,0.0223,0.0375,0.0484,0.0475,0.0647,0.0591,0.0753,0.0098,0.0684
10,0.0164,0.0173,0.0347,0.007,0.0187,0.0671,0.1056,0.0697,0.0962


We see that the column of labels is of categorical type:

In [13]:
data1.t

208-element PooledArrays.PooledVector{String, UInt32, Vector{UInt32}}:
 "R"
 "R"
 "R"
 "R"
 "R"
 "R"
 "R"
 "R"
 "R"
 "R"
 "R"
 "R"
 "R"
 ⋮
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"
 "M"

The latter is not ideal for some of the models we are going to use, like the regression models, so we need to assign a binary label to the values of $M$ and $R$, namely $1$ and $0$ respectively. 

In [28]:
#here we call some of the packages that we need for this notebook
using Plots
using GLM
using Lathe
using StatsBase
using MLBase
using ClassImbalance
using ROCAnalysis
using MLDataPattern

In [15]:
#here we replace the categorical labels for numerical ones
data1.t .= replace.(data1.t, "R" => "0");  
data1.t .= replace.(data1.t, "M" => "1");

In [16]:
# here we parse the last column to float number so that the models treat the values as numbers and not strings
data1.t = parse.(Float64, data1.t);

In [17]:
# we can see the data as a matrix and note that the data is of type Float64
A = Matrix(data1[:, 1:61])

208×61 Matrix{Float64}:
 0.02    0.0371  0.0428  0.0207  …  0.018   0.0084  0.009   0.0032  0.0
 0.0453  0.0523  0.0843  0.0689     0.014   0.0049  0.0052  0.0044  0.0
 0.0262  0.0582  0.1099  0.1083     0.0316  0.0164  0.0095  0.0078  0.0
 0.01    0.0171  0.0623  0.0205     0.005   0.0044  0.004   0.0117  0.0
 0.0762  0.0666  0.0481  0.0394     0.0072  0.0048  0.0107  0.0094  0.0
 0.0286  0.0453  0.0277  0.0174  …  0.0057  0.0027  0.0051  0.0062  0.0
 0.0317  0.0956  0.1321  0.1408     0.0092  0.0143  0.0036  0.0103  0.0
 0.0519  0.0548  0.0842  0.0319     0.0085  0.0047  0.0048  0.0053  0.0
 0.0223  0.0375  0.0484  0.0475     0.0065  0.0093  0.0059  0.0022  0.0
 0.0164  0.0173  0.0347  0.007      0.0032  0.0035  0.0056  0.004   0.0
 0.0039  0.0063  0.0152  0.0336  …  0.0042  0.0003  0.0053  0.0036  0.0
 0.0123  0.0309  0.0169  0.0313     0.0026  0.0092  0.0009  0.0044  0.0
 0.0079  0.0086  0.0055  0.025      0.0059  0.0058  0.0059  0.0032  0.0
 ⋮                               ⋱      

# Fixing class imbalance

Now we perform some previous adjustments to the data before we start training with the models. First, we can see that we have an imbalance on the labels of the data:

In [35]:
# function countmap makes a dictionary and counts how much of each type has the column of 0's and 1's
countmap(data1.t, alg = :dict)

Dict{Float64, Int64} with 2 entries:
  0.0 => 97
  1.0 => 111

We see that we have more Rocks than Mines, which is not ideal to use when training since it can yield to the model making false predictions. In order to fix that, we use the smote function from the ClassImbalance package, which basically takes the data and oversamples the smaller class, and undersamples the bigger class, and then balances them. See https://docs.juliahub.com/ClassImbalance/2Pjhq/0.8.7/autodocs/.

In [37]:
y1 = data1.t; # 0 = majority, 1 = minority
X1 = A
X0, y0 = smote(X1, y1, k = 5, pct_under = 150, pct_over = 200)

([0.055190842921485045 0.10654403845014279 … 0.010744737227115586 1.0; 0.06092555568490212 0.10236512329507327 … 0.011536351525261862 1.0; … ; 0.029267295078158225 0.06099083219303244 … 0.010264985135203186 1.0; 0.11509465825283761 0.2071876148765947 … 0.02021217458125348 1.0], [1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0  …  1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0])

In [38]:
X0

666×61 Matrix{Float64}:
 0.0551908  0.106544   0.114441   0.130378   …  0.0202619   0.0107447   1.0
 0.0609256  0.102365   0.109162   0.137642      0.0232983   0.0115364   1.0
 0.0545872  0.113696   0.118691   0.142543      0.0212974   0.0114938   1.0
 0.0712     0.0901     0.1276     0.1497        0.0095      0.0068      1.0
 0.0067     0.0096     0.0024     0.0058        0.0051      0.0031      0.0
 0.0084     0.0153     0.0291     0.0432     …  0.0072      0.0045      0.0
 0.019      0.0038     0.0642     0.0452        0.0055      0.0122      0.0
 0.0131     0.0387     0.0329     0.0078        0.0015      0.0085      1.0
 0.0079     0.0086     0.0055     0.025         0.0059      0.0032      0.0
 0.0353     0.0713     0.0326     0.0272        0.0093      0.0053      0.0
 0.026      0.0192     0.0254     0.0061     …  0.0044      0.0077      0.0
 0.0598141  0.05267    0.0553135  0.0466552     0.0120306   0.0101718   1.0
 0.0196297  0.0427703  0.0536532  0.0763964     0.0163798   0.00

In [40]:
countmap(y0, alg = :dict)

Dict{Float64, Int64} with 2 entries:
  0.0 => 333
  1.0 => 333

We see that we wound up with a bigger dataset which has equal amount of observations of each class.

In [42]:
# here we convert the data back into a DataFrame
data1_balanced = DataFrame(X0)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0551908,0.106544,0.114441,0.130378,0.138722,0.0773536,0.0988919,0.0609983
2,0.0609256,0.102365,0.109162,0.137642,0.178345,0.0792192,0.115255,0.0777299
3,0.0545872,0.113696,0.118691,0.142543,0.139727,0.0606311,0.0822241,0.0697095
4,0.0712,0.0901,0.1276,0.1497,0.1284,0.1165,0.1285,0.1684
5,0.0067,0.0096,0.0024,0.0058,0.0197,0.0618,0.0432,0.0951
6,0.0084,0.0153,0.0291,0.0432,0.0951,0.0752,0.0414,0.0259
7,0.019,0.0038,0.0642,0.0452,0.0333,0.069,0.0901,0.1454
8,0.0131,0.0387,0.0329,0.0078,0.0721,0.1341,0.1626,0.1902
9,0.0079,0.0086,0.0055,0.025,0.0344,0.0546,0.0528,0.0958
10,0.0353,0.0713,0.0326,0.0272,0.037,0.0792,0.1083,0.0687


In [43]:
# and we recover the original name of the dataset
data1 = data1_balanced;

# Splitting the data

The next step is to divide the data into two sets, namely Training and Testing sets; we adjust the models to the training data and then study the performance of the model on the testing set.

In [48]:
# We use TrainTestSplit from the Lathe package
using Lathe.preprocess: TrainTestSplit

train, test = TrainTestSplit(data1,.75);

In [51]:
train

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0545872,0.113696,0.118691,0.142543,0.139727,0.0606311,0.0822241,0.0697095
2,0.0712,0.0901,0.1276,0.1497,0.1284,0.1165,0.1285,0.1684
3,0.0067,0.0096,0.0024,0.0058,0.0197,0.0618,0.0432,0.0951
4,0.0084,0.0153,0.0291,0.0432,0.0951,0.0752,0.0414,0.0259
5,0.0131,0.0387,0.0329,0.0078,0.0721,0.1341,0.1626,0.1902
6,0.0079,0.0086,0.0055,0.025,0.0344,0.0546,0.0528,0.0958
7,0.0598141,0.05267,0.0553135,0.0466552,0.105536,0.075782,0.0851496,0.0463111
8,0.0196297,0.0427703,0.0536532,0.0763964,0.0842833,0.0874583,0.106551,0.234061
9,0.0369696,0.0506365,0.0610604,0.0871369,0.0913171,0.0812356,0.0440225,0.0953988
10,0.0116,0.0179,0.0449,0.1096,0.1913,0.0924,0.0761,0.1092


In [111]:
test

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0551908,0.106544,0.114441,0.130378,0.138722,0.0773536,0.0988919,0.0609983
2,0.0609256,0.102365,0.109162,0.137642,0.178345,0.0792192,0.115255,0.0777299
3,0.019,0.0038,0.0642,0.0452,0.0333,0.069,0.0901,0.1454
4,0.0353,0.0713,0.0326,0.0272,0.037,0.0792,0.1083,0.0687
5,0.026,0.0192,0.0254,0.0061,0.0352,0.0701,0.1263,0.108
6,0.0201,0.0116,0.0123,0.0245,0.0547,0.0208,0.0891,0.0836
7,0.0251473,0.0464832,0.0557241,0.0727519,0.0643764,0.0962323,0.127121,0.262736
8,0.0139,0.0222,0.0089,0.0108,0.0215,0.0136,0.0659,0.0954
9,0.0206,0.0132,0.0533,0.0569,0.0647,0.1432,0.1344,0.2041
10,0.0025,0.0309,0.0171,0.0228,0.0434,0.1224,0.1947,0.1661


We see that the function gave us a training set of size $523$, while the testing set is of size $143$.<br>

We are now ready to start training with some models in order to make predictions. First we approach the regression models:

# Logistic Regression

The first classification model we are going to use is Linear Regression. The model basically assigns to each observation a probability of being of one or another type.<br>

For this model we make use of the package GLM (Generalized Linear Models):

In [86]:
# We build the model on the train data
fm = @formula(x61 ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24 + x25 + x26 + x27 + x28 + x29 + x30 + x31 + x32 + x33 + x34 + x35 + x36 + x37 + x38 + x39 + x40 + x41 + x42 + x43 + x44 + x45 + x46 + x47 + x48 + x49 + x50 + x51 + x52 + x53 + x54 + x55 + x56 + x57 + x58 + x59 + x60)
logit = glm(fm, train, Binomial(), ProbitLink())

# We make the prediction on the test data
prediction = GLM.predict(logit,test)

# Now we convert the probability that the model assigned to one of the label using 0.5 as the threshold
prediction_class = [if x < 0.5 0 else 1 end for x in prediction]

#here we can contrast the actual data (1st col) with the prediction made by the model(2nd col)
prediction_df = DataFrame(y_actual = test.x61, y_predicted = prediction_class, prob_predicted = prediction) #model predicted correctly and 0 otherwise

Unnamed: 0_level_0,y_actual,y_predicted,prob_predicted
Unnamed: 0_level_1,Float64,Int64,Float64⍰
1,1.0,1,1.0
2,1.0,1,1.0
3,0.0,0,9.62143e-92
4,0.0,0,5.83262e-10
5,0.0,0,0.0
6,0.0,0,0.0
7,1.0,1,1.0
8,0.0,0,2.11919e-9
9,0.0,0,2.13968e-11
10,0.0,0,1.36418e-9


# Performance measure: Accuracy

But what does the above information tells us about how good the model worked? In order to measure the performance of the model, one of the most simple ways to measure the performance is the acuracy: it basically tells you how many labels of the test data were correctly predicted by the model, out of all the test data. This measure should be treated carefully because sometimes it can give a misleading value when the data is not proprely balanced (see https://www.machinelearningplus.com/julia/logistic-regression-in-julia-practical-guide-with-examples/), but for now we just want to have an idea on how the models compare to others on the same training and testing data.  

In [96]:
# in order to measure accuracy, we take the mean of the following list
# this counts 1 if the model predicted correctly and 0 otherwise
prediction_df.correctly_classified = prediction_df.y_actual .== prediction_df.y_predicted;

In [95]:
# Accuracy Score
accuracy = mean(prediction_df.correctly_classified)
print("Accuracy of the model is : ",accuracy*100, " %")

Accuracy of the model is : 97.9020979020979 %

# Linear Regression

Next up we have linear regression. This is not usually used as a classification model, basically because it doesn't do a very good job predicting labels for unseen data (see: https://stats.stackexchange.com/questions/22381/why-not-approach-classification-through-regression), something we will see next up on the performance of the model:

In [98]:
#first we set up the model
N = 523 #size of train data
A = Matrix(train[:, 1:60])
A = [ones(N,1) A];
b = Array(train.x61);

In [99]:
# we solve the linear system
xhat=A\b;

In [100]:
#now we make the prediction on the test data
pred = xhat[1]*ones(143,1) + xhat[2]*test.x1 + xhat[3]*test.x2 + xhat[4]*test.x3 + xhat[5]*test.x4 + xhat[6]*test.x5 + xhat[7]*test.x6 + xhat[8]*test.x7 + xhat[9]*test.x8 + xhat[10]*test.x9 + xhat[11]*test.x10 + xhat[12]*test.x11 + xhat[13]*test.x12 + xhat[14]*test.x13 + xhat[15]*test.x14 + xhat[16]*test.x15 + xhat[17]*test.x16 + xhat[18]*test.x17 + xhat[19]*test.x18 + xhat[20]*test.x19 + xhat[21]*test.x20 + xhat[22]*test.x21 + xhat[23]*test.x22 + xhat[24]*test.x23 + xhat[25]*test.x24 + xhat[26]*test.x25 + xhat[27]*test.x26; + xhat[28]*test.x27 + xhat[29]*test.x28 + xhat[30]*test.x29 + xhat[31]*test.x30 + xhat[32]*test.x31 + xhat[33]*test.x32 + xhat[34]*test.x33 + xhat[35]*test.x34 + xhat[36]*test.x35 + xhat[37]*test.x36 + xhat[38]*test.x37 + xhat[39]*test.x38 + xhat[40]*test.x39 + xhat[41]*test.x40 + xhat[42]*test.x41 + xhat[43]*test.x42 + xhat[44]*test.x43 + xhat[45]*test.x44 + xhat[46]*test.x45 + xhat[47]*test.x46 + xhat[48]*test.x47 + xhat[49]*test.x48 + xhat[50]*test.x49 + xhat[51]*test.x50 + xhat[52]*test.x51 + xhat[53]*test.x52 + xhat[54]*test.x53 + xhat[55]*test.x54 + xhat[56]*test.x55 + xhat[57]*test.x56 + xhat[58]*test.x57 + xhat[59]*test.x58 + xhat[60]*test.x59 + xhat[61]*test.x60

#we see the probatilities that the model gave to each of the observations
pred = DataFrame(pred)

Unnamed: 0_level_0,x1
Unnamed: 0_level_1,Float64
1,0.516064
2,0.262142
3,0.0712967
4,-0.220874
5,-0.105759
6,-0.186986
7,0.314711
8,-0.0768633
9,-0.114551
10,-0.0711872


In [101]:
#now we use 0.5 as a threshold once again and assign the prediction label to the data
pred_class = [if x < 0.5 0 else 1 end for x in pred.x1];

In [102]:
# here we do something similar to the logistic regression model in order to measure accuracy
pred_df = DataFrame(y_actual = test.x61, y_predicted = pred_class, prob_predicted = pred);
pred_df.correctly_classified = pred_df.y_actual .== pred_df.y_predicted;

In [105]:
# Accuracy Score
accuracy = mean(pred_df.correctly_classified)
print("Accuracy of the model is : ",accuracy*100, " %")

Accuracy of the model is : 62.23776223776224 %

We can see that the accuracy is significantly less than the one for the logistic regression model, and just a few points above the baseline performance for classification models.

# SVM

Now we use the Support Vector Machine model for the data:

In [71]:
using LIBSVM # SVM library
using RDatasets
using Printf
using Statistics, LinearAlgebra, Random

In [106]:
# Here we set up the model on the train data
X = Matrix(train[:, 1:60])' # the X input has to be transposed
y = train.x61
X0 = Matrix(test[:, 1:60])'
model = svmtrain(X, y); # svmtrain makes the svm model

# Now we test on the test data
(predicted_labels, decision_values) = svmpredict(model, X0);

# And finally we measure accuracy
@printf "Accuracy of the model is: %.2f%%\n" mean((predicted_labels .== test.x61))*100

Accuracy of the model is: 82.52%


We can see that the model did better than Linear Regression but not better than Logistic Regression in terms of accuracy.

# Decision Trees

Next up we have Decision Trees: the model creates a tree which classifies the data by determining which features and with what thresholds split the data into uniform subsets.

In [107]:
using DecisionTree # This is the Julia library for Decision Trees model

Xt = Matrix(train[:, 1:60]);
yt = train.x61;

tmodel = DecisionTreeClassifier() #this function creates a tree model to which one can specify or not the desired depht of the tree.

DecisionTree.fit!(tmodel, Xt, yt) # we fit the previous model to the train data
# and we can see the tree, which features it took and the corresponding thresholds
print_tree(tmodel)

Feature 4, Threshold 0.0606
L-> Feature 12, Threshold 0.22875
    L-> Feature 19, Threshold 0.8186
        L-> Feature 1, Threshold 0.0382
            L-> Feature 28, Threshold 0.9929
                L-> Feature 4, Threshold 0.05776505489091939
                    L-> Feature 29, Threshold 0.19855
                        L-> 1.0 : 1/1
                        R-> 0.0 : 159/159
                    R-> 1.0 : 1/1
                R-> 1.0 : 1/1
            R-> Feature 15, Threshold 0.5374218518637437
                L-> 1.0 : 3/3
                R-> 0.0 : 2/2
        R-> Feature 27, Threshold 0.83045
            L-> 1.0 : 6/6
            R-> 0.0 : 3/3
    R-> Feature 27, Threshold 0.7672713000639921
        L-> Feature 55, Threshold 0.016444641023191486
            L-> Feature 51, Threshold 0.021096420459993263
                L-> Feature 2, Threshold 0.01925
                    L-> Feature 11, Threshold 0.19531994396532976
                        L-> 0.0 : 1/1
                        R-> 1.

In [110]:
# here we test the model on the test data
Xt0 = Matrix(test[:,1:60])
yt = DecisionTree.predict(tmodel, Xt0) # the DecisionTree Package has its own predict function

# and finally we measure the accuracy of the model
accuracy = mean(yt .== test.x61)
print("accuracy of the model is: ", accuracy*100, "%")

accuracy of the model is: 95.1048951048951%

# KNN
Finally, we have the K-Nearest Neighbors model for classification: given a point $x$ that we want to classify, what the model does is to take the nearest $k$ neighbors of $x$, and see of which type are most of those neighbors in order to assign to $x$ such label. In order to avoid equal amount of neighbors of different type, one should choose and odd value of $k$. Also $k$ shall not be too big since taking too many neighbors can be counterproductive.<br>

The algorithm itself sounds and is indeed very expensive in terms of computation time, but thanks to the $k$-dimensional tree structure (see https://en.wikipedia.org/wiki/K-d_tree), we can partition the set of point in order to search for neighbors in efficient time.<br>

The NearestNeighbors package of julia implements the concepts mentioned above:

In [78]:
using NearestNeighbors #we add the knn package for julia
# remember  X = Matrix(train[:, 1:60])'
#           y = train.x61
#           X0 = Matrix(test[:, 1:60])'

kdtree = KDTree(X) # create the k-dimensional tree of the data

k = 5 # choose an odd and not too big value for k

index_knn, distances = knn(kdtree, X0, k, true) #this gives the indexes of the respective 5 nearest neighbors and the given distances to each point on the test data X0

# now we put back that information in ma matrix where we can read it

index_knn_matrix = hcat(index_knn...)
# each row are the index of the nearest neighbors to the respective observation of the test dataset
index_knn_matrix_t = permutedims(index_knn_matrix)

143×5 Matrix{Int64}:
  25  489  431    1   19
 296  489  274  484   25
 390  405  185   91  354
 171   84   83   12  337
 389  459  428  277  406
 286  267  194  444   54
 403  442  409  125   98
 521  263  316  363  142
  75  256  275  282  320
 428  277  406  389  459
 248  446  488  165  516
 171   84   83   12  337
 479   85  161  411   62
   ⋮                 
 195  122  300  333    3
 161  479   85  411  515
 457  443  310  208  376
 427  356  166  204  514
 248  446  488  165  516
 262  383  202   97  107
 251   18  274  216  296
 457  443  310  208  376
 330  302  433  149  303
 228    7  440  179    9
 258  188   71  415  366
 442  409  403  239   96

In [83]:
#here we take the classes (labels) of the nearest neighbors to each observation from the test data
knn_classes = y[index_knn_matrix_t]

# now we make the prediction y_hat by taking the classes that appear the most on those 5 neighbors, 
# we do that by counting the classes with countmap function and then with the argmax function from StastBase package:

y_hat = [
    argmax(countmap(knn_classes[i, :], alg = :dict))
    for i in 1:143
]


143-element Vector{Float64}:
 1.0
 1.0
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0
 ⋮
 1.0
 0.0
 0.0
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 1.0
 0.0
 1.0

In [114]:
# and finally we check accuracy

accuracy = mean(y_hat .== test.x61)

print("accuracy of the model: ", accuracy*100, "%")

accuracy of the model: 95.1048951048951%

# Which model did better?

We can see, based merely on the accuracy of each model, that the one that did better was the logistic regression model, although the decision trees and the knn also had a really high accuracy. For further studies and in order to determine which of the models work better in a more deep way, we can use other performance measures like the confussion matrix.
