## Support Vector Machine (SVM) Classifier

Consists of creating an hyperplane that divides the instances in space.

This method is effective with highly dimentional problems.

In [3]:
library(e1071)
training_set <- read.csv("../Data/PreProcess/processed_training_data_split.csv")
validation_set <- read.csv("../Data/PreProcess/processed_verification_data_split.csv")

In [4]:
column_to_drop<-c("X.1","X")
training_set<-training_set[,!(names(training_set) %in% column_to_drop)] # drop the desired columns
validation_set<-validation_set[,!(names(validation_set) %in% column_to_drop)] # drop the desired columns

In [5]:
training_set$id <- factor(training_set$id)
training_set

id,population,permit,construction_year,quality_group,quantity,Internal,Lake.Nyasa,Lake.Rukwa,Lake.Tanganyika,⋯,X18,X19,X20,X21,X24,X40,X60,X80,X90,X99
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
functional,-0.07262067,0.6889358,-1.2105682,0.3440186,0.81768317,0,0,0,0,⋯,0,0,0,1,0,0,0,0,0,0
functional,-0.08339009,0.6889358,0.0000000,0.3440186,-0.08578228,0,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
functional,-0.61825111,0.6889358,-1.2105682,-3.1172670,0.81768317,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
non functional,-0.07624876,0.6889358,0.0000000,0.3440186,0.81768317,1,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
non functional,-0.07624876,0.6889358,0.0000000,-0.8097433,0.81768317,0,0,0,1,⋯,0,0,0,0,0,0,0,0,0,0
non functional,0.48793255,0.6889358,0.0000000,-3.1172670,-0.08578228,0,0,0,1,⋯,0,0,0,0,0,0,0,0,0,0
functional,-0.08339009,0.6889358,0.0000000,0.3440186,0.81768317,0,0,0,0,⋯,1,0,0,0,0,0,0,0,0,0
functional,0.13555158,-1.5302158,-1.4092612,-3.1172670,0.81768317,0,0,0,0,⋯,0,0,0,0,0,0,1,0,0,0
functional,-0.07262067,-1.5302158,0.9750546,0.3440186,-0.98924773,0,1,0,0,⋯,0,0,0,0,0,0,0,0,0,0
functional,-0.61825111,0.6889358,-1.2105682,0.3440186,0.81768317,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


## SVM Hyperparameters

The svm classifier has multiples parameters such as:

* Kernel type, linear, polynomial, radial_based, sigmoid. Changes the shape of the diving surface
* Degree, needed if polynomial.
* Gamma, affects the shape of the surface, needed if not linear
* coef0, needed in polynomial and sigmoid
* cost, cost of constraint violation

The SVM classifier is trainined on the training set, accuracy is then measured with the validation set.

In [15]:
svm_model <- svm(id ~ ., data=training_set, kernel="linear", cost=10, scale=FALSE)  # No need to scale as data is already processed
print(svm_model)


Call:
svm(formula = id ~ ., data = training_set, kernel = "linear", cost = 10, 
    scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  10 

Number of Support Vectors:  30679



## Training set accuracy

In [63]:
svm.predict <- predict(svm_model, training_set[1:100,])

In [66]:
classAgreement(table(pred=svm.predict, true=training_set$id[1:100]))

## Validation set accuracy

In [68]:
svm.predict <- predict(svm_model, validation_set)

In [69]:
classAgreement(table(pred=svm.predict, true=validation_set$id[-1]))

## Testing with polynomial SVM

In [6]:
svm_model_poly <- svm(id ~ ., data=training_set, kernel="polynomial", cost=20, scale=FALSE)

In [7]:
svm.predict <- predict(svm_model_poly, training_set[1:500,])
classAgreement(table(pred=svm.predict, true=training_set$id[1:500]))

In [8]:
svm.predict <- predict(svm_model_poly, validation_set)
classAgreement(table(pred=svm.predict, true=validation_set$id[-1]))

Increasing cost increases the accuracy of both training and validation set, but also increases overfitting.
* With cost=20, training_accuracy = 72% & validation_accuracy=0.66%

## Testing with radial basis SVM 

In [11]:
svm_model_poly <- svm(id ~ ., data=training_set, kernel="radial", cost=10, scale=FALSE)

In [12]:
svm.predict <- predict(svm_model_poly, training_set[1:500,])
classAgreement(table(pred=svm.predict, true=training_set$id[1:500]))

In [13]:
svm.predict <- predict(svm_model_poly, validation_set)
classAgreement(table(pred=svm.predict, true=validation_set$id[-1]))