# Supported Vector Machine

This code utilizes a Support Vector Machine (SVM) for classification of data extracted from the LIDC-IDRI dataset.

The `.csv` file employed in this version contains a **clean and analyzed** dataset derived from the raw data using the `pylidc`, `pyradiomics`, and deep feature extraction methods.

The relevant methods can be found in the **csv_cleanup** folder.

## Importing libraries and Datasets

We will begin by importing the relevant and necessary libraries.

In [142]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import warnings

Next, we will convert the three datasets into pandas DataFrames for further processing.

In [143]:
df = pd.read_csv('clean_rfe50.csv')
df.head()

Unnamed: 0,is_cancer,resnet3d_feature_33,resnet3d_feature_41,resnet3d_feature_52,resnet3d_feature_63,resnet3d_feature_64,resnet3d_feature_66,resnet3d_feature_67,resnet3d_feature_68,resnet3d_feature_72,...,original_firstorder_Median,original_firstorder_Minimum,original_firstorder_RobustMeanAbsoluteDeviation,original_firstorder_Skewness,original_firstorder_Variance,original_gldm_DependenceNonUniformityNormalized,original_glrlm_GrayLevelNonUniformity,original_glrlm_LongRunEmphasis,original_glrlm_RunLengthNonUniformity,original_glrlm_ShortRunEmphasis
0,2,0.339816,0.562524,0.442979,0.730204,0.333658,0.491633,0.398934,0.413691,0.399483,...,0.364344,0.011082,0.001734,0.648394,0.005157,0.21098,0.87881,0.15793,0.324954,0.496851
1,2,0.526288,0.736178,0.486639,0.796184,0.340574,0.547593,0.475688,0.532664,0.69038,...,0.161537,0.034874,0.0,0.359984,0.000511,0.465737,0.890961,0.364065,0.24092,0.253756
2,0,0.616598,0.428909,0.104794,0.619367,0.502572,0.125043,0.291206,0.484636,0.376556,...,0.224852,0.000876,0.0,0.610007,0.002463,0.448261,0.874003,0.314858,0.249673,0.428741
3,2,0.395332,0.431717,0.251262,0.73569,0.667631,0.282189,0.237252,0.245101,0.460938,...,0.368496,0.00453,0.004242,0.659629,0.006736,0.313686,0.804823,0.218568,0.269539,0.537671
4,1,0.432535,0.491978,0.446434,0.571172,0.847354,0.442363,0.294933,0.210586,0.340075,...,0.570872,0.018355,0.16319,0.800094,0.029328,0.247511,0.736251,0.118588,0.323211,0.432529


After that, we need to split the features and target variable. We will also be using 10-fold cross validation for every model.

In [144]:
X = df.drop(columns=['is_cancer'])  
y = df['is_cancer']  

kfold = KFold(n_splits=10, shuffle=True, random_state=42)

In [156]:
def model_evaluation(X,y,imbalanced=False): 
    # Define models
    if not imbalanced:
        models = {
            'Logistic Regression': LogisticRegression(max_iter=700),
            'Random Forest': RandomForestClassifier(),
            'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'),
            'SVM': SVC(),
            'Gauss': GaussianNB()
        }
    else:
        models = {
            'Random Forest': RandomForestClassifier(class_weight='balanced'),
        }

    # Define test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Train and evaluate models
    for name, model in models.items():
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        confusion = confusion_matrix(y_test, predictions)
        
        print(f"{name} Accuracy: {accuracy:.2f}\n")
        print(f"{name} Confusion Matrix:\n{confusion}\n")


In [157]:
model_evaluation(X,y)

Logistic Regression Accuracy: 0.58

Logistic Regression Confusion Matrix:
[[ 62 190   8]
 [ 50 334  30]
 [  4  49  61]]

Random Forest Accuracy: 0.56

Random Forest Confusion Matrix:
[[ 75 176   9]
 [ 77 305  32]
 [  3  50  61]]

XGBoost Accuracy: 0.54

XGBoost Confusion Matrix:
[[ 87 165   8]
 [ 96 286  32]
 [  2  56  56]]

SVM Accuracy: 0.57

SVM Confusion Matrix:
[[ 21 231   8]
 [ 18 367  29]
 [  1  51  62]]

Gauss Accuracy: 0.51

Gauss Confusion Matrix:
[[133  82  45]
 [130 180 104]
 [  8  15  91]]



Because our accuracy was **absolutely devious**, we decided to transform the dataset into a binary classification problem.

In [158]:
# Removes ambiguous option
#mask = df['is_cancer'] == 1
#df = df[~mask]

# Maps ambiguous values to cancer or not cancer
df_amb_cancer = df.copy()
df_amb_not_cancer = df.copy()

# Mapeamento necessário para trees (expected 0 and 1)
df_amb_cancer['is_cancer'] = df_amb_cancer['is_cancer'].replace(2, 1)

df_amb_not_cancer['is_cancer'] = df_amb_not_cancer['is_cancer'].replace({1: 0, 2: 1})


In [159]:
df_amb_cancer.head()

Unnamed: 0,is_cancer,resnet3d_feature_33,resnet3d_feature_41,resnet3d_feature_52,resnet3d_feature_63,resnet3d_feature_64,resnet3d_feature_66,resnet3d_feature_67,resnet3d_feature_68,resnet3d_feature_72,...,original_firstorder_Median,original_firstorder_Minimum,original_firstorder_RobustMeanAbsoluteDeviation,original_firstorder_Skewness,original_firstorder_Variance,original_gldm_DependenceNonUniformityNormalized,original_glrlm_GrayLevelNonUniformity,original_glrlm_LongRunEmphasis,original_glrlm_RunLengthNonUniformity,original_glrlm_ShortRunEmphasis
0,1,0.339816,0.562524,0.442979,0.730204,0.333658,0.491633,0.398934,0.413691,0.399483,...,0.364344,0.011082,0.001734,0.648394,0.005157,0.21098,0.87881,0.15793,0.324954,0.496851
1,1,0.526288,0.736178,0.486639,0.796184,0.340574,0.547593,0.475688,0.532664,0.69038,...,0.161537,0.034874,0.0,0.359984,0.000511,0.465737,0.890961,0.364065,0.24092,0.253756
2,0,0.616598,0.428909,0.104794,0.619367,0.502572,0.125043,0.291206,0.484636,0.376556,...,0.224852,0.000876,0.0,0.610007,0.002463,0.448261,0.874003,0.314858,0.249673,0.428741
3,1,0.395332,0.431717,0.251262,0.73569,0.667631,0.282189,0.237252,0.245101,0.460938,...,0.368496,0.00453,0.004242,0.659629,0.006736,0.313686,0.804823,0.218568,0.269539,0.537671
4,1,0.432535,0.491978,0.446434,0.571172,0.847354,0.442363,0.294933,0.210586,0.340075,...,0.570872,0.018355,0.16319,0.800094,0.029328,0.247511,0.736251,0.118588,0.323211,0.432529


In [160]:
df_amb_not_cancer.head()

Unnamed: 0,is_cancer,resnet3d_feature_33,resnet3d_feature_41,resnet3d_feature_52,resnet3d_feature_63,resnet3d_feature_64,resnet3d_feature_66,resnet3d_feature_67,resnet3d_feature_68,resnet3d_feature_72,...,original_firstorder_Median,original_firstorder_Minimum,original_firstorder_RobustMeanAbsoluteDeviation,original_firstorder_Skewness,original_firstorder_Variance,original_gldm_DependenceNonUniformityNormalized,original_glrlm_GrayLevelNonUniformity,original_glrlm_LongRunEmphasis,original_glrlm_RunLengthNonUniformity,original_glrlm_ShortRunEmphasis
0,1,0.339816,0.562524,0.442979,0.730204,0.333658,0.491633,0.398934,0.413691,0.399483,...,0.364344,0.011082,0.001734,0.648394,0.005157,0.21098,0.87881,0.15793,0.324954,0.496851
1,1,0.526288,0.736178,0.486639,0.796184,0.340574,0.547593,0.475688,0.532664,0.69038,...,0.161537,0.034874,0.0,0.359984,0.000511,0.465737,0.890961,0.364065,0.24092,0.253756
2,0,0.616598,0.428909,0.104794,0.619367,0.502572,0.125043,0.291206,0.484636,0.376556,...,0.224852,0.000876,0.0,0.610007,0.002463,0.448261,0.874003,0.314858,0.249673,0.428741
3,1,0.395332,0.431717,0.251262,0.73569,0.667631,0.282189,0.237252,0.245101,0.460938,...,0.368496,0.00453,0.004242,0.659629,0.006736,0.313686,0.804823,0.218568,0.269539,0.537671
4,0,0.432535,0.491978,0.446434,0.571172,0.847354,0.442363,0.294933,0.210586,0.340075,...,0.570872,0.018355,0.16319,0.800094,0.029328,0.247511,0.736251,0.118588,0.323211,0.432529


Let's try again:

In [161]:
X_cancer = df_amb_cancer.drop(columns=['is_cancer'])  
y_cancer = df_amb_cancer['is_cancer']  

X_not = df_amb_not_cancer.drop(columns=['is_cancer'])  
y_not = df_amb_not_cancer['is_cancer'] 

kfold = KFold(n_splits=10, shuffle=True, random_state=42)

In [162]:
warnings.filterwarnings('ignore')

model_evaluation(X_cancer,y_cancer)
print("------------------------------------------")
model_evaluation(X_not,y_not)

Logistic Regression Accuracy: 0.68

Logistic Regression Confusion Matrix:
[[ 56 204]
 [ 50 478]]

Random Forest Accuracy: 0.66

Random Forest Confusion Matrix:
[[ 67 193]
 [ 73 455]]

XGBoost Accuracy: 0.65

XGBoost Confusion Matrix:
[[ 83 177]
 [100 428]]

SVM Accuracy: 0.67

SVM Confusion Matrix:
[[ 15 245]
 [ 14 514]]

Gauss Accuracy: 0.62

Gauss Confusion Matrix:
[[165  95]
 [201 327]]

------------------------------------------
Logistic Regression Accuracy: 0.89

Logistic Regression Confusion Matrix:
[[646  28]
 [ 60  54]]

Random Forest Accuracy: 0.90

Random Forest Confusion Matrix:
[[648  26]
 [ 55  59]]

XGBoost Accuracy: 0.88

XGBoost Confusion Matrix:
[[639  35]
 [ 59  55]]

SVM Accuracy: 0.89

SVM Confusion Matrix:
[[646  28]
 [ 60  54]]

Gauss Accuracy: 0.77

Gauss Confusion Matrix:
[[511 163]
 [ 19  95]]



The accuracy of the model significantly improved when ambiguous cases were classified as "not cancer." This improvement may be due to an imbalance in class distributions, where "not cancer" and "ambiguous" (and their combination) cases constitute the largest portion of the data. 

Such an imbalance could lead to overfitting, with the model favoring predictions for the more prevalent classes, potentially compromising the model’s generalizability to less common classes.

Taking into account the best model, Logistic Regression, we obtained the following confusion matrix:

\[
\begin{bmatrix}
646 & 28 \\
60 & 54 \\
\end{bmatrix}
\]

This means that the model correctly identified 646 out of 647 instances as "not cancer," showing strong performance in recognizing non-cancer cases (incorrectly classified 28 instances, ~4.33% of all available data for not_cancer).

It also correctly identified 54 out of 114 cancer cases, indicating some effectiveness in recognizing true cancer instances. However, the lower true positive rate may reflect an imbalance due to the predominant class ("not cancer") in the dataset, which could be causing the model to lean toward "not cancer" predictions.

We conclude that 60 cases of "cancer" were misclassified as "not cancer." This is a significant observation, as failing to detect actual cancer cases could have serious clinical implications. The higher count of false negatives suggests that the model may not be sensitive enough to cancer cases in this configuration.

Let's try to run a **balanced Random Forest Classifier**:

In [163]:
warnings.filterwarnings('ignore')

model_evaluation(X_cancer,y_cancer,True)
print("------------------------------------------")
model_evaluation(X_not,y_not,True)

Random Forest Accuracy: 0.66

Random Forest Confusion Matrix:
[[ 56 204]
 [ 66 462]]

------------------------------------------
Random Forest Accuracy: 0.89

Random Forest Confusion Matrix:
[[650  24]
 [ 66  48]]



## Optimization Possibilities

if:
 - treat ambiguous as not_cancer vs ambiguous as cancer
 - SMOTE for unbalanced classes
 - weights for unbalanced classes (maybe using weighted accuracy?)

is not enough, then try:
 - scaling for svm
 - encoding into intervals for random forest