# Supported Vector Machine

This code utilizes a Support Vector Machine (SVM) for classification of data extracted from the LIDC-IDRI dataset.

The `.csv` file employed in this version contains a **clean and analyzed** dataset derived from the raw data using the `pylidc`, `pyradiomics`, and deep feature extraction methods.

The relevant methods can be found in the **csv_cleanup** folder.

## Importing libraries and Datasets

We will begin by importing the relevant and necessary libraries.

In [None]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import warnings
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold


Next, we will convert the three datasets into pandas DataFrames for further processing.

In [85]:
df = pd.read_csv('clean_rfe50.csv')
df.head()

Unnamed: 0,is_cancer,resnet3d_feature_33,resnet3d_feature_41,resnet3d_feature_52,resnet3d_feature_63,resnet3d_feature_64,resnet3d_feature_66,resnet3d_feature_67,resnet3d_feature_68,resnet3d_feature_72,...,original_firstorder_Median,original_firstorder_Minimum,original_firstorder_RobustMeanAbsoluteDeviation,original_firstorder_Skewness,original_firstorder_Variance,original_gldm_DependenceNonUniformityNormalized,original_glrlm_GrayLevelNonUniformity,original_glrlm_LongRunEmphasis,original_glrlm_RunLengthNonUniformity,original_glrlm_ShortRunEmphasis
0,2,0.339816,0.562524,0.442979,0.730204,0.333658,0.491633,0.398934,0.413691,0.399483,...,0.364344,0.011082,0.001734,0.648394,0.005157,0.21098,0.87881,0.15793,0.324954,0.496851
1,2,0.526288,0.736178,0.486639,0.796184,0.340574,0.547593,0.475688,0.532664,0.69038,...,0.161537,0.034874,0.0,0.359984,0.000511,0.465737,0.890961,0.364065,0.24092,0.253756
2,0,0.616598,0.428909,0.104794,0.619367,0.502572,0.125043,0.291206,0.484636,0.376556,...,0.224852,0.000876,0.0,0.610007,0.002463,0.448261,0.874003,0.314858,0.249673,0.428741
3,2,0.395332,0.431717,0.251262,0.73569,0.667631,0.282189,0.237252,0.245101,0.460938,...,0.368496,0.00453,0.004242,0.659629,0.006736,0.313686,0.804823,0.218568,0.269539,0.537671
4,1,0.432535,0.491978,0.446434,0.571172,0.847354,0.442363,0.294933,0.210586,0.340075,...,0.570872,0.018355,0.16319,0.800094,0.029328,0.247511,0.736251,0.118588,0.323211,0.432529


After that, we need to split the features and target variable. We will also be using 10-fold cross validation for every model.

In [None]:
X = df.drop(columns=['is_cancer'])  
y = df['is_cancer']  

In [105]:
def model_evaluation(X, y, imbalanced=False):
    # Define models
    models = {
        #'Logistic Regression': LogisticRegression(max_iter=700),
        'Random Forest': RandomForestClassifier(class_weight='balanced' if imbalanced else None),
        #'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'),
        'SVM': SVC(kernel='rbf'),
        'Gaussian Naive Bayes': GaussianNB()
    }
    
    # Define 10-fold cross-validation
    skf = StratifiedKFold(n_splits=10, shuffle=True)
    
    # Train and evaluate models with 10-fold cross-validation
    for name, model in models.items():
        accuracies = []
        misclass_percentages = []  # List to store misclassification percentages for each class

        for train_index, test_index in skf.split(X, y):
            # Use iloc to select by position
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]
            
            # Apply SMOTE only if imbalanced
            if imbalanced:
                smote = SMOTE()
                X_train, y_train = smote.fit_resample(X_train, y_train)
            
            # Fit the model and predict
            model.fit(X_train, y_train)
            predictions = model.predict(X_test)
            
            # Calculate accuracy
            accuracy = accuracy_score(y_test, predictions)
            accuracies.append(accuracy)
            
            # Calculate misclassification percentage for each class
            confusion = confusion_matrix(y_test, predictions)
            class_errors = []
            
            for i, class_total in enumerate(confusion.sum(axis=1)):
                if class_total > 0:
                    misclassified = class_total - confusion[i, i]  # Misclassified instances for this class
                    misclass_percentage = (misclassified / class_total) * 100
                else:
                    misclass_percentage = 0.0
                class_errors.append(misclass_percentage)
            
            misclass_percentages.append(class_errors)
        
        # Calculate mean accuracy and mean misclassification percentages per class
        mean_accuracy = np.mean(accuracies)
        mean_misclass_percentages = np.mean(misclass_percentages, axis=0)
        
        print(f"{name} Mean Accuracy: {mean_accuracy:.2f}")
        for i, error in enumerate(mean_misclass_percentages):
            print(f"Class {i} Mean Misclassification Percentage: {error:.2f}%")
        print()


In [106]:
print("For reference:\n Class 0: not cancer\n Class 1: ambiguous\n Class 2: cancer\n")

print("Without SMOTE:")
model_evaluation(X,y,True)

print("\n-------------\n")

print("With SMOTE:")
model_evaluation(X,y,True)

For reference:
 Class 0: not cancer
 Class 1: ambiguous
 Class 2: cancer

Without SMOTE:
Random Forest Mean Accuracy: 0.56
Class 0 Mean Misclassification Percentage: 50.87%
Class 1 Mean Misclassification Percentage: 42.94%
Class 2 Mean Misclassification Percentage: 33.12%

SVM Mean Accuracy: 0.53
Class 0 Mean Misclassification Percentage: 37.66%
Class 1 Mean Misclassification Percentage: 57.99%
Class 2 Mean Misclassification Percentage: 24.53%

Gaussian Naive Bayes Mean Accuracy: 0.48
Class 0 Mean Misclassification Percentage: 40.78%
Class 1 Mean Misclassification Percentage: 68.66%
Class 2 Mean Misclassification Percentage: 17.08%


-------------

With SMOTE:
Random Forest Mean Accuracy: 0.56
Class 0 Mean Misclassification Percentage: 51.45%
Class 1 Mean Misclassification Percentage: 42.23%
Class 2 Mean Misclassification Percentage: 31.15%

SVM Mean Accuracy: 0.54
Class 0 Mean Misclassification Percentage: 37.18%
Class 1 Mean Misclassification Percentage: 57.35%
Class 2 Mean Misclassi

Because our accuracy was **absolutely devious**, we decided to transform the dataset into a binary classification problem.

In [107]:
# Removes ambiguous option
df_rem = df.copy()
before = df_rem.shape[0] # n_linhas antes

mask = df_rem['is_cancer'] == 1
df_rem = df_rem[~mask]
after = df_rem.shape[0] # n_linhas depois

df_rem['is_cancer'] = df_rem['is_cancer'].replace(2, 1)
print(f"Row size changed from {before} to {after} (lost {before-after} rows).")

# Mostrar diferença no número de linhas quando removemos casos ambíguos

# Maps ambiguous values to cancer or not cancer
df_amb_cancer = df.copy()
df_amb_not_cancer = df.copy()

# Mapeamento necessário para trees (expected 0 and 1)
df_amb_cancer['is_cancer'] = df_amb_cancer['is_cancer'].replace(2, 1)
df_amb_not_cancer['is_cancer'] = df_amb_not_cancer['is_cancer'].replace({1: 0, 2: 1})

Row size changed from 2626 to 1238 (lost 1388 rows).


In [108]:
df_rem.head()

Unnamed: 0,is_cancer,resnet3d_feature_33,resnet3d_feature_41,resnet3d_feature_52,resnet3d_feature_63,resnet3d_feature_64,resnet3d_feature_66,resnet3d_feature_67,resnet3d_feature_68,resnet3d_feature_72,...,original_firstorder_Median,original_firstorder_Minimum,original_firstorder_RobustMeanAbsoluteDeviation,original_firstorder_Skewness,original_firstorder_Variance,original_gldm_DependenceNonUniformityNormalized,original_glrlm_GrayLevelNonUniformity,original_glrlm_LongRunEmphasis,original_glrlm_RunLengthNonUniformity,original_glrlm_ShortRunEmphasis
0,1,0.339816,0.562524,0.442979,0.730204,0.333658,0.491633,0.398934,0.413691,0.399483,...,0.364344,0.011082,0.001734,0.648394,0.005157,0.21098,0.87881,0.15793,0.324954,0.496851
1,1,0.526288,0.736178,0.486639,0.796184,0.340574,0.547593,0.475688,0.532664,0.69038,...,0.161537,0.034874,0.0,0.359984,0.000511,0.465737,0.890961,0.364065,0.24092,0.253756
2,0,0.616598,0.428909,0.104794,0.619367,0.502572,0.125043,0.291206,0.484636,0.376556,...,0.224852,0.000876,0.0,0.610007,0.002463,0.448261,0.874003,0.314858,0.249673,0.428741
3,1,0.395332,0.431717,0.251262,0.73569,0.667631,0.282189,0.237252,0.245101,0.460938,...,0.368496,0.00453,0.004242,0.659629,0.006736,0.313686,0.804823,0.218568,0.269539,0.537671
6,0,0.275632,0.454981,0.361014,0.542304,0.659543,0.325105,0.451855,0.178754,0.195805,...,0.453218,0.00832,0.354365,0.870931,0.072326,0.308885,0.574462,0.183526,0.196528,0.378695


In [109]:
df_amb_cancer.head()

Unnamed: 0,is_cancer,resnet3d_feature_33,resnet3d_feature_41,resnet3d_feature_52,resnet3d_feature_63,resnet3d_feature_64,resnet3d_feature_66,resnet3d_feature_67,resnet3d_feature_68,resnet3d_feature_72,...,original_firstorder_Median,original_firstorder_Minimum,original_firstorder_RobustMeanAbsoluteDeviation,original_firstorder_Skewness,original_firstorder_Variance,original_gldm_DependenceNonUniformityNormalized,original_glrlm_GrayLevelNonUniformity,original_glrlm_LongRunEmphasis,original_glrlm_RunLengthNonUniformity,original_glrlm_ShortRunEmphasis
0,1,0.339816,0.562524,0.442979,0.730204,0.333658,0.491633,0.398934,0.413691,0.399483,...,0.364344,0.011082,0.001734,0.648394,0.005157,0.21098,0.87881,0.15793,0.324954,0.496851
1,1,0.526288,0.736178,0.486639,0.796184,0.340574,0.547593,0.475688,0.532664,0.69038,...,0.161537,0.034874,0.0,0.359984,0.000511,0.465737,0.890961,0.364065,0.24092,0.253756
2,0,0.616598,0.428909,0.104794,0.619367,0.502572,0.125043,0.291206,0.484636,0.376556,...,0.224852,0.000876,0.0,0.610007,0.002463,0.448261,0.874003,0.314858,0.249673,0.428741
3,1,0.395332,0.431717,0.251262,0.73569,0.667631,0.282189,0.237252,0.245101,0.460938,...,0.368496,0.00453,0.004242,0.659629,0.006736,0.313686,0.804823,0.218568,0.269539,0.537671
4,1,0.432535,0.491978,0.446434,0.571172,0.847354,0.442363,0.294933,0.210586,0.340075,...,0.570872,0.018355,0.16319,0.800094,0.029328,0.247511,0.736251,0.118588,0.323211,0.432529


In [110]:
df_amb_not_cancer.head()

Unnamed: 0,is_cancer,resnet3d_feature_33,resnet3d_feature_41,resnet3d_feature_52,resnet3d_feature_63,resnet3d_feature_64,resnet3d_feature_66,resnet3d_feature_67,resnet3d_feature_68,resnet3d_feature_72,...,original_firstorder_Median,original_firstorder_Minimum,original_firstorder_RobustMeanAbsoluteDeviation,original_firstorder_Skewness,original_firstorder_Variance,original_gldm_DependenceNonUniformityNormalized,original_glrlm_GrayLevelNonUniformity,original_glrlm_LongRunEmphasis,original_glrlm_RunLengthNonUniformity,original_glrlm_ShortRunEmphasis
0,1,0.339816,0.562524,0.442979,0.730204,0.333658,0.491633,0.398934,0.413691,0.399483,...,0.364344,0.011082,0.001734,0.648394,0.005157,0.21098,0.87881,0.15793,0.324954,0.496851
1,1,0.526288,0.736178,0.486639,0.796184,0.340574,0.547593,0.475688,0.532664,0.69038,...,0.161537,0.034874,0.0,0.359984,0.000511,0.465737,0.890961,0.364065,0.24092,0.253756
2,0,0.616598,0.428909,0.104794,0.619367,0.502572,0.125043,0.291206,0.484636,0.376556,...,0.224852,0.000876,0.0,0.610007,0.002463,0.448261,0.874003,0.314858,0.249673,0.428741
3,1,0.395332,0.431717,0.251262,0.73569,0.667631,0.282189,0.237252,0.245101,0.460938,...,0.368496,0.00453,0.004242,0.659629,0.006736,0.313686,0.804823,0.218568,0.269539,0.537671
4,0,0.432535,0.491978,0.446434,0.571172,0.847354,0.442363,0.294933,0.210586,0.340075,...,0.570872,0.018355,0.16319,0.800094,0.029328,0.247511,0.736251,0.118588,0.323211,0.432529


Let's try again:

In [111]:
X_cancer = df_amb_cancer.drop(columns=['is_cancer'])  
y_cancer = df_amb_cancer['is_cancer']  

X_not = df_amb_not_cancer.drop(columns=['is_cancer'])  
y_not = df_amb_not_cancer['is_cancer'] 

X_bin = df_rem.drop(columns=['is_cancer'])  
y_bin = df_rem['is_cancer']  

kfold = KFold(n_splits=10, shuffle=True, random_state=42)

In [112]:
warnings.filterwarnings('ignore')

# SMOTE applied in all of them due to class shifts
print("For reference:\n Class 0: not cancer\n Class 1: ambiguous + cancer\n")

model_evaluation(X_cancer,y_cancer,True)
print("------------------------------------------")

print("For reference:\n Class 0: not cancer + ambiguous\n Class 1: cancer\n")

model_evaluation(X_not,y_not,True)
print("------------------------------------------")

print("For reference:\n Class 0: not cancer\n Class 1: cancer\n")
model_evaluation(X_bin,y_bin,True)

For reference:
 Class 0: not cancer
 Class 1: ambiguous + cancer

Random Forest Mean Accuracy: 0.65
Class 0 Mean Misclassification Percentage: 47.05%
Class 1 Mean Misclassification Percentage: 28.42%

SVM Mean Accuracy: 0.62
Class 0 Mean Misclassification Percentage: 29.21%
Class 1 Mean Misclassification Percentage: 42.26%

Gaussian Naive Bayes Mean Accuracy: 0.61
Class 0 Mean Misclassification Percentage: 29.32%
Class 1 Mean Misclassification Percentage: 44.07%

------------------------------------------
For reference:
 Class 0: not cancer + ambiguous
 Class 1: cancer

Random Forest Mean Accuracy: 0.88
Class 0 Mean Misclassification Percentage: 8.62%
Class 1 Mean Misclassification Percentage: 35.26%

SVM Mean Accuracy: 0.85
Class 0 Mean Misclassification Percentage: 13.28%
Class 1 Mean Misclassification Percentage: 23.75%

Gaussian Naive Bayes Mean Accuracy: 0.76
Class 0 Mean Misclassification Percentage: 25.59%
Class 1 Mean Misclassification Percentage: 15.46%

----------------------

The accuracy of the model significantly improved when ambiguous cases were classified as "not cancer." This improvement may be due to an imbalance in class distributions, where "not cancer" and "ambiguous" (and their combination) cases constitute the largest portion of the data. 

Such an imbalance could lead to overfitting, with the model favoring predictions for the more prevalent classes, potentially compromising the model’s generalizability to less common classes.

The tester function includes several imbalanced learning adjustments; however, none of these methods significantly improved cancer prediction when ambiguous cases were classified as cancer, as was our objective.

Our reasoning is that if there is uncertainty regarding a cancer diagnosis—and even the AI model is uncertain—it is in the patient’s best interest to recommend further testing, rather than assuming they are cancer-free based solely on achieving higher accuracy.

### Summary of Findings

Removing ambiguity from the dataset improves cancer prediction performance. Adding ambiguity and applying SMOTE did not enhance results, often leading to a 15% error rate across classes. We recommend excluding ambiguous cases, as this approach has proven effective. Cross-validation and optional SMOTE application are implemented, allowing for the addition of other models for comparison.

---

Taking into account the best model, Logistic Regression, we obtained the following confusion matrix:

\[
\begin{bmatrix}
646 & 28 \\
60 & 54 \\
\end{bmatrix}
\]

This means that the model correctly identified 646 out of 647 instances as "not cancer," showing strong performance in recognizing non-cancer cases (incorrectly classified 28 instances, ~4.33% of all available data for not_cancer).

It also correctly identified 54 out of 114 cancer cases, indicating some effectiveness in recognizing true cancer instances. However, the lower true positive rate may reflect an imbalance due to the predominant class ("not cancer") in the dataset, which could be causing the model to lean toward "not cancer" predictions.

We conclude that 60 cases of "cancer" were misclassified as "not cancer." This is a significant observation, as failing to detect actual cancer cases could have serious clinical implications. The higher count of false negatives suggests that the model may not be sensitive enough to cancer cases in this configuration.

---

## Optimization Possibilities

if:
 - treat ambiguous as not_cancer vs ambiguous as cancer
 - SMOTE for unbalanced classes
 - weights for unbalanced classes (maybe using weighted accuracy?)

is not enough, then try:
 - scaling for svm
 - encoding into intervals for random forest