# Supported Vector Machine

# !!!!!!!!! Criar csvs limpos para testar depois !!!!!!!!!!!

# Base Idea: [link to study](https://pubs.aip.org/aip/acp/article-abstract/2655/1/020103/2888254/Classification-of-normal-and-nodule-lung-images?redirectedFrom=fulltext)

This code utilizes a Support Vector Machine (SVM) for classification of data extracted from the LIDC-IDRI dataset.

The `.csv` file employed in this version contains a **clean and analyzed** dataset derived from the raw data using the `pylidc`, `pyradiomics`, and deep feature extraction methods.

The relevant methods can be found in the **csv_cleanup** folder.

## Importing libraries and Datasets

We will begin by importing the relevant and necessary libraries.

In [13]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

Next, we will convert the three datasets into pandas DataFrames for further processing.

In [14]:
df = pd.read_csv('2d_semiclean.csv')
df.head()

Unnamed: 0,patient_nodule_count,is_cancer,radiomics_original_shape_Elongation,radiomics_original_shape_Flatness,radiomics_original_shape_LeastAxisLength,radiomics_original_shape_MajorAxisLength,radiomics_original_shape_Maximum2DDiameterColumn,radiomics_original_shape_Maximum2DDiameterRow,radiomics_original_shape_Maximum2DDiameterSlice,radiomics_original_shape_Maximum3DDiameter,...,radiomics_original_glrlm_RunVariance,radiomics_original_glrlm_ShortRunEmphasis,radiomics_original_glszm_GrayLevelNonUniformity,radiomics_original_glszm_LargeAreaEmphasis,radiomics_original_glszm_SizeZoneNonUniformity,radiomics_original_glszm_SizeZoneNonUniformityNormalized,radiomics_original_glszm_SmallAreaEmphasis,radiomics_original_glszm_ZoneEntropy,radiomics_original_glszm_ZonePercentage,radiomics_original_glszm_ZoneVariance
0,1,2,0.972773,0.858091,19.588162,22.827597,31.038346,32.075141,26.977005,34.406945,...,30.375465,0.190547,1.0,29463184.0,1.0,1.0,3.394066e-08,-3.203427e-16,0.000184,0.0
1,1,2,0.795595,0.747403,21.632902,28.944112,36.659875,30.9453,37.724947,40.384819,...,51.76955,0.211422,1.0,203119504.0,1.0,1.0,4.92321e-09,-3.203427e-16,7e-05,0.0
2,1,0,0.768615,0.549141,14.561133,26.5162,26.873199,31.855156,25.552264,32.630357,...,16.420147,0.232529,1.0,6461764.0,1.0,1.0,1.547565e-07,-3.203427e-16,0.000393,0.0
3,2,2,0.836843,0.690356,16.647824,24.114835,26.959597,30.922607,24.750701,30.922607,...,16.833843,0.14264,1.0,10504081.0,1.0,1.0,9.520109e-08,-3.203427e-16,0.000309,0.0
4,3,2,0.66754,0.638922,7.316637,11.451529,11.145619,12.548361,12.934109,13.26797,...,2.783696,0.287526,1.0,68121.0,1.0,1.0,1.467976e-05,-3.203427e-16,0.003831,0.0


In [15]:
# One hot encoding
df_encoded = df.copy()

# Select only columns that have floating point values
float_columns = df_encoded.select_dtypes(include='float').columns

# Define function to apply one-hot encoding for different intervals
def one_hot_encode_intervals(column, intervals):
    # Create an empty DataFrame to store one-hot encoded columns
    encoded_df = pd.DataFrame()

    # For each interval, create a new column with 1 if the value falls within the interval, else 0
    for i, (low, high) in enumerate(intervals):
        encoded_df[f'{column}_interval_{i}'] = df_encoded[column].apply(lambda x: 1 if low <= x < high else 0)
    
    return encoded_df

# Iterate over each floating point column and apply one-hot encoding based on intervals
for column in float_columns:
    # Define intervals for each floating point column (you can customize this based on the data)
    min_val = df_encoded[column].min()
    max_val = df_encoded[column].max()
    
    # Example: Divide into 4 equal intervals (you can adjust as needed)
    intervals = [(min_val, min_val + (max_val - min_val) * 0.25),
                 (min_val + (max_val - min_val) * 0.25, min_val + (max_val - min_val) * 0.5),
                 (min_val + (max_val - min_val) * 0.5, min_val + (max_val - min_val) * 0.75),
                 (min_val + (max_val - min_val) * 0.75, max_val)]
    
    # Apply one-hot encoding for this column
    encoded_intervals_df = one_hot_encode_intervals(column, intervals)
    
    # Drop the original floating-point column and concatenate the new one-hot encoded columns
    df_encoded = df_encoded.drop(columns=[column]).join(encoded_intervals_df)

# The df_encoded now contains one-hot encoded columns for the floating point columns.


In [16]:
# Step 2: Extract features (X) and labels (y)
# Assume df contains the feature columns and a label column
X = df.drop(columns=['is_cancer'])  # Drop the label column to get features
y = df['is_cancer']  # Target variable (lung nodule classification)

X_enc = df_encoded.drop(columns=['is_cancer'])  # Drop the label column to get features
y_enc = df_encoded['is_cancer']  # Target variable (lung nodule classification)

In [17]:
# Step 3: Data preprocessing (scaling)
# SVM performs better when features are standardized
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Fit to the data and then transform it

X_scaled_enc = scaler.fit_transform(X_enc)  # Fit to the data and then transform it

In [18]:

# Step 4: Setting up the SVM model
# We will use a basic SVM with an RBF kernel (commonly used for medical data)
svm_model = SVC(kernel='rbf', C=1, gamma='scale')  # Regularization and kernel hyperparameters


In [19]:

# Step 5: Performing 10-fold cross-validation
# Define KFold with 10 splits
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

In [20]:
# Cross-validation to get the score for each fold
cv_scores = cross_val_score(svm_model, X_scaled, y, cv=kfold, scoring='accuracy')

# Output the results
print(f"Cross-validation accuracy scores for each fold: {cv_scores}")
print(f"Mean accuracy: {np.mean(cv_scores)}")
print(f"Standard deviation of accuracy: {np.std(cv_scores)}")

print('-------------------------------------------------------')


cv_scores = cross_val_score(svm_model, X_scaled_enc, y_enc, cv=kfold, scoring='accuracy')

# Output the results
print(f"Cross-validation accuracy scores for each fold: {cv_scores}")
print(f"Mean accuracy: {np.mean(cv_scores)}")
print(f"Standard deviation of accuracy: {np.std(cv_scores)}")

Cross-validation accuracy scores for each fold: [0.69433962 0.69811321 0.63257576 0.68939394 0.65909091 0.70454545
 0.6780303  0.70454545 0.70454545 0.68560606]
Mean accuracy: 0.6850786163522014
Standard deviation of accuracy: 0.022120194873270032
-------------------------------------------------------
Cross-validation accuracy scores for each fold: [0.66037736 0.68679245 0.625      0.68560606 0.63257576 0.68560606
 0.64772727 0.67045455 0.68181818 0.67045455]
Mean accuracy: 0.6646412235563177
Standard deviation of accuracy: 0.021531943834780666
