# vertispine 3C

## Machine learning classification algorithm predicting 3 target spinal normality/abnormalities from 6 biomechanical X-ray image derived features, deployed at https://vertispine.vercel.app/

### Created by Jan Drmota for BOA x Stryker Hackathon 2024

This version (vertispine3C) predicts either normal, hernia or spondylolisthesis

### Instructions
Please select Run > Run All Cells from the top menubar. You may need to restart the kernel after installing the libraries by going to "Kernel > Restart Kernal" in the top menubar and then re-run all cells with "Run > Run All Cells".

### Machine learning algorithm

The data was examined and the k-nearest neighbors (kNN) machine learning algorithm (MLA) was selected due to the nature of the data and task.
The task involves classification and kNN is one of the widely-used MLAs used for this task as the statistical distribution of the data is not considered. 
The distance between the data points in this case, makes kNN a relevant choice.
Though there are down-sides to using this algorithm including slower predicive speed (as computation is deferred to prediction stage) and reduced performance in high-dimensional data spaces, I believe for the task at hand with the data size and nature it is the most appropriate choice.

### Downloading dataset
The dataset was downloaded from [UC Irvine's Machine Learning Repository](https://archive.ics.uci.edu/dataset/212/vertebral+column) as outlined in the instructions and stored in a data folder in the root directory of the project:
> ./data/column_3C_weka.arff

### Installing libraries

Python3 and pip3 was used to create this, the following commands are run to install the necessary libraries (I was using Anaconda which comes with all these libraries already):

> pip3 install numpy<br>
> pip3 install pandas<br>
> pip3 install scipy<br>
> pip3 install scikit-learn<br>
> pip3 install matplotlib<br>

Alternatively install them from the requirements.txt folder (**you may need to restart the kernel by going to "Kernel > Restart Kernal" in the top menubar and then re-run all cells with "Run > Run All Cells"**):

In [None]:
#jupyter nbconvert --to script --execute --stdout vertispine3C.ipynb | python

In [385]:
try:
    get_ipython()  # Check if running in a Jupyter notebook, if yes run the following command to install libraries:
    %pip install -r requirements.txt
except NameError:
    # This checks if the notebook is run from the command line with nbconvert, 
    # then the modules will have to be installed with the command pip install -r requirements.txt in the command line
    print("Not in IPython or Jupyter, if you get an error that libraries/modules not found then install packages with the command: pip install -r requirements.txt")

Note: you may need to restart the kernel to use updated packages.


### Importing libraries

In [333]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import joblib
from scipy.io import arff
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from joblib import Parallel, delayed

### Importing data from .arff file
We read the downloaded file from the data folder with arff using SciPy's input/output module scipy.io. arff loads the data into a (data, meta) tuple. As such, we need to import only the first index ie the data into our dataframe from the pandas library. The following function ensures the file is in the default location, if not please enter where the file is.

In [335]:
default_dataset_path = "./data/column_3C_weka.arff"

def load_dataset():
    try:
        # Attempt to load the dataset from the default path
        print(f"Trying to load dataset from default path: {default_dataset_path}")
        arff_file = arff.loadarff(default_dataset_path)
    except FileNotFoundError:
        # If the default path is invalid, ask the user for an alternative path
        user_path = input("Default dataset path not found. Please enter the path to your dataset: ")
        if not os.path.exists(user_path):
            raise FileNotFoundError(f"Dataset not found at {user_path}. Please check the path and try again.")
        arff_file = arff.loadarff(user_path)
    return arff_file

# Load the dataset
try:
    arff_data = load_dataset()
    df = pd.DataFrame(arff_data[0])
    print("Dataset loaded successfully!")
except FileNotFoundError as e:
    print(e)

Trying to load dataset from default path: ./data/column_3C_weka.arff
Dataset loaded successfully!


### Data shape
We first examine what the shape of the stored data is. From the UC Irvine repository we expect 310 instances and 7 variables in our imported dataframe. We confirm this below:

In [337]:
print(df.shape)

(310, 7)


### Data structure
We can confirm the expected variables and observe the data by printing our dataframe:

In [339]:
df

Unnamed: 0,pelvic_incidence,pelvic_tilt,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis,class
0,63.027817,22.552586,39.609117,40.475232,98.672917,-0.254400,b'Hernia'
1,39.056951,10.060991,25.015378,28.995960,114.405425,4.564259,b'Hernia'
2,68.832021,22.218482,50.092194,46.613539,105.985135,-3.530317,b'Hernia'
3,69.297008,24.652878,44.311238,44.644130,101.868495,11.211523,b'Hernia'
4,49.712859,9.652075,28.317406,40.060784,108.168725,7.918501,b'Hernia'
...,...,...,...,...,...,...,...
305,47.903565,13.616688,36.000000,34.286877,117.449062,-4.245395,b'Normal'
306,53.936748,20.721496,29.220534,33.215251,114.365845,-0.421010,b'Normal'
307,61.446597,22.694968,46.170347,38.751628,125.670725,-2.707880,b'Normal'
308,45.252792,8.693157,41.583126,36.559635,118.545842,0.214750,b'Normal'


### Missing data
We need to confirm there is no missing data in our dataframe by running isna(), which converts our dataframe into Booleans with any data missing indicated as True (i.e. 1) and no data missing as False (i.e. 0) and then summing the columns:

In [341]:
print(df.isna().sum())

pelvic_incidence            0
pelvic_tilt                 0
lumbar_lordosis_angle       0
sacral_slope                0
pelvic_radius               0
degree_spondylolisthesis    0
class                       0
dtype: int64


There is no missing data that needs to be inputted and we can continue with building our model.

### Assigning MLA variables
We assign the variable X our independent data inputs (all except the last column "class") and the variable y our dependent targets (column "class") from our dataframe.
The sklearn module requires the targets are in string format but in our dataframe they are currently objects, as such a type conversion was made.

In [344]:
X = df.drop(columns=["class"])
df["class"] = df["class"].astype("string")
y = df["class"]

### Scaling data
For the kNN selection (as it is distance-based), scaling data (standardisation of data) is important to ensure the variation in scale between the independent features does not affect our results. We use the Scikit-learn's StandardScaler preprocessing library for this.

In [346]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Assigning train/test split
Next we want to prepare our training and testing data subsets, storing our scaled X and y variables into their respective subsets. The conventional 80/20 training/testing split and the conventional 42 seed for reproducible randomness are used, we also want to ensure the target class is distributed in the same way in the training and testing subsets hence why it is stratified with y (our target "class"). The train_test_split from the scikit-learn library is used.

In [348]:
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled, y, test_size=0.20, random_state=42, stratify=y)

### Cross validation parameters
Here we create an object with possible parameter values for our kNN we will want our cross-validation to test. We will trial between 2 and 20 number of neighbours.

In [350]:
parameter = {
    'n_neighbors': np.arange(2, 21, 1),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski'],
    'algorithm': ['auto']
}

### Stratified K-Fold cross validation
We are using a stratified K-fold crossvalidation with Grid search rather than only a conventional train/test split. A normal train/test split may not be as representative as it is only one run and may cause overfitting/underfitting. Though we may get a lower accuracy from this cross-validation we can be assured this is a more reliable estimate of the performance as we run several tests on our data rather than just one, in our case we are using a k-value of 10. All the data is used for the testing, rather than just the 80% that would be used in a train/test split.

In [352]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
knn = KNeighborsClassifier()

### Hyperparametric tuning with grid search
We then validate our kNN algorithms on the training set to see which one yields the best results.

In [354]:
knn_cv = GridSearchCV(knn, param_grid=parameter, cv=skf, verbose=1)
knn_cv.fit(X_train_scaled, y_train)
print("Best Parameters:", knn_cv.best_params_)

Fitting 10 folds for each of 114 candidates, totalling 1140 fits
Best Parameters: {'algorithm': 'auto', 'metric': 'euclidean', 'n_neighbors': 13, 'weights': 'distance'}


### Evaluate best kNN on test set
The final accuracy is from the test dataset.

In [356]:
best_knn = knn_cv.best_estimator_
best_knn.fit(X_train_scaled, y_train)
y_pred = best_knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred) * 100
print("Final Accuracy on Test Set: {:.2f}%".format(accuracy))

Final Accuracy on Test Set: 79.03%


### Creating a pipeline for our best kNN
Last, we can create a pipeline saving our best kNN and the scaler that we can then save and use for real-life data.

In [358]:
pipeline = Pipeline([
    ('scaler', scaler),
    ('knn', best_knn)
])

### Saving our pipeline into a file

In [360]:
joblib.dump(pipeline, 'vertispine3CMLPipeline.pkl')

['vertispine3CMLPipeline.pkl']

### Loading our saved pipeline file with best kNN and using it on real-world data
Here we load the saved pipeline from the file and run a test on 4 selected datasets (as if it were data from real-life). These should be Normal, Normal, Hernia and Spondylolisthesis for illustrative purposes only. The pipeline scales the data and classifies using the best_knn we saved.
This is deployed in the [vertispine web-app at https://vertispine.vercel.app/](https://vertispine.vercel.app/)

In [362]:
loaded_pipeline = joblib.load('vertispine3CMLPipeline.pkl')

d_two = {'pelvic_incidence': [45.252792, 33.841641, 74.433593, 70.952728], 'pelvic_tilt': [8.693157, 5.073991, 41.557331, 20.159931], 'lumbar_lordosis_angle': [41.583126, 36.641233, 27.700000, 62.859109], 'sacral_slope': [36.559635, 28.767649, 32.876262, 50.792797], 'pelvic_radius': [118.545842, 123.945244, 107.949304, 116.177932], 'degree_spondylolisthesis': [0.214750, -0.199249, 5.000089, 32.522331]}

df_two = pd.DataFrame(data=d_two)

two_prediction = loaded_pipeline.predict(df_two)

print(two_prediction)

['Normal' 'Normal' 'Hernia' 'Spondylolisthesis']
