<hr>

## TABLE OF CONTENTS

1. [Packages](##1.-PACKAGES)

2. [Datasets](##2.-DATASETS)

3. [Analysis & Visualization](##3.-ANALYSIS-&-VISUALIZATION)

4. [Preprocessing & Pipeline](##4.-PREPROCESSING-&-PIPELINE)

5. [Model Prediction & Evaluation](##5.-MODEL-PREDICTION-&-EVALUATION)

6. [Model Comparison](##6.-MODEL-COMPARISON)

7. [Conclusion](##7.-CONCLUSION)

<hr>

## 1. PACKAGES

### Essential Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Machine Learning Libraries

In [None]:
# data pipeline:
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# preprocessing:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# model selection:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# models:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# metrics:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

<hr>

## 2. DATASETS

### Read CSV Files

In [None]:
df_2C = pd.read_csv('datasets/orthopedic_2C.csv')
df_3C = pd.read_csv('datasets/orthopedic_3C.csv')

In [None]:
print(df_2C.shape)
print(df_3C.shape)

In [None]:
# 2 dfs should have the same data structure and values:
if df_2C.iloc[:, :6].equals(df_3C.iloc[:, :6]):
    # Display the first 10 rows and 6 columns of df_2c
    display(df_2C.iloc[:10, :6])

In [None]:
print(df_2C['class'].unique())
print(df_3C['class'].unique())

<hr>

## 3. ANALYSIS & VISUALIZATION

### Class Distribution

#### df_2C

In [None]:
distribution_2C = pd.DataFrame({
    'Class': df_2C['class'].value_counts().index,
    'Count': df_2C['class'].value_counts().values,
    'Percentage (%)': round(df_2C['class'].value_counts(normalize=True)*100,2)
}).reindex(['Normal', 'Abnormal'])

distribution_2C = distribution_2C.reset_index(drop=True)

display(distribution_2C.style.hide(axis="index"))

#### df_3C

In [None]:
distribution_3C = pd.DataFrame({
    'Class': df_3C['class'].value_counts().index,
    'Count': df_3C['class'].value_counts().values,
    'Percentage (%)': round(df_3C['class'].value_counts(normalize=True)*100,2)
}).reindex(['Normal', 'Hernia', 'Spondylolisthesis'])

distribution_3C = distribution_3C.reset_index(drop=True)

display(distribution_3C.style.hide(axis="index"))

### Missing Values

In [None]:
display(df_2C.isnull().sum())
display(df_3C.isnull().sum())

<hr>

## 4. PREPROCESSING & PIPELINE

### Encoding

* OneHotEncoder for df_2C's target variable y (with 2 classes)

* LabelEncoder for df_3C's target variable y (with 3 classes)

In [None]:
# Initialize Encoder
oh_encoder = OneHotEncoder(sparse_output=False)
lb_encoder = LabelEncoder()

#### df_2C

In [None]:
# Make a copy to reserve original data
df_2C_encoded = df_2C.copy()

# Apply OneHot encoder to 'class' and transform to an array
encoded_array_2C = oh_encoder.fit_transform(df_2C_encoded[['class']])
# Create a dataframe from the encoded array
encoded_df_2C = pd.DataFrame(encoded_array_2C, columns=oh_encoder.get_feature_names_out(['class']))
# Concatenate the encoded dataframe with the original dataframe
df_2C_encoded = pd.concat([df_2C_encoded, encoded_df_2C], axis=1)
# Drop the original 'class', and 'class_Normal' column
df_2C_encoded.drop(['class','class_Normal'], axis=1, inplace=True)

# Value Counts:
display(df_2C_encoded['class_Abnormal'].value_counts().sort_index(ascending=False))

display(df_2C_encoded.head())

#### df_3C

In [None]:
# Make a copy to reserve original data
df_3C_encoded = df_3C.copy()

# Custom classes: 'Normal'=0, 'Hernia'=1, 'Spondylolisthesis'=2
custom_classes = ['Normal', 'Hernia', 'Spondylolisthesis']
# Force custom order
lb_encoder.classes_ = np.array(custom_classes)
# Transform using the custom order
df_3C_encoded['class'] = lb_encoder.transform(df_3C_encoded['class'])

# Value Counts:
display(df_3C_encoded['class'].value_counts().sort_index(ascending=False))

display(df_3C_encoded.head())

### Scaling

- Since df_2C and df_3C have the exact same 6 features as proved in Section 2, the only difference is in target column. <br>
We will only scaling the features from df_2C and use it to train all models.

In [None]:
# Initate scaler:
numerical_transformer = StandardScaler()

In [None]:
numerical_cols = df_2C_encoded.columns[0:6].tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
    ],
    remainder='passthrough'
)

### Pipeline

#### Four pipelines
- DecisionTree Model
    - df_2C
    - df_3C
- Naive Bayes Model
    - df_2C
    - df_3C

In [None]:
# Classifying objects:
dt_classifier = DecisionTreeClassifier(random_state=42)
nb_classifier = GaussianNB()

##### DecisionTree model pipeline for df_2C

In [None]:
dt_pipeline_2C = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', clone(dt_classifier))
])

##### DecisionTree model pipeline for df_3C

In [None]:
dt_pipeline_3C = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', clone(dt_classifier))
])

##### Naive Bayes model pipeline for df_2C

In [None]:
nb_pipeline_2C = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', clone(nb_classifier))
])

##### Naive Bayes model pipeline for df_3C

In [None]:
nb_pipeline_3C = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', clone(nb_classifier))
])

### Splitting

In [None]:
X = df_2C_encoded[['pelvic_incidence',
 'pelvic_tilt',
 'lumbar_lordosis_angle',
 'sacral_slope',
 'pelvic_radius',
 'degree_spondylolisthesis']]

y_2C = df_2C_encoded['class_Abnormal']
y_3C = df_3C_encoded['class']

X_train_2C, X_test_2C, y_train_2C, y_test_2C = train_test_split(X, y_2C, test_size=0.2, random_state=28)
print(X_train_2C.shape, y_train_2C.shape, X_test_2C.shape, y_test_2C.shape)

X_train_3C, X_test_3C, y_train_3C, y_test_3C = train_test_split(X, y_3C, test_size=0.2, random_state=82)
print(X_train_3C.shape, y_train_3C.shape, X_test_3C.shape, y_test_3C.shape)

<hr>

## 5. MODEL PREDICTION & EVALUATION

### DecisionTree Model

#### df_2C

In [None]:
dt_pipeline_2C.fit(X_train_2C, y_train_2C)
y_pred_dt_2C = dt_pipeline_2C.predict(X_test_2C)
accuracy_score(y_test_2C, y_pred_dt_2C)

#### df_3C

In [None]:
dt_pipeline_3C.fit(X_train_3C, y_train_3C)
y_pred_dt_3C = dt_pipeline_3C.predict(X_test_3C)
accuracy_score(y_test_3C, y_pred_dt_3C)

### Gaussian Naive Bayes Model

#### df_2C

In [None]:
nb_pipeline_2C.fit(X_train_2C, y_train_2C)
y_pred_nb_2C = nb_pipeline_2C.predict(X_test_2C)
accuracy_score(y_test_2C, y_pred_nb_2C)

#### df_3C

In [None]:
nb_pipeline_3C.fit(X_train_3C, y_train_3C)
y_pred_nb_3C = nb_pipeline_3C.predict(X_test_3C)
accuracy_score(y_test_3C, y_pred_nb_3C)

<hr>

## 6. MODEL COMPARISON

<hr>

## 7. CONCLUSION