# Performance-level analysis

Now that we have shown that rotations could influence the feature values in the previous phase of the study, more significant so on wavelet-based features, we want to know if this influence translate to model performance. 

Therefore, in this part of the study, we train models using $R_0$ features, selected from 1) WD-based features only $X_{WD}$; and 2) non-WD-based features only $X_{nWD}$, note that:
$$
\begin{align*}
X_{WD} \cap X_{nWD} &= \emptyset \\ 
X_{WD} \cup X_{nWD} &= X
\end{align*}
$$

For the data at hand, the NSCLC lesions were classified into 4 subtypes including large cell (LCC), adenocarcinoma (ACC), squamous cell carcinoma (SCC), and not otherwise specified (nos). There are also some cases with no available information labelled as NA. In this study, we exclude both patients with nos and NA. See manuscript for more information. 

## Study design

![Study design](./img/study_design.png)

## Reference

Manuscript in preparation:

>    *Decoding the Rotation Effect: A Retrospective Study on Lesion Orientation and Wavelet Decomposition in Radiomics*

# Usage

This notebook runs on the features extracted using the script provided. The features were stored as a HDF data store in `../data/RadFeatures_raw_fine_res.h5`. For the data structure, please see [`1_features_analysis.ipynb`](./1_feature_analysis.ipynb)

For more details, please see [README](./README.md)

## Radiomics pipeline


In [129]:
import pandas as pd
import scipy 
import numpy as np
import SimpleITK as sitk
import os, sys
import mri_radiomics_toolkit as mradtk
import itertools
import sklearn.utils as skutils
import sklearn.linear_model as linear_model
import tqdm.auto
import joblib
import warnings
from IPython.display import *
from pathlib import Path
from mnts.mnts_logger import MNTSLogger
from typing import Tuple, Union, Iterable, Any, Optional
from tqdm import auto
from sklearn import pipeline
from functools import partial
np.warnings = warnings # work arround 


# define for easy usage
mdprint = lambda x: display(Markdown(x))

# add package to path without installing it
sys.path.append(Path(".").absolute().__str__())

# List out the version of the packages required
np_ver = np.__version__
sp_ver = scipy.__version__
sitk_ver = sitk.__version__
pd_ver = pd.__version__
display({
    'Numpy version': np_ver, 
    'Scipy version': sp_ver, 
    'SITK version': sitk_ver,
    'Pandas version': pd_ver
})

'''Configurations'''
MNTSLogger.set_global_log_level('debug')
os.chdir('/media/storage/Source/Repos/wavelet_analysis/src')
src_root = Path(".")
resource_root = src_root / "resources"
data_root = Path("../data")
idregpat = "^([\w\d]+-\d+)" # For pairing the IDs
clinical = data_root.joinpath("NSCLC-clinical-Oct 2019.csv")
clinical_df = pd.read_csv(clinical, index_col=0)
radiomics_data_path = data_root / "RadFeatures_raw_fine_res.h5"

# exclude list
exclude_pids = [
    'LUNG1-128', # Cannot load
    'LUNG1-246', # Canoot load
]

clinical_df.drop(exclude_pids, inplace=True)
display(clinical_df)

{'Numpy version': '1.24.3',
 'Scipy version': '1.10.1',
 'SITK version': '2.2.1',
 'Pandas version': '2.0.1'}

[2024-02-24 04:38:09,358-INFO] (global) Setting log level info -> debug
[2024-02-24 04:38:09,359-INFO] (preliminary_feature_filtering) Setting log level info -> debug
[2024-02-24 04:38:09,359-INFO] (ANOVA) Setting log level info -> debug
[2024-02-24 04:38:09,360-INFO] (sup-featselect) Setting log level info -> debug
[2024-02-24 04:38:09,360-INFO] (model-building) Setting log level info -> debug


Unnamed: 0_level_0,age,clinical.T.Stage,Clinical.N.Stage,Clinical.M.Stage,Overall.Stage,Histology,gender,Survival.time,deadstatus.event
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
LUNG1-001,78.7515,2.0,3,0,IIIb,large cell,male,2165,1
LUNG1-002,83.8001,2.0,0,0,I,squamous cell carcinoma,male,155,1
LUNG1-003,68.1807,2.0,3,0,IIIb,large cell,male,256,1
LUNG1-004,70.8802,2.0,1,0,II,squamous cell carcinoma,male,141,1
LUNG1-005,80.4819,4.0,2,0,IIIb,squamous cell carcinoma,male,353,1
...,...,...,...,...,...,...,...,...,...
LUNG1-418,53.6712,2.0,0,0,I,adenocarcinoma,male,346,1
LUNG1-419,66.5096,4.0,1,0,IIIb,squamous cell carcinoma,male,2772,0
LUNG1-420,73.3808,2.0,1,0,II,squamous cell carcinoma,male,2429,1
LUNG1-421,61.7041,2.0,2,0,IIIa,squamous cell carcinoma,female,369,1


# Patient characteristics

In [130]:
mdprint("## For all patients")
mdprint("### Summary characteristics")
display(clinical_df.describe())

# sumamrize stages information
stages_df = pd.concat([clinical_df['clinical.T.Stage'].value_counts(),
                       clinical_df['Clinical.N.Stage'].value_counts(),
                       clinical_df['Clinical.M.Stage'].value_counts()], axis=1)
stages_df.columns = ['T', 'N', 'M']
stages_df.index = stages_df.index.astype('int')
stages_df.sort_index(inplace=True)

mdprint("### Stages distribution")
display(stages_df.fillna(0))

# summarize histology
mdprint("## For patients with histology")
mdprint("### Summary of histology")
hist_df = clinical_df['Histology'].value_counts(dropna=False)
display(hist_df.to_frame())

# get the list of patients with histology
clinical_df_whist = clinical_df.loc[clinical_df['Histology'] != 'nos'].copy()['Histology'].to_frame()
clinical_df_whist.dropna(inplace=True)
mdprint("### Summary characteristics")
display(clinical_df_whist['Histology'].value_counts(dropna=False))


## For all patients

### Summary characteristics

Unnamed: 0,age,clinical.T.Stage,Clinical.N.Stage,Clinical.M.Stage,Survival.time,deadstatus.event
count,398.0,419.0,420.0,420.0,420.0,420.0
mean,68.094625,2.47494,1.352381,0.030952,984.385714,0.888095
std,10.058821,1.1329,1.218246,0.295542,1030.134266,0.315625
min,33.6849,1.0,0.0,0.0,10.0,0.0
25%,61.3012,2.0,0.0,0.0,260.25,1.0
50%,68.7137,2.0,2.0,0.0,545.5,1.0
75%,75.887025,4.0,2.0,0.0,1393.0,1.0
max,91.7043,5.0,4.0,3.0,4454.0,1.0


### Stages distribution

Unnamed: 0,T,N,M
0,0.0,170.0,415.0
1,93.0,22.0,1.0
2,155.0,141.0,0.0
3,52.0,84.0,4.0
4,117.0,3.0,0.0
5,2.0,0.0,0.0


## For patients with histology

### Summary of histology

Unnamed: 0_level_0,count
Histology,Unnamed: 1_level_1
squamous cell carcinoma,152
large cell,114
nos,62
adenocarcinoma,51
,41


### Summary characteristics

Histology
squamous cell carcinoma    152
large cell                 114
adenocarcinoma              51
Name: count, dtype: int64

# Load the radiomics features

In [131]:
from feature_robustness_analysis.io import standardize_df

# Load the features
features_df = pd.read_hdf(radiomics_data_path)
features_df = standardize_df(features_df.T).drop('diagnostics')

# Extract the data that has histology
X_all = features_df[clinical_df_whist.index]
display(X_all)

Unnamed: 0_level_0,Unnamed: 1_level_0,PID,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,...,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421
Unnamed: 0_level_1,Unnamed: 1_level_1,LessionCode,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,...,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Unnamed: 0_level_2,Unnamed: 1_level_2,Rotation,R00,R10,R15,R20,R25,R30,R35,R40,R45,R05,...,R40,R45,R05,R50,R55,R60,R65,R70,R75,R80
ImagingFilter,Category,Name,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3,Unnamed: 22_level_3,Unnamed: 23_level_3
original,shape,Elongation,0.727227,0.727191,0.726599,0.727014,0.727373,0.727265,0.72736,0.726967,0.727177,0.726972,...,0.776835,0.777889,0.776214,0.776608,0.775722,0.776907,0.778633,0.776914,0.779277,0.776845
original,shape,Flatness,0.545216,0.544209,0.544994,0.544926,0.545092,0.545003,0.545036,0.544964,0.545055,0.544997,...,0.64158,0.641207,0.638647,0.64071,0.64095,0.641733,0.642394,0.640914,0.642063,0.641025
original,shape,LeastAxisLength,45.743563,45.623419,45.692782,45.679115,45.680463,45.6729,45.666421,45.673603,45.674558,45.680083,...,32.729696,32.711779,32.624467,32.707135,32.71242,32.734819,32.734656,32.710618,32.705374,32.716974
original,shape,MajorAxisLength,83.899907,83.834408,83.840833,83.826295,83.803233,83.802969,83.786089,83.810238,83.798134,83.817069,...,51.014199,51.01595,51.08376,51.048268,51.037436,51.010049,50.957259,51.037448,50.937962,51.038495
original,shape,Maximum2DDiameterColumn,93.932927,91.57324,93.998896,88.915466,92.216668,90.272626,93.393283,81.266196,80.683213,93.831345,...,60.829732,49.110512,56.892624,49.89078,65.012957,58.218198,48.573555,54.90505,60.55475,60.931553
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
logarithm,ngtdm,Busyness,45.928779,45.625421,45.470045,44.903059,44.483675,45.160686,44.200094,45.096904,45.060095,45.08508,...,25.992902,84.892336,21.900443,31.159373,27.268277,26.907348,24.824137,31.296444,24.028008,85.112309
logarithm,ngtdm,Coarseness,0.000078,0.000078,0.000078,0.000079,0.00008,0.000079,0.00008,0.000079,0.000079,0.000079,...,0.000314,0.000232,0.000355,0.000381,0.0003,0.000372,0.000328,0.000375,0.000336,0.000229
logarithm,ngtdm,Complexity,45.426285,44.946281,44.629187,44.638569,44.209293,44.616998,43.964332,44.640065,44.531808,44.592335,...,14.883252,8.759975,11.874364,7.385874,15.791292,8.908219,13.773907,7.240072,13.051764,8.806948
logarithm,ngtdm,Contrast,0.011986,0.011613,0.011569,0.011789,0.011729,0.011723,0.011758,0.011683,0.011661,0.01171,...,0.025759,0.034133,0.017177,0.024017,0.027496,0.020257,0.023956,0.025214,0.022269,0.033644


## Handling patients with multiple segmentations

There are a few dozens of patients with multiple segmentations. Some of them were already filtered away in phase 1 of the study, but there still are still a few left. Because the original histology information provided by Aerts et al. did not specify the site of biopsy or histology specimen, cases with individually segmented, discontinued masses, we cannot say for sure they share the same histology. Therefore, we inspect each of the cases and filter out those that have segmentation on the both sides of the lung, which has the highest probability of having different histology.

In [132]:

#! secondary drop list for removing human errors and uncertain histology
# exclude list
exclude_pids = [
    'LUNG1-019', # Erroroneous segmentation
    'LUNG1-326', # both side tumors
    'LUNG1-353', # both side tumors
    'LUNG1-372', # both side tumors
    'LUNG1-399', # both side tumors
]


# * check if any of them are still in the dataset with histology available
curr_patients = X_all.columns.get_level_values(0).unique()
secondary_exclude = pd.Series({k: k in curr_patients for k in exclude_pids})
display(secondary_exclude.to_frame())
mdprint(f"> Note: {sum(secondary_exclude)} more patients were excluded.")

# * exclude all these cases if they are in the feature list
X_all.drop(exclude_pids, level=0, axis=1, inplace=True)

# # * finally only look at the largest lesion
volumes = X_all.T.query("LessionCode != 'A' and LessionCode != ''")
X_all = X_all.drop(volumes.index, axis=1)
display(X_all.columns.get_level_values(0).unique())

Unnamed: 0,0
LUNG1-019,False
LUNG1-326,True
LUNG1-353,False
LUNG1-372,True
LUNG1-399,True


> Note: 3 more patients were excluded.

Index(['LUNG1-001', 'LUNG1-002', 'LUNG1-003', 'LUNG1-004', 'LUNG1-005',
       'LUNG1-006', 'LUNG1-007', 'LUNG1-008', 'LUNG1-009', 'LUNG1-010',
       ...
       'LUNG1-412', 'LUNG1-413', 'LUNG1-414', 'LUNG1-415', 'LUNG1-416',
       'LUNG1-417', 'LUNG1-418', 'LUNG1-419', 'LUNG1-420', 'LUNG1-421'],
      dtype='object', name='PID', length=314)

# Radiomics Pipeline

## Segregating the data

We are looking at:
1. Only WD features ($X_{WD}$)
2. Only non-WD features ($X_{nWD}$)


## Variable explanations

| Variable     | Description                                                                 |
|--------------|-----------------------------------------------------------------------------|
| `y`          | Ground-truth label of histology                                             |
| `y_onehot`   | One-hot version of `y`                                                      |
| `X_all`      | Radiomics features                                                          |
| `X_all_R0`   | Radiomics features of R0 samples only, meaning no rotations were applied.   |
| `X_WD`       | Radiomics features from WD filters only                                     |
| `X_nWD`      | Radiomics features from non-WD filters only, excludes shape features        |


In [133]:
# getting WD only features
wd_names = [f'wavelet-{"".join(x)}' for x in itertools.product(*([('H', 'L')] * 3))]
X_WD = X_all.loc[wd_names]
mdprint("### WD only features ($X_{WD}$)")
display(X_WD)

# getting non-WD only features
X_nWD = X_all.loc[features_df.index._get_level_values(0).difference(wd_names)]
X_nWD.drop('shape', axis=0, level=1, inplace=True) # Exclude shape features
mdprint("### non-WD only features ($X_{nWD}$)")
display(X_nWD)

mdprint("### Number of Cases")
display(len(X_all.columns.get_level_values(0).unique()))

### WD only features ($X_{WD}$)

Unnamed: 0_level_0,Unnamed: 1_level_0,PID,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,...,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421
Unnamed: 0_level_1,Unnamed: 1_level_1,LessionCode,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,...,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Unnamed: 0_level_2,Unnamed: 1_level_2,Rotation,R00,R10,R15,R20,R25,R30,R35,R40,R45,R05,...,R40,R45,R05,R50,R55,R60,R65,R70,R75,R80
ImagingFilter,Category,Name,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3,Unnamed: 22_level_3,Unnamed: 23_level_3
wavelet-HHH,firstorder,10Percentile,-0.006612,-0.006196,-0.013886,-0.006577,-0.01325,-0.008982,-0.013165,-0.012083,-0.012539,-0.008109,...,-0.008625,-0.00945,-0.004487,-0.00885,-0.008706,-0.007158,-0.008265,-0.007802,-0.004764,-0.006709
wavelet-HHH,firstorder,90Percentile,0.00661,0.006206,0.013969,0.00655,0.013243,0.008964,0.013173,0.012032,0.012501,0.008119,...,0.008604,0.009448,0.004355,0.008731,0.008656,0.007076,0.008173,0.007675,0.004637,0.006633
wavelet-HHH,firstorder,Energy,5.289175,4.731673,21.671994,5.352313,19.912227,9.243655,19.57019,18.155947,19.487921,7.513284,...,2.947578,3.422126,0.765366,2.859875,3.048037,1.898348,2.498022,2.079198,0.816922,1.706881
wavelet-HHH,firstorder,Entropy,0.999999,0.999994,1.0,0.999998,0.999999,1.0,1.0,0.999994,0.999999,0.999999,...,1.0,1.0,0.999969,0.999994,0.999999,0.999993,0.999999,0.999993,0.999983,0.999991
wavelet-HHH,firstorder,InterquartileRange,0.006416,0.006034,0.014099,0.006432,0.013452,0.009105,0.013432,0.011901,0.012461,0.008212,...,0.007888,0.008973,0.003965,0.008322,0.008156,0.006803,0.007883,0.007461,0.004327,0.006256
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wavelet-LLL,ngtdm,Busyness,12.206095,12.084919,12.210817,12.106545,13.302507,12.138941,12.606074,12.131731,12.14532,12.122981,...,7.967374,7.997804,8.006495,8.022501,7.982128,7.968822,7.950204,8.014583,8.001931,7.975322
wavelet-LLL,ngtdm,Coarseness,0.000074,0.000075,0.000074,0.000075,0.000074,0.000074,0.000074,0.000074,0.000074,0.000074,...,0.000288,0.000287,0.000287,0.000286,0.000288,0.000288,0.000289,0.000287,0.000287,0.000288
wavelet-LLL,ngtdm,Complexity,268.600376,266.508771,267.322944,269.083699,254.001143,266.886398,253.146454,268.091613,266.979944,268.486217,...,77.895082,78.169391,79.025926,78.63185,78.055036,78.521336,78.488515,79.007824,78.783269,78.415622
wavelet-LLL,ngtdm,Contrast,0.007902,0.00778,0.007823,0.007841,0.008292,0.007813,0.008292,0.007808,0.007804,0.007801,...,0.026512,0.026585,0.026521,0.026654,0.026667,0.026883,0.026849,0.026886,0.026393,0.02654


### non-WD only features ($X_{nWD}$)

Unnamed: 0_level_0,Unnamed: 1_level_0,PID,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,LUNG1-001,...,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421,LUNG1-421
Unnamed: 0_level_1,Unnamed: 1_level_1,LessionCode,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,...,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Unnamed: 0_level_2,Unnamed: 1_level_2,Rotation,R00,R10,R15,R20,R25,R30,R35,R40,R45,R05,...,R40,R45,R05,R50,R55,R60,R65,R70,R75,R80
ImagingFilter,Category,Name,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3,Unnamed: 22_level_3,Unnamed: 23_level_3
exponential,firstorder,10Percentile,0.845402,0.846659,0.845255,0.846421,0.847258,0.847848,0.84756,0.847367,0.846483,0.846459,...,0.900503,0.898515,0.896344,0.892309,0.89989,0.893221,0.898307,0.893001,0.902949,0.898655
exponential,firstorder,90Percentile,1.217135,1.218058,1.217956,1.216305,1.215841,1.216766,1.215064,1.216991,1.217017,1.216774,...,1.307269,1.313072,1.324495,1.336001,1.306099,1.331274,1.309346,1.334103,1.311155,1.311217
exponential,firstorder,Energy,207278.584093,207214.460545,207320.315036,206926.752697,206813.382665,206966.821053,206595.382453,207042.620621,207059.677897,206989.601842,...,63979.477715,64449.067992,65369.964399,66140.315023,63954.593498,65783.457254,64173.677039,66034.270991,64373.730666,64278.642993
exponential,firstorder,Entropy,0.721847,0.721132,0.722075,0.719982,0.720081,0.721796,0.71849,0.719716,0.719103,0.720305,...,0.610504,0.611646,0.610186,0.613595,0.611709,0.613259,0.61141,0.615349,0.601304,0.609373
exponential,firstorder,InterquartileRange,0.091773,0.091441,0.091546,0.091008,0.090921,0.091103,0.090164,0.091291,0.091231,0.091285,...,0.136083,0.139877,0.142837,0.149687,0.136409,0.147705,0.137651,0.149249,0.137405,0.139024
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
squareroot,ngtdm,Busyness,53.399118,53.872637,53.726684,52.589216,51.525783,52.707852,50.912929,52.838895,52.897289,52.682349,...,21.31336,45.291231,22.931501,36.893394,22.03511,32.656496,21.329294,52.065002,21.107449,45.227866
squareroot,ngtdm,Coarseness,0.000062,0.000062,0.000062,0.000063,0.000064,0.000063,0.000065,0.000063,0.000063,0.000063,...,0.000296,0.000341,0.000317,0.00027,0.000286,0.000237,0.000296,0.0002,0.000298,0.000342
squareroot,ngtdm,Complexity,62.567682,62.681291,62.342269,61.82298,60.94195,61.877806,60.266377,61.966589,61.806882,61.744362,...,20.183038,5.622871,17.332919,11.354792,20.570032,22.757081,20.527665,18.738581,20.31667,5.633295
squareroot,ngtdm,Contrast,0.025009,0.024508,0.024428,0.024707,0.024492,0.0246,0.024392,0.024522,0.024478,0.024524,...,0.038687,0.035457,0.042239,0.031276,0.039809,0.045506,0.038971,0.052508,0.037549,0.035245


### Number of Cases

314

## Steps for feature selection:

```mermaid
flowchart TD
    AD[("All Patients <br> (R_0 features)")] 
    T[("Training folds")]
    H[("Testing fold")]
    AD --> kfold[5-fold CV Splitter]
    kfold --> |80%|T 
    kfold --> |20%|H
    T ---> fine
	T --> training[Model Training] 
    AD --> |X_WD / X_nWD|pre(Preliminary feature selection)
    pre --> |Surviving features|fine(Fine feature selection)
    fine --> |Surviving features|training
    H --> testing(Cross-validation)
    training --> testing
    testing --> res("5-fold results averaged")
```

Note that this feature selection process does not include the testing pipeline
### 1. Preliminary feature selection

Features are preliminarily filtered based on statistical requirement. First, `VarianceThreshold` is applied to remove features that are almost identical across all cases. Second, ANOVA was used to identified features that showed statistically significant mean differences across the three target classes.

**This step was conducted prior to splitting the patients into folds.**

### 2. Fine feature selection

Features were further selected using more sophisticated selection technique, namely Elastic net. The elastic net can be seen as a combination of LASSO and Ridge, which uses $L_1$ and $L_2$ regularization, respectively. Elastic net was used for each run within the K-fold iteration using only the training fold data. Therefore, each run would select a slightly different feature from either the $X_{WD}$ or $X_{nWD}$ group. 

**This step is conducted within the K-fold cross-validation iteration**

In [134]:
'''Preparatinos'''

from sklearn.preprocessing import *
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn import metrics, ensemble
from mri_radiomics_toolkit.utils import *
from mri_radiomics_toolkit.feature_selection import *
from mri_radiomics_toolkit.model_building import cv_grid_search, neg_log_loss
from mri_radiomics_toolkit.models.cards import multi_class_cv_grid_search_card
from mri_radiomics_toolkit.perf_metric import top_k_accuracy_score_
from functools import partial
import sklearn
import warnings
warnings.filterwarnings('ignore')

# * set logger level to show warning only
MNTSLogger.set_global_log_level('debug')

# * Code the classes into integers
label_map = {
    key:  i for i, key in enumerate(clinical_df_whist['Histology'].unique())
}
mdprint("## Mapping")
display(label_map)

# * Get the ground-truth
y = clinical_df_whist['Histology'].replace(label_map)

# * Get the R0 original features for training
X_all_R0 = X_all.reorder_levels(axis=1, order=[2, 0, 1])['R00']
X_WD_R0 = X_WD.reorder_levels(axis=1, order=[2, 0, 1])['R00']
X_nWD_R0 = X_nWD.reorder_levels(axis=1, order=[2, 0, 1])['R00']

# * Map y to match indices of X
y = y.loc[X_all_R0.columns.droplevel(1)]
y.index = X_all_R0.columns
# Convert y into one-hot vector
y_onehot = pd.get_dummies(y)

# * Sanity check
assert X_all_R0.T.index.identical(y_onehot.index), \
    f"Index is different: {X_all_R0.T.index.difference(y_onehot.index)}"

# * Rest of the features
rot_degs = [f'R{i:02d}' for i in range(0, 81, 5)]
rot_degs.remove('R00')
print(rot_degs)
X_all_rest = X_all.reorder_levels(axis=1, order=[2, 0, 1])[rot_degs]
X_WD_rest = X_WD.reorder_levels(axis=1, order=[2, 0, 1])[rot_degs]
X_nWD_rest = X_nWD.reorder_levels(axis=1, order=[2, 0, 1])[rot_degs]

# * Choose feature level normalizers
normalizer_cls = partial(StandardScaler)

[2024-02-24 04:38:15,036-INFO] (global) Setting log level info -> debug
[2024-02-24 04:38:15,037-INFO] (preliminary_feature_filtering) Setting log level info -> debug
[2024-02-24 04:38:15,037-INFO] (ANOVA) Setting log level info -> debug
[2024-02-24 04:38:15,037-INFO] (sup-featselect) Setting log level info -> debug
[2024-02-24 04:38:15,037-INFO] (model-building) Setting log level info -> debug


## Mapping

{'large cell': 0, 'squamous cell carcinoma': 1, 'adenocarcinoma': 2}

['R05', 'R10', 'R15', 'R20', 'R25', 'R30', 'R35', 'R40', 'R45', 'R50', 'R55', 'R60', 'R65', 'R70', 'R75', 'R80']


## Sanity check

Now lets check if the feature normalization step is functioning properly. We based the normalization on $R_0$ features, and then apply it on $R_{05}$ to $R_{80}$ features to see how many of the features had its mean and variance shifted correctly.

In [135]:
# * Sanity check if Z-score is valid for all rotations 
normalizer = normalizer_cls()
X_all_zscored = normalizer.fit_transform(X_all_R0.T).T
X_all_zscored = pd.DataFrame(data=X_all_zscored, index=X_all_rest['R10'].index, columns=X_all_rest['R10'].columns)

# for each rotation
_t = []
for rot in rot_degs:
    # check how many features is not valid for the R0 mean/var
    _zscored = normalizer.transform(X_all_rest[rot].T).T
    _zscored = pd.DataFrame(_zscored, index=X_all_rest[rot].index, columns=X_all_rest[rot].columns)
    _zscored_mean = _zscored.mean(axis=1)
    _zscored_mean.name = 'Mean'
    _zscored_var = _zscored.var(axis=1)
    _zscored_var.name = 'Var'

    # Report the numbers of features failed to normalized
    _zscored_mean = _zscored_mean[~np.isclose(_zscored_mean.values, 0, atol=1E-01)]
    mdprint(f"## {rot}")
    display(_zscored_mean.to_frame().join(_zscored_var))
    

## R05

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,shape,Sphericity,0.138331,0.982426
lbp-3D-m1,glcm,Contrast,-0.138789,0.871582
lbp-3D-m1,glcm,Correlation,0.244365,0.847523
lbp-3D-m1,glcm,DifferenceAverage,-0.118109,0.913178
lbp-3D-m1,glcm,DifferenceVariance,-0.152764,0.885028
...,...,...,...,...
wavelet-HHH,gldm,SmallDependenceEmphasis,-0.103954,0.972004
wavelet-HHH,gldm,SmallDependenceLowGrayLevelEmphasis,-0.117836,0.886939
wavelet-HHH,ngtdm,Complexity,0.110444,1.002913
wavelet-HHH,ngtdm,Contrast,0.110406,1.001659


## R10

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,glszm,SizeZoneNonUniformityNormalized,0.248234,1.238423
original,glszm,SmallAreaEmphasis,0.250340,1.324205
lbp-3D-m1,firstorder,Minimum,0.113047,9.028754
lbp-3D-m1,glcm,Contrast,-0.143049,0.867206
lbp-3D-m1,glcm,Correlation,0.253539,0.834224
...,...,...,...,...
square,glszm,SmallAreaLowGrayLevelEmphasis,0.101114,1.076459
squareroot,glszm,SizeZoneNonUniformityNormalized,0.111457,1.004760
squareroot,glszm,SmallAreaEmphasis,0.124087,0.907123
logarithm,glszm,SizeZoneNonUniformityNormalized,0.196317,1.046797


## R15

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,glszm,SizeZoneNonUniformityNormalized,0.480832,1.461612
original,glszm,SmallAreaEmphasis,0.480498,1.348799
original,glszm,ZoneEntropy,-0.102839,1.014792
lbp-3D-m1,glcm,Contrast,-0.147424,0.858020
lbp-3D-m1,glcm,Correlation,0.268719,0.832434
...,...,...,...,...
square,glszm,SmallAreaLowGrayLevelEmphasis,0.118485,0.904553
squareroot,glszm,SizeZoneNonUniformityNormalized,0.145539,1.029799
squareroot,glszm,SmallAreaEmphasis,0.154088,0.963578
logarithm,glszm,SizeZoneNonUniformityNormalized,0.304821,1.075881


## R20

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,glszm,SizeZoneNonUniformity,0.133695,1.455065
original,glszm,SizeZoneNonUniformityNormalized,0.570905,1.401857
original,glszm,SmallAreaEmphasis,0.577467,1.351007
original,glszm,SmallAreaHighGrayLevelEmphasis,0.105220,1.006400
original,glszm,ZoneEntropy,-0.109100,0.975651
...,...,...,...,...
squareroot,glszm,SizeZoneNonUniformityNormalized,0.273090,0.999240
squareroot,glszm,SmallAreaEmphasis,0.266323,0.918749
squareroot,glszm,SmallAreaLowGrayLevelEmphasis,0.107820,1.170568
logarithm,glszm,SizeZoneNonUniformityNormalized,0.455325,1.106770


## R25

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,glszm,SizeZoneNonUniformity,0.161251,1.411239
original,glszm,SizeZoneNonUniformityNormalized,0.711305,1.521912
original,glszm,SmallAreaEmphasis,0.714966,1.272452
original,glszm,SmallAreaHighGrayLevelEmphasis,0.148192,1.018771
original,glszm,SmallAreaLowGrayLevelEmphasis,0.113556,1.212287
...,...,...,...,...
logarithm,glszm,SizeZoneNonUniformityNormalized,0.570246,1.137837
logarithm,glszm,SmallAreaEmphasis,0.579286,0.894778
logarithm,glszm,SmallAreaHighGrayLevelEmphasis,0.131696,1.215930
logarithm,glszm,SmallAreaLowGrayLevelEmphasis,0.113487,1.278254


## R30

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,glszm,SizeZoneNonUniformity,0.156722,1.420258
original,glszm,SizeZoneNonUniformityNormalized,0.790371,1.311072
original,glszm,SmallAreaEmphasis,0.811405,1.150372
original,glszm,SmallAreaHighGrayLevelEmphasis,0.138318,0.992310
original,glszm,SmallAreaLowGrayLevelEmphasis,0.125093,1.221968
...,...,...,...,...
logarithm,glrlm,RunEntropy,0.108784,1.124724
logarithm,glszm,SizeZoneNonUniformityNormalized,0.595109,1.080855
logarithm,glszm,SmallAreaEmphasis,0.608124,0.868857
logarithm,glszm,SmallAreaHighGrayLevelEmphasis,0.147721,1.226457


## R35

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,glszm,SizeZoneNonUniformity,0.158253,1.368626
original,glszm,SizeZoneNonUniformityNormalized,0.736556,1.406923
original,glszm,SmallAreaEmphasis,0.731892,1.433582
original,glszm,SmallAreaHighGrayLevelEmphasis,0.124010,0.959875
original,glszm,SmallAreaLowGrayLevelEmphasis,0.117705,1.288333
...,...,...,...,...
squareroot,glszm,SmallAreaLowGrayLevelEmphasis,0.127929,1.274701
logarithm,glszm,SizeZoneNonUniformityNormalized,0.602634,1.201598
logarithm,glszm,SmallAreaEmphasis,0.611406,0.973284
logarithm,glszm,SmallAreaHighGrayLevelEmphasis,0.119894,1.129383


## R40

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,glszm,SizeZoneNonUniformity,0.144354,1.377057
original,glszm,SizeZoneNonUniformityNormalized,0.677905,1.467344
original,glszm,SmallAreaEmphasis,0.694964,1.306453
original,glszm,SmallAreaHighGrayLevelEmphasis,0.109126,0.958226
original,glszm,SmallAreaLowGrayLevelEmphasis,0.124940,1.196756
...,...,...,...,...
squareroot,glszm,SizeZoneNonUniformityNormalized,0.330999,1.019325
squareroot,glszm,SmallAreaEmphasis,0.330723,0.925258
logarithm,glszm,SizeZoneNonUniformityNormalized,0.569780,1.074128
logarithm,glszm,SmallAreaEmphasis,0.562989,1.089687


## R45

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,glszm,SizeZoneNonUniformity,0.124807,1.249399
original,glszm,SizeZoneNonUniformityNormalized,0.735998,1.486157
original,glszm,SmallAreaEmphasis,0.759603,1.272390
original,glszm,SmallAreaHighGrayLevelEmphasis,0.135706,0.979778
original,glszm,SmallAreaLowGrayLevelEmphasis,0.124730,1.150371
...,...,...,...,...
squareroot,glszm,SizeZoneNonUniformityNormalized,0.375731,1.089293
squareroot,glszm,SmallAreaEmphasis,0.375819,0.931238
logarithm,glszm,SizeZoneNonUniformityNormalized,0.562048,1.314333
logarithm,glszm,SmallAreaEmphasis,0.568027,0.981185


## R50

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,shape,Maximum2DDiameterSlice,0.120069,1.240359
original,glszm,SizeZoneNonUniformity,0.127621,1.324493
original,glszm,SizeZoneNonUniformityNormalized,0.695276,1.431698
original,glszm,SmallAreaEmphasis,0.731690,1.128258
original,glszm,SmallAreaHighGrayLevelEmphasis,0.114307,0.982606
...,...,...,...,...
squareroot,glszm,SmallAreaEmphasis,0.371544,0.809046
logarithm,glszm,SizeZoneNonUniformityNormalized,0.541511,1.100374
logarithm,glszm,SmallAreaEmphasis,0.555471,0.873850
logarithm,glszm,SmallAreaHighGrayLevelEmphasis,0.127879,1.190874


## R55

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,shape,Maximum2DDiameterColumn,-0.105494,0.837348
original,shape,Maximum2DDiameterSlice,0.100430,1.165279
original,glszm,SizeZoneNonUniformity,0.135765,1.327921
original,glszm,SizeZoneNonUniformityNormalized,0.747528,1.567715
original,glszm,SmallAreaEmphasis,0.742626,1.363063
...,...,...,...,...
logarithm,glrlm,RunEntropy,0.113713,1.069244
logarithm,glszm,SizeZoneNonUniformityNormalized,0.608870,1.197802
logarithm,glszm,SmallAreaEmphasis,0.621244,0.845118
logarithm,glszm,SmallAreaHighGrayLevelEmphasis,0.160779,1.226113


## R60

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,shape,Maximum2DDiameterColumn,-0.108001,0.847065
original,shape,Maximum2DDiameterSlice,0.111652,1.244884
original,glszm,SizeZoneNonUniformity,0.154205,1.414326
original,glszm,SizeZoneNonUniformityNormalized,0.730771,1.512981
original,glszm,SmallAreaEmphasis,0.739829,1.339912
...,...,...,...,...
squareroot,glszm,SmallAreaLowGrayLevelEmphasis,0.111903,1.377803
logarithm,glszm,SizeZoneNonUniformityNormalized,0.547126,1.080735
logarithm,glszm,SmallAreaEmphasis,0.558707,0.864436
logarithm,glszm,SmallAreaHighGrayLevelEmphasis,0.128430,1.227178


## R65

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,shape,Maximum2DDiameterColumn,-0.111174,0.810421
original,shape,Maximum2DDiameterSlice,0.132789,1.243881
original,glszm,SizeZoneNonUniformity,0.139793,1.352090
original,glszm,SizeZoneNonUniformityNormalized,0.692958,1.524554
original,glszm,SmallAreaEmphasis,0.695097,1.364734
...,...,...,...,...
squareroot,glszm,SmallAreaEmphasis,0.301986,0.907134
squareroot,glszm,SmallAreaLowGrayLevelEmphasis,0.106347,1.149559
logarithm,glszm,SizeZoneNonUniformityNormalized,0.504875,1.024296
logarithm,glszm,SmallAreaEmphasis,0.541663,0.800382


## R70

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,shape,Maximum2DDiameterColumn,-0.114994,0.828483
original,shape,Maximum2DDiameterSlice,0.148855,1.269793
original,glszm,SizeZoneNonUniformity,0.120958,1.211389
original,glszm,SizeZoneNonUniformityNormalized,0.604010,1.457064
original,glszm,SmallAreaEmphasis,0.626666,1.216167
...,...,...,...,...
squareroot,glszm,SmallAreaEmphasis,0.256275,0.970936
squareroot,glszm,SmallAreaLowGrayLevelEmphasis,0.140456,1.590537
logarithm,glszm,SizeZoneNonUniformityNormalized,0.495136,1.192869
logarithm,glszm,SmallAreaEmphasis,0.494545,1.044675


## R75

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,shape,Maximum2DDiameterColumn,-0.125052,0.795584
original,shape,Maximum2DDiameterSlice,0.160462,1.314497
original,glszm,SizeZoneNonUniformityNormalized,0.498632,1.446732
original,glszm,SmallAreaEmphasis,0.517690,1.355889
lbp-3D-m1,glcm,Contrast,-0.143546,0.860111
...,...,...,...,...
square,glszm,SmallAreaLowGrayLevelEmphasis,0.228191,1.168555
squareroot,glszm,SizeZoneNonUniformityNormalized,0.278025,1.008856
squareroot,glszm,SmallAreaEmphasis,0.274028,0.960056
logarithm,glszm,SizeZoneNonUniformityNormalized,0.354936,1.039237


## R80

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Mean,Var
ImagingFilter,Category,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
original,shape,Maximum2DDiameterColumn,-0.122978,0.809664
original,shape,Maximum2DDiameterSlice,0.133891,1.254803
original,glszm,SizeZoneNonUniformityNormalized,0.343869,1.209709
original,glszm,SmallAreaEmphasis,0.364188,1.187443
lbp-3D-m1,glcm,Contrast,-0.134769,0.877347
...,...,...,...,...
square,glszm,SmallAreaLowGrayLevelEmphasis,0.113818,1.035776
squareroot,glszm,SizeZoneNonUniformityNormalized,0.197163,0.895272
squareroot,glszm,SmallAreaEmphasis,0.216189,0.781709
logarithm,glszm,SizeZoneNonUniformityNormalized,0.278294,1.089923


## Steps for model training:

Models were trained within the repeated K-fold iterations. In this study, we ran 5-fold cross-validation for 50 times individually using $X_{WD}$ and $X_{nWD}$ features. Therefore, there are a total of 500 runs (250 each for using exotically $X_{WD}$ or $X_{nWD}$). In each of the runs, we performed the following:

1. Grid search of various classifiers for optimal hyperparameters to discriminate between NSCLC subtypes.  
2. Using the optimized criteria found in upper cell, train the model using K-fold training set and limit to using only $R_0$ features (from either $X_{WD}$ or $X_{nWD}$).
3. Test the trained model on the K-fold testing sets of $R_0$ to $R_{80}$ feature sets (from either $X_{WD}$ or $X_{nWD}$). 


### Grid Search

Using the function `GridSearchCV`, we tested different model hyperparameters of the chosen classifiers to determine the optimal set that will be used during K-fold cross-validation. The hyperparameters grid is defined as var `clf_card`. 

In [136]:
'''Grid search for optimal training hyperparameters'''
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imb_Pipeline

# Fine tunned to optimize accuracy
scoring_func_name = "roc_auc_ovr_weighted"
scoring_func = metrics.accuracy_score
global_config = {
    'ANOVA_pthres': 0.05, 
    'var_thres': 1E-4,
    'alpha': 0.01, 
    'l1_ratio': 0.9, 
    'n_features': 50, 
    'N': 50,
    'K': 5
}
switch = {
    'AddGridSearch': True,  # Perform grid search grid-search in loop rather than doing it now and use same hyperparams
    'AddSMOTE': False,      # Add SMOTE during K-fold training. Otherwise, its only on during feature selection
    'UpdateStandardization': True # Update the normalization parameter using rotated training data
}

verbose=False
clf_card = {
    'Support Vector Machine': {
        # Takes too many iterations for SVM to converge with tol < 1E-3. RBF kernel notably takes
        # upto 100,000 iterations to converge with tol = 1E-3, which is only 0.1% of the score function
        'classification': [sklearn.svm.SVC(tol=1E-4, max_iter=-1, probability=True,  decision_function_shape='ovr',verbose=verbose, 
                                           )],
        'classification__kernel': ['linear'],
        'classification__C': [0.1, 1, 10, 100],
    },
    'Support Vector Machine (rbf)': {
        # Takes too many iterations for SVM to converge with tol < 1E-3. RBF kernel notably takes
        # upto 100,000 iterations to converge with tol = 1E-3
        'classification': [sklearn.svm.SVC(tol=1E-4, max_iter=-1, probability=True,  decision_function_shape='ovr', verbose=verbose, 
                                           )],
        'classification__kernel': ['rbf'],
        'classification__C': [0.1, 1, 10, 100],
        'classification__shrinking': [False, True], 
        'classification__gamma': ['auto', 'scale']
    },
    'Logistic Regression': {
        'classification': [sklearn.linear_model.LogisticRegression(penalty='elasticnet',
                                                           solver='saga', tol=1E-4,
                                                           max_iter=10000,
                                                           verbose=verbose)],
        'classification__C': [0.1, 1, 100, 1000],
        'classification__l1_ratio': [0.1, 0.9]
    },
    'Random Forest': {
        'classification': [sklearn.ensemble.RandomForestClassifier(verbose=verbose)],
        'classification__n_estimators': [20, 30, 50, 100],
        'classification__criterion': ['gini', 'entropy'],
        'classification__max_depth': [None, 5, 10, 20]
    }, 
    'KNN': {
        'classification': [sklearn.neighbors.KNeighborsClassifier()], 
        'classification__n_neighbors': [3, 5, 7, 10]
    }, 
}

def normalize_n_feature_selection(X: pd.DataFrame,
                                  y: Union[pd.Series, pd.DataFrame],
                                  X_hold_out: pd.DataFrame,
                                  **kwargs) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Index]:
    """Performs normalization and supervised feature selection on given data.

    This function first applies normalization to the data. It then uses Synthetic Minority 
    Over-sampling Technique (SMOTE) to balance the classes before performing supervised 
    feature selection. Feature selection is performed using Elastic Net and only features 
    with non-zero coefficients are selected.

    Args:
        X (pd.DataFrame): 
            The input features to be normalized and for feature selection. Each row is a feature
            and each column is a sample.
        y (Union[pd.Series, pd.DataFrame]): 
            The target variable corresponding to the input features.
        X_hold_out (pd.DataFrame): 
            The hold-out set of input features to be transformed based on the transformations 
            applied on X.
        **kwargs: 
            Additional parameters to be passed to the normalization and feature selection functions.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame, pd.Index]:
            X (pd.DataFrame): 
                The normalized and feature-selected dataframe of input features. 
            X_hold_out (pd.DataFrame): 
                The transformed and feature-selected dataframe of hold-out input features.
            sup_selected_feats.index (pd.Index): 
                The index of features selected during supervised feature selection.
            
    .. notes::
        The number of selected features is limited to 25 to keep it to a reasonable number. 
        The output DataFrames `X` and `X_hold_out` will have their index sorted as a side effect of the feature selection process.
        The output are normalized using `normalizer_cls()`. If you don't want normalized output
        please use `sup_selected_feats.index` output to select the features from the raw input `X`.
    """
    # Perform normalization
    normalizer = normalizer_cls()
    X_normed = normalizer.fit_transform(X.T).T
    X = pd.DataFrame(data=X_normed, index=X.index, columns=X.columns)
    X_hold_out_normed = normalizer.transform(X_hold_out.T).T
    X_hold_out = pd.DataFrame(data=X_hold_out_normed, index=X_hold_out.index, columns=X_hold_out.columns)
    
    # Smote to balance class before supervised feature selection
    smote_sampler = SMOTE()
    X_smote, y_smote = smote_sampler.fit_resample(X.T, y) # X.T: (n_samples, n_features)
    X_smote = X_smote.T # X_smote: (n_features, n_samples)
    y_smote = pd.DataFrame(y_smote, index=X_smote.columns)
    y = y_smote
    
    # Before K-fold, perform fine feature selection using all training data, features are selected
    # if its ENet coefficients are non-zero. Note: Set `n_trials`=1 to use one ElasticNet
    sup_selected_feats = supervised_features_selection(
        X_smote,
        y.to_frame() if isinstance(y, pd.Series) else y,
        alpha=global_config['alpha'],
        l1_ratio=global_config['l1_ratio'], 
        n_trials=1,
        boosting=False,
        n_features=global_config['n_features'],    # Keep the number of selected features limited to a reasonable number
    )
    X = X.loc[sup_selected_feats.index] # Note that output has its index sorted
    if not X_hold_out is None:
        X_hold_out = X_hold_out.loc[X.index] # Also apply results to X_hold_out
    return X, X_hold_out, sup_selected_feats.index


def get_best_hyperparameters(X: pd.DataFrame, 
                             gt: Union[pd.DataFrame, pd.Series]) -> Tuple[dict, dict, pd.Index]:
    r"""
    Get best hyperparameters for a given dataset and ground truth labels using cross-validated grid search.
    
    The function first standardizes the input data using RobustScaler, then applies preliminary feature
    filtering. After that, it performs cross-validated grid search to find the best hyperparameters for the
    model, and return the instance of the model created using these best hyperparameters.

    Arg:
        X (pd.DataFrame):
            The input data, where each row is a sample and each column is a feature.
        gt (Union[pd.DataFrame, pd.Series]):
            The ground truth labels for each sample.

    Returns:
        Tuple[dict, dict, pd.Index]
            A tuple containing three elements:
                - best_params: A dictionary of the best hyperparameters found by the grid search.
                - best_estimators: A dictionary of the best models for each class, with classes as keys.
                - X_train.index: The index of the samples in the training set after feature filtering.

    Raises:
        ValueError
            If the input parameters are not in the expected format or if the grid search does not converge.
    """
    # Zscore normalization
    zscored = normalizer_cls().fit_transform(X.T).T
    X = pd.DataFrame(data=zscored, index=X.index, columns=X.columns)

    # Use custom pipeline that changes the standardization
    clf = sklearn.pipeline.Pipeline([
        ('standardization', normalizer_cls()),
        ('classification', 'passthrough')
    ]) 

    gt = gt.to_frame() if isinstance(gt, pd.Series) else gt

    # Feature selection
    X_train, _ = preliminary_feature_filtering(X, None, gt, 
                                               p_thres=global_config['ANOVA_pthres'], var_thres = global_config['var_thres'])
    
    # Fine feature selection
    __, __, chosen_feats = normalize_n_feature_selection(
        X_train,
        gt,
        X_train # Place-holder, have no use.
    )
    X_train = X_train.loc[chosen_feats]

    
    # Perform CV grid search for best hyperparams
    best_params, results, predict_table, best_estimators = cv_grid_search(
        X_train.T, gt.astype('int'),
        param_grid_dict=clf_card,
        scoring=scoring_func_name,
        clf=clf, 
        refit=True, 
        n_jobs=10
    )
    return best_params, best_estimators, X_train.index

X_R0 = {
    'WD' : X_WD_R0,
    'nWD': X_nWD_R0
}

mdprint("## Hyperparameters grid search")
try:
    if not switch['AddGridSearch']:
        with open(resource_root / 'pipelines_v2.joblib', 'rb') as f:
            best_estimators = joblib.load(f)

        with open(resource_root / 'bestparams_v2.joblib', 'rb') as f:
            best_params = joblib.load(f)

        with open(resource_root / 'prelim_features_v2.joblib', 'rb') as f:
            prelim_features = joblib.load(f)
        mdprint("Successfully loaded checkpoints.")
    else:
        mdprint("Not loading best estimators because GridSearch is done within loop")
except Exception as e:
    # It takes a lot of time to do grid search so this is established
    mdprint("Checkpoints not found, re-do initial hyperparameter grid search")
    best_params = {}
    best_estimators = {}
    prelim_features = {}
    for name, _X in X_R0.items():
        # ! Add SMOTE if its on
        if switch['AddSMOTE']:
            smote_sampler = SMOTE()
            _X, _y = smote_sampler.fit_resample(_X.T, y) # X.T: (n_samples, n_features)
            _X = _X.T # X_smote: (n_features, n_samples)
            _y = pd.DataFrame(_y, index=_X.columns)
        else:
            _y = y.copy()
        
        # Perform preliminary feature filtering
        best_params[name], best_estimators[name], prelim_features[name] = get_best_hyperparameters(_X, _y)

    # check if the best_estimators has converged properly
    for feat_gp in best_estimators:
        for model_name, model in best_estimators[feat_gp].items():
            try:
                if model[1].fit_status_:
                    msg = f"{(feat_gp, model_name)} did not converge for the best_estimator."
                    print(msg)
            except AttributeError:
                continue
            except Exception as e:
                raise e

    # Save the selected features
    with open(resource_root / 'pipelines_v2.joblib', 'wb') as f:
        joblib.dump(best_estimators, f)

    with open(resource_root / 'bestparams_v2.joblib', 'wb') as f:
        joblib.dump(best_params, f)

    with open(resource_root / 'prelim_features_v2.joblib', 'wb') as f:
        joblib.dump(prelim_features, f)

## Hyperparameters grid search

Checkpoints not found, re-do initial hyperparameter grid search

[2024-02-24 04:38:19,366-INFO] (preliminary_feature_filtering) Dropping 'Diganostics' column.
[2024-02-24 04:38:19,367-ERROR] (preliminary_feature_filtering) Diagnostics column generated by PyRadiomics is not found or error occurs.
[2024-02-24 04:38:19,367-INFO] (preliminary_feature_filtering) Dropping 'Diganostics' column.
[2024-02-24 04:38:19,367-INFO] (preliminary_feature_filtering) Dropping features with low variance with threshold 0.0001...
[2024-02-24 04:38:19,376-INFO] (preliminary_feature_filtering) Dropped 0 features for first feature set.
[2024-02-24 04:38:19,376-INFO] (preliminary_feature_filtering) Second feature set not found. ICC filtering skipped. 
[2024-02-24 04:38:19,377-INFO] (preliminary_feature_filtering) Dropping features using T-test or ANOVA with p-value threshold: 0.05
[2024-02-24 04:38:19,378-INFO] (preliminary_feature_filtering) Using ANOVA
[2024-02-24 04:38:19,522-DEBUG] (ANOVA) Normality Shapiro results: 
                                                     

## Run K-fold cross validation

In [137]:
from feature_robustness_analysis.io import prepend_index_level

# makesure random seed is reset
np.random.seed(None)    


# Mute the logger because its too noisy
MNTSLogger.set_global_verbosity(False)
MNTSLogger.set_global_log_level('info')

# Setup performance metric
# scoring_func = partial(top_k_accuracy_score_, k=1)

# * Perform N times K-fold
N = global_config['N']
K = global_config['K']
preds = []
trials = []
final_features = []
fit_failed = []
show_sub_progress = False
for n in auto.trange(N):
    # Because some patients has more than one lesion, the K-fold are constructed by 
    # splitting patients instead of splitting lesions to avoid data leakage.
    unique_patients = y.index.get_level_values(0).unique()
    unique_y = y.droplevel(1).index.duplicated(keep='first')
    unique_y = y[~unique_y]
    
    # K-fold splitter
    kfold_splitter = StratifiedKFold(n_splits=K, shuffle=True)
    
    # Prepare the results to be recorded per trial
    trial_row = pd.Series(
        index=pd.MultiIndex.from_product(
            [list(X_R0.keys()), 
             list(range(K)),
             list(clf_card.keys()), 
             tuple(['Performance']), 
             tuple(['R00'] + rot_degs)], 
            names=['Feature Category', 'Fold #', 'Model', 'Pocket', 'Rotations']
        ),
        name=f'Trial-{n:03d}'
    )
    # Train and test for each feature group ('All', 'WD', 'nWD')
    for fold_number, (train_set, test_set) in enumerate(kfold_splitter.split(unique_patients, unique_y)):
        # configure X and y
        train_set_patients = unique_patients[train_set]
        test_set_patients = unique_patients[test_set]
        
        # Loop feature group
        for feat_group in X_R0:
            # * Get preliminarily selected features (computed outside this loop)
            pre_selected_features = prelim_features[feat_group]
            
            _X_n_train = X_R0[feat_group][train_set_patients].loc[pre_selected_features].copy()
            _y_n_train = y[train_set_patients].copy()
            _X_n_test = X_R0[feat_group][test_set_patients].loc[pre_selected_features].copy()
            _y_n_test = y[test_set_patients].copy()
            
            # * Perform supervised feature selection and normalization
            _X_n_train, _X_n_test, chosen_feats = normalize_n_feature_selection(
                _X_n_train, 
                _y_n_train.to_frame(), 
                _X_n_test
            )
            _features_row = pd.Series(chosen_feats.to_list(), name=(n, fold_number, feat_group)) 
            final_features.append(_features_row) # store this
            
            # if there are too few features remained, throw a warning
            if _X_n_train.shape[0] <5:
                warnings.warn(f"The current loop {n}, fold {fold_number} of {feat_group} has very few features ({_X_n_train.shape[0]}) survived.")
            
            #! Grid search for optimized parameters
            if switch['AddGridSearch']:
                clf = sklearn.pipeline.Pipeline([
                    ('standardization', normalizer_cls()),
                    ('classification', 'passthrough')
                ]) 
                _, __, ___, be = cv_grid_search(
                    _X_n_train.T, _y_n_train.astype('int'),
                    param_grid_dict=clf_card,
                    scoring=scoring_func_name,
                    clf=clf, 
                    refit=True, 
                    n_jobs=10
                )
            else:
                be = best_estimators[feat_group]
            
            # * Loop for each ML model
            for model_name, model in be.items():
                # Training with R0 features again
                _model = sklearn.base.clone(model)
                
                with warnings.catch_warnings():
                    warnings.filterwarnings('error', category=sklearn.exceptions.ConvergenceWarning)
                    warnings.filterwarnings('ignore', category=FutureWarning)
                    try:
                        if switch['AddSMOTE']:
                            #! SMOTE
                            smote_sampler = SMOTE()
                            X_smote, y_smote = smote_sampler.fit_resample(_X_n_train.T, _y_n_train) # X.T: (n_samples, n_features)
                            X_smote = X_smote.T # X_smote: (n_features, n_samples)
                            y_smote = pd.DataFrame(y_smote, index=X_smote.columns)
                            _model.fit(X_smote.T, y_smote)
                        else:
                            #! No SMOTE
                            _model.fit(_X_n_train.T, _y_n_train.to_frame())
                    except sklearn.exceptions.ConvergenceWarning as w:
                        # record fail to fit
                        failed2fit = {
                            'Trial #': n, 
                            'Fold #': fold_number, 
                            'Model': model_name, 
                            'Feature Category': feat_group
                        }
                        fit_failed.append(failed2fit)
                        print(f"Fail to fit: {failed2fit}")
                        continue
                    
                
                # * Test with R0 features
                _pred = _model.predict(_X_n_test.T)
                _pred_proba = _model.predict_proba(_X_n_test.T)
                _pred_row = pd.DataFrame(
                    _pred_proba, index=_y_n_test.index,
                    columns=pd.MultiIndex.from_tuples([
                        (n, 'predict_proba', f"Fold {fold_number}", feat_group, model_name, 'R00', a) for a in range(_pred_proba.shape[1])
                    ])
                )
                preds.append(_pred_row)
                # also get predict() because predict() and predict_proba() can give diff results
                # based on what is said on sklearn. 
                _pred_row = pd.DataFrame(
                    _pred, index=_y_n_test.index,
                    columns=pd.MultiIndex.from_tuples([
                        (n, 'predict', f"Fold {fold_number}", feat_group, model_name, 'R00', -1) # Class -1 means result from `predict()`, it is usually integer representing class
                    ])
                )
                preds.append(_pred_row)
                
                # * Run also for training set
                _pred_train = _model.predict(_X_n_train.T)
                _pred_row = pd.DataFrame(
                    _pred_train, index=_y_n_train.index, 
                    columns=pd.MultiIndex.from_tuples([
                        (n, 'train', f"Fold {fold_number}", feat_group, model_name, 'R00', -1) # Class -1 means result from `predict()`, it is usually integer representing class
                    ])
                )
                preds.append(_pred_row)
                
                _score = scoring_func(_y_n_test, _pred)
                trial_row[feat_group, fold_number, model_name, 'Performance', 'R00'] = _score
                # Test with rotated features
                for rot in rot_degs:
                    _X_rot_train = X_all_rest[rot][_X_n_train.columns].loc[chosen_feats]
                    _X_rot_test = X_all_rest[rot][_X_n_test.columns].loc[chosen_feats]

                    if switch['UpdateStandardization']:
                        # ! update feature normalization parameters for rotated features using the training set
                        test_norm = normalizer_cls()
                        test_norm.fit(_X_rot_train.T) # Fit using training set mean/var
                        _X_rot_test_normed = test_norm.transform(_X_rot_test.T).T
                        _pred = _model[1].predict(_X_rot_test_normed.T)
                        _pred_proba = _model[1].predict_proba(_X_rot_test_normed.T)
                        _pred_train = _model[1].predict(test_norm.fit_transform(_X_rot_train.T))
                    else:
                        # ! no feature normalization update, directly use trained parameters
                        _pred = _model.predict(_X_rot_test.T)
                        _pred_proba = _model.predict_proba(_X_rot_test.T)
                        _pred_train = _model.predict(_X_rot_train.T)
                        
                    #* build row with header [num_trial, pred_func, num_fold, feat_group, model_name, rotation, class]
                    _pred_row = pd.DataFrame(
                        _pred_proba,
                        index=_y_n_test.index,
                        columns=pd.MultiIndex.from_tuples([
                            (n, 'predict_proba', f"Fold {fold_number}", feat_group, model_name, rot, a) for a in range(_pred_proba.shape[1])
                        ])
                    )
                    preds.append(_pred_row)
                    _pred_row = pd.DataFrame(
                        _pred,
                        index=_y_n_test.index,
                        columns=pd.MultiIndex.from_tuples([
                            (n, 'predict', f"Fold {fold_number}", feat_group, model_name, rot, -1)
                        ])
                    )
                    preds.append(_pred_row)
                    _pred_row = pd.DataFrame(
                        _pred_train,
                        index=_y_n_train.index,
                        columns=pd.MultiIndex.from_tuples([
                            (n, 'train', f"Fold {fold_number}", feat_group, model_name, rot, -1)
                        ])
                    )
                    preds.append(_pred_row)
                    _score = scoring_func(_y_n_test, _pred)
                    trial_row[feat_group, fold_number, model_name, 'Performance', rot] = _score
    trials.append(trial_row)
    
# Unmute the logger
MNTSLogger.set_global_verbosity(True)

# Build dataframes
df_preds = pd.concat(preds, axis=1)
df_preds.columns.names=['Iteration # (n)', 'Pred Func', 'Fold #', 'Feature Category', 'Classifier', 'Rotation', 'Classes']
df_preds.sort_index(axis=1, level=0, inplace=True)
df_preds.sort_index(axis=0, level=0, inplace=True)
df_features = pd.concat(final_features, axis=1)
df_features.columns.name = ('Iteration # (n)', 'Fold #', 'Feature Group')
df_features.fillna("", inplace=True)
df_trials = pd.concat(trials, axis=1).T

  0%|          | 0/50 [00:00<?, ?it/s]

In [139]:
# Plot how many of those runs has been unsuccessful
df_fit_failed = {}
for v in fit_failed:
    idx = tuple(v.values())
    df_fit_failed[idx] = True
if len(df_fit_failed) > 0:
    df_fit_failed = pd.Series(df_fit_failed, name='Not converged')
    df_fit_failed.index.names = v.keys()
    mdprint("## Model failed to converge ")
    display(df_fit_failed.to_frame())
    display(df_fit_failed.groupby(["Feature Category", "Model"]).count().to_frame())
else:
    print("All fits were sucessful.")

All fits were sucessful.


In [141]:
# Save output as excels
output_dir = data_root
excel_dir = output_dir / "n_k-fold_cv_results_3classes.xlsx"

# Save to excel, this does not always work becuase of the huge size of the tables, so is disabled now
# with pd.ExcelWriter(excel_dir, mode='w', engine='openpyxl') as excel_writer:
#     df_preds.fillna("train").T.to_excel(excel_writer, sheet_name='Predictions')
#     df_features.fillna("").to_excel(excel_writer, sheet_name="Selected Features")wwd
#     df_trials.fillna("").to_excel(excel_writer, sheet_name="Performances")

# Also save to HDF, this is quick and more reliable than excel.
with pd.HDFStore(excel_dir.with_suffix('.h5')) as hdf_file:
    df_preds.fillna("train").T.to_hdf(hdf_file, key = 'Predictions')
    df_features.fillna("").to_hdf(hdf_file    , key = "Selected Features")
    df_trials.fillna("").to_hdf(hdf_file      , key = "Performances")

# Selected features analysis

Note that all other analysis were conducted in `3_performance_analysis.ipynb`

In [152]:
df_features.columns.names = ['Trial #', 'Fold #', 'Feature Category']
for feat_gp in df_features.columns.get_level_values('Feature Category').unique():
    mdprint(f"### {feat_gp}")
    _df = df_features.xs(feat_gp, level='Feature Category', axis=1)
    
    # count frequency, note that each feature is unique in each trial (i.e., in each column)
    feat_list = _df.values.flatten().tolist()
    union_feats = set(feat_list)
    counts = pd.Series({feat_name: feat_list.count(feat_name) for feat_name in feat_list}, name="Counts")
    selected_chance = counts / float(_df.shape[1])
    selected_chance = selected_chance.apply(lambda x: f"{x * 100:.02f}%")
    selected_chance.name = 'Percentage'
    counts = pd.concat([counts, selected_chance],axis=1)
    counts.drop('', inplace=True)
    mdprint("#### Feature counts")
    display(counts.sort_values(by='Counts', ascending=False).iloc[:50])
    
    # average number of features
    s = _df.apply(lambda x: (x != '').sum())
    s.name = 'Number of selected features'
    mdprint("#### Number of fetures")
    display(s.to_frame())
    
    mdprint(f"Average nubmer of feature selected: {s.mean()}")

### WD

#### Feature counts

Unnamed: 0,Counts,Percentage
"(wavelet-LLH, glcm, ClusterTendency)",250,100.00%
"(wavelet-HHL, glcm, ClusterTendency)",250,100.00%
"(wavelet-LHL, glszm, LowGrayLevelZoneEmphasis)",250,100.00%
"(wavelet-LHH, glrlm, RunVariance)",249,99.60%
"(wavelet-HLH, glcm, MCC)",249,99.60%
"(wavelet-LLH, glszm, SmallAreaLowGrayLevelEmphasis)",248,99.20%
"(wavelet-HHH, glrlm, ShortRunLowGrayLevelEmphasis)",247,98.80%
"(wavelet-LLH, glrlm, ShortRunEmphasis)",247,98.80%
"(wavelet-LHH, firstorder, InterquartileRange)",247,98.80%
"(wavelet-LLH, firstorder, Minimum)",242,96.80%


#### Number of fetures

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of selected features
Trial #,Fold #,Unnamed: 2_level_1
0,0,25
0,1,21
0,2,25
0,3,25
0,4,28
...,...,...
49,0,24
49,1,26
49,2,26
49,3,27


Average nubmer of feature selected: 24.86

### nWD

#### Feature counts

Unnamed: 0,Counts,Percentage
"(lbp-3D-k, glszm, ZonePercentage)",250,100.00%
"(exponential, glszm, SizeZoneNonUniformityNormalized)",250,100.00%
"(lbp-3D-m1, glcm, Correlation)",250,100.00%
"(exponential, firstorder, Kurtosis)",247,98.80%
"(exponential, ngtdm, Strength)",247,98.80%
"(squareroot, gldm, DependenceNonUniformityNormalized)",246,98.40%
"(exponential, glcm, ClusterShade)",246,98.40%
"(square, firstorder, 10Percentile)",246,98.40%
"(logarithm, glcm, InverseVariance)",246,98.40%
"(lbp-3D-k, glszm, SmallAreaLowGrayLevelEmphasis)",242,96.80%


#### Number of fetures

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of selected features
Trial #,Fold #,Unnamed: 2_level_1
0,0,21
0,1,20
0,2,21
0,3,22
0,4,23
...,...,...
49,0,22
49,1,22
49,2,17
49,3,21


Average nubmer of feature selected: 21.12