## Gait Video Study 
### Traditional ML algorithms on task+subject generalization together frameworks, namely a) train on some subjects in W-> test on separate set of subjects in WT and b) train on some subjects in VBW-> test on separate set of subjects in VBWT to classify HOA/MS/PD strides and subjects 
#### Remember to add the original count of frames in a single stride (before down sampling via smoothing) for each stride as an additional artificial feature to add information about speed of the subject to the model

1. Save the optimal hyperparameters, confusion matrices and ROC curves for each algorithm.
2. Make sure to not use x, y, z, confidence = 0, 0, 0, 0 as points for the model since they are simply missing values and not data points, so make sure to treat them before inputting to model 
3. Make sure to normalize (z-score normalization) the features before we feed them to the model.
4. We use the summary statistics as range, CoV and asymmetry between the right and left limbs as the features to input to the traditional models requiring fixed size 1D input for each training/testing set sample.
5. For implementation of task+subject generalization together framework 1: i.e. train on some subjects in W and test on remaining separate set of subjects in WT,  since we have 32 subjects in training/W and 26 subjects in testing/WT and 25 subjects that are common in both W and WT. We always keep the (32-25) = 7 subjects only available in W in training and always keep (26-25) = 1 subject only available in WT in testing along with cross validation folds created for training and testing sets from the 25 common subjects in both. So, basically, for 5 fold cross validation on 25 common subjects, we train on 20 + 7 subjects and test on 5+1 subjects where these 20 and 5 subjects keep on changing with each fold, but the 7 and 1 subjects remain the same.
6. We use stratified group 5 fold cross validation.

In [17]:
from importlib import reload
from ml_utils.imports import *

from ml_utils import cross_gen_traditionalML
reload(cross_gen_traditionalML)
from ml_utils.cross_gen_traditionalML import extract_train_test_common_PIDs, compute_train_test_indices_split
from ml_utils.cross_gen_traditionalML import models, design, plot_ROC, run_ml_models

In [18]:
path = 'C:\\Users\\Rachneet Kaur\\Box\\Gait Video Project\\GaitVideoData\\video\\'
data_path = path+'traditional_methods_dataframe.csv'
results_path = 'C:\\Users\\Rachneet Kaur\\Box\Gait Video Project\\MLresults\\'

data = pd.read_csv(data_path, index_col= 0)
display(data.head())
        
#Whether to save the results (confusion matrices and RoC plots) or not 
save_results = True 

Unnamed: 0,key,cohort,trial,scenario,video,PID,stride_number,frame_count,label,right hip-x-CoV,...,ankle-z-asymmetry,heel-x-asymmetry,heel-y-asymmetry,heel-z-asymmetry,toe 1-x-asymmetry,toe 1-y-asymmetry,toe 1-z-asymmetry,toe 2-x-asymmetry,toe 2-y-asymmetry,toe 2-z-asymmetry
0,GVS_212_T_T1_1,HOA,BW,SLWT,GVS_212_T_T1,212,1,46,0,0.046077,...,14.426173,3.407379,10.662441,0.830365,0.50257,31.450487,8.644012,5.236678,31.182183,8.215725
1,GVS_212_T_T1_2,HOA,BW,SLWT,GVS_212_T_T1,212,2,39,0,0.021528,...,1.360847,5.155307,11.363806,4.333776,1.025647,28.2664,2.671081,6.678294,15.058825,4.903579
2,GVS_212_T_T1_3,HOA,BW,SLWT,GVS_212_T_T1,212,3,56,0,0.034394,...,1.341021,8.625363,7.159495,3.366152,1.759968,17.545787,5.921325,8.243491,9.578638,3.008162
3,GVS_212_T_T1_4,HOA,BW,SLWT,GVS_212_T_T1,212,4,53,0,0.028511,...,2.375934,6.728268,0.098235,0.999027,0.541911,7.843339,4.279617,0.748023,19.471731,5.086056
4,GVS_212_T_T1_5,HOA,BW,SLWT,GVS_212_T_T1,212,5,44,0,0.025213,...,8.525816,1.775282,0.03321,9.166863,1.354601,6.674183,8.47948,4.373622,0.315168,11.795593


### main()

#### Task+subject generalization together framework 1: train on walking (W) and test on walking while talking (WT) to classify HOA/MS/PD strides and subjects 

In [19]:
#We are training on some subjects of trial W and testing on separate remaining subjects of trial WT
train_framework = 'W'
test_framework = 'WT'

#Extracting the list of PIDs/subjects that are only included in the training set, only included in the testing set 
#and common PIDs in both training and testing sets 
train_pids, test_pids, common_pids = extract_train_test_common_PIDs(data, train_framework, test_framework)
design()


Original number of subjects in training task W are: 32
Original number of subjects in testing task WT are: 26
Common number of subjects across train and test frameworks:  25
Common subjects across train and test frameworks:  [404, 405, 406, 407, 408, 409, 410, 411, 310, 311, 313, 314, 318, 320, 321, 322, 323, 212, 213, 214, 215, 216, 217, 218, 219]
Number of subjects only in training framework:  7
Subjects only in training framework:  [102, 112, 113, 115, 312, 123, 124]
Number of subjects only in test framework:  1
Subjects only in test framework:  [403]
******************************************


In [20]:
#Trial W for training 
trialW = data[data['scenario']==train_framework] #Full trial W with all 32 subjects 
#Trial WT for testing 
trialWT = data[data['scenario']==test_framework] #Full trial WT with all 26 subjects 

#Full training data stats 
print ('Number of subjects in trial W in each cohort:\n', trialW.groupby('PID').first()['cohort'].value_counts())
print('Strides in complete training set: ', len(trialW))
print ('HOA, MS and PD strides in complete training set:\n', trialW['cohort'].value_counts())
design()

#Full testing data stats 
print ('Number of subjects in trial WT in each cohort:\n', trialWT.groupby('PID').first()['cohort'].value_counts())
print('Strides in complete testing set: ', len(trialWT))
print ('HOA, MS and PD strides in complete testing set:\n', trialWT['cohort'].value_counts())
design()

#Training only data with strides from W
train_only_trialW = trialW[trialW.PID.isin(train_pids)] #subset of trial W with subjects only present in trial W but not in trial WT
print ('Number of subjects only in trial W in each cohort:\n', train_only_trialW.groupby('PID').first()['cohort'].value_counts())
print('Strides of subjects only in trial W: ', len(train_only_trialW))
print ('HOA, MS and PD strides in of subjects only in trial W :\n', train_only_trialW['cohort'].value_counts())
design()

#Testing only data with strides from WT
test_only_trialWT = trialWT[trialWT.PID.isin(test_pids)] #subset of trial WT with subjects only present in trial WT but not in trial W
print ('Number of subjects only in trial WT in each cohort:\n', test_only_trialWT.groupby('PID').first()['cohort'].value_counts())
print('Strides of subjects only in trial WT: ', len(test_only_trialWT))
print ('HOA, MS and PD strides in of subjects only in trial WT :\n', test_only_trialWT['cohort'].value_counts())
design()

#Training data with strides from W for common PIDs in trials W and WT
train_trialW_commonPID = trialW[trialW.PID.isin(common_pids)] #subset of trial W with common subjects in trial W and WT
print ('Number of subjects common to trials W and WT in each cohort:\n', train_trialW_commonPID.groupby('PID').first()['cohort'].value_counts())
print('Strides in trial W in each cohort of subjects common to trials W and WT: ', len(train_trialW_commonPID))
print ('HOA, MS and PD strides in trial W of subjects common to trials W and WT:\n', train_trialW_commonPID['cohort'].value_counts())
design()

#Testing data with strides from WT for common PIDs in trials W and WT
test_trialWT_commonPID = trialWT[trialWT.PID.isin(common_pids)] #subset of trial W with common subjects in trial W and WT
print ('Number of subjects common to trials W and WT in each cohort:\n', test_trialWT_commonPID.groupby('PID').first()['cohort'].value_counts())
print('Strides in trial WT in each cohort of subjects common to trials W and WT: ', len(test_trialWT_commonPID))
print ('HOA, MS and PD strides in trial WT of subjects common to trials W and WT:\n', test_trialWT_commonPID['cohort'].value_counts())
design()

Number of subjects in trial W in each cohort:
 HOA    14
MS     10
PD      8
Name: cohort, dtype: int64
Strides in complete training set:  1380
HOA, MS and PD strides in complete training set:
 HOA    658
MS     389
PD     333
Name: cohort, dtype: int64
******************************************
Number of subjects in trial WT in each cohort:
 MS     9
PD     9
HOA    8
Name: cohort, dtype: int64
Strides in complete testing set:  1050
HOA, MS and PD strides in complete testing set:
 PD     367
HOA    351
MS     332
Name: cohort, dtype: int64
******************************************
Number of subjects only in trial W in each cohort:
 HOA    6
MS     1
Name: cohort, dtype: int64
Strides of subjects only in trial W:  372
HOA, MS and PD strides in of subjects only in trial W :
 HOA    324
MS      48
Name: cohort, dtype: int64
******************************************
Number of subjects only in trial WT in each cohort:
 PD    1
Name: cohort, dtype: int64
Strides of subjects only in trial 

In [21]:
cols_to_drop = ['key', 'cohort', 'trial', 'scenario', 'video', 'stride_number', 'label', 'PID']
X_train_common = train_trialW_commonPID.drop(cols_to_drop, axis = 1)
Y_train_common = train_trialW_commonPID[['PID', 'label']]

train_test_concatenated = pd.concat([trialW, trialWT], axis = 0).reset_index().drop('index', axis = 1)

#Shuffling the concatenated data
train_test_concatenated = shuffle(train_test_concatenated, random_state = 0)

#Computing the X (91 features), Y (PID, label) for the models 
X_full = train_test_concatenated.drop(cols_to_drop, axis=1)
Y_full = train_test_concatenated[['PID', 'label']]
print (X_full.shape, Y_full.shape) #1176+1651

#Computing the training and test set indices for the CV folds 
train_indices, test_indices = compute_train_test_indices_split(train_test_concatenated, X_train_common, Y_train_common, \
                                                               train_pids, test_pids, train_framework, test_framework)
framework = 'task_and_subject_WtoWT' #Defining the task generalization framework of interest

(2430, 91) (2430, 2)


In [None]:
ml_models = ['random_forest', 'adaboost', 'kernel_svm', 'gbm', 'xgboost', 'knn', 'decision_tree',  'linear_svm', 
             'logistic_regression', 'mlp']
# ml_models = ['logistic_regression']
metrics = run_ml_models(ml_models, X_full, Y_full, train_indices, test_indices, framework, results_path, save_results)

random_forest


In [None]:
metrics

#### Task+subject generalization together framework 2: train on virtual beam walking (VBW) and test on virtual beam walking while talking (VBWT) to classify HOA/MS/PD strides and subjects 

In [None]:
#We are training on some subjects of trial VBW and testing on separate remaining subjects of trial VBWT
train_framework = 'SLW'
test_framework = 'SLWT'

#Extracting the list of PIDs/subjects that are only included in the training set, only included in the testing set 
#and common PIDs in both training and testing sets 
train_pids, test_pids, common_pids = extract_train_test_common_PIDs(data, train_framework, test_framework)
design()


In [None]:
#Trial VBW for training 
trialVBW = data[data['scenario']==train_framework] #Full trial VBW with all 22 subjects 
#Trial VBWT for testing 
trialVBWT = data[data['scenario']==test_framework] #Full trial VBWT with all 21 subjects 

#Full training data stats 
print ('Number of subjects in trial VBW in each cohort:\n', trialVBW.groupby('PID').first()['cohort'].value_counts())
print('Strides in complete training set: ', len(trialVBW))
print ('HOA, MS and PD strides in complete training set:\n', trialVBW['cohort'].value_counts())
design()

#Full testing data stats 
print ('Number of subjects in trial VBWT in each cohort:\n', trialVBWT.groupby('PID').first()['cohort'].value_counts())
print('Strides in complete testing set: ', len(trialVBWT))
print ('HOA, MS and PD strides in complete testing set:\n', trialVBWT['cohort'].value_counts())
design()

#Training only data with strides from VBW
train_only_trialVBW = trialVBW[trialVBW.PID.isin(train_pids)] #subset of trial W with subjects only present in trial W but not in trial WT
print ('Number of subjects only in trial VBW in each cohort:\n', train_only_trialVBW.groupby('PID').first()['cohort'].value_counts())
print('Strides of subjects only in trial VBW: ', len(train_only_trialVBW))
print ('HOA, MS and PD strides in of subjects only in trial VBW :\n', train_only_trialVBW['cohort'].value_counts())
design()

#Testing only data with strides from VBWT
test_only_trialVBWT = trialVBWT[trialVBWT.PID.isin(test_pids)] #subset of trial WT with subjects only present in trial WT but not in trial W
print ('Number of subjects only in trial VBWT in each cohort:\n', test_only_trialVBWT.groupby('PID').first()['cohort'].value_counts())
print('Strides of subjects only in trial VBWT: ', len(test_only_trialVBWT))
print ('HOA, MS and PD strides in of subjects only in trial VBWT :\n', test_only_trialVBWT['cohort'].value_counts())
design()

#Training data with strides from VBW for common PIDs in trials VBW and VBWT
train_trialVBW_commonPID = trialVBW[trialVBW.PID.isin(common_pids)] #subset of trial VBW with common subjects in trial VBW and VBWT
print ('Number of subjects common to trials VBW and VBWT in each cohort:\n', train_trialVBW_commonPID.groupby('PID').first()['cohort'].value_counts())
print('Strides in trial VBW in each cohort of subjects common to trials VBW and VBWT: ', len(train_trialVBW_commonPID))
print ('HOA, MS and PD strides in trial VBW of subjects common to trials VBW and VBWT:\n', train_trialVBW_commonPID['cohort'].value_counts())
design()

#Testing data with strides from VBWT for common PIDs in trials VBW and VBWT
test_trialVBWT_commonPID = trialVBWT[trialVBWT.PID.isin(common_pids)] #subset of trial VBW with common subjects in trial VBW and VBWT
print ('Number of subjects common to trials VBW and VBWT in each cohort:\n', test_trialVBWT_commonPID.groupby('PID').first()['cohort'].value_counts())
print('Strides in trial VBWT in each cohort of subjects common to trials VBW and VBWT: ', len(test_trialVBWT_commonPID))
print ('HOA, MS and PD strides in trial VBWT of subjects common to trials VBW and VBWT:\n', test_trialVBWT_commonPID['cohort'].value_counts())
design()

In [None]:
cols_to_drop = ['key', 'cohort', 'trial', 'scenario', 'video', 'stride_number', 'label', 'PID']
X_train_common_VBWtoVBWT = train_trialVBW_commonPID.drop(cols_to_drop, axis = 1)
Y_train_common_VBWtoVBWT = train_trialVBW_commonPID[['PID', 'label']]

train_test_concatenated_VBWtoVBWT = pd.concat([trialVBW, trialVBWT], axis = 0).reset_index().drop('index', axis = 1)

#Shuffling the concatenated data
train_test_concatenated_VBWtoVBWT = shuffle(train_test_concatenated_VBWtoVBWT, random_state = 0)

#Computing the X (91 features), Y (PID, label) for the models 
X_full_VBWtoVBWT = train_test_concatenated_VBWtoVBWT.drop(cols_to_drop, axis=1)
Y_full_VBWtoVBWT = train_test_concatenated_VBWtoVBWT[['PID', 'label']]
print (X_full_VBWtoVBWT.shape, Y_full_VBWtoVBWT.shape) #1176+1651

#Computing the training and test set indices for the CV folds 
train_indices_VBWtoVBWT, test_indices_VBWtoVBWT = compute_train_test_indices_split(train_test_concatenated_VBWtoVBWT,\
                                                                X_train_common_VBWtoVBWT, Y_train_common_VBWtoVBWT, \
                                                               train_pids, test_pids, train_framework, test_framework)
framework = 'task_and_subject_VBWtoVBWT' #Defining the task generalization framework of interest

In [None]:
ml_models = ['random_forest', 'adaboost', 'kernel_svm', 'gbm', 'xgboost', 'knn', 'decision_tree',  'linear_svm', 
             'logistic_regression', 'mlp']
metrics_VBWtoVBWT = run_ml_models(ml_models, X_full_VBWtoVBWT, Y_full_VBWtoVBWT, train_indices_VBWtoVBWT, test_indices_VBWtoVBWT,\
                                  framework, results_path, save_results)

In [None]:
metrics_VBWtoVBWT