# Notebook for the creation of the FRELSA dataframe

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Matteo Leghissa

To properly run this notebook one should have the data from wave 5 and wave 6 of the ELSA (Englidh Longitudinal Study of Ageing) study [https://www.elsa-project.ac.uk/].

A frailty variable based on the FFP (Fried's Frailty Phenotype) definition is created starting from wave 6 data.
The best variables for the classification problem of said frailty label are selected from wave 5 and 6 using the MULTISurf algorithm.
Seven ML arcitectures are then trained on the classification task (detection with wave 6 and prefiction with wave 5).

Let us start by importing all necessary modules:

In [2]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, GridSearchCV, train_test_split, StratifiedKFold, cross_val_predict
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from skrebate import multisurf
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from preprocess import load_data, preprocess_frailty_db, add_fried_w6, load_w6, load_w5
from models import get_cv_metrics
pd.options.mode.chained_assignment = None

## Computation of the Fried Frailty Phenotype

Let us open wave 6 data (both core data and nurse visit data), and merge them in a single dataframe.

Now we can compute the frailty level for the wave 6 patients using the merged dataframe.
We drop all the varaibles we ued for the computation, and all the patients which frailty level could not be computed.

The frailty is computed using an adaptation of the FFP definition, as described in the following table:

![FFPtable](img/FFP-table.png)

We then save the dataframe with the new frailty variable added, to avoid repeating this step in the future.

In [None]:
core_data_path = "data/raw/wave_6_elsa_data_v2.tab"
nurse_data_path = "data/raw/wave_6_elsa_nurse_data_v2.tab"
frelsa, ffp = add_fried_w6(elsa_w6_merged=load_w6(core_data_path=core_data_path, nurse_data_path=nurse_data_path),
                     drop_columns=True, drop_rows=True)
frelsa["FFP"] = ffp
frelsa.to_csv('data/raw/wave_6_frailty_FFP_data.tab', sep='\t', index_label='idauniq', quoting=3, escapechar='\\')
frelsa.head()

## Features selection

We can now load the wave 6 data we just saved, with its frailty label.

We can also load wave 5 data, and filter it to only keep the wave 5 patients who were still present in wave 6 and had their frailty level computed.

In [None]:
X_w6, y_w6 = load_data(file_name="wave_6_frailty_FFP_data.tab", folder_path="data/raw/", target_variable="FFP",
                     index="idauniq")
data_file_w5 = "wave_5_elsa_data_v4.tab"
X_w5 = load_w5(core_data_path="data/raw/" + str(data_file_w5), index_col='idauniq', acceptable_features=None,
            acceptable_idauniq=X_w6.index, drop_frailty_columns=None)
y_w5 = y_w6.loc[X_w5.index]
y_w5.sort_index(inplace=True)
X_w5.sort_index(inplace=True)

We then preprocess both wave 5 and 6 data to scale the values and perform feature selection using the MultiSURF logic.

In [None]:
X_w6, y_w6 = preprocess_frailty_db(X=X_w6, y=y_w6, replace_missing_value=True, regex_list=None,
                                   replace_negatives=np.nan, replace_nan=None, rm_constant_features=True, min_max=True, group_frailty=True)
X_w5, y_w5 = preprocess_frailty_db(X=X_w5, y=y_w5, replace_missing_value=True, regex_list=None,
                                   replace_negatives=np.nan, replace_nan=None, rm_constant_features=True,
                                   min_max=True, group_frailty=True)

n_features=50
multisurf_feature_selection(X=X_w6, y=y_w6, n_features=n_features, discrete_threshold=20, n_jobs=-1, save_features=True,
                            file_path="data/best_features/wave_6_features.tab")
multisurf_feature_selection(X=X_w5, y=y_w5, n_features=n_features, discrete_threshold=20, n_jobs=-1, save_features=True,
                            file_path="data/best_features/wave_5_features.tab")

This process might be costly and take a long time, especially if we do not have many CPUs or GPUs available.

We have saved the best features names in a file, so that we will not have to repeat this process.

## Models training

Let us define in a dictionary the different classifiers we want to train, with all the hyperparameters we want to try out.

In [None]:
classifiers = {"SVM_linear": [SVC(kernel='linear'), {'C': [0.1, 1, 10]}],
               "SVM_rbf": [SVC(kernel='rbf'), {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1]}],
               "MLP": [MLPClassifier(),
                       {'hidden_layer_sizes': [(100, 50,), (100, 75, 25,)],
                        'activation': ['relu', 'tanh'], 'alpha': [0.001, 0.0001], 'max_iter': [2000]}],
               "DT": [DecisionTreeClassifier(), {'max_depth': [5, 10, 20]}],
               "RF": [RandomForestClassifier(), {'max_depth': [5, 10, 20], 'n_estimators': [20, 50, 100]}],
               "LR": [LogisticRegression(), {'C': [0.1, 1, 10], 'max_iter': [2000]}]
               }

### Detection models (wave 6)

Now we can read the file we just saved with the best features, train the models specified in the dictionary above, including possibly a Voting Classifier.

We then save the results and if needed the models in pickle files.

In [None]:
X, y = X_w6, y_w6

# Load selected variables
prediction_multisurf_variables = pd.read_csv("data/best_features/wave_6_features.tab", sep='\t', 
                                             escapechar='\\')['0'].tolist()
X = X.loc[:, prediction_multisurf_variables]

# Eliminate specific repeated features
X.drop(list(X.filter(regex = 'ff.*')), axis = 1, inplace = True)

# Select the best_grid_search features
scoring = ['accuracy', 'precision_macro', 'f1_macro', 'recall_macro']
folds = 10
seed = 10
epochs = 1000

# Train models and save results
saved_model_path = "data/models/detection/model"
saved_results_path = "data/metrics/detection/results"
get_cv_metrics(X=X, y=y, scoring=scoring, voting_classifier=True, random_state=seed, epochs=epochs, cv=folds,
               results_file_path=saved_results_path, model_file_path=saved_model_path)

And now let us check the results:

In [7]:
detection_results=pd.read_csv("data/metrics/detection/results.tab", sep='\t', escapechar='\\')
detection_results

Unnamed: 0.1,Unnamed: 0,params,accuracy,precision_macro,f1_macro,recall_macro
0,SVM_linear,{'C': 1},0.738831,0.750903,0.732101,0.733536
1,SVM_rbf,"{'C': 1, 'gamma': 0.1}",0.727707,0.733096,0.722859,0.723556
2,MLP,"{'activation': 'tanh', 'alpha': 0.0001, 'hidde...",0.670752,0.670278,0.669749,0.670092
3,DT,{'max_depth': 5},0.712427,0.716769,0.707991,0.70899
4,RF,"{'max_depth': 5, 'n_estimators': 100}",0.73279,0.740454,0.727498,0.728113
5,LR,"{'C': 0.1, 'max_iter': 2000}",0.737511,0.741253,0.733509,0.733782
6,VC,,0.731096,0.778618,0.684094,0.610036


### Prediction models (wave 5)

Repeat the same process for wave 5.

Note that the two processes are independent, meaning that if we only want to train prediction models we cas skip the above section.

In [None]:
X, y = X_w5, y_w5

# Load selected variables
prediction_multisurf_variables = pd.read_csv("data/best_features/wave_5_features.tab", sep='\t', 
                                             escapechar='\\')['0'].tolist()
X = X.loc[:, prediction_multisurf_variables]

# Eliminate specific repeated features
X.drop(list(X.filter(regex = 'ff.*')), axis = 1, inplace = True)

# Select the best_grid_search features
scoring = ['accuracy', 'precision_macro', 'f1_macro', 'recall_macro']
folds = 10
seed = 10
epochs = 1000

# Train models and save results
saved_model_path = "data/models/prediction/model"
saved_results_path = "data/metrics/prediction/results"
get_cv_metrics(X=X, y=y, scoring=scoring, voting_classifier=True, random_state=seed, epochs=epochs, cv=folds,
               results_file_path=saved_results_path, model_file_path=saved_model_path)

Let us check the results that we just saved.

In [8]:
prediction_results=pd.read_csv("data/metrics/detection/results.tab", sep='\t', escapechar='\\')
prediction_results

Unnamed: 0.1,Unnamed: 0,params,accuracy,precision_macro,f1_macro,recall_macro
0,SVM_linear,{'C': 1},0.738831,0.750903,0.732101,0.733536
1,SVM_rbf,"{'C': 1, 'gamma': 0.1}",0.727707,0.733096,0.722859,0.723556
2,MLP,"{'activation': 'tanh', 'alpha': 0.0001, 'hidde...",0.670752,0.670278,0.669749,0.670092
3,DT,{'max_depth': 5},0.712427,0.716769,0.707991,0.70899
4,RF,"{'max_depth': 5, 'n_estimators': 100}",0.73279,0.740454,0.727498,0.728113
5,LR,"{'C': 0.1, 'max_iter': 2000}",0.737511,0.741253,0.733509,0.733782
6,VC,,0.731096,0.778618,0.684094,0.610036
