# Define (1) prediction problem, (2) superfeatures and (3) data split

This tutorial shows how to define the prediction problem (what to predict) and the superfeatures (groups of features that are usually missing jointly) for your dataset.

Note: Also define a datasplit

In [1]:
%load_ext autoreload
%autoreload 2

### Define paths 

In [3]:
# which dataset to work on 
dataset_name   = "miiv_test"


In [23]:
# data specifications 
data_dir            = "../../../data/ts/" + dataset_name + "/MCAR_1/" 
data_file          = data_dir + dataset_name + '_static.csv.gz' 
temporal_data_file = data_dir + dataset_name + '_ts_eav.csv.gz' 
data_file          = data_dir + dataset_name + '_MCAR_1_static.parquet' 
temporal_data_file = data_dir + dataset_name + '_MCAR_1_ts_wide.parquet' 

# file to save problem
problem_file = data_dir + 'problem/' + 'problem.yaml'

# file to save superfeatures
superfeature_mapping_file = data_dir + 'superfeatures.csv'

# file for datasplit 
folds_file = data_dir + 'folds/' + 'fold_list.hkl'

## Define problem

We define the problem, by setting what we want to predict. We save the problem in a .yaml file for faster loading.

In [1]:
from afa.data_modelling.problem.utils import load_problem_specs, save_problem_specs

In [34]:
# define problem specifications
problem_specs = { 'label_name' : ['label'], 
                  'problem'    : 'online',
                  'treatment' : None ,
                  'max_seq_len' : 120}

# save
save_problem_specs( problem_specs  = problem_specs , problem_file = problem_file ) 

In [8]:
problem_specs = load_problem_specs(problem_file = problem_file)
problem_specs

{'label_name': ['label'],
 'max_seq_len': 120,
 'problem': 'online',
 'treatment': None}

## Define superfeature_mapping

Superfeatures contain multiple features that are usually acquired/ missing jointly. Think e.g. of an image where the image is the superfeature and the pixels are the features.   
They are thus especially important for defining the missingness process. 
If no superfeatures are defined, the default assumption is that every feature is also its own superfeature. 

Note: The superfeature generation for synthetic data is already included in the preparation00 tutorial. 

You can test the superfeature mapping by loading the data with the specified file in tutorial_classification_static.ipynb

### Option 1: Create superfeature mapping directly via a .csv file
Fill a .csv file by 
- listing superfeature names as columnnames
- writing the feature names below the corresponding superfeatures (can have different length columns). Make sure the feature names are spelled exactly how you load them in the dataframe 

### Option 2: Define them here and save the mapping 
A second option is to define them as a dictionary and save it. 

In [2]:
from afa.data_modelling.datasets.superfeatures.utils import save_name_mapping

In [43]:
superfeature2feature_name_mapping = \
    { 'superX0' : ['X0'], 
      'superY'  : ['Y' ], 
      'superX0_ts' : ['X0_ts'], 
      'superX1_ts' : ['X1_ts'], 
      'superX2_ts' : ['X2_ts', 'X3_ts'] }

save_name_mapping( superfeature2feature_name_mapping , mapping_file  = superfeature_mapping_file   )  

### Option 3: Prepared superfeature mappings (not recommended)
Lastly, for specific datasets, the superfeature mapping creation can be stored automatically executed. 

In [10]:
from afa.configurations.data_settings.define_data_settings_ts import generate_superfeature_mapping_ts
superfeature2feature_name_mapping = generate_superfeature_mapping_ts( dataset_name ,  data_dir = data_dir )

2023-08-07 10:23:45.445778: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-08-07 10:23:45.446628: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


### Test by loading superfeature mapping 

#### Test 1: load superfeature mapping by itself 

In [11]:
from afa.data_modelling.datasets.superfeatures.utils import load_superfeature2feature_name_mapping
superfeature2feature_name_mapping = load_superfeature2feature_name_mapping( superfeature_mapping_file) 

In [12]:
superfeature2feature_name_mapping

{'CBC': ['hgb', 'mcv', 'mch', 'mchc', 'plt', 'wbc'],
 'diff_from_CBC': ['lymph', 'neut'],
 'BMP': ['glu', 'bun', 'bicar', 'crea', 'na', 'k', 'cl', 'ca'],
 'CMP_without_BMP': ['alb', 'bili', 'alp', 'ast', 'alt'],
 'ABG': ['ph', 'pco2', 'po2'],
 'CRP': ['crp']}

## Test: load dataset with problem and superfeature mapping 

In [13]:
from afa.data_modelling.datasets.data_loader.data_loader_ts import DataLoader_ts

In [24]:
# load data
data_loader = DataLoader_ts( data_file                  = data_file,
                             temporal_data_file         = temporal_data_file,
                             superfeature_mapping_file  = superfeature_mapping_file,
                             problem_file               = problem_file)
dataset = data_loader.load() 

Padding sequences: 100%|██████████| 100/100 [00:00<00:00, 1459.23it/s]
Padding sequences: 100%|██████████| 100/100 [00:00<00:00, 1476.61it/s]
Padding sequences: 100%|██████████| 100/100 [00:00<00:00, 1725.42it/s]


In [25]:
# check superfeature mapping
dataset.superfeature2feature_name_mapping

{'CBC': ['hgb', 'mcv', 'mch', 'mchc', 'plt', 'wbc'],
 'diff_from_CBC': ['lymph', 'neut'],
 'BMP': ['glu', 'bun', 'bicar', 'crea', 'na', 'k', 'cl', 'ca'],
 'CMP_without_BMP': ['alb', 'bili', 'alp', 'ast', 'alt'],
 'ABG': ['ph', 'pco2', 'po2'],
 'CRP': ['crp'],
 'sbp': ['sbp'],
 'fio2': ['fio2'],
 'resp': ['resp'],
 'ptt': ['ptt'],
 'weight': ['weight'],
 'tnt': ['tnt'],
 'map': ['map'],
 'fgn': ['fgn'],
 'methb': ['methb'],
 'bili_dir': ['bili_dir'],
 'phos': ['phos'],
 'ckmb': ['ckmb'],
 'mg': ['mg'],
 'ck': ['ck'],
 'inr_pt': ['inr_pt'],
 'sex': ['sex'],
 'hr': ['hr'],
 'dbp': ['dbp'],
 'bnd': ['bnd'],
 'temp': ['temp'],
 'height': ['height'],
 'urine': ['urine'],
 'o2sat': ['o2sat'],
 'cai': ['cai'],
 'be': ['be'],
 'lact': ['lact'],
 'age': ['age']}

In [26]:
# check if resulting feature/superfeature names are correct 
dataset.feature_name

{'temporal': ['alb',
  'alp',
  'alt',
  'ast',
  'be',
  'bicar',
  'bili',
  'bili_dir',
  'bnd',
  'bun',
  'ca',
  'cai',
  'ck',
  'ckmb',
  'cl',
  'crea',
  'crp',
  'dbp',
  'fgn',
  'fio2',
  'glu',
  'hgb',
  'hr',
  'inr_pt',
  'k',
  'lact',
  'lymph',
  'map',
  'mch',
  'mchc',
  'mcv',
  'methb',
  'mg',
  'na',
  'neut',
  'o2sat',
  'pco2',
  'ph',
  'phos',
  'plt',
  'po2',
  'ptt',
  'resp',
  'sbp',
  'temp',
  'tnt',
  'urine',
  'wbc'],
 'data': ['age', 'sex', 'height', 'weight'],
 'treatment': None,
 'label': ['label'],
 'super_data': ['weight', 'sex', 'height', 'age'],
 'super_temporal': ['CBC',
  'diff_from_CBC',
  'BMP',
  'CMP_without_BMP',
  'ABG',
  'CRP',
  'sbp',
  'fio2',
  'resp',
  'ptt',
  'tnt',
  'map',
  'fgn',
  'methb',
  'bili_dir',
  'phos',
  'ckmb',
  'mg',
  'ck',
  'inr_pt',
  'hr',
  'dbp',
  'bnd',
  'temp',
  'urine',
  'o2sat',
  'cai',
  'be',
  'lact']}

## Define data split 

In [27]:
# define the datasplit 
dataset.multi_split( prob_list = [0.4,0.4,0.2], split_names = ["train", "val", "test"])

# save the datasplot 
dataset.save_folds( data_dir ) 

In [28]:
# load data
data_loader = DataLoader_ts(   data_file                  = data_file,
                               temporal_data_file         = temporal_data_file,
                               superfeature_mapping_file  = superfeature_mapping_file,
                               problem_file               = problem_file, 
                               folds_file                 = folds_file)
dataset = data_loader.load() 

Padding sequences: 100%|██████████| 100/100 [00:00<00:00, 2079.88it/s]
Padding sequences:   0%|          | 0/100 [00:00<?, ?it/s]

Padding sequences: 100%|██████████| 100/100 [00:00<00:00, 2039.45it/s]
Padding sequences: 100%|██████████| 100/100 [00:00<00:00, 1790.13it/s]


In [29]:
data = dataset.get_data(fold = 0, split = "train") 

In [30]:
data.keys()

dict_keys(['feature', 'label', 'treatment', 'temporal_feature', 'time', 'superR', 'temporal_superR'])