# Data Pre-processing Notebook
In this notebook I'm running the data pre-processing phase on the Randomized DCE dataset.

**Author**: Arthur G.

## Loading Dependencies
Importing and setting up all the dependencies for this notebook.

In [1]:
# importing libs
import os
import warnings

import h5py
import numpy as np
import pandas as pd

## Helper Functions
A set of helper functions to automate data loading and pre-processing.

In [2]:
def get_subject_class(
    name: str, 
    mapped_data: dict, 
    hdf_data_object: h5py._hl.group.Group
) -> list:
    """Get a subject ID and loads it's associated data and class."""
    SUBJECT_IDS = list(mapped_data.keys())
    
    # data storages
    subjects_data_arrays = []
    subject_data_targets = []
    
    # reading data arrays and targets
    for subject_id in SUBJECT_IDS:
        subjects_data_arrays.append(np.array(hdf_data_object.get(subject_id)))
        subject_data_targets.append(mapped_data[subject_id])
    
    # checking the subject with the smallest record for normalized windowing
    min_subject_cols = min([size.shape[-1] for size in subjects_data_arrays])
    current_subject_rows = subjects_data_arrays[0].shape[0]
    print(f"For '{name}', the smallest record is: {min_subject_cols} (with {current_subject_rows} rows)")
        
    return subjects_data_arrays, subject_data_targets


def calc_window_number(name: str, base_record_size: int) -> None:
    """Calculates all the exact number of windows a given record can be splitted into."""
    possible_samples = []
    
    for num in range(1, base_record_size+1):
        res = base_record_size % num
        
        if res == 0:
            window_size = base_record_size // num
            possible_samples.append(f"samples: {num} (size: {window_size})")          
            
    print(f"{name} possible windowing strategy: \n {possible_samples} \n\n")
    

def array_to_img(
    data_matrix: np.ndarray,
    target_value: str,
    n_samples: int,
    rows: int,
    cols: int, 
    axis: int = 1
) -> np.ndarray:
    """Convert numpy data matrix to images dataset."""
    if axis not in [0, 1]:
        raise Exception("Axis not valid!")
    
    # processing data matrix
    raw_data_matrix = data_matrix[:,0:cols]
    samples_col = cols // n_samples
    splitted_data_matrix = np.array_split(raw_data_matrix, n_samples, axis=axis)
    data = np.array(splitted_data_matrix).reshape(((n_samples, 1, samples_col, rows)))
    
    # processing sample's target
    target = np.array([target_value]*n_samples)
        
    return data, target

## Loading Data
Loading the raw randomized dataset from the H5 file as well as it's separated metadata.

Loading metadata.

In [4]:
# subjects metadata.
metadata = pd.read_excel(os.path.join("..", "references", "cognitive_training_metadata.xlsx"))
metadata["Operador"] = metadata.Operador.map(lambda x: x.replace("'", ""))
metadata["Classe"] = metadata.Classe.map(lambda x: x.replace("'", ""))

# mapping subject id to class
subject_to_class_map = metadata[["Operador", "Classe"]].set_index("Operador").to_dict()["Classe"]
metadata.head()

Unnamed: 0,Operador,Classe,Produtividade,prodClassesArray de,prodClassesArray até,DCEDataArray de,DCEDataArray até
0,Sujeito01_AAB_36_GCA_RT_OA_ICA_C3,B,9751.651897,1,177,1,53100
1,Sujeito01_AAB_36_GCA_RTpos_OA_ICA_C3,B,9751.651897,178,355,53101,106500
2,Sujeito01_AAB_36_GCD_RT_OA_ICA_C3,B,9647.2438,356,532,106501,159600
3,Sujeito01_AAB_36_GCD_RTpos_OA_ICA_C3,B,9647.2438,533,709,159601,212700
4,Sujeito02_ASC_33_GCA_RT_OA_ICA_C3,D,6523.4397,710,887,212701,266100


Loading randomized data.

In [5]:
# reading h5py file
data = h5py.File(os.path.join("..", "data", "raw", "bancoDCEComParametrosRandomizados.h5"), "r")

# reading each randomized dataset
original_data, original_target = get_subject_class("Original data", subject_to_class_map, data.get("Original"))
similar_m_data, similar_m_target = get_subject_class("Similar M data", subject_to_class_map, data.get("M rand Similar"))
large_tau_data, large_tau_target = get_subject_class("Large Tau data", subject_to_class_map, data.get("Tau rand Grande"))
smaller_tau_data, smaller_tau_target = get_subject_class("Smaller Tau data", subject_to_class_map, data.get("Tau rand Menor"))
similar_tau_data, similar_tau_target = get_subject_class("Similar Tau data", subject_to_class_map, data.get("Tau rand Similar"))

data.close()

For 'Original data', the smallest record is: 53228 (with 6 rows)
For 'Similar M data', the smallest record is: 52862 (with 12 rows)
For 'Large Tau data', the smallest record is: 52862 (with 6 rows)
For 'Smaller Tau data', the smallest record is: 53512 (with 6 rows)
For 'Similar Tau data', the smallest record is: 53216 (with 6 rows)


## Data Pre-processing
In this section I'm running the data pre-processing step, which is comprised of:
+ Windowing of each subset (normalized by the subject with the smallest record from each group).
+ Organization into different dataset files for further machine learning modeling.

I'm starting by figuring out the *"kind of"* optimal number of windows for each subset of raw data. *Based on some tests I did beforehand, I needed to change the base record size number (the number of columns of the subject with the smallest record)* used to normalize the split because some of the subsets ended up with a too small number of columns, which may negatively impact the performance of our algorithms.

In [5]:
calc_window_number("Original data", 52839)
calc_window_number("Similar M data", 52839)
calc_window_number("Large Tau data", 52839)
calc_window_number("Smaller Tau data", 52839)
calc_window_number("Similar Tau data", 52839)

Original data possible windowing strategy: 
 ['samples: 1 (size: 52839)', 'samples: 3 (size: 17613)', 'samples: 9 (size: 5871)', 'samples: 19 (size: 2781)', 'samples: 27 (size: 1957)', 'samples: 57 (size: 927)', 'samples: 103 (size: 513)', 'samples: 171 (size: 309)', 'samples: 309 (size: 171)', 'samples: 513 (size: 103)', 'samples: 927 (size: 57)', 'samples: 1957 (size: 27)', 'samples: 2781 (size: 19)', 'samples: 5871 (size: 9)', 'samples: 17613 (size: 3)', 'samples: 52839 (size: 1)'] 


Similar M data possible windowing strategy: 
 ['samples: 1 (size: 52839)', 'samples: 3 (size: 17613)', 'samples: 9 (size: 5871)', 'samples: 19 (size: 2781)', 'samples: 27 (size: 1957)', 'samples: 57 (size: 927)', 'samples: 103 (size: 513)', 'samples: 171 (size: 309)', 'samples: 309 (size: 171)', 'samples: 513 (size: 103)', 'samples: 927 (size: 57)', 'samples: 1957 (size: 27)', 'samples: 2781 (size: 19)', 'samples: 5871 (size: 9)', 'samples: 17613 (size: 3)', 'samples: 52839 (size: 1)'] 


Large Tau dat

### Samples Split
In this step I'm splitting the raw DCE data into samples for AI/ML modeling.

Starting with the original DCE data (calculated based on the Take's Theorem).

In [19]:
original_data_samples = []
original_data_target = []

for idx in np.arange(len(original_data)):
    # splitting samples for original data
    current_samples, current_targets = array_to_img(
        original_data[idx],
        original_target[idx],
        n_samples=513, 
        rows=6, 
        cols=52839
    )
    
    # storing original data
    original_data_samples.append(current_samples)
    [original_data_target.append(target) for target in current_targets]
    
original_data_samples = np.vstack(original_data_samples)
original_data_target = np.array(original_data_target)

print(f"Original data shape: {original_data_samples.shape}")

Original data shape: (30780, 1, 103, 6)


Now for the Similar M value.

In [7]:
similar_m_data_samples = []
similar_m_data_target = []

for idx in np.arange(len(original_data)):
    # splitting samples for similar m data
    current_samples, current_targets = array_to_img(
        similar_m_data[idx],
        similar_m_target[idx],
        n_samples=513, 
        rows=12, 
        cols=52839
    )
    
    # storing similar m data
    similar_m_data_samples.append(current_samples)
    [similar_m_data_target.append(target) for target in current_targets]

similar_m_data_samples = np.vstack(similar_m_data_samples)
similar_m_data_target = np.array(similar_m_data_target)

print(f"Similar M data shape: {similar_m_data_samples.shape}")

Similar M data shape: (30780, 1, 103, 12)


And now for large Tau data.

In [8]:
large_tau_data_samples = []
large_tau_data_target = []

for idx in np.arange(len(original_data)):
    # splitting samples for large tau data
    current_samples, current_targets = array_to_img(
        large_tau_data[idx],
        large_tau_target[idx],
        n_samples=513, 
        rows=6, 
        cols=52839
    )
    
    # storing similar m data
    large_tau_data_samples.append(current_samples)
    [large_tau_data_target.append(target) for target in current_targets]

large_tau_data_samples = np.vstack(large_tau_data_samples)
large_tau_data_target = np.array(large_tau_data_target)

print(f"Large Tau data shape: {large_tau_data_samples.shape}")

Large Tau data shape: (30780, 1, 103, 6)


For smaller tau data.

In [9]:
smaller_tau_data_samples = []
smaller_tau_data_target = []

for idx in np.arange(len(original_data)):
    # splitting samples for smaller tau data
    current_samples, current_targets = array_to_img(
        smaller_tau_data[idx],
        smaller_tau_target[idx],
        n_samples=513, 
        rows=6, 
        cols=52839
    )
    
    # storing similar m data
    smaller_tau_data_samples.append(current_samples)
    [smaller_tau_data_target.append(target) for target in current_targets]

smaller_tau_data_samples = np.vstack(smaller_tau_data_samples)
smaller_tau_data_target = np.array(smaller_tau_data_target)

print(f"Smaller Tau data shape: {smaller_tau_data_samples.shape}")

Smaller Tau data shape: (30780, 1, 103, 6)


And finally for similar Tau data.

In [10]:
similar_tau_data_samples = []
similar_tau_data_target = []

for idx in np.arange(len(original_data)):
    # splitting samples for similar tau data
    current_samples, current_targets = array_to_img(
        similar_tau_data[idx],
        similar_tau_target[idx],
        n_samples=513, 
        rows=6, 
        cols=52839
    )
    
    # storing similar m data
    similar_tau_data_samples.append(current_samples)
    [similar_tau_data_target.append(target) for target in current_targets]

similar_tau_data_samples = np.vstack(similar_tau_data_samples)
similar_tau_data_target = np.array(similar_tau_data_target)

print(f"Similar Tau data shape: {similar_tau_data_samples.shape}")

Similar Tau data shape: (30780, 1, 103, 6)


## Data Serialization
In this final step I'm serializing the pre-processed data to NPY files.

In [11]:
# original data serialization
np.save(
    os.path.join("..", "data", "processed", "original_processed_data.npy"),
    original_data_samples,
    allow_pickle=True,
    fix_imports=True
)

np.save(
    os.path.join("..", "data", "processed", "original_processed_targets.npy"),
    original_data_target,
    allow_pickle=True,
    fix_imports=True
)

# similar m data serialization
np.save(
    os.path.join("..", "data", "processed", "similar_m_processed_data.npy"),
    similar_m_data_samples,
    allow_pickle=True,
    fix_imports=True
)

np.save(
    os.path.join("..", "data", "processed", "similar_m_processed_targets.npy"),
    similar_m_data_target,
    allow_pickle=True,
    fix_imports=True
)

# large tau data serialization
np.save(
    os.path.join("..", "data", "processed", "large_tau_processed_data.npy"),
    large_tau_data_samples,
    allow_pickle=True,
    fix_imports=True
)

np.save(
    os.path.join("..", "data", "processed", "large_tau_processed_targets.npy"),
    large_tau_data_target,
    allow_pickle=True,
    fix_imports=True
)

# smaller tau data serialization
np.save(
    os.path.join("..", "data", "processed", "smaller_tau_processed_data.npy"),
    smaller_tau_data_samples,
    allow_pickle=True,
    fix_imports=True
)

np.save(
    os.path.join("..", "data", "processed", "smaller_tau_processed_targets.npy"),
    smaller_tau_data_target,
    allow_pickle=True,
    fix_imports=True
)

# similar tau data serialization
np.save(
    os.path.join("..", "data", "processed", "similar_tau_processed_data.npy"),
    similar_tau_data_samples,
    allow_pickle=True,
    fix_imports=True
)

np.save(
    os.path.join("..", "data", "processed", "similar_tau_processed_targets.npy"),
    similar_tau_data_target,
    allow_pickle=True,
    fix_imports=True
)