# Transforming datasets

Transforms can be applyied to datasets, obtaining a new dataset. Here we show how to apply a sequence of transformation to a multi modal dataset, generating a new multi modal dataset.
All transformations implements the `librep.base.transform.Transform` interface.

## Transforming MultiModal Datasets

MultiModal datasets allow features to be partitionated in windows.
Librep provides the `librep.dataset.multimodal.TransformMultiModalDataset` allowing the same transformation to be applyied to all windows of the dataset. Also, it allows transformations to be chained, causing the transformations to be applied in sequence, and generating a new multi modal `ArrayMultiModalDataset` dataset. The window slices will be automatically recalculated if the transformation adds or remove features from windows.


The operation of `librep.dataset.multimodal.TransformMultiModalDataset` is illustrated in the Figure below. 
Supposing, we have 2 transforms (`Transform 1` and `Transform 2`) that implements the `librep.base.transform.Transform` interface, and a MultiModal dataset (*e.g.* `ArrayMultiModalDataset`,  `PandasMultiModalDataset` or any other inheriting from `librep.datasets.multimodal.MultiModalDataset`).
Using `librep.dataset.multimodal.TransformMultiModalDataset` the `Transform 1` will be applyied to each window of the input dataset, generating a new dataset. The `Transform 2` will then be applyed to each window of the resulting dataset.

![A windowed dataset transformation figure](./images/windowed-dataset-transform.svg "Windowed dataset transform")

> **_NOTE 1_**: `librep.datasets.multimodal.MultiModalDataset` assumes that the transformation can be applied in parallel to each window and will not overwrite the contents, that is, will generate a new dataset.

> **_NOTE 2_**: For now `librep.datasets.multimodal.MultiModalDataset` will result in a `ArrayMultiModalDataset`, independent of the type of the input dataset.

> **_NOTE 3_**: For now `librep.datasets.multimodal.MultiModalDataset` will apply each transformation to all windows of the multi modal dataset. Other options of controlling the transforms application is not supported yet.

> **_NOTE 4_**: For now the transforms must receive a numpy arrays. If you are using `PandasMultiModalDataset` set the `as_array` parameter to `True`.

Let's create two transforms: 

- `SumTransform` will sum `value` (passed as parameter) to every element of the `dataset` (or window, if using a MultiModal dataset). This will not change the number of features per window.
- `MeanTransform` will select calculate the mean value of each sample from the dataset. For each sample of the dataset (or window, if using a MultiModal dataset) the mean value will be returned, generating recuding the features to 1 feature (per window).

> **_NOTE_**: Transforms can add or remove number of features/sampels per window. The number of samples generated must be the same for all windows, when applying a single transform.

In [2]:
import numpy as np

from librep.base.transform import Transform
from librep.datasets.multimodal import ArrayMultiModalDataset, TransformMultiModalDataset

In [3]:
# This transform will sum value into all elements of the array X
class SumTransform(Transform):
    def __init__(self, value: int):
        self.value = value

    def transform(self, X):
        return X + self.value


# This transform will iterate over each sample of X and calculates the mean
# It returns a array of (n_samples, 1) -- this is why expand_dims at the end
class MeanTransform(Transform):
    def transform(self, X):
        samples = []
        for x in X:
            samples.append(np.mean(x))
        return np.expand_dims(np.array(samples), axis=1)

Let's create an `ArrayMultiModalDataset` dataset with 4 samples and 4 features. Columns 0 and 1 will be window 0 and columns 2 and 3 will be window 1.

In [4]:
samples = np.arange(16).reshape(4, 4)
labels = np.array([0, 0, 1, 1])

multi_modal_dataset = ArrayMultiModalDataset(
    X=samples, y=labels, 
    window_slices=[(0, 2), (2, 4)], # window 0 are composed by columns 0 and 1
                                    # window 2 are composed by columns 2 and 3
    window_names=["a", "b"]         # Optional parameter informing the name
                                    # of each window
)

print(f"There are {len(multi_modal_dataset)} samples")
print(f"Number of windows: {multi_modal_dataset.num_windows}")
print(f"The window slices: {multi_modal_dataset.window_slices}")

There are 4 samples
Number of windows: 2
The window slices: [(0, 2), (2, 4)]


Now, we will instantiate the transform objects and create a transform chain. 
To create a chain, we must instantiate an `TransformMultiModalDataset` object and pass a list of transform objects for parameter `transforms`.

In [5]:
# Instantiate the SumTransform with value = 10
# So 10 will be summed to the each window
sum_transform = SumTransform(value=10)
# Instatiate the MeanTransform object
mean_tranform = MeanTransform()

# Create a transformer that transform a dataset
# The sum_transform and mean_transform will be applyied to each window
# of the dataset, sequentially
transformer = TransformMultiModalDataset(
    transforms=[sum_transform, mean_tranform]
)

We can apply the sequence of transforms the a dataset. We must just call the `TransformMultiModalDataset` object passing the `MultiModalDataset` as input parameter. It will result in a new `ArrayMultiModalDataset` with the samples transformed. The new `window_slices` will be automatically calculcated.

In [6]:
transformed_dataset = transformer(multi_modal_dataset)

print(f"There are {len(transformed_dataset)} samples")
print(f"Number of windows: {transformed_dataset.num_windows}")
print(f"The window slices: {transformed_dataset.window_slices}")

There are 4 samples
Number of windows: 2
The window slices: [(0, 1), (1, 2)]


In [7]:
samples, labels = transformed_dataset[:]
print(f"Samples:\n{samples}")
print(f"Labels:\n{labels}")

Samples:
[[10.5 12.5]
 [14.5 16.5]
 [18.5 20.5]
 [22.5 24.5]]
Labels:
[0 0 1 1]


The same transform can be applied to another `MultiModalDataset`.

In [8]:
# Set the seed to allows reprodutibility. 
# This will generate the same random numbers
np.random.seed(0)
samples = np.random.random(16).reshape(4, 4)
labels = np.array([0, 0, 1, 1])

multi_modal_dataset_2 = ArrayMultiModalDataset(
    X=samples, y=labels, 
    window_slices=[(0, 2), (2, 4)], # window 0 are composed by columns 0 and 1
                                    # window 2 are composed by columns 2 and 3
    window_names=["a", "b"]         # Optional parameter informing the name
                                    # of each window
)

samples, labels = multi_modal_dataset_2[:]
print(f"Samples:\n{samples}")
print(f"Labels:\n{labels}")

Samples:
[[0.5488135  0.71518937 0.60276338 0.54488318]
 [0.4236548  0.64589411 0.43758721 0.891773  ]
 [0.96366276 0.38344152 0.79172504 0.52889492]
 [0.56804456 0.92559664 0.07103606 0.0871293 ]]
Labels:
[0 0 1 1]


In [9]:
transformed_dataset_2 = transformer(multi_modal_dataset_2)

print(f"There are {len(transformed_dataset_2)} samples")
print(f"Number of windows: {transformed_dataset_2.num_windows}")
print(f"The window slices: {transformed_dataset_2.window_slices}")

There are 4 samples
Number of windows: 2
The window slices: [(0, 1), (1, 2)]


In [10]:
samples, labels = transformed_dataset_2[:]
print(f"Samples:\n{samples}")
print(f"Labels:\n{labels}")

Samples:
[[10.63200144 10.57382328]
 [10.53477446 10.66468011]
 [10.67355214 10.66030998]
 [10.7468206  10.07908268]]
Labels:
[0 0 1 1]
