# Creating datasets

### Outline

In this tutorial you will learn about how to organize data to be used in the training process, and in particular the difference between:

- datasets
- dataloaders 
- datamodules

Furthermore, we will also look into some helper functions that can help in
 creating:

- datasets from COLVAR files
- time-lagged datasets

In a nutshell:
- [datasets](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) are objects which store the input data as well as additional quantities like labels or weights that are going to be used in the training. 
- [dataloaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) wrap an iterable around datasets to allow for easy access to data (as well as collating inputs into batches). 
- [datamodules](https://pytorch-lightning.readthedocs.io/en/1.8.1/data/datamodule.html) encapsulate all the steps needed to process data, e.g. split the datasets and create dataloaders

### Datasets

We subclassed `torch.utils.data.Dataset` into a `DictionaryDataset` which stores the information inside a dictionary and returns a dictionary with the batched data when sliced. 

The **keys** depend on the kind of learning task:
- Unsupervised: "data" (,"weights")
- Supervised
    - Regression: "data", "target" (,"weights")
    - Classification: "data", "labels"
- Time-lagged: "data", "data_lag" (,"weights","weights_lag")

The **values** can be either torch.Tensors or np.arrays / lists that will be passed to the torch.Tensor() function. 

In [6]:
import torch
from mlcolvar.data import DictionaryDataset

# the constructor takes a dictionary as input.
n_samples, n_features = 100, 2
dataset = DictionaryDataset({'data': torch.rand((n_samples,n_features)),
                             'target': torch.rand((n_samples,))
                             })

dataset

DictionaryDataset( "data": [100, 2], "target": [100] )

If the dataset is accessed with a string it will return the value of the underlying dictionary,
otherwise if it is accessed with a slice it will return a sliced dictionary:

In [7]:
# access with a key 
print('dataset["data"] -->', dataset["data"].shape )
# access the 0-th element
print('\ndataset[0] =', dataset[0] )
# slice the dataset
print('\ndataset[0:3] =', dataset[0:3] )

dataset["data"] --> torch.Size([100, 2])

dataset[0] = {'data': tensor([0.0238, 0.6240]), 'target': tensor(0.3217)}

dataset[0:3] = {'data': tensor([[0.0238, 0.6240],
        [0.6782, 0.4476],
        [0.8055, 0.8887]]), 'target': tensor([0.3217, 0.6375, 0.5045])}


You can also add additional keys to the dataset, e.g. if you want to give different weights to the data:

In [8]:
dataset['weights'] = torch.rand(100)

dataset

DictionaryDataset( "data": [100, 2], "target": [100], "weights": [100] )

### Dataloaders

The dataloaders wrap iterables around the dataset such that can be easily collated into batches and used for training/validation. We subclassed the `torch.utils.data.Dataloader` into a `FastDictionaryDataloader` which takes a `DictionaryDataset` as input. You can see further details in its documentation.

Typically the dataset is split across training and validation sets and then the dataloaders are created.

In [9]:
from mlcolvar.data import FastDictionaryLoader

# create train/valid dataloader
train_loader = FastDictionaryLoader(dataset[:80],batch_size=40)
valid_loader = FastDictionaryLoader(dataset[80:],batch_size=20)

train_loader

FastDictionaryLoader(length=80, batch_size=40, shuffle=True)

### Datamodule

The `lightning.LightningDataModule` object can be used to simplify and organized the tasks described above related to data processing. Here we subclassed it into a `DictionaryDataModule` which will take care of the 1) shuffling 2) splitting the datasets 3) creating the dataloaders. Note that this is supposed to be used together a `lightning.Trainer`.  

In [10]:
from mlcolvar.data import DictionaryDataModule

# (1) lenghts by fraction
datamodule = DictionaryDataModule(dataset, lengths = [0.8,0.2], batch_size = 10 )
print('#1 --> ', datamodule ) 

# (2) lenghts as number of element
datamodule = DictionaryDataModule(dataset, lengths = [75,20,5], 
                                    batch_size = [25,10,5],             # different batch sizes for each dataloader
                                    shuffle = [True, False, False] )    # specify per-dataloader options

print('\n#2 --> ', datamodule ) 

#1 -->  DictionaryDataModule(dataset -> DictionaryDataset( "data": [100, 2], "target": [100], "weights": [100] ),
		     train_loader -> FastDictionaryLoader(length=0.8, batch_size=10, shuffle=True),
		     valid_loader -> FastDictionaryLoader(length=0.2, batch_size=10, shuffle=True))

#2 -->  DictionaryDataModule(dataset -> DictionaryDataset( "data": [100, 2], "target": [100], "weights": [100] ),
		     train_loader -> FastDictionaryLoader(length=75, batch_size=25, shuffle=True),
		     valid_loader -> FastDictionaryLoader(length=20, batch_size=10, shuffle=False),
			test_loader =FastDictionaryLoader(length=5, batch_size=5, shuffle=False))


### I/O helper functions

#### Creating datasets from file

It is of course possible to load the data from files (e.g. with the `load_dataframe` function`) and then creating a dataset. For convenience, we created a function `create_dataset_from_files` that can be used to create the dataset directly from files. This covers the following settings:

1) **unsupervised learning**: one or more files are merged together in an unlabeled dataset

In [11]:
from mlcolvar.utils.io import create_dataset_from_files

filenames = [ "data/muller-brown/unbiased/high-temp/COLVAR" ]

# load data into dataset
dataset, df = create_dataset_from_files(filenames, 
                                        create_labels=False,
                                        filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes
                                        return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)

Class 0 dataframe shape:  (5001, 11)

 - Loaded dataframe (5001, 11): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2']
 - Descriptors (5001, 2): ['p.x', 'p.y']


In [34]:
df.head(5)

Unnamed: 0,time,p.x,p.y,p.z,ene,pot.bias,pot.ene_bias,lwall.bias,lwall.force2,uwall.bias,uwall.force2
0,0.0,0.5,0.0,0.0,6.580981,6.580981,6.580981,0.0,0.0,0.0,0.0
1,1.0,0.285803,0.351447,0.0,11.50674,11.50674,11.50674,0.0,0.0,0.0,0.0
2,2.0,-0.004293,0.59071,0.0,11.821637,11.821637,11.821637,0.0,0.0,0.0,0.0
3,3.0,-0.530208,0.714688,0.0,16.812886,16.812886,16.812886,0.0,0.0,0.0,0.0
4,4.0,-1.015236,0.978306,0.0,8.821514,8.821514,8.821514,0.0,0.0,0.0,0.0


2. **classification**: in this case each file contains samples of a different class

In [38]:
from mlcolvar.utils.io import create_dataset_from_files

filenames = [ f"data/muller-brown/unbiased/state-{i}/COLVAR" for i in range(2) ]

# load data into dataset
dataset, df = create_dataset_from_files(filenames, 
                                        create_labels=True,
                                        filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes
                                        return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)

Class 0 dataframe shape:  (2001, 12)
Class 1 dataframe shape:  (2001, 12)

 - Loaded dataframe (4002, 12): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2', 'labels']
 - Descriptors (4002, 2): ['p.x', 'p.y']


#### Create time-lagged datasets

In case of time-lagged tasks, one has to deal not to single configurations, rather to pairs of configurations $\{x(t),x(t+\tau)\}$ which are distant a lag-time $\tau$ in time. The `mlcolvar.utils.timelagged` module contains some helper functions, in particular the function `create_timelagged_dataset`.

Notes:
- If logweigths are given (e.g. beta*bias) the search for time-lagged configurations will be performed in rescaled time [McCarthy and Parrinello, JCP 2017].
- The resulting dataset will contain the keys 'data', 'data_lag' as well as 'weights' and 'weights_lag', where the weights are all equal to ones in the unbiased case.
- The actual search for time-lagged configurations is performed by the function `find_time_lagged_configurations`, which however is not supposed to be called directly.

In [47]:
from mlcolvar.utils.timelagged import create_timelagged_dataset

X = torch.rand((100,20)) 
t = torch.arange(100)

# returns configurations at time t as well as time t+tau
dataset = create_timelagged_dataset(X, t, 
                                    lag_time=10, 
                                    logweights=None )

dataset



DictionaryDataset( "data": [88, 20], "data_lag": [88, 20], "weights": [88], "weights_lag": [88] )