# Using the FSMol Dataset

The `FSMolDataset` provides access to the train/valid/test tasks of the few-shot dataset. An instance is created from the data directory by `FSMolDataset.from_directory(/path/to/dataset)` and access to the iterable over task files is given by `FSMolDataset.get_task_reading_iterable`. This allows specification of a callable to define how to read a list of task files, and permits multithreaded data loading. The default implementation returns an iterable over `FSMolTask` objects, each containing an entire task's of single featurised molecules, `MoleculeDatapoint`. 

In [None]:
import os
import sys

sys.path.insert(0, os.path.join(os.getcwd(), "../fs_mol"))

from data import FSMolDataset, DataFold

In [None]:
dataset = FSMolDataset.from_directory(os.path.join(os.getcwd(), "../dataset/")) # path to the dataset here
test_task_iterable = dataset.get_task_reading_iterable(DataFold.TEST)

In [None]:
for i, task_sample in enumerate(test_task_iterable):
    if i == 0:
        task = task_sample
    else:
        break

In [None]:
type(task)

In [None]:
task.name

The iterable can be used to assemble batches of tasks, as seen in `maml_train.py`. 

## Drawing Molecule Samples -> Making a Task Sample

In practice, all methods require that we sample from the `FSMolTask`s. This is accomplished by building on the `TaskSampler` abstract class, which requires a `sample` method. As an example of this, the few-shot learning methods in `FS-Mol` utilise stratified sampling, where the resulting balance of classes reflects that in the overall dataset. 

To deal with the datasets being small, we require in training a minimal support set size, a minimal query set size, and also supply an upper desired query set size. When a task dataset cannot be sampled to achieve this, and error is thrown and caught, enabling training to continue. 

In [None]:
from data import StratifiedTaskSampler

In [None]:
task_sampler = StratifiedTaskSampler(train_size_or_ratio = 16,
                                    valid_size_or_ratio = 0.0,
                                    test_size_or_ratio = 256, 
                                    allow_smaller_test = True)

In [None]:
task_sample = task_sampler.sample(task)

Applying a sampler returns a sample from the task, which contains a support/train set, validation set, and query/test set. In this case, the task is too small to return all requested testing samples, so it returns the maximum available.

In [None]:
print(f"Number of train samples: {len(task_sample.train_samples)}")
print(f"Number of test samples: {len(task_sample.test_samples)}")
print(f"Number of valid samples: {len(task_sample.valid_samples)}")

In [None]:
type(task_sample.train_samples[0])

A balanced sampler (equal proportion of both classes) and random sampler (random draws from the entire task dataset) are also implemented in `data/fsmol_task_sampler.py`

## Custom task reading functions

Line 215 of `maml_train.py` implements a custom task reading function that is consumed by `dataset.get_task_reading_iterable(task_reader_fn=...)`. This stands as an example of custom reading -- the data is simultaneously read in from disk and sampled by tyhe `StratidiedTaskSampler`.

## Batching task samples

`maml_train.py` Line 113 also demonstrates the batching functionality provided in `FS-Mol`: the train samples of a `FSMolTaskSample` are batched to be appropriate for mini-batch gradient descent, in a `TFGraphBatchIterable`. 

This is an instance of the more general `FSMolBatchIterable`, which uses an `FSMolBatcher` to return complete batches of samples.

In [None]:
from utils.maml_data_utils import TFGraphBatchIterable

In [None]:
batched_data = TFGraphBatchIterable(
                samples=task_sample.train_samples,
                shuffle=True,
                max_num_nodes=100,
            )

In [None]:
for i, batch in enumerate(batched_data):
    data = batch


In [None]:
data[1]

Another implementation of the batcher is found in `MultitaskTaskSampleBatchIterable` as consumed in Line 532 of `fs_mol/multitask_train.py`. However we note that the FSMolDataset can be used flexibly, as demonstrated by, for instance, `fs_mol/mat_test.py`