## Defining a Task

In the last tutorial we learned the basic structure of a `NeuralTree`. In this section we will explore `Task` objects, which define the interface
between a `NeuralTree` and the datasets we will use to train it.

### A Dataset

Every task starts with a dataset. In this example we will use the GFP fluorescence dataset (TODO add link).

In [1]:
from cortex.data.dataset import TAPEFluorescenceDataset

dataset = TAPEFluorescenceDataset(
    root='./.cache',
    download=True,
    train=True,
)
dataset[0]

100%|██████████| 1410354/1410354 [00:00<00:00, 1483539.84it/s]


OrderedDict([('tokenized_seq',
              'S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T L S Y G V Q C F S R Y P D H M K Q H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H K I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E R Y K'),
             ('log_fluorescence', 3.8237006664276123)])

### A Task Data Module

The `cortex` package uses the `lightning` package to handle data loading and distributed training. 
The `TaskDataModule` subclasses `lightning.DataModule`.

In [2]:
from cortex.data.data_module import TaskDataModule
from omegaconf import DictConfig

dataset_cfg = DictConfig(
    {
        '_target_': 'cortex.data.dataset.TAPEFluorescenceDataset',
        'root': './.cache',
        'download': True,
        'train': "???"
    }
)

data_module = TaskDataModule(
    batch_size=2,
    dataset_config=dataset_cfg,
)

train_loader = data_module.train_dataloader()
batch = next(iter(train_loader))
print(batch)

  from .autonotebook import tqdm as notebook_tqdm


OrderedDict([('tokenized_seq', ['S K G E E L F T G A V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T L S Y G V Q C F S R Y P D H M K Q H D F F K S A M P E G Y V Q E R A I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q D T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K', 'S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T S G E L P V P W P T L V T T L S Y G V Q C F S R Y P D H M K Q H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K']), ('log_fluore

### A Task object

A task object in `cortex` determines how a batch of data from a dataloader is passed to a `NeuralTree` during training.

In [3]:
from cortex.task import RegressionTask

task = RegressionTask(
    data_module=data_module,
    input_map={"protein_seq": ["tokenized_seq"]},  # {root_key: [input_key]}
    outcome_cols=["log_fluorescence"],  # [*target_keys]
    leaf_key="log_fluorescence_0"  # name of leaf node
)

formatted_batch = task.format_batch(batch)
print(formatted_batch)

{'root_inputs': {'protein_seq': {'inputs': array([['S K G E E L F T G A V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T L S Y G V Q C F S R Y P D H M K Q H D F F K S A M P E G Y V Q E R A I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q D T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'],
       ['S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T S G E L P V P W P T L V T T L S Y G V Q C F S R Y P D H M K Q H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G 

## Usage

Now we will instantiate the a `NeuralTree` similar to the last tutorial, however we will use Hydra to simplify the instantiation.

In [4]:
import hydra
from omegaconf import OmegaConf

with hydra.initialize(config_path="./hydra"):
    cfg = hydra.compose(config_name="2_defining_a_task")
    OmegaConf.set_struct(cfg, False)

tree = hydra.utils.instantiate(cfg.tree)
tree.build_tree(cfg)
tree

The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  with hydra.initialize(config_path="./hydra"):


using vocab from /Users/stantos5/.venv/cortex-docs/lib/python3.10/site-packages/cortex/assets/protein_seq_tokenizer_32/vocab.txt


SequenceModelTree(
  (root_nodes): ModuleDict(
    (protein_seq): Conv1dRoot(
      (tok_encoder): Embedding(32, 32, padding_idx=1)
      (pos_encoder): SinePosEncoder(
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (encoder): Sequential(
        (0): Apply(
          (module): Expression()
        )
        (1): Conv1dResidBlock(
          (conv_1): Conv1d(32, 32, kernel_size=(3,), stride=(1,), padding=same, bias=False)
          (conv_2): Conv1d(32, 32, kernel_size=(3,), stride=(1,), padding=same, bias=False)
          (norm_1): MaskLayerNorm1d((32, 1), eps=1e-05, elementwise_affine=True)
          (norm_2): MaskLayerNorm1d((32, 1), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.0, inplace=False)
        )
        (2): Conv1dResidBlock(
          (conv_1): Conv1d(32, 32, kernel_size=(3,), stride=(1,), padding=same, bias=False)
          (conv_2): Conv1d(32, 32, kernel_size=(3,), stride=(1,), padding=same, bias=False)
          (norm_1): MaskLayer

In [5]:
tree_output = tree(formatted_batch["root_inputs"])
tree_output.leaf_outputs["log_fluorescence_0"].loc

tensor([[0.0381],
        [0.0486]], grad_fn=<MulBackward0>)

### Computing a task loss

In [6]:
leaf_key = "log_fluorescence_0"
leaf_node = tree.leaf_nodes[leaf_key]

loss = leaf_node.loss(
    leaf_outputs=tree_output.leaf_outputs[leaf_key],
    root_outputs=tree_output.root_outputs["protein_seq"],
    **formatted_batch["leaf_targets"][leaf_key]
)
print(loss)

tensor(19.7825, grad_fn=<MulBackward0>)


### Evaluating task output

In [7]:
leaf_node.evaluate(
    outputs=tree_output.leaf_outputs[leaf_key],
    **formatted_batch["leaf_targets"][leaf_key]
)

{'nll': 19.782459259033203,
 'nrmse': 0.9881033301353455,
 's_rho': 0.9999999999999999}