In this notebook we are going to explain how to use a new dataset with `openretina`.

For this example, we are going to use data from Maheswaranathan et al. (2023): [Interpreting the retinal neural code for natural scenes: From computations to neurons](https://doi.org/10.1016/j.neuron.2023.06.007) .

Along the way, we are also going to address some questions that can arise regarding the process for your own data. 

In [1]:
import logging
import os

import lightning

from openretina.data_io.base import MoviesTrainTestSplit, ResponsesTrainTestSplit, compute_data_info
from openretina.data_io.base_dataloader import multiple_movies_dataloaders
from openretina.data_io.cyclers import LongCycler, ShortCycler
from openretina.models.core_readout import CoreReadout
from openretina.utils.file_utils import get_cache_directory, get_local_file_path
from openretina.utils.h5_handling import load_dataset_from_h5, load_h5_into_dict
from openretina.utils.misc import CustomPrettyPrinter

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)  # to display logs in jupyter notebooks

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

pp = CustomPrettyPrinter(indent=4, max_lines=40)

First, let's set the cache directory for the data and models.

In [2]:
# The default directory for downloads will be ~/openretina_cache
# To change this, uncomment the following line and change its path
# os.environ["OPENRETINA_CACHE_DIRECTORY"] = "/Data/"

# You can then check if that directory has been correctly set by running:
get_cache_directory()

'/home/baptiste/openretina_cache'

Let's now download the data from HuggingFace.

In [3]:
data_path = get_local_file_path(
    "https://huggingface.co/datasets/open-retina/open-retina/blob/main/baccus_lab/maheswaranathan_2023/neural_code_data.zip"
)

2025-04-30 13:04:21,543 - INFO - Fetching file list for open-retina/open-retina...


2025-04-30 13:04:21,765 - INFO - Found extracted folder at /home/baptiste/openretina_cache/baccus_lab/maheswaranathan_2023/neural_code_data


Now let's inspect the structure of this dataset

In [4]:
!ls $data_path/ganglion_cell_data

15-10-07  15-11-21a  15-11-21b


We can see that the ganglion cell data is structured by sessions. We are going to pick session `15-10-07` to use throughout the examples in this notebook.

In [5]:
!ls $data_path/ganglion_cell_data/15-10-07

naturalscene.h5  whitenoise.h5


Inside each session we have files for two different type of stimuli.

Let's load the file dealing with whitenoise and inspect it.

In [6]:
whitenoise_file = load_h5_into_dict(os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"))

pp.pprint(whitenoise_file)

Loading HDF5 file contents:   0%|          | 0/30 [00:00<?, ?item/s]

{   'spikes': {   'cell01': numpy.ndarray(shape=(43927,)),
                  'cell02': numpy.ndarray(shape=(11819,)),
                  'cell03': numpy.ndarray(shape=(12423,)),
                  'cell04': numpy.ndarray(shape=(37662,)),
                  'cell05': numpy.ndarray(shape=(10976,)),
                  'cell06': numpy.ndarray(shape=(11654,)),
                  'cell07': numpy.ndarray(shape=(17792,)),
                  'cell08': numpy.ndarray(shape=(4566,)),
                  'cell09': numpy.ndarray(shape=(36307,))},
    'test': {   'repeats': {   'cell01': numpy.ndarray(shape=(6, 5997)),
                               'cell02': numpy.ndarray(shape=(6, 5997)),
                               'cell03': numpy.ndarray(shape=(6, 5997)),
                               'cell04': numpy.ndarray(shape=(6, 5997)),
                               'cell05': numpy.ndarray(shape=(6, 5997)),
                               'cell06': numpy.ndarray(shape=(6, 5997)),
                               

We can see that at the first level of the `.h5` hierarchy the data is split into `train`, `test` and `spikes`. 

`spikes` will contain the spike times for each neuron, which we can ignore. 

`train` and `test` are structured similarly: they both contain numpy arrays for the stimulus, time (mapping to the spike indices in `spikes`) and the response. The latter is saved with different binnings (by choosing a different bin width in time, there are more ways to group a sequence of spike times into a firing rate representation). 

We can see that the stimulus and the response arrays share the time dimensions. These are the data we are interested in for model fitting.

---

Now let's see how we can load this to use with `openretina`. 

# Loading data

What we need is the matching stimulus and response pairs for training and testing. We will then need to feed them inside the two classes that handle their data, respectively `ResponsesTrainTestSplit` and `MoviesTrainTestSplit`.

Let's briefly print the classes help information, so we can see which arguments they expect:


In [None]:
MoviesTrainTestSplit?

[31mInit signature:[39m
MoviesTrainTestSplit(
    train: jaxtyping.Float[ndarray, [33m'channels train_time height width'[39m],
    test_dict: dict = <factory>,
    test: dataclasses.InitVar[jaxtyping.Float[ndarray, [33m'channels test_time height width'[39m] | [38;5;28;01mNone[39;00m] = [38;5;28;01mNone[39;00m,
    stim_id: Optional[str] = [38;5;28;01mNone[39;00m,
    random_sequences: Optional[numpy.ndarray] = [38;5;28;01mNone[39;00m,
    norm_mean: Optional[float] = [38;5;28;01mNone[39;00m,
    norm_std: Optional[float] = [38;5;28;01mNone[39;00m,
) -> [38;5;28;01mNone[39;00m
[31mDocstring:[39m      MoviesTrainTestSplit(train: jaxtyping.Float[ndarray, 'channels train_time height width'], test_dict: dict = <factory>, test: dataclasses.InitVar[jaxtyping.Float[ndarray, 'channels test_time height width'] | None] = None, stim_id: Optional[str] = None, random_sequences: Optional[numpy.ndarray] = None, norm_mean: Optional[float] = None, norm_std: Optional[float] = None)


In [None]:
ResponsesTrainTestSplit?

[31mInit signature:[39m
ResponsesTrainTestSplit(
    train: jaxtyping.Float[ndarray, [33m'neurons train_time'[39m],
    test_dict: dict = <factory>,
    test: dataclasses.InitVar[jaxtyping.Float[ndarray, [33m'neurons test_time'[39m] | [38;5;28;01mNone[39;00m] = [38;5;28;01mNone[39;00m,
    test_by_trial: jaxtyping.Float[ndarray, [33m'trials neurons test_time'[39m] | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m,
    stim_id: str | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m,
    session_kwargs: dict[str, typing.Any] = <factory>,
) -> [38;5;28;01mNone[39;00m
[31mDocstring:[39m      ResponsesTrainTestSplit(train: jaxtyping.Float[ndarray, 'neurons train_time'], test_dict: dict = <factory>, test: dataclasses.InitVar[jaxtyping.Float[ndarray, 'neurons test_time'] | None] = None, test_by_trial: jaxtyping.Float[ndarray, 'trials neurons test_time'] | None = None, stim_id: str | None = None, session_kwargs: dict[str, typing.Any] = <factory>)
[31mFile:[39m     

Let's now start importing the data that we will feed into these classes:

In [None]:
test_stimulus = load_dataset_from_h5(
    os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"), "test/stimulus"
)
test_response = load_dataset_from_h5(
    os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"), "test/response/firing_rate_20ms"
)

train_stimulus = load_dataset_from_h5(
    os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"), "train/stimulus"
)
train_response = load_dataset_from_h5(
    os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"), "train/response/firing_rate_20ms"
)

print(f"Train stimulus shape: {train_stimulus.shape}")
print(f"Train response shape: {train_response.shape}")

print(f"Test stimulus shape: {test_stimulus.shape}")
print(f"Test response shape: {test_response.shape}")

Train stimulus shape: (359802, 50, 50)
Train response shape: (9, 359802)
Test stimulus shape: (5996, 50, 50)
Test response shape: (9, 5997)


Looking at the shapes of the arrays we just imported, we need to make some small adjustments to match the assumptions that the classes within `openretina` make. 

- The stimulus needs to be 4-dimensional, with shape `color_channels x time x height x width`: in this case the channel dimension is missing.
- The responses need to have shape `n_neurons x time`: this is already the case here.
- The stimuli and responses time dimension should match exactly: in this case the test response seems to have one extra time bin, which we are simply going to cut in this case.

Let's do all of this here:

In [None]:
test_stimulus = test_stimulus.reshape(1, -1, test_stimulus.shape[1], test_stimulus.shape[2])
train_stimulus = train_stimulus.reshape(1, -1, train_stimulus.shape[1], train_stimulus.shape[2])

test_response = test_response[:, :-1]

print(f"Train stimulus shape: {train_stimulus.shape}")
print(f"Train response shape: {train_response.shape}")

print(f"Test stimulus shape: {test_stimulus.shape}")
print(f"Test response shape: {test_response.shape}")

Train stimulus shape: (1, 359802, 50, 50)
Train response shape: (9, 359802)
Test stimulus shape: (1, 5996, 50, 50)
Test response shape: (9, 5996)


Before finally initialising our target functions, we should normalise the stimuli (and optionally the responses). This is mostly done to stabilise training, as too wide of an input data range can lead to exploding gradients.

In [None]:
train_stim_mean = train_stimulus.mean()
train_stim_std = train_stimulus.std()

norm_train_stimulus = (train_stimulus - train_stim_mean) / train_stim_std
norm_test_stimulus = (test_stimulus - train_stim_mean) / train_stim_std

Finally, we can initialise the classes

In [None]:
single_stimulus = MoviesTrainTestSplit(
    train=norm_train_stimulus,
    test=norm_test_stimulus,
    stim_id="whitenoise",
    norm_mean=train_stim_mean,
    norm_std=train_stim_std,
)

single_response = ResponsesTrainTestSplit(
    train=train_response,
    test=test_response,
    stim_id="whitenoise",
)

### Q: How to do this step with your data?

What matters for how the pipeline is configured within `openretina` is that you can import your data in a way that stimuli and responses for each session have the same sampling frequency, and that you can then end up with two numpy arrays, one for the stimuli and one for the responses, at the same sampling rate (i.e. having the exact same length in the time dimension).

This might require some resampling if it is not the case already, and the workflow will vary depending on how your data is exported. This decision and implementation are the responsibility of the user.

### Q: What if I do not have a train and a test split in my data?

The train/test split is completely arbitrary, but it is sometimes a direct consequence of certain experimental design choices. For example, test stimuli usually have been repeated multiple times, so that an average response can be computed, along with different estimates of SNR or response reliability. Training stimuli on the other hand tend to have a lower number of repeats, often only 1.

If all your stimuli have multiple repeats by design and no clear train/test separation, you can then decide which parts you want to use for training and which for testing, for example by doing a 80% / 20% split. It is recommended to use the **average test trace** across repetitions for testing. On the contrary, during training, it can be beneficial to introduce some noise and it is recommended use the single repeats (this will also lead to having more training data).

If your data has no clear trial/repetition structure, and you only have 1 repeat per stimulus, you can similarly arbitrarly decide how to split your data, and how much to leave for testing. What you can expect in this case, however, is to have lower test performance compared to what you would get if your test responses were actually collected across multiple trials. The reason for this is simply that having more trials averages out noise, which otherwise is treated as ground-truth signal when computing test performance.

---

# Dataloading

We are now ready to initialise a dataloader with the stimuli and responses we extracted. Note that dataloader functions within `openretina` assume that you input a **dictionary** of stimuli and responses, where keys are session names and values are instances of `ResponsesTrainTestSplit` and `MoviesTrainTestSplit` classes we just created. 

We make this assumption to accommodate multiple experimental sessions for training, which is the usual case. 
If you indeed have data from multiple sessions, you have two options moving forward.

1. Manually repeat what we have done above for all sessions
2. **Recommended**: code up your personal *data_io* functions / modules, one for the stimuli and one for the responses. The output of these functions should be two dictionaries that share the keys (i.e. the session names), and have as values the different `ResponsesTrainTestSplit` and `MoviesTrainTestSplit` objects. If you take this route, you can insert such functions inside `openretina.data_io.your_dataset_name` and if you feel like sharing, submit a PR to us such that we can include your dataset in the repository! To see a worked example, check out how we coded up the functions to do so for the current dataset at `openretina.data_io.maheswaranathan_2023`.


To keep things simple, here we simply initialise one-item dictionaries for the stimuli and responses we just extracted.

In [None]:
stimuli = {"15-10-07": single_stimulus}

responses = {"15-10-07": single_response}

We are now ready to feed our matching dictionaries of stimuli and responses to a dataloader.

In [None]:
dataloaders = multiple_movies_dataloaders(neuron_data_dictionary=responses, movies_dictionary=stimuli)

Creating movie dataloaders:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
stimuli['15-10-07'].train.shape

(1, 359802, 50, 50)

In [None]:
pp.pprint(dataloaders)

defaultdict(<class 'dict'>,
            {   'test': {   '15-10-07': torch.utils.data.DataLoader(Dataset: MovieDataSet with 9 neuron responses to a movie of shape [1, 5996, 50, 50].)},
                'train': {   '15-10-07': torch.utils.data.DataLoader(Dataset: MovieDataSet with 9 neuron responses to a movie of shape [1, 358800, 50, 50].)},
                'validation': {   '15-10-07': torch.utils.data.DataLoader(Dataset: MovieDataSet with 9 neuron responses to a movie of shape [1, 1000, 50, 50].)}})


# Initialising a simple model

Digital twin models (as are ML models in general) are very much dependent on the data they are trained and evaluated on, even in model architecture. Practically:

- The shape of the input stimulus will influence the shape of the convolutional kernels, and is therefore a parameter at model creation
- The number of sessions and the neurons in each session will, in turn, influence the structure and number of parameters in the readout networks, and are also parameters at model creation.

To get this information from the data and pass it to the model and store it, `openretina` has an utility function, `compute_data_info`, which takes as arguments the same two dictionaries that are fed to the dataloader function.

In [None]:
data_info = compute_data_info(neuron_data_dictionary=responses, movies_dictionary=stimuli)

# Display the data info
pp.pprint(data_info)

{   'input_shape': (1, 50, 50),
    'movie_norm_dict': {   '15-10-07': {   'norm_mean': np.float64(64.0036423232778),
                                           'norm_std': np.float64(63.99999989635505)}},
    'n_neurons_dict': {'15-10-07': 9},
    'sessions_kwargs': {'15-10-07': {}}}


Now we can initialise a model:

In [None]:
model = CoreReadout(
    in_shape=(1, 100, 50, 50),  # Note that data_info does not include time, we add a dummy time dimension here.
    hidden_channels=[32, 64],
    temporal_kernel_sizes=[3, 3],
    spatial_kernel_sizes=[7, 7],
    n_neurons_dict=data_info["n_neurons_dict"],
)

  core = SimpleCoreWrapper(
2025-04-29 13:57:12,627 - INFO - in_shape_readout=torch.Size([64, 68, 44, 44])


# Training

The last step before initiating training is now to wrap the training, validation and testing dataloaders (which are, in fact, dictionaries of dataloaders) into a `Cycler` object, which is an utility that will go through the data for each session. 

(Note that we still need to do this in our one-session running example, because `dataloaders["train"]`, `dataloaders["validation"]` and `dataloaders["test"]` will still be dictionaries, in this case of only one item. Feel free to inspect a bit more the `dataloaders` dictionary we created to make sense of this.)

In [None]:
train_loader = LongCycler(dataloaders["train"])
val_loader = ShortCycler(dataloaders["validation"])
test_loader = ShortCycler(dataloaders["test"])

Here we will just check whether the trainer works. 

In [None]:
trainer = lightning.Trainer(fast_dev_run=True)

trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)

Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed.
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                        | Params | Mode 
-------------------------------------------------------------------------
0 | core             | SimpleCoreWrapper           | 106 

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=1` reached.


It is recommended to set up training using a training script, either a custom one or using our unified interface. More on that below

# Using new data with our unified training script

`openretina` comes with a few command line scripts, among which `openretina train`. This calls our training script, which uses `hydra` for config management.

A few things are needed to run training on a completely new dataset using our training script:

1. Creating data_io and dataloading functions for the new dataset, and placing them in the `openretina/data_io` submodule. Earlier parts of this notebook dealt with this.
2. Creating data_io and dataloader configs for the new dataset, and placing them in the appropriate folders in `configs`.
3. Creating an "outer" config, to place as a direct children in the `configs` folder.

Let's go through these step by step.

### 1. Creating a custom sub-module for the new dataset.

Doing something similar to what we did in this notebook, you would need to code up functions for the stimuli and for the responses that create two dictionaries that share the keys (i.e. the session names), and have as values `ResponsesTrainTestSplit` and `MoviesTrainTestSplit` objects. 

Extending the example in this notebook, we already provide such function for the `maheswaranathan_2023` dataset under `openretina.data_io.maheswaranathan_2023`.

### 2. Creating config files for data_io and dataloading

Once such dataloading functions are in place, we need to make sure they are used correctly in the training script.
This is how dataloading happens in the training script:

```{python}
movies_dict = hydra.utils.call(cfg.data_io.stimuli)
neuron_data_dict = hydra.utils.call(cfg.data_io.responses)

dataloaders = hydra.utils.instantiate(
    cfg.dataloader,
    neuron_data_dictionary=neuron_data_dict,
    movies_dictionary=movies_dict,
)
```

Let's break this down in the case for the stimuli.

`hydra.utils.call` is calling a function which is found in the config at `data_io.stimuli`. 
In the main configuration files folder (called `configs`), we have different subfolders for different possibilities of configuration options. In the `data_io` folder we have different `YAML` files dealing with the data_io functions. There, we created a file called `maheswaranathan_2023.yaml` which looks like this:

```{yaml}
stimuli:
  _target_: openretina.data_io.maheswaranathan_2023.stimuli.load_all_stimuli
  _convert_: object
  base_data_path: ${data.data_dir}
  stim_type: "naturalscene"
  normalize_stimuli: true


responses:
  _target_: openretina.data_io.maheswaranathan_2023.responses.load_all_responses
  _convert_: object
  base_data_path: ${data.data_dir}
  stim_type: "naturalscene"
  response_type: "firing_rate_20ms"
  fr_normalization: 1.0
```

When we call hydra.utils.call(cfg.data_io.stimuli), Hydra looks up the stimuli key in our configuration and finds that it specifies a function to call:
- `_target_`: Specifies the fully qualified function path that should be called, in this case, `openretina.data_io.maheswaranathan_2023.stimuli.load_all_stimuli`.
- `_convert_`: Ensures that the output of the function is returned as an object rather than a dictionary.
- The rest are arguments specific to the function that we coded up.

Importantly then, when adding the configuration for a new dataset, the user should specify in a a new config file under `data_io` which function should be called and with which parameters such that they will return the dictionary of keys to `ResponsesTrainTestSplit` and `MoviesTrainTestSplit` objects.

The same holds for dataloading.


### 3. Creating an "outer" config.

Once `data_io` functions are coded up, and `data_io` configs are created, these will need to be referenced in an "outer" config file which orchestrates the run. A template is present under `configs/template_outer_config.yaml`.

Here is how the template looks like:
```{yaml}

defaults:
  - data_io: ??? # For new data, create data_io config and put its name here
  - dataloader: ??? # For new data, create dataloader config and put its name here
  - model: base_core_readout
  - training_callbacks:
    - early_stopping
    - lr_monitor
    - model_checkpoint
  - logger:
    - tensorboard
    - csv
  - trainer: default_deterministic
  - hydra: default
  - _self_ # values in this config will overwrite the defaults

exp_name: example_experiment_new_data
seed: 42
check_stimuli_responses_match: false

paths:
  cache_dir: ${oc.env:OPENRETINA_CACHE_DIRECTORY} # Remote files are downloaded to this location
  # If data_dir is a local path, data will be read from there. If a remote link, the target will be downloaded to cache_dir.
  data_dir: ??? # Choose the location of the data. Should be used in data_io functions.
  log_dir: "." # Used as parent for output_dir. Will store train logs.
  output_dir: ${hydra:runtime.output_dir} # Modify in the "hydra/default.yaml" config

# Overwrite model defaults with specifics for the current data input format
model:
  in_shape: ???
  hidden_channels: ???
  spatial_kernel_sizes: ??? 
  # Can over-ride further model defaults here.
```

#### Breaking down the template

##### 1. Defaults section:
- Hydra uses the defaults section to compose configurations from different files.
- Each line here references a specific configuration file, stored in subdirectories within configs/.
- For example, data_io: ??? means that a specific data_io config must be created and provided (e.g., maheswaranathan_2023).
- Similarly, dataloader: ??? ensures that a dataloader configuration is selected.
- _self_ ensures that values defined later in this file override the defaults.

##### 2. Run specific variables
- exp_name: The experiment name, which helps organize logs and outputs.
- seed: A fixed seed for reproducibility.
- check_stimuli_responses_match: A debugging flag to ensure that stimuli and responses are aligned correctly.

##### 3. File paths
- cache_dir: The base directory for downloads, if any need to happen.
- data_dir: The location of the dataset, which can be referenced in data_io functions using `${paths.data_dir}`. If `${paths.data_dir}` is a remote path, its contents will be downloaded to cache_dir, and the downloaded files path will be used in loading the data.
- log_dir: Parent folder for the logs, which is used by output_dir.
- output_dir: Where logs, model checkpoints, and results will be saved. Uses logs dir as the parent, and sub-folder structure is set by hydra.

##### 4. Model specific overrides
This section defines the input shape and architecture details, overriding the default model configuration if needed.
- ``in_shape``, ``hidden_channels``, and ``spatial_kernel_sizes`` are left as placeholders (???), meaning they should be specified based on the dataset used.

---


#### Filling in the Configuration for `maheswaranathan_2023`
Now, let’s see how this template is filled in for an actual experiment using `maheswaranathan_2023`:

```{yaml}
defaults:
  - data_io: maheswaranathan_2023
  - dataloader: maheswaranathan_2023
  - model: base_core_readout
  - training_callbacks:
    - early_stopping
    - lr_monitor
    - model_checkpoint
  - logger:
    - tensorboard
    - csv
  - trainer: default_deterministic
  - hydra: default
  - _self_
```

Instead of ``???,`` we now explicitly specify ``maheswaranathan_2023`` for both ``data_io`` and ``dataloader``.
The remaining configuration choices (e.g., logging, training callbacks, trainer) stay the same as the template, but could also be modified further. We provide different options in the respective folders.

Continuing:
```{yaml}
exp_name: core_readout_maheswaranathan
seed: 42
check_stimuli_responses_match: false

paths:
  cache_dir: null # Assume we already downloaded and unzipped manually the data
  data_dir: ${oc.env:HOME}/baccus_data/neural_code_data/ganglion_cell_data/ # Say we downloaded it in home
  log_dir: "." # Save logs in the current directory
  output_dir: ${hydra:runtime.output_dir} # Keep hydra default for sub-folders, which we set in configs/hydra/default.yaml

model:
  in_shape: [1, 100, 50, 50]
  hidden_channels: [16, 32]
  spatial_kernel_sizes: [15, 11]
```
- The experiment is now named "core_readout_maheswaranathan", which will be used in logs and outputs.
- The dataset location is explicitly set to "baccus_data/neural_code_data/ganglion_cell_data/", where ``cache_dir`` should still be defined by the user.
- The model section is defined, containing a few over-rides of the defaults for `base_core_readout`:
  - ``in_shape: [1, 100, 50, 50]`` represents the input dimensions for the dataset.
  - ``hidden_channels: [16, 32]`` defines the number of channels in each convolutional layer.
  - ``spatial_kernel_sizes: [15, 11]`` specifies the spatial kernel sizes.

---

Once an outer config is specified, running training with the specified options is done via the command line with:

```{bash}
openretina train --config-name "maheswaranathan_2023_core_readout"
```

Where you need to change `"maheswaranathan_2023_core_readout"` with the name of your outer `YAML` config.

---

# Conclusion

In this tutorial, we walked through the process of integrating new dataset into `openretina` and getting started with training on it. While setting up a new dataset can be challenging, taking a structured approach makes it much more manageable, despite the initial learning curve. If you run into issues, don’t hesitate to reach out and explore further the Hydra and OpenRetina documentations for more details.