# Providing data - Level 1
## How to tell Hyrax what data to give to a model.

A model will need data for training and inference.
Hyrax ``Dataset`` classes define how to read data and provide it for use.
(If you haven't seen the Hyrax Dataset intro notebook, it's worth a look.)

Our goals here will be to:
- Show how to define what data Hyrax should provide to a model with a ``DataProvider``.
- Explore the ``DataProvider`` to understand the form of the data being provided.

To accomplish these goal, we'll use the built-in ``HyraxRandomDataset`` class as 
a stand in for a real data source.

As always, we'll start by creating an instance of the Hyrax class.

In [1]:
from hyrax import Hyrax

h = Hyrax()

[2025-09-11 21:18:43,512 hyrax:INFO] Runtime Config read from: /Users/drew/code/hyrax/src/hyrax/hyrax_default_config.toml


Next we'll try to tell Hyrax that we want to use the ``HyraxRandomDataset`` as the source for our data provider.

In [2]:
h.config["model_inputs"] = {
    "random_ds": {
        "dataset_class": "HyraxRandomDataset",
    },
}

# Prepare the model_inputs
d = h.prepare()

# Print a sample of the data
d.sample_data()

[2025-09-11 21:18:46,431 hyrax.data_sets.data_provider:INFO] No fields were specified for 'random_ds'. The request will be modified to select all by default. You can specify `fields` in `model_inputs`.
[2025-09-11 21:18:46,462 hyrax.prepare:INFO] Finished Prepare


{'random_ds': {'image': array([[[0.08925092, 0.773956  , 0.6545715 , 0.43887842, 0.43301523],
          [0.8585979 , 0.08594561, 0.697368  , 0.20146948, 0.09417731],
          [0.52647895, 0.9756223 , 0.73575234, 0.7611397 , 0.71747726],
          [0.78606427, 0.51322657, 0.12811363, 0.8397482 , 0.45038593],
          [0.5003519 , 0.370798  , 0.1825496 , 0.92676497, 0.78156745]],
  
         [[0.6438651 , 0.40241432, 0.8227616 , 0.5454291 , 0.44341415],
          [0.45045954, 0.22723871, 0.09213591, 0.55458474, 0.8878898 ],
          [0.0638172 , 0.85829127, 0.8276311 , 0.27675968, 0.6316644 ],
          [0.16522902, 0.7580877 , 0.70052296, 0.35452592, 0.06791997],
          [0.970698  , 0.44568747, 0.89312106, 0.677919  , 0.7783835 ]]],
        dtype=float32),
  'label': np.int64(0),
  'meta_field_1': np.float64(50.0),
  'meta_field_2': np.float64(33.333333333333336),
  'object_id': '19'}}

Hooray! It looks like we were able to tell Hyrax what dataset to use, instantiate that dataset with ``h.prepare()`` and then print out a sample of the data.

Note that the configuration is called ``model_inputs`` - because this defines what will be input to the model...
And yes, "inputs" is plural, we'll get to that in the next notebook.

This particular configuration is the minimum required to tell Hyrax to use data from ``HyraxRandomDataset`` as the input for model training and inference.

There are two important points to call out:
1) The dictionary key ``"random_ds"`` is up to the user. You can call it whatever you like as long as it's unique.
2) The value of ``"dataset_class"`` is the name of the class you want to use to read in data.

## If you only want _some_ fields

In the minimal configuration Hyrax will use **all** the fields that a dataset class provides.
Often this isn't what you want. 
The ``DataProvider`` instance allows you to see what fields are available,
and ``model_inputs`` can be updated to only request the desired fields.

In [3]:
d.fields()

{'random_ds': ['image', 'label', 'meta_field_1', 'meta_field_2', 'object_id']}

In [4]:
h.config["model_inputs"] = {
    "random_ds": {
        "dataset_class": "HyraxRandomDataset",
        "fields": ["image", "meta_field_2"],  # <- Request only specific fields.
    },
}

# Prepare the model_inputs
d = h.prepare()

# Print a sample of the data
d.sample_data()

[2025-09-11 21:18:46,593 hyrax.prepare:INFO] Finished Prepare


{'random_ds': {'image': array([[[0.08925092, 0.773956  , 0.6545715 , 0.43887842, 0.43301523],
          [0.8585979 , 0.08594561, 0.697368  , 0.20146948, 0.09417731],
          [0.52647895, 0.9756223 , 0.73575234, 0.7611397 , 0.71747726],
          [0.78606427, 0.51322657, 0.12811363, 0.8397482 , 0.45038593],
          [0.5003519 , 0.370798  , 0.1825496 , 0.92676497, 0.78156745]],
  
         [[0.6438651 , 0.40241432, 0.8227616 , 0.5454291 , 0.44341415],
          [0.45045954, 0.22723871, 0.09213591, 0.55458474, 0.8878898 ],
          [0.0638172 , 0.85829127, 0.8276311 , 0.27675968, 0.6316644 ],
          [0.16522902, 0.7580877 , 0.70052296, 0.35452592, 0.06791997],
          [0.970698  , 0.44568747, 0.89312106, 0.677919  , 0.7783835 ]]],
        dtype=float32),
  'meta_field_2': np.float64(33.333333333333336)}}

Huzzah! We were able to look at all the data fields that the ``HyraxRandomDataset`` provides,
then update ``model_inputs`` to only request "image" and "meta_field_2".

## Examine some of the data
Now that our ``DataProvider`` is configured to only return the data fields that we want, let's take a moment to look more carefully at the data itself.

We've already seen the ``DataProvider.sample_data()`` function.
It attempts to return the first data sample that it can.
Since Hyrax retrieves data by index, it can be fun to randomly sample the data at various indices.

In [5]:
d[12]

{'random_ds': {'image': array([[[0.6162841 , 0.77899635, 0.73904294, 0.13455218, 0.8260549 ],
          [0.536068  , 0.3230834 , 0.51422286, 0.96698344, 0.85757214],
          [0.83369845, 0.4627993 , 0.7841378 , 0.38508946, 0.68616545],
          [0.63956326, 0.24979866, 0.26646328, 0.06113309, 0.13976836],
          [0.4336186 , 0.47787726, 0.16512072, 0.4168893 , 0.6041775 ]],
  
         [[0.23256993, 0.33823687, 0.3675118 , 0.5050539 , 0.36639243],
          [0.7369446 , 0.32749552, 0.43389672, 0.37946403, 0.18291575],
          [0.68574333, 0.23544616, 0.29687643, 0.7927519 , 0.9488579 ],
          [0.6708431 , 0.916348  , 0.8541906 , 0.48091042, 0.77029014],
          [0.32836115, 0.46049702, 0.5354348 , 0.96049297, 0.84856045]]],
        dtype=float32),
  'meta_field_2': np.float64(29.333333333333332)}}

No big surprise here, this looks very similar to the output of ``d.sample_data()``.
Note: the return data is in a nested dictionary where the key is that same as the friendly name, "random_ds".

Of course any integer index up to the size of the data will work.
But how do we know the size of the data?
``len(...)`` of course!

In [7]:
len(d)

100

## Persisting the configuration

As with all configurations in Hyrax, ``model_inputs`` will be saved in the configuration .toml file along with any results.
Our example would look like this:

``` toml
[model_inputs]
[model_inputs.random_ds]
dataset_class: 'HyraxRandomDataset'
fields: ['image', 'meta_field_2']
```