# 6. Using datasets

The `nitrain.Dataset` class provides everything you need to map collections of images and related meta-data. This chapter introduces the basic functionality and structure of the class so you can get going. Once you learn the basics, it will be intuitive to expand on it with additional things you'll learn later.

## Prerequisites

Besides nitrain, this chapter will use ants and numpy to create images, pandas to create some basic csv files, and some basic operating system tools to create directories that mimic what your data will look like when not loaded into memory.

In [12]:
import nitrain as nt
import ants
import numpy as np
import pandas as pd
import os
from tempfile import TemporaryDirectory

## Basic example

To create a dataset, you need to pass in `inputs` and `outputs` arguments. In the most basic example of image classification, you would pass in a list of images as inputs and a list of class labels as outputs.

In [24]:
images = [ants.from_numpy(np.ones((100,100))) * i for i in range(10)]
labels = [i for i in range(10)]

dataset = nt.Dataset(inputs=images,
                     outputs=labels)

Now our dataset is mapped! We can retrieve a record from the dataset via indexing.

In [10]:
x, y = dataset[0]
print(x)

ANTsImage
	 Pixel Type : float (float32)
	 Components : 1
	 Dimensions : (100, 100)
	 Spacing    : (1.0, 1.0)
	 Origin     : (0.0, 0.0)
	 Direction  : [1. 0. 0. 1.]



And retrieving multiple records is also possible via indexing.

In [7]:
x_list, y_list = dataset[3:5]
print(x_list)
print(y_list)

[ANTsImage
	 Pixel Type : float (float32)
	 Components : 1
	 Dimensions : (100, 100)
	 Spacing    : (1.0, 1.0)
	 Origin     : (0.0, 0.0)
	 Direction  : [1. 0. 0. 1.]
, ANTsImage
	 Pixel Type : float (float32)
	 Components : 1
	 Dimensions : (100, 100)
	 Spacing    : (1.0, 1.0)
	 Origin     : (0.0, 0.0)
	 Direction  : [1. 0. 0. 1.]
]
[3, 4]


We can also print the dataset to understand a bit more of its structure.

In [8]:
print(dataset)

Dataset (n=10)
     Inputs     : <nitrain.readers.memory.MemoryReader object at 0x1326f5690>
     Outputs    : <nitrain.readers.memory.MemoryReader object at 0x1326f5dd0>
     Transforms : {}



As you see, our dataset has a `MemoryReader` in the inputs and the outputs slot. You will learn more about readers in later chapter, but a basic explanation is that readers are what the dataset uses to feed records from a variety of sources. Since our images and labels actually exist in memory right now, a `MemoryReader` is inferred. 

## Loading from file

When our data does not exist in memory already, we need to actually specify the source of the data with a reader. Let's start with a scenario where our images are stored in a folder and we still want to perform classification. How would the class labels be stored?

One common possibilty would be for our class labels to be stored in a csv file. Then our folder might look like this:

```
mydata/
  participants.csv
  img0.nii.gz
  img1.nii.gz
  ...
  img9.nii.gz
```

Let's create this dataset in a temporary folder.

In [28]:
tmpfolder = TemporaryDirectory()
base_dir = tmpfolder.name

# save images
for i in range(10):
    ants.image_write(images[i], os.path.join(base_dir, f'img{i}.nii.gz'))

# create and save participants.csv
dataframe = pd.DataFrame({'labels': labels})
dataframe.to_csv(os.path.join(base_dir, 'participants.csv'))

Listing the files in the directory shows us exactly what we expect.

In [29]:
print(os.listdir(base_dir))

['img6.nii.gz', 'img4.nii.gz', 'img8.nii.gz', 'participants.csv', 'img0.nii.gz', 'img2.nii.gz', 'img7.nii.gz', 'img5.nii.gz', 'img9.nii.gz', 'img1.nii.gz', 'img3.nii.gz']


In this case, we want to read images from the folder and we want to read the class labels from a column in the participants.csv file. In nitrain, this corresponds to using the `ImageReader` and the `ColumnReader` classes. Here is what this looks like.

In [30]:
from nitrain import readers
dataset = nt.Dataset(inputs=readers.ImageReader('img*.nii.gz'),
                     outputs=readers.ColumnReader('labels', base_file='participants.csv'),
                     base_dir=base_dir)

The `ImageReader` class lets us map images from a glob-like pattern, while the `ColumnReader` lets us map column values from csv-like files. We also pass in a `base_dir` to make things simpler. We can read a record from this dataset exactly as before.

In [33]:
x, y = dataset[3]
print(x)
print(x.mean())
print(y)

ANTsImage
	 Pixel Type : float (float32)
	 Components : 1
	 Dimensions : (100, 100)
	 Spacing    : (1.0, 1.0)
	 Origin     : (0.0, 0.0)
	 Direction  : [1. 0. 0. 1.]

3.0
3


As you see, nitrain knew to read in the images from file and to align the image with its label. This covers the scenario of reading images from file and values from csv-like files.

### Folder names as labels

Another common scenario for image classification is where images are stored into folders based on their class. In this case, the class labels are not stored in a csv-like file but are instead contained in the folder names themselves.

The dataset would therefore look like this:

```
mydata/
  class0/
    img1.nii.gz
    ...
  class1/
    img1.nii.gz
    ...
  ...
```

Let's create that dataset now in a temporary folder to use as a reference.

In [42]:
tmpfolder = TemporaryDirectory()
base_dir = tmpfolder.name

# save images
for i in range(10):
    os.mkdir(os.path.join(base_dir, f'class{i}'))
    ants.image_write(images[i], os.path.join(base_dir, f'class{i}/img1.nii.gz'))

As before, we can list the main directory to see the structure. We can also show what's in one of the folders.

In [43]:
print('Main folder:', sorted(os.listdir(base_dir)))
print('Sub-folder (class0):', sorted(os.listdir(os.path.join(base_dir, 'class0'))))
print('Sub-folder (class1):', sorted(os.listdir(os.path.join(base_dir, 'class1'))))

Main folder: ['class0', 'class1', 'class2', 'class3', 'class4', 'class5', 'class6', 'class7', 'class8', 'class9']
Sub-folder (class0): ['img1.nii.gz']
Sub-folder (class1): ['img1.nii.gz']


This scenario is handled by a small update to the glob pattern in our `ImageReader` and with a different kind of reader for the outputs called `FolderNameReader`.

In [45]:
dataset = nt.Dataset(inputs=readers.ImageReader('*/img*.nii.gz'),
                     outputs=readers.FolderNameReader('*/img*.nii.gz'),
                     base_dir=base_dir)

Reading a record shows that the result is (nearly) the same as before.

In [48]:
x, y = dataset[3]

print(x)
print(x.mean())
print(y)

ANTsImage
	 Pixel Type : float (float32)
	 Components : 1
	 Dimensions : (100, 100)
	 Spacing    : (1.0, 1.0)
	 Origin     : (0.0, 0.0)
	 Direction  : [1. 0. 0. 1.]

3.0
class3


The difference is that since our folders were named e.g., "class0" instead of just "0", the `FolderNameReader` returned the full string name. We can change this by telling the `FolderNameReader` to format the values as integers instead.

In [49]:
dataset = nt.Dataset(inputs=readers.ImageReader('*/img*.nii.gz'),
                     outputs=readers.FolderNameReader('*/img*.nii.gz', format='integer'),
                     base_dir=base_dir)

x, y = dataset[3]
print(y)

3


Now we get the same thing as before. This demonstrates the flexibility of many readers in nitrain.

## Multiple inputs or outputs

In the above scenario, we had only a single image as input and a single value as output. However, readers can be arbitrarily combined to return multiple inputs or multiple outputs in whatever format you need. 

Let's start with a simpler scenario where we want to perform image segmentation - i.e., predict a label image from another image. Say that our folder looks like this:

```
mydata/
   img1.nii.gz
   img1-seg.nii.gz
   img2.nii.gz
   img2-seg.nii.gz
   ...
```


We can create this folder and then map a nitrain dataset using the `ImageReader` class as input and output.