# Reading a dataset

In this tutorial, we will learn how to read Open3D-ML datasets.

You can use any dataset available in the `ml3d.datasets` namespace. For this example, we will use the `SemanticKITTI` dataset. You can use any of the other datasets to load data. However, you must understand that the parameters may vary for each dataset.

To read a dataset in this example, we will supply the following parameter variables:

- Dataset path (`dataset_path`)
- Cache directory (`cache_dir`)
- Dataset splits (for training, validation, and testing)

> **For more theoretical background information on dataset splitting, please refer to these articles:**
>
> https://machinelearningcompass.com/dataset_optimization/split_data_the_right_way/
>
> https://www.freecodecamp.org/news/key-machine-learning-concepts-explained-dataset-splitting-and-random-forest/

## Creating a global dataset object

First, we import the Open3D-ML PyTorch library:

In [None]:
# import torch
import open3d.ml.torch as ml3d

We then create a `dataset` object, initializing it with *path, cache directory*, and *splits*. This `dataset` can read all the files inside `dataset_path` directory:

In [None]:
# Read a dataset by specifying the path. We are also providing the cache directory and splits.
dataset = ml3d.datasets.SemanticKITTI(dataset_path='SemanticKITTI/',
                                      cache_dir='./logs/cache',
                                      training_split=['00'],
                                      validation_split=['01'],
                                      test_split=['01'])

A couple of words regarding the *splits* variables: here, we isolate different portions of the `SemanticKITTI` dataset content and divide them into 3 different parts:

1. `training_split` for data training. This part usually contains 70-75% of the global `dataset` content.
2. `validation_split` for data validation. This part accounts for 10-15% of the global `dataset` content;
3. `test_split` for testing. It contains test data and its size varies.

Note the `SemanticKITTI` dataset folder structure:

![dataset_structure](https://user-images.githubusercontent.com/93158890/162548755-28c541d3-3557-4903-a9a1-cc685d16dfc2.jpg)

The three different *split* parameter variables instruct Open3D-ML subsystem to reference the following folder locations:

- `training_split=['00']` points to `'SemanticKITTI/dataset/sequences/00/'`
- `validation_split=['01']` points to `'SemanticKITTI/dataset/sequences/01/'`
- `test_split=['01']` points to `'SemanticKITTI/dataset/sequences/01/'`

> Note: **dataset split directories usually contain numerous point cloud files.** In our example we included only one point cloud file for extra speed and convenience.

## Creating dataset split objects to query the data

Next, we will create **dedicated** dataset split objects for specifying which split portion we would like to query.

First, we create a `train_split` subset for training from the global `dataset` content we have initialized above using the `get_split()` method:

In [None]:
# Split the dataset for 'training'. You can get the other splits by passing 'validation' or 'test'
train_split = dataset.get_split('training')

Now, let's do the same for validation:

In [None]:
# Similarly, get validataion split.
val_split = dataset.get_split('validation')

Finally, we create a `test_split` subset for testing:

In [None]:
# Get test split
test_split = dataset.get_split('test')

## Determining the size of dataset splits

Let's see how large out *split* portions are:

In [None]:
# Get length of splits
print(len(train_split))
print(len(val_split))
print(len(test_split))

Above, Open3D-ML prints out the number of pointcloud files it found in `'SemanticKITTI/dataset/sequences/'`' `'/00/'` and `'/01/'` subdirectories we have specified earlier in `training_split=['00'], validation_split=['01'], test_split=['01']` varables for the `dataset`.

## Querying dataset splits for data

In this section, we are using the `train_split` dataset split object as an example. The procedure would be identical for all other dataset splits - `val_split` and `test_split`.

In order to extract the data from the `train_split`, we can iterate through the `train_split` with the index `i` (ranging from `0` - `len(train_split)-1`) using the `get_data()` method.
:

In [None]:
# Query splits for data, index should be from `0` to `len(split) - 1`
for i in range(len(train_split)):
    data = train_split.get_data(i)
    print(data)
    break

`data` objects from the above `for` loop return a dictionary of points (`'point'`), features (`'feat'`), and labels (`'label'`), as we will see below:

In [None]:
data = train_split.get_data(0)  # Dictionary of `point`, `feat`, and `label`
print(data.keys())

- The **`'point'`** key value contains a set of 3D points/coordinates - X, Y, and Z:

![dataset_coordinates](https://user-images.githubusercontent.com/93158890/162549410-6369cbd0-b835-4216-ba54-945e3f591395.jpg)

- The **`'feat'`** (*features*) key value contains RGB color information for each of the above points.

- The **`'label'`** key value represents which class the dataset content belongs to, i.e.: *pedestrian, vehicle, traffic light*, etc.

### Querying dataset splits for attributes

We can also extract corresponding point cloud information:

In [None]:
attr = train_split.get_attr(0)
print(
    attr
)  # Dictionary containing information about the data e.g. name, path, split, etc.

Atttributes returned are: `'idx'`(*index*), `'name'`, `'path'`, and `'split'`.

In [None]:
#support of Open3d-ML visualizer in Jupyter Notebooks is in progress
#view the frames using the visualizer
#vis = ml3d.vis.Visualizer()
#vis.visualize_dataset(dataset, 'training',indices=range(len(train_split)))