# Reading a Dataset

In this tutorial, we are going to learn how to read Open3D-ML datasets.

You can use any dataset available in the `ml3d.datasets` dataset namespace. For this example, we will use the `SemanticKITTI` dataset. You can use any of the other datasets to load data. However, you must understand that the parameters may vary for each dataset.

In order to read a dataset in this example, we will supply the following parameter variables:

- Dataset path (`dataset_path`)
- Cache directory (`cache_dir`)
- Dataset Splits (for training, validation, and testing)

> **For more theoretical background information on dataset splitting, please refer to these articles:**
>
> https://machinelearningcompass.com/dataset_optimization/split_data_the_right_way/
>
> https://www.freecodecamp.org/news/key-machine-learning-concepts-explained-dataset-splitting-and-random-forest/

## Creating a Global Dataset Object

First, we declare the `ml3d` object to be of `open3d.ml.torch` library:

In [1]:
#import torch
import open3d.ml.torch as ml3d

Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.

--------------------------------------------------------------------------------

 Using the Open3D PyTorch ops with CUDA 11 may have stability issues!

 We recommend to compile PyTorch from source with compile flags
   '-Xcompiler -fno-gnu-unique'

 or use the PyTorch wheels at
   https://github.com/isl-org/open3d_downloads/releases/tag/torch1.8.2


 Ignore this message if PyTorch has been compiled with the aforementioned
 flags.

 See https://github.com/isl-org/Open3D/issues/3324 and
 https://github.com/pytorch/pytorch/issues/52663 for more information on this
 problem.

--------------------------------------------------------------------------------



2022-04-06 16:20:21.687818: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-06 16:20:21.687846: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


We then create a global `dataset` object, initializing it with *path, cache directory*, and *splits*. This `dataset` consists of all files specified in the `dataset_path='<path>'` variable:

In [2]:
# Read a dataset by specifying the path. We are also providing the cache directory and splits.
dataset = ml3d.datasets.SemanticKITTI(dataset_path='SemanticKITTI/', cache_dir='./logs/cache', training_split=['00'], validation_split=['01'], test_split=['01'])

A couple of words regarding the *splits* variables: here, we isolate different portions of the `SemanticKITTI` dataset content and divide them into 3 different parts:

1. `training_split` for data training. This part usually contains 70-75% of the global `dataset` content.
2. `validation_split` for data validation. This part accounts for 10-15% of the global `dataset` content;
3. `test_split` for testing. It contains test data and its size varies.

Note the `SemanticKITTI` dataset folder structure:

![dataset_splits](https://user-images.githubusercontent.com/93158890/160633509-e190d8ea-7d3c-4eea-a00f-29ba215b7e69.jpg)

The three different *split* parameter variables instruct Open3D-ML subsystem to reference the following folder locations:

- `training_split=['00']` points to `'SemanticKITTI/dataset/sequences/00/'`
- `validation_split=['01']` points to `'SemanticKITTI/dataset/sequences/01/'`
- `test_split=['01']` points to `'SemanticKITTI/dataset/sequences/01/'`

> Note: **Dataset split directories usually contain numerous pointcloud files.** In our example we included only one pointcloud file for extra speed and convenience.

## Creating Dataset Split Objects to Query the Data

Next, we will create **dedicated** dataset split objects for specifying which split portion we would like to query.

First, we create a `train_split` subset for training from the global `dataset` content we have initialized above using the `get_split()` method:

In [3]:
# Split the dataset for 'training'. You can get the other splits by passing 'validation' or 'test'
train_split = dataset.get_split('training')

INFO - 2022-04-06 16:22:26,972 - semantickitti - Found 1 pointclouds for training


Now, let's do the same for validation:

In [4]:
# Similarly, get validataion split.
val_split = dataset.get_split('validation')

INFO - 2022-04-06 16:22:31,932 - semantickitti - Found 1 pointclouds for validation


Finally, we create a `test_split` subset for testing:

In [5]:
# Get test split
test_split = dataset.get_split('test')

INFO - 2022-04-06 16:22:33,834 - semantickitti - Found 1 pointclouds for test


## Determining the Size of Dataset Splits

Let's see how large out *split* portions are:

In [6]:
# Get length of splits
print(len(train_split))
print(len(val_split))
print(len(test_split))

1
1
1


Above, Open3D-ML prints out the number of pointcloud files it found in `'SemanticKITTI/dataset/sequences/'`' `'/00/'` and `'/01/'` subdirectories we have specified earlier in `training_split=['00'], validation_split=['01'], test_split=['01']` varables for the `dataset`.

## Querying Dataset Splits for Data

In this section, we are using the `train_split` dataset split object as an example. The procedure would be identical for all other dataset splits - `val_split` and `test_split`.

In order to extract the data from the `train_split`, we can iterate through the `train_split` with the index `i` (ranging from `0` - `len(train_split)-1`) using the `get_data()` method.
:

In [7]:
# Query splits for data, index should be from `0` to `len(split) - 1`
for i in range(len(train_split)):
    data = train_split.get_data(i)
    print(data)
    break

{'point': array([[ 5.2305943e+01,  2.2989707e-02,  1.9779946e+00],
       [ 5.3259735e+01,  1.0695236e-01,  2.0099745e+00],
       [ 5.3284321e+01,  2.7487758e-01,  2.0109341e+00],
       ...,
       [ 3.8249431e+00, -1.4261885e+00, -1.7655631e+00],
       [ 3.8495324e+00, -1.4222100e+00, -1.7755738e+00],
       [ 3.8631279e+00, -1.4142324e+00, -1.7805853e+00]], dtype=float32), 'feat': None, 'label': array([0, 0, 0, ..., 9, 9, 9], dtype=int32)}


`data` objects from the above `for` loop return a dictionary of points (`'point'`), features (`'feat'`), and labels (`'label'`), as we will see below:

In [8]:
data = train_split.get_data(0) # Dictionary of `point`, `feat`, and `label`
print(data.keys())

dict_keys(['point', 'feat', 'label'])


- The **`'point'`** key value contains a set of 3D points / coordinates - X, Y, and Z:

![dataset_coordinates](https://user-images.githubusercontent.com/93158890/160503607-76e77f7a-56be-4fba-91c5-7e35019f68e8.jpg)

- The **`'feat'`** (*features*) key value contains RGB color information for each of the above points.

- The **`'label'`** key value represents which class the dataset content belongs to, i.e.: *pedestrian, vehicle, traffic light*, etc.

### Querying Dataset Splits for Attributes

We can also extract corresponding pointcloud information:

In [9]:
attr = train_split.get_attr(0)
print(attr)  # Dictionary containing information about the data e.g. name, path, split, etc.

{'idx': 0, 'name': '00_000001', 'path': 'SemanticKITTI/dataset/sequences/00/velodyne/000001.bin', 'split': 'training'}


Atttributes returned are: `'idx'`(*index*), `'name'`, `'path'`, and `'split'`.

In [14]:
#support of Open3d-ML visualizer in Jupyter Notebooks is in progress
#view the frames using the visualizer
#vis = ml3d.vis.Visualizer()
#vis.visualize_dataset(dataset, 'training',indices=range(len(train_split)))