# Pymicro's Datasets

This first tutorial will introduce you to the creation and deletion of Pymicro's datasets. 

## I - Create and Open datasets with the SampleData class

In this first section, we will see how to create *SampleData* datasets, or open pre-existing ones. These two operations are performed by instantiating a SampleData class object. 

Before that, you will need to import the `SampleData` class. We will import it with the alias name `SD`, by executing:

### Import SampleData and get help

In [None]:
from pymicro.core.samples import SampleData as SD

Before starting to create our datasets, we will take a look at the `SampleData` class documenation, to discover the arguments of the class constructor. You can read it on the `pymicro.core` package [API doc page](../../pymicro.core.rst), or print interactively by executing:
```python
>>> help(SD)
```
or, if you are working with a Jupyter notebook, by executing the magic command:
```
>>> ?SD
```

**Do not hesitate to systematically use the `help` function or the `"?"` magic command to get information on methods when you encounter a new one. All SampleData methods are documented with explicative docstrings, that detail the method arguments and returns.**

### Dataset creation

The class docstring is divided in multiple rubrics, one of them giving the list of the class constructor arguments. 
Let us review them one by one.

* **filename**: basename of the HDF5 pair of file of the dataset

This is the first and only mandatory argument of the class constructor. If this string corresponds to an existing file, the SampleData class will open these file, and create a file instance to interact with this already existing dataset. **If the filename do not correspond to an existing file, the class will create a new dataset, which is what we want to do here.**

Let us create a SampleData dataset:

In [None]:
data = SD(filename='my_first_dataset')

That is it. The class has created a new HDF5/XDMF pair of files, and associated the interface with this dataset to the variable `data`. No message has been returned by the code, how can we know that the dataset has been created ?

When the name of the file is not an absolute path, the default behavior of the class is to create the dataset in the current work directory. Let us print the content of this directory then !

In [None]:
import os # load python module to interact with operating system
cwd = os.getcwd() # get current directory
file_list = os.listdir(cwd) # get content of current work directory
print(file_list,'\n')

# now print only HDF5 files
print('Our dataset files:')
for file in file_list:
    if file.endswith('.h5'):
        print(file)

The file *my_first_dataset.h5* has indeed been created. If you want interactive prints about the dataset creation, you can set the **verbose** argument to `True`. This will set the activate the *verbose* mode of the class. When it is, the class instance prints a lot of information about what it is doing. This flag can be set by using the `set_verbosity` method: 

In [None]:
data.set_verbosity(True)

Let us now close our dataset, and see if the class instance prints information about it:

In [None]:
del data

<div class="alert alert-info">

**Note** 
    
It is a good practice to always delete your `SampleData` instances once you are done working with a dataset, or if you want to re-open it. As the class instance handles opened files as long as it exists, deleting it ensures that the files are properly closed. Otherwise, file may close at some random times or stay opened, and you may encounter undesired behavior of your datasets.

</div>

The class indeed returns some prints during the instance destruction. As you can see, the class instance wrights into the HDF5 file the data that is stored into the class instance, and then closes the dataset instance and the files. 

### Dataset opening and verbose mode

Let us now try to create a new SD instance for the same dataset file `"my_first_dataset"`. **As the HDF5 dataset already exist, this new *SampleData* instance will open it and synchronize with it.**  With the **verbose** mode activated, *SampleData* class instances will display messages about the actions performed by the class (creating, deleting data items for instance)

In [None]:
data = SD(filename='my_first_dataset', verbose=True)

You can see that the printed information states that the dataset file *my_first_dataset.h5*  has been opened, and not created, because we provided a **filename** that already existed to the class constructor.

Some information about the dataset content are also printed by the class in *verbose* mode. This information can be retrived with specific methods that will be detailed in the next section of this Notebook. Let us focus for now on one part of it. 

The printed info reveals that our dataset content is composed  only of one **data item**, a Group data object named `/`. 

This group is the **Root** Group of the dataset. Each dataset has necessarily a Root Group, automatically created along with the dataset. You can see that this Group has no parent group, and already have a *Child*, named `Index`. This particular data object will be presented in the third section of this Notebook. You can also observe that the Root Group already has *attributes* (recall from introduction Notebook that they are Name/Value pairs used to store metadata in datasets). Two of those attributes match arguments of the SampleData class constructor:


* the **description** attribute
* the **sample_name** attribute

**The description and sample_name are not modified in the dataset when reading a dataset. These SD constructor arguments are only used when creating a dataset**. They are string metadata whose role is to give a general name/title to the dataset, and a general description. 
However, they can be set to a new value after the dataset creation with the methods `set_sample_name` and `set_description`, used a little further in this Notebook.

Now we know how to open a dataset previously created with *SampleData*. We could want to open a new dataset, with the name of an already existing data, but overwrite it. The *SampleData* constructor allows to do that, and we will see it in the next subsection. But first, we will close our dataset again: 

In [None]:
del data

### Overwriting datasets

The **overwrite_hdf5** argument of the class constructor, if it is set to `True`, will remove the `filename` dataset and create a new empty one, if this dataset already exists: 

In [None]:
data = SD(filename='my_first_dataset',  verbose=True, overwrite_hdf5=True)

As you can see, the dataset files have been overwritten, as requested. We will now close our dataset again and continue to see the possibilities offered by the class constructor.

In [None]:
del data

Our dataset is now closed and we can move on to other ways to create and remove datasets.
    
**Up to now, there is no mechanism implemented into the class to protect datasets from being overwritten. Be carefull with your data when using this functionality ! **

### Test Copying datasets

One last thing that may be interesting to do with already existing dataset files, is to create a new dataset that is a copy of them, associated with a new class instance. This is usefull for instance when you have to try new processing on a set of valuable data, without risking to damage the data. 

To do this, you may use the `copy_sample` method of the *SampleData* class. Its main arguments are:

* `src_sample_file`: basename of the dataset files to copy (*source file*)
* `dst_sample_file`: basename of the dataset to create as a copy of the source (*destination file*)
* `get_object`: if `False`, the method will just create the new dataset files and close them. If `True`, the method will leave the files open and return a *SampleData* instance that you may use to interact with your new dataset.

Let us try to create a copy of our first dataset:

In [None]:
data2 = SD.copy_sample(src_sample_file='my_first_dataset', dst_sample_file='dataset_copy', get_object=True)

In [None]:
cwd = os.getcwd() # get current directory
file_list = os.listdir(cwd) # get content of current work directory
print(file_list,'\n')

# now print only files that start with our dataset basename
print('Our dataset files:')
for file in file_list:
    if file.startswith('dataset_copy'):
        print(file)

The `copy_dataset.h5` HDF5 file has indeed been created, and is a copy of the `my_first_dataset.h5`.

Note that the `copy_sample` is a *static method*, that can be called even without *SampleData* instance. Note also that it has an `overwrite` argument, that allows to overwrite an already existing `dst_sample_file`. It also has, like the class constructor, a `autodelete` argument, that we will discover in the next subsection.

### Automatically removing dataset files

In some occasions, we may want to remove our dataset files after using our *SampleData* class instance. This can be the case for instance if you are trying some new data processing, or using the class for visualization purposes, and are not interested in keeping your test data.  

The class has a **autodelete** attribute for this purpose. IF it is set to True, the class destructor will remove the dataset file pair in addition to deleting the class instance. The class constructor and the `copy_sample` method also have a **autodelete** argument, which, if `True`, will automatically set the class instance **autodelete** attribute to `True`.

To illustrate this feature, we will try to change the *autodelete* attribute of our copied dataset to `True`, and remove it.

In [None]:
# set the autodelete argument to True
data2.autodelete = True
# Set the verbose mode on for copied dataset  
data2.set_verbosity(True)

In [None]:
# Close copied dataset
del data2

The class destructor ends by priting a confirmation message of the dataset file removal in *verbose* mode, as you can see in the cell above.
Let us verify that it has been effectively deleted: 

In [None]:
file_list = os.listdir(cwd) # get content of current work directory
print(file_list,'\n')

# now print only files that start with our dataset basename
print('Our copied dataset files:')
for file in file_list:
    if file.startswith('dataset_copy'):
        print(file)

As you can see, the dataset file has been suppressed. Now we can also open and remove our first created dataset using the class constructor **autodelete** option: 

In [None]:
data = SD(filename='my_first_dataset',  verbose=True, autodelete=True)

print(f'Is autodelete mode on ? {data.autodelete}')

del data

In [None]:
file_list = os.listdir(cwd) # get content of current work directory
print(file_list,'\n')

# now print only files that start with our dataset basename
print('Our dataset files:')
for file in file_list:
    if file.startswith('my_first_dataset'):
        print(file)

<div class="alert alert-info">

**Note** 
    
Using the **autodelete** option is usefull when you want are using the class for tries, or tests, and do not want to keep the dataset files on your computer. 

</div>

**This first tutorial on Data Management with Pymicro User Guide is now finished. You should now know how to create, open or remove SampleData datasets.  **