# Create Predefined Custom Data Models for your datasets

This tutorial will teach you how to defined derived classes from SampleData, in order to create datasets with an automatically generated data model that is tailored to a specific need. 

## I - SampleData derived classes 

The `SampleData` class allows to create and interact with complex HDF5 datasets. New datasets are created empty, and can be constructed freely according to the needs of the user. When using the class to work with many datasets that should share the same type of internal organization and content, users will have to rebuild this internal data model for each new dataset. In addition, in order to defined scripts or classes that aim at batch processing some data items that are found in each of these datasest, they will have to make sure that the these item names and/or pathes are identical in all datasets.   

These considerations show that the automatic generation of a non-empty and specific *data model* would be a usefull addition to the features of `SampleData`. For that purpose, the class implements two simple mechanisms through class inheritance, that are the subject of the present tutorial.

### Custom Data Model

The `SampleData` class defines a minimal data model for all the datasets that structures all created datasets. This data model is an organized collection of data item *indexnames, pathes* and *types*, provided via two dictionaries, that are:

1. `minimal_content_index_dic`: the path of each data item in the data model
2. `minimal_content_type_dic`: the type of each data item in the data model

#### The content index dictionary

Each item of this dictionary defines a data item of the data model. Its key will be the *indexname* given to the data item in the dataset, and the item value must be a string giving a valid path for the data item in the dataset. When a dataset is created, the class will automatically create a data item for each key of this dictionary, and set its path in the dataset with the associated value in the dictionary. 

For the `SampleData` class, this dictionary is empty, no data model is prescribed. Hence, datasets that are created with `SampleData` are empty (they just hase a Root Group, as explained in a previous [tutorial](./Datasets_Files.ipynb). To create datasets with a prescribed data model, the idea is to implement a class that is derived from `SampleData`,  with a non-empty `minimal_content_index_dic`, that implements the desired data model.

This dictionary should hence look like this:

```python
       minimal_content_index_dic = {'item1': '/path_to_item1',
                                    'item2': '/path_to_item1/path_to_item2',
                                    'item3': '/path_to_item3',
                                    'item4': '/path_to_item1/path_to_item4',
                                     '...': '...',}
```

An item of the form `'wrongitem': '/undeclared_item/path_to_wrong_item'` would have been a non valid path.

The dictionary example just above would lead to the creation of at least 4 data items, with names `item1`, `item2`, `item3` and `item4`, with items 1 and 3 being directly attached to the dataset *Root Group*, and the items 2 and 4 being childrens of item 1. 

#### The content type dictionary

The second dictionary must have the same keys as the `minimal_content_index_dic`. **Its values must be valid *SampleData* data item types**. The type of data item that are automatically created at the dataset creation with the names and pathes specified by `minimal_content_index_dic`, are prescribed by the `minimal_content_type_dic`  

Possible values and associated data types are (see previous tutorials for description of these data types):

* `Group`: creates a HDF5 group data item
* `2DImage`, `3DImage`, or `Image`: creates an empty Image group
* `2DMesh`, `3DMesh`, `Mesh`: creates an empty Mesh group
* `data_array`: creates an empty Data Array
* `field_array`: creates an empty Field Array (its path must be a children of a an Image or Mesh group)
* `string_array`: creates an empty String Array 
* a `numpy.dtype` or a `tables.IsDescription` class ([see here](https://www.pytables.org/usersguide/libref/declarative_classes.html#the-isdescription-class) and [the tutorial on basic data items](./Data_Items.ipynb)):

This dictionary should share the same keys as the `minimal_content_index` dictionary, and should look like this:

```python
       minimal_content_type_dic = {'item1': '3DMesh',
                                   'item2': 'field_array',
                                   'item3': 'data_array',
                                   'item4':  array_np.dtype,
                                   '...': '...',}
```

In this case, the first item would be created as a *Mesh Group*, the second will be created as a field data item stored in this mesh, the third as a data array attached to the *Root Group*, and the last as a *Structured Table* attached to the Mesh Group.

*****
These two dictionaries are returned by the `minimal_data_model` method of the `SampleData` class. They are used during the dataset object initialization, to create the prescribed data model, and populate it with empty objects, with the right names and organization. This allows to prepend a set of names and pathes that form a particular data model that all objects created by the class should have. 

To create a dataset class with the above data model, its implementation should thus look like this at this stage:

```python
class MyDatasets(SampleData):
    """Example of SampleData derived class.
    
       This is how to implement a class of datasets with a custom data model.
    """
    
    def minimal_data_model(self):
        
        minimal_content_index_dic = {'item1': '/path_to_item1',
                                    'item2': '/path_to_item1/path_to_item2',
                                    'item3': '/path_to_item3',
                                    'item4': '/path_to_item1/path_to_item4'}
        minimal_content_type_dic = {'item1': '3DMesh',
                                    'item2': 'field_array',
                                    'item3': 'data_array',
                                    'item4':  array_np.dtype}
        
        return minimal_content_index_dic, minimal_content_type_dic
```

This dictionaries are labeled as **minimal data model**, as they only prescribe the data items and organization that will be generated in each created dataset of the subclass. The user is free to enrich the datasets with any additional data item (see previous tutorial to learn how to do it).

To sum up, creating a interface to create and interact with datasets with a prescribed data model, you have to:

1. Implement a new class, inherited from SampleData
2. Override the `minimal_data_model` method and write your data model in the two dictionaries returned by the class 

You will then get a class derived from *SampleData* (hence with all its methods and features), that creates datasets with this prescribed data model.

### Custom initialization

The other mechanisms that is important to design subclasses of *SampleData*, is the specification of initialization commands that are runed each time at dataset opening. These operations can include, for instance, the definition of class attributes, prints, sanity checks etc..... The `_after_file_open` method of the `SampleData` class has been designed to this end. It is called by the class constructor after opening the HDF5 dataset and loading the dataset Index and data tree in the class instance. 

To create your custom dataset initialization routine, you can hence override the `_after_file_open` method in your derived class, and implement your initialization procedure. For instance, if you want your class to warn the user that a data item is empty in the dataset, you could implement your class as follows:

```python
class MyDatasets(SampleData):
    """Example of SampleData derived class.
    
       This is how to implement a class of datasets with a custom data model
       and initialization procedure.
    """
    
    def minimal_data_model(self):
        """Define data model of MyDatasets class."""
        minimal_content_index_dic = {'item1': '/path_to_item1',
                                    'item2': '/path_to_item1/path_to_item2',
                                    'item3': '/path_to_item3',
                                    'item4': '/path_to_item1/path_to_item4'}
        minimal_content_type_dic = {'item1': '3DMesh',
                                    'item2': 'field_array',
                                    'item3': 'data_array',
                                    'item4':  array_np.dtype}
        
        return minimal_content_index_dic, minimal_content_type_dic
        
    def _after_file_open(self):
        """Initialization procedure for MyDatasets."""
        
        if self._is_empty('item3'):
            print('Warning: data array "item3" is empty in the dataset !')
        else:
            print('"item3" is not empty !')
        return
        
```

## II - A practical example : The Microstructure Class

The `Microstructure` class has been designed to build datasets representing polycrystalline material samples. The `Microstructure` class also offers many application specific methods to interact with polycrystalline materials datasets, that are detailed in dedicated pages of this User's guide. 

Following the principles detailed in the previous section, the `Microstructure` class is implemented as a subclass of the `SampleData` class:
```python
class Microstructure(SampleData):
```

Let us review its prescribed data model and initialization procedure to use it as a practical example of custom data model creation. 

### Class data model

The code of the `minimal_data_model` method of the *Microstructure* class is replicated below:

```python
    def minimal_data_model(self):
        """Data model for a polycrystalline microstructure.

        Specify the minimal contents of the hdf5 (Group names, paths and group
        types) in the form of a dictionary {content: location}. This extends
        `~pymicro.core.SampleData.minimal_data_model` method.

        :return: a tuple containing the two dictionnaries.
        """
        minimal_content_index_dic = {'Image_data': '/CellData',
                                     'grain_map': '/CellData/grain_map',
                                     'phase_map': '/CellData/phase_map',
                                     'mask': '/CellData/mask',
                                     'Mesh_data': '/MeshData',
                                     'Grain_data': '/GrainData',
                                     'GrainDataTable': ('/GrainData/'
                                                        'GrainDataTable'),
                                     'Phase_data': '/PhaseData'}
        minimal_content_type_dic = {'Image_data': '3DImage',
                                    'grain_map': 'field_array',
                                    'phase_map': 'field_array',
                                    'mask': 'field_array',
                                    'Mesh_data': 'Mesh',
                                    'Grain_data': 'Group',
                                    'GrainDataTable': GrainData,
                                    'Phase_data': 'Group'}
        return minimal_content_index_dic, minimal_content_type_dic
```

### Datasets initialization

The `_after_file_open` method of the `Microstructure` is composed of the following lines of code:
```python
    def _after_file_open(self):
        """Initialization code to run after opening a Sample Data file."""
        self.grains = self.get_node('GrainDataTable')
        if self._file_exist:
            self.active_grain_map = self.get_attribute('active_grain_map',
                                                       'CellData')
            if self.active_grain_map is None:
                self.set_active_grain_map()
            self._init_phase(phase)
            if not hasattr(self, 'active_phase_id'):
                self.active_phase_id = 1
        else:
            self.set_active_grain_map()
            self._init_phase(phase)
            self.active_phase_id = 1
        return
```

When opening a dataset, a class attribute `grains` is associated with the *Structured Array* node `GrainDataTable`. This `grains` attribute is used by many of the class methods. Hence, the `_after_file_open` method is used here to ensure that this attribute is properly associated to the *GrainDataTable* data item, for each opening of the dataset. The class initialization also executes the `_init_phase` and `set_active_grain_map` methods, that serve a similar purpose for other data items. 

### Creating a Microstructure dataset

To conclude this tutorial, we will create a Microstructure object and look at its content. The class constructor arguments are similar to those of the `SampleData` class.

In [5]:
# import SampleData class
from pymicro.crystal.microstructure import Microstructure 

# create a microstructure dataset
micro = Microstructure(filename='test_microstructure', autodelete=True)

# print the content of the microstructure dataset
print(micro)

# print class attributes that are initialized by the _after_file_open method
print('The Grain object has been initialized:')
print(micro.grains)

# close the dataset
del micro

Adding empty field /CellData/grain_map to mesh group /CellData
Adding empty field /CellData/phase_map to mesh group /CellData
Adding empty field /CellData/mask to mesh group /CellData
new phase added: unknown
Microstructure
* name: micro
* lattice: Lattice (Symmetry.cubic) a=1.000, b=1.000, c=1.000 alpha=90.0, beta=90.0, gamma=90.0

Dataset Content Index :
------------------------:
index printed with max depth `3` and under local root `/`

	 Name : Image_data                                H5_Path : /CellData 	
	 Name : Mesh_data                                 H5_Path : /MeshData 	
	 Name : Grain_data                                H5_Path : /GrainData 	
	 Name : Phase_data                                H5_Path : /PhaseData 	
	 Name : grain_map                                 H5_Path : /CellData/grain_map 	
	 Name : Image_data_Field_index                    H5_Path : /CellData/Field_index 	
	 Name : phase_map                                 H5_Path : /CellData/phase_map 	
	 Name : ma

The dataset has indeed been created with a content that conforms to the data model prescribed by the `minimal_data_model` method of the `Microstructure` class. Each of these items corresponds to data used systematically to study a polycrystalline material sample. In this context, the implementation of the data model serves the following purposes:

* it can be used as a standard data model for polycrystalline data sets, thus promoting data exchange and interoperability
* it allows to implement a high level interface with reduced complexity to interact with these data items 

The interface is provided by the `Microstructure` class, that allows to perform data processings that are frequently used in material science on polycrystalline datasets. In addition, the pre-existing data model facilitates the implementation of new processing functionalities within the class. This is illustrated here by the `grains` class attribute, that has been associated to the `GrainDataTable` data item in the dataset, as shown by the printed information above. This attribute an accessible and explicit object to get information and apply processing on the data describing the grains of the microstructure represented by the dataset. 

This conclude this short tutorial on creating custom data models with `SampleData`. The `Microstructure` class features and use is detailed in a dedicated part of this User Guide.