# Multiscale Programming Tutorial
*By R. Bulanadi, 28/01/20*

***

This is the tutorial on actually programming more complicated features into the Multiscale project. Most of this is briefly covered in comments in the core module, but explained in more detail here.
***

### The Cheat Sheet

**How do I access attributes?**

To access attributes in a function, there are two steps:
1. Pass the name of the attribute into m_apply under the argument `use_attrs`. eg, `use_attrs = 'catch_rate'` would call the attribute `catch_rate` from the source
2. Make your custom function have an kwarg named `source_###`, where `###` is the attribute called above, ie, the function would need an argument `source_catch_rate`.

**How do I write attributes?**

* Some generic attributes are written by default by `write_generic_attributes`, such as `path`, `time`, and `operation_number`. As this is called by `m_apply` and the defunct `write_output_f`, these should always be written.
* Additional kwargs passed into `m_apply` will always be written as an attribute by `m_apply`
* By extension, as source attributes become additional kwargs, source attributes will also be written as a new attribute by `m_apply`
* Attributes can be copied and propagated by using the argument `prop_attrs` in `m_apply`.
* If you want an intermediate result or side effect to be saved as an attribute, then use `hdf5_dict` in the custom function, and place write those attributes as kwargs to `hdf5_dict`. Return the result.

**How are an unusual inputs/outputs dealt with?**

* *Multiple inputs per output:* `l_apply` can be used if more than one input is needed.
* *Multiple outputs per input:* In `m_apply`, pass a list of names into `output_names`
* *Multiple inputs, and multiple outputs per input:* Pass a list of `output_names` into `l_apply`
* *Dimension changes:* By default, `m_apply` should be able to transform m-dimensional data into n-dimensional data

***

### m_apply

In [5]:
def m_apply(filename, function, in_paths, output_names=None, folder_names = None,
            use_attrs = None, prop_attrs = None, increment_proc = True, **kwargs):
    
    #Convert in_paths to a list if not already
    if type(in_paths) != list:
        in_paths = [in_paths]
    
    #Guess output_names (aka channel names) if not given
    if output_names is None:
        output_names = in_paths[0].rsplit('/', 1)[1]
    
    #Guess folder_names (aka sample names) if not given
    if folder_names is None:
        folder_names = in_paths[0].rsplit('/', 2)[1]
    
    #Convert output_names to list if not already
    if type(output_names) != list:
        output_names = [output_names]
    
    #Convert prop_attrs to list if it exists, but not already a list
    if prop_attrs is not None:
        if type(prop_attrs) != list:
            prop_attrs = [prop_attrs]
            
    #Convert use_attrs to list if it exists, but not already a list
    if use_attrs is not None:
        if type(use_attrs) != list:
            use_attrs = [use_attrs]
    
    #Convert file to hdf5 if not already
    if filename.split('.')[-1] != 'hdf5':
        if os.path.isfile(filename.split('.')[0] + '.hdf5'):
            filename = filename.split('.')[0] + '.hdf5'
        else:
            try:
                read_file.tohdf5(filename)
                filename = filename.split('.')[0] + '.hdf5'
                print('The file does not have an hdf5 extension. It has been converted.')
            except:
                print('The given filename does not have an hdf5 extension, and it was not possible' \
                        'to convert it. Please use an hdf5 file with m_apply')
                
    #Open hdf5 file to extract data, attributes, and run function
    data_list = []
    prop_attr_keys = []
    prop_attr_vals = []
    use_attr_keys = []
    use_attr_vals = []
    with h5py.File(filename, 'r') as f:
        for path in in_paths:
            data_list.append(np.array(f[path]))
            if prop_attrs is not None:
                for prop_attr in prop_attrs:
                    if (prop_attr not in prop_attr_keys) and (prop_attr in f[path].attrs):
                        prop_attr_keys.append(prop_attr)
                        prop_attr_vals.append(f[path].attrs[prop_attr])
            if use_attrs is not None:
                for use_attr in use_attrs:
                    if (use_attr not in use_attr_keys) and (use_attr in f[path].attrs):
                        use_attr_keys.append(use_attr)
                        use_attr_vals.append(f[path].attrs[use_attr])
                for key_num in range(len(use_attr_keys)):
                    use_attr_dict = {'source_'+use_attr_keys[key_num]:use_attr_vals[key_num]}
                kwargs.update(use_attr_dict)
        result = function(*data_list, **kwargs)
    
    #End function if no result is calculated
    if isinstance(result, type(None)):  # type(result) == type(None):
        return None

    #Convert result to tuple if not already
    if type(result) != tuple:
        result = tuple([result])
    
    #Open hdf5 file to write new data, attributes
    with h5py.File(filename, 'a') as f:
        num_proc = len(f['process'].keys())
        if increment_proc:
            num_proc = num_proc + 1
        out_folder_location = ('process/' + str(num_proc).zfill(3) + '-' + function.__name__ + '/'
                               + folder_names)
        fproc = f.require_group(out_folder_location)
        
        if (len(output_names) == len(result)):
            for i in range(len(output_names)):
                name = output_names[i]
                data = result[i]
                if type(data)==dict:
                    if 'hdf5_dict' in data:
                        dataset = create_dataset_from_dict(f[out_folder_location], name, data)
                        if prop_attrs is not None:
                            dataset = propagate_attrs(dataset, prop_attr_keys, prop_attr_vals)
                    else:
                        dataset = f[out_folder_location].create_dataset(name, data=data)
                        if prop_attrs is not None:
                            dataset = propagate_attrs(dataset, prop_attr_keys, prop_attr_vals)
                else:
                    dataset = f[out_folder_location].create_dataset(name, data=data)
                    if prop_attrs is not None:
                        dataset = propagate_attrs(dataset, prop_attr_keys, prop_attr_vals)
                write_generic_attributes(fproc[name], out_folder_location+'/', in_paths, name)
        else:
            print('Error: Unequal amount of outputs and output names')
        for key, value in kwargs.items():
            dataset.attrs[key] = value
    return result

`m_apply` requires three arguments, natively has five optional arguments, and passes all additional keyword arguments to the declared function. The required arguments are:

1. `filename`: The filename of the `.hdf5` file operated on
2. `function`: The function applied to the file
3. `in_paths`: An explicit path (or list of multiple paths) that lead to the dataset passed to the function call. In the custom functions, this will pass as the first (and successive, in case of multiple paths) positional argument to said custom function. Note that all other arguments passed will be keyword arguments.

The optional arguments are:

1. `output_names`: This is the name of the actual dataset produced in the `.hdf5` file. If this is left unset, it will inherit the name of the first source file. Alternatively, if output_names is a list where `len(output_names) > 1`, the Python will know to prepare additional output files. This is required if additional outputs are desired.
2. `folder_names`: The name of the folder that the outputs lie in. This is only a single string; it cannot be a list, like `output_names` can be.
3. `use_attrs`: Attributes that will be given to the function. This can either be a string, or a list of strings. `m_apply` will look through each source to find an attribute that bears the same name as the string declared. This is done in order; if multiple sources are used, and each have the same attribute, the attribute of the first list will be used. When the custom function is then called, an additional kwarg is submitted, bearing the name of the string, preceeded by `'source_'`. So, if a file has attributes `'base_attack':134` and `'move':'outrage'`, then declaring `use_attrs = ['base_attack', 'move']` would pass two additional arguments to the custom function: `source_base_attack = 134` and `source_move = 'outrage'`. If the attribute is not present, then the extra kwarg will not be passed; if the kwarg is needed, be sure to note it in the custom function.
4. `prop_attrs`: These are attributes that are simply propagated from the source into the destination. As with `use_attrs`, these will be searched in order, and ignored if not found.
5. `increment_proc`: This should not be directly called, and only exists to interface with `l_apply`. By default, the process number increases each time (from 001, to 002, to 003, ...). By default, if `l_apply` were to operate 5 times, then 5 distinct processes would be made. `l_apply` thus sets this to `False` on subsequent operations to ensure the folder is kept the same.

Finally, optional kwargs can be provided

1. `**kwargs`: These kwargs are passed into the custom function. `m_apply` also has an additional use, though; every kwarg passed in is automatically written as an attribute into the dataset. Thus, setting a kwarg `opponent_type = 'Steel'` will cause an attribute called `opponent_type` to be written with a value of `'Steel'`. Following the above example, as `use_attrs` automatically passes its values as kwargs to the custom function, these are also made to be attributes. Thus, given the arguments defined with `prop_attrs`, both `source_base_attack = 134` and `source_move = 'outrage'` will also be written as attributes.

***

### l_apply and path_search

In [8]:
def l_apply(filename, function, all_input_criteria, output_names = None, folder_names = None, 
            prop_attrs = None, repeat = None, **kwargs):
    all_in_path_list = path_search(filename, all_input_criteria, repeat)
    all_in_path_list = list(map(list, zip(*all_in_path_list)))
    increment_proc = True
    start_time = time.time()
    for path_num in range(len(all_in_path_list)):
        m_apply(filename, function, all_in_path_list[path_num], output_names = output_names,
                folder_names = folder_names, increment_proc = increment_proc,
                prop_attrs = prop_attrs, **kwargs)
        progress_report(path_num+1, len(all_in_path_list), start_time, function.__name__,
                        all_in_path_list[path_num])
        increment_proc = False    
        
def path_search(filename, all_input_criteria, repeat = None):
    if type(all_input_criteria) != list:
        all_input_criteria = [all_input_criteria]
    if type(all_input_criteria[0]) != list:
        all_input_criteria = [all_input_criteria]
    
    with h5py.File(filename, 'r') as f:
        all_path_list = find_paths_of_all_subgroups(f, 'datasets')
        all_path_list.extend(find_paths_of_all_subgroups(f, 'process'))
        
        all_in_path_list = []
        list_lengths = []
        for each_data_type in all_input_criteria:
            in_path_list = []
            for each_criteria in each_data_type:
                for path in all_path_list:
                    if fnmatch.fnmatch(path, each_criteria):
                        in_path_list.append(path)
            all_in_path_list.append(in_path_list)
            list_lengths.append(len(in_path_list))
        if len(list_lengths) == 1:
            if list_lengths[0] == 0:
                print('No Input Datafiles found!')
        else:
            if len(set(list_lengths)) != 1:
                if repeat is None:
                    print('Input lengths not equal, and repeat not set! Extra files will be omitted.')
                else:
                    largest_list_length = np.max(list_lengths)
                    list_multiples = []
                    for length in list_lengths:
                        if largest_list_length%length != 0:
                            print('At least one path list length is not a factor of the largest path'\
                                  'list length. Extra files will be omitted.')
                        list_multiples.append(largest_list_length//length)
                    if (repeat == 'block') or (repeat == 'b'):
                        for list_num in range(len(list_multiples)):
                            all_in_path_list[list_num] = np.repeat(all_in_path_list[list_num],
                                                                   list_multiples[list_num])
                    if (repeat == 'alt') or (repeat == 'a'):
                        for list_num in range(len(list_multiples)):
                            old_path_list = all_in_path_list[list_num]
                            new_path_list = []
                            for repeat_iter in range(list_multiples[list_num]):
                                new_path_list.extend(old_path_list)
                            all_in_path_list[list_num] = new_path_list
    return all_in_path_list

`l_apply` effectively calls `m_apply` several times. Thus, the arguments of `m_apply` are effectively identical to `l_apply`. The one difference is that `output_names`, which used to explicitly call a folder, has now been replaced with two new arguments:

1. `all_input_criteria`: The use of this argument has generally been described in the Intermediate tutorial. Effectively, this is a two dimensional list; if less than two dimensions are submitted, the extra dimensions are added along the ends. The inner list is a list of search conditions that can be used with wildcards. Multiple conditions are used, so searching for `[['*Phase*', '*Amplitude*']]` would search for both conditions and send each to `m_apply` only after the previous dataset is done with.

The outer list is only used if the custom function calls for two or more datasets; in that case, a second condition could be applied. Searching for `[['*Phase*'], ['*Amplitude*']]` would thus send two arrays to the custom function on each operation; a phase, and an amplitude.

2. `repeat`: This increases the ease of using `l_apply` when multiple datasets are needed, and some of these datasets would need to be used multiple times. This can occur if, for example, `n` parameters need to be applied to `8n` arrays. In this case, you would want `n` to repeat 8 times. There are two options for how to use `repeat`. The first, `alt`, or `a`, repeats the shorter path list in its entirety; so a path list `ABC` would repeat to be `ABCABCABC...`. The other option, `block`, or `b`, repeats  each individual component, causing 'phase separation' in a classical sense. A path list `ABC` would thus become `AAA...BBB...CCC...`, where `A`, `B` and `C` are all of equal length.

The function `path_search` can also be called separate from `l_apply`. This can allow for more complicated custom functions to be utilise the search conditions and wildcares, even if they cannot necessarily use `l_apply` or `m_apply` in their entirety.
***

### hdf5_dict and Attribute Writing

In [16]:
def hdf5_dict(data, **kwargs):
    data_dict = {
        'hdf5_dict':True,
        'data':data
    }
    data_dict.update(kwargs)
    return data_dict

There are multiple ways to write attributes. As described above, there are five-ish ways that these attributes are written:

* Basic attributes are written by `write_generic_attributes`, which is called by `m_apply`
* kwargs passed into `m_apply` are written as attributes
* Source attributes passed into `m_apply` via `use_attrs`, which become kwargs, are also written
* Attributes can be copied and propagated from sources by using the argument `prop_attrs` in `m_apply`.

The final method - using `hdf5_dict` - needs to be used in custom functions. This allows intermediate or side results to be saved as attributes. To use `hdf5_dict`, it must be called within the function. Consider the custom function m_sum below:

In [17]:
def m_sum(*args):
    total = 0
    for arg in args:
        total = total+arg
        
    return total

This function, if called by `m_apply`, would add entries of a list together. Say that an attribute that reported the amount of entries added together was desired. This value, `input_count`, could be defined by `len(args)`. Simply returning `input_count` would cause Multiscale to believe it to be another dataset, and thus it needs to be combined with the actual dataset as a dict:

In [18]:
def m_sum(*args):
    total = 0
    for arg in args:
        total = total+arg
        
    #return total
    
    result = hdf5_dict(total, input_count=input_count)
    return result

`hdf5_dict` then creates a `dict`, which in this case has three keys: `'hdf5_dict'`, which marks it as being made by the function of the same name; `'data'`, which contains the dataset; and `'input_count'`, which is the attribute, whose name is defined in the `hdf5_dict` function call. If additional attributes were desired, these would be additional keys in the dict.

After this, `create_dataset_from_dict` can then called by `m_apply` to write these attributes.

In [19]:
def create_dataset_from_dict (dataset, name, dict_data):
    dataset = dataset.create_dataset(name, data = dict_data['data'])
    for key, value in dict_data.items():
        if (key != 'hdf5_dict') and (key != 'data'):
            dataset.attrs[key] = value
    return dataset