In [1]:
%reload_kedro

2020-08-12 10:35:57,141 - root - INFO - ** Kedro project optimus_pkg
2020-08-12 10:35:57,146 - root - INFO - Defined global variable `context` and `catalog`
2020-08-12 10:35:57,156 - root - INFO - Registered line magic `run_viz`


# Tag Dictionary

The Tag Dictionary (TagDict) is a central means of configuration for Optimus pipelines. It is the place where tag-specific configuration can be set and serves as a tool for communicating with subject matter experts.

Underlying the tag dictionary is a simple CSV file. To make interactions with the data as simple as possible, we provide multiple tools:

- a `TagDict` class which allows for simple query operations
- a custom kedro data set which allows for the csv go be read and turned into an instance of `TagDict` and vice versa

> as a general rule, we suggest that users use the tag dict to store all information about an individual tag such as the expected range, the clear name, or a mapping to one or multiple models. Higher level parametrization (e.g. per-dataset) or parametrization of pipeline nodes is likely better placed in the pipeline's `conf` section.


## Key columns
The tag dictionary is designed in a way that allows users to add any columns they want. Simply add a column to the underlying csv to see it reflected in the `TagDict` object. A small number of columns are required to ensure proper function and are validated whenever a `TagDict` object is created or loaded from a csv.

The minimum columns required to construct an instance of `TagDict` are:

| column              | description                                                                                                                                           | 
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | 
| tag                 | tag name (key)                                                                                                                                        | 
| name                | human-readable clear name                                                                                                                             | 
| tag_type            | functional type. One of {"input", "output", "state", "control", "on_off"}                                                                             | 
| data_type           | data type. One of {"numeric", "categorical", "boolean", "datetime"}                                                                                   | 
| unit                | unit of measurement                                                                                                                                   | 
| range_min           | lower limit of tag range (values that the measurement can physically take)                                                                            | 
| range_max           | upper limit of tag range (values that the measurement can physically take)                                                                            | 
| on_off_dependencies | names (keys) of on/off tags which determine the current tag's on/off state. If one of the dependencies is off, the current tag is considered off, too | 
| derived             | indicates whether a tag is an original sensor reading or artificially created / derived                                                               | 

## Extra Columns

There are some columns which are not mandatory in the Tag Dict, but which many modular pipelines and shared code commonly use. You should always consider using these columns before inventing new column names, so that you maintain compatability with other Optimus solutions.

The common extra columns used in an instance of `TagDict` are:

| column              | description                                                                                                                                           | 
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | 
| area                | plant area                                                                                                                                            | 
| sub_area            | plant sub-area                                                                                                                                        | 
| op_min              | lower limit of operating range (values that should be considered for a control variable)                                                              | 
| op_max              | upper limit of operating range (values that should be considered for a control variable)                                                              | 
| max_delta           | maximum change from current value allowed during optimization                                                                                         | 
| constraint_set      | set of permissible values for control                                                                                                                 | 
| agg_window_length   | length of window over which to aggregate during static feature creation                                                                               | 
| agg_method          | static feature creation aggregation method                                                                                                            | 
| notes               | free-text notes                                                                                                                                       | 
| model_feature       | indicates tag as a feature of the model                                                                                                               | 
| model_target        | indicates tag as the model target                                                                                                                     | 

## Creating Tag Dict

It is possible to create a `TagDict` programmatically, and this may be useful to get started. Let's take a self-driving car example, and create the tag dictionary with only a single tag, engine, in it.

In [2]:
from optimus_pkg.core.tag_management import TagDict
td = TagDict.from_dict({
    "engine": {
        'name': 'Engine On/Off Sensor', 
        'tag_type': 'on_off', 
        'data_type': 'boolean',
        'unit': 'on/off', 
        'range_min': False, 
        'range_max': True,
        'on_off_dependencies': [], 
        'derived': False, 
    },
})
td.to_frame()

Unnamed: 0,tag,name,tag_type,data_type,unit,range_min,range_max,on_off_dependencies,derived
0,engine,Engine On/Off Sensor,on_off,boolean,on/off,False,True,,False


In this example we used the `TagDict.from_dict` method, but you can also construct a `TagDict` from JSON or a pandas dataframe. As a minimum we had to provide all of the required columns to describe the engine tag.

## Adding Tags

Once you have a `TagDict`, it is also possible to add new tags to it programmatically. This can be done through the `add_tag` method of the dictionary. Let's extend our self-driving car example with 4 further tags: speedometer, accelerator, ron, and cof.

In [3]:
td.add_tag({
        'tag': ['speedometer', 'accelerator', 'ron', 'cof'],
        'name': ['Speedometer', 'Accelerator Pedal', 'Research Octane Number', 'Road Coefficient of Friction'], 
        'tag_type': ['output', 'control', 'input', 'state'], 
        'data_type': ['numeric', 'numeric', 'categorical', 'numeric'],
        'unit': ['mph', 'travel %', 'RON', 'ratio'], 
        'range_min': [0.0,0.0,'93',0.0], 
        'range_max': [150.0,100.0,'99',2.0],
        'on_off_dependencies': ['engine',None,None,None], 
        'derived': [False,False,False,True], 
    })
td.to_frame()

Unnamed: 0,tag,name,tag_type,data_type,unit,range_min,range_max,on_off_dependencies,derived
0,engine,Engine On/Off Sensor,on_off,boolean,on/off,False,True,,False
1,speedometer,Speedometer,output,numeric,mph,0,150,engine,False
2,accelerator,Accelerator Pedal,control,numeric,travel %,0,100,,False
3,ron,Research Octane Number,input,categorical,RON,93,99,,False
4,cof,Road Coefficient of Friction,state,numeric,ratio,0,2,,True


## Querying The Tag Dict

Most users will interact with the tag dictionary through an instance of the `TagDict` class. The `TagDict` class is a thin wrapper around the underlying `DataFrame` that makes it easy to query tag information.

### Reading Tag Properties

To get all information for a single tag, simply subset the tag dict instance:

In [4]:
td["engine"]

{'tag': 'engine',
 'name': 'Engine On/Off Sensor',
 'tag_type': 'on_off',
 'data_type': 'boolean',
 'unit': 'on/off',
 'range_min': False,
 'range_max': True,
 'on_off_dependencies': [],
 'derived': False}

### Check If A Tag Exists
To check whether the tag dict contains information about a given tag, use the `in` operator:

In [5]:
"engine" in td

True

### Dependencies
Dependencies can be queried via the `dependencies` and `dependents` methods.

In [6]:
td.dependencies("speedometer")

{'engine'}

In [7]:
td.dependents("engine")

{'speedometer'}

### Retrieve Sets of Tags

The `select` method can be used to quickly retrieve sets of tags.

Using `select` without any arguments will return all tags in the `TagDict`. 

Similar to 
```sql
select tag from tags
```

In [8]:
td.select()

['engine', 'speedometer', 'accelerator', 'ron', 'cof']

Using `select` with a column name, only yields all tags with non-zero, non-null entries in that column. This is helpful for boolean flags, such as assignment of tags to a model.

Similar to 
```sql
select tag from tags 
where derived is not null 
and derived > 0
```

In [9]:
td.select("derived")

['cof']

Using `select` with a column name and a value *x* yields all tags with the column entry *x*. This is helpful for filtering, e.g. by tag type.

Similar to 
```sql
select tag from tags
where tag_type = "on_off"
```

In [10]:
td.select("tag_type", "on_off")

['engine']

Using `select` with a column name and a callable *f* yields all tags where *f(column)* evaluates to True.

Similar to
```sql
select tag from tags
where lambda_udf("tag_type") = True
```

In [11]:
td.select("tag_type", lambda col: col in ["control", "output"])

['speedometer', 'accelerator']

Using `select` with only a callable _f_ passed through the `condition` argument yields all tags where _f(row)_ evaluates to True

Similar to 

```sql
select tag from tags
where lambda_udf(*) = True
```

In [12]:
td.select(condition=lambda row: row["data_type"] == "numeric" and row["range_max"] > 0 and row["derived"])

['cof']

## Loading and Saving the TagDict
While the TagDict class operates on an underlying pandas Dataframe, it can be exported to, and created from, multiple formats. Currently supported formats are `csv`, `excel`, and `json`. 

See `optimus_pkg.core.tag_management.io` for corresponding kedro DataSource classes.

### CSV

The `catalog.yml` entry for a CSV-based `TagDict` looks like:

```yaml
td:
  type: optimus_pkg.core.tag_management.TagDictCSVLocalDataSet
  filepath: path/to/my/tag_dict.csv
  layer: raw
```

This is an extension of the [Kedro Pandas CSV Dataset](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.CSVDataSet.html)

### JSON

The `catalog.yml` entry for a JSON-based `TagDict` looks like:

```yaml
td:
  type: optimus_pkg.core.tag_management.TagDictJSONLocalDataSet
  filepath: path/to/my/tag_dict.json
  layer: raw
```

This is an extension of the [Kedro Text Dataset](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.text.TextDataSet.html)

### Excel

The `catalog.yml` entry for an Excel-based `TagDict` looks like:

```yaml
td:
  type: optimus_pkg.core.tag_management.TagDictExcelLocalDataSet
  filepath: path/to/my/tag_dict.xlsx
  layer: raw
  load_args:
     sheet_name: "my_sheet"  # if excluded, default is first sheet in workbook
  save_args:
     sheet_name: "my_sheet"
```

This is an extension of the [Kedro Text Dataset](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.ExcelDataSet.html)

## Extending the Tag Dictionary

Many of the modular pipelines within Optimus make use of the `TagDict` for parameterization. The advantage in doing this is that it is very easy to collaborate with domain experts, who can help construct the tag dictionary. As you create your own modular pipelines, you should make use of the Tag Dictionary to encode domain led parameterization.

For example, you may want to specify permitted operating ranges (as expected by optimizer), which are narrower than the `range_min` to `range_max`. In this case, we could extend the tag dictionary to include two new attributes: `op_min` and `op_max`.


In [13]:
# get copy of Tag Dict data
tag_dict_df = td.to_frame()

# add column to df
tag_dict_df["op_min"] = None
tag_dict_df["op_max"] = None

# set min and max for accelerator
tag_dict_df.loc[tag_dict_df["tag"] == "accelerator", "op_min"] = 0
tag_dict_df.loc[tag_dict_df["tag"] == "accelerator", "op_max"] = 50

# recreate tag dict
td = TagDict(tag_dict_df)

td.to_frame()

Unnamed: 0,tag,name,tag_type,data_type,unit,range_min,range_max,on_off_dependencies,derived,op_min,op_max
0,engine,Engine On/Off Sensor,on_off,boolean,on/off,False,True,,False,,
1,speedometer,Speedometer,output,numeric,mph,0,150,engine,False,,
2,accelerator,Accelerator Pedal,control,numeric,travel %,0,100,,False,0.0,50.0
3,ron,Research Octane Number,input,categorical,RON,93,99,,False,,
4,cof,Road Coefficient of Friction,state,numeric,ratio,0,2,,True,,
