This notebook presents the various elements of the DTW component of the predictive system. The actual module is stored in the `libdtw.py` file.

## Preliminaries: Data Loading
The `libdtw.py` module contains 2 functions:
- `load_data(n_to_keep=50, data_path = "data/ope3_26.pickle")`
- `assign_ref(data)`

and one class:
- `Dtw(json_obj = False)`

We first illustrate the usage of the two funcitons.

`load_data(n_to_keep=50, data_path = "data/ope3_26.pickle")` load the .pickle in the `data_path` position. This has to be a dictionary where the keys are the batch IDs and the values are lists of dictionaries representing the PVs (with keys `name, start, end, values`). 

The function first identifies the median (with respect to the duration parameter) batch to be used as reference. Then, it selects the first `n_to_keep` batches closer to the reference one in terms of duration. The output dictionary has the `reference` key explicitly declared.

In [None]:
def load_data(n_to_keep=50, data_path = "data/ope3_26.pickle"):
    """
    Load data of operation 3.26, only the n_to_keep batches with duration closer to the median one
    are selected
    """
    with open(data_path, "rb") as infile:
        data = pickle.load(infile)

    operation_length = list()
    pv_dataset = list()
    for _id, pvs in data.items():
        operation_length.append((len(pvs[0]['values']), _id))
        pv_list = list()
        for pv_dict in pvs:
            pv_list.append(pv_dict['name'])
        pv_dataset.append(pv_list)

    median_len = np.median([l for l, _id in operation_length])

    # Select the batches closer to the median bacth
    # center around the median
    centered = [(abs(l-median_len), _id) for l, _id in operation_length]
    selected = sorted(centered)[:n_to_keep]

    med_id = selected[0][1]

    all_ids = list(data.keys())
    for _id in all_ids:
        if _id not in [x[1] for x in selected]:
            _ = data.pop(_id)

    data['reference'] = med_id

    return data

`assign_ref(data)` computes the median batch (acording to the duration parameter) and sets it as `reference` of the data set. It is useful in case the data loaded with `load_data()` undergoes modification prior to its actual use

In [None]:
def assign_ref(data):
    data = copy(data)
    operation_length = list()
    pv_dataset = list()
    for _id, pvs in data.items():
        operation_length.append((len(pvs[0]['values']), _id))
        pv_list = list()
        for pv_dict in pvs:
            pv_list.append(pv_dict['name'])
        pv_dataset.append(pv_list)

    median_len = np.median([l for l, _id in operation_length])

    # Select the ref_len=50 closest to the median bacthes
    # center around the median
    centered = [(abs(l-median_len), _id) for l, _id in operation_length]
    selected = sorted(centered)

    med_id = selected[0][1]  # 5153

    all_ids = list(data.keys())
    for _id in all_ids:
        if _id not in [x[1] for x in selected]:
            _ = data.pop(_id)

    data['reference'] = med_id

    return data

## Dtw class
The `Dtw(json_obj)` contains all the methods used to set up the DTW algorithm, optimizing the variables weights, computing the alignment. it contains the following methods (divided by topic here for clarity):

##### data handling
- `__init__()`
- `convert_data_from_json()`
- `add_query`
- `get_scaling_parameters`
- `remove_const_feats`
- `scale_pv`
- `convert_to_mvts`

##### dtw implementation
- `comp_dist_matrix`
- `comp_acc_dist_matrix`
- `comp_acc_element`
- `get_warping_path`
- `call_dtw`
- `dtw`
- `get_ref_prefix_length`
- `itakura`
- `extreme_itakura`
- `check_open_ended`

##### step pattern selection utilities
- `time_distortion`
- `avg_time_distortion`
- `avg_distance`
- `get_p_max`
- `get_global_p_max`

##### variables weights optimization
- `reset_weights`
- `compute_mld`
- `extract_single_feat
- `weight_optimization_single_batch`
- `weight_optimization_step`
- `optimize_weights`
- `get_weight_variables`

##### visualization
- `distance_cost_plot`
- `plot_weights`
- `plot_by_name`
- `do_warp`
- `plot_warped_curves`

##### misc
- `online_scale`
- `online_query`
- `generate_train_set`

We now examine each method.



### Data handling
`__init__(json_obj=False, random_weights = True, scaling='group')` initialize the `Dtw` class. 

- `json_obj` is the dictionary containing the data in the output format of `load_data()`. 
- `random_weights` if True, initialize the variable weights to randomly chosen values in the [0.1, 1] interval. 
- `scaling` is the scaling strategy for the PVs: `group` scales the PVs according to the values of the PVs in the reference batch, `single` scales the PVs as individual entities

The initialization consists in structuring the data inside the Dtw object, removing the constant features (in the reference batch) filtering the batches that does not contain all the PVs of the reference batch, and finally setting the variables weights.

In [None]:
def __init__(self, json_obj=False, random_weights = True, scaling='group'):
    """
    Initialization of the class.
    json_obj: contains the data in the usual format
    """
    if not json_obj:
        pass
    else:
        self.convert_data_from_json(deepcopy(json_obj))
        #self.scale_params = self.get_scaling_parameters()
        self.remove_const_feats()
        self.reset_weights(random=random_weights)
        self.scaling = scaling

---
`convert_data_from_json(json_obj)` takes as input the data object specified before and separates it into reference and query batches. It also initialize the internal structure of the `Dtw` object. This includes the structure to collect data useful in subsequent operations and avoid repeating calculations.

In [None]:
def convert_data_from_json(self, json_obj):
        """
        Returns a dictionary containing all the data, organized as:
        ref_id: the ID of the reference batch
        reference: reference batch in the usual format (list of dictionaries)
        queries: list of dictionaries in which the keys are the query batch's ID and the values are
        the actual batches (list of dictionaries)
        num_queries: number of query batches in the data set
        """
        ref_id = json_obj["reference"]
        reference = json_obj[ref_id]
        queries = {key: batch for key, batch in json_obj.items() if key !=
                   "reference" and key != ref_id}

        self.data = {"ref_id": ref_id,
                     "reference": reference,
                     "queries": queries,
                     "num_queries": len(queries),
                     "warpings": dict(),
                     "distances": dict(),
                     'warp_dist': dict(),
                     "queriesID": list(queries.keys()),
                     "time_distortion": defaultdict(dict),
                     "distance_distortion": defaultdict(dict),
                     'warpings_per_step_pattern': defaultdict(dict),
                     'feat_weights': 1.0}

        self.data_open_ended = {"ref_id": ref_id,
                                "reference": reference,
                                "queries": defaultdict(list),
                                'warp_dist': dict()}
        scale_params = dict()

        for pv_dict in self.data['reference']:
            pv_name = pv_dict['name']
            pv_min = min(pv_dict['values'])
            pv_max = max(pv_dict['values'])
            scale_params[pv_name] = (pv_min, pv_max)

        self.scale_params = scale_params

---
`add_query(batch_dict)` adds a batch (in the forms of a dict(batch_id: list_of_PVs)) to the `Dtw` object, updating the relevant structures and filtering the PVs. If the batch does not contain all the necessary PVs, it is not added.

In [None]:
def add_query(self, batch_dict):
    _id, pvs = list(batch_dict.items())[0]
    self.data['queries'][_id] = pvs
    self.data['num_queries'] += 1
    self.data['queriesID'].append(_id)
    self.remove_const_feats()

---
`get_scaling_parameters()` probably can be omitted, TO CHECK

In [None]:
def get_scaling_parameters(self):
        """
        Computes the parameters necessary for scaling the features as a 'group'.
        This means considering the mean range of a variable across al the data set.
        This seems creating problems, since the distributions for the minimum and the
        maximum are too spread out. This method is here just in case of future use and to help
        removing non-informative (constant) features.
        avg_range = [avg_min, avg_max]
        """
        scale_params = dict()

        for pv_dict in self.data['reference']:
            pv_name = pv_dict['name']
            pv_min = min(pv_dict['values'])
            pv_max = max(pv_dict['values'])

            scale_params[pv_name] = [[pv_min], [pv_max]]

        for _id, batch in self.data['queries'].items():
            for pv_dict in batch:
                pv_name = pv_dict['name']
                pv_min = min(pv_dict['values'])
                pv_max = max(pv_dict['values'])

                scale_params[pv_name][0].append(pv_min)
                scale_params[pv_name][1].append(pv_max)

        pv_names = scale_params.keys()
        for pv_name in pv_names:
            scale_params[pv_name] = np.median(scale_params[pv_name], axis=1)

        return scale_params

---
`remove_const_feats()` removes the features that are constant in the reference batch from all the batches, reference and queries. If a query batch does not contain all the PVs of the filtered reference one, it is removed from the `Dtw` object. 

If a Pv is constant in the reference batch, this does not mean that it is constant also in any of the queries, but since all the DTW alignment are performed with respect to one single reference, this PVs would add no information.

In [None]:
def remove_const_feats(self):
        """
        Removes non-informative features (features with low variability)
        """
        const_feats = list()
        for pv_name, avg_range in self.scale_params.items():
            if abs(avg_range[0]-avg_range[1]) < 1e-6:
                const_feats.append(pv_name)
        #const_feats.append('ba_TCzWpXo')
        #const_feats.append('ba_TCfg3Yxn')
        #const_feats.append('ba_FQYXdr6Q0')


        initial_queries = list(self.data['queries'].keys())
        print('Number of queries before filtering: %d'%len(initial_queries))

        self.data['reference'] = list(filter(lambda x: x['name'] not in const_feats, self.data['reference']))
        pv_names = [pv['name'] for pv in self.data['reference']]
        for _id in initial_queries:
            self.data['queries'][_id] = list(filter(lambda x: x['name']  in pv_names, self.data['queries'][_id]))
            if len(self.data['queries'][_id]) != len(self.data['reference']):
                _ = self.data['queries'].pop(_id)
        print('Number of queries after filtering: %d'%len(self.data['queries']))

        self.data['num_queries'] = len(self.data['queries'])
        self.data['queriesID'] = list(self.data['queries'].keys())
        self.pv_names = pv_names

---
`scale_pv(pv_name, pv_values, mode="single")` takes as parameters the `pv_name` in order to get the right scaling parameters from the structure where they are stored, the `pv_values` to scale in the form of a list, and the mode of scaling:

-`single` scales PV independently from the others: thus computing min and max of the PV and scaling to the [0, 1] interval. This is apt to off-line scenarios
- `group` scales the PVs with respect to values of the same PV in the reference batch. This way, the online and offline scenarios are treated equally from this point of view

The scaling formula adopted is:
$$X_{scaled} = \frac{X_{original} - min}{max - min}$$
where min and max have different meaning for `single` and `group` scaling. If a PV is constant in any one of the queries, it is scaled by default to the constant 0.5 value.

In [None]:
def scale_pv(self, pv_name, pv_values, mode="single"):
        """
        Scales features in two possible ways:
            'single': the feature is scaled according to the values it assumes in the current batch
            'group': the feature is scaled according to its average range across the whole data set
        """
        if mode == "single":
            pv_min = min(pv_values)
            pv_max = max(pv_values)
            if abs(pv_max-pv_min) > 1e-6:
                scaled_pv_values = (np.array(pv_values)-pv_min)/(pv_max-pv_min)
            else:
                scaled_pv_values = .5 * np.ones(len(pv_values))
        elif mode == "group":
            pv_min, pv_max = self.scale_params[pv_name]
            scaled_pv_values = (np.array(pv_values)-pv_min)/(pv_max-pv_min)
        return scaled_pv_values