[FEA] DTW computation for time series with different lengths #31

hichamjanati · 2019-09-08T13:32:49Z

One of the features of the DTW metric is that it can compare time series of different lengths; which is done in this PR

All region methods are updated following the same logic by rescaling to the relative diagonal (n_timestamps_2 / n_timestamps_1).

Here is the updated output of the plot_dtw example with x = x[::2]

codecov · 2019-09-08T13:38:42Z

Codecov Report

Merging #31 into master will increase coverage by 0.07%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #31      +/-   ##
==========================================
+ Coverage   97.27%   97.34%   +0.07%     
==========================================
  Files          79       79              
  Lines        3265     3318      +53     
==========================================
+ Hits         3176     3230      +54     
+ Misses         89       88       -1

Impacted Files	Coverage Δ
pyts/classification/tests/test_knn.py	`100% <ø> (ø)`	⬆️
pyts/metrics/tests/test_dtw.py	`100% <100%> (ø)`	⬆️
pyts/metrics/dtw.py	`98.91% <100%> (+0.43%)`	⬆️
pyts/classification/knn.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b4310f...ca519b5. Read the comment docs.

johannfaouzi

Thank you very much for your PR! I don't have much to add, you did an awesome job.

Regarding the computation of the Sakoe-Chiba band and the Itakura parallelogram, I would probably have done some things differently.

Sakoe-Chiba band: I would have add the absolute difference between both lengths to the window_size, with something like this:

def sakoe_chiba_band_proposed(n_timestamps_1, n_timestamps_2, window_size=1):
    if isinstance(window_size, (int, np.integer)):
        window_size_ = window_size
    elif isinstance(window_size, (float, np.floating)):
        window_size_ = ceil(window_size * (n_timestamps - 1))
        
    lower_shift = max(0, n_timestamps_1 - n_timestamps_2) + window_size_
    upper_shift = max(0, n_timestamps_2 - n_timestamps_1) + window_size_ + 1
    
    region = np.array([np.arange(n_timestamps_1) for _ in range(2)])
    region += np.array([- lower_shift, upper_shift]).reshape(2, 1)
    region[0] = np.clip(region[0], 0, n_timestamps_1)
    region[1] = np.clip(region[1], 0, n_timestamps_2)
    return region

Itakura parallelogram: I would have rescaled the slope at the beginning of the function, without adding the points at the borders (technically they are not in the parallelogram), something like this:

def itakura_parallelogram_proposed(sz1, sz2, max_slope):
    max_slope = float(max_slope) * max(sz2, sz1) / float(min(sz2, sz1))
    min_slope = 1 / max_slope
    max_slope *= (sz2 / sz1)
    min_slope *= (sz2 / sz1)

    lower_bound = np.empty((2, sz1))
    lower_bound[0] = min_slope * np.arange(sz1)
    lower_bound[1] = ((sz2 - 1) - max_slope * (sz1 - 1)
                      + max_slope * np.arange(sz1))
    lower_bound = np.round(lower_bound, 2)
    lower_bound = np.ceil(np.max(lower_bound, axis=0))

    upper_bound = np.empty((2, sz1))
    upper_bound[0] = max_slope * np.arange(sz1)
    upper_bound[1] = ((sz2 - 1) - min_slope * (sz1 - 1)
                      + min_slope * np.arange(sz1))
    upper_bound = np.round(upper_bound, 2)
    upper_bound = np.floor(np.min(upper_bound, axis=0) + 1)

    region = np.asarray([lower_bound, upper_bound]).astype('int64')
    return region

Visualization: Two small functions to see the results

def plot_region(region, sz1, sz2):
    
    mask = np.zeros((sz1, sz2))
    for i in range(sz1):
        for j in np.arange(*region[:, i]):
            mask[i, j] = 1.

    plt.imshow(mask, origin='lower', cmap='Greys')

    for i in range(sz2):
        for j in range(sz1):
            plt.plot(i, j, 'o', color='green', ms=2)

    ax = plt.gca();
    ax.set_xticks(np.arange(-.5, sz2, 1), minor=True);
    ax.set_yticks(np.arange(-.5, sz1, 1), minor=True);
    ax.grid(which='minor', color='b', linestyle='--', linewidth=1)

    plt.xlim((-0.5, sz2 - 0.5))
    plt.ylim((-0.5, sz1 - 0.5))

def plot_itakura(sz1, sz2, max_slope):

    mask = itakura_mask(sz1, sz2, max_slope)
    
    max_slope = float(max_slope) * max(sz1, sz2) / float(min(sz1, sz2))
    min_slope = 1 / max_slope
    max_slope *= (sz1 / sz2)
    min_slope *= (sz1 / sz2)
    
    sz = max(sz1, sz2)
    
    plt.figure(figsize=(6, 6))
    plt.imshow(mask, origin='lower');

    plt.plot(np.arange(-1, sz + 1), max_slope * np.arange(-1, sz + 1), 'b')
    plt.plot(np.arange(-1, sz + 1), min_slope * np.arange(-1, sz + 1), 'r')
    plt.plot(np.arange(-1, sz + 1), ((sz1 - 1) - max_slope * (sz2 - 1)) + max_slope * np.arange(-1, sz + 1), 'g')
    plt.plot(np.arange(-1, sz + 1), ((sz1 - 1) - min_slope * (sz2 - 1)) + min_slope * np.arange(-1, sz + 1), 'y')

    for i in range(sz2):
        for j in range(sz1):
            plt.plot(i, j, 'o', color='green', ms=2)

    ax = plt.gca();
    ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.grid(which='minor', color='b', linestyle='--', linewidth=1)

    plt.xlim((-0.5, sz2 - 0.5))
    plt.ylim((-0.5, sz1 - 0.5))

    plt.show()

We can ask some experts (I think of Eamonn Keogh) if we cannot agree with the definitions ;)

johannfaouzi · 2019-09-09T07:39:22Z

pyts/metrics/dtw.py

+        The size of the first time series.
+
+    n_timestamps_2 : int (optional, default None)
+        The size of the second time series. if None, set to `n_timestamps_1`.


Nit: capitalized 'if'

johannfaouzi · 2019-09-09T08:18:39Z

pyts/metrics/dtw.py

-                      + max_slope * np.arange(n_timestamps))
-    lower_bound = np.round(lower_bound, 2)
-    lower_bound = np.ceil(np.max(lower_bound, axis=0))
+    # upper_bound[0] = max_slope * x


I added them to make the code clearer for future collaborators (who could be us :) ) because it took me a while to figure out what lower_bound and upper_bound refer to

Indeed, it's a good idea (there aren't many comments in my code...). Maybe adding a short sentence could be helpful, something like:
'lower_bounds' and 'upper_bounds' are the equations defining the sides of the parallelogram

johannfaouzi · 2019-09-09T12:50:09Z

You can also add the description of this new feature in doc/changelog.rst for version 0.10.0 with your name.

hichamjanati · 2019-09-09T19:57:04Z

So you're half right about sakoe-chiba, in the original paper the region is defined as |i - j| <= window_size which requires window_size to be larger than the difference between the lengths. But the window_size is symmetric w.r.t to i = j. My proposal was to make it symmetric with respect to to the relative diagonal. I think it is more natural for a window_size = 0.1 to have:

Than to have:

which forces the window_size = |sz1 - sz2| and does not allow j > i

If you are aware of other refs abt this I'd be happy to have a look.

About itakura, rescaling the min_slope in the beginning will actually lead to a way lower min_scale ( and hence not a symmetric parallelogram with respect to the relative diagonal, I am still working on that at the moment)

johannfaouzi · 2019-09-10T07:50:39Z

I agree with your new proposal for the Sakoe-Chiba band.

For the Itakura parallelogram, it is a parallelogram and not a rhombus: there is no need for a symmetry with respect to the relative diagonal. What is needed is two pairs of parallel sides. For instance take a rectangle with a width much lower than its height, you will see that it is impossible to have a symmetric diagonale with regards to both sides.

hichamjanati · 2019-09-10T20:13:28Z

Yes you are right, symmetry is not required. But taking min_scale = 1/max_scale afterwards only works if n2 > n1. if for e.g n2 = 0.5 n1 and you take for instance scale=2 then max_scale = 1 and min_scale = 1 as well. min_scale must be lower than n2 / n1 to include the end point. Which is why I make sure max_scale >= n2 / n1 and min_scale <= n2 / n1

johannfaouzi · 2019-09-11T06:54:05Z

Those scales are tricky... I don't get the same results when I swap n_timestamps_1 and n_timestamps_2 with your latest code:

n_timestamps_1 = 8
n_timestamps_2 = 4
max_slope = 1

If I'm not wrong, you removed the rounding on lower_bounds and upper_bounds. I added it to prevent float precision issues before using ceil and floor (when the true value is 1. but the computed value is 0.9999 or 1.00001, you don't get the expected results, it occurred when I was testing the function the first time).

johannfaouzi · 2019-09-11T08:00:04Z

Regarding the computation of the Itakura parallelogram, I will still advocate for my proposal for the moment:

def itakura_parallelogram_proposed(sz1, sz2, max_slope):
    max_slope *= max(sz2, sz1) / min(sz2, sz1)
    min_slope = 1 / max_slope
    max_slope *= (sz2 / sz1)
    min_slope *= (sz2 / sz1)

    lower_bound = np.empty((2, sz1))
    lower_bound[0] = min_slope * np.arange(sz1)
    lower_bound[1] = ((sz2 - 1) - max_slope * (sz1 - 1)
                      + max_slope * np.arange(sz1))
    lower_bound = np.round(lower_bound, 2)
    lower_bound = np.ceil(np.max(lower_bound, axis=0))

    upper_bound = np.empty((2, sz1))
    upper_bound[0] = max_slope * np.arange(sz1)
    upper_bound[1] = ((sz2 - 1) - min_slope * (sz1 - 1)
                      + min_slope * np.arange(sz1))
    upper_bound = np.round(upper_bound, 2)
    upper_bound = np.floor(np.min(upper_bound, axis=0) + 1)

    region = np.asarray([lower_bound, upper_bound]).astype('int64')
    return region

It has several upsides:

The possible values for max_slope do not depend on the sizes: max_slope>=1
It always provides a feasible region for max_slope>=1
It has a point reflection
Swapping n_timestamps_1 and n_timestamps_2 leads to a region with the same shape

hichamjanati · 2019-09-16T05:10:43Z

def itakura ...
    min_slope = 1 / max_slope

    scale_max = (sz2 - 1) / (sz1 - 2)
    max_slope *= scale_max
    max_slope = max(1., max_slope)

    scale_min = (sz2 - 2) / (sz1 - 1)
    min_slope *= scale_min
    min_slope = min(1., min_slope)

As you noticed, for a # to be feasible max_slope must be >= 1 and min_slope <= 1.
With slope = 1, we want the most narrow path possible.
To avoid broken paths, both the width and height of the parallelogram must be >=1 to include at least 1 square everywhere. This constraint leads to the scale_max and scale_min bounds on the slopes.

This agrees with the previous implementation with n1=n2 and does not overestimate the size of the parallelogram by multiplying the slopes with n2 / n1 twice.

I'll add an example for region plots and update the doc :)

johannfaouzi · 2019-09-16T06:22:32Z

Great! It is indeed better to have a single admissible path (the 'diagonale') when max_slope and window_size are set to the minimum values that they can have.

Maybe we should use the cartesian equations for the Sakoe-Chiba band too, it seems to be broken for small values of window_size:

johannfaouzi · 2019-09-16T07:44:24Z

Maybe something like this:

import numpy as np
import matplotlib.pyplot as plt


def sakoe(sz1, sz2, window_size):
    scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
    window_size = max(window_size, 1 / (2 * scale), 2 * scale)

    lower_bound = scale * np.arange(n_timestamps_1) - window_size
    lower_bound = np.ceil(np.round(lower_bound, 1))
    upper_bound = scale * np.arange(n_timestamps_1) + window_size
    upper_bound = np.floor(np.round(upper_bound, 1)) + 1
    region = np.asarray([lower_bound, upper_bound]).astype('int64')
    region = _check_region(region, n_timestamps_1, n_timestamps_2)
    return region


def plot_sakoe(sz1, sz2, window_size):

    region = sakoe(sz1, sz2, window_size)
    scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
    window_size = max(window_size, 1 / (2 * scale), 2 * scale)
    
    mask = np.zeros((sz2, sz1))
    for i, (j, k) in enumerate(region.T):
        mask[j:k, i] = 1.
    
    plt.figure(figsize=(sz1 / 2, sz2 / 2))
    plt.imshow(mask, origin='lower', cmap='Wistia');
    
    sz = max(sz1, sz2)
    x = np.arange(-1, sz + 1)

    plt.plot(x, scale * np.arange(-1, sz + 1) - window_size, 'b')
    plt.plot(x, scale * np.arange(-1, sz + 1) + window_size, 'g')
    plt.plot(x, (sz2 - 1) / (sz1 - 1) * np.arange(-1, sz + 1), 'black')

    for i in range(sz1):
        for j in range(sz2):
            plt.plot(i, j, 'o', color='green', ms=2)

    ax = plt.gca();
    ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.grid(which='minor', color='b', linestyle='--', linewidth=1)

    plt.xlim((-0.5, sz1 - 0.5))
    plt.ylim((-0.5, sz2 - 0.5))

    plt.show()

johannfaouzi · 2019-09-16T08:24:07Z

The band is way too large as it is, it is a bit better with this code:

import numpy as np
import matplotlib.pyplot as plt


def sakoe(sz1, sz2, window_size):
    scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
    
    if sz2 > sz1:
        window_size = max(window_size, scale / 2)
    else:
        window_size = max(window_size, scale * 2)

    lower_bound = scale * np.arange(n_timestamps_1) - window_size
    lower_bound = np.ceil(np.round(lower_bound, 1))
    upper_bound = scale * np.arange(n_timestamps_1) + window_size
    upper_bound = np.floor(np.round(upper_bound, 1)) + 1
    region = np.asarray([lower_bound, upper_bound]).astype('int64')
    region = _check_region(region, n_timestamps_1, n_timestamps_2)
    return region


def plot_sakoe(sz1, sz2, window_size):

    region = sakoe(sz1, sz2, window_size)
    scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
    if sz2 > sz1:
        window_size = max(window_size, scale / 2)
    else:
        window_size = max(window_size, scale * 2)
    
    mask = np.zeros((sz2, sz1))
    for i, (j, k) in enumerate(region.T):
        mask[j:k, i] = 1.
    
    plt.figure(figsize=(sz1 / 2, sz2 / 2))
    plt.imshow(mask, origin='lower', cmap='Wistia');
    
    sz = max(sz1, sz2)
    x = np.arange(-1, sz + 1)

    plt.plot(x, scale * np.arange(-1, sz + 1) - window_size, 'b')
    plt.plot(x, scale * np.arange(-1, sz + 1) + window_size, 'g')
    plt.plot(x, (sz2 - 1) / (sz1 - 1) * np.arange(-1, sz + 1), 'black')

    for i in range(sz1):
        for j in range(sz2):
            plt.plot(i, j, 'o', color='green', ms=2)

    ax = plt.gca();
    ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.grid(which='minor', color='b', linestyle='--', linewidth=1)

    plt.xlim((-0.5, sz1 - 0.5))
    plt.ylim((-0.5, sz2 - 0.5))

    plt.show()

The only issue is that, when both the upper and lower bounds cross an admissible point, there are two points in the feasible region (see the third column for instance).

hichamjanati · 2019-09-16T12:10:24Z

Nice adaptation !
To fix the last point with n1 > n2, I made this small change:

    if sz2 > sz1:
        window_size = max(window_size, scale / 2)
    elif sz1 < sz2:
        window_size = max(window_size, 0.5)

The lower window_size allowed is not symmetric. When the matrix is almost flat, the margin between the bounds should be 1, hence a 0.5 window. Plus we should leave the case sz1=sz2 untouched.

johannfaouzi

Thanks for the great PR! I left a few comments (mostly nitpicking comments, and one regarding the API).

pyts/metrics/dtw.py

johannfaouzi · 2019-09-16T13:37:05Z

pyts/metrics/dtw.py

+        Each cell whose distance with the diagonale is lower than or equal to
+        'window_size' becomes a valid cell for the path.
+
+    relative_window_size : bool (default True)


scikit-learn has one single parameter for integers and floats (see for instance min_samples_split and min_samples_leaf in sklearn.ensemble.RandomForestClassifier).

I prefer having a single parameter, and I think that it would be more consistent with scikit-learn's API. But it's open to discussion :)

I agree having 2 params is not ideal either, but having different outcomes with 1 and 1. gave me some headaches as well. I'll revert back to the old API but we can make the distinction better when raising errors.

The error messages may not be the best indeed. However the description in the docstring is pretty explicit imho:

pyts/pyts/metrics/dtw.py

Lines 387 to 392 in 5b4310f

window_size : int or float (default = 0.1)

The window above and below the diagonale. If float, it must be between

0 and 1, and the actual window size will be computed as

``ceil(window_size * (n_timestamps - 1))``. Each cell whose distance

with the diagonale is lower than or equal to 'window_size' becomes a

valid cell for the path.

Yes, I made some small changes to the docstring and the error messages. I can live with that :)

examples/metrics/plot_itakura.py

examples/metrics/plot_sakoe_chiba.py

examples/metrics/plot_itakura.py

examples/metrics/plot_sakoe_chiba.py

examples/metrics/plot_itakura.py

pyts/metrics/dtw.py

pyts/multivariate/transformation/multivariate.py

hichamjanati · 2019-09-18T09:13:57Z

Thanks for the great PR! I left a few comments (mostly nitpicking comments, and one regarding the API).

Thanks, I think we're good to go !

johannfaouzi

Final review (some little details) :)

examples/metrics/plot_itakura.py

examples/metrics/plot_sakoe_chiba.py

examples/metrics/plot_itakura.py

pyts/metrics/dtw.py

johannfaouzi · 2019-09-22T07:15:13Z

Thanks once again for your great PR!

Hicham Janati added 5 commits September 8, 2019 13:50

adapt all dtw metrics for different n_timestamps

da3848c

adapt tests for different n_timestamps + add specific tests

80f2d65

update dtw example

e2a762c

fix broken corners in itakura

6321b3a

make region params explicit for test_knn to pass

0eaf8d9

johannfaouzi reviewed Sep 9, 2019

View reviewed changes

Hicham Janati added 3 commits September 10, 2019 22:46

update log

1bb15df

update example plot

816a910

fix minscale in itakura

9859057

Hicham Janati added 3 commits September 15, 2019 23:16

fix itakura scales for bounds

19a31f4

update test

b31be59

update example

593618f

Hicham Janati added 6 commits September 16, 2019 14:11

small sphinx warning

5e89850

add relative_window arg in knn test

c4b4b60

adapt tests of sakoe-chiba

3c411f4

fix sakoe_chiba with small windows + reorganize code

59cb574

small docstring warning

91f9b07

add examples for both regions

32a1860

make examples figures better

eebdd6e

johannfaouzi reviewed Sep 16, 2019

View reviewed changes

pyts/multivariate/transformation/multivariate.py Outdated Show resolved Hide resolved

Hicham Janati added 3 commits September 17, 2019 18:58

fix sakoechiba with n1 > n2 + add test for check_region

0390cde

revert old docstring

2302f9b

merge example figures with gridspec

ca7bcb9

johannfaouzi approved these changes Sep 19, 2019

View reviewed changes

Hicham Janati added 2 commits September 20, 2019 21:32

fix code block in examples

dc55361

add rounding before ceil and floor

ca519b5

johannfaouzi merged commit 706f302 into johannfaouzi:master Sep 22, 2019

hichamjanati deleted the dtw-diffsize branch September 22, 2019 07:18

hichamjanati changed the title ~~[WIP] DTW computation for time series with different lengths~~ [FEA] DTW computation for time series with different lengths Sep 22, 2019

johannfaouzi mentioned this pull request Jul 5, 2020

Inconsistent calculation results between tslearn, pyts and fastdtw #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] DTW computation for time series with different lengths #31

[FEA] DTW computation for time series with different lengths #31

hichamjanati commented Sep 8, 2019 •

edited

codecov bot commented Sep 8, 2019 •

edited

johannfaouzi left a comment

johannfaouzi Sep 9, 2019

johannfaouzi Sep 9, 2019

hichamjanati Sep 10, 2019

johannfaouzi Sep 11, 2019

johannfaouzi commented Sep 9, 2019

hichamjanati commented Sep 9, 2019

johannfaouzi commented Sep 10, 2019 •

edited

hichamjanati commented Sep 10, 2019

johannfaouzi commented Sep 11, 2019

johannfaouzi commented Sep 11, 2019

hichamjanati commented Sep 16, 2019 •

edited

johannfaouzi commented Sep 16, 2019

johannfaouzi commented Sep 16, 2019

johannfaouzi commented Sep 16, 2019

hichamjanati commented Sep 16, 2019

johannfaouzi left a comment

johannfaouzi Sep 16, 2019

hichamjanati Sep 17, 2019

johannfaouzi Sep 17, 2019

hichamjanati Sep 18, 2019

hichamjanati commented Sep 18, 2019

johannfaouzi left a comment

johannfaouzi commented Sep 22, 2019

	window_size : int or float (default = 0.1)
	The window above and below the diagonale. If float, it must be between
	0 and 1, and the actual window size will be computed as
	``ceil(window_size * (n_timestamps - 1))``. Each cell whose distance
	with the diagonale is lower than or equal to 'window_size' becomes a
	valid cell for the path.

[FEA] DTW computation for time series with different lengths #31

[FEA] DTW computation for time series with different lengths #31

Conversation

hichamjanati commented Sep 8, 2019 • edited

codecov bot commented Sep 8, 2019 • edited

Codecov Report

johannfaouzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johannfaouzi commented Sep 9, 2019

hichamjanati commented Sep 9, 2019

johannfaouzi commented Sep 10, 2019 • edited

hichamjanati commented Sep 10, 2019

johannfaouzi commented Sep 11, 2019

johannfaouzi commented Sep 11, 2019

hichamjanati commented Sep 16, 2019 • edited

johannfaouzi commented Sep 16, 2019

johannfaouzi commented Sep 16, 2019

johannfaouzi commented Sep 16, 2019

hichamjanati commented Sep 16, 2019

johannfaouzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hichamjanati commented Sep 18, 2019

johannfaouzi left a comment

Choose a reason for hiding this comment

johannfaouzi commented Sep 22, 2019

hichamjanati commented Sep 8, 2019 •

edited

codecov bot commented Sep 8, 2019 •

edited

johannfaouzi commented Sep 10, 2019 •

edited

hichamjanati commented Sep 16, 2019 •

edited