Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] DTW computation for time series with different lengths #31

Merged
merged 23 commits into from Sep 22, 2019

Conversation

hichamjanati
Copy link
Contributor

@hichamjanati hichamjanati commented Sep 8, 2019

One of the features of the DTW metric is that it can compare time series of different lengths; which is done in this PR

All region methods are updated following the same logic by rescaling to the relative diagonal (n_timestamps_2 / n_timestamps_1).

Here is the updated output of the plot_dtw example with x = x[::2]

Screen Shot 2019-09-08 at 13 58 55

@codecov
Copy link

codecov bot commented Sep 8, 2019

Codecov Report

Merging #31 into master will increase coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #31      +/-   ##
==========================================
+ Coverage   97.27%   97.34%   +0.07%     
==========================================
  Files          79       79              
  Lines        3265     3318      +53     
==========================================
+ Hits         3176     3230      +54     
+ Misses         89       88       -1
Impacted Files Coverage Δ
pyts/classification/tests/test_knn.py 100% <ø> (ø) ⬆️
pyts/metrics/tests/test_dtw.py 100% <100%> (ø) ⬆️
pyts/metrics/dtw.py 98.91% <100%> (+0.43%) ⬆️
pyts/classification/knn.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b4310f...ca519b5. Read the comment docs.

Copy link
Owner

@johannfaouzi johannfaouzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your PR! I don't have much to add, you did an awesome job.

Regarding the computation of the Sakoe-Chiba band and the Itakura parallelogram, I would probably have done some things differently.

Sakoe-Chiba band: I would have add the absolute difference between both lengths to the window_size, with something like this:

def sakoe_chiba_band_proposed(n_timestamps_1, n_timestamps_2, window_size=1):
    if isinstance(window_size, (int, np.integer)):
        window_size_ = window_size
    elif isinstance(window_size, (float, np.floating)):
        window_size_ = ceil(window_size * (n_timestamps - 1))
        
    lower_shift = max(0, n_timestamps_1 - n_timestamps_2) + window_size_
    upper_shift = max(0, n_timestamps_2 - n_timestamps_1) + window_size_ + 1
    
    region = np.array([np.arange(n_timestamps_1) for _ in range(2)])
    region += np.array([- lower_shift, upper_shift]).reshape(2, 1)
    region[0] = np.clip(region[0], 0, n_timestamps_1)
    region[1] = np.clip(region[1], 0, n_timestamps_2)
    return region

Itakura parallelogram: I would have rescaled the slope at the beginning of the function, without adding the points at the borders (technically they are not in the parallelogram), something like this:

def itakura_parallelogram_proposed(sz1, sz2, max_slope):
    max_slope = float(max_slope) * max(sz2, sz1) / float(min(sz2, sz1))
    min_slope = 1 / max_slope
    max_slope *= (sz2 / sz1)
    min_slope *= (sz2 / sz1)

    lower_bound = np.empty((2, sz1))
    lower_bound[0] = min_slope * np.arange(sz1)
    lower_bound[1] = ((sz2 - 1) - max_slope * (sz1 - 1)
                      + max_slope * np.arange(sz1))
    lower_bound = np.round(lower_bound, 2)
    lower_bound = np.ceil(np.max(lower_bound, axis=0))

    upper_bound = np.empty((2, sz1))
    upper_bound[0] = max_slope * np.arange(sz1)
    upper_bound[1] = ((sz2 - 1) - min_slope * (sz1 - 1)
                      + min_slope * np.arange(sz1))
    upper_bound = np.round(upper_bound, 2)
    upper_bound = np.floor(np.min(upper_bound, axis=0) + 1)

    region = np.asarray([lower_bound, upper_bound]).astype('int64')
    return region

Visualization: Two small functions to see the results

def plot_region(region, sz1, sz2):
    
    mask = np.zeros((sz1, sz2))
    for i in range(sz1):
        for j in np.arange(*region[:, i]):
            mask[i, j] = 1.

    plt.imshow(mask, origin='lower', cmap='Greys')

    for i in range(sz2):
        for j in range(sz1):
            plt.plot(i, j, 'o', color='green', ms=2)

    ax = plt.gca();
    ax.set_xticks(np.arange(-.5, sz2, 1), minor=True);
    ax.set_yticks(np.arange(-.5, sz1, 1), minor=True);
    ax.grid(which='minor', color='b', linestyle='--', linewidth=1)

    plt.xlim((-0.5, sz2 - 0.5))
    plt.ylim((-0.5, sz1 - 0.5))
def plot_itakura(sz1, sz2, max_slope):

    mask = itakura_mask(sz1, sz2, max_slope)
    
    max_slope = float(max_slope) * max(sz1, sz2) / float(min(sz1, sz2))
    min_slope = 1 / max_slope
    max_slope *= (sz1 / sz2)
    min_slope *= (sz1 / sz2)
    
    sz = max(sz1, sz2)
    
    plt.figure(figsize=(6, 6))
    plt.imshow(mask, origin='lower');

    plt.plot(np.arange(-1, sz + 1), max_slope * np.arange(-1, sz + 1), 'b')
    plt.plot(np.arange(-1, sz + 1), min_slope * np.arange(-1, sz + 1), 'r')
    plt.plot(np.arange(-1, sz + 1), ((sz1 - 1) - max_slope * (sz2 - 1)) + max_slope * np.arange(-1, sz + 1), 'g')
    plt.plot(np.arange(-1, sz + 1), ((sz1 - 1) - min_slope * (sz2 - 1)) + min_slope * np.arange(-1, sz + 1), 'y')

    for i in range(sz2):
        for j in range(sz1):
            plt.plot(i, j, 'o', color='green', ms=2)

    ax = plt.gca();
    ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.grid(which='minor', color='b', linestyle='--', linewidth=1)

    plt.xlim((-0.5, sz2 - 0.5))
    plt.ylim((-0.5, sz1 - 0.5))

    plt.show()

We can ask some experts (I think of Eamonn Keogh) if we cannot agree with the definitions ;)

The size of the first time series.

n_timestamps_2 : int (optional, default None)
The size of the second time series. if None, set to `n_timestamps_1`.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: capitalized 'if'

+ max_slope * np.arange(n_timestamps))
lower_bound = np.round(lower_bound, 2)
lower_bound = np.ceil(np.max(lower_bound, axis=0))
# upper_bound[0] = max_slope * x
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them to make the code clearer for future collaborators (who could be us :) ) because it took me a while to figure out what lower_bound and upper_bound refer to

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it's a good idea (there aren't many comments in my code...). Maybe adding a short sentence could be helpful, something like:
'lower_bounds' and 'upper_bounds' are the equations defining the sides of the parallelogram

@johannfaouzi
Copy link
Owner

You can also add the description of this new feature in doc/changelog.rst for version 0.10.0 with your name.

@hichamjanati
Copy link
Contributor Author

So you're half right about sakoe-chiba, in the original paper the region is defined as |i - j| <= window_size which requires window_size to be larger than the difference between the lengths. But the window_size is symmetric w.r.t to i = j. My proposal was to make it symmetric with respect to to the relative diagonal. I think it is more natural for a window_size = 0.1 to have:

Screen Shot 2019-09-09 at 21 44 44

Than to have:

Screen Shot 2019-09-09 at 21 43 17

which forces the window_size = |sz1 - sz2| and does not allow j > i

If you are aware of other refs abt this I'd be happy to have a look.

About itakura, rescaling the min_slope in the beginning will actually lead to a way lower min_scale ( and hence not a symmetric parallelogram with respect to the relative diagonal, I am still working on that at the moment)

@johannfaouzi
Copy link
Owner

johannfaouzi commented Sep 10, 2019

I agree with your new proposal for the Sakoe-Chiba band.

For the Itakura parallelogram, it is a parallelogram and not a rhombus: there is no need for a symmetry with respect to the relative diagonal. What is needed is two pairs of parallel sides. For instance take a rectangle with a width much lower than its height, you will see that it is impossible to have a symmetric diagonale with regards to both sides.

@hichamjanati
Copy link
Contributor Author

Yes you are right, symmetry is not required. But taking min_scale = 1/max_scale afterwards only works if n2 > n1. if for e.g n2 = 0.5 n1 and you take for instance scale=2 then max_scale = 1 and min_scale = 1 as well. min_scale must be lower than n2 / n1 to include the end point. Which is why I make sure max_scale >= n2 / n1 and min_scale <= n2 / n1

@johannfaouzi
Copy link
Owner

Those scales are tricky... I don't get the same results when I swap n_timestamps_1 and n_timestamps_2 with your latest code:

n_timestamps_1 = 8
n_timestamps_2 = 4
max_slope = 1

Capture d’écran 2019-09-11 à 08 45 41

Capture d’écran 2019-09-11 à 08 45 52

If I'm not wrong, you removed the rounding on lower_bounds and upper_bounds. I added it to prevent float precision issues before using ceil and floor (when the true value is 1. but the computed value is 0.9999 or 1.00001, you don't get the expected results, it occurred when I was testing the function the first time).

@johannfaouzi
Copy link
Owner

Regarding the computation of the Itakura parallelogram, I will still advocate for my proposal for the moment:

def itakura_parallelogram_proposed(sz1, sz2, max_slope):
    max_slope *= max(sz2, sz1) / min(sz2, sz1)
    min_slope = 1 / max_slope
    max_slope *= (sz2 / sz1)
    min_slope *= (sz2 / sz1)

    lower_bound = np.empty((2, sz1))
    lower_bound[0] = min_slope * np.arange(sz1)
    lower_bound[1] = ((sz2 - 1) - max_slope * (sz1 - 1)
                      + max_slope * np.arange(sz1))
    lower_bound = np.round(lower_bound, 2)
    lower_bound = np.ceil(np.max(lower_bound, axis=0))

    upper_bound = np.empty((2, sz1))
    upper_bound[0] = max_slope * np.arange(sz1)
    upper_bound[1] = ((sz2 - 1) - min_slope * (sz1 - 1)
                      + min_slope * np.arange(sz1))
    upper_bound = np.round(upper_bound, 2)
    upper_bound = np.floor(np.min(upper_bound, axis=0) + 1)

    region = np.asarray([lower_bound, upper_bound]).astype('int64')
    return region

It has several upsides:

  • The possible values for max_slope do not depend on the sizes: max_slope>=1
  • It always provides a feasible region for max_slope>=1
  • It has a point reflection
  • Swapping n_timestamps_1 and n_timestamps_2 leads to a region with the same shape

@hichamjanati
Copy link
Contributor Author

hichamjanati commented Sep 16, 2019

def itakura ...
    min_slope = 1 / max_slope

    scale_max = (sz2 - 1) / (sz1 - 2)
    max_slope *= scale_max
    max_slope = max(1., max_slope)

    scale_min = (sz2 - 2) / (sz1 - 1)
    min_slope *= scale_min
    min_slope = min(1., min_slope)

As you noticed, for a # to be feasible max_slope must be >= 1 and min_slope <= 1.
With slope = 1, we want the most narrow path possible.
To avoid broken paths, both the width and height of the parallelogram must be >=1 to include at least 1 square everywhere. This constraint leads to the scale_max and scale_min bounds on the slopes.

Screen Shot 2019-09-16 at 06 57 23

Screen Shot 2019-09-16 at 06 57 07

This agrees with the previous implementation with n1=n2 and does not overestimate the size of the parallelogram by multiplying the slopes with n2 / n1 twice.

I'll add an example for region plots and update the doc :)

@johannfaouzi
Copy link
Owner

Great! It is indeed better to have a single admissible path (the 'diagonale') when max_slope and window_size are set to the minimum values that they can have.

Maybe we should use the cartesian equations for the Sakoe-Chiba band too, it seems to be broken for small values of window_size:
Capture d’écran 2019-09-16 à 08 22 13

@johannfaouzi
Copy link
Owner

Maybe something like this:

import numpy as np
import matplotlib.pyplot as plt


def sakoe(sz1, sz2, window_size):
    scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
    window_size = max(window_size, 1 / (2 * scale), 2 * scale)

    lower_bound = scale * np.arange(n_timestamps_1) - window_size
    lower_bound = np.ceil(np.round(lower_bound, 1))
    upper_bound = scale * np.arange(n_timestamps_1) + window_size
    upper_bound = np.floor(np.round(upper_bound, 1)) + 1
    region = np.asarray([lower_bound, upper_bound]).astype('int64')
    region = _check_region(region, n_timestamps_1, n_timestamps_2)
    return region


def plot_sakoe(sz1, sz2, window_size):

    region = sakoe(sz1, sz2, window_size)
    scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
    window_size = max(window_size, 1 / (2 * scale), 2 * scale)
    
    mask = np.zeros((sz2, sz1))
    for i, (j, k) in enumerate(region.T):
        mask[j:k, i] = 1.
    
    plt.figure(figsize=(sz1 / 2, sz2 / 2))
    plt.imshow(mask, origin='lower', cmap='Wistia');
    
    sz = max(sz1, sz2)
    x = np.arange(-1, sz + 1)

    plt.plot(x, scale * np.arange(-1, sz + 1) - window_size, 'b')
    plt.plot(x, scale * np.arange(-1, sz + 1) + window_size, 'g')
    plt.plot(x, (sz2 - 1) / (sz1 - 1) * np.arange(-1, sz + 1), 'black')

    for i in range(sz1):
        for j in range(sz2):
            plt.plot(i, j, 'o', color='green', ms=2)

    ax = plt.gca();
    ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.grid(which='minor', color='b', linestyle='--', linewidth=1)

    plt.xlim((-0.5, sz1 - 0.5))
    plt.ylim((-0.5, sz2 - 0.5))

    plt.show()

Capture d’écran 2019-09-16 à 09 43 18

Capture d’écran 2019-09-16 à 09 43 26

Capture d’écran 2019-09-16 à 09 43 38

@johannfaouzi
Copy link
Owner

The band is way too large as it is, it is a bit better with this code:

import numpy as np
import matplotlib.pyplot as plt


def sakoe(sz1, sz2, window_size):
    scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
    
    if sz2 > sz1:
        window_size = max(window_size, scale / 2)
    else:
        window_size = max(window_size, scale * 2)

    lower_bound = scale * np.arange(n_timestamps_1) - window_size
    lower_bound = np.ceil(np.round(lower_bound, 1))
    upper_bound = scale * np.arange(n_timestamps_1) + window_size
    upper_bound = np.floor(np.round(upper_bound, 1)) + 1
    region = np.asarray([lower_bound, upper_bound]).astype('int64')
    region = _check_region(region, n_timestamps_1, n_timestamps_2)
    return region


def plot_sakoe(sz1, sz2, window_size):

    region = sakoe(sz1, sz2, window_size)
    scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
    if sz2 > sz1:
        window_size = max(window_size, scale / 2)
    else:
        window_size = max(window_size, scale * 2)
    
    mask = np.zeros((sz2, sz1))
    for i, (j, k) in enumerate(region.T):
        mask[j:k, i] = 1.
    
    plt.figure(figsize=(sz1 / 2, sz2 / 2))
    plt.imshow(mask, origin='lower', cmap='Wistia');
    
    sz = max(sz1, sz2)
    x = np.arange(-1, sz + 1)

    plt.plot(x, scale * np.arange(-1, sz + 1) - window_size, 'b')
    plt.plot(x, scale * np.arange(-1, sz + 1) + window_size, 'g')
    plt.plot(x, (sz2 - 1) / (sz1 - 1) * np.arange(-1, sz + 1), 'black')

    for i in range(sz1):
        for j in range(sz2):
            plt.plot(i, j, 'o', color='green', ms=2)

    ax = plt.gca();
    ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
    ax.grid(which='minor', color='b', linestyle='--', linewidth=1)

    plt.xlim((-0.5, sz1 - 0.5))
    plt.ylim((-0.5, sz2 - 0.5))

    plt.show()

Capture d’écran 2019-09-16 à 10 22 02

The only issue is that, when both the upper and lower bounds cross an admissible point, there are two points in the feasible region (see the third column for instance).

@hichamjanati
Copy link
Contributor Author

Nice adaptation !
To fix the last point with n1 > n2, I made this small change:

    if sz2 > sz1:
        window_size = max(window_size, scale / 2)
    elif sz1 < sz2:
        window_size = max(window_size, 0.5)

The lower window_size allowed is not symmetric. When the matrix is almost flat, the margin between the bounds should be 1, hence a 0.5 window. Plus we should leave the case sz1=sz2 untouched.

Copy link
Owner

@johannfaouzi johannfaouzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great PR! I left a few comments (mostly nitpicking comments, and one regarding the API).

pyts/metrics/dtw.py Show resolved Hide resolved
Each cell whose distance with the diagonale is lower than or equal to
'window_size' becomes a valid cell for the path.

relative_window_size : bool (default True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn has one single parameter for integers and floats (see for instance min_samples_split and min_samples_leaf in sklearn.ensemble.RandomForestClassifier).

I prefer having a single parameter, and I think that it would be more consistent with scikit-learn's API. But it's open to discussion :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree having 2 params is not ideal either, but having different outcomes with 1 and 1. gave me some headaches as well. I'll revert back to the old API but we can make the distinction better when raising errors.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages may not be the best indeed. However the description in the docstring is pretty explicit imho:

pyts/pyts/metrics/dtw.py

Lines 387 to 392 in 5b4310f

window_size : int or float (default = 0.1)
The window above and below the diagonale. If float, it must be between
0 and 1, and the actual window size will be computed as
``ceil(window_size * (n_timestamps - 1))``. Each cell whose distance
with the diagonale is lower than or equal to 'window_size' becomes a
valid cell for the path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I made some small changes to the docstring and the error messages. I can live with that :)

examples/metrics/plot_itakura.py Outdated Show resolved Hide resolved
examples/metrics/plot_sakoe_chiba.py Outdated Show resolved Hide resolved
examples/metrics/plot_itakura.py Outdated Show resolved Hide resolved
examples/metrics/plot_sakoe_chiba.py Outdated Show resolved Hide resolved
examples/metrics/plot_sakoe_chiba.py Outdated Show resolved Hide resolved
examples/metrics/plot_itakura.py Outdated Show resolved Hide resolved
pyts/metrics/dtw.py Outdated Show resolved Hide resolved
@hichamjanati
Copy link
Contributor Author

Thanks for the great PR! I left a few comments (mostly nitpicking comments, and one regarding the API).

Thanks, I think we're good to go !

Copy link
Owner

@johannfaouzi johannfaouzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final review (some little details) :)

examples/metrics/plot_itakura.py Outdated Show resolved Hide resolved
examples/metrics/plot_sakoe_chiba.py Outdated Show resolved Hide resolved
examples/metrics/plot_itakura.py Outdated Show resolved Hide resolved
pyts/metrics/dtw.py Show resolved Hide resolved
pyts/metrics/dtw.py Show resolved Hide resolved
@johannfaouzi johannfaouzi merged commit 706f302 into johannfaouzi:master Sep 22, 2019
@johannfaouzi
Copy link
Owner

Thanks once again for your great PR!

@hichamjanati hichamjanati deleted the dtw-diffsize branch September 22, 2019 07:18
@hichamjanati hichamjanati changed the title [WIP] DTW computation for time series with different lengths [FEA] DTW computation for time series with different lengths Sep 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants