New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] DTW computation for time series with different lengths #31
Conversation
Codecov Report
@@ Coverage Diff @@
## master #31 +/- ##
==========================================
+ Coverage 97.27% 97.34% +0.07%
==========================================
Files 79 79
Lines 3265 3318 +53
==========================================
+ Hits 3176 3230 +54
+ Misses 89 88 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your PR! I don't have much to add, you did an awesome job.
Regarding the computation of the Sakoe-Chiba band and the Itakura parallelogram, I would probably have done some things differently.
Sakoe-Chiba band: I would have add the absolute difference between both lengths to the window_size, with something like this:
def sakoe_chiba_band_proposed(n_timestamps_1, n_timestamps_2, window_size=1):
if isinstance(window_size, (int, np.integer)):
window_size_ = window_size
elif isinstance(window_size, (float, np.floating)):
window_size_ = ceil(window_size * (n_timestamps - 1))
lower_shift = max(0, n_timestamps_1 - n_timestamps_2) + window_size_
upper_shift = max(0, n_timestamps_2 - n_timestamps_1) + window_size_ + 1
region = np.array([np.arange(n_timestamps_1) for _ in range(2)])
region += np.array([- lower_shift, upper_shift]).reshape(2, 1)
region[0] = np.clip(region[0], 0, n_timestamps_1)
region[1] = np.clip(region[1], 0, n_timestamps_2)
return region
Itakura parallelogram: I would have rescaled the slope at the beginning of the function, without adding the points at the borders (technically they are not in the parallelogram), something like this:
def itakura_parallelogram_proposed(sz1, sz2, max_slope):
max_slope = float(max_slope) * max(sz2, sz1) / float(min(sz2, sz1))
min_slope = 1 / max_slope
max_slope *= (sz2 / sz1)
min_slope *= (sz2 / sz1)
lower_bound = np.empty((2, sz1))
lower_bound[0] = min_slope * np.arange(sz1)
lower_bound[1] = ((sz2 - 1) - max_slope * (sz1 - 1)
+ max_slope * np.arange(sz1))
lower_bound = np.round(lower_bound, 2)
lower_bound = np.ceil(np.max(lower_bound, axis=0))
upper_bound = np.empty((2, sz1))
upper_bound[0] = max_slope * np.arange(sz1)
upper_bound[1] = ((sz2 - 1) - min_slope * (sz1 - 1)
+ min_slope * np.arange(sz1))
upper_bound = np.round(upper_bound, 2)
upper_bound = np.floor(np.min(upper_bound, axis=0) + 1)
region = np.asarray([lower_bound, upper_bound]).astype('int64')
return region
Visualization: Two small functions to see the results
def plot_region(region, sz1, sz2):
mask = np.zeros((sz1, sz2))
for i in range(sz1):
for j in np.arange(*region[:, i]):
mask[i, j] = 1.
plt.imshow(mask, origin='lower', cmap='Greys')
for i in range(sz2):
for j in range(sz1):
plt.plot(i, j, 'o', color='green', ms=2)
ax = plt.gca();
ax.set_xticks(np.arange(-.5, sz2, 1), minor=True);
ax.set_yticks(np.arange(-.5, sz1, 1), minor=True);
ax.grid(which='minor', color='b', linestyle='--', linewidth=1)
plt.xlim((-0.5, sz2 - 0.5))
plt.ylim((-0.5, sz1 - 0.5))
def plot_itakura(sz1, sz2, max_slope):
mask = itakura_mask(sz1, sz2, max_slope)
max_slope = float(max_slope) * max(sz1, sz2) / float(min(sz1, sz2))
min_slope = 1 / max_slope
max_slope *= (sz1 / sz2)
min_slope *= (sz1 / sz2)
sz = max(sz1, sz2)
plt.figure(figsize=(6, 6))
plt.imshow(mask, origin='lower');
plt.plot(np.arange(-1, sz + 1), max_slope * np.arange(-1, sz + 1), 'b')
plt.plot(np.arange(-1, sz + 1), min_slope * np.arange(-1, sz + 1), 'r')
plt.plot(np.arange(-1, sz + 1), ((sz1 - 1) - max_slope * (sz2 - 1)) + max_slope * np.arange(-1, sz + 1), 'g')
plt.plot(np.arange(-1, sz + 1), ((sz1 - 1) - min_slope * (sz2 - 1)) + min_slope * np.arange(-1, sz + 1), 'y')
for i in range(sz2):
for j in range(sz1):
plt.plot(i, j, 'o', color='green', ms=2)
ax = plt.gca();
ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
ax.grid(which='minor', color='b', linestyle='--', linewidth=1)
plt.xlim((-0.5, sz2 - 0.5))
plt.ylim((-0.5, sz1 - 0.5))
plt.show()
We can ask some experts (I think of Eamonn Keogh) if we cannot agree with the definitions ;)
pyts/metrics/dtw.py
Outdated
The size of the first time series. | ||
|
||
n_timestamps_2 : int (optional, default None) | ||
The size of the second time series. if None, set to `n_timestamps_1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: capitalized 'if'
+ max_slope * np.arange(n_timestamps)) | ||
lower_bound = np.round(lower_bound, 2) | ||
lower_bound = np.ceil(np.max(lower_bound, axis=0)) | ||
# upper_bound[0] = max_slope * x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added them to make the code clearer for future collaborators (who could be us :) ) because it took me a while to figure out what lower_bound
and upper_bound
refer to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it's a good idea (there aren't many comments in my code...). Maybe adding a short sentence could be helpful, something like:
'lower_bounds' and 'upper_bounds' are the equations defining the sides of the parallelogram
You can also add the description of this new feature in |
So you're half right about sakoe-chiba, in the original paper the region is defined as |i - j| <= window_size which requires window_size to be larger than the difference between the lengths. But the window_size is symmetric w.r.t to i = j. My proposal was to make it symmetric with respect to to the relative diagonal. I think it is more natural for a window_size = 0.1 to have: Than to have: which forces the window_size = |sz1 - sz2| and does not allow j > i If you are aware of other refs abt this I'd be happy to have a look. About itakura, rescaling the min_slope in the beginning will actually lead to a way lower min_scale ( and hence not a symmetric parallelogram with respect to the relative diagonal, I am still working on that at the moment) |
I agree with your new proposal for the Sakoe-Chiba band. For the Itakura parallelogram, it is a parallelogram and not a rhombus: there is no need for a symmetry with respect to the relative diagonal. What is needed is two pairs of parallel sides. For instance take a rectangle with a width much lower than its height, you will see that it is impossible to have a symmetric diagonale with regards to both sides. |
Yes you are right, symmetry is not required. But taking min_scale = 1/max_scale afterwards only works if n2 > n1. if for e.g n2 = 0.5 n1 and you take for instance scale=2 then max_scale = 1 and min_scale = 1 as well. min_scale must be lower than n2 / n1 to include the end point. Which is why I make sure max_scale >= n2 / n1 and min_scale <= n2 / n1 |
Those scales are tricky... I don't get the same results when I swap n_timestamps_1 = 8
n_timestamps_2 = 4
max_slope = 1 If I'm not wrong, you removed the rounding on |
Regarding the computation of the Itakura parallelogram, I will still advocate for my proposal for the moment: def itakura_parallelogram_proposed(sz1, sz2, max_slope):
max_slope *= max(sz2, sz1) / min(sz2, sz1)
min_slope = 1 / max_slope
max_slope *= (sz2 / sz1)
min_slope *= (sz2 / sz1)
lower_bound = np.empty((2, sz1))
lower_bound[0] = min_slope * np.arange(sz1)
lower_bound[1] = ((sz2 - 1) - max_slope * (sz1 - 1)
+ max_slope * np.arange(sz1))
lower_bound = np.round(lower_bound, 2)
lower_bound = np.ceil(np.max(lower_bound, axis=0))
upper_bound = np.empty((2, sz1))
upper_bound[0] = max_slope * np.arange(sz1)
upper_bound[1] = ((sz2 - 1) - min_slope * (sz1 - 1)
+ min_slope * np.arange(sz1))
upper_bound = np.round(upper_bound, 2)
upper_bound = np.floor(np.min(upper_bound, axis=0) + 1)
region = np.asarray([lower_bound, upper_bound]).astype('int64')
return region It has several upsides:
|
def itakura ...
min_slope = 1 / max_slope
scale_max = (sz2 - 1) / (sz1 - 2)
max_slope *= scale_max
max_slope = max(1., max_slope)
scale_min = (sz2 - 2) / (sz1 - 1)
min_slope *= scale_min
min_slope = min(1., min_slope) As you noticed, for a # to be feasible max_slope must be >= 1 and min_slope <= 1. This agrees with the previous implementation with n1=n2 and does not overestimate the size of the parallelogram by multiplying the slopes with n2 / n1 twice. I'll add an example for region plots and update the doc :) |
Maybe something like this: import numpy as np
import matplotlib.pyplot as plt
def sakoe(sz1, sz2, window_size):
scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
window_size = max(window_size, 1 / (2 * scale), 2 * scale)
lower_bound = scale * np.arange(n_timestamps_1) - window_size
lower_bound = np.ceil(np.round(lower_bound, 1))
upper_bound = scale * np.arange(n_timestamps_1) + window_size
upper_bound = np.floor(np.round(upper_bound, 1)) + 1
region = np.asarray([lower_bound, upper_bound]).astype('int64')
region = _check_region(region, n_timestamps_1, n_timestamps_2)
return region
def plot_sakoe(sz1, sz2, window_size):
region = sakoe(sz1, sz2, window_size)
scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
window_size = max(window_size, 1 / (2 * scale), 2 * scale)
mask = np.zeros((sz2, sz1))
for i, (j, k) in enumerate(region.T):
mask[j:k, i] = 1.
plt.figure(figsize=(sz1 / 2, sz2 / 2))
plt.imshow(mask, origin='lower', cmap='Wistia');
sz = max(sz1, sz2)
x = np.arange(-1, sz + 1)
plt.plot(x, scale * np.arange(-1, sz + 1) - window_size, 'b')
plt.plot(x, scale * np.arange(-1, sz + 1) + window_size, 'g')
plt.plot(x, (sz2 - 1) / (sz1 - 1) * np.arange(-1, sz + 1), 'black')
for i in range(sz1):
for j in range(sz2):
plt.plot(i, j, 'o', color='green', ms=2)
ax = plt.gca();
ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
ax.grid(which='minor', color='b', linestyle='--', linewidth=1)
plt.xlim((-0.5, sz1 - 0.5))
plt.ylim((-0.5, sz2 - 0.5))
plt.show() |
The band is way too large as it is, it is a bit better with this code: import numpy as np
import matplotlib.pyplot as plt
def sakoe(sz1, sz2, window_size):
scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
if sz2 > sz1:
window_size = max(window_size, scale / 2)
else:
window_size = max(window_size, scale * 2)
lower_bound = scale * np.arange(n_timestamps_1) - window_size
lower_bound = np.ceil(np.round(lower_bound, 1))
upper_bound = scale * np.arange(n_timestamps_1) + window_size
upper_bound = np.floor(np.round(upper_bound, 1)) + 1
region = np.asarray([lower_bound, upper_bound]).astype('int64')
region = _check_region(region, n_timestamps_1, n_timestamps_2)
return region
def plot_sakoe(sz1, sz2, window_size):
region = sakoe(sz1, sz2, window_size)
scale = (n_timestamps_2 - 1) / (n_timestamps_1 - 1)
if sz2 > sz1:
window_size = max(window_size, scale / 2)
else:
window_size = max(window_size, scale * 2)
mask = np.zeros((sz2, sz1))
for i, (j, k) in enumerate(region.T):
mask[j:k, i] = 1.
plt.figure(figsize=(sz1 / 2, sz2 / 2))
plt.imshow(mask, origin='lower', cmap='Wistia');
sz = max(sz1, sz2)
x = np.arange(-1, sz + 1)
plt.plot(x, scale * np.arange(-1, sz + 1) - window_size, 'b')
plt.plot(x, scale * np.arange(-1, sz + 1) + window_size, 'g')
plt.plot(x, (sz2 - 1) / (sz1 - 1) * np.arange(-1, sz + 1), 'black')
for i in range(sz1):
for j in range(sz2):
plt.plot(i, j, 'o', color='green', ms=2)
ax = plt.gca();
ax.set_xticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
ax.set_yticks(np.arange(-.5, max(sz1, sz2), 1), minor=True);
ax.grid(which='minor', color='b', linestyle='--', linewidth=1)
plt.xlim((-0.5, sz1 - 0.5))
plt.ylim((-0.5, sz2 - 0.5))
plt.show() The only issue is that, when both the upper and lower bounds cross an admissible point, there are two points in the feasible region (see the third column for instance). |
Nice adaptation ! if sz2 > sz1:
window_size = max(window_size, scale / 2)
elif sz1 < sz2:
window_size = max(window_size, 0.5) The lower window_size allowed is not symmetric. When the matrix is almost flat, the margin between the bounds should be 1, hence a 0.5 window. Plus we should leave the case sz1=sz2 untouched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great PR! I left a few comments (mostly nitpicking comments, and one regarding the API).
pyts/metrics/dtw.py
Outdated
Each cell whose distance with the diagonale is lower than or equal to | ||
'window_size' becomes a valid cell for the path. | ||
|
||
relative_window_size : bool (default True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn has one single parameter for integers and floats (see for instance min_samples_split
and min_samples_leaf
in sklearn.ensemble.RandomForestClassifier).
I prefer having a single parameter, and I think that it would be more consistent with scikit-learn's API. But it's open to discussion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree having 2 params is not ideal either, but having different outcomes with 1
and 1.
gave me some headaches as well. I'll revert back to the old API but we can make the distinction better when raising errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error messages may not be the best indeed. However the description in the docstring is pretty explicit imho:
Lines 387 to 392 in 5b4310f
window_size : int or float (default = 0.1) | |
The window above and below the diagonale. If float, it must be between | |
0 and 1, and the actual window size will be computed as | |
``ceil(window_size * (n_timestamps - 1))``. Each cell whose distance | |
with the diagonale is lower than or equal to 'window_size' becomes a | |
valid cell for the path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I made some small changes to the docstring and the error messages. I can live with that :)
Thanks, I think we're good to go ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final review (some little details) :)
Thanks once again for your great PR! |
One of the features of the DTW metric is that it can compare time series of different lengths; which is done in this PR
All
region
methods are updated following the same logic by rescaling to the relative diagonal (n_timestamps_2 / n_timestamps_1).Here is the updated output of the
plot_dtw
example with x = x[::2]