Initial commit

keonlee9420 · Jun 16, 2021 · c89d8c2 · c89d8c2
commit c89d8c2
Show file tree

Hide file tree

Showing 4 changed files with 323 additions and 0 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021 Keon Lee
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,52 @@
+# Soft DTW Loss Function for PyTorch in CUDA
+
+This is a Pytorch Implementation of [Soft-DTW: a Differentiable Loss Function for Time-Series](https://arxiv.org/abs/1703.01541) which is `batch supported computation`, `CUDA-friendly`, and `feasible to use as a final loss`. I can confirm that you can train a (sequential) model with this as a final loss! The following image shows training logs of a TTS model using the Soft-DTW Loss Function.
+
+<p align="center">
+    <img src="figs/sdtw_cuda_loss.png" width="80%">
+</p>
+
+There are some previous implementations:
+1. [mblondel's soft-dtw](https://github.com/mblondel/soft-dtw)
+2. [lyprince's sdtw_pytorch](https://github.com/lyprince/sdtw_pytorch)
+3. [Maghoumi's pytorch-softdtw-cuda](https://github.com/Maghoumi/pytorch-softdtw-cuda)
+
+But they are either not supported by CUDA-friendly batch computation or not considering the jacobean w.r.t input matrix, which is necessary to be used as a final loss in recent deep learning frameworks. In the current implementation, all conditions are satisfied.
+
+# Usage
+
+Same as [Maghoumi's pytorch-softdtw-cuda](https://github.com/Maghoumi/pytorch-softdtw-cuda):
+```python
+from sdtw_cuda_loss import SoftDTW
+
+# Create the sequences
+batch_size, len_x, len_y, dims = 8, 15, 12, 5
+x = torch.rand((batch_size, len_x, dims), requires_grad=True)
+y = torch.rand((batch_size, len_y, dims))
+
+# Create the "criterion" object
+sdtw = SoftDTW(use_cuda=True, gamma=0.1)
+
+# Compute the loss value
+loss = sdtw(x, y)  # Just like any torch.nn.xyzLoss()
+
+# Aggregate and call backward()
+loss.mean().backward()
+```
+But the backward will compute the gradient w.r.t input target sequence x (which is not considered in the previous work).
+
+# Note
+In the current implementation, only `use_cuda=True` is supported. But you can easily implement the CPU version as in [Maghoumi's pytorch-softdtw-cuda](https://github.com/Maghoumi/pytorch-softdtw-cuda).
+
+# Citation
+
+```
+@misc{lee2021soft_dtw_loss,
+  author = {Lee, Keon},
+  title = {Soft-DTW-Loss},
+  year = {2021},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/keonlee9420/Soft-DTW-Loss}}
+}
+```
diff --git a/figs/sdtw_cuda_loss.png b/figs/sdtw_cuda_loss.png
diff --git a/sdtw_cuda_loss.py b/sdtw_cuda_loss.py
@@ -0,0 +1,250 @@
+import numpy as np
+import torch
+import torch.cuda
+from numba import jit
+from torch.autograd import Function
+from numba import cuda
+import math
+
+# ----------------------------------------------------------------------------------------------------------------------
+@cuda.jit
+def compute_softdtw_cuda(D, gamma, bandwidth, max_i, max_j, n_passes, R):
+    """
+    :param seq_len: The length of the sequence (both inputs are assumed to be of the same size)
+    :param n_passes: 2 * seq_len - 1 (The number of anti-diagonals)
+    """
+    # Each block processes one pair of examples
+    b = cuda.blockIdx.x
+    # We have as many threads as seq_len, because the most number of threads we need
+    # is equal to the number of elements on the largest anti-diagonal
+    tid = cuda.threadIdx.x
+
+    # Compute I, J, the indices from [0, seq_len)
+
+    # The row index is always the same as tid
+    I = tid
+
+    inv_gamma = 1.0 / gamma
+
+    # Go over each anti-diagonal. Only process threads that fall on the current on the anti-diagonal
+    for p in range(n_passes):
+
+        # The index is actually 'p - tid' but need to force it in-bounds
+        J = max(0, min(p - tid, max_j - 1))
+
+        # For simplicity, we define i, j which start from 1 (offset from I, J)
+        i = I + 1
+        j = J + 1
+
+        # Only compute if element[i, j] is on the current anti-diagonal, and also is within bounds
+        if I + J == p and (I < max_i and J < max_j):
+            # Don't compute if outside bandwidth
+            if not (abs(i - j) > bandwidth > 0):
+                r0 = -R[b, i - 1, j - 1] * inv_gamma
+                r1 = -R[b, i - 1, j] * inv_gamma
+                r2 = -R[b, i, j - 1] * inv_gamma
+                rmax = max(max(r0, r1), r2)
+                rsum = math.exp(r0 - rmax) + math.exp(r1 - rmax) + math.exp(r2 - rmax)
+                softmin = -gamma * (math.log(rsum) + rmax)
+                R[b, i, j] = D[b, i - 1, j - 1] + softmin
+
+        # Wait for other threads in this block
+        cuda.syncthreads()
+
+# ----------------------------------------------------------------------------------------------------------------------
+@cuda.jit
+def compute_softdtw_backward_cuda(D, R, inv_gamma, bandwidth, max_i, max_j, n_passes, E):
+    k = cuda.blockIdx.x
+    tid = cuda.threadIdx.x
+
+    # Indexing logic is the same as above, however, the anti-diagonal needs to
+    # progress backwards
+    I = tid
+
+    for p in range(n_passes):
+        # Reverse the order to make the loop go backward
+        rev_p = n_passes - p - 1
+
+        # convert tid to I, J, then i, j
+        J = max(0, min(rev_p - tid, max_j - 1))
+
+        i = I + 1
+        j = J + 1
+
+        # Only compute if element[i, j] is on the current anti-diagonal, and also is within bounds
+        if I + J == rev_p and (I < max_i and J < max_j):
+
+            if math.isinf(R[k, i, j]):
+                R[k, i, j] = -math.inf
+
+            # Don't compute if outside bandwidth
+            if not (abs(i - j) > bandwidth > 0):
+                a = math.exp((R[k, i + 1, j] - R[k, i, j] - D[k, i + 1, j]) * inv_gamma)
+                b = math.exp((R[k, i, j + 1] - R[k, i, j] - D[k, i, j + 1]) * inv_gamma)
+                c = math.exp((R[k, i + 1, j + 1] - R[k, i, j] - D[k, i + 1, j + 1]) * inv_gamma)
+                E[k, i, j] = E[k, i + 1, j] * a + E[k, i, j + 1] * b + E[k, i + 1, j + 1] * c
+
+        # Wait for other threads in this block
+        cuda.syncthreads()
+
+# ----------------------------------------------------------------------------------------------------------------------
+def jacobean_product_squared_euclidean(X, Y, Bt):
+    '''
+    jacobean_product_squared_euclidean(X, Y, Bt):
+    
+    Jacobean product of squared Euclidean distance matrix and alignment matrix.
+    See equations 2 and 2.5 of https://arxiv.org/abs/1703.01541
+    '''
+    # print(X.shape, Y.shape, Bt.shape)
+
+    ones = torch.ones(Y.shape).to('cuda' if Bt.is_cuda else 'cpu')
+    return 2 * (ones.matmul(Bt) * X - Y.matmul(Bt))
+
+class _SoftDTWCUDA(Function):
+    """
+    CUDA implementation is inspired by the diagonal one proposed in https://ieeexplore.ieee.org/document/8400444:
+    "Developing a pattern discovery method in time series data and its GPU acceleration"
+    """
+
+    @staticmethod
+    def forward(ctx, X, Y, D, gamma, bandwidth):
+        dev = D.device
+        dtype = D.dtype
+        gamma = torch.cuda.FloatTensor([gamma])
+        bandwidth = torch.cuda.FloatTensor([bandwidth])
+
+        B = D.shape[0]
+        N = D.shape[1]
+        M = D.shape[2]
+        threads_per_block = max(N, M)
+        n_passes = 2 * threads_per_block - 1
+
+        # Prepare the output array
+        R = torch.ones((B, N + 2, M + 2), device=dev, dtype=dtype) * math.inf
+        R[:, 0, 0] = 0
+
+        # Run the CUDA kernel.
+        # Set CUDA's grid size to be equal to the batch size (every CUDA block processes one sample pair)
+        # Set the CUDA block size to be equal to the length of the longer sequence (equal to the size of the largest diagonal)
+        compute_softdtw_cuda[B, threads_per_block](cuda.as_cuda_array(D.detach()),
+                                                   gamma.item(), bandwidth.item(), N, M, n_passes,
+                                                   cuda.as_cuda_array(R))
+        ctx.save_for_backward(D, X, Y, R, gamma, bandwidth)
+        return R[:, -2, -2]
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        dev = grad_output.device
+        dtype = grad_output.dtype
+        D, X, Y, R, gamma, bandwidth = ctx.saved_tensors
+
+        B = D.shape[0]
+        N = D.shape[1]
+        M = D.shape[2]
+        threads_per_block = max(N, M)
+        n_passes = 2 * threads_per_block - 1
+
+        D_ = torch.zeros((B, N + 2, M + 2), dtype=dtype, device=dev)
+        D_[:, 1:N + 1, 1:M + 1] = D
+
+        R[:, :, -1] = -math.inf
+        R[:, -1, :] = -math.inf
+        R[:, -1, -1] = R[:, -2, -2]
+
+        E = torch.zeros((B, N + 2, M + 2), dtype=dtype, device=dev)
+        E[:, -1, -1] = 1
+
+        # Grid and block sizes are set same as done above for the forward() call
+        compute_softdtw_backward_cuda[B, threads_per_block](cuda.as_cuda_array(D_),
+                                                            cuda.as_cuda_array(R),
+                                                            1.0 / gamma.item(), bandwidth.item(), N, M, n_passes,
+                                                            cuda.as_cuda_array(E))
+        E = E[:, 1:N + 1, 1:M + 1]
+        G = jacobean_product_squared_euclidean(X.transpose(1,2), Y.transpose(1,2), E.transpose(1,2)).transpose(1,2)
+
+        return grad_output.view(-1, 1, 1).expand_as(G) * G, None, None, None, None
+
+# ----------------------------------------------------------------------------------------------------------------------
+class SoftDTW(torch.nn.Module):
+    """
+    The soft DTW implementation that optionally supports CUDA
+    """
+
+    def __init__(self, use_cuda, gamma=1.0, normalize=False, bandwidth=None, dist_func=None):
+        """
+        Initializes a new instance using the supplied parameters
+        :param use_cuda: Flag indicating whether the CUDA implementation should be used
+        :param gamma: sDTW's gamma parameter
+        :param normalize: Flag indicating whether to perform normalization
+                          (as discussed in https://github.com/mblondel/soft-dtw/issues/10#issuecomment-383564790)
+        :param bandwidth: Sakoe-Chiba bandwidth for pruning. Passing 'None' will disable pruning.
+        :param dist_func: Optional point-wise distance function to use. If 'None', then a default Euclidean distance function will be used.
+        """
+        super(SoftDTW, self).__init__()
+
+        assert use_cuda, "Only the CUDA version is supported."
+
+        self.normalize = normalize
+        self.gamma = gamma
+        self.bandwidth = 0 if bandwidth is None else float(bandwidth)
+        self.use_cuda = use_cuda
+
+        # Set the distance function
+        if dist_func is not None:
+            self.dist_func = dist_func
+        else:
+            self.dist_func = SoftDTW._euclidean_dist_func
+
+    def _get_func_dtw(self, x, y):
+        """
+        Checks the inputs and selects the proper implementation to use.
+        """
+        bx, lx, dx = x.shape
+        by, ly, dy = y.shape
+        # Make sure the dimensions match
+        assert bx == by  # Equal batch sizes
+        assert dx == dy  # Equal feature dimensions
+
+        use_cuda = self.use_cuda
+
+        if use_cuda and (lx > 1024 or ly > 1024):  # We should be able to spawn enough threads in CUDA
+                print("SoftDTW: Cannot use CUDA because the sequence length > 1024 (the maximum block size supported by CUDA)")
+                use_cuda = False
+
+        # Finally, return the correct function
+        return _SoftDTWCUDA.apply
+
+    @staticmethod
+    def _euclidean_dist_func(x, y):
+        """
+        Calculates the Euclidean distance between each element in x and y per timestep
+        """
+        n = x.size(1)
+        m = y.size(1)
+        d = x.size(2)
+        x = x.unsqueeze(2).expand(-1, n, m, d)
+        y = y.unsqueeze(1).expand(-1, n, m, d)
+        return torch.pow(x - y, 2).sum(3)
+
+    def forward(self, X, Y):
+        """
+        Compute the soft-DTW value between X and Y
+        :param X: One batch of examples, batch_size x seq_len x dims
+        :param Y: The other batch of examples, batch_size x seq_len x dims
+        :return: The computed results
+        """
+
+        # Check the inputs and get the correct implementation
+        func_dtw = self._get_func_dtw(X, Y)
+
+        if self.normalize:
+            # Stack everything up and run
+            x = torch.cat([X, X, Y])
+            y = torch.cat([Y, X, Y])
+            D = self.dist_func(x, y)
+            out = func_dtw(X, Y, D, self.gamma, self.bandwidth)
+            out_xy, out_xx, out_yy = torch.split(out, X.shape[0])
+            return out_xy - 1 / 2 * (out_xx + out_yy)
+        else:
+            D_xy = self.dist_func(X, Y)
+            return func_dtw(X, Y, D_xy, self.gamma, self.bandwidth)