Optimized memory usage and speed for covar type "full" #23

DeMoriarty · 2022-03-13T19:49:50Z

Improved speed and memory usage with following optimizations:

the (N, K, 1, D) * (1, K, D, D) matmul at line275 is replaced with an equivalent matmul (K, N, D) * (K, D, D). (N, K, 1, D) * (1, K, D, D) will be interpreted by cublas as batched matrix vector product, while (K, N, D) * (K, D, D) is batched matrix matrix product, which is more efficient on GPUs.
in 2 consecutive iterations of fit, _estimate_log_prob was being called twice with the same input, in _e_step and __score. now weighted_log_probs is only computed once in __score of previous iteration, then cached to be reused at _e_step of next iteration.
at line342 , mu was originally obtained by element wise multiplication & summation, which is now simplified to a matmul.
at line346, the batched vector outer product followed by summation is rewritten as a single batched matmul, which is more efficient on GPUs.
computations in _m_step and _estimate_log_prob is splitted into smaller "chunks" of computations in order to prevent OOM as much as possible.
added option to choose the dtype of the covariance matrix. Use torch.linalg.eigvals to compute log_det if covariance_data_type = torch.float, otherwise use cholesky decomp.
replaced some of the tensor-scalar or tensor-tensor additions/multiplications with their inplace counterparts to reduce unnecessary memory allocation.

remaining issues:

when covariance_data_type = "float", and both n_components and n_features are large, covar contains NaN.

ldeecke

Very nice work, and thank you for putting so much thought into this! Impressive speed-ups! 👍

Left a couple of comments (apologies for the delay!). In particular, curious to hear your ideas on whether we should move optimizations for covariance_type=full, as this could benefit readability w.r.t. the underlying EM mechanism.

ldeecke · 2022-04-09T14:18:44Z

benchmark.md

@@ -0,0 +1,39 @@
+# Benchmark


Awesome results, thanks for sharing these!

Before merging with master, I would suggest removing benchmark.md.

ldeecke · 2022-04-09T14:20:59Z

gmm.py

@@ -1,9 +1,11 @@
 import torch
 import numpy as np
+import math


gmm.py:5 imports from math, so it'd make sense to either replace all occurrences of pi or import ceil alongside it.

ldeecke · 2022-04-09T14:22:15Z

gmm.py


 from math import pi
 from scipy.special import logsumexp
-from utils import calculate_matmul, calculate_matmul_n_times
+from utils import calculate_matmul, calculate_matmul_n_times, find_optimal_splits
+from tqdm import tqdm


I'd recommend removing this to keep the repository light on dependencies — users that require this functionality can always add it.

ldeecke · 2022-04-09T14:24:50Z

utils.py

+        return check_available_ram(device) >= size
+
+
+def find_optimal_splits(n, get_required_memory, device="cpu", safe_mode=True):


safe_mode doesn't seem to get passed on to will_it_fit.

ldeecke · 2022-04-09T14:36:06Z

gmm.py

@@ -188,7 +203,8 @@ def predict(self, x, probs=False):
        """
        x = self.check_size(x)

-        weighted_log_prob = self._estimate_log_prob(x) + torch.log(self.pi)
+        weighted_log_prob = self._estimate_log_prob(x)
+        weighted_log_prob.add_(torch.log(self.pi))


While carrying this out in-place preserves memory, spreading this across two lines here and in 369 and 466 decreases readability somewhat. Alternatively, I reckon this could be moved into _estimate_log_prob.

ldeecke · 2022-04-09T14:42:56Z

gmm.py

+
+            log_det = self._calculate_log_det(precision) #[K, 1]
+
+            x_mu_T_precision_x_mu = torch.empty(N, K, 1, device=x.device, dtype=x.dtype)


Unless there are reservations/concerns, I would consider moving this into its own utility function, in the interest of preserving readability of the code (happy to take care of this once it has been merged).

ldeecke · 2022-04-09T14:54:33Z

gmm.py

-            eps = (torch.eye(self.n_features) * self.eps).to(x.device)
-            var = torch.sum((x - mu).unsqueeze(-1).matmul((x - mu).unsqueeze(-2)) * resp.unsqueeze(-1), dim=0,
-                            keepdim=True) / torch.sum(resp, dim=0, keepdim=True).unsqueeze(-1) + eps
+            var = torch.empty(1, K, D, D, device=x.device, dtype=resp.dtype)


Nice! 👍

Same thought as before however, given the additional complexity that's introduced here, it might make sense to define these optimizations in some other place.

ldeecke · 2022-04-09T14:57:01Z

gmm.py

+            covariance_type:        str
+            eps:                    float
+            init_params:            str
+            covariance_data_type:   str or torch.dtype


Since mu is getting matched against this type, might as well go ahead and introduce this as dtype altogether, right?

ldeecke · 2022-04-09T14:58:19Z

gmm.py

@@ -15,30 +17,31 @@ class GaussianMixture(torch.nn.Module):
    probabilities are shaped (n, k, 1) if they relate to an individual sample,
    or (1, k, 1) if they assign membership probabilities to one of the mixture components.
    """
-    def __init__(self, n_components, n_features, covariance_type="full", eps=1.e-6, init_params="kmeans", mu_init=None, var_init=None):
+    def __init__(self, n_components, n_features, covariance_type="full", eps=1.e-6, init_params="kmeans", mu_init=None, var_init=None, covariance_data_type="double"):


Any reservations against going with default "float" as default type (matches the torch.Tensor default)?

ldeecke · 2022-04-09T15:00:43Z

gmm.py

-            log_2pi = d * np.log(2. * pi)
-
-            log_det = self._calculate_log_det(precision)
+            x = x.to(var.dtype)


Since self.covariance_data_type has been allocated, maybe use that instead?

DeMoriarty added 30 commits March 13, 2022 14:41

Update gmm.py

6656009

optimizations

f775797

bugfix

f5537b7

bugfix

29e6d2a

bugfix

4cf41f8

bugfix

c4c73d4

bugfix

54e9cbd

bugfix

69e58c0

bugfix

0f70e40

configurable covar dtype

083e241

bugfix

6d73ade

.

3ab5051

use eigvals to compute log det

e2d010a

.

e9976f7

.

1569ecc

bugfix

d11fe27

.

04ccd94

.

315520c

.

c42b198

.

65ee660

.

bd1c632

.

7f862dd

Create benchmark.md

06baed3

.

d5215a2

.

d8941a4

Update benchmark.md

13b2d70

Update benchmark.md

e223a57

Update benchmark.md

2572c1a

Update benchmark.md

e50457b

Update benchmark.md

d750bc7

Update benchmark.md

9bc2062

ldeecke reviewed Apr 9, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized memory usage and speed for covar type "full" #23

Optimized memory usage and speed for covar type "full" #23

DeMoriarty commented Mar 13, 2022

ldeecke left a comment

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

ldeecke Apr 9, 2022

		return check_available_ram(device) >= size


		def find_optimal_splits(n, get_required_memory, device="cpu", safe_mode=True):


		log_det = self._calculate_log_det(precision) #[K, 1]

		x_mu_T_precision_x_mu = torch.empty(N, K, 1, device=x.device, dtype=x.dtype)

Optimized memory usage and speed for covar type "full" #23

Are you sure you want to change the base?

Optimized memory usage and speed for covar type "full" #23

Conversation

DeMoriarty commented Mar 13, 2022

ldeecke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment