Fixes torch DDP distributed metric computation for AUROC #3234

geoffreyangus · 2023-03-11T02:17:14Z

This PR ensures that AUROC can be computed for the torch DDP strategy. Before this change, ludwig.modules.metric_modules.BinaryAUROCMetric was not being instantiated correctly, meaning that the override for BinaryAUROCMetric.update was not called. This is in turn meant that the target was not being correctly cast into a boolean.

The fix was directly subclassing torchmetrics.classification.BinaryAUROC. Before this change, we were subclassing torchmetrics.AUROC, which overrides the __new__ method and messes up class inheritance.

Once class inheritance was working correctly, we unveiled an issue with horovod_utils.all_gather_tensors. The issue was previously hidden because of the broken class inheritance issue. This issue meant that we were actually using torchmetrics' all_gather_tensors function for binary AUROC computation.

The proposed fix for horovod_utils.all_gather_tensors was to ensure that the returned list of tensors matched the shape of those returned by torchmetrics' all_gather_tensors.

…t for distributed metric computation

github-actions · 2023-03-11T04:14:39Z

Unit Test Results

        6 files ±  0         6 suites ±0 7h 7m 18s ⏱️ + 8m 48s
  4 091 tests +  4   4 048 ✔️ +  4   43 💤 ±0 0 ❌ ±0
12 254 runs - 28 12 122 ✔️ - 25 132 💤 - 3 0 ❌ ±0

Results for commit 1546862. ± Comparison against base commit d5f61eb.

♻️ This comment has been updated with latest results.

ludwig/api.py

tgaddair · 2023-03-11T05:14:27Z

ludwig/features/base_feature.py

@@ -352,7 +353,10 @@ def get_metrics(self):
            try:
                metric_vals[metric_name] = get_scalar_from_ludwig_metric(metric_fn)
            except Exception as e:
-                logger.error(f"Caught exception computing metric: {metric_name}. Exception: {e}")
+                logger.error(


This should just be:

logger.exception(f"Caught exception computing metric: {metric_name}")

Then you get the stack trace for free.

Hm, is that true? We're not re-raising the exception, so it seems like it would get swallowed. With the original line of code,

logger.error(f"Caught exception computing metric: {metric_name}. Exception: {e}")

I was only seeing the exception string, but not the stack trace. With the suggested line of code, the exception is not included in the log message, so I would expect it to get swallowed entirely.

logger.exception != logger.error

Ah, didn't catch that. Okay will try that. Thanks!

tgaddair · 2023-03-11T05:30:58Z

ludwig/utils/horovod_utils.py

+
+    # This is to match the output of the torchmetrics gather_all_tensors function
+    # and ensures that the return value is usable by torchmetrics.compute downstream.
+    if len(result.shape) >= 2:


Interesting, so my understanding is that DDP allgather contract is like:

tensor[*shape] -> [tensor[*shape]] * world_size

While Horovod is:

tensor[shape] -> tensor[world_size, *shape]

In the previous implementation, we split the tensor along the rank dimension (dim=0) into [tensor[shape]] * world_size to match the DDP format. But with this change it seems if the input is a scalar, meaning the allgathered output is a 1D vector, then we just return it as-is (unless it's a bool, in which case we iterate over it and turn it into a list of bool tensors).

Seems like the output format is potentially inconsistent then, right? It could be a list of tensors or it could be a single tensor. Am I missing something.

The other aspect here is the difference between casting to a list vs calling split. Running this in a terminal, I get:

>>> t tensor([[1., 1., 2.], [2., 3., 1.], [5., 2., 4.]]) >>> list(t) [tensor([1., 1., 2.]), tensor([2., 3., 1.]), tensor([5., 2., 4.])] >>> t.split(1, dim=0) (tensor([[1., 1., 2.]]), tensor([[2., 3., 1.]]), tensor([[5., 2., 4.]]))

So it seems like the difference here is that split preserves the rank of the tensor, while list removes the dimension that's being split upon entirely. So that does seem to better align with the DDP format.

As such, is the right thing to do here just to change lines 77 and 78 to:

gathered = _HVD.allgather(result) gathered_result = list(gathered)

In other words, why only do this when len(result.shape) >= 2?

Otherwise, it makes sense, as I do believe DDP allgather does not add an extra dimension to the tensors being gathered, as the code seemed to be doing previously.

Oh interesting– so double-checking this, it seems like the guard I've added is pretty much always triggered. I believe this is because of these two lines:

https://github.com/ludwig-ai/ludwig/blob/master/ludwig/utils/horovod_utils.py#L63-L66

https://github.com/ludwig-ai/ludwig/blob/master/ludwig/utils/horovod_utils.py#L73-L75

Tensors always have a rank of at least one because of the first code block, and the second code block ensures that the rank is at least two. I'll remove the guard since it is not necessary.

ludwig/modules/metric_modules.py

ludwig/utils/horovod_utils.py

tgaddair

LGTM

arnavgarg1 · 2023-03-15T19:03:33Z

Should we add this for 0.7.3?

geoffreyangus added 2 commits March 10, 2023 18:12

wip; implements fix by subclassing BinaryAUROC directly; adds ray tes…

b29e5be

…t for distributed metric computation

add category roc_auc to validation

6d740c2

tgaddair reviewed Mar 11, 2023

View reviewed changes

ludwig/api.py Outdated Show resolved Hide resolved

tgaddair reviewed Mar 11, 2023

View reviewed changes

geoffreyangus added 3 commits March 13, 2023 10:03

pr revisions; remove extraneous conditional; revert traceback in logger

58179ea

lint

6f50c7b

pr revision; use logger.exception

47b490c

tgaddair reviewed Mar 13, 2023

View reviewed changes

ludwig/modules/metric_modules.py Outdated Show resolved Hide resolved

tgaddair reviewed Mar 13, 2023

View reviewed changes

ludwig/utils/horovod_utils.py Outdated Show resolved Hide resolved

tgaddair approved these changes Mar 13, 2023

View reviewed changes

geoffreyangus added 2 commits March 13, 2023 16:49

cleanup

fef6211

more cleanup

3de61cc

geoffreyangus marked this pull request as ready for review March 14, 2023 00:22

geoffreyangus added 3 commits March 14, 2023 10:26

merge

a631fc8

Merge branch 'master' into binary-auroc-fix

b166130

Merge branch 'master' into binary-auroc-fix

1546862

arnavgarg1 added the release-0.7 Needs cherry-pick into 0.7 release branch label Mar 15, 2023

geoffreyangus merged commit 057b9c5 into master Mar 16, 2023

geoffreyangus deleted the binary-auroc-fix branch March 16, 2023 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes torch DDP distributed metric computation for AUROC #3234

Fixes torch DDP distributed metric computation for AUROC #3234

geoffreyangus commented Mar 11, 2023

github-actions bot commented Mar 11, 2023 •

edited

tgaddair Mar 11, 2023

geoffreyangus Mar 13, 2023

tgaddair Mar 13, 2023

geoffreyangus Mar 13, 2023

tgaddair Mar 11, 2023 •

edited

geoffreyangus Mar 13, 2023 •

edited

tgaddair left a comment

arnavgarg1 commented Mar 15, 2023

Fixes torch DDP distributed metric computation for AUROC #3234

Fixes torch DDP distributed metric computation for AUROC #3234

Conversation

geoffreyangus commented Mar 11, 2023

github-actions bot commented Mar 11, 2023 • edited

Unit Test Results

tgaddair Mar 11, 2023

Choose a reason for hiding this comment

geoffreyangus Mar 13, 2023

Choose a reason for hiding this comment

tgaddair Mar 13, 2023

Choose a reason for hiding this comment

geoffreyangus Mar 13, 2023

Choose a reason for hiding this comment

tgaddair Mar 11, 2023 • edited

Choose a reason for hiding this comment

geoffreyangus Mar 13, 2023 • edited

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

arnavgarg1 commented Mar 15, 2023

github-actions bot commented Mar 11, 2023 •

edited

tgaddair Mar 11, 2023 •

edited

geoffreyangus Mar 13, 2023 •

edited