Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mm checks and correct int dtypes if A is on the gpu #658

Merged
merged 11 commits into from Dec 4, 2020

Conversation

coquelin77
Copy link
Member

@coquelin77 coquelin77 commented Aug 25, 2020

Description

fixes bug by casting tensors to floats before mm operation

Issue/s resolved: #657

Changes proposed:

  • for matmul: cast tensors to floats if the devices are GPUs and the dtypes are ints

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

@mtar
Copy link
Collaborator

mtar commented Aug 25, 2020

GPU cluster tests are currently disabled on this Pull Request.

@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #658 (48004d8) into master (0c9c763) will decrease coverage by 0.12%.
The diff coverage is 79.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #658      +/-   ##
==========================================
- Coverage   97.54%   97.41%   -0.13%     
==========================================
  Files          87       87              
  Lines       18219    18331     +112     
==========================================
+ Hits        17771    17857      +86     
- Misses        448      474      +26     
Impacted Files Coverage Δ
heat/core/linalg/basics.py 90.14% <62.65%> (-4.12%) ⬇️
heat/core/linalg/tests/test_basics.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c9c763...48004d8. Read the comment docs.

@mtar
Copy link
Collaborator

mtar commented Aug 25, 2020

test this please

@bhagemeier
Copy link
Member

It works, but the results are now dependent on whether you compute on the GPU or CPU. Do we have other places where we do similar stuff? The actual difference at first sight is the difference in the data type returned (int64 vs. float64). Further down the road, we're losing 11 bits of precision for very large numbers. This may be a corner case, but I wanted to point it out before approving this.

>>> a = ht.array([[0, 1], [2, 3]])
>>> b = ht.array([[4, 5], [6, 7]])
>>> ht.dot(a, b)
DNDarray([[ 6,  7],
          [26, 31]], dtype=ht.int64, device=cpu:0, split=None)
>>> a.gpu()
DNDarray([[0, 1],
          [2, 3]], dtype=ht.int64, device=gpu:0, split=None)
>>> b.gpu()
DNDarray([[4, 5],
          [6, 7]], dtype=ht.int64, device=gpu:0, split=None)
>>> ht.dot(a, b)
DNDarray([[ 6.,  7.],
          [26., 31.]], dtype=ht.float64, device=gpu:0, split=None)

@coquelin77
Copy link
Member Author

as far as i know, this is the only place where we do a specific type change for GPUs. however, the function required for this operation, namely addmm_cuda is not implemented so we are forced to work around it. even if we were to do this:

r = a_block @ b_block
c[c_start0 : c_start0 + mB, c_start1 : c_start1 + nB] += r

it still fails with the same error, unfortunately I do not see a way around this. we can loop back to it when this function is implemented in torch.

@mtar
Copy link
Collaborator

mtar commented Sep 1, 2020

The dtype of the returned array changes. As a user, I would expect it to be the same as the input.

@coquelin77
Copy link
Member Author

again, the operation which we need is not implemented in torch. this cannot be done a different way in this specific scenario. It can throw a warning but that seems excessive.

@mtar
Copy link
Collaborator

mtar commented Sep 1, 2020

Can you change the type afterwards before returning?

@coquelin77
Copy link
Member Author

the precision is already lost even if the type is changed back. we can cast back at the end but it does require yet another check in all 8 return locations

@Markus-Goetz
Copy link
Member

Even if we do loose precision, we should be consistent on the types. For GPU, cast the output tensor back to int64. The case that you actually do int64 mm on GPU is relatively seldom to being with anyway.

@mtar
Copy link
Collaborator

mtar commented Sep 22, 2020

rerun tests

@coquelin77
Copy link
Member Author

this now casts back to the initial promoted dtype for ints on GPUs

CHANGELOG.md Outdated Show resolved Hide resolved
bhagemeier
bhagemeier previously approved these changes Dec 4, 2020
Copy link
Member

@bhagemeier bhagemeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now from my point of view.

Copy link
Member

@bhagemeier bhagemeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test have been added. I have a feeling that codecov is having trouble just now. Therefore, I vote for ignoring the minute (.01%) decrease in coverage.

@bhagemeier bhagemeier merged commit 6a10e94 into master Dec 4, 2020
@bhagemeier bhagemeier deleted the bug/657-mm-int-gpu branch December 8, 2021 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

matmul fails with ints on GPU
4 participants