Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Add the normalize_index function #2677

Conversation

gabrieldemarmiesse
Copy link
Contributor

@gabrieldemarmiesse gabrieldemarmiesse commented May 16, 2024

This PR is some kind of mix between #2386 and #2384 which have issues (aborting) or have too many conflicts because the PR is too big.

This PR solves part of #2251 and #2337

We try here to give the ground work for indexing correctly. This function added can then be used wherever we work with sequences.

Two things I noticed during development:

  1. The debug_assert does not run in unit tests. Is there any way to enable it? We currently have out-of-bounds bugs in our test suite.
  2. The null terminator is causing pain, again, again, and again. Do we have any plans to make it optional when working with String? I opened [Feature Request] Better handling of null terminator in strings #2678 to discuss this.

To avoid to fix those issues in this PR, I used the normalize_index on the __refitem__ of InlineArray which doesn't have widespread use yet and isn't impacted by out-of-bounds bugs.

My recommendation would be to merge this PR then to rebase #2386 and #2384 on it. We should also afterwards fix the out of bounds bugs that can be triggered in the test suite by enabling debug_assert.

The diff might seem big, but no worries, it's mostly the licenses and docstrings :)

@JoeLoser
Copy link
Collaborator

JoeLoser commented May 16, 2024

Re your (1) comment about MOJO_ENABLE_ASSERTIONS, we need to fix up the library and its tests. I see we've regressed in the past month or two. We were building/working fine with assertions enabled, but we don't have internal CI with this mode.

I think we can do something like MOJO_ENABLE_ASSERTIONS=1 ./stdlib/scripts/run-tests.sh (may need to propagate MOJO_ENABLE_ASSERTIONS through in the lit.cfg.py. When I ran this recently in the open source, I think we had 12 test failures or so. We should fix these up and then add CI for it (preferably in a matrix for with and without assertions enabled). Thanks for the reminder; feel free to file a GitHub issue, and if you're so inclined, work on it or someone else can! 😄

@gabrieldemarmiesse
Copy link
Contributor Author

Here you go: #2687 :)

@laszlokindrat
Copy link
Contributor

Indexer will be available in the next nightly, please rebase after that!

@gabrieldemarmiesse
Copy link
Contributor Author

@laszlokindrat Feel free to review when you have the time. The rebase is done :)

Copy link
Contributor

@laszlokindrat laszlokindrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! Let's have a discussion on some of these design and implementation questions, before we move forward.

stdlib/src/collections/index_normalization.mojo Outdated Show resolved Hide resolved
var normalized_idx = int(index)
if normalized_idx < 0:
normalized_idx += size
var normalized_index = normalize_index[container_name="InlineArray"](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subtly different behavior: debug_assert can be disabled (in which case there is no bound checking, no overhead, but bad things can happen); the implementation you introduced, however, always performs the bounds check, which guarantees that this becomes more expensive.

@JoeLoser Any thoughts on whether this should be a "safe" method or not?

Copy link
Contributor Author

@gabrieldemarmiesse gabrieldemarmiesse May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. This is intentional. I followed the instructions from @JoeLoser here: #2386 (comment)

Since having a performant access to the elements in any array is desirable, we could get some inspiration from rust which does have the same challenges. Rust solves it by having safe access with [i] and unsafe access with get_unchecked .

What I would recommend here is to do something similar, as unsafe things must be explicitly unsafe using an appropriate name (from the stdlib vision document). When this PR gets merged, I can open an issue (which can be a good first issue!) to add a second method to the InlineArray unsafe_get() where normalized_index isn't called, and we don't wrap the index when there are negative values. A debug_assert will be added in unsafe_get() for better debuggability.

I'll answer here concerning the heap memory allocation. I think there are a few points to add:

  • The heap allocation will be used only when the "safe" get is used. Thus if the platform doesn't support it, the developper can use unsafe_get()
  • We should strive for good error messages in the general/safe case. It's fine to have bad/no error messages in the other situations.
  • When we have SSO and can control the size of the internal String buffer, we can make it big enough in this precise code path to make sure that the String bytes are always allocated on the stack (buffer) and never on the heap. When this is done, the platforms where heap allocations aren't possible will be able to use the safe method to get an item by index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for clarifying, I'm glad you already had this discussion. I'm okay with the goal here, but it might be risky to merge this since it will introduce overhead for existing (internal) users. I can push this forward, but it would be great if you could fast follow with the implementation of an unsafe accessors.

Copy link
Contributor Author

@gabrieldemarmiesse gabrieldemarmiesse May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it in this PR now instead. Let's not take unnecessary risks with internal breakages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laszlokindrat I added unsafe_get to have fast access :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing, thanks!

stdlib/src/collections/index_normalization.mojo Outdated Show resolved Hide resolved
Copy link
Contributor

@laszlokindrat laszlokindrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you! Please rebase on the latest nightly, and I'll pull this in.

var normalized_idx = int(index)
if normalized_idx < 0:
normalized_idx += size
var normalized_index = normalize_index[container_name="InlineArray"](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for clarifying, I'm glad you already had this discussion. I'm okay with the goal here, but it might be risky to merge this since it will introduce overhead for existing (internal) users. I can push this forward, but it would be great if you could fast follow with the implementation of an unsafe accessors.

stdlib/src/collections/_index_normalization.mojo Outdated Show resolved Hide resolved
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@gabrieldemarmiesse gabrieldemarmiesse force-pushed the add_basics_of_indexer_and_normalization branch from 9439280 to 5d859b1 Compare May 20, 2024 18:54
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@laszlokindrat
Copy link
Contributor

!sync

modularbot pushed a commit that referenced this pull request May 21, 2024
[External] [stdlib] Apply Indexer trait to stdlib containers

The second half of #2384

I have intentionally omitted `static_tuple.mojo` to avoid conflicting
with #2677

Co-authored-by: bgreni <42788181+bgreni@users.noreply.github.com>
Closes #2722
MODULAR_ORIG_COMMIT_REV_ID: 365298938b36c3f4f9ad0067dfcb3856de369875
martinvuyk pushed a commit to martinvuyk/mojo that referenced this pull request May 24, 2024
[External] [stdlib] Apply Indexer trait to stdlib containers

The second half of modularml#2384

I have intentionally omitted `static_tuple.mojo` to avoid conflicting
with modularml#2677

Co-authored-by: bgreni <42788181+bgreni@users.noreply.github.com>
Closes modularml#2722
MODULAR_ORIG_COMMIT_REV_ID: 365298938b36c3f4f9ad0067dfcb3856de369875
modularbot pushed a commit that referenced this pull request May 25, 2024
[External] [stdlib] Add method `unsafe_get` to `List`

See #2677 (comment)
for the background about this method.

We are currently missing methods to access elements in collections
without any bounds checks or wraparound, for maximum performance. I
suggest that we introduce `unsafe_get` and `unsafe_set` to our List-like
collections. This is equivalent to
* https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked
*
https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked_mut
in Rust.

This should prove useful to makers of high performance libraries like
Max :p

We can then make `__getitem__` and `__setitem__` as safe as we want
without impacting the power users.

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes #2800
MODULAR_ORIG_COMMIT_REV_ID: 5321c191d83262f240c5d8d5ac77afccaa00a7ba
@laszlokindrat
Copy link
Contributor

@gabrieldemarmiesse Sorry this has been taking so long, but we are running into several internal CI failures related to this and a lot has changed recently. Could you please rebase this on the next nightly? Note that there will be a _get_reference_unsafe methods that probably makes your unsafe_get obsolete.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@gabrieldemarmiesse
Copy link
Contributor Author

@laszlokindrat done!

Av1nag pushed a commit to Av1nag/mojo that referenced this pull request May 27, 2024
[External] [stdlib] Add method `unsafe_get` to `List`

See modularml#2677 (comment)
for the background about this method.

We are currently missing methods to access elements in collections
without any bounds checks or wraparound, for maximum performance. I
suggest that we introduce `unsafe_get` and `unsafe_set` to our List-like
collections. This is equivalent to
* https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked
*
https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked_mut
in Rust.

This should prove useful to makers of high performance libraries like
Max :p

We can then make `__getitem__` and `__setitem__` as safe as we want
without impacting the power users.

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes modularml#2800
MODULAR_ORIG_COMMIT_REV_ID: 5321c191d83262f240c5d8d5ac77afccaa00a7ba
Av1nag pushed a commit to Av1nag/mojo that referenced this pull request May 27, 2024
[External] [stdlib] Add method `unsafe_get` to `List`

See modularml#2677 (comment)
for the background about this method.

We are currently missing methods to access elements in collections
without any bounds checks or wraparound, for maximum performance. I
suggest that we introduce `unsafe_get` and `unsafe_set` to our List-like
collections. This is equivalent to
* https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked
*
https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked_mut
in Rust.

This should prove useful to makers of high performance libraries like
Max :p

We can then make `__getitem__` and `__setitem__` as safe as we want
without impacting the power users.

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes modularml#2800
MODULAR_ORIG_COMMIT_REV_ID: 5321c191d83262f240c5d8d5ac77afccaa00a7ba

Signed-off-by: Avinag <udayagiriavinag@gmail.com>
Av1nag pushed a commit to Av1nag/mojo that referenced this pull request May 27, 2024
[External] [stdlib] Add method `unsafe_get` to `List`

See modularml#2677 (comment)
for the background about this method.

We are currently missing methods to access elements in collections
without any bounds checks or wraparound, for maximum performance. I
suggest that we introduce `unsafe_get` and `unsafe_set` to our List-like
collections. This is equivalent to
* https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked
*
https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked_mut
in Rust.

This should prove useful to makers of high performance libraries like
Max :p

We can then make `__getitem__` and `__setitem__` as safe as we want
without impacting the power users.

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes modularml#2800
MODULAR_ORIG_COMMIT_REV_ID: 5321c191d83262f240c5d8d5ac77afccaa00a7ba
Av1nag pushed a commit to Av1nag/mojo that referenced this pull request May 27, 2024
[External] [stdlib] Add method `unsafe_get` to `List`

See modularml#2677 (comment)
for the background about this method.

We are currently missing methods to access elements in collections
without any bounds checks or wraparound, for maximum performance. I
suggest that we introduce `unsafe_get` and `unsafe_set` to our List-like
collections. This is equivalent to
* https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked
*
https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked_mut
in Rust.

This should prove useful to makers of high performance libraries like
Max :p

We can then make `__getitem__` and `__setitem__` as safe as we want
without impacting the power users.

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes modularml#2800
MODULAR_ORIG_COMMIT_REV_ID: 5321c191d83262f240c5d8d5ac77afccaa00a7ba

Signed-off-by: Avinag <udayagiriavinag@gmail.com>
@laszlokindrat
Copy link
Contributor

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label May 27, 2024
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added merged-internally Indicates that this pull request has been merged internally merged-externally Merged externally in public mojo repo labels May 27, 2024
modularbot pushed a commit that referenced this pull request May 28, 2024
[External] [stdlib] Add the `normalize_index` function

This PR is some kind of mix between
#2386 and
#2384 which have issues (aborting)
or have too many conflicts because the PR is too big.

This PR solves part of #2251 and
#2337

We try here to give the ground work for indexing correctly. This
function added can then be used wherever we work with sequences.

Two things I noticed during development:
1) The `debug_assert` does not run in unit tests. Is there any way to
enable it? We currently have out-of-bounds bugs in our test suite.
2) The null terminator is causing pain, again, again, and again. Do we
have any plans to make it optional when working with String? I opened
#2678 to discuss this.

To avoid to fix those issues in this PR, I used the `normalize_index` on
the `__refitem__` of `InlineArray` which doesn't have widespread use yet
and isn't impacted by out-of-bounds bugs.

My recommendation would be to merge this PR then to rebase
#2386 and
#2384 on it. We should also
afterwards fix the out of bounds bugs that can be triggered in the test
suite by enabling debug_assert.

The diff might seem big, but no worries, it's mostly the licenses and
docstrings :)

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes #2677
MODULAR_ORIG_COMMIT_REV_ID: 66e7121a6a333c16284eb33a89eb85c034c296c3
@modularbot
Copy link
Collaborator

Landed in cb2e75b! Thank you for your contribution 🎉

@modularbot modularbot closed this May 28, 2024
@ConnorGray
Copy link
Collaborator

Now that this has been merged, we should start using it more widely. @gabrieldemarmiesse @laszlokindrat is there any reason I might have missed that means I should hold off on filing an issue for using this more widely?

@gabrieldemarmiesse
Copy link
Contributor Author

I think we can open an issue. At the time of merge there were a few PRs in flight, we didn't know if there was work left to do on the subject.

Now that the dust has settled, it's true that it would be a good issue to beginners. Let's not forget in the issue to mention that we would like on function change per PR to avoid a big PR changing all the indexing for the stdlib

@ConnorGray
Copy link
Collaborator

Thanks Gabriel. Filed #2948 to track this. From grepping through the stdlib code it looks like there aren't that many uses of this pattern overall, so the size of resulting PRs may end up not to be too huge. I included a note though asking contributors not to do more than 4-5 call-sites in one PR.

modularbot pushed a commit that referenced this pull request Jun 7, 2024
[External] [stdlib] Apply Indexer trait to stdlib containers

The second half of #2384

I have intentionally omitted `static_tuple.mojo` to avoid conflicting
with #2677

Co-authored-by: bgreni <42788181+bgreni@users.noreply.github.com>
Closes #2722
MODULAR_ORIG_COMMIT_REV_ID: 365298938b36c3f4f9ad0067dfcb3856de369875
modularbot pushed a commit that referenced this pull request Jun 7, 2024
[External] [stdlib] Add method `unsafe_get` to `List`

See #2677 (comment)
for the background about this method.

We are currently missing methods to access elements in collections
without any bounds checks or wraparound, for maximum performance. I
suggest that we introduce `unsafe_get` and `unsafe_set` to our List-like
collections. This is equivalent to
* https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked
*
https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked_mut
in Rust.

This should prove useful to makers of high performance libraries like
Max :p

We can then make `__getitem__` and `__setitem__` as safe as we want
without impacting the power users.

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes #2800
MODULAR_ORIG_COMMIT_REV_ID: 5321c191d83262f240c5d8d5ac77afccaa00a7ba
modularbot pushed a commit that referenced this pull request Jun 7, 2024
[External] [stdlib] Add the `normalize_index` function

This PR is some kind of mix between
#2386 and
#2384 which have issues (aborting)
or have too many conflicts because the PR is too big.

This PR solves part of #2251 and
#2337

We try here to give the ground work for indexing correctly. This
function added can then be used wherever we work with sequences.

Two things I noticed during development:
1) The `debug_assert` does not run in unit tests. Is there any way to
enable it? We currently have out-of-bounds bugs in our test suite.
2) The null terminator is causing pain, again, again, and again. Do we
have any plans to make it optional when working with String? I opened
#2678 to discuss this.

To avoid to fix those issues in this PR, I used the `normalize_index` on
the `__refitem__` of `InlineArray` which doesn't have widespread use yet
and isn't impacted by out-of-bounds bugs.

My recommendation would be to merge this PR then to rebase
#2386 and
#2384 on it. We should also
afterwards fix the out of bounds bugs that can be triggered in the test
suite by enabling debug_assert.

The diff might seem big, but no worries, it's mostly the licenses and
docstrings :)

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes #2677
MODULAR_ORIG_COMMIT_REV_ID: 66e7121a6a333c16284eb33a89eb85c034c296c3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants