Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

part: added least func #96

Merged
merged 2 commits into from
Jun 15, 2022
Merged

part: added least func #96

merged 2 commits into from
Jun 15, 2022

Conversation

thorfour
Copy link
Contributor

@thorfour thorfour commented Jun 11, 2022

Move least of a part into a function. Lifted out all the calls in table where we fetched the least to use this helper function.

Also changed add part calls in addparttogranule to use the internal form that takes the least row so it doesn't need to be called again.

This addresses #86

@thorfour thorfour marked this pull request as ready for review June 11, 2022 04:02
@brancz
Copy link
Member

brancz commented Jun 14, 2022

So if I'm understanding this right, it all boils down to parquet readers not being safe for concurrent reading (because the values buffer is concurrently reused)?

@thorfour
Copy link
Contributor Author

So if I'm understanding this right, it all boils down to parquet readers not being safe for concurrent reading (because the values buffer is concurrently reused)?

That's my understanding

@brancz
Copy link
Member

brancz commented Jun 14, 2022

In that case, this seems like we're maybe fixing one specific case, but not the general problem and other cases where this data race can happen. I feel like we should then instead create a new *parquet.File instead of reusing one instance. If this ends up being too expensive we should use a sync.Pool to reuse them safely.

@thorfour
Copy link
Contributor Author

In that case, this seems like we're maybe fixing one specific case, but not the general problem and other cases where this data race can happen. I feel like we should then instead create a new *parquet.File instead of reusing one instance. If this ends up being too expensive we should use a sync.Pool to reuse them safely.

That may be a solution. I want to write another read based unit test to see if it exacerbates this problem. That said, I think this change is still good because it lifts and shifts code that makes the code overall more readable.

@brancz
Copy link
Member

brancz commented Jun 15, 2022

That makes sense. Happy with that!

@brancz brancz merged commit 6d2b52a into main Jun 15, 2022
@brancz brancz deleted the fix-data-race-2 branch June 15, 2022 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants