Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to fix seemingly random failures on CI #224

Merged
merged 15 commits into from
Jan 11, 2024

Conversation

dotsdl
Copy link
Member

@dotsdl dotsdl commented Jan 9, 2024

Unfortunately, unable to reproduce failures like this locally. Working hypothesis is that use of fixtures is impacting us in hard to pin-down ways, so scoping them down gradually here.

@codecov-commenter
Copy link

codecov-commenter commented Jan 9, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (fe8551d) 82.19% compared to head (6b451a3) 81.75%.

Files Patch % Lines
alchemiscale/interface/client.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
- Coverage   82.19%   81.75%   -0.45%     
==========================================
  Files          23       23              
  Lines        2937     2937              
==========================================
- Hits         2414     2401      -13     
- Misses        523      536      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dotsdl
Copy link
Member Author

dotsdl commented Jan 10, 2024

After much gnashing of teeth, I think I may have narrowed down the cause of the random CI failures to a race condition. Hypothesis that has yet to be invalidated:

  • following the call to user_client.create_network completes, but then when user_client.get_network_transformations is called later, no Transformation ScopedKeys are returned, suggesting that the query to server doesn't see them yet
  • adding a while loop to call get_network_transformations repeatedly until Transformations are populated appears to address the issue
    • mostly: there are cases where the Transformation objects are present, but their connecting nodes aren't fully populated yet
  • it remains unclear how this race condition could be present; it may be the case that the calls to Neo4j to create the AlchemicalNetwork via py2neo fully return before Neo4j has finished creating all nodes and relationships in the DB, and the low amount of total processing allocated on the CI worker presents conditions such that this creation may be incomplete before the next call in the tests occurs

The solution to this may be to try pulling the full AlchemicalNetwork, and do this until it succeeds, then progress to the remaining tests. A note on why this is necessary would be sufficient.

@dotsdl
Copy link
Member Author

dotsdl commented Jan 11, 2024

6 successful attempts in a row! I think we might be good. 😁

@dotsdl dotsdl merged commit 697a936 into main Jan 11, 2024
4 checks passed
@dotsdl dotsdl deleted the ci-stochastic-failure-fix branch January 11, 2024 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants