adjust test_parallel bound #3281

austereantelope · 2021-12-25T01:10:30Z

Hi,

A few days ago I created a PR #3278 where I proposed to tighten the bound of test_parallel's assertion bound (self.assertLess(neighbor_rank, 20)) from 20 to 2. I verified the new bound locally by running the test 500 times to make sure the test is not flaky (>99 % pass rate).

However after the PR is merged I noticed some CI runs are failing on this test. After taking a closer look at the test and my experiment script I realized the issue is caused by the parallelism in the test. The test uses 4 workers for training and the assertion value could be affected by the parallelism. However in my script I run each test using taskset -c coreAffinity which limits the test to use only one core which will miss the variance in the test caused by using 4 workers.

I re-ran the experiment without using taskset for a more accurate bound for the test. Below I include the experiment results

To improve this test, I propose to adjust the bound to 5 (the blue dotted line). The new bound has a catch rate of 0.78 (+0.2 increase compare to original bound of 20) while still has >99 % pass rate (test is not flaky, I ran the updated test 500 times and observed >99 % pass rate). I think this is a good balance between improving the bug-detection ability of the test while keep the flakiness of the test low.

@piskvorky @gojomo @mpenkov Do you guys think this makes sense? Please let me know if this looks good or if you have any other suggestions.

codecov · 2021-12-25T01:26:32Z

Codecov Report

Merging #3281 (de90480) into develop (ac3bbcd) will decrease coverage by 2.06%.
The diff coverage is n/a.

❗ Current head de90480 differs from pull request most recent head 4934788. Consider uploading reports for the commit 4934788 to get more accurate results

@@             Coverage Diff             @@
##           develop    #3281      +/-   ##
===========================================
- Coverage    81.43%   79.37%   -2.07%     
===========================================
  Files          122       68      -54     
  Lines        21052    11781    -9271     
===========================================
- Hits         17144     9351    -7793     
+ Misses        3908     2430    -1478

Impacted Files	Coverage Δ
gensim/models/ldamulticore.py	`75.29% <0.00%> (-15.62%)`	⬇️
gensim/scripts/glove2word2vec.py	`76.19% <0.00%> (-7.15%)`	⬇️
gensim/corpora/wikicorpus.py	`93.75% <0.00%> (-1.04%)`	⬇️
gensim/matutils.py	`77.23% <0.00%> (-0.90%)`	⬇️
gensim/similarities/docsim.py	`23.95% <0.00%> (-0.76%)`	⬇️
gensim/models/rpmodel.py	`89.47% <0.00%> (-0.53%)`	⬇️
gensim/utils.py	`71.86% <0.00%> (-0.12%)`	⬇️
gensim/corpora/dictionary.py	`94.17% <0.00%> (-0.09%)`	⬇️
gensim/models/hdpmodel.py	`71.27% <0.00%> (-0.08%)`	⬇️
gensim/models/ldamodel.py	`87.27% <0.00%> (-0.07%)`	⬇️
... and 90 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 546de20...4934788. Read the comment docs.

austereantelope · 2022-01-11T23:55:24Z

@piskvorky Please let me know if this looks good to you

gojomo · 2022-01-12T19:37:02Z

I appreciate the extra rigor & attention!

Still, my concerns from #3278 apply. The CI virtual-machines/containers/whatever have historically higher propensity to trigger even rare, threading-related failures. Even a 99% or 99.9% pass rate, on a single test, will be a recurring annoyance to people who suffer random failures after making unrelated changes. And, we have multiple tests with similar risks - so (for example) if we were to tune 10 independent tests to each have "99.5%" reliability, running the full test suite would still fail 5% of the time.

So: if rigorously tuning individual tests (which is a great idea!), the target should be something like 99.999% reliability for each - ideally as confirmed on the CI systems themselves.

austereantelope · 2022-01-13T23:58:25Z

@gojomo Thanks for your feedback. I ran the test locally 10,000 times. The bound we suggested (5) has a pass rate of 0.9963. We can achieve 0.9999 pass rate if we set the bound to be 8. Do you think this is a reasonable pass rate for this test?

piskvorky · 2022-01-14T00:39:35Z

@austereantelope I believe the underlying concern is capturing variance that's outside of merely re-running the tests more times, in a fixed environment.

For instance, I'm sure the existing tests worked reliably well on the computers of the people who originally wrote them. But then the tests gave us trouble when deployed elsewhere, such as the CI system, with different OSes / CPUs / thread scheduling / Travis vs Github Actions / whatever. And then we had to add hacks to relax their bounds or skip them altogether.

Or make the test data and/or algorithm genuinely more robust to such variance (can be a hard academic task, way more work).

To be clear, your PR here is interesting and a net benefit to Gensim already, in my estimation. The worry is just something similar may happen again.

austereantelope · 2022-01-14T18:30:17Z

@piskvorky I agree that the underlying variability we see for this test is due to the parallelism (the test is called test_parallel) and it behaves differently for different machines (developer vs CI). However currently the bound of 2 is too tight even on my development machine. Unfortunately I don't think we can feasibility test what the approriate bound should be on CI (run 100,000 times across all possible CI configurations).

The bound I suggested in the initial PR (5, pass rate of 0.9963) or after running it locally 10,000 times (8, pass rate of 0.9999) might be a good middle ground for now, and I can continue monitor some CI failures to see if we need to further loosen the bound.

mpenkov · 2022-03-22T11:40:38Z

Merging. Thank you @austereantelope !

Update test_word2vec.py

de90480

austereantelope mentioned this pull request Dec 25, 2021

Fix str() method in WmdSimilarity #3282

Merged

piskvorky requested a review from mpenkov January 15, 2022 05:54

piskvorky added this to the Next release milestone Feb 18, 2022

piskvorky assigned mpenkov Feb 25, 2022

piskvorky approved these changes Feb 25, 2022

View reviewed changes

Merge branch 'develop' into adjust-test_parallel-bound

4934788

mpenkov approved these changes Mar 22, 2022

View reviewed changes

mpenkov merged commit acd8308 into piskvorky:develop Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adjust test_parallel bound #3281

adjust test_parallel bound #3281

austereantelope commented Dec 25, 2021

codecov bot commented Dec 25, 2021 •

edited

Loading

austereantelope commented Jan 11, 2022

gojomo commented Jan 12, 2022

austereantelope commented Jan 13, 2022

piskvorky commented Jan 14, 2022 •

edited

Loading

austereantelope commented Jan 14, 2022

mpenkov commented Mar 22, 2022

adjust test_parallel bound #3281

adjust test_parallel bound #3281

Conversation

austereantelope commented Dec 25, 2021

codecov bot commented Dec 25, 2021 • edited Loading

Codecov Report

austereantelope commented Jan 11, 2022

gojomo commented Jan 12, 2022

austereantelope commented Jan 13, 2022

piskvorky commented Jan 14, 2022 • edited Loading

austereantelope commented Jan 14, 2022

mpenkov commented Mar 22, 2022

codecov bot commented Dec 25, 2021 •

edited

Loading

piskvorky commented Jan 14, 2022 •

edited

Loading