Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestParserReadNetworkingConfig could be randomly failing. #9382

Open
azr opened this issue Jun 9, 2020 · 7 comments
Open

TestParserReadNetworkingConfig could be randomly failing. #9382

azr opened this issue Jun 9, 2020 · 7 comments
Labels
bug need-more-tests tech-debt Issues and pull requests related to addressing technical debt or improving the codebase

Comments

@azr
Copy link
Contributor

azr commented Jun 9, 2020

A circle-ci build on linux failed, then worked after a retry:

2020/06/09 13:17:12 Invalid command : [VERSION=1,0]
--- FAIL: TestParserReadNetworingConfig (0.00s)
    driver_parser_test.go:1016: error parsing networking-example: Unexpected format for VERSION entry : [answer VNET_1_DHCP yes]

It may be worth investigating.

https://circleci.com/gh/hashicorp/packer/58370

@azr azr added bug need-more-tests tech-debt Issues and pull requests related to addressing technical debt or improving the codebase labels Jun 9, 2020
@azr
Copy link
Contributor Author

azr commented Jun 9, 2020

Hey @arizvisa ! About this issue, do you still have the context in mind ? It looks like to me like a race-condition, do you have some time to take a look at this ? Could this cause some trouble down the line too ? Cheers !

@SwampDragons
Copy link
Contributor

I think this must be a major resource constraint issue because I've run this test over a hundred times, with lots of other stuff going on in my env, to try to catch it and I haven't managed to.

@arizvisa
Copy link
Contributor

Oh yeah. It's totally a race condition. Right here at the beginning of the code in ReadNetworkingConfig, there's two consumers for the same channel (rows). I didn't have fusion when I wrote this a long time ago and plus was a newb at golang.

        consumeFile(fd)
        tokenized := tokenizeNetworkingConfig(fromfile)
        rows := splitNetworkingConfig(tokenized)
        entries := parseNetworkingConfig(rows)

        // parse the version
        parsed_version, err := networkingReadVersion(<-rows)
        if err != nil {
                return NetworkingConfig{}, err
        }

Having tests sure is awesome for catching this. It's crazy to think that by sheer luck this race condition hasn't really manifested itself until this refactor.

I'll open up a PR in a sec. I think I can read the version and then hand off the channel to parseNetworkingConfig after confirming it (which appears to be the right thing to do anyways). Although, I'm not quite sure how to reproduce this as @SwampDragons mentioned. So I can only hope my logic is sound... Is there some sane way to repro this to ensure it's squashed?

@arizvisa
Copy link
Contributor

Okay. I think PR #9387 should fix this race.

I tried to keep things linearly defined so that there's only one consumer of each channel at a given time. I went through the other parsers to see if this pattern affected anything else but couldn't spot anything.

The reading of the version isn't in a goro so it should block until it's able to read it. Once the version has been confirmed, then we set the parser off which is free to consume rows as much as it needs to.

@arizvisa
Copy link
Contributor

In terms of reproducing this, go test -race doesn't seem to do anything. Reason being is that it's oriented around memory reads/writes due to being based on AddressSanitizer (ASAN). So in essense only memory accesses are instrumented in order to keep track of more than one thread accessing the same piece of memory. For some reason this doesn't affect channels which I thought used a lock and a buffer of some kind which should be a memory access between two coop'd threads...

But as a general way to reproduce this and avoid this happening again, the only thing I can think of is to test after each fork/clone of a thread. Maybe sleep or blocking-syscall before reading a channel in the parent, so that a child will get woken up by the OS while the parent is blocking. As an example, if you put a time.Sleep(1.0) before the call to networkingReadVersion, you can manifest the error as the children (parseNetworkingConfig) are woken up to consume said channel.

Typically, your OS won't wake up the child thread until your parent thread blocks on some call...so something to maybe think about when dealing with goro-heavy code (which seems to be all of my code, hah), would be to do a test where I sleep in the parent, and sleep in the child to ensure that things work as they're actually intended.

@arizvisa
Copy link
Contributor

Anyways, sorry bout that. Saw you guys just did your release. :-/

@azr
Copy link
Contributor Author

azr commented Jun 10, 2020

Hey @arizvisa thanks for fixing this so quickly ! I think this case should not happen too much. I'll review the PR carefully now 🙂 !

The thing with chans is that they are meant to be used across processes so simply using chans will never trigger the race detector. In other words the race detector will detect unprotected 'write + read' calls, but a chan is a mean of synchronisation across goroutines and sometimes as avoidance against race-conditions.

Another thing about go testing is that it caches the results of tests, so if a test passes once, no mater how often you will re-run it the go toolchain will simply say 'success' without running it. This is especially dangerous for randomly failing tests because it sorts of irons them out. I think a specific option for it is coming soon but for now running go test -count=1 ... ( from golang/go#24573 ) will remove caching.

SwampDragons added a commit that referenced this issue Jun 10, 2020
Fixed a race in the ReadNetworkingConfig implementation from the vmware builders
@nywilken nywilken changed the title TestParserReadNetworingConfig could be randomly failing. TestParserReadNetworkingConfig could be randomly failing. Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug need-more-tests tech-debt Issues and pull requests related to addressing technical debt or improving the codebase
Projects
None yet
Development

No branches or pull requests

3 participants