Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: possible memory corruption #49075

Open
katiehockman opened this issue Oct 19, 2021 · 44 comments
Open

runtime: possible memory corruption #49075

katiehockman opened this issue Oct 19, 2021 · 44 comments

Comments

@katiehockman
Copy link
Member

@katiehockman katiehockman commented Oct 19, 2021

OSS-Fuzz reported an issue a few weeks ago that we suspect is memory corruption caused by the runtime. This started on August 16th, so is likely a Go 1.17 issue.

A slice bounds out of range issue is being reported from calls to regexp.MustCompile(,\s*).Split
However, this is not reproducible with the inputs provided by OSS-Fuzz, so we expect something else is going on.

Below are some of the panic logs:

panic: runtime error: slice bounds out of range [:18416820578376] with length 59413

goroutine 17 [running, locked to thread]:
regexp.(*Regexp).Split(0x10c0000b2640, {0x10c0001c801f, 0x76dbc0}, 0xffffffffffffffff)
	regexp/regexp.go:1266 +0x61c
github.com/google/gonids.(*Rule).option(0x10c000068000, {0x100c000096970, {0x10c0001c8016, 0x8}}, 0x10c00029a040)
	github.com/google/gonids/parser.go:675 +0x36cf
github.com/google/gonids.parseRuleAux({0x10c0001c8000, 0x630000350400}, 0x0)
	github.com/google/gonids/parser.go:943 +0x6b3
github.com/google/gonids.ParseRule(...)
	github.com/google/gonids/parser.go:972
github.com/google/gonids.FuzzParseRule({0x630000350400, 0x0, 0x10c000000601})
	github.com/google/gonids/fuzz.go:20 +0x54
main.LLVMFuzzerTestOneInput(...)
	./main.1689543426.go:21

panic: runtime error: slice bounds out of range [628255583:13888]

goroutine 17 [running, locked to thread]:
regexp.(*Regexp).Split(0x10c0000b2640, {0x10c00033601f, 0x76dbc0}, 0xffffffffffffffff)
	regexp/regexp.go:1266 +0x617
github.com/google/gonids.(*Rule).option(0x10c00026cc00, {0x100c00026e190, {0x10c000336016, 0x7}}, 0x10c0001a4300)
	github.com/google/gonids/parser.go:675 +0x36cf
github.com/google/gonids.parseRuleAux({0x10c000336000, 0x62f00064a400}, 0x0)
	github.com/google/gonids/parser.go:943 +0x6b3
github.com/google/gonids.ParseRule(...)
	github.com/google/gonids/parser.go:972
github.com/google/gonids.FuzzParseRule({0x62f00064a400, 0x0, 0x10c000000601})
	github.com/google/gonids/fuzz.go:20 +0x54
main.LLVMFuzzerTestOneInput(...)
	./main.1689543426.go:21
AddressSanitizer:DEADLYSIGNAL

panic: runtime error: slice bounds out of range [473357973:29412]

goroutine 17 [running, locked to thread]:
regexp.(*Regexp).Split(0x10c0000b2640, {0x10c0002a001f, 0x76dbc0}, 0xffffffffffffffff)
	regexp/regexp.go:1266 +0x617
github.com/google/gonids.(*Rule).option(0x10c0001b0180, {0x100c000280100, {0x10c0002a0016, 0xb}}, 0x10c0001ae040)
	github.com/google/gonids/parser.go:675 +0x36cf
github.com/google/gonids.parseRuleAux({0x10c0002a0000, 0x632000930800}, 0x0)
	github.com/google/gonids/parser.go:943 +0x6b3
github.com/google/gonids.ParseRule(...)
	github.com/google/gonids/parser.go:972
github.com/google/gonids.FuzzParseRule({0x632000930800, 0x0, 0x10c000000601})
	github.com/google/gonids/fuzz.go:20 +0x54
main.LLVMFuzzerTestOneInput(...)
	./main.1689543426.go:21

From rsc@:

The relevant code is processing the [][]int returned from regexp.(*Regexp).FindAllStringIndex.
That [][]int is prepared by repeated append:

func (re *Regexp) FindAllStringIndex(s string, n int) [][]int {
    if n < 0 {
        n = len(s) + 1
    }
    var result [][]int
    re.allMatches(s, nil, n, func(match []int) {
        if result == nil {
            result = make([][]int, 0, startSize)
        }
        result = append(result, match[0:2])
    })
    return result
}

Each of the match[0:2] being appended is prepared in regexp.(*Regexp).doExecute by:

dstCap = append(dstCap, m.matchcap...)

appending to a zero-length, non-nil slice to copy m.matchcap.

And each of the m.matchcap is associated with the *regexp.machine m, which is kept in a sync.Pool for reuse.

The specific corruption is that the integers in the [][]int are clear non-integers (like pointers),
which suggests that either one of the appends is losing the reference accidentally during GC
or something in sync.Pool is wonky.

This could also be something strange that OSS-Fuzz is doing, and doesn't necessarily represent a real-world use case.

/cc @golang/security

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 19, 2021

How often does this happen? Is there any way we could reproduce?
If reproducible, we could turn off the sync.Pool usage and see if anything changes.

Loading

@cherrymui cherrymui added this to the Go1.18 milestone Oct 19, 2021
@rsc
Copy link
Contributor

@rsc rsc commented Oct 20, 2021

@catenacyber says it is still happening multiple times a day on OSS-Fuzz.

Philippe, do you have any hints about an easy way to get a reproduction case on our own machines?
I looked briefly at the instructions for running oss-fuzz itself and they were a bit daunting.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 20, 2021

How often does this happen?

Around 50 times a day on oss-fuzz

Is there any way we could reproduce?

I did not manage to reproduce it myself, did not try very hard though...

I looked briefly at the instructions for running oss-fuzz itself and they were a bit daunting.

Well, the hard thing is that this bug does not reproduce for a specific input.
But running it should be ok cf https://google.github.io/oss-fuzz/getting-started/new-project-guide/#testing-locally
That is

  • install docker
  • cd /path/to/oss-fuzz
  • python infra/helper.py build_image gonids
  • python infra/helper.py build_fuzzers gonids
  • python infra/helper.py run_fuzzer --corpus-dir=<path-to-temp-corpus-dir> gonids fuzz_parserule

Then, I guess you need to wait one hour, and relaunch the fuzzer if it did not trigger the bug, until it does

we could turn off the sync.Pool usage and see if anything changes.

Is there some environment variable to do so ?

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 20, 2021

Maybe oss-fuzz uses -fork=2 as an extra argument to run_fuzzer

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 20, 2021

we could turn off the sync.Pool usage and see if anything changes.

Is there some environment variable to do so ?

No, you'd have to edit the code to replace pool allocations with new or make.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 20, 2021

Is there some environment variable to do so ?

No, you'd have to edit the code to replace pool allocations with new or make.

so rebuild the standard library ?

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 20, 2021

Just edit it, rebuild will be automatic.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 20, 2021

So, I did google/oss-fuzz#6623 with regex.go not using sync package

Loading

@rsc
Copy link
Contributor

@rsc rsc commented Oct 20, 2021

google/oss-fuzz#6623 looks worth a shot. Thanks.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 22, 2021

It looks like the bug is still happening but much less often.

One last stack trace is

panic: runtime error: slice bounds out of range [:107271103185152] with length 45246

goroutine 17 [running, locked to thread]:
regexp.(*Regexp).Split(0x10c0000b2640, {0x10c00010801f, 0x76dbc0}, 0xffffffffffffffff)
	regexp/regexp.go:1260 +0x61c
github.com/google/gonids.(*Rule).option(0x10c00034a180, {0x100c0007c4350, {0x10c000108016, 0x6}}, 0x10c000334080)
	github.com/google/gonids/parser.go:675 +0x36cf
github.com/google/gonids.parseRuleAux({0x10c000108000, 0x62e0000d8400}, 0x0)
	github.com/google/gonids/parser.go:943 +0x6b3
github.com/google/gonids.ParseRule(...)
	github.com/google/gonids/parser.go:972
github.com/google/gonids.FuzzParseRule({0x62e0000d8400, 0x0, 0x10c000000601})
	github.com/google/gonids/fuzz.go:20 +0x54
main.LLVMFuzzerTestOneInput(...)
	./main.3230035416.go:21
AddressSanitizer:DEADLYSIGNAL

regexp.go:1260 seems to prove that this is the modified regex.go file without sync right ?

Any more clues ?
Any more debug assertions to insert in regexp.go ?

Loading

@josharian
Copy link
Contributor

@josharian josharian commented Oct 22, 2021

This started on August 16th, so is likely a Go 1.17 issue.

Could you bisect to a particular commit that introduced the corruption?

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 22, 2021

Could you bisect to a particular commit that introduced the corruption?

oss-fuzz uses latest Golang release, so they switched from 1.16.x to 1.17 on August 16th, but we do not know which commit exactly in this major release induced the buggy behavior...

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 22, 2021

I have another possible theory:

regexp.(*Regexp).Split(0x10c0000b2640, {0x10c0001c801f, 0x76dbc0}, 0xffffffffffffffff)
	regexp/regexp.go:1266 +0x61c

Is there any way you can get at the regexp and the contents of the string? In this case, 0x10c0000b2640 is the address of a Regexp, the first field of which is a string I'd like to see. Also, we'd need the very long string at address 0x10c0001c801f for 0x76dbc0 bytes (7MB+!).
If you have any way for us to reproduce that execution state, we could grab those values ourselves.

My theory is that regexp is using the code in internal/bytealg pretty hard, and maybe it's tripping up on some weird corner case. The string pointers always end in 0x1f, which is maybe one of those corner cases. That might explain the intermittency - if the same test case gets allocated in a different place, it might not trigger the bug.
Not sure why we would hit it only for 1.17. https://go-review.googlesource.com/c/go/+/310184 seems to be the only CL of note in internal/bytealg for 1.17, and it looks reasonable to me.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 22, 2021

Put the hex encoded version of one string here :
https://catenacyber.fr/stringhex.txt

Loading

@dgryski
Copy link
Contributor

@dgryski dgryski commented Oct 22, 2021

If it's an alignment issue, could we try to reproduce by mmap'ing a large block and creating a string with all the different possible alignments / ending pointer bytes?

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 22, 2021

What is the regexp that is being used?

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 22, 2021

regexp.MustCompile(,\s*).Split(fuzzedinput, -1)

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 22, 2021

So, there are kinds of a lot of answers to split....

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 22, 2021

Nothing obvious with the string+regexp you posted. Putting it at different alignments doesn't seem to trigger anything.
The string you posted doesn't seem long enough, though. It is only 227927 bytes, and all the failure traces above have a string that is 0x76dbc0 = 7789504 bytes.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 22, 2021

This string posted comes with stack trace

panic: runtime error: slice bounds out of range [699494522:64250]
--
  |  
  | goroutine 17 [running, locked to thread]:
  | regexp.(*Regexp).Split(0x10c0000b2640, {0x10c000ac201f, 0x962c00}, 0xffffffffffffffff)
  | regexp/regexp.go:1266 +0x617
  | github.com/google/gonids.(*Rule).option(0x10c000066000, {0x0, {0x10c000ac2016, 0x10c000044000}}, 0x10c0001dc000)
  | github.com/google/gonids/parser.go:675 +0x36c5
  | github.com/google/gonids.parseRuleAux({0x10c000ac2000, 0x37a78}, 0x0)
  | github.com/google/gonids/parser.go:942 +0x6b3
  | github.com/google/gonids.ParseRule(...)
  | github.com/google/gonids/parser.go:971
  | github.com/google/gonids.FuzzParseRule({0x7f369cb75800, 0x0, 0x10c000000601})
  | github.com/google/gonids/fuzz.go:20 +0x54
  | main.LLVMFuzzerTestOneInput(...)
  | ./main.3953748960.go:21

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 22, 2021

When running manually, it seems my string is always aligned on 16 bytes.like 0x10c0001185a0

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 22, 2021

That doesn't seem right. The string posted is only 227927 bytes, but the stack trace shows that is is 9841664 bytes.

To unalign a string, do:

s2 := (strings.Repeat("*", i) + s)[i:]

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 22, 2021

So, the string must have been corrupted before the call to regexp.(*Regexp).Split

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 22, 2021

I'm not sure I understand. How did you get the string, if not just dumping the 0x962c00 bytes starting at address 0x10c000ac201f ?

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 23, 2021

I'm not sure I understand. How did you get the string, if not just dumping the 0x962c00 bytes starting at address 0x10c000ac201f ?

oss-fuzz provides the stack trace, and the input that triggered it.
The input that triggers it, when run over gonids.FuzzParseRule ends up once at gonids.(*Rule).option to call regexp.(*Regexp).Split and that is where I got the string

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 24, 2021

Hm, then I guess the values in the stack traces are incorrect. Which can happen, especially with regabi introduced in 1.17.
I don't have any ideas on how to figure out the problem without bisecting through the 1.17 commits.
I am unable to reproduce on my laptop. It never crashes for me...
Speaking of which, it's always possible that it is a bad machine. Have you reproduced on different machines?

Another thing to try, run with GODEBUG=cpu.all=off. That will disable most of the complicated bytealg code.

Loading

@thepudds
Copy link

@thepudds thepudds commented Oct 24, 2021

I am unable to reproduce on my laptop. It never crashes for me...

FWIW, I was not able to reproduce either using the oss-fuzz local execution steps on a clean Linux VM (amd64, Ubuntu 20.04) following the directions outlined above in #49075 (comment), with 3 separate runs that totaled about 24 hours.

I also ran for about 24 hours using dvyukov/go-fuzz against the same gonids fuzz target, which also did not crash (although I don't think there is an enormous amount of signal from that result, given libFuzzer and go-fuzz have different process restart strategies, different mutations, different propensity for creating large inputs, and so on).

@catenacyber some questions for you:

  1. I am curious if you are able to reproduce in a clean environment using those steps you outlined for local execution?

  2. I wonder if some additional steps might be needed, such as perhaps downloading a snapshot of the oss-fuzz working corpus (vs. if I followed, the oss-fuzz local execution steps you outlined above results in only using a seed corpus downloaded from https://rules.emergingthreats.net?).

  3. I wonder if -max_len needs to be set, or alternatively, perhaps if the live oss-fuzz working corpus has larger inputs that push libFuzzer to use larger examples? Following the steps above, my local execution currently reports:

INFO: -max_len is not provided; libFuzzer will not generate inputs larger 
than 4096 bytes
INFO: seed corpus: files: 67174 min: 1b max: 3228b total: 19520296b rss: 57Mb

That said, it might be that size of inputs in the corpus is not meaningful if this turns out to be due to some corruption.

  1. If I followed, it looks like the live gonids Dockerfile on the oss-fuzz repo is using the attempted workaround of removing use of sync.Pool, which I think you said resulted in "It looks like the bug is still happening but much less often." If Keith or others are trying to reproduce, it might make sense to not use that attempted workaround when trying to reproduce locally in order to increase frequency?

In any event, sorry if anything I wrote is off base, but curious for your thoughts.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 25, 2021

Hm, then I guess the values in the stack traces are incorrect. Which can happen, especially with regabi introduced in 1.17. I don't have any ideas on how to figure out the problem without bisecting through the 1.17 commits. I am unable to reproduce on my laptop. It never crashes for me... Speaking of which, it's always possible that it is a bad machine. Have you reproduced on different machines?

I did not manage to reproduce, so this may indeed be a hardware/OS issue...

Another thing to try, run with GODEBUG=cpu.all=off. That will disable most of the complicated bytealg code.

I just opened google/oss-fuzz#6650

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Oct 25, 2021

  1. I am curious if you are able to reproduce in a clean environment using those steps you outlined for local execution?

I am not able to reproduce myself... Just seeing crashes on oss-fuzz

  1. I wonder if some additional steps might be needed, such as perhaps downloading a snapshot of the oss-fuzz working corpus (vs. if I followed, the oss-fuzz local execution steps you outlined above results in only using a seed corpus downloaded from https://rules.emergingthreats.net?).

You can download the oss-fuzz public corpus
https://storage.googleapis.com/gonids-backup.clusterfuzz-external.appspot.com/corpus/libFuzzer/gonids_fuzz_parserule/public.zip

  1. I wonder if -max_len needs to be set, or alternatively, perhaps if the live oss-fuzz working corpus has larger inputs that push libFuzzer to use larger examples? Following the steps above, my local execution currently reports:
INFO: -max_len is not provided; libFuzzer will not generate inputs larger 
than 4096 bytes
INFO: seed corpus: files: 67174 min: 1b max: 3228b total: 19520296b rss: 57Mb

That said, it might be that size of inputs in the corpus is not meaningful if this turns out to be due to some corruption.

I think this max_len gets ignored somehow...

  1. If I followed, it looks like the live gonids Dockerfile on the oss-fuzz repo is using the attempted workaround of removing use of sync.Pool, which I think you said resulted in "It looks like the bug is still happening but much less often." If Keith or others are trying to reproduce, it might make sense to not use that attempted workaround when trying to reproduce locally in order to increase frequency?

Reverting the change to get a more steady flow of bugs back

In any event, sorry if anything I wrote is off base, but curious for your thoughts.

Nothing was off base...

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Nov 4, 2021

Another thing to try, run with GODEBUG=cpu.all=off. That will disable most of the complicated bytealg code.

Bug still happens for the last 10 days after merge of
google/oss-fuzz#6650

Loading

@thepudds
Copy link

@thepudds thepudds commented Nov 4, 2021

Is that setting of the env variable only happening at build time?

I am not in front of a full size computer, and probably would be smarter for me to let someone else chime in, but I had thought GODEBUG=cpu.all=off was a runtime flag?

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Nov 5, 2021

Is that setting of the env variable only happening at build time?

I used it only at build time, not at runtime.
Should I try this at runtime ?

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 5, 2021

The setting of GODEBUG is only useful at run time, not at build time.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Nov 6, 2021

The setting of GODEBUG is only useful at run time, not at build time.

Can I set the environment variable while running the program ? Or should it be set before ? (harder to set up...)

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 7, 2021

Let's first find out whether it actually fixes the problem.

Loading

@FiloSottile
Copy link
Contributor

@FiloSottile FiloSottile commented Nov 8, 2021

I think what @catenacyber is asking is whether GODEBUG can be set after program start and still be effective for debugging or if it's only checked at start.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Nov 8, 2021

I think what @catenacyber is asking is whether GODEBUG can be set after program start and still be effective for debugging or if it's only checked at start.

Indeed, thanks Filippo.
If it is not effective if set after program start, this will require a complex change to try this on oss-fuzz

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 8, 2021

whether GODEBUG can be set after program start and still be effective for debugging or if it's only checked at start.

Yes, I didn't answer that question because it's complicated and depends on the exact value of GODEBUG. I was hoping to first discover whether setting GODEBUG in the environment when running the program fixes the problem. If it does, then we can figure out whether setting it while running the program will work.

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Nov 9, 2021

Thanks Ian, you answered may question.
GODEBUG likely needs to be set up before start... which is complicated on oss-fuzz... which is the only place where the bug reproduces...

If it does, then we can figure out whether setting it while running the program will work.

There will be no more point to do it for me.
The point is that it is easier to set GODEBUG from the fuzz target running, than outside of it before running it...

So, I guess I will think on it...

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Nov 10, 2021

@asraa
Is there any more precise way to reproduce than run python infra/helper.py run_fuzzer gonids fuzz_parserule ? (that is more environment options to be in a sandbox, a more precise hardware...)

Could these be provided in oss-fuzz logs ?

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Nov 16, 2021

Side note on reproducing the issue.
It happened 35 times on November 14th, among more than 2400 runs
It happens about one time in 100 runs...

Loading

@catenacyber
Copy link

@catenacyber catenacyber commented Nov 28, 2021

No luck in reproducing it in 500 runs...

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 29, 2021

Is there anything you can tell us about the machines that are running this task, when they fail?
Are they physical machines? From some cloud? Can you give us /proc/cpuinfo or some other description of the processors?
Are they Intel or amd?
What OS are you running? What version?

Loading

@cristaloleg
Copy link

@cristaloleg cristaloleg commented Nov 29, 2021

After re-reading this issue yesterday I have recalled a post by Aleksey Shipilev regarding memcheck, might be a case too (less often due to smaller amount of pointers vs jvm) https://shipilev.net/jvm/test-your-memory/

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet