Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in page.go on bigendian arch #56

Open
pavolloffay opened this issue Aug 30, 2023 · 21 comments
Open

Panic in page.go on bigendian arch #56

pavolloffay opened this issue Aug 30, 2023 · 21 comments
Labels
bug Something isn't working

Comments

@pavolloffay
Copy link

Hi, we are seeing the following error on IBM Z 64bit with Grafana Tempo and Parquet format. Weird is that the exact same setup works well on other architectures.

I have tried rebuilding Tempo with -tags purego and got the same panic.

Any help would be appreciated.

panic: runtime error: index out of range [7] with length 7

goroutine 1221 [running]:
github.com/segmentio/parquet-go.(*byteArrayPage).index(...)
	/remote-source/tempo/app/vendor/github.com/segmentio/parquet-go/page.go:982
github.com/segmentio/parquet-go.(*byteArrayDictionary).lookupString(0xc0002201c0, {0xc001108000, 0x1b, 0x400}, {{0xc001068000, 0x1b, 0x18}})
	/remote-source/tempo/app/vendor/github.com/segmentio/parquet-go/dictionary_purego.go:42 +0x162
github.com/segmentio/parquet-go.(*byteArrayDictionary).Lookup(0xc0002201c0, {0xc001108000, 0x1b, 0x400}, {0xc001068000, 0x1b, 0x3e8})
	/remote-source/tempo/app/vendor/github.com/segmentio/parquet-go/dictionary.go:748 +0x142
github.com/segmentio/parquet-go.(*indexedPageValues).ReadValues(0xc000fe5490, {0xc001068000, 0x1b, 0x3e8})
	/remote-source/tempo/app/vendor/github.com/segmentio/parquet-go/dictionary.go:1332 +0xd6
github.com/segmentio/parquet-go.(*repeatedPageValues).ReadValues(0xc000aa3e40, {0xc001068000, 0x3e8, 0x3e8})
	/remote-source/tempo/app/vendor/github.com/segmentio/parquet-go/page_values.go:98 +0x192
github.com/grafana/tempo/pkg/parquetquery.(*ColumnIterator).iterate.func3.2(0xc00115ff48, 0xc00115fec8, 0xc00018f740, {0xc001068000, 0x3e8, 0x3e8}, 0x3e8, {0x2815978, 0xc000ef6fc0}, {0x2829640, ...})
	/remote-source/tempo/app/pkg/parquetquery/iters.go:406 +0x26c
github.com/grafana/tempo/pkg/parquetquery.(*ColumnIterator).iterate.func3(0xc00018f740, 0xc00115ff48, 0xc00115fec8, {0xc001068000, 0x3e8, 0x3e8}, 0x3e8, {0x2815978, 0xc000ef6fc0}, {0x2821868, ...})
	/remote-source/tempo/app/pkg/parquetquery/iters.go:459 +0x1e4
github.com/grafana/tempo/pkg/parquetquery.(*ColumnIterator).iterate(0xc00018f740, {0x2815978, 0xc000ef6fc0}, 0x3e8)
	/remote-source/tempo/app/pkg/parquetquery/iters.go:464 +0x676
github.com/grafana/tempo/pkg/parquetquery.NewColumnIterator.func1()
	/remote-source/tempo/app/pkg/parquetquery/iters.go:299 +0x42
created by github.com/grafana/tempo/pkg/parquetquery.(*ColumnIterator).next
	/remote-source/tempo/app/pkg/parquetquery/iters.go:485 +0x60
@achille-roussel
Copy link
Collaborator

Hello @pavolloffay, thanks for reporting!

We might need a bit more details to be able to track down this issue. Would you be able to share a file that causes the panic? Or steps to generate one?

@achille-roussel achille-roussel added the bug Something isn't working label Sep 1, 2023
@pavolloffay
Copy link
Author

The IBM Z (s390x) is big-endian. Is this causing the issue?

https://github.com/search?q=repo%3Aparquet-go%2Fparquet-go%20LittleEndian&type=code

@pavolloffay
Copy link
Author

@achille-roussel see #58 to reproduce failing tests on s390x

@pavolloffay
Copy link
Author

The panic in #58 happens as well in page.go, so we could take it as a reproducer.

@pavolloffay pavolloffay changed the title Panic in page.go Panic in page.go on bigendian arch Oct 12, 2023
@JavaPerformance
Copy link

Hi all, may I ask what the status of this is please, it is blocking us from upgrading Tempo.

@JavaPerformance
Copy link

@pavolloffay I'm going to work on a patch to Parquet, not quite sure I can do it but I'll try. Without it we cant upgrade. Are you in Zurich?

@achille-roussel
Copy link
Collaborator

Hello @JavaPerformance, progress on the resolution was tracked in #58

There hasn't been much progress lately, but it would be extremely helpful if you have the cycles and can pick it up. I don't have access to a system running on a big-endian architecture, so it's a bit tricky to investigate the remaining issues.

@srinivas-pokala
Copy link

@pavolloffay , @achille-roussel I have tried reproducing the issue on s390x platform(IBM-Z)

I am able reproduce the issue, below is the stack trace which I have got from the panic.

--- FAIL: TestOpenFile (0.01s)
    --- PASS: TestOpenFile/testdata/alltypes_dictionary.parquet (0.00s)
    --- PASS: TestOpenFile/testdata/alltypes_plain.parquet (0.00s)
    --- PASS: TestOpenFile/testdata/alltypes_plain.snappy.parquet (0.00s)
    --- FAIL: TestOpenFile/testdata/alltypes_tiny_pages.parquet (0.01s)
panic: runtime error: index out of range [620756992] with length 731 [recovered]
        panic: runtime error: index out of range [620756992] with length 731

goroutine 1642 [running]:
testing.tRunner.func1.2({0x961d80, 0xc0000287c8})
        testing/testing.go:1526 +0x2e8
testing.tRunner.func1()
        testing/testing.go:1529 +0x406
panic({0x961d80, 0xc0000287c8})
        runtime/panic.go:884 +0x23a
github.com/parquet-go/parquet-go.(*byteArrayPage).index(0xc00104d9d0, 0x25000000)
        github.com/parquet-go/parquet-go/page.go:981 +0x140
github.com/parquet-go/parquet-go.(*byteArrayDictionary).lookupString(0xc00104d9d0, {0xc000276000, 0xd, 0x400}, {{0xc00100ec00, 0xd, 0x18}})
        github.com/parquet-go/parquet-go/dictionary_purego.go:42 +0x102
github.com/parquet-go/parquet-go.(*byteArrayDictionary).Lookup(0xc00104d9d0, {0xc000276000, 0xd, 0x400}, {0xc00100ec00, 0xd, 0x2a})
        github.com/parquet-go/parquet-go/dictionary.go:748 +0x134
github.com/parquet-go/parquet-go.(*indexedPageValues).ReadValues(0xc00017de70, {0xc00100ec00, 0xd, 0x2a})
        github.com/parquet-go/parquet-go/dictionary.go:1332 +0x206
github.com/parquet-go/parquet-go.(*optionalPageValues).ReadValues(0xc0004c29e0, {0xc00100ec00, 0x2a, 0x2a})
        github.com/parquet-go/parquet-go/page_values.go:40 +0x580
github.com/parquet-go/parquet-go_test.printColumns(0xc0031c56c0, 0xc000b34ea0, {0xc000548508, 0x4})
        github.com/parquet-go/parquet-go/file_test.go:85 +0x8f6
github.com/parquet-go/parquet-go_test.printColumns(0xc0031c56c0, 0xc000b34000, {0x9abe3e, 0x2})
        github.com/parquet-go/parquet-go/file_test.go:119 +0x12f8
github.com/parquet-go/parquet-go_test.TestOpenFile.func1(0xc0031c56c0)
        github.com/parquet-go/parquet-go/file_test.go:50 +0x8e0
testing.tRunner(0xc0031c56c0, 0xc00094a780)
        testing/testing.go:1576 +0x128
created by testing.(*T).Run
        testing/testing.go:1629 +0x50a

In theTestOpenFile table driven test, the test "TestOpenFile/testdata/alltypes_tiny_pages.parquet:" which causing the issue.
When I have debugged further this issue with failing test, the index [620756992] which storing in little-endian fashion, It's actually trying to store the index value(of type int32[]) as 0x25, but it 's storing as 0x25000000 . Due to this panic has encountered.
Below is the code snippets which causing the index value storing in endian reverse through copy function:
https://github.com/parquet-go/parquet-go/blob/main/encoding/rle/rle.go#L379

		} else {
			j := i + bitpack.ByteCount(bitWidth)

			if j > len(src) {
				return dst, fmt.Errorf("decoding run-length block of %d values: %w", count, io.ErrUnexpectedEOF)
			}

			bits := [4]byte{}
===>>>>		copy(bits[:], src[i:j])
			dst = appendRepeat(dst, bits[:], count)
			i = j
		}
	}

	return dst, nil
}

After fixing this with endian check and if Big endian, swapped the Bytes. There are other failures which I have observed after this fix which already causing the same test to failure.
whatever this path above mentioned is executed when "bitpacked " boolean variable is false, that's working fine now with endian check and swapping int32 bytes.
But, for bitpacked boolean become true this if case is not handling propely the src[] int32 data. Below is the code snippet for if case.
https://github.com/parquet-go/parquet-go/blob/main/encoding/rle/rle.go#L358

if bitpacked {
			offset := len(dst)
			length := int(count * bitWidth)
			dst = resize(dst, offset+4*8*int(count))

			// The bitpack.UnpackInt32 function requires the input to be padded
			// or the function panics. If there is enough room in the input
			// buffer we can use it, otherwise we have to copy it to a larger
			// location (which should rarely happen).
			in := src[i : i+length]
			if (cap(in) - len(in)) >= bitpack.PaddingInt32 {
				in = in[:cap(in)]
			} else {
				buf = resize(buf, len(in)+bitpack.PaddingInt32)
				copy(buf, in)
				in = buf
			}

			out := unsafecast.BytesToInt32(dst[offset:])
			bitpack.UnpackInt32(out, in, bitWidth)
			i += length
		} else {

It would be really helpful any pointer's on bitpacked true if case execution code path. I am suspecting the issue lies in this code flow.

@JavaPerformance
Copy link

Hi all, is there any update on this, I know that the s390x Tempo is small but we are doing interesting things and being stuck on Tempo 1.5 is holding us back. @joe-elliott

@joe-elliott
Copy link
Collaborator

unfortunately, i think those that using big endian architectures are going to have to dig this one out on their own. we do not have the bandwidth or incentive right now.

@JavaPerformance
Copy link

@joe-elliott I understand, I'll make an effort. Forgive me ask the odd naive question in the process.

@joe-elliott
Copy link
Collaborator

no worries! i just want to be clear about where we are on this.

huge thx for making an attempt 🙏

@JavaPerformance
Copy link

I've been playing around, although there will probably be many other places in the code that have to be worked on I started with rle.go that @srinivas-pokala found. The private encodeInt32 decodeInt32 work just fine and pass the simple test I wrote. However the exported function DecodeInt32 is a problem and fails on BigEndian because of the unsafecasts (I was lazy and just replicated the unsafecasts from DecodeInt32 in my test that directly calls the provate decodeIn32. At least it gives me somewhere to start.

@joe-elliott is RLE used often? could it be the reason for the panic in page.go or is this a case of fix this anyway but keep looking?

=== RUN TestEncodeDecodeInt32
rle_test.go:10: **********************************************************
rle.go:42: Encode
rle.go:51: Decode
rle.go:64: Convert decoded bytes back to int32
rle.go:73: Compare src and int32Decoded
rle.go:79: decoded: 1, src: 1
rle.go:79: decoded: 2, src: 2
rle.go:79: decoded: 3, src: 3
rle.go:79: decoded: 4, src: 4
rle.go:79: decoded: 5, src: 5
rle.go:79: decoded: 6, src: 6
rle.go:79: decoded: 7, src: 7
rle.go:79: decoded: 8, src: 8
rle_test.go:14: **********************************************************
--- PASS: TestEncodeDecodeInt32 (0.00s)

=== RUN TestEncodeDecodeInt32
rle_test.go:10: **********************************************************
rle.go:109: Encode
rle.go:118: Decode
rle.go:131: Convert decoded bytes back to int32
rle.go:140: Compare src and int32Decoded
rle.go:146: decoded: 1, src: 16777216
rle.go:148: Mismatch at index 0: expected 1, got 16777216
rle.go:146: decoded: 2, src: 33554432
rle.go:148: Mismatch at index 1: expected 2, got 33554432
rle.go:146: decoded: 3, src: 50331648
rle.go:148: Mismatch at index 2: expected 3, got 50331648
rle.go:146: decoded: 4, src: 67108864
rle.go:148: Mismatch at index 3: expected 4, got 67108864
rle.go:146: decoded: 5, src: 83886080
rle.go:148: Mismatch at index 4: expected 5, got 83886080
rle.go:146: decoded: 6, src: 100663296
rle.go:148: Mismatch at index 5: expected 6, got 100663296
rle.go:146: decoded: 7, src: 117440512
rle.go:148: Mismatch at index 6: expected 7, got 117440512
rle.go:146: decoded: 8, src: 134217728
rle.go:148: Mismatch at index 7: expected 8, got 134217728
rle_test.go:14: **********************************************************
--- FAIL: TestEncodeDecodeInt32 (0.00s)
FAIL
FAIL github.com/parquet-go/parquet-go 0.003s
FAIL

@JavaPerformance
Copy link

@joe-elliott I've started work on this, A few of those question I warned you about :)

  1. Binary compatibility. Do you think it is necessary to maintain the portability of stores between systems of different endianism? Additionally Would you imagine or require the mixing of endianisms in a network?

  2. Testing. many of the tests use hard-coded data, either I write alternative tests or the big endian code translates to little endian externally (the same as question 1 I guess)

@joe-elliott
Copy link
Collaborator

Binary compatibility. Do you think it is necessary to maintain the portability of stores between systems of different endianism? Additionally Would you imagine or require the mixing of endianisms in a network?

Are you asking from a parquet-go perspective? Like should a file written by a big endian system be readable by a little endian system? I think ideally the storage format would not be dependent on the cpu architecture it was created with.

also, it seems like the parquet specification itself has endian requirements and we'd like the repo to be consistent:

https://parquet.apache.org/docs/file-format/data-pages/encodings/

Testing. many of the tests use hard-coded data, either I write alternative tests or the big endian code translates to little endian externally (the same as question 1 I guess)

Yeah, I think same answer as above. Presumably the tests are reading parquet files that are written using the standard encodings detailed above and a big endian system should be able to read those.

cc @achille-roussel in case he has a different viewpoint.

@JavaPerformance
Copy link

Agreed on both points.

*_bigendian.go and *_bigendian_test.go OK for you?

@forsaken628
Copy link

forsaken628 commented Apr 4, 2024

binary.LittleEndian is safe
but unsafecast package not, it not working on BigEndian platform.

// Slice converts the data slice of type []From to a slice of type []To sharing
// the same backing array. The length and capacity of the returned slice are
// scaled according to the size difference between the source and destination
// types.
//
// Note that the function does not perform any checks to ensure that the memory
// layouts of the types are compatible, it is possible to cause memory
// corruption if the layouts mismatch (e.g. the pointers in the From are different
// than the pointers in To).
func Slice[To, From any](data []From) []To {
// This function could use unsafe.Slice but it would drop the capacity
// information, so instead we implement the type conversion.
var zf From
var zt To
var s = (*slice)(unsafe.Pointer(&data))
s.len = int((uintptr(s.len) * unsafe.Sizeof(zf)) / unsafe.Sizeof(zt))
s.cap = int((uintptr(s.cap) * unsafe.Sizeof(zf)) / unsafe.Sizeof(zt))
return *(*[]To)(unsafe.Pointer(s))
}

"the layouts mismatch", It refers to cast little endian file to big endian int

There is no way to fix unsafecast package without incurring significant performance overhead

@JavaPerformance
Copy link

@forsaken628 I realise that I'll have to take a performance hit on Mainframe but my changes won't affect other platforms which will continue using unsafe casts and copies and maintain their existing performance.

@Vishwanatha-HD
Copy link

Hi @pavolloffay,
Srinivas and myself are working on providing the Parquet support on s390x, in order to help the Openshift team by enabling the Tempo Operator support..

We see that many of the test cases are failing right now on s390x, and we tried our best to handle the testcase failure by changing endianness of the data types.. Even though, it helps in succeeding at some place, but it eventually fails at later stages.

These are testcase groups that were failing initially when we started looking into the parquet issue.
TestSplitBlockFilter/INT96
TestBuffer/boolean/
TestBuffer/int32
TestBuffer/float
TestBuffer/uint32
TestBufferRoundtripNestedRepeated
TestBufferRoundtripNestedRepeatedPointer
TestBufferSeekToRow
TestColumnPageIndex/file
TestConvertValue/string_to_int96 – Fixed this issue.
TestDictionary/BOOLEAN
TestDictionary/FLOAT
TestDictionary/INT(32,false)
TestOpenFile/testdata/alltypes_tiny_pages.parquet

Out of the above testcases, we have successfully fixed “TestConvertValue/string_to_int96” testcase…

We need your help and support in understanding the complete parquet package and its functionality, so that we can work on fixing the other testcase failures..
We want to know how the data is getting stored and retrieved as part of Parquet file format..

Warm Regards,
Vishwa..

@achille-roussel
Copy link
Collaborator

achille-roussel commented Apr 25, 2024

@Vishwanatha-HD feel free to send your questions, I'll do my best to respond promptly.

We can chat on the #parquet-go Gopher Slack channel for more direct communication as well!

@JavaPerformance
Copy link

JavaPerformance commented Apr 25, 2024

@Vishwanatha-HD @achille-roussel

Hi Vishwa, I have been attempting a port for a little while, but struggling. If you haven't already you might want to look at sparse/array.go and add 8- sizeof type to the Index functions. For example:

func (a Uint8Array) Index(i int) uint8 {
	return *(*uint8)(unsafe.Add(a.index(i), 7))
}

func (a Uint32Array) Index(i int) uint32 {
	return *(*uint32)(unsafe.Add(a.index(i), 4))
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants