Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression between old and new version for loading parquet #15

Open
Shogan opened this issue Jan 25, 2022 · 11 comments
Open

Regression between old and new version for loading parquet #15

Shogan opened this issue Jan 25, 2022 · 11 comments

Comments

@Shogan
Copy link

Shogan commented Jan 25, 2022

Previously with an older release of dsq I could do a basic SQL select on a parquet file.

With the latest release (0.2.0), I get this error:

panic: Missing type equality condition for unknown merge.

Command:

./dsq ~/Downloads/part-00030.snappy.parquet "SELECT * FROM {}"

If you can't reproduce this easily I can see if I can get a sample parquet file together and attached to this.

Stacktrace:

goroutine 1 [running]:
github.com/multiprocessio/datastation/runner.shapeMerge({{0x4c4111c, 0x7}, 0x0, 0x0, 0x0, 0x0}, {{0x4c4111c, 0x7}, 0x0, 0x0, ...})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:248 +0x648
github.com/multiprocessio/datastation/runner.objectMerge({0x4bd1da0}, {0x4bd1da0})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:209 +0x1b9
github.com/multiprocessio/datastation/runner.shapeMerge({{0x4c3f476, 0x6}, 0x0, 0xc000010168, 0x0, 0x0}, {{0x4c3f476, 0x6}, 0x0, 0xc000010198, ...})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:232 +0x115
github.com/multiprocessio/datastation/runner.getArrayShape({0x7ffeefbff919, 0x30}, {0xc0005183c0, 0x3, 0x6}, 0x16)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:268 +0x3e5
github.com/multiprocessio/datastation/runner.GetShape({0x7ffeefbff919, 0x4ac43e0}, {0x4ad1700, 0xc00051c4f8}, 0x0)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:277 +0x245
github.com/multiprocessio/datastation/runner.ShapeFromFile({0xc0005bb570, 0x4adfe60}, {0x7ffeefbff919, 0x30}, 0x2710, 0x7ffeefbff919)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:328 +0x16c
main.getShape({0xc0005bb570, 0x30}, {0x7ffeefbff919, 0x30})
	/Users/runner/work/dsq/dsq/main.go:46 +0x4e
main.main()
	/Users/runner/work/dsq/dsq/main.go:202 +0xac7
@eatonphil
Copy link
Member

Aha! Yes please do send me a sample. I thought that code path wasn't possible.

@eatonphil
Copy link
Member

Hey @Shogan ping on a sample to help me reproduce this :/

@eatonphil
Copy link
Member

What I'm going to do in the meantime is drop this panic. It's demonstrating a real bug but maybe you don't care about this particular column.

Instead it will just log some info about the column and you still won't be able to query that column until I fix the bug.

multiprocessio/datastation#162 this pr is where the main fix happens.

@grawlinson
Copy link

Just started updating 0.6.0 for the AUR, and I'm getting a potential regression that may be related to this. For reference, the version I am updating from (0.5.0) passes all tests successfully.

Here's test output from ./scripts/test.py

STARTING: SQL count for csv pipe
  SUCCESS

STARTING: SQL count for csv file
  SUCCESS

STARTING: SQL count for tsv pipe
  SUCCESS

STARTING: SQL count for tsv file
  SUCCESS

STARTING: SQL count for parquet pipe
  FAILURE
1c1,31
< 1000
\ No newline at end of file
---
> panic: runtime error: index out of range [576457816924784844] with length 115816
>
> goroutine 1 [running]:
> github.com/goccy/go-json/internal/encoder.CompileToGetCodeSet(0xc000f30ee0, 0x562e4598cb2c)
>       github.com/goccy/go-json@v0.9.4/internal/encoder/compiler_norace.go:11 +0x1df
> github.com/goccy/go-json.encode(0xc0018ec000, {0xc000caf9e0, 0xc001898b60})
>       github.com/goccy/go-json@v0.9.4/encode.go:224 +0xd0
> github.com/goccy/go-json.marshal({0xc000caf9e0, 0xc001898b60}, {0x0, 0x0, 0x1})
>       github.com/goccy/go-json@v0.9.4/encode.go:148 +0xba
> github.com/goccy/go-json.MarshalWithOption(...)
>       github.com/goccy/go-json@v0.9.4/json.go:186
> github.com/goccy/go-json.Marshal({0xc000caf9e0, 0xc001898b60})
>       github.com/goccy/go-json@v0.9.4/json.go:171 +0x2a
> github.com/multiprocessio/go-json.(*StreamEncoder).EncodeRow(0xc00047b5c0, {0xc000caf9e0, 0xc001898b60})
>       github.com/multiprocessio/go-json@v0.0.0-20220308002443-61d497dd7b9e/encoder.go:57 +0x1dd
> github.com/multiprocessio/datastation/runner.transformParquet.func1(0x0)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:120 +0xc6
> github.com/multiprocessio/datastation/runner.withJSONArrayOutWriter({0x562e47751fe0, 0xc000131040}, 0xc000f311d8)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/json.go:36 +0xf6
> github.com/multiprocessio/datastation/runner.withJSONArrayOutWriterFile(...)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/json.go:51
> github.com/multiprocessio/datastation/runner.transformParquet({0x562e47787be8, 0xc0006cc000}, {0x562e47751fe0, 0xc000131040})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:105 +0xd8
> github.com/multiprocessio/datastation/runner.transformParquetFile({0xc000044140, 0x562e4774cee0}, {0x562e47751fe0, 0xc000131040})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:142 +0xec
> github.com/multiprocessio/datastation/runner.TransformReader({0x562e4774cee0, 0xc00047a000}, {0x0, 0x0}, {{0x562e46b7d36e, 0x562e46b79c73}, {0x0, 0x0}}, {0x562e47751fe0, 0xc000131040})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/http.go:262 +0x325
> main._main()
>       github.com/multiprocessio/dsq/main.go:211 +0x968
> main.main()
>       github.com/multiprocessio/dsq/main.go:376 +0x19
\ No newline at end of file


STARTING: SQL count for parquet file
  FAILURE
1c1,33
< 1000
\ No newline at end of file
---
> panic: runtime error: index out of range [576457833567912774] with length 115816
>
> goroutine 1 [running]:
> github.com/goccy/go-json/internal/encoder.CompileToGetCodeSet(0xc000ebef68, 0x55b244f0fb2c)
>       github.com/goccy/go-json@v0.9.4/internal/encoder/compiler_norace.go:11 +0x1df
> github.com/goccy/go-json.encode(0xc001546a90, {0xc000627920, 0xc0015095f0})
>       github.com/goccy/go-json@v0.9.4/encode.go:224 +0xd0
> github.com/goccy/go-json.marshal({0xc000627920, 0xc0015095f0}, {0x0, 0x0, 0x1})
>       github.com/goccy/go-json@v0.9.4/encode.go:148 +0xba
> github.com/goccy/go-json.MarshalWithOption(...)
>       github.com/goccy/go-json@v0.9.4/json.go:186
> github.com/goccy/go-json.Marshal({0xc000627920, 0xc0015095f0})
>       github.com/goccy/go-json@v0.9.4/json.go:171 +0x2a
> github.com/multiprocessio/go-json.(*StreamEncoder).EncodeRow(0xc000592600, {0xc000627920, 0xc0015095f0})
>       github.com/multiprocessio/go-json@v0.0.0-20220308002443-61d497dd7b9e/encoder.go:57 +0x1dd
> github.com/multiprocessio/datastation/runner.transformParquet.func1(0x0)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:120 +0xc6
> github.com/multiprocessio/datastation/runner.withJSONArrayOutWriter({0x55b246cd4fe0, 0xc00015a740}, 0xc000ebf260)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/json.go:36 +0xf6
> github.com/multiprocessio/datastation/runner.withJSONArrayOutWriterFile(...)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/json.go:51
> github.com/multiprocessio/datastation/runner.transformParquet({0x55b246d0abe8, 0xc000152ab0}, {0x55b246cd4fe0, 0xc00015a740})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:105 +0xd8
> github.com/multiprocessio/datastation/runner.transformParquetFile({0x7ffc520f99c6, 0x1b}, {0x55b246cd4fe0, 0xc00015a740})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:142 +0xec
> github.com/multiprocessio/datastation/runner.TransformFile({0x7ffc520f99c6, 0x1b}, {{0x0, 0x1ff}, {0x0, 0xc000b7f428}}, {0x55b246cd4fe0, 0xc00015a740})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:554 +0x1ab
> main.evalFileInto({0x7ffc520f99c6, 0x1b}, 0x0)
>       github.com/multiprocessio/dsq/main.go:47 +0xc5
> main._main()
>       github.com/multiprocessio/dsq/main.go:236 +0xb29
> main.main()
>       github.com/multiprocessio/dsq/main.go:376 +0x19
\ No newline at end of file

@eatonphil
Copy link
Member

Rats! No I don't think it's related. I redid the way JSON encoding/decoding works so it's not surprising there's a bug. But it is surprising it's in of the files that are tested in automated testing!

@eatonphil
Copy link
Member

I'm having trouble reproducing this though. In Github Actions this test passes, as does it on my MBP and Fedora Linux dev machine.

I also tried building dsq and running the tests in an archlinux container and the tests passed.

Can you tell me any more about your machine/environment? I'm surprised it worked for you before and now breaks.

@Shogan
Copy link
Author

Shogan commented Mar 9, 2022

Hey @Shogan ping on a sample to help me reproduce this :/

Hi @eatonphil , sorry it took me so long - I was on vacation for a while and not checking notifications. I've had a look at the parquet I was querying and unfortunately I can't provide it here easily as it has sensitive data.

However in trying to load it up and edit it to remove said data, using viewer tool I got an exception about INT96 being unsupported. So this parquet data I'm working with uses INT96. Could that be an issue? I believe it is incompatible with avro.

I used the same version of dsq that I used when I started this thread to load some other parquet file examples I found here and it worked fine for these.

@eatonphil
Copy link
Member

However in trying to load it up and edit it to remove said data, using viewer tool I got an exception about INT96 being unsupported. So this parquet data I'm working with uses INT96. Could that be an issue? I believe it is incompatible with avro.

Ok I'll try making a dataset with INT96 in it and see if that causes an issue.

But also if you don't need that column right now newer versions of dsq won't crash when this happens. They'll just not be able to load that column for querying.

@Shogan
Copy link
Author

Shogan commented Mar 11, 2022

However in trying to load it up and edit it to remove said data, using viewer tool I got an exception about INT96 being unsupported. So this parquet data I'm working with uses INT96. Could that be an issue? I believe it is incompatible with avro.

Ok I'll try making a dataset with INT96 in it and see if that causes an issue.

But also if you don't need that column right now newer versions of dsq won't crash when this happens. They'll just not be able to load that column for querying.

Nice! Confirmed version 0.6.0 works. Thanks @eatonphil 🎉

@grawlinson
Copy link

I'm having trouble reproducing this though. In Github Actions this test passes, as does it on my MBP and Fedora Linux dev machine.

I also tried building dsq and running the tests in an archlinux container and the tests passed.

Can you tell me any more about your machine/environment? I'm surprised it worked for you before and now breaks.

I can reproduce it on two different machines both running Arch Linux, both Intel & AMD CPUs with go 1.18.

Basically just running in a clean chroot (based on systemd-nspawn) following our Go packaging guidelines, so the flags could be a thing (but I doubt it).

I've currently skipped the failing parquet tests.

@eatonphil
Copy link
Member

Gotcha! It's the -buildmode=pie flag that is exposing this crash (I don't know whether or not to say it's causing the crash).

I'm going to make a separate issue about -buildmode=pie. I don't know whether I'll be able to fix it though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants