Zero copy import when the schema is known #261

darabos · 2022-07-19T13:42:25Z

Resolves #258.

No import button! The corresponding Python code is:

lk.importParquet(eager='no', filename='/home/darabos/eg.parquet', schema='name: String, age: Double')

Outstanding issues:

Currently you can only "import" a file this way once. LynxKite assumes it will never change. This could be avoided with a version parameter, same as its done with export operations.
Add the three parameters: imported_columns, limit, and sql.
Tests, documentation.

darabos · 2022-08-05T11:32:06Z

Tuck also wants to read Date and Timestamp columns. I've added these.

This is useful outside of this PR too. But it will be easier to test them together. Hopefully we end up merging both changes.

tuckging · 2022-08-19T09:15:14Z

Tuck also wants to read Date and Timestamp columns. I've added these.

This is useful outside of this PR too. But it will be easier to test them together. Hopefully we end up merging both changes.

Thanks @darabos, seems to work great! 🙌 We can merge this for now, but I've come across another datatype below. If it's trivial to fix, we can include it in this PR as well!

decimal(8,0)
decimal(18,2)
decimal(12,0)
decimal(9,2)

tuckging · 2022-08-19T09:26:46Z

PS you might want to pin openjdk in conda-env.yml for CI to start passing again

- sbt
- openjdk==11.0.15 # if left unpinned, sbt now brings in unsupported jdk17

darabos · 2022-08-19T09:50:28Z

PS you might want to pin openjdk in conda-env.yml for CI to start passing again
- sbt
- openjdk==11.0.15 # if left unpinned, sbt now brings in unsupported jdk17

Haha, @lacca0 just discovered the same 2 minutes ago! We will look into changing the code to be compatible with Java 17. But this workaround is very useful until then!

Thanks @darabos, seems to work great! 🙌

Awesome, thanks a lot for testing!

We can merge this for now, but I've come across another datatype below. If it's trivial to fix, we can include it in this PR as well!

Do you have a small test file or pyspark script to generate one? I want to make sure I'm targeting the right type.

tuckging · 2022-08-19T10:35:37Z

Do you have a small test file or pyspark script to generate one? I want to make sure I'm targeting the right type.

from pyspark.sql import functions as F
spark.range(100).withColumn('decimal_col', F.rand().cast('decimal(18,2)'))

darabos · 2022-08-19T16:12:57Z

Thanks! I've changed the code to support all types that Spark supports. I let Spark parse what you enter.

If the table does not match the specified schema you get an error. I changed the error formatting a bit to make sure this includes the important bit about what didn't match.

I've also added documentation. I'll add unit tests on Monday and then we can merge it.

darabos · 2022-08-24T14:04:16Z

@lacca0 I'm going to merge this PR to simplify merge conflicts with the 1-jar change. But it's still open for your comments! (I'll just have to address them in a separate PR.) Thanks!

darabos added 3 commits July 19, 2022 15:24

Zero-copy import operation.

f5c4c34

Make ReadParquetWithSchema a case class to fix equality check.

36dfa44

SerializableTypes for Date and Timestamp.

e884d3c

darabos added 4 commits August 19, 2022 15:17

Merge branch 'main' into darabos-zero-copy

013a558

Let Spark parse the schema.

62fc7a5

Documentation for the new feature.

3f60204

A bit more detailed error reporting.

8caf221

darabos added 3 commits August 24, 2022 10:54

Fix AggregateOnNeighborsTest.

750793a

Array support in ReadParquetWithSchema.

92e42d4

Test for ReadParquetWithSchema.

f8aab30

darabos requested a review from lacca0 August 24, 2022 08:57

darabos changed the title ~~[WIP] Zero copy import when the schema is known~~ Zero copy import when the schema is known Aug 24, 2022

darabos mentioned this pull request Aug 24, 2022

Force conda to use java11 #266

Merged

Merge branch 'main' into darabos-zero-copy

95bdd2e

darabos merged commit c92af70 into main Aug 24, 2022

darabos deleted the darabos-zero-copy branch August 24, 2022 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero copy import when the schema is known #261

Zero copy import when the schema is known #261

darabos commented Jul 19, 2022

darabos commented Aug 5, 2022

tuckging commented Aug 19, 2022

tuckging commented Aug 19, 2022

darabos commented Aug 19, 2022

tuckging commented Aug 19, 2022

darabos commented Aug 19, 2022

darabos commented Aug 24, 2022

Navigation Menu

Zero copy import when the schema is known #261

Zero copy import when the schema is known #261

Conversation

darabos commented Jul 19, 2022

darabos commented Aug 5, 2022

tuckging commented Aug 19, 2022

tuckging commented Aug 19, 2022

darabos commented Aug 19, 2022

tuckging commented Aug 19, 2022

darabos commented Aug 19, 2022

darabos commented Aug 24, 2022