Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero copy import when the schema is known #261

Merged
merged 11 commits into from Aug 24, 2022
Merged

Zero copy import when the schema is known #261

merged 11 commits into from Aug 24, 2022

Conversation

darabos
Copy link
Contributor

@darabos darabos commented Jul 19, 2022

Resolves #258.

image

No import button! The corresponding Python code is:

lk.importParquet(eager='no', filename='/home/darabos/eg.parquet', schema='name: String, age: Double')

Outstanding issues:

  • Currently you can only "import" a file this way once. LynxKite assumes it will never change. This could be avoided with a version parameter, same as its done with export operations.
  • Add the three parameters: imported_columns, limit, and sql.
  • Tests, documentation.

@darabos
Copy link
Contributor Author

darabos commented Aug 5, 2022

Tuck also wants to read Date and Timestamp columns. I've added these.

image

This is useful outside of this PR too. But it will be easier to test them together. Hopefully we end up merging both changes.

@tuckging
Copy link
Contributor

Tuck also wants to read Date and Timestamp columns. I've added these.

This is useful outside of this PR too. But it will be easier to test them together. Hopefully we end up merging both changes.

Thanks @darabos, seems to work great! 🙌 We can merge this for now, but I've come across another datatype below. If it's trivial to fix, we can include it in this PR as well!

decimal(8,0)
decimal(18,2)
decimal(12,0)
decimal(9,2)

@tuckging
Copy link
Contributor

PS you might want to pin openjdk in conda-env.yml for CI to start passing again

- sbt
- openjdk==11.0.15 # if left unpinned, sbt now brings in unsupported jdk17

@darabos
Copy link
Contributor Author

darabos commented Aug 19, 2022

PS you might want to pin openjdk in conda-env.yml for CI to start passing again

- sbt
- openjdk==11.0.15 # if left unpinned, sbt now brings in unsupported jdk17

Haha, @lacca0 just discovered the same 2 minutes ago! We will look into changing the code to be compatible with Java 17. But this workaround is very useful until then!

Thanks @darabos, seems to work great! 🙌

Awesome, thanks a lot for testing!

We can merge this for now, but I've come across another datatype below. If it's trivial to fix, we can include it in this PR as well!

Do you have a small test file or pyspark script to generate one? I want to make sure I'm targeting the right type.

@tuckging
Copy link
Contributor

Do you have a small test file or pyspark script to generate one? I want to make sure I'm targeting the right type.

from pyspark.sql import functions as F
spark.range(100).withColumn('decimal_col', F.rand().cast('decimal(18,2)'))

@darabos
Copy link
Contributor Author

darabos commented Aug 19, 2022

Thanks! I've changed the code to support all types that Spark supports. I let Spark parse what you enter.

image

If the table does not match the specified schema you get an error. I changed the error formatting a bit to make sure this includes the important bit about what didn't match.

image

I've also added documentation. I'll add unit tests on Monday and then we can merge it.

@darabos darabos requested a review from lacca0 August 24, 2022 08:57
@darabos darabos changed the title [WIP] Zero copy import when the schema is known Zero copy import when the schema is known Aug 24, 2022
@darabos
Copy link
Contributor Author

darabos commented Aug 24, 2022

@lacca0 I'm going to merge this PR to simplify merge conflicts with the 1-jar change. But it's still open for your comments! (I'll just have to address them in a separate PR.) Thanks!

@darabos darabos merged commit c92af70 into main Aug 24, 2022
@darabos darabos deleted the darabos-zero-copy branch August 24, 2022 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zero-copy Parquet imports
2 participants