Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we force columns by using col.types and does it do what it supposed to do? #15741

Closed
wendycwong opened this issue Sep 11, 2023 · 3 comments
Closed
Assignees
Labels
Verification Verify whatever is claimed to be working.
Milestone

Comments

@wendycwong
Copy link
Contributor

H2O-3 parsing optimizes the memory use and will try to find the smaller unit to store a column of data. There is no rounding/truncation in this process.

However, if an user wants to store integers as a double, can this be done using col.types argument when calling h2o.import_file?

@wendycwong wendycwong added bug Verification Verify whatever is claimed to be working. labels Sep 11, 2023
@wendycwong wendycwong self-assigned this Sep 11, 2023
@wendycwong wendycwong removed the bug label Sep 11, 2023
@wendycwong
Copy link
Contributor Author

Okay, I have done the experiment. I have a frame that contains the following:

image

It is saved as a CSV file. Next, I do an import_file specified both columns to be real

csv_df = h2o.import_file("/Users/wendycwong/temp/df_parquet.csv", col_types=col_types)
print(csv_df.types)

here is the result:

image

Note that the second column is still parsed as integer.

The only way to force a real column is to this:

image

@wendycwong
Copy link
Contributor Author

wendycwong commented Sep 12, 2023

I am going to add a flag force_col_types to make sure that Parquet schema column types are used if force_col_types is set to true.

For other parsers, users can set the force_col_types and then force the column types to be the same as is specified in col_types.

The reason this flag is needed for numeric columns are because H2O-3 tries to optimize memory use by using the minimum memory needed to store each column. For example, if your column is declared double but only contains integer, H2O-3 will store your column as a column of integers. There is no truncation or rounding. Hence, there is no precision lost.

However, if you want to specify your own numeric columns to be integer/double, you need to set force_col_types to be true and say so in col_types.

For parquet files, if you set force_col_types = true, the column types in parquet schema will be used in determining the final column types. You will miss out on the memory optimization provided by H2O-3.

@wendycwong wendycwong added this to the 3.44.0.1 milestone Sep 12, 2023
@wendycwong
Copy link
Contributor Author

For integer/double columns, they are specified as numeric in col_types. The parser will get the final say to parse it as integer or double. Hence, the force_col_types will not deal with this case.

wendycwong added a commit that referenced this issue Sep 13, 2023
wendycwong added a commit that referenced this issue Sep 20, 2023
wendycwong added a commit that referenced this issue Sep 28, 2023
wendycwong added a commit that referenced this issue Sep 28, 2023
…fied in parquet schema or in col_types.

GH-15741: finished forcing column types for non-parque parser
GH-15741: Add force_col_types to parquet parser
GH-15741: added force_col_types support to R client

Co-authored-by: Marek Novotný <marek.novotny@h2o.ai>
wendycwong added a commit that referenced this issue Oct 6, 2023
…fied in parquet schema or in col_types.

GH-15741: finished forcing column types for non-parque parser
GH-15741: Add force_col_types to parquet parser
GH-15741: added force_col_types support to R client

Co-authored-by: Marek Novotný <marek.novotny@h2o.ai>
wendycwong added a commit that referenced this issue Oct 9, 2023
…fied in parquet schema or in col_types.

GH-15741: finished forcing column types for non-parque parser
GH-15741: Add force_col_types to parquet parser
GH-15741: added force_col_types support to R client

Co-authored-by: Marek Novotný <marek.novotny@h2o.ai>
wendycwong added a commit that referenced this issue Oct 10, 2023
* GH-15741: added force_col_types parameter to force column types specified in parquet schema or in col_types.
GH-15741: finished forcing column types for non-parque parser
GH-15741: Add force_col_types to parquet parser
GH-15741: added force_col_types support to R client

Co-authored-by: Marek Novotný <marek.novotny@h2o.ai>

* use elegant idea from Tomas Frydas to perform integer to double column type conversion.

* remove commented out code and extra space.

* remove runtime dependencies to build.gradle.

---------

Co-authored-by: Marek Novotný <marek.novotny@h2o.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Verification Verify whatever is claimed to be working.
Projects
None yet
Development

No branches or pull requests

1 participant