-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we force columns by using col.types and does it do what it supposed to do? #15741
Comments
Okay, I have done the experiment. I have a frame that contains the following: It is saved as a CSV file. Next, I do an import_file specified both columns to be real csv_df = h2o.import_file("/Users/wendycwong/temp/df_parquet.csv", col_types=col_types) here is the result: Note that the second column is still parsed as integer. The only way to force a real column is to this: |
I am going to add a flag force_col_types to make sure that Parquet schema column types are used if force_col_types is set to true. For other parsers, users can set the force_col_types and then force the column types to be the same as is specified in col_types. The reason this flag is needed for numeric columns are because H2O-3 tries to optimize memory use by using the minimum memory needed to store each column. For example, if your column is declared double but only contains integer, H2O-3 will store your column as a column of integers. There is no truncation or rounding. Hence, there is no precision lost. However, if you want to specify your own numeric columns to be integer/double, you need to set force_col_types to be true and say so in col_types. For parquet files, if you set force_col_types = true, the column types in parquet schema will be used in determining the final column types. You will miss out on the memory optimization provided by H2O-3. |
For integer/double columns, they are specified as numeric in col_types. The parser will get the final say to parse it as integer or double. Hence, the force_col_types will not deal with this case. |
…fied in parquet schema or in col_types.
…fied in parquet schema or in col_types.
* GH-15741: added force_col_types parameter to force column types specified in parquet schema or in col_types. GH-15741: finished forcing column types for non-parque parser GH-15741: Add force_col_types to parquet parser GH-15741: added force_col_types support to R client Co-authored-by: Marek Novotný <marek.novotny@h2o.ai> * use elegant idea from Tomas Frydas to perform integer to double column type conversion. * remove commented out code and extra space. * remove runtime dependencies to build.gradle. --------- Co-authored-by: Marek Novotný <marek.novotny@h2o.ai>
H2O-3 parsing optimizes the memory use and will try to find the smaller unit to store a column of data. There is no rounding/truncation in this process.
However, if an user wants to store integers as a double, can this be done using col.types argument when calling h2o.import_file?
The text was updated successfully, but these errors were encountered: