When importing parquet file and CSV, have option to include partitionBy column in frame #8207

exalate-issue-sync · 2023-05-11T19:32:36Z

Currently, when importing PARTITIONED parquet files or CSV in H2O, the partitionBy column is not present in frame.

However, when spark reads parquet, the partition column is included in the new Spark Frame.

Example:

{code:python}#Create Spark Frame from partitioned parquet
df_1 = spark.read.parquet("frame_1.parquet")
df_1.head(){code}

#Spark Frame has 5 columns (including ‘RT’ ← partitioned column)

{quote}Row(SERIALNO=673102, SPORDER=5, PUMA=100, Row_Number=21342, RT='P'){quote}

{code:python}#Create H2O Frame from partitioned parquet
h_frame1 = h2o.import_file("hdfs://mr-0xyz://user/UID/frame_1.parquet"){code}

#h2o frame has 4 columns (RT column is missing):

{quote}SERIALNO SPORDER PUMA Row_Number

84 1 2600 0
154 1 2500 1{quote}

exalate-issue-sync · 2023-05-11T19:32:37Z

Neema Mashayekhi commented: Example of new feature:

Python:

{code:python}df = h2o.import_file(path=pyunit_utils.locate("smalldata/partitioned/partitioned_arilines/"), partition_by=["Year", "IsArrDelayed"]){code}

R:

{code:r}df <- h2o.importFile(path = locate("smalldata/partitioned/partitioned_arilines/"), partition_by=c("Year", "IsArrDelayed")){code}

h2o-ops · 2023-05-14T22:30:40Z

JIRA Issue Migration Info

Jira Issue: PUBDEV-7430
Assignee: Pavel Pscheidl
Reporter: Neema Mashayekhi
State: Resolved
Fix Version: 3.30.0.7
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#4786

h2o-ops closed this as completed May 14, 2023

h2o-ops added the fixVersion/3.30.0.7 label May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When importing parquet file and CSV, have option to include partitionBy column in frame #8207

When importing parquet file and CSV, have option to include partitionBy column in frame #8207

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023

When importing parquet file and CSV, have option to include partitionBy column in frame #8207

When importing parquet file and CSV, have option to include partitionBy column in frame #8207

Comments

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023