We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Currently, when importing PARTITIONED parquet files or CSV in H2O, the partitionBy column is not present in frame.
However, when spark reads parquet, the partition column is included in the new Spark Frame.
Example:
{code:python}#Create Spark Frame from partitioned parquet df_1 = spark.read.parquet("frame_1.parquet") df_1.head(){code}
#Spark Frame has 5 columns (including ‘RT’ ← partitioned column)
{quote}Row(SERIALNO=673102, SPORDER=5, PUMA=100, Row_Number=21342, RT='P'){quote}
{code:python}#Create H2O Frame from partitioned parquet h_frame1 = h2o.import_file("hdfs://mr-0xyz://user/UID/frame_1.parquet"){code}
#h2o frame has 4 columns (RT column is missing):
{quote}SERIALNO SPORDER PUMA Row_Number
84 1 2600 0 154 1 2500 1{quote}
The text was updated successfully, but these errors were encountered:
Neema Mashayekhi commented: Example of new feature:
Python:
{code:python}df = h2o.import_file(path=pyunit_utils.locate("smalldata/partitioned/partitioned_arilines/"), partition_by=["Year", "IsArrDelayed"]){code}
R:
{code:r}df <- h2o.importFile(path = locate("smalldata/partitioned/partitioned_arilines/"), partition_by=c("Year", "IsArrDelayed")){code}
Sorry, something went wrong.
JIRA Issue Migration Info
Jira Issue: PUBDEV-7430 Assignee: Pavel Pscheidl Reporter: Neema Mashayekhi State: Resolved Fix Version: 3.30.0.7 Attachments: N/A Development PRs: Available
Linked PRs from JIRA
#4786
No branches or pull requests
Currently, when importing PARTITIONED parquet files or CSV in H2O, the partitionBy column is not present in frame.
However, when spark reads parquet, the partition column is included in the new Spark Frame.
Example:
{code:python}#Create Spark Frame from partitioned parquet
df_1 = spark.read.parquet("frame_1.parquet")
df_1.head(){code}
#Spark Frame has 5 columns (including ‘RT’ ← partitioned column)
{quote}Row(SERIALNO=673102, SPORDER=5, PUMA=100, Row_Number=21342, RT='P'){quote}
{code:python}#Create H2O Frame from partitioned parquet
h_frame1 = h2o.import_file("hdfs://mr-0xyz://user/UID/frame_1.parquet"){code}
#h2o frame has 4 columns (RT column is missing):
{quote}SERIALNO SPORDER PUMA Row_Number
84 1 2600 0
154 1 2500 1{quote}
The text was updated successfully, but these errors were encountered: