Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H2O import_file() fails to import "date" column from parquet (Spark dataframe) #7262

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 3 comments
Assignees

Comments

@exalate-issue-sync
Copy link

To reproduce:

  • Follow lines below. parquet folder is also attached as a zip.

Create parquet in Pyspark:

{code:python}from pyspark.sql import functions as F
columns = ["col_A","date_string"]
data = [("Java", "2020-01-30"), ("Python", "2020-01-31"), ("Scala", "2020-02-01")]
df = spark.createDataFrame(data).toDF(*columns)

Cast as date and make as new column

df2 = df.select(F.col("col_A"), F.col("date_string"), F.to_date(F.col("date_string"), "yyyy-MM-dd").alias("date_converted"))

Save on hdfs

df2.write.save('date_for_testing_100643.parquet', format='parquet'){code}

Load parquet in H2O/SW:

{noformat}# Load in Sparkling Water
from pysparkling import *
hc = H2OContext.getOrCreate()

Load parquet as H2O frame

h2o.import_file('hdfs://mr-0xg10.0xdata.loc:8020/user/neema/date_for_testing_100643.parquet'){noformat}

Returns:

!image-20211029-013413.png|width=867,height=119!

^ date_converted should be a date, not int.

@h2o-ops-ro
Copy link
Collaborator

JIRA Issue Details

Jira Issue: PUBDEV-8397
Assignee: krasinski
Reporter: Neema Mashayekhi
State: Resolved
Fix Version: 3.34.0.5
Attachments: Available (Count: 2)
Development PRs: Available

@h2o-ops-ro
Copy link
Collaborator

Attachments From Jira

Attachment Name: date_for_testing_100643.parquet.zip
Attached By: Neema Mashayekhi
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8397/date_for_testing_100643.parquet.zip

Attachment Name: image-20211029-013413.png
Attached By: Neema Mashayekhi
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8397/image-20211029-013413.png

@h2o-ops-ro
Copy link
Collaborator

Linked PRs from JIRA

#5884

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants