-
Notifications
You must be signed in to change notification settings - Fork 43
[FSTORE-414] feature_view.create_train_test_split returns empty df #854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSTORE-414] feature_view.create_train_test_split returns empty df #854
Conversation
|
I wonder what was the initial problem here. Was it that the conversion didn't work for string-based date columns? |
python/hsfs/engine/spark.py
Outdated
| result_dfs = {} | ||
| ts_type = dataset.select(event_time).dtypes[0][1] | ||
| ts_col = ( | ||
| unix_timestamp(col(event_time)) * 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the udf should be able to handle "date", "timestamp" as well, so we do not need the if-clause.
The problem was that we expect event time column has precision to second and in the backend we will multiple the value by 1000 for comparison. However, Davit use a column which is in millisecond, so the comparison did not work properly. @davitbzh Like Till said, we should handle millisecond value in Java client as well. |
|
|
||
| def _get_start_time(self): | ||
| # minimum start time is 1 second | ||
| return 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we eventually will support only 10 digit timestamp yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now we should any timestamp in second right?
python/hsfs/engine/spark.py
Outdated
| return util.convert_event_time_to_timestamp(event_time) | ||
|
|
||
| # registering the UDF | ||
| _convert_event_time_to_timestamp = udf(_check_event_time_type, LongType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use util.convert_event_time_to_timestamp directly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I don't want to put this part in util.convert_event_time_to_timestamp
# for backward compatibility
if isinstance(event_time, int) and len(str(event_time)) == 13:
event_time = int(event_time / 1000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in util.py, we have
elif isinstance(event_time, int):
if event_time == 0:
raise ValueError("Event time should be greater than 0.")
# jdbc supports timestamp precision up to second only.
if len(str(event_time)) < 13:
event_time = event_time * 1000
return event_time
so, below is redundant.
if isinstance(event_time, int) and len(str(event_time)) == 13:
event_time = int(event_time / 1000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly this is wrong:
if len(str(event_time)) < 13:
event_time = event_time * 1000
but I don't know how to replace without braking API.
If we don't use this then how do we guarantee that even_time will have seconds granularity?
if isinstance(event_time, int) and len(str(event_time)) == 13:
event_time = int(event_time / 1000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check either 10 or 13?
java/src/main/java/com/logicalclocks/hsfs/engine/SparkEngine.java
Outdated
Show resolved
Hide resolved
java/src/main/java/com/logicalclocks/hsfs/engine/SparkEngine.java
Outdated
Show resolved
Hide resolved
Co-authored-by: kennethmhc <kennethmhc@users.noreply.github.com>
kennethmhc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
java/src/main/java/com/logicalclocks/hsfs/engine/SparkEngine.java
Outdated
Show resolved
Hide resolved
Co-authored-by: kennethmhc <kennethmhc@users.noreply.github.com>
This PR adds/fixes/changes...
Updated converting to epoch milliseconds in spark engine and use util function to be consistent
JIRA Issue: -
Priority for Review: -
Related PRs: -
How Has This Been Tested?
Checklist For The Assigned Reviewer: