-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create epoch extraction op #2178
Conversation
ibis/pandas/execution/temporal.py
Outdated
| return pd.Series( | ||
| (pd.DatetimeIndex(data) - pd.Timestamp('1970-01-01')) | ||
| // pd.Timedelta('1s') | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is suboptimal. Not sure if pandas has a specific function to get the underlying data from a datetime, but at least this works, and should be faster:
return data.astype(int) // int(1e9)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @datapythonista do you know if the result is the same? or could it have any round difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internal representation is the number of nanoseconds since epoch. Didn't check, but I assume the .astype(int) is returning that internal representation without errors. I think the // will ignore second decimals, not round to the closer. But among both implementations, if there are differences, I'd go with my implementation, since it's taking the epoch directly, not with an intermediary delta. But I'd say it should be exactly the same.
ibis/tests/all/test_temporal.py
Outdated
| if attr == 'epoch': | ||
| expected = pd.Series( | ||
| (pd.DatetimeIndex(df.timestamp_col) - pd.Timestamp('1970-01-01')) | ||
| // pd.Timedelta('1s') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's very useful to write a test with the same implementation as the tested function. If there is no better way to test this here, I'd remove epoch from this parametrization, and implement a separate test where you check that execute_epoch returns 10 for 1970-01-01 0:00:10 or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is just the same implementation for pandas. so I think we should keep it here. and maybe add an extra test just for pandas. how does it sound to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, didn't realize you'd be comparing with the results of other backends.
|
@datapythonista thanks for the review. the current PR will add epoch for all backends (as much as possible). I just need to stop working on that right now because I am investigating the CI problem. I hope to be back to this PR soon. thanks again for the review. |
5768ab0
to
64f2d11
Compare
|
@datapythonista do you know why it is not working on windows? ps: the pandas version used is 1.0.4 |
|
this PR is ready for review. thanks! the error in LinuxTest py38_sql_parquet happened after tests succeed. as I reran that before .. probably it is related to that, according to the message error: |
|
a friendly reminder about this PR. thanks! |
|
rebased! ready for review again. thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks @xmnlab
Added couple of suggestions, but looks ok as it is too.
ibis/clickhouse/operations.py
Outdated
| @@ -600,6 +605,7 @@ def _string_like(translator, expr): | |||
| ops.ExtractDay: unary('toDayOfMonth'), | |||
| ops.ExtractDayOfYear: unary('toDayOfYear'), | |||
| ops.ExtractQuarter: unary('toQuarter'), | |||
| ops.ExtractEpoch: _extract_epoch, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would make sense to directly use this? Feels like using one liners will make the code more readable, and it's simple enough IMO. Same would apply to other backends.
| ops.ExtractEpoch: _extract_epoch, | |
| ops.ExtractEpoch: lambda translator, expr: _call(translator, 'toRelativeSecondNum', *expr.op().args), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is not a big deal, I prefer to keep this. but if it would block this PR I can do it if no problem.
| def _extract(fmt): | ||
| def translator(t, expr): | ||
| def _extract(fmt, output_type=sa.SMALLINT): | ||
| def translator(t, expr, output_type=output_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit weird. I'd remove this, I think it should still work:
| def translator(t, expr, output_type=output_type): | |
| def translator(t, expr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I did it because I had a side effect without that.
Co-authored-by: Marc Garcia <garcia.marc@gmail.com>
|
thanks for the review @datapythonista, I applied your suggestion about the release note. thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, question on the api
ibis/expr/api.py
Outdated
| @@ -3425,6 +3426,7 @@ def _date_sub(left, right): | |||
| day_of_week=_day_of_week, | |||
| day_of_year=_extract_field('day_of_year', ops.ExtractDayOfYear), | |||
| quarter=_extract_field('quarter', ops.ExtractQuarter), | |||
| epoch=_extract_field('epoch', ops.ExtractEpoch), | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epoch_seconds or seconds_since_epoch would be a more descriptive name i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jreback , changed to epoch_seconds
|
@datapythonista @jreback the changes requested were applied. thanks! |
|
thanks @xmnlab |
Added epoch extraction operation to Clickhouse, CSV, Impala, MySQL, OmniSciDB, Pandas, Parquet, PostgreSQL, PySpark, SQLite and Spark
Extra information about epoch int32 vs int64: