Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create epoch extraction op #2178

Merged
merged 3 commits into from
Jul 9, 2020
Merged

Conversation

xmnlab
Copy link
Contributor

@xmnlab xmnlab commented Apr 7, 2020

Added epoch extraction operation to Clickhouse, CSV, Impala, MySQL, OmniSciDB, Pandas, Parquet, PostgreSQL, PySpark, SQLite and Spark

Extra information about epoch int32 vs int64:

return pd.Series(
(pd.DatetimeIndex(data) - pd.Timestamp('1970-01-01'))
// pd.Timedelta('1s')
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is suboptimal. Not sure if pandas has a specific function to get the underlying data from a datetime, but at least this works, and should be faster:

return data.astype(int) // int(1e9)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @datapythonista do you know if the result is the same? or could it have any round difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal representation is the number of nanoseconds since epoch. Didn't check, but I assume the .astype(int) is returning that internal representation without errors. I think the // will ignore second decimals, not round to the closer. But among both implementations, if there are differences, I'd go with my implementation, since it's taking the epoch directly, not with an intermediary delta. But I'd say it should be exactly the same.

ibis/tests/all/test_temporal.py Show resolved Hide resolved
if attr == 'epoch':
expected = pd.Series(
(pd.DatetimeIndex(df.timestamp_col) - pd.Timestamp('1970-01-01'))
// pd.Timedelta('1s')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's very useful to write a test with the same implementation as the tested function. If there is no better way to test this here, I'd remove epoch from this parametrization, and implement a separate test where you check that execute_epoch returns 10 for 1970-01-01 0:00:10 or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just the same implementation for pandas. so I think we should keep it here. and maybe add an extra test just for pandas. how does it sound to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, didn't realize you'd be comparing with the results of other backends.

@datapythonista datapythonista added feature Features or general enhancements pandas The pandas backend labels Apr 13, 2020
@xmnlab
Copy link
Contributor Author

xmnlab commented Apr 13, 2020

@datapythonista thanks for the review. the current PR will add epoch for all backends (as much as possible). I just need to stop working on that right now because I am investigating the CI problem. I hope to be back to this PR soon. thanks again for the review.

@pep8speaks
Copy link

pep8speaks commented May 30, 2020

Hello @xmnlab! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-29 15:39:28 UTC

@xmnlab xmnlab force-pushed the add-epoch-op branch 2 times, most recently from 5768ab0 to 64f2d11 Compare May 30, 2020 22:34
@xmnlab
Copy link
Contributor Author

xmnlab commented May 31, 2020

@datapythonista do you know why it is not working on windows?

ps: the pandas version used is 1.0.4

ibis\tests\all\test_temporal.py:62: in test_timestamp_extract
    expected = df.timestamp_col.astype(int) // int(1e9)
c:\miniconda\envs\ibis36\lib\site-packages\pandas\core\generic.py:5698: in astype
    new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
c:\miniconda\envs\ibis36\lib\site-packages\pandas\core\internals\managers.py:582: in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
c:\miniconda\envs\ibis36\lib\site-packages\pandas\core\internals\managers.py:442: in apply
    applied = getattr(b, f)(**kwargs)
c:\miniconda\envs\ibis36\lib\site-packages\pandas\core\internals\blocks.py:2223: in astype
    return super().astype(dtype=dtype, copy=copy, errors=errors)
c:\miniconda\envs\ibis36\lib\site-packages\pandas\core\internals\blocks.py:625: in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
c:\miniconda\envs\ibis36\lib\site-packages\pandas\core\dtypes\cast.py:841: in astype_nansafe
    raise TypeError(f"cannot astype a datetimelike from [{arr.dtype}] to [{dtype}]")
E   TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]

@xmnlab xmnlab marked this pull request as ready for review June 4, 2020 18:59
@xmnlab
Copy link
Contributor Author

xmnlab commented Jun 4, 2020

this PR is ready for review. thanks!

the error in LinuxTest py38_sql_parquet happened after tests succeed. as I reran that before .. probably it is related to that, according to the message error:

Information, ApplicationInsightsTelemetrySender correlated 2 events with X-TFS-Session 393403b6-d6fa-4c5c-a451-8ddc09a9b2e2
##[error]Artifact LinuxCondaEnvironment-38-main already exists for build 2565.

@xmnlab
Copy link
Contributor Author

xmnlab commented Jun 10, 2020

a friendly reminder about this PR. thanks!

@xmnlab
Copy link
Contributor Author

xmnlab commented Jun 12, 2020

rebased! ready for review again. thanks

Copy link
Contributor

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @xmnlab

Added couple of suggestions, but looks ok as it is too.

docs/source/release/index.rst Outdated Show resolved Hide resolved
@@ -600,6 +605,7 @@ def _string_like(translator, expr):
ops.ExtractDay: unary('toDayOfMonth'),
ops.ExtractDayOfYear: unary('toDayOfYear'),
ops.ExtractQuarter: unary('toQuarter'),
ops.ExtractEpoch: _extract_epoch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would make sense to directly use this? Feels like using one liners will make the code more readable, and it's simple enough IMO. Same would apply to other backends.

Suggested change
ops.ExtractEpoch: _extract_epoch,
ops.ExtractEpoch: lambda translator, expr: _call(translator, 'toRelativeSecondNum', *expr.op().args),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is not a big deal, I prefer to keep this. but if it would block this PR I can do it if no problem.

def _extract(fmt):
def translator(t, expr):
def _extract(fmt, output_type=sa.SMALLINT):
def translator(t, expr, output_type=output_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit weird. I'd remove this, I think it should still work:

Suggested change
def translator(t, expr, output_type=output_type):
def translator(t, expr):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I did it because I had a side effect without that.

Co-authored-by: Marc Garcia <garcia.marc@gmail.com>
@xmnlab
Copy link
Contributor Author

xmnlab commented Jun 21, 2020

thanks for the review @datapythonista, I applied your suggestion about the release note. thanks!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, question on the api

ibis/expr/api.py Outdated
@@ -3425,6 +3426,7 @@ def _date_sub(left, right):
day_of_week=_day_of_week,
day_of_year=_extract_field('day_of_year', ops.ExtractDayOfYear),
quarter=_extract_field('quarter', ops.ExtractQuarter),
epoch=_extract_field('epoch', ops.ExtractEpoch),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epoch_seconds or seconds_since_epoch would be a more descriptive name i think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jreback , changed to epoch_seconds

@xmnlab
Copy link
Contributor Author

xmnlab commented Jul 1, 2020

@datapythonista @jreback the changes requested were applied. thanks!

@jreback jreback added this to the Next Feature Release milestone Jul 9, 2020
@jreback jreback merged commit 32b8380 into ibis-project:master Jul 9, 2020
@jreback
Copy link
Contributor

jreback commented Jul 9, 2020

thanks @xmnlab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements pandas The pandas backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants