Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: optimize NaT lookups in cython modules #24008

Merged
merged 9 commits into from Dec 2, 2018

Conversation

jbrockmendel
Copy link
Member

By making NaT a cdef'd object that we can cimport, we take a module-level lookup out of each check of if obj is NaT. Since we tend to do this check a lot, avoiding these global lookups can get us some mileage:

In [3]: vals = np.array([pd.NaT for _ in range(10**6)])
In [4]: %timeit pd.to_datetime(vals)

master:

10 loops, best of 3: 33.1 ms per loop
100 loops, best of 3: 29.9 ms per loop
10 loops, best of 3: 25.8 ms per loop
10 loops, best of 3: 20 ms per loop
10 loops, best of 3: 31 ms per loop
10 loops, best of 3: 19.8 ms per loop
10 loops, best of 3: 26.6 ms per loop
10 loops, best of 3: 32.2 ms per loop

PR

10 loops, best of 3: 13.6 ms per loop
10 loops, best of 3: 20.6 ms per loop
10 loops, best of 3: 14.1 ms per loop
100 loops, best of 3: 14.7 ms per loop
10 loops, best of 3: 14.4 ms per loop
10 loops, best of 3: 20.9 ms per loop
10 loops, best of 3: 20.8 ms per loop
100 loops, best of 3: 16.6 ms per loop

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

@jbrockmendel
Copy link
Member Author

Also moved some NaTType properties from the python class to the cython class, as those are supposed to be marginally more efficient

@gfyoung gfyoung added Timeseries Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Nov 30, 2018
@jbrockmendel
Copy link
Member Author

Travis failures are pickle plotting, not clearly related.

@codecov
Copy link

codecov bot commented Dec 1, 2018

Codecov Report

Merging #24008 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24008      +/-   ##
==========================================
- Coverage   42.46%   42.46%   -0.01%     
==========================================
  Files         161      161              
  Lines       51557    51554       -3     
==========================================
- Hits        21892    21890       -2     
+ Misses      29665    29664       -1
Flag Coverage Δ
#single 42.46% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/datetimes.py 63.41% <ø> (-0.08%) ⬇️
pandas/core/arrays/period.py 36.97% <100%> (-0.15%) ⬇️
pandas/core/reshape/tile.py 11.69% <0%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b0610b...8dc29f7. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 1, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@9d85b22). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #24008   +/-   ##
=========================================
  Coverage          ?   42.44%           
=========================================
  Files             ?      161           
  Lines             ?    51559           
  Branches          ?        0           
=========================================
  Hits              ?    21886           
  Misses            ?    29673           
  Partials          ?        0
Flag Coverage Δ
#single 42.44% <100%> (?)
Impacted Files Coverage Δ
pandas/core/arrays/datetimes.py 63.41% <ø> (ø)
pandas/core/arrays/period.py 36.97% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d85b22...05c2ce0. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the introduction of NAT extra-confusing. What actually does this buy?

@jbrockmendel
Copy link
Member Author

I find the introduction of NAT extra-confusing. What actually does this buy?

A ton of dict lookups. Consider nattype.is_null_datetimelike, in particular the line elif val is NaT:. The C code for this in that status quo is:

  __pyx_t_4 = __Pyx_GetModuleGlobalName(__pyx_n_s_NaT); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 686, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_1 = (__pyx_v_val == __pyx_t_4);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __pyx_t_3 = (__pyx_t_1 != 0);
  if (__pyx_t_3) {

In the PR the C code is:

  __pyx_t_1 = (__pyx_v_val == ((PyObject *)__pyx_v_6pandas_5_libs_6tslibs_7nattype_NAT));
  __pyx_t_3 = (__pyx_t_1 != 0);
  if (__pyx_t_3) {

In the status quo it has to look up "NaT" in the module-level namespace dict (and +/- refcount). In the PR we avoid that dict lookup. It adds up because it is a dict lookup we do a lot.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2018

this adds IMHO a huge amount of mental overhead. I would either call this c_NaT, leave it as NaT but make it a c-importable (not sure if that is possible), or make this a private attribute _NaT and put a cdef accessor function to return this.

@jbrockmendel
Copy link
Member Author

leave it as NaT but make it a c-importable (not sure if that is possible)

Agreed that would be ideal, but AFAICT cython won't allow NaT to be both cimport-able and import-able.

or make this a private attribute _NaT and put a cdef accessor function to return this.

Can you elaborate on what you have in mind with "cdef accessor function"?

I'm broadly indifferent between NAT vs c_NaT vs _NaT in terms of what we call the cdef'd version within nattype. I much prefer cimport NAT as NaT within other cython modules so only the imports are affected and not the rest of the code.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2018

Can you elaborate on what you have in mind with "cdef accessor function"?

I meant this, but this is not going to help the problem.

cdef get_c_nat():
    return _NaT

how about if you rename what you are calling NAT -> c_NaT, then I think could be on-board (and import that as such). Then i think the distinction is very clear and explicit, no wondering is this a c-object or not. yes references in the cython modules would change but you are already changing them.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments remain, otherwise lgtm.

cdef readonly:
int64_t value
object freq
# cdef readonly:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do we do this elsewhere to remind of the attributes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the pattern we used for _TSObject

@jreback jreback added this to the 0.24.0 milestone Dec 2, 2018
@jreback
Copy link
Contributor

jreback commented Dec 2, 2018

lgtm. do we have sufficient asv's for this?

@jbrockmendel
Copy link
Member Author

do we have sufficient asv's for this?

The %timeit results in the OP are all that's available ATM. I'm poking at an idea that would make it easier to identify what asvs to run for any particular commit/PR, will see how that pans out.

@jreback jreback merged commit a88ae2b into pandas-dev:master Dec 2, 2018
@jreback
Copy link
Contributor

jreback commented Dec 2, 2018

thanks!

@jbrockmendel jbrockmendel deleted the eke_out branch December 2, 2018 23:10
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance Timeseries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants