Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: changed default value of cache parameter to True in to_datetime function #26043

Merged
merged 22 commits into from Jul 4, 2019

Conversation

@anmyachev
Copy link
Contributor

commented Apr 10, 2019

  • closes #N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@anmyachev anmyachev referenced this pull request Apr 10, 2019
4 of 4 tasks complete
@codecov

This comment has been minimized.

Copy link

commented Apr 10, 2019

Codecov Report

Merging #26043 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26043      +/-   ##
==========================================
- Coverage    91.9%   91.89%   -0.01%     
==========================================
  Files         175      175              
  Lines       52485    52485              
==========================================
- Hits        48235    48230       -5     
- Misses       4250     4255       +5
Flag Coverage Δ
#multiple 90.45% <ø> (ø) ⬆️
#single 40.78% <ø> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/tools/datetimes.py 84.59% <ø> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.62% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d9b702...317267c. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented Apr 10, 2019

Codecov Report

Merging #26043 into master will increase coverage by 0.07%.
The diff coverage is 95.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26043      +/-   ##
==========================================
+ Coverage   91.79%   91.86%   +0.07%     
==========================================
  Files         180      179       -1     
  Lines       50934    50728     -206     
==========================================
- Hits        46753    46600     -153     
+ Misses       4181     4128      -53
Flag Coverage Δ
#multiple 90.45% <95.45%> (-0.03%) ⬇️
#single 41.1% <50%> (-0.94%) ⬇️
Impacted Files Coverage Δ
pandas/core/tools/datetimes.py 85.67% <95.45%> (+0.61%) ⬆️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/base.py 91.83% <0%> (-8.17%) ⬇️
pandas/plotting/_misc.py 59.49% <0%> (-5.38%) ⬇️
pandas/plotting/_matplotlib/converter.py 58.43% <0%> (-5.24%) ⬇️
pandas/io/excel/_openpyxl.py 84.71% <0%> (-3.23%) ⬇️
pandas/core/config_init.py 92.91% <0%> (-3.17%) ⬇️
pandas/io/formats/printing.py 85.56% <0%> (-1.65%) ⬇️
pandas/core/internals/managers.py 95.21% <0%> (-0.94%) ⬇️
pandas/core/internals/construction.py 95.93% <0%> (-0.82%) ⬇️
... and 80 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ab9ff5...2b1fa85. Read the comment docs.

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2019

@jreback apparently, I didn't understand enough last time about the reasons for the problems in CI builds #25990 (comment) and the problem was different (maybe in the master, from which I made my branch)

Asv's result will be later.

@mroeschke

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

One downside of changing the default to True is that this bug (edge case) will be more evident to users #22305

Additionally, can you show ASVs/performance benchmarks for converting a small number of arguments? I suspect there will be a performance with a small amount of arguments and cache=True. Curious to see if there's an impact in this case (and how significant)

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Apr 12, 2019

For a vision of the situation as a whole:
asv continuous -f 1.05 origin/master to_datetime_cache_true -a warmup_time=1 -a sample_time=1:

master patch ratio test_name
1.81±0.01s failed n/a timeseries.ToDatetimeNONISO8601.time_different_offset
489±2μs 2.33±0.01ms 4.77 timeseries.TimeDatetimeConverter.time_convert
4.52±0.02ms 9.22±0.2ms 2.04 reshape.Cut.time_cut_datetime(4)
5.16±0.03ms 10.0±0.07ms 1.94 reshape.Cut.time_cut_datetime(10)
18.9±0.1ms 31.6±0.4ms 1.67 plotting.TimeseriesPlotting.time_plot_regular_compat
19.8±0.1ms 32.7±0.1ms 1.65 plotting.TimeseriesPlotting.time_plot_irregular
2.23±0.02ms 3.46±0.02ms 1.55 timeseries.ToDatetimeISO8601.time_iso8601
2.25±0.02ms 3.47±0.02ms 1.54 timeseries.ToDatetimeISO8601.time_iso8601_format
2.22±0.01ms 3.42±0.02ms 1.54 timeseries.ToDatetimeISO8601.time_iso8601_nosep
2.26±0.01ms 3.48±0.02ms 1.54 timeseries.ToDatetimeISO8601.time_iso8601_format_no_sep
1.58±0.01ms 2.34±0.01ms 1.49 timeseries.ResampleDatetetime64.time_resample
11.3±0.09ms 16.5±0.6ms 1.46 reshape.Cut.time_qcut_datetime(10)
10.6±0.05ms 15.4±0.2ms 1.46 reshape.Cut.time_qcut_datetime(4)
1.47±0.01ms 1.98±0.01ms 1.35 timeseries.ResampleDataFrame.time_method('min')
1.46±0.01ms 1.96±0ms 1.34 timeseries.ResampleDataFrame.time_method('max')
1.44±0.01ms 1.90±0.02ms 1.32 io.csv.ReadCSVParseDates.time_baseline
1.73±0.01ms 2.24±0.02ms 1.30 io.csv.ReadCSVParseDates.time_multiple_date
24.6±0.2ms 29.9±0.2ms 1.21 reshape.Cut.time_cut_datetime(1000)
947±4μs 1.10±0ms 1.16 timeseries.ResampleDataFrame.time_method('mean')
1.78±0.01ms 2.03±0.02ms 1.15 groupby.Datelike.time_sum('date_range')
3.59±0.02ms 4.08±0.04ms 1.14 timeseries.ToDatetimeYYYYMMDD.time_format_YYYYMMDD
43.3±0.5ms 48.5±0.4ms 1.12 reshape.Cut.time_qcut_datetime(1000)
1.34±0.02ms 1.46±0.01ms 1.09 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
9.02±0.1ms 9.79±0.07ms 1.09 io.sql.ReadSQLTable.time_read_sql_table_parse_dates
1.41±0.03ms 1.53±0ms 1.08 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
1.73±0.02ms 1.88±0.01ms 1.08 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'ymd')
1.68±0.01ms 1.82±0.01ms 1.08 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
2.34±0.02ms 2.51±0.05ms 1.08 rolling.VariableWindowMethods.time_rolling('DataFrame', '1h', 'float', 'kurt')
1.44±0ms 1.54±0.01ms 1.07 algorithms.Hashing.time_series_int
92.4±2ms 98.8±0.6ms 1.07 binary_ops.Ops.time_frame_multi_and(False, 1)
456±3μs 484±4μs 1.06 frame_methods.Isnull.time_isnull_floats_no_null
1.44±0.01ms 1.53±0ms 1.06 algorithms.Hashing.time_series_timedeltas
719±7ns 761±10ns 1.06 period.PeriodProperties.time_property('min', 'hour')
5.85±0.2ms 6.17±0.07ms 1.06 frame_methods.Equals.time_frame_float_unequal
1.45±0ms 1.52±0.01ms 1.05 algorithms.Hashing.time_series_float
3.13±0.04ms 3.30±0.02ms 1.05 rolling.VariableWindowMethods.time_rolling('Series', '1h', 'float', 'min')
713±4ns 749±20ns 1.05 period.PeriodProperties.time_property('min', 'quarter')
141±0.6ms 134±2ms 0.95 strings.Methods.time_rpartition
105±0.7μs 99.8±0.6μs 0.95 indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('datetime', 'unique_monotonic_inc')
3.23±0.05μs 3.04±0.04μs 0.94 offset.OnOffset.time_on_offset(<YearBegin: month=1>)
1.74±0.01ms 1.64±0.01ms 0.94 rolling.Methods.time_rolling('DataFrame', 1000, 'float', 'skew')
3.59±0.02μs 3.37±0.1μs 0.94 timedelta.TimedeltaConstructor.time_from_np_timedelta
231±6ns 216±2ns 0.93 timestamp.TimestampProperties.time_dayofweek(tzutc(), None)
24.0±2ms 20.7±0.2ms 0.86 indexing.NonNumericSeriesIndexing.time_getitem_list_like('string', 'nonunique_monotonic_inc')
318±5ms 187±4ms 0.59 io.stata.StataMissing.time_read_stata('tw')
283±5ms 162±3ms 0.57 io.stata.StataMissing.time_write_stata('tw')
326±6ms 184±4ms 0.57 io.stata.StataMissing.time_read_stata('ty')
338±5ms 190±6ms 0.56 io.stata.StataMissing.time_read_stata('th')
337±5ms 187±5ms 0.55 io.stata.StataMissing.time_read_stata('tm')
331±8ms 181±5ms 0.55 io.stata.StataMissing.time_read_stata('tq')
264±10ms 129±3ms 0.49 io.stata.Stata.time_write_stata('tw')
220±5ms 97.0±2ms 0.44 io.stata.Stata.time_read_stata('tw')
242±5ms 99.4±3ms 0.41 io.stata.Stata.time_read_stata('th')
242±1ms 98.1±3ms 0.41 io.stata.Stata.time_read_stata('tq')
238±4ms 96.0±2ms 0.40 io.stata.Stata.time_read_stata('tm')
239±6ms 91.4±2ms 0.38 io.stata.Stata.time_read_stata('ty')
336±10ms 17.0±0.1ms 0.05 timeseries.ToDatetimeFormat.time_exact
324±3ms 16.0±0.3ms 0.05 timeseries.ToDatetimeFormat.time_no_exact
111±4ms 3.60±0.03ms 0.03 timeseries.ToDatetimeFormatQuarters.time_infer_quarter
931±4ms 2.56±0.01ms 0.00 timeseries.ToDatetimeNONISO8601.time_same_offset

@mroeschke I'll see what can be done with #22305.

It seems that I ran into another kind of error, has it been mentioned before?

Under a small number of arguments you understand 10 - 100?

@mroeschke

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

What other error did you run into? And sure around 50 argument where there are no duplicate arguments to parse.

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2019

not that branch, therefore I will do push --force. Sorry

@anmyachev anmyachev force-pushed the anmyachev:to_datetime_cache_true branch from 29530f4 to 317267c Apr 13, 2019

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2019

@mroeschke I have provided a workaround for #22305 in #26078 PR. Can you see?

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

First of all, asv test:

class ToDatetimeCacheSmallCount(object):

    params = [True, False]
    param_names = ['cache']

    def setup(self, cache):
        N = 50
        rng = date_range(start='1/1/2000', periods=N)
        self.unique_date_strings = rng.strftime('%Y-%m-%d').tolist()

    def time_unique_date_strings(self, cache):
        to_datetime(self.unique_date_strings, cache=cache)

asv run -E existing -b ^timeseries.ToDatetimeCacheSmallCount -a warmup_time=1 -a sample_time=3:

cache test_time
True 501±20μs
False 335±0.9μs

Also I decide perform tests with 5000 elements(to be more confident in numbers)

size increase time
50 50%
5000 48%
@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

I do not know yet what the error is and I want first to do rebase from master

@anmyachev anmyachev force-pushed the anmyachev:to_datetime_cache_true branch from 317267c to d45c434 Apr 15, 2019

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

When I run timeseries.ToDatetimeNONISO8601.time_different_offset, the following error appears:
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 12, 2019

can you merge master and update

@anmyachev anmyachev force-pushed the anmyachev:to_datetime_cache_true branch from d45c434 to 12eac47 May 12, 2019

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented May 14, 2019

What is the performance impact for the (rather typical I think) case with all unique datetimes?

@anmyachev can you provide some timings for that? (or point to the benchmark result in one of your previous comments that represent that case)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented May 14, 2019

See also comment of @TomAugspurger on the read_csv PR: #25990 (comment)

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2019

What is the performance impact for the (rather typical I think) case with all unique datetimes?

@anmyachev can you provide some timings for that? (or point to the benchmark result in one of your previous comments that represent that case)

@jorisvandenbossche result for case with all unique datetimes (decrease in performance about 2 times):

asv run -E existing -b ToDatetimeCacheSmallCount -a warmup_time=0.5 -a sample_time=1:

- count count count count
cache 50 500 5000 100000
True 574±3μs 705±10μs 1.74±0.01ms 35.0±0.9ms
False 319±2μs 391±8μs 977±0μs 15.4±0.08ms

Benchmark for asv:

class ToDatetimeCacheSmallCount(object):

    params = ([True, False], [50, 500, 5000, 100000])
    param_names = ['cache', 'count']

    def setup(self, cache, count):
        rng = date_range(start='1/1/1971', periods=count)
        self.unique_date_strings = rng.strftime('%Y-%m-%d').tolist()

    def time_unique_date_strings(self, cache, count):
        to_datetime(self.unique_date_strings, cache=cache)
@jreback
Copy link
Contributor

left a comment

need a whatsnew note in the performance section

pandas/core/tools/datetimes.py Show resolved Hide resolved
@jreback

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

is this orthogonal to #26097

@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

is this orthogonal to #26097

Not sure... that PR is trying to address a bug that manifests when cache=True, so if default is changed then exposure for that bug would be higher. But still these could potentially be applied independently.

@anmyachev anmyachev force-pushed the anmyachev:to_datetime_cache_true branch from 12eac47 to 81f54c0 May 17, 2019

@jreback jreback added this to the 0.25.0 milestone May 19, 2019

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 19, 2019

small comment, pls merge master and ping on green.

pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved

@anmyachev anmyachev force-pushed the anmyachev:to_datetime_cache_true branch from 5210268 to 98e18a8 Jul 3, 2019

@jreback
jreback approved these changes Jul 3, 2019
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2019

lgtm. ping on green.

@jorisvandenbossche jorisvandenbossche referenced this pull request Jul 3, 2019
@pep8speaks

This comment has been minimized.

Copy link

commented Jul 3, 2019

Hello @anmyachev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 332:80: E501 line too long (83 > 79 characters)

kwds['cache_dates'] = do_cache
read_csv(self.data(self.StringIO_input), header=None,
parse_dates=[0], **kwds)
try:

This comment has been minimized.

Copy link
@jreback

jreback Jul 3, 2019

Contributor

@TomAugspurger ok method of handling ?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 3, 2019

Contributor

Seems fine.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 3, 2019

Contributor

Although... I worry it would incorrectly catch a TypeError in the function? The other way might be to check pandas.__version__?

This comment has been minimized.

Copy link
@jreback

jreback Jul 3, 2019

Contributor

hmm, let me see what i can do

TomAugspurger and others added 4 commits Jul 3, 2019
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2019

patched an edge case that was showing up the asvs

@jreback jreback merged commit ce567de into pandas-dev:master Jul 4, 2019

14 checks passed

codecov/patch 95.45% of diff hit (target 50%)
Details
codecov/project 92.78% (target 82%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190704.3 succeeded
Details
pandas-dev.pandas (Checks) Checks succeeded
Details
pandas-dev.pandas (Docs) Docs succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2019

thanks @anmyachev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.