Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

Merged
merged 42 commits into from May 12, 2019

Conversation

@anmyachev
Copy link
Contributor

commented Mar 17, 2019

  • closes N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@jreback
Copy link
Contributor

left a comment

woa what benchmark are u trying to make better? this is adding a huge amount of complexity

@codecov

This comment has been minimized.

Copy link

commented Mar 17, 2019

Codecov Report

Merging #25754 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25754      +/-   ##
==========================================
+ Coverage   91.25%   91.25%   +<.01%     
==========================================
  Files         172      172              
  Lines       52977    52971       -6     
==========================================
- Hits        48342    48338       -4     
+ Misses       4635     4633       -2
Flag Coverage Δ
#multiple 89.83% <100%> (ø) ⬆️
#single 41.74% <50%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.39% <100%> (+0.04%) ⬆️
pandas/util/testing.py 89.08% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 707c720...846f0bf. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented Mar 17, 2019

Codecov Report

Merging #25754 into master will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25754      +/-   ##
==========================================
- Coverage   92.04%   91.91%   -0.13%     
==========================================
  Files         175      174       -1     
  Lines       52291    50708    -1583     
==========================================
- Hits        48131    46610    -1521     
+ Misses       4160     4098      -62
Flag Coverage Δ
#multiple 90.42% <ø> (-0.17%) ⬇️
#single 41.18% <ø> (+0.31%) ⬆️
Impacted Files Coverage Δ
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️
pandas/core/arrays/categorical.py 95.95% <0%> (ø) ⬆️
pandas/core/indexes/range.py 97.97% <0%> (+0.01%) ⬆️
pandas/core/base.py 98% <0%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bfbd81...5dda33c. Read the comment docs.

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Mar 17, 2019

I'm trying to speed up read_csv function. Manually, I ran asv as follows:
asv continuous -f 1.05 origin HEAD -b io.csv
And got these result:

before after ratio test name
[707c720] [846f0bf]
master datehelpers-performance-patch
2.98±0.06ms 2.78±0.02ms 0.93 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
2.56±0.04ms 2.38±0.01ms 0.93 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
1.94±0.03ms 1.79±0.04ms 0.93 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'ymd')
3.27±0.08ms 3.02±0.04ms 0.92 io.csv.ReadUint64Integers.time_read_uint64
1.55±0.03ms 1.38±0.02ms 0.89 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')

During further testing performance, one of the datasets showed a decrease in the read_csv runtime for 25%.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2019

these are tiny numbers - and likely not the cause of an actual slowdown - read_csv is io bound for the most part

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2019

I have written a simple benchmark to demonstrate the impact.
Download dataset here: https://vincentarelbundock.github.io/Rdatasets/csv/mosaicData/Birthdays.csv
And run this simple script:

#!/usr/bin/env python
import os
import sys
import time
import pandas as pd

print("pandas version - %s" % pd.__version__)

if len(sys.argv) > 2:
    dataset_file = sys.argv[1]
    test_time = float(sys.argv[2]) * 60
elif len(sys.argv) == 2:
    dataset_file = sys.argv[1]
    test_time = 60
else:
    raise ValueError("please specify dataset")

count_call = 0
test_result_time = 0
while(test_time > 0):
    time_call = time.time()
    df = pd.read_csv(dataset_file, sep=',', parse_dates=['date'])
    time_call = time.time() - time_call
    test_time -= time_call
    test_result_time += time_call
    count_call += 1
print("%d calls read_csv on %s per %f seconds" % (count_call, dataset_file, test_result_time))
average_time = test_result_time / count_call;
print("average time per call read_csv - %f" % average_time)

Without proposed patch:

pandas version - 0.25.0.dev0+266.g707c7201a
167 calls read_csv on .\Birthdays.csv per 60.212000 seconds
average time per call read_csv - 0.360551

With:

pandas version - 0.25.0.dev0+267.g846f0bfa5
208 calls read_csv on .\Birthdays.csv per 60.246000 seconds
average time per call read_csv - 0.289644
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

@anmyachev can u show the effect on the asv benchmarks for csv

@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

Here's the makeshift script that directly measures time spent in _concat_date_cols:

#!/usr/bin/env python
import os
import sys
import time
import pandas as pd

import pandas.io.parsers
old_concat_date_cols = pandas.io.parsers._concat_date_cols
total = 0
def my_concat_date_cols(*a, **kw):
    global total
    start = time.time()
    try:
        return old_concat_date_cols(*a, **kw)
    finally:
        total += time.time() - start
pandas.io.parsers._concat_date_cols = my_concat_date_cols

print("pandas version - %s" % pd.__version__)

if len(sys.argv) > 2:
    dataset_file = sys.argv[1]
    test_time = float(sys.argv[2]) * 60
elif len(sys.argv) == 2:
    dataset_file = sys.argv[1]
    test_time = 60
else:
    raise ValueError("please specify dataset")

count_call = 0
test_result_time = 0
while(test_time > 0):
    time_call = time.time()
    df = pd.read_csv(dataset_file, sep=',', parse_dates=['date'])
    time_call = time.time() - time_call
    test_time -= time_call
    test_result_time += time_call
    count_call += 1
print("%d calls read_csv on %s per %f seconds" % (count_call, dataset_file, test_result_time))
average_time = test_result_time / count_call;
print("average time per call read_csv - %f" % average_time)
print("average time for _concat_date_cols() - %f" % (total / count_call))

And here are its results:

pandas read_csv average time _concat_date_cols average time
Original 0.418585 0.076364
Patched 0.310918 0.007838

As you can see here, with the patch _concat_date_cols gets 10x faster.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

@anmyachev see the docs: http://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

you would need to see the effect on the current perf test suite & write a new benchmark in the same style.

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2019

Original csv benchmarks were mostly not measuring _concat_date_time - most their time was spent in parsing item. I have composed new benchmarks (see 61d8e6e) that highlight the speedup brought by this PR.

master datehelpers-performance-patch speedup test name
28.6±0.3ms 24.2±0.2ms 0.85 io.csv.ReadCSVConcatDatetime.time_read_csv
12.1±0.2ms 7.17±0.3ms 0.59 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
17.1±0.1ms 6.90±0.1ms 0.40 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
16.2±0.4ms 5.04±0.2ms 0.31 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

@anmyachev those are not much of a speedup. This PR adds a huge amount of complexity / c-code. I would rather you cythonize things if need by. Am -1 on adding c-code.

@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

I would rather you cythonize things if need by.

I have implemented a small benchmark for _does_string_look_like_datetime. It clearly shows that Cython in certain cases like tight loops or lots of exception raising/handling can lag behind C code quite a bit:

before after ratio test case
[707c720] [e66883e]
pre-patch datehelpers-performance-patch
105±3ms 32.9±0.05ms 0.31 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
168±4ms 46.1±1ms 0.27 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('10000')
296±5ms 41.5±0.4ms 0.14 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('2Q2005')
@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

Also note that benchmarks are currently working on quite limited amount of data to be fast, if I patch those to resemble real-life scenarios (using 5M lines instead of 50K) results would be:

before after ratio test case
[707c720] [e66883e]
pre-patch datehelpers-performance-patch
3.38±0s 2.77±0s 0.82 io.csv.ReadCSVConcatDatetime.time_read_csv
1.19±0.01s 584±9ms 0.49 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
1.73±0.01s 482±10ms 0.28 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
1.66±0.01s 414±2ms 0.25 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')

Which yields at least 0.5s saving on reading (or more than 1 second in the best case for our optimization).

@mroeschke

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Is there any way the to align _does_string_look_like_datetime closer to what your implementation in C does or are the speedups purely language specifics? I am not the biggest fan of the maintenance burden of this approach either.

@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2019

@mroeschke we did our best to bring our improvements from C to Cython for _does_string_look_like_datetime (and new benchmarks show it's now on par with C version).

We've now trying to re-write _concat_date_cols in Cython and we'll see how it would impact the benchmark numbers.

@anmyachev anmyachev force-pushed the anmyachev:datehelpers-performance-patch branch from d545d3a to 2e2d588 Mar 22, 2019

@pep8speaks

This comment has been minimized.

Copy link

commented Mar 22, 2019

Hello @anmyachev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-09 21:12:33 UTC
@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Mar 22, 2019

@jreback, @mroeschke what you say about this cython version of _concat_date_cols?
This version loses in performance but not much relative to C.
Specific numbers will be later.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@anmyachev pls remove all c-code. further this is not passing.

@vnlitvin vnlitvin force-pushed the anmyachev:datehelpers-performance-patch branch from 2e2d588 to 7074b96 Mar 22, 2019

@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

Performance loss of Cython vs C code is noticable but not big:
asv continuous -f 1.05 datehelpers-performance-patch datehelpers-performance-patch-purec -b io.parsers -e -a sample_time=2 -a warmup_time=2

before after ratio test name
[7074b96] [6ba6fb91]
datehelpers-performance-patch datehelpers-performance-patch-purec
365±2μs 347±3μs 0.95 io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, )
954±20μs 899±10μs 0.94 io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, )
82.3±0.2μs 62.3±2μs 0.76 io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, <class 'list'>)
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@anmyachev we are -1 on adding really any c-code. this is increases complexity exponentially. if you have a perf improvement with cython great. but even that must be clear code.

@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@jreback I've removed C code and rebased to latest master.
Here's latest benchmark results, ran via asv continuous -f 1.05 upstream/master datehelpers-performance-patch -b io.parsers -e -a sample_time=2 -a warmup_time=2:

before after ratio test name
[f2bcb9f] [764aafd]
master datehelpers-performance-patch
105±0.6ms 36.5±1ms 0.35 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
1.74±0.01ms 601±8μs 0.35 io.parsers.ConcatDateCols.time_check_concat(1234567890, 1, <class 'list'>)
161±0.6ms 51.3±1ms 0.32 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('10000')
6.97±0.02ms 1.41±0.01ms 0.20 io.parsers.ConcatDateCols.time_check_concat(1234567890, 2, <class 'list'>)
291±2ms 50.6±2ms 0.17 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('2Q2005')
4.92±0.1ms 722±6μs 0.15 io.parsers.ConcatDateCols.time_check_concat(1234567890, 1, )
13.7±0.6ms 1.75±0.02ms 0.13 io.parsers.ConcatDateCols.time_check_concat(1234567890, 2, )
3.10±0.2ms 372±4μs 0.12 io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, )
9.97±0.4ms 919±10μs 0.09 io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, )
5.91±0.04ms 540±6μs 0.09 io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, <class 'list'>)
1.12±0.03ms 81.0±0.5μs 0.07 io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, <class 'list'>)

P.S. Please note that while gain might look small please look at ratio more than at raw time units, as we run the benchmark on Intel(R) Core(TM) i9-7900X (and raw timings would be, for example, several times bigger on some data scientist laptop). Also gain would scale linearly and for some more real-world cases with 10M of lines in a csv file (multiplied by number of date columns) it would be several seconds' saved.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

P.S. Please note that while gain might look small please look at ratio more than at raw time units, as we run the benchmark on Intel(R) Core(TM) i9-7900X (and raw timings would be, for example, several times bigger on some data scientist laptop). Also gain would scale linearly and for some more real-world cases with 10M of lines in a csv file (multiplied by number of date columns) it would be several seconds' saved.

of course, the only time we care about the absolute numbers is to avoid the impact of function calls (e.g. some micro bencharks are severly impacted by just 1 or 2 function calling conventions). also we are mindful of making the absolute time too big, as the benchmark suite is large.

@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@jreback I hope I fixed all CI issues (will see soon), and this patch is as compact as we can make keeping same speedup.

@vnlitvin

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@jreback pinging on green

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

can you show the entire suite of csv parsing benchmarks results.

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented Mar 22, 2019

@jreback, here's latest benchmark results, ran via asv continuous -f 1.05 origin/master datehelpers-performance-patch -b io.csv -e -a sample_time=2 -a warmup_time=2:

before after ratio test_name
[f2bcb9f] [8caea7f]
master datehelpers-performance-patch
1.76±0.03ms 1.65±0.02ms 0.94 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
1.40±0.01ms 1.30±0.01ms 0.93 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
1.47±0.01ms 1.35±0.01ms 0.92 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
27.4±0.3ms 22.0±0.2ms 0.80 io.csv.ReadCSVConcatDatetime.time_read_csv
12.0±0.1ms 6.95±0.07ms 0.58 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
17.6±0.2ms 6.37±0.08ms 0.36 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
16.0±0.1ms 4.83±0.1ms 0.30 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')

@anmyachev anmyachev changed the title PERF: rewrited _concat_date_cols function on C with removing extra co… PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime Mar 22, 2019

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@anmyachev maybe you didn't see my comment. I want ALL of the csv asv. I want to see if your changes actual matter ex-micro benchmarking.

@anmyachev anmyachev force-pushed the anmyachev:datehelpers-performance-patch branch from 713fb97 to c06a662 May 8, 2019

@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented May 8, 2019

@jreback rebase & green

@jreback

jreback approved these changes May 9, 2019

Copy link
Contributor

left a comment

tiny comment, ping on green.

pandas/_libs/tslibs/parsing.pyx Show resolved Hide resolved
@anmyachev

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

@jreback ping on green

@jreback jreback merged commit b48d1ff into pandas-dev:master May 12, 2019

11 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190509.18 succeeded
Details
pandas-dev.pandas (Checks_and_doc) Checks_and_doc succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented May 12, 2019

thanks @anmyachev

nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.