Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Improve performance of hash sets in read_csv #25804

Merged
merged 11 commits into from Mar 22, 2019

Conversation

@vnlitvin
Copy link
Contributor

commented Mar 20, 2019

  • closes N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Improved performance of checking of string presence in hashsets by:

  1. implemented lookup table for first symbol of entries - this speeds up negative answers
  2. increased size of hash table to reduce the probability of hash collision

Hash bucket increase is needed due to khash search algorithm - it computes a hash of key being searched, uses its last N bits where N corresponds to bucket size, and then does an open ended search over the table if collision happened with calling a comparison function over keys. When keys are strings (or, more precisely, char* things) comparison function is strcmp() which complexity is O(N), thus we want for collisions to be rare.
Most lookups of those sets happen when values are checked against N/A table which by default has 17 entries. Without the change bucket size is 32, thus hash collision probability is 17/32 (~53%), after the patch bucket size is 256 and collision probability is 6.6%.

I will add asv results and whatsnew later.

@vnlitvin vnlitvin changed the title Khash na sets Improve performance of hash sets in read_csv Mar 20, 2019

@codecov

This comment has been minimized.

Copy link

commented Mar 20, 2019

Codecov Report

Merging #25804 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25804      +/-   ##
==========================================
+ Coverage   91.27%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       53002    53002              
==========================================
+ Hits        48375    48376       +1     
+ Misses       4627     4626       -1
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.76% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/util/testing.py 89.4% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...da9d4e6. Read the comment docs.

1 similar comment
@codecov

This comment has been minimized.

Copy link

commented Mar 20, 2019

Codecov Report

Merging #25804 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25804      +/-   ##
==========================================
+ Coverage   91.27%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       53002    53002              
==========================================
+ Hits        48375    48376       +1     
+ Misses       4627     4626       -1
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.76% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/util/testing.py 89.4% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...da9d4e6. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented Mar 20, 2019

Codecov Report

Merging #25804 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #25804   +/-   ##
=======================================
  Coverage   91.27%   91.27%           
=======================================
  Files         173      173           
  Lines       53002    53002           
=======================================
  Hits        48375    48375           
  Misses       4627     4627
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.77% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/excel/_base.py 92.48% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...2ceb850. Read the comment docs.

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2019

Results of running asv continuous -f 1.05 upstream/master HEAD -b io.csv -a sample_time=1 -a warmup_time=1:

before after ratio test name
[fbe2523] [da9d4e6]
master khash-na-sets
1.34±0.02ms 1.26±0.01ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
13.7±0.2ms 12.8±0.3ms 0.94 io.csv.ReadCSVSkipRows.time_skipprows(None)
1.41±0.01ms 1.31±0.02ms 0.93 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
1.42±0.01ms 1.32±0.01ms 0.93 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
32.9±1ms 28.8±0.2ms 0.87 io.csv.ReadCSVCategorical.time_convert_direct
12.7±0.2ms 10.9±0.1ms 0.86 io.csv.ReadCSVThousands.time_thousands(',', ',')
13.2±0.08ms 11.3±0.1ms 0.86 io.csv.ReadCSVThousands.time_thousands('|', ',')
10.7±0.07ms 9.06±0.09ms 0.84 io.csv.ReadCSVThousands.time_thousands('|', None)
10.9±0.1ms 9.17±0.2ms 0.84 io.csv.ReadCSVThousands.time_thousands(',', None)

@vnlitvin vnlitvin changed the title Improve performance of hash sets in read_csv PERF: Improve performance of hash sets in read_csv Mar 20, 2019

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2019

More precise benchmarking:
asv continuous -f 1.01 upstream/master HEAD -b io.csv -a sample_time=2 -a warmup_time=2

before after ratio test name
[fbe2523] [da9d4e6]
master khash-na-sets
1.55±0.02ms 1.49±0ms 0.96 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
1.55±0.01ms 1.48±0.01ms 0.96 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
13.2±0.4ms 12.5±0.1ms 0.95 io.csv.ReadCSVSkipRows.time_skipprows(None)
1.56±0.02ms 1.48±0.01ms 0.95 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
1.84±0.02ms 1.74±0.01ms 0.95 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
1.81±0.02ms 1.72±0.01ms 0.95 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
1.56±0.02ms 1.47±0.01ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
1.84±0.01ms 1.73±0.02ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
1.57±0.03ms 1.47±0.03ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
1.34±0.01ms 1.24±0.03ms 0.92 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
1.35±0.02ms 1.24±0.01ms 0.92 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
1.43±0.02ms 1.30±0.01ms 0.91 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
1.43±0.03ms 1.29±0.01ms 0.90 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
31.5±0.6ms 28.3±0.3ms 0.90 io.csv.ReadCSVCategorical.time_convert_direct
2.95±0.07ms 2.61±0.06ms 0.89 io.csv.ReadUint64Integers.time_read_uint64
13.1±0.1ms 11.2±0.05ms 0.86 io.csv.ReadCSVThousands.time_thousands('|', ',')
10.8±0.04ms 9.12±0.08ms 0.84 io.csv.ReadCSVThousands.time_thousands(',', None)
12.8±0.04ms 10.8±0.1ms 0.84 io.csv.ReadCSVThousands.time_thousands(',', ',')
11.1±0.2ms 9.11±0.05ms 0.82 io.csv.ReadCSVThousands.time_thousands('|', None)
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@jreback jreback added this to the 0.25.0 milestone Mar 22, 2019

@chris-b1
Copy link
Contributor

left a comment

good stuff! looks good to me

@@ -71,9 +73,6 @@ cdef:
float64_t NEGINF = -INF


cdef extern from "errno.h":

This comment has been minimized.

Copy link
@chris-b1

chris-b1 Mar 22, 2019

Contributor

Pulling out errno ideally could have been a separate PR but I'm fine doing it here

# Resize the hash table to make it almost empty, this
# reduces amount of hash collisions on lookup thus
# "key not in table" case is faster.
# Note that this trades table memory footprint for lookup speed.

This comment has been minimized.

Copy link
@WillAyd

WillAyd Mar 22, 2019

Member

Is there a ballpark estimate for how much more memory this might be using? I feel if anything we get more issues for read_csv in regards to high memory usage than slow parsing so just want to be careful of tradeoffs that we make, especially since we don't have any memory benchmarks set up

This comment has been minimized.

Copy link
@vnlitvin

vnlitvin Mar 22, 2019

Author Contributor

Worst case would be when bucket_size was increased from 128 to 1024.
Memory for 128-sized table (on x86_64) ~3.6 KB, for 1024-sized table (on x86_64) ~21.04 KB.

For default case bucket size is 32, so memory is increased from 1.67 KB to 6.04 KB.

I think it's acceptable to trade about 15 more KB (as there are typically three such maps - NA values, true-ish values, false-ish values) to gain up to 15% of parsing speed.

P.S. Memory size formula where B is bucket_size:

sizeof(khint32_t) * 4 + sizeof(void*) * 3 + B * (sizeof(khint32_t) + sizeof(char*) + sizeof(size_t)) + 256 * sizeof(int)

which gives 1064 + B * 20 bytes (again on x86_64).

This comment has been minimized.

Copy link
@WillAyd

WillAyd Mar 22, 2019

Member

My math might be fuzzy but assuming someone is parsing say 10M records and this increases the size of each record by 15kb, would that require roughly 150MB more of memory?

This comment has been minimized.

Copy link
@vnlitvin

vnlitvin Mar 22, 2019

Author Contributor

It is 15 KB per parser, not per record :)
So in a typical case (when one parser exists at a time) it's only 15 KB more.

This comment has been minimized.

Copy link
@WillAyd

WillAyd Mar 22, 2019

Member

Ha thanks for clarifying. I suppose that's quite the difference

@jreback jreback merged commit cd057c7 into pandas-dev:master Mar 22, 2019

8 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190320.55 succeeded
Details
pandas-dev.pandas (Checks_and_doc) Checks_and_doc succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np14) Windows py36_np14 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

thanks @vnlitvin

@gfyoung

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

@vnlitvin : Nice work!

@vnlitvin vnlitvin deleted the anmyachev:khash-na-sets branch Mar 25, 2019

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

@vnlitvin getting tons of build warnings like: https://travis-ci.org/pandas-dev/pandas/jobs/511269649

if you can do anything about this.

n file included from pandas/_libs/parsers.c:663:0:
pandas/_libs/src/klib/khash_python.h:109:27: note: expected ‘struct kh_str_starts_t *’ but argument is of type ‘const struct kh_str_starts_t *’
 khint_t PANDAS_INLINE kh_get_str_starts_item(kh_str_starts_t* table, char* key) {
                           ^
pandas/_libs/parsers.c:26118:7: warning: passing argument 2 of ‘kh_get_str_starts_item’ discards ‘const’ qualifier from pointer target type [enabled by default]
       __pyx_t_1 = (kh_get_str_starts_item(__pyx_v_na_hashset, __pyx_v_word) != 0);
       ^
In file included from pandas/_libs/parsers.c:663:0:
pandas/_libs/src/klib/khash_python.h:109:27: note: expected ‘char *’ but argument is of type ‘const char *’
 khint_t PANDAS_INLINE kh_get_str_starts_item(kh_str_starts_t* table, char* key) {
                           ^
pandas/_libs/parsers.c: In function ‘__pyx_f_6pandas_5_libs_7parsers__try_bool_flex_nogil’:

anmyachev added a commit to anmyachev/pandas that referenced this pull request Apr 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.