Skip to content

Commit

Permalink
fix text delimiter (#631)
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Sep 15, 2020
1 parent dc31de1 commit f38a871
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion datasets/text/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,11 @@ def pa_parse_options(self):
if self.parse_options is not None:
parse_options = self.parse_options
else:
# To force the one-column setting, we set an arbitrary character
# that is not in text files as delimiter, such as \b or \v.
# The bell character, \b, was used to make beeps back in the days
parse_options = pac.ParseOptions(
delimiter="\r",
delimiter="\b",
quote_char=False,
double_quote=False,
escape_char=False,
Expand Down

1 comment on commit f38a871

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==0.17.1

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.016857 / 0.011353 (0.005504) 0.015019 / 0.011008 (0.004010) 0.043166 / 0.038508 (0.004658) 0.026843 / 0.023109 (0.003734) 0.206152 / 0.275898 (-0.069746) 0.203696 / 0.323480 (-0.119784) 0.007539 / 0.007986 (-0.000446) 0.005249 / 0.004328 (0.000921) 0.008867 / 0.004250 (0.004617) 0.039222 / 0.037052 (0.002170) 0.200717 / 0.258489 (-0.057772) 0.222045 / 0.293841 (-0.071796) 0.153290 / 0.128546 (0.024744) 0.114726 / 0.075646 (0.039080) 0.365272 / 0.419271 (-0.053999) 0.521903 / 0.043533 (0.478370) 0.173513 / 0.255139 (-0.081626) 0.186316 / 0.283200 (-0.096884) 0.073184 / 0.141683 (-0.068499) 1.529456 / 1.452155 (0.077302) 1.571035 / 1.492716 (0.078319)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036982 / 0.037411 (-0.000430) 0.018389 / 0.014526 (0.003864) 0.022690 / 0.176557 (-0.153867) 0.073680 / 0.737135 (-0.663455) 0.022298 / 0.296338 (-0.274041)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.172394 / 0.215209 (-0.042815) 1.706023 / 2.077655 (-0.371632) 1.017322 / 1.504120 (-0.486798) 0.933920 / 1.541195 (-0.607275) 0.991862 / 1.468490 (-0.476628) 6.114918 / 4.584777 (1.530141) 5.231305 / 3.745712 (1.485593) 7.364401 / 5.269862 (2.094539) 6.432920 / 4.565676 (1.867243) 0.608128 / 0.424275 (0.183853) 0.009618 / 0.007607 (0.002010) 0.201471 / 0.226044 (-0.024574) 2.136877 / 2.268929 (-0.132052) 1.467121 / 55.444624 (-53.977503) 1.384009 / 6.876477 (-5.492467) 1.303871 / 2.142072 (-0.838201) 6.306812 / 4.805227 (1.501585) 5.216238 / 6.500664 (-1.284426) 6.627813 / 0.075469 (6.552344)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 307.155476 / 1.841788 (305.313688) 12.770226 / 8.074308 (4.695918) 14.282731 / 10.191392 (4.091339) 0.425252 / 0.680424 (-0.255172) 0.244559 / 0.534201 (-0.289642) 0.726522 / 0.579283 (0.147239) 0.556347 / 0.434364 (0.121983) 0.705303 / 0.540337 (0.164965) 1.462794 / 1.386936 (0.075858)
PyArrow==1.0
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.015875 / 0.011353 (0.004522) 0.015070 / 0.011008 (0.004062) 0.040236 / 0.038508 (0.001728) 0.025894 / 0.023109 (0.002784) 0.282335 / 0.275898 (0.006437) 0.310002 / 0.323480 (-0.013478) 0.007160 / 0.007986 (-0.000825) 0.004371 / 0.004328 (0.000043) 0.006208 / 0.004250 (0.001958) 0.039021 / 0.037052 (0.001968) 0.290034 / 0.258489 (0.031545) 0.304972 / 0.293841 (0.011131) 0.140354 / 0.128546 (0.011808) 0.118841 / 0.075646 (0.043195) 0.403039 / 0.419271 (-0.016233) 0.362367 / 0.043533 (0.318834) 0.283464 / 0.255139 (0.028325) 0.287474 / 0.283200 (0.004274) 0.077705 / 0.141683 (-0.063978) 1.619907 / 1.452155 (0.167752) 1.586076 / 1.492716 (0.093360)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035805 / 0.037411 (-0.001607) 0.020142 / 0.014526 (0.005616) 0.033527 / 0.176557 (-0.143029) 0.069552 / 0.737135 (-0.667583) 0.063776 / 0.296338 (-0.232562)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.224212 / 0.215209 (0.009003) 2.269918 / 2.077655 (0.192263) 1.584141 / 1.504120 (0.080021) 1.477960 / 1.541195 (-0.063235) 1.507770 / 1.468490 (0.039280) 6.038871 / 4.584777 (1.454094) 5.029923 / 3.745712 (1.284211) 7.266376 / 5.269862 (1.996514) 6.411179 / 4.565676 (1.845503) 0.595821 / 0.424275 (0.171546) 0.009590 / 0.007607 (0.001983) 0.292775 / 0.226044 (0.066731) 2.715916 / 2.268929 (0.446988) 10.674218 / 55.444624 (-44.770407) 2.469730 / 6.876477 (-4.406747) 1.682323 / 2.142072 (-0.459750) 6.225780 / 4.805227 (1.420552) 1.706477 / 6.500664 (-4.794187) 0.024300 / 0.075469 (-0.051169)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 287.403763 / 1.841788 (285.561975) 12.747225 / 8.074308 (4.672917) 13.229372 / 10.191392 (3.037980) 0.746156 / 0.680424 (0.065733) 0.480254 / 0.534201 (-0.053947) 0.719122 / 0.579283 (0.139839) 0.548265 / 0.434364 (0.113901) 0.697374 / 0.540337 (0.157037) 1.408796 / 1.386936 (0.021860)

Please sign in to comment.