Skip to content

Commit

Permalink
minor typo in text dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Sep 29, 2020
1 parent 6427e26 commit e3a66a5
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion datasets/text/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ def _generate_tables(self, files):
file,
read_options=self.config.pa_read_options,
parse_options=self.config.pa_parse_options,
convert_options=self.config.convert_options,
convert_options=self.config.pa_convert_options,
)
# Uncomment for debugging (will print the Arrow table size and elements)
# logger.warning(f"pa_table: {pa_table} num rows: {pa_table.num_rows}")
Expand Down

1 comment on commit e3a66a5

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==0.17.1

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.018009 / 0.011353 (0.006657) 0.022218 / 0.011008 (0.011210) 0.046469 / 0.038508 (0.007961) 0.032797 / 0.023109 (0.009688) 0.220746 / 0.275898 (-0.055152) 0.242999 / 0.323480 (-0.080481) 0.006491 / 0.007986 (-0.001495) 0.004560 / 0.004328 (0.000232) 0.006823 / 0.004250 (0.002573) 0.049560 / 0.037052 (0.012508) 0.223361 / 0.258489 (-0.035128) 0.244784 / 0.293841 (-0.049057) 0.160581 / 0.128546 (0.032034) 0.124722 / 0.075646 (0.049075) 0.444459 / 0.419271 (0.025187) 0.527612 / 0.043533 (0.484079) 0.218037 / 0.255139 (-0.037102) 0.241927 / 0.283200 (-0.041272) 0.085116 / 0.141683 (-0.056567) 1.932446 / 1.452155 (0.480291) 2.154961 / 1.492716 (0.662245)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.041053 / 0.037411 (0.003641) 0.021030 / 0.014526 (0.006505) 0.068127 / 0.176557 (-0.108430) 0.112415 / 0.737135 (-0.624720) 0.026665 / 0.296338 (-0.269673)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.222396 / 0.215209 (0.007187) 2.203507 / 2.077655 (0.125852) 1.293408 / 1.504120 (-0.210712) 1.220966 / 1.541195 (-0.320229) 1.213752 / 1.468490 (-0.254738) 7.001330 / 4.584777 (2.416553) 5.724911 / 3.745712 (1.979199) 8.379560 / 5.269862 (3.109699) 7.274708 / 4.565676 (2.709032) 0.705436 / 0.424275 (0.281161) 0.011592 / 0.007607 (0.003985) 0.251209 / 0.226044 (0.025164) 2.632579 / 2.268929 (0.363651) 1.784517 / 55.444624 (-53.660107) 1.646901 / 6.876477 (-5.229576) 1.648350 / 2.142072 (-0.493722) 7.063928 / 4.805227 (2.258701) 9.848406 / 6.500664 (3.347742) 6.744743 / 0.075469 (6.669274)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 14.061736 / 1.841788 (12.219948) 15.686935 / 8.074308 (7.612627) 15.750708 / 10.191392 (5.559316) 0.485882 / 0.680424 (-0.194542) 0.307097 / 0.534201 (-0.227104) 0.840332 / 0.579283 (0.261049) 0.615750 / 0.434364 (0.181386) 0.789713 / 0.540337 (0.249376) 1.689284 / 1.386936 (0.302348)
PyArrow==1.0
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.018163 / 0.011353 (0.006810) 0.015830 / 0.011008 (0.004822) 0.046993 / 0.038508 (0.008485) 0.032564 / 0.023109 (0.009455) 0.353320 / 0.275898 (0.077421) 0.384420 / 0.323480 (0.060940) 0.006641 / 0.007986 (-0.001344) 0.004502 / 0.004328 (0.000174) 0.007044 / 0.004250 (0.002794) 0.051873 / 0.037052 (0.014820) 0.353664 / 0.258489 (0.095175) 0.390518 / 0.293841 (0.096677) 0.158759 / 0.128546 (0.030212) 0.126254 / 0.075646 (0.050607) 0.460981 / 0.419271 (0.041710) 0.444840 / 0.043533 (0.401307) 0.356006 / 0.255139 (0.100867) 0.385239 / 0.283200 (0.102040) 0.097803 / 0.141683 (-0.043879) 1.947943 / 1.452155 (0.495788) 1.976930 / 1.492716 (0.484213)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.045358 / 0.037411 (0.007946) 0.023561 / 0.014526 (0.009035) 0.064437 / 0.176557 (-0.112120) 0.088872 / 0.737135 (-0.648263) 0.037984 / 0.296338 (-0.258354)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.279943 / 0.215209 (0.064734) 2.827717 / 2.077655 (0.750062) 1.950951 / 1.504120 (0.446831) 1.842821 / 1.541195 (0.301626) 1.886546 / 1.468490 (0.418056) 6.947632 / 4.584777 (2.362855) 5.783562 / 3.745712 (2.037850) 8.411217 / 5.269862 (3.141356) 7.252046 / 4.565676 (2.686369) 0.702115 / 0.424275 (0.277840) 0.012195 / 0.007607 (0.004588) 0.319544 / 0.226044 (0.093500) 3.354396 / 2.268929 (1.085467) 2.468838 / 55.444624 (-52.975786) 2.322722 / 6.876477 (-4.553755) 2.297542 / 2.142072 (0.155469) 6.968169 / 4.805227 (2.162942) 5.091950 / 6.500664 (-1.408714) 11.570187 / 0.075469 (11.494718)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 14.442717 / 1.841788 (12.600929) 15.906987 / 8.074308 (7.832679) 15.864440 / 10.191392 (5.673048) 0.925982 / 0.680424 (0.245559) 0.610975 / 0.534201 (0.076774) 0.823180 / 0.579283 (0.243897) 0.605155 / 0.434364 (0.170791) 0.781736 / 0.540337 (0.241398) 1.669337 / 1.386936 (0.282401)

Please sign in to comment.