Skip to content

Commit

Permalink
Revert default value for label_schema and add TODO
Browse files Browse the repository at this point in the history
  • Loading branch information
lewtun committed May 18, 2021
1 parent 290b583 commit 3578dbd
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion src/datasets/tasks/text_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@
class TextClassification(TaskTemplate):
task = "text-classification"
input_schema = Features({"text": Value("string")})
label_schema = Features
# TODO(lewtun): Since we update this in __post_init__ do we need to set a default? We'll need it for __init__ so
# investigate if there's a more elegant approach.
label_schema = Features({"labels": ClassLabel})
labels: List[str]
text_column: str = "text"
label_column: str = "labels"
Expand Down

1 comment on commit 3578dbd

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.025441 / 0.011353 (0.014088) 0.019449 / 0.011008 (0.008441) 0.055382 / 0.038508 (0.016874) 0.038827 / 0.023109 (0.015717) 0.368524 / 0.275898 (0.092626) 0.410399 / 0.323480 (0.086919) 0.012129 / 0.007986 (0.004144) 0.005488 / 0.004328 (0.001160) 0.010785 / 0.004250 (0.006535) 0.051961 / 0.037052 (0.014909) 0.368565 / 0.258489 (0.110076) 0.396760 / 0.293841 (0.102919) 0.177782 / 0.128546 (0.049236) 0.146633 / 0.075646 (0.070987) 0.471253 / 0.419271 (0.051982) 0.435123 / 0.043533 (0.391590) 0.381210 / 0.255139 (0.126071) 0.416773 / 0.283200 (0.133573) 1.749890 / 0.141683 (1.608207) 1.903013 / 1.452155 (0.450859) 2.056719 / 1.492716 (0.564003)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.010392 / 0.018006 (-0.007614) 0.522243 / 0.000490 (0.521753) 0.002523 / 0.000200 (0.002323) 0.000080 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.045251 / 0.037411 (0.007839) 0.029062 / 0.014526 (0.014536) 0.033942 / 0.176557 (-0.142615) 0.046722 / 0.737135 (-0.690414) 0.033989 / 0.296338 (-0.262349)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.564327 / 0.215209 (0.349118) 5.314727 / 2.077655 (3.237072) 2.413609 / 1.504120 (0.909490) 2.170949 / 1.541195 (0.629754) 2.120207 / 1.468490 (0.651717) 7.995179 / 4.584777 (3.410403) 6.872148 / 3.745712 (3.126436) 10.026221 / 5.269862 (4.756359) 8.493658 / 4.565676 (3.927981) 0.779925 / 0.424275 (0.355650) 0.011022 / 0.007607 (0.003415) 0.725276 / 0.226044 (0.499232) 7.327676 / 2.268929 (5.058747) 3.151590 / 55.444624 (-52.293034) 2.444535 / 6.876477 (-4.431942) 2.343496 / 2.142072 (0.201424) 7.323991 / 4.805227 (2.518764) 5.696013 / 6.500664 (-0.804651) 7.103172 / 0.075469 (7.027703)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 11.413595 / 1.841788 (9.571807) 13.653036 / 8.074308 (5.578728) 38.889572 / 10.191392 (28.698180) 0.889133 / 0.680424 (0.208709) 0.575178 / 0.534201 (0.040977) 0.804861 / 0.579283 (0.225578) 0.619867 / 0.434364 (0.185503) 0.707774 / 0.540337 (0.167437) 1.538204 / 1.386936 (0.151268)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.025214 / 0.011353 (0.013861) 0.016990 / 0.011008 (0.005982) 0.052219 / 0.038508 (0.013711) 0.046521 / 0.023109 (0.023412) 0.386508 / 0.275898 (0.110610) 0.442021 / 0.323480 (0.118541) 0.012807 / 0.007986 (0.004821) 0.005458 / 0.004328 (0.001130) 0.014496 / 0.004250 (0.010246) 0.062762 / 0.037052 (0.025710) 0.383613 / 0.258489 (0.125124) 0.434073 / 0.293841 (0.140232) 0.186083 / 0.128546 (0.057537) 0.136554 / 0.075646 (0.060908) 0.495603 / 0.419271 (0.076332) 0.478908 / 0.043533 (0.435375) 0.398082 / 0.255139 (0.142943) 0.445721 / 0.283200 (0.162521) 1.832014 / 0.141683 (1.690331) 1.986883 / 1.452155 (0.534729) 2.032006 / 1.492716 (0.539289)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.010011 / 0.018006 (-0.007996) 0.538763 / 0.000490 (0.538273) 0.003523 / 0.000200 (0.003323) 0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043793 / 0.037411 (0.006382) 0.030339 / 0.014526 (0.015813) 0.034104 / 0.176557 (-0.142453) 0.051894 / 0.737135 (-0.685242) 0.039226 / 0.296338 (-0.257113)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.508349 / 0.215209 (0.293140) 5.127225 / 2.077655 (3.049570) 2.248116 / 1.504120 (0.743996) 1.935411 / 1.541195 (0.394217) 1.963999 / 1.468490 (0.495509) 7.502316 / 4.584777 (2.917539) 6.583454 / 3.745712 (2.837742) 9.297719 / 5.269862 (4.027858) 8.160927 / 4.565676 (3.595250) 0.740024 / 0.424275 (0.315748) 0.018992 / 0.007607 (0.011385) 0.683572 / 0.226044 (0.457527) 6.794658 / 2.268929 (4.525730) 3.077151 / 55.444624 (-52.367473) 2.382918 / 6.876477 (-4.493559) 2.451405 / 2.142072 (0.309332) 7.600892 / 4.805227 (2.795665) 5.850560 / 6.500664 (-0.650104) 7.786832 / 0.075469 (7.711363)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 11.795553 / 1.841788 (9.953766) 13.583042 / 8.074308 (5.508734) 42.343114 / 10.191392 (32.151721) 0.930137 / 0.680424 (0.249714) 0.667822 / 0.534201 (0.133621) 0.820117 / 0.579283 (0.240834) 0.686073 / 0.434364 (0.251710) 0.750325 / 0.540337 (0.209987) 1.657982 / 1.386936 (0.271046)

CML watermark

Please sign in to comment.