Skip to content

Commit

Permalink
Implement Dataset add_item (#1870)
Browse files Browse the repository at this point in the history
* Test Dataset.add_item

* Implement Dataset.add_item

* tmp

* Use InMemoryTable for new item

* Add dataset_dict and arrow_path for tests

* Fix test Dataset.add_item

* Add docstring

* Return new Dataset

* Fix test with returned new dataset

* Test multiple InMemoryTables are consolidated

* Test for consolidated InMemoryTables after multiple calls

* Add versionadded to docstring

* Add method docstring to the docs

* Simplify cast schema

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
albertvillanova and lhoestq committed Apr 23, 2021
1 parent d843090 commit 1f83a89
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 1 deletion.
4 changes: 3 additions & 1 deletion docs/source/package_reference/main_classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ Main classes
The base class :class:`datasets.Dataset` implements a Dataset backed by an Apache Arrow table.

.. autoclass:: datasets.Dataset
:members: from_file, from_buffer, from_pandas, from_dict,
:members:
add_item,
from_file, from_buffer, from_pandas, from_dict,
data, cache_files, num_columns, num_rows, column_names, shape,
unique,
flatten_, cast_, remove_columns_, rename_column_,
Expand Down
19 changes: 19 additions & 0 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2858,6 +2858,25 @@ def add_elasticsearch_index(
)
return self

def add_item(self, item: dict):
"""Add item to Dataset.
.. versionadded:: 1.6
Args:
item (dict): Item data to be added.
Returns:
:class:`Dataset`
"""
item_table = InMemoryTable.from_pydict({k: [v] for k, v in item.items()})
# Cast item
schema = pa.schema(self.features.type)
item_table = item_table.cast(schema)
# Concatenate tables
table = concat_tables([self._data, item_table])
return Dataset(table)


def concatenate_datasets(
dsets: List[Dataset],
Expand Down
28 changes: 28 additions & 0 deletions tests/test_arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1948,6 +1948,34 @@ def test_concatenate_datasets_duplicate_columns(dataset):
assert "duplicated" in str(excinfo.value)


@pytest.mark.parametrize("in_memory", [False, True])
@pytest.mark.parametrize(
"item",
[
{"col_1": "4", "col_2": 4, "col_3": 4.0},
{"col_1": "4", "col_2": "4", "col_3": "4"},
{"col_1": 4, "col_2": 4, "col_3": 4},
{"col_1": 4.0, "col_2": 4.0, "col_3": 4.0},
],
)
def test_dataset_add_item(item, in_memory, dataset_dict, arrow_path):
dataset = (
Dataset(InMemoryTable.from_pydict(dataset_dict))
if in_memory
else Dataset(MemoryMappedTable.from_file(arrow_path))
)
dataset = dataset.add_item(item)
assert dataset.data.shape == (5, 3)
expected_features = {"col_1": "string", "col_2": "int64", "col_3": "float64"}
assert dataset.data.column_names == list(expected_features.keys())
for feature, expected_dtype in expected_features.items():
assert dataset.features[feature].dtype == expected_dtype
assert len(dataset.data.blocks) == 1 if in_memory else 2 # multiple InMemoryTables are consolidated as one
dataset = dataset.add_item(item)
assert dataset.data.shape == (6, 3)
assert len(dataset.data.blocks) == 1 if in_memory else 2 # multiple InMemoryTables are consolidated as one


@pytest.mark.parametrize("keep_in_memory", [False, True])
@pytest.mark.parametrize(
"features",
Expand Down

2 comments on commit 1f83a89

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.018839 / 0.011353 (0.007486) 0.013490 / 0.011008 (0.002482) 0.045758 / 0.038508 (0.007250) 0.035954 / 0.023109 (0.012845) 0.293484 / 0.275898 (0.017586) 0.335250 / 0.323480 (0.011770) 0.009928 / 0.007986 (0.001942) 0.004455 / 0.004328 (0.000127) 0.010178 / 0.004250 (0.005928) 0.048079 / 0.037052 (0.011026) 0.294871 / 0.258489 (0.036382) 0.343696 / 0.293841 (0.049855) 0.136756 / 0.128546 (0.008209) 0.100650 / 0.075646 (0.025003) 0.373908 / 0.419271 (-0.045363) 0.369507 / 0.043533 (0.325974) 0.296201 / 0.255139 (0.041062) 0.328005 / 0.283200 (0.044806) 1.450613 / 0.141683 (1.308931) 1.648343 / 1.452155 (0.196188) 1.748027 / 1.492716 (0.255310)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.015700 / 0.018006 (-0.002307) 0.000450 / 0.000490 (-0.000040) 0.000191 / 0.000200 (-0.000009) 0.000042 / 0.000054 (-0.000013)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039524 / 0.037411 (0.002113) 0.019598 / 0.014526 (0.005072) 0.025771 / 0.176557 (-0.150785) 0.043986 / 0.737135 (-0.693150) 0.026247 / 0.296338 (-0.270092)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.371389 / 0.215209 (0.156180) 3.767100 / 2.077655 (1.689445) 2.015849 / 1.504120 (0.511729) 1.830722 / 1.541195 (0.289527) 1.853177 / 1.468490 (0.384687) 5.320980 / 4.584777 (0.736203) 4.777189 / 3.745712 (1.031477) 7.148904 / 5.269862 (1.879043) 6.421836 / 4.565676 (1.856160) 0.532648 / 0.424275 (0.108373) 0.009335 / 0.007607 (0.001728) 0.466448 / 0.226044 (0.240404) 4.685424 / 2.268929 (2.416496) 2.835441 / 55.444624 (-52.609183) 2.564960 / 6.876477 (-4.311517) 2.556790 / 2.142072 (0.414717) 5.451046 / 4.805227 (0.645819) 4.117223 / 6.500664 (-2.383441) 6.308735 / 0.075469 (6.233266)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 9.229999 / 1.841788 (7.388211) 11.825749 / 8.074308 (3.751441) 25.813631 / 10.191392 (15.622239) 0.808700 / 0.680424 (0.128276) 0.509286 / 0.534201 (-0.024915) 0.665381 / 0.579283 (0.086098) 0.487922 / 0.434364 (0.053558) 0.564326 / 0.540337 (0.023989) 1.342289 / 1.386936 (-0.044647)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.018432 / 0.011353 (0.007079) 0.012877 / 0.011008 (0.001868) 0.043573 / 0.038508 (0.005065) 0.035574 / 0.023109 (0.012465) 0.291636 / 0.275898 (0.015738) 0.322503 / 0.323480 (-0.000976) 0.009819 / 0.007986 (0.001833) 0.004271 / 0.004328 (-0.000057) 0.010217 / 0.004250 (0.005967) 0.053032 / 0.037052 (0.015980) 0.295689 / 0.258489 (0.037199) 0.332885 / 0.293841 (0.039044) 0.130348 / 0.128546 (0.001802) 0.097208 / 0.075646 (0.021562) 0.379105 / 0.419271 (-0.040166) 0.370802 / 0.043533 (0.327269) 0.291998 / 0.255139 (0.036859) 0.321480 / 0.283200 (0.038280) 1.458349 / 0.141683 (1.316666) 1.559214 / 1.452155 (0.107059) 1.611397 / 1.492716 (0.118681)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.015165 / 0.018006 (-0.002841) 0.000401 / 0.000490 (-0.000089) 0.000185 / 0.000200 (-0.000015) 0.000042 / 0.000054 (-0.000013)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.037387 / 0.037411 (-0.000024) 0.019270 / 0.014526 (0.004744) 0.025409 / 0.176557 (-0.151148) 0.043889 / 0.737135 (-0.693246) 0.026211 / 0.296338 (-0.270127)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.361240 / 0.215209 (0.146031) 3.618570 / 2.077655 (1.540915) 1.826859 / 1.504120 (0.322739) 1.640982 / 1.541195 (0.099787) 1.671730 / 1.468490 (0.203240) 5.121057 / 4.584777 (0.536281) 4.537224 / 3.745712 (0.791512) 6.974887 / 5.269862 (1.705026) 5.730335 / 4.565676 (1.164659) 0.513534 / 0.424275 (0.089259) 0.009162 / 0.007607 (0.001555) 0.458627 / 0.226044 (0.232583) 4.583441 / 2.268929 (2.314512) 2.682061 / 55.444624 (-52.762563) 2.426380 / 6.876477 (-4.450097) 2.423888 / 2.142072 (0.281816) 5.278243 / 4.805227 (0.473015) 4.122490 / 6.500664 (-2.378174) 4.712376 / 0.075469 (4.636907)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 9.128945 / 1.841788 (7.287157) 11.639643 / 8.074308 (3.565334) 25.459008 / 10.191392 (15.267616) 0.809637 / 0.680424 (0.129214) 0.524434 / 0.534201 (-0.009767) 0.620662 / 0.579283 (0.041379) 0.472071 / 0.434364 (0.037707) 0.545484 / 0.540337 (0.005146) 1.308745 / 1.386936 (-0.078191)

CML watermark

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.024628 / 0.011353 (0.013275) 0.017163 / 0.011008 (0.006155) 0.052475 / 0.038508 (0.013967) 0.039857 / 0.023109 (0.016748) 0.348086 / 0.275898 (0.072188) 0.395478 / 0.323480 (0.071998) 0.011170 / 0.007986 (0.003185) 0.005001 / 0.004328 (0.000672) 0.011632 / 0.004250 (0.007381) 0.049466 / 0.037052 (0.012414) 0.344214 / 0.258489 (0.085725) 0.407031 / 0.293841 (0.113190) 0.176191 / 0.128546 (0.047644) 0.142955 / 0.075646 (0.067308) 0.462125 / 0.419271 (0.042853) 0.446854 / 0.043533 (0.403321) 0.351792 / 0.255139 (0.096653) 0.385480 / 0.283200 (0.102280) 1.768709 / 0.141683 (1.627026) 1.902087 / 1.452155 (0.449933) 2.024622 / 1.492716 (0.531906)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.017974 / 0.018006 (-0.000033) 0.000503 / 0.000490 (0.000013) 0.000177 / 0.000200 (-0.000023) 0.000062 / 0.000054 (0.000007)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.048023 / 0.037411 (0.010612) 0.023308 / 0.014526 (0.008782) 0.029758 / 0.176557 (-0.146798) 0.047780 / 0.737135 (-0.689356) 0.030136 / 0.296338 (-0.266203)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.510545 / 0.215209 (0.295336) 5.100733 / 2.077655 (3.023078) 2.242355 / 1.504120 (0.738235) 1.916748 / 1.541195 (0.375553) 1.914267 / 1.468490 (0.445777) 7.545904 / 4.584777 (2.961127) 6.600957 / 3.745712 (2.855245) 9.275747 / 5.269862 (4.005885) 8.216796 / 4.565676 (3.651120) 0.768541 / 0.424275 (0.344266) 0.011399 / 0.007607 (0.003792) 0.642659 / 0.226044 (0.416615) 6.398573 / 2.268929 (4.129644) 3.319322 / 55.444624 (-52.125302) 2.852248 / 6.876477 (-4.024229) 2.836620 / 2.142072 (0.694547) 7.625976 / 4.805227 (2.820749) 5.581259 / 6.500664 (-0.919405) 8.086878 / 0.075469 (8.011409)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.127193 / 1.841788 (10.285406) 14.398977 / 8.074308 (6.324669) 41.214837 / 10.191392 (31.023445) 0.890582 / 0.680424 (0.210158) 0.623133 / 0.534201 (0.088932) 0.828964 / 0.579283 (0.249681) 0.678506 / 0.434364 (0.244142) 0.771813 / 0.540337 (0.231476) 1.710019 / 1.386936 (0.323083)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.023744 / 0.011353 (0.012391) 0.015902 / 0.011008 (0.004893) 0.051533 / 0.038508 (0.013025) 0.037710 / 0.023109 (0.014601) 0.353697 / 0.275898 (0.077799) 0.371186 / 0.323480 (0.047706) 0.010722 / 0.007986 (0.002736) 0.004962 / 0.004328 (0.000634) 0.011301 / 0.004250 (0.007051) 0.057046 / 0.037052 (0.019994) 0.351027 / 0.258489 (0.092538) 0.369355 / 0.293841 (0.075514) 0.173039 / 0.128546 (0.044492) 0.142623 / 0.075646 (0.066976) 0.450033 / 0.419271 (0.030762) 0.450313 / 0.043533 (0.406780) 0.358275 / 0.255139 (0.103136) 0.388914 / 0.283200 (0.105714) 1.805259 / 0.141683 (1.663576) 1.986526 / 1.452155 (0.534371) 2.007159 / 1.492716 (0.514443)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.018002 / 0.018006 (-0.000004) 0.000459 / 0.000490 (-0.000031) 0.000210 / 0.000200 (0.000010) 0.000061 / 0.000054 (0.000007)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042729 / 0.037411 (0.005318) 0.023666 / 0.014526 (0.009140) 0.028505 / 0.176557 (-0.148052) 0.048296 / 0.737135 (-0.688840) 0.030885 / 0.296338 (-0.265453)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.499645 / 0.215209 (0.284436) 4.942700 / 2.077655 (2.865045) 2.199862 / 1.504120 (0.695742) 1.864321 / 1.541195 (0.323126) 1.888948 / 1.468490 (0.420458) 7.301682 / 4.584777 (2.716905) 6.492038 / 3.745712 (2.746326) 9.108397 / 5.269862 (3.838535) 7.900655 / 4.565676 (3.334978) 0.740810 / 0.424275 (0.316535) 0.011296 / 0.007607 (0.003688) 0.643390 / 0.226044 (0.417345) 6.477335 / 2.268929 (4.208407) 3.382125 / 55.444624 (-52.062499) 2.887618 / 6.876477 (-3.988858) 2.864079 / 2.142072 (0.722007) 7.499477 / 4.805227 (2.694250) 5.313545 / 6.500664 (-1.187119) 8.968124 / 0.075469 (8.892655)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.140183 / 1.841788 (10.298395) 13.824805 / 8.074308 (5.750496) 39.297321 / 10.191392 (29.105929) 0.912956 / 0.680424 (0.232533) 0.622191 / 0.534201 (0.087990) 0.813337 / 0.579283 (0.234054) 0.653141 / 0.434364 (0.218777) 0.746484 / 0.540337 (0.206147) 1.614199 / 1.386936 (0.227263)

CML watermark

Please sign in to comment.