Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casting type from Array2D int to Array2D float crashes #6291

Closed
AlanBlanchet opened this issue Oct 10, 2023 · 1 comment · Fixed by #6297
Closed

Casting type from Array2D int to Array2D float crashes #6291

AlanBlanchet opened this issue Oct 10, 2023 · 1 comment · Fixed by #6297

Comments

@AlanBlanchet
Copy link

AlanBlanchet commented Oct 10, 2023

Describe the bug

I am on a school project and the initial type for feature annotations are Array2D(shape=(None, 4)). I am trying to cast this type to a float64 and pyarrow gives me this error :

Traceback (most recent call last):
  File "/home/alan/dev/ClassezDesImagesAvecDesAlgorithmesDeDeeplearning/src/sdd/data/dataset.py", line 141, in <module>
    dataset = StanfordDogsDataset(size, 5).original(True).demo()
  File "<attrs generated init __main__.StanfordDogsDataset>", line 4, in __init__
  File "/home/alan/dev/ClassezDesImagesAvecDesAlgorithmesDeDeeplearning/src/sdd/data/dataset.py", line 33, in __attrs_post_init__
    self.dataset = self.dataset.cast_column(
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/fingerprint.py", line 511, in wrapper
    out = func(dataset, *args, **kwargs)
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2110, in cast_column
    return self.cast(features)
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2055, in cast
    dataset = dataset.map(
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3097, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/table.py", line 2328, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/table.py", line 2287, in cast_table_to_schema
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/table.py", line 2287, in <listcomp>
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1831, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1831, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/table.py", line 2143, in cast_array_to_feature
    return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1833, in wrapper
    return func(array, *args, **kwargs)
  File "/home/alan/.cache/pypoetry/virtualenvs/sdd-2XWLAjSi-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1967, in array_cast
    return pa_type.wrap_array(array)
  File "pyarrow/types.pxi", line 1369, in pyarrow.lib.BaseExtensionType.wrap_array
TypeError: Incompatible storage type for extension<arrow.py_extension_type<Array2DExtensionType>>: expected list<item: list<item: double>>, got list<item: list<item: int32>>

Steps to reproduce the bug

dataset = datasets.load_dataset("Alanox/stanford-dogs", split="full")
dataset = dataset.cast_column("annotations", Array2D((None, 4), "float64"))

Expected behavior

It should simply cast the column feature type to a float64 without error

Environment info

datasets == 2.14.5

@mariosasko
Copy link
Collaborator

Thanks for reporting! I've opened a PR with a fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants