Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using dataset.map() if passed Features types do not match what is returned from the mapped function, execution does not except in an obvious way #4352

Open
plamb-viso opened this issue May 14, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@plamb-viso
Copy link

Describe the bug

Recently I was trying to using .map() to preprocess a dataset. I defined the expected Features and passed them into .map() like dataset.map(preprocess_data, features=features). My expected Features keys matched what came out of preprocess_data, but the types i had defined for them did not match the types that came back. Because of this, i ended up in tracebacks deep inside arrow_dataset.py and arrow_writer.py with exceptions that did not make clear what the problem was. In short i ended up with overflows and the OS killing processes when Arrow was attempting to write. It wasn't until I dug into def write_batch and the loop that loops over cols that I figured out what was going on.

It seems like .map() could set a boolean that it's checked that for at least 1 instance from the dataset, the returned data's types match the types provided by the features param and error out with a clear exception if they don't. This would make the cause of the issue much more understandable and save people time. This could be construed as a feature but it feels more like a bug to me.

Steps to reproduce the bug

I don't have explicit code to repro the bug, but ill show an example

Code prior to the fix:

def preprocess(examples):
    # returns an encoded data dict with keys that match the features, but the types do not match
...

def get_encoded_data(data):
  dataset = Dataset.from_pandas(data)
      unique_labels = data['audit_type'].unique().tolist()
      features = Features({
          'image': Array3D(dtype="uint8", shape=(3, 224, 224))),
          'input_ids': Sequence(feature=Value(dtype='int64'))),
          'attention_mask': Sequence(Value(dtype='int64'))),
          'token_type_ids': Sequence(Value(dtype='int64'))),
          'bbox': Array2D(dtype="int64", shape=(512, 4))),
          'label': ClassLabel(num_classes=len(unique_labels), names=unique_labels),
      })

      encoded_dataset = dataset.map(preprocess_data, features=features, remove_columns=dataset.column_names)

The Features set that fixed it:

    features = Features({
        'image': Sequence(Array3D(dtype="uint8", shape=(3, 224, 224))),
        'input_ids': Sequence(Sequence(feature=Value(dtype='int64'))),
        'attention_mask': Sequence(Sequence(Value(dtype='int64'))),
        'token_type_ids': Sequence(Sequence(Value(dtype='int64'))),
        'bbox': Sequence(Array2D(dtype="int64", shape=(512, 4))),
        'label': ClassLabel(num_classes=len(unique_labels), names=unique_labels),
    })

The difference between my original code (which was based on documentation) and the working code is the addition of the Sequence(...) to 4/5 features as I am working with paginated data and the doc examples are not.

Expected results

Dataset.map() attempts to validate the data types for each Feature on the first iteration and errors out if they are not validated.

Actual results

Specify the actual results or traceback.
Based on the value of writer_batch_size, execution errors out when Arrow attempts to write because the types do not match, though its error messages dont make this obvious

Example errors:

OverflowError: There was an overflow with type <class 'list'>. Try to reduce writer_batch_size to have batches smaller than 2GB.
(offset overflow while concatenating arrays)
zsh: killed     python doc_classification.py

UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

Environment info

datasets version: 2.1.0
Platform: macOS-12.2.1-arm64-arm-64bit
Python version: 3.9.12
PyArrow version: 6.0.1
Pandas version: 1.4.2

@lhoestq
Copy link
Member

lhoestq commented May 16, 2022

Hi ! Thanks for reporting :) datasets usually returns a pa.lib.ArrowInvalid error if the feature types don't match.

It would be awesome if we had a way to reproduce the OverflowError in this case, to better understand what happened and be able to provide the best error message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants