When using `dataset.map()` if passed `Features` types do not match what is returned from the mapped function, execution does not except in an obvious way #4352

plamb-viso · 2022-05-14T17:55:15Z

Describe the bug

Recently I was trying to using .map() to preprocess a dataset. I defined the expected Features and passed them into .map() like dataset.map(preprocess_data, features=features). My expected Features keys matched what came out of preprocess_data, but the types i had defined for them did not match the types that came back. Because of this, i ended up in tracebacks deep inside arrow_dataset.py and arrow_writer.py with exceptions that did not make clear what the problem was. In short i ended up with overflows and the OS killing processes when Arrow was attempting to write. It wasn't until I dug into def write_batch and the loop that loops over cols that I figured out what was going on.

It seems like .map() could set a boolean that it's checked that for at least 1 instance from the dataset, the returned data's types match the types provided by the features param and error out with a clear exception if they don't. This would make the cause of the issue much more understandable and save people time. This could be construed as a feature but it feels more like a bug to me.

Steps to reproduce the bug

I don't have explicit code to repro the bug, but ill show an example

Code prior to the fix:

def preprocess(examples):
    # returns an encoded data dict with keys that match the features, but the types do not match
...

def get_encoded_data(data):
  dataset = Dataset.from_pandas(data)
      unique_labels = data['audit_type'].unique().tolist()
      features = Features({
          'image': Array3D(dtype="uint8", shape=(3, 224, 224))),
          'input_ids': Sequence(feature=Value(dtype='int64'))),
          'attention_mask': Sequence(Value(dtype='int64'))),
          'token_type_ids': Sequence(Value(dtype='int64'))),
          'bbox': Array2D(dtype="int64", shape=(512, 4))),
          'label': ClassLabel(num_classes=len(unique_labels), names=unique_labels),
      })

      encoded_dataset = dataset.map(preprocess_data, features=features, remove_columns=dataset.column_names)

The Features set that fixed it:

    features = Features({
        'image': Sequence(Array3D(dtype="uint8", shape=(3, 224, 224))),
        'input_ids': Sequence(Sequence(feature=Value(dtype='int64'))),
        'attention_mask': Sequence(Sequence(Value(dtype='int64'))),
        'token_type_ids': Sequence(Sequence(Value(dtype='int64'))),
        'bbox': Sequence(Array2D(dtype="int64", shape=(512, 4))),
        'label': ClassLabel(num_classes=len(unique_labels), names=unique_labels),
    })

The difference between my original code (which was based on documentation) and the working code is the addition of the Sequence(...) to 4/5 features as I am working with paginated data and the doc examples are not.

Expected results

Dataset.map() attempts to validate the data types for each Feature on the first iteration and errors out if they are not validated.

Actual results

Specify the actual results or traceback.
Based on the value of writer_batch_size, execution errors out when Arrow attempts to write because the types do not match, though its error messages dont make this obvious

Example errors:

OverflowError: There was an overflow with type <class 'list'>. Try to reduce writer_batch_size to have batches smaller than 2GB.
(offset overflow while concatenating arrays)

zsh: killed     python doc_classification.py

UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

Environment info

datasets version: 2.1.0
Platform: macOS-12.2.1-arm64-arm-64bit
Python version: 3.9.12
PyArrow version: 6.0.1
Pandas version: 1.4.2

The text was updated successfully, but these errors were encountered:

lhoestq · 2022-05-16T15:09:16Z

Hi ! Thanks for reporting :) datasets usually returns a pa.lib.ArrowInvalid error if the feature types don't match.

It would be awesome if we had a way to reproduce the OverflowError in this case, to better understand what happened and be able to provide the best error message

plamb-viso added the bug Something isn't working label May 14, 2022

This was referenced May 14, 2022

Layoutlmv2 documents with multiple pages NielsRogge/Transformers-Tutorials#114

Open

Dataset.map()'s fails at any value of parameter writer_batch_size #4349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using `dataset.map()` if passed `Features` types do not match what is returned from the mapped function, execution does not except in an obvious way #4352

When using `dataset.map()` if passed `Features` types do not match what is returned from the mapped function, execution does not except in an obvious way #4352

plamb-viso commented May 14, 2022

lhoestq commented May 16, 2022

When using dataset.map() if passed Features types do not match what is returned from the mapped function, execution does not except in an obvious way #4352

When using dataset.map() if passed Features types do not match what is returned from the mapped function, execution does not except in an obvious way #4352

Comments

plamb-viso commented May 14, 2022

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

lhoestq commented May 16, 2022

When using `dataset.map()` if passed `Features` types do not match what is returned from the mapped function, execution does not except in an obvious way #4352

When using `dataset.map()` if passed `Features` types do not match what is returned from the mapped function, execution does not except in an obvious way #4352