When using dataset.map()
if passed Features
types do not match what is returned from the mapped function, execution does not except in an obvious way
#4352
Labels
bug
Something isn't working
Describe the bug
Recently I was trying to using
.map()
to preprocess a dataset. I defined the expected Features and passed them into.map()
likedataset.map(preprocess_data, features=features)
. My expectedFeatures
keys matched what came out ofpreprocess_data
, but the types i had defined for them did not match the types that came back. Because of this, i ended up in tracebacks deep inside arrow_dataset.py and arrow_writer.py with exceptions that did not make clear what the problem was. In short i ended up with overflows and the OS killing processes when Arrow was attempting to write. It wasn't until I dug intodef write_batch
and the loop that loops over cols that I figured out what was going on.It seems like
.map()
could set a boolean that it's checked that for at least 1 instance from the dataset, the returned data's types match the types provided by thefeatures
param and error out with a clear exception if they don't. This would make the cause of the issue much more understandable and save people time. This could be construed as a feature but it feels more like a bug to me.Steps to reproduce the bug
I don't have explicit code to repro the bug, but ill show an example
Code prior to the fix:
The Features set that fixed it:
The difference between my original code (which was based on documentation) and the working code is the addition of the
Sequence(...)
to 4/5 features as I am working with paginated data and the doc examples are not.Expected results
Dataset.map() attempts to validate the data types for each Feature on the first iteration and errors out if they are not validated.
Actual results
Specify the actual results or traceback.
Based on the value of
writer_batch_size
, execution errors out when Arrow attempts to write because the types do not match, though its error messages dont make this obviousExample errors:
Environment info
datasets version: 2.1.0
Platform: macOS-12.2.1-arm64-arm-64bit
Python version: 3.9.12
PyArrow version: 6.0.1
Pandas version: 1.4.2
The text was updated successfully, but these errors were encountered: