Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363

Merged

Conversation

eltoto1219
Copy link
Contributor

nlp/features.py:

The main factory class is MultiArray, every single time this class is called, a corresponding pyarrow extension array and type class is generated (and added to the list of globals for future use) for a given root data type and set of dimensions/shape. I provide examples on working with this in datasets/lxmert_pretraining_beta/test_multi_array.py

src/nlp/arrow_writer.py

I had to add a method for writing batches that include extension array types because despite having a unique class for each multidimensional array shape, pyarrow is unable to write any other "array-like" data class to a batch object unless it is of the type pyarrow.ExtensionType. The problem in this is that when writing multiple batches, the order of the schema and data to be written get mixed up (where the pyarrow datatype in the schema only refers to as ExtensionAray, but each ExtensionArray subclass has a different shape) ... possibly I am missing something here and would be grateful if anyone else could take a look!

datasets/lxmert_pretraining_beta/lxmert_pretraining_beta.py & datasets/lxmert_pretraining_beta/to_arrow_data.py:

I have begun adding the data from the original LXMERT paper (https://arxiv.org/abs/1908.07490) hosted here: (https://github.com/airsplay/lxmert). The reason I am not pulling from the source of truth for each individual dataset is because it seems that there will also need to be functionality to aggregate multimodal datasets to create a pre-training corpus (:sleepy: ).
For now, this is just being used to test and run edge-cases for the MultiArray feature, so ive labeled it as "beta_pretraining"!

(still working on the pretraining, just wanted to push out the new functionality sooner than later)

@eltoto1219 eltoto1219 changed the title Adding support for an generic multi dimensional tensors and auxillary image data for multimodal datasets Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets Jul 9, 2020
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool ! I left some comments in the code.

This is going in the right direction, great job !
If I can help you on some aspects about apache arrow, feel free to contact me (here on github or slack)

Also should we make this PR a "draft PR" until things are ready to get merged ?

src/nlp/features.py Outdated Show resolved Hide resolved
src/nlp/features.py Outdated Show resolved Hide resolved
src/nlp/features.py Outdated Show resolved Hide resolved
src/nlp/features.py Outdated Show resolved Hide resolved
src/nlp/arrow_writer.py Outdated Show resolved Hide resolved
datasets/lxmert_pretraining_beta/test_multi_array.py Outdated Show resolved Hide resolved
@eltoto1219 eltoto1219 marked this pull request as draft July 10, 2020 17:19
@eltoto1219
Copy link
Contributor Author

Thank you! I just marked this as a draft PR. It probably would be better to create specific Array2D and Array3D classes as needed instead of a generic MultiArray for now, it should simplify the code a lot too so, I'll update it as such. Also i was meaning to reply earlier, but I wanted to thank you for the testing script you sent me earlier since it ended up being tremendously helpful.

@eltoto1219 eltoto1219 force-pushed the support_multi_dim_tensors_for_images branch 2 times, most recently from 145feb5 to 9072e05 Compare July 11, 2020 06:23
@eltoto1219
Copy link
Contributor Author

Okay, I just converted the MultiArray class to Array2D, and got rid of all those "globals()"!

The main issues I had were that when including a "pa.ExtensionType" as a column, the ordinary methods to batch the data would not work and it would throw me some mysterious error, so I first cleaned up my code to order the row to match the schema (because when including extension types the row is disordered ) and then made each row a pa.Table and then concatenated all the tables. Also each n-dimensional vector class we implement will be size invariant which is some good news.

@eltoto1219 eltoto1219 force-pushed the support_multi_dim_tensors_for_images branch from ed53c8b to b554845 Compare July 14, 2020 06:50
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's clean ! It looks like we're getting closer and closer to something very cool here :)
And thanks for changing the MultiArray for Array2D

Could you give more details about the error you have with the code in write_on_file ?

Also I left some comments (mostly nitpicking).

src/nlp/features.py Outdated Show resolved Hide resolved
src/nlp/features.py Outdated Show resolved Hide resolved
src/nlp/features.py Outdated Show resolved Hide resolved
src/nlp/arrow_writer.py Outdated Show resolved Hide resolved
src/nlp/arrow_writer.py Outdated Show resolved Hide resolved
src/nlp/arrow_writer.py Outdated Show resolved Hide resolved
@eltoto1219 eltoto1219 force-pushed the support_multi_dim_tensors_for_images branch from b088aa8 to 0cbcb9b Compare July 16, 2020 01:09
@eltoto1219
Copy link
Contributor Author

eltoto1219 commented Jul 16, 2020

Okay awesome! I just added your suggestions and changed up my recursive functions.

Here is the traceback for the when I use the original code in the write_on_file method:

Traceback (most recent call last):
  File "<stdin>", line 33, in <module>
  File "/home/eltoto/nlp/src/nlp/arrow_writer.py", line 214, in finalize
    self.write_on_file()
  File "/home/eltoto/nlp/src/nlp/arrow_writer.py", line 134, in write_on_file
    pa_array = pa.array(self.current_rows, type=self._type)
  File "pyarrow/array.pxi", line 269, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 38, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 106, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: MakeBuilder: cannot construct builder for type extension<arrow.py_extension_type>

shell returned 1

I think when trying to cast an extension array within a list of dictionaries, some method gets called that bugs out Arrow and somehow doesn't get called when adding a single row to a a table and then appending multiple tables together. I tinkered with this for a while but could not find any workaround.

In the case that this new method causes bad compression/worse performance, we can explicitly set the batch size in the pa.Table.to_batches(batch_size) method, which will return a list of batches. Perhaps, we can check that the batch size is not too large converting the table to batches after X many rows are appended to it by following the batch_size check below.

@lhoestq
Copy link
Member

lhoestq commented Jul 21, 2020

I think when trying to cast an extension array within a list of dictionaries, some method gets called that bugs out Arrow and somehow doesn't get called when adding a single row to a a table and then appending multiple tables together. I tinkered with this for a while but could not find any workaround.

Indeed that's weird.

In the case that this new method causes bad compression/worse performance, we can explicitly set the batch size in the pa.Table.to_batches(batch_size) method, which will return a list of batches. Perhaps, we can check that the batch size is not too large converting the table to batches after X many rows are appended to it by following the batch_size check below.

The argument of pa.Table.to_batches is not batch_size but max_chunksize, which means that right now it would have no effects (each chunk is of length 1).

We can fix that just by doing entries.combine_chunks().to_batches(batch_size). In that case it would write by chunk of 1000 which is what we want. I don't think it will slow down the writing by much, but we may have to do a benchmark just to make sure. If speed is ok we could even replace the original code to always write chunks this way.

Do you still have errors that need to be fixed ?

@eltoto1219
Copy link
Contributor Author

@lhoestq Nope all should be good!

Would you like me to add the entries.combine_chunks().to_batch_size() code + benchmark?

@lhoestq
Copy link
Member

lhoestq commented Jul 24, 2020

@lhoestq Nope all should be good!

Awesome :)

I think it would be good to start to add some tests then.
You already have test_multi_array.py which is a good start, maybe you can place it in /tests and make it a unittest.TestCase ?

Would you like me to add the entries.combine_chunks().to_batch_size() code + benchmark?

That would be interesting. We don't want reading/writing to be the bottleneck of dataset processing for example in terms of speed. Maybe we could test the write + read speed of different datasets:

  • write speed + read speed a dataset with nlp.Array2D features
  • write speed + read speed a dataset with nlp.Sequence(nlp.Sequence(nlp.Value("float32"))) features
  • write speed + read speed a dataset with nlp.Sequence(nlp.Value("float32")) features (same data but flatten)
    It will be interesting to see the influence of .combine_chunks() on the Array2D test too.

What do you think ?

@lhoestq
Copy link
Member

lhoestq commented Jul 24, 2020

Well actually it looks like we're still having the print(dataset[0]) error no ?

@lhoestq
Copy link
Member

lhoestq commented Jul 24, 2020

I just tested your code to try to understand better.

  • First thing you must know is that we've switched from dataset._data.to_pandas to dataset._data.to_pydict by default when we call dataset[0] in Change features vs schema logic #423 . Right now it raises an error but it can be fixed by adding this method to ExtensionArray2D:
    def to_pylist(self):
        return self.to_numpy().tolist()
  • Second, I noticed that ExtensionArray2D.to_numpy() always return a (5, 5) shape in your example. I thought ExtensionArray was for possibly multiple examples and so I was expecting a shape like (1, 5, 5) for example. Did I miss something ?
    Therefore when I apply the fix I mentioned (adding to_pylist), it returns one example per row in each image (in your example of 2 images of shape 5x5, I get len(dataset._data.to_pydict()["image"]) == 10 # True)

[EDIT] I changed the reshape step in ExtensionArray2D.to_numpy() by

numpy_arr = numpy_arr.reshape(len(self), *ExtensionArray2D._construct_shape(self.storage))

and it did the job: len(dataset._data.to_pydict()["image"]) == 2 # True

  • Finally, I was able to make to_pandas work though, by implementing custom array dtype in pandas with arrow conversion (I got inspiration from here and here)

Maybe you could add me in your repo so I can open a PR to add these changes to your branch ?

@lhoestq
Copy link
Member

lhoestq commented Jul 24, 2020

combine_chunks doesn't seem to work btw:
ArrowNotImplementedError: concatenation of extension<arrow.py_extension_type>

@eltoto1219
Copy link
Contributor Author

@lhoestq Nope all should be good!

Awesome :)

I think it would be good to start to add some tests then.
You already have test_multi_array.py which is a good start, maybe you can place it in /tests and make it a unittest.TestCase ?

Would you like me to add the entries.combine_chunks().to_batch_size() code + benchmark?

That would be interesting. We don't want reading/writing to be the bottleneck of dataset processing for example in terms of speed. Maybe we could test the write + read speed of different datasets:

  • write speed + read speed a dataset with nlp.Array2D features
  • write speed + read speed a dataset with nlp.Sequence(nlp.Sequence(nlp.Value("float32"))) features
  • write speed + read speed a dataset with nlp.Sequence(nlp.Value("float32")) features (same data but flatten)
    It will be interesting to see the influence of .combine_chunks() on the Array2D test too.

What do you think ?

Ya! that should be no problem at all, Ill use the timeit module and get back to you with the results sometime over the weekend.

@eltoto1219
Copy link
Contributor Author

Thank you for all your help getting the pandas and row indexing for the dataset to work! For print(dataset[0]), I considered the workaround of doing print(dataset["col_name"][0]) a temporary solution, but ya, I was never able to figure out how to previously get it to work. I'll add you to my repo right now, let me know if you do not see the invite. Also lastly, it is strange how the to_batches method is not working, so I can check that out while I add some speed tests + add the multi dim test under the unit tests this weekend.

@lhoestq
Copy link
Member

lhoestq commented Jul 27, 2020

I created the PR :)
I also tested to_batches and it works on my side

@eltoto1219 eltoto1219 force-pushed the support_multi_dim_tensors_for_images branch from db72b56 to 81f120c Compare July 30, 2020 21:00
@eltoto1219
Copy link
Contributor Author

Sorry for the bit of delay! I just added the tests, the PR into my fork, and some speed tests. It should be fairly easy to add more tests if we need. Do you think there is anything else to checkout?

@lhoestq
Copy link
Member

lhoestq commented Jul 30, 2020

Cool thanks for adding the tests :)

Next step is merge master into this branch.
Not sure I understand what you did in your last commit, but it looks like you discarded all the changes from master ^^'

We've done some changes in the features logic on master, so let me know if you need help merging it.

As soon as we've merged from master, we'll have to make sure that we have extensive tests and we'll be good to do !
About the lxmert dataset, we can probably keep it for another PR as soon as we have working 2d features. What do you think ?

@thomwolf
Copy link
Member

We might want to merge this after tomorrow's release though to avoid potential side effects @lhoestq

@lhoestq
Copy link
Member

lhoestq commented Jul 30, 2020

Yep I'm sure we can have it not for tomorrow's release but for the next one ;)

@eltoto1219
Copy link
Contributor Author

haha, when I tried to rebase, I ran into some conflicts. In that last commit, I restored the features.py from the previous commit on the branch in my fork because upon updating to master, the pandasdtypemanger and pandas extension types disappeared. If you actually could help me with merging in what is needed, that would actually help a lot.

Other than that, ya let me go ahead and move the dataloader code out of this PR. Perhaps we could discuss in the slack channelk soon about what to do with that because we can either just support the pretraining corpus for lxmert or try to implement the full COCO and visual genome datasets (+VQA +GQA) which im sure people would be pretty happy about.

Also we can talk more tests soon too when you are free.

Goodluck on the release tomorrow guys!

@lhoestq
Copy link
Member

lhoestq commented Aug 20, 2020

I'll add Array3D, 4D.. tomorrow but it should take only a few lines. The rest won't change

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice addition! I love it @eltoto1219 @lhoestq

src/nlp/arrow_dataset.py Outdated Show resolved Hide resolved
else:
outputs = self._data[key].to_pylist()
outputs = self._data.to_pandas(types_mapper=pandas_types_mapper)[key].to_list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here

src/nlp/arrow_dataset.py Outdated Show resolved Hide resolved
src/nlp/arrow_dataset.py Outdated Show resolved Hide resolved
@@ -45,13 +125,12 @@ class ArrowWriter(object):

def __init__(
self,
data_type: Optional[pa.DataType] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good cleanup!

src/nlp/features.py Outdated Show resolved Hide resolved
src/nlp/features.py Outdated Show resolved Hide resolved

def to_numpy(self):
numpy_arr = Array2DExtensionType._generate_flatten(self.storage, self.dims)
numpy_arr = numpy_arr.reshape(len(self), *Array2DExtensionArray._construct_shape(self.storage))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this zero copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 2D, these two lines are equivalent to self.storage.flatten().flatten().to_numpy(zero_copy=True).reshape(...).
The zero copy conversion from arrow to numpy is done in _generate_flatten.

@@ -479,7 +667,9 @@ def generate_from_dict(obj: Any):

if class_type == Sequence:
return Sequence(feature=generate_from_dict(obj["feature"]), length=obj["length"])
return class_type(**obj)

field_names = set(f.name for f in fields(class_type))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do this (curious)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature types like Value etc. are serialized in json format when saving the dataset info.
If in the future we add fields to these feature types then they would still be loadable from older versions of nlp

tests/test_array_2d.py Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member

lhoestq commented Aug 21, 2020

I took your comments into account and I added Array[3-5]D.
I changed the storage type to fixed lengths lists. I had to update the to_numpy function because of that. Indeed slicing a FixedLengthListArray returns a view a of the original array, while in the previous case slicing a ListArray copies the storage.

@lhoestq lhoestq requested a review from thomwolf August 21, 2020 16:07
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

id: Optional[str] = None
# Automatically constructed
_type: str = field(default="Array2D", init=False, repr=False)
class _ArrayXD:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XD!

]


@parameterized.named_parameters(get_array_feature_types())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@lhoestq lhoestq force-pushed the support_multi_dim_tensors_for_images branch from 52ace77 to 375858c Compare August 24, 2020 09:54
@lhoestq lhoestq merged commit c93a19e into huggingface:master Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants