Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363

eltoto1219 · 2020-07-09T07:10:30Z

nlp/features.py:

The main factory class is MultiArray, every single time this class is called, a corresponding pyarrow extension array and type class is generated (and added to the list of globals for future use) for a given root data type and set of dimensions/shape. I provide examples on working with this in datasets/lxmert_pretraining_beta/test_multi_array.py

src/nlp/arrow_writer.py

I had to add a method for writing batches that include extension array types because despite having a unique class for each multidimensional array shape, pyarrow is unable to write any other "array-like" data class to a batch object unless it is of the type pyarrow.ExtensionType. The problem in this is that when writing multiple batches, the order of the schema and data to be written get mixed up (where the pyarrow datatype in the schema only refers to as ExtensionAray, but each ExtensionArray subclass has a different shape) ... possibly I am missing something here and would be grateful if anyone else could take a look!

datasets/lxmert_pretraining_beta/lxmert_pretraining_beta.py & datasets/lxmert_pretraining_beta/to_arrow_data.py:

I have begun adding the data from the original LXMERT paper (https://arxiv.org/abs/1908.07490) hosted here: (https://github.com/airsplay/lxmert). The reason I am not pulling from the source of truth for each individual dataset is because it seems that there will also need to be functionality to aggregate multimodal datasets to create a pre-training corpus (:sleepy: ).
For now, this is just being used to test and run edge-cases for the MultiArray feature, so ive labeled it as "beta_pretraining"!

(still working on the pretraining, just wanted to push out the new functionality sooner than later)

lhoestq

Really cool ! I left some comments in the code.

This is going in the right direction, great job !
If I can help you on some aspects about apache arrow, feel free to contact me (here on github or slack)

Also should we make this PR a "draft PR" until things are ready to get merged ?

src/nlp/features.py

src/nlp/arrow_writer.py

datasets/lxmert_pretraining_beta/test_multi_array.py

eltoto1219 · 2020-07-10T17:53:52Z

Thank you! I just marked this as a draft PR. It probably would be better to create specific Array2D and Array3D classes as needed instead of a generic MultiArray for now, it should simplify the code a lot too so, I'll update it as such. Also i was meaning to reply earlier, but I wanted to thank you for the testing script you sent me earlier since it ended up being tremendously helpful.

eltoto1219 · 2020-07-11T06:30:59Z

Okay, I just converted the MultiArray class to Array2D, and got rid of all those "globals()"!

The main issues I had were that when including a "pa.ExtensionType" as a column, the ordinary methods to batch the data would not work and it would throw me some mysterious error, so I first cleaned up my code to order the row to match the schema (because when including extension types the row is disordered ) and then made each row a pa.Table and then concatenated all the tables. Also each n-dimensional vector class we implement will be size invariant which is some good news.

lhoestq

That's clean ! It looks like we're getting closer and closer to something very cool here :)
And thanks for changing the MultiArray for Array2D

Could you give more details about the error you have with the code in write_on_file ?

Also I left some comments (mostly nitpicking).

src/nlp/features.py

src/nlp/arrow_writer.py

eltoto1219 · 2020-07-16T01:33:51Z

Okay awesome! I just added your suggestions and changed up my recursive functions.

Here is the traceback for the when I use the original code in the write_on_file method:

Traceback (most recent call last):
  File "<stdin>", line 33, in <module>
  File "/home/eltoto/nlp/src/nlp/arrow_writer.py", line 214, in finalize
    self.write_on_file()
  File "/home/eltoto/nlp/src/nlp/arrow_writer.py", line 134, in write_on_file
    pa_array = pa.array(self.current_rows, type=self._type)
  File "pyarrow/array.pxi", line 269, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 38, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 106, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: MakeBuilder: cannot construct builder for type extension<arrow.py_extension_type>

shell returned 1

I think when trying to cast an extension array within a list of dictionaries, some method gets called that bugs out Arrow and somehow doesn't get called when adding a single row to a a table and then appending multiple tables together. I tinkered with this for a while but could not find any workaround.

In the case that this new method causes bad compression/worse performance, we can explicitly set the batch size in the pa.Table.to_batches(batch_size) method, which will return a list of batches. Perhaps, we can check that the batch size is not too large converting the table to batches after X many rows are appended to it by following the batch_size check below.

lhoestq · 2020-07-21T08:44:57Z

I think when trying to cast an extension array within a list of dictionaries, some method gets called that bugs out Arrow and somehow doesn't get called when adding a single row to a a table and then appending multiple tables together. I tinkered with this for a while but could not find any workaround.

Indeed that's weird.

In the case that this new method causes bad compression/worse performance, we can explicitly set the batch size in the pa.Table.to_batches(batch_size) method, which will return a list of batches. Perhaps, we can check that the batch size is not too large converting the table to batches after X many rows are appended to it by following the batch_size check below.

The argument of pa.Table.to_batches is not batch_size but max_chunksize, which means that right now it would have no effects (each chunk is of length 1).

We can fix that just by doing entries.combine_chunks().to_batches(batch_size). In that case it would write by chunk of 1000 which is what we want. I don't think it will slow down the writing by much, but we may have to do a benchmark just to make sure. If speed is ok we could even replace the original code to always write chunks this way.

Do you still have errors that need to be fixed ?

eltoto1219 · 2020-07-23T17:20:43Z

@lhoestq Nope all should be good!

Would you like me to add the entries.combine_chunks().to_batch_size() code + benchmark?

lhoestq · 2020-07-24T12:24:39Z

@lhoestq Nope all should be good!

Awesome :)

I think it would be good to start to add some tests then.
You already have test_multi_array.py which is a good start, maybe you can place it in /tests and make it a unittest.TestCase ?

Would you like me to add the entries.combine_chunks().to_batch_size() code + benchmark?

That would be interesting. We don't want reading/writing to be the bottleneck of dataset processing for example in terms of speed. Maybe we could test the write + read speed of different datasets:

write speed + read speed a dataset with nlp.Array2D features
write speed + read speed a dataset with nlp.Sequence(nlp.Sequence(nlp.Value("float32"))) features
write speed + read speed a dataset with nlp.Sequence(nlp.Value("float32")) features (same data but flatten)
It will be interesting to see the influence of .combine_chunks() on the Array2D test too.

What do you think ?

lhoestq · 2020-07-24T12:45:02Z

Well actually it looks like we're still having the print(dataset[0]) error no ?

lhoestq · 2020-07-24T14:57:13Z

I just tested your code to try to understand better.

First thing you must know is that we've switched from dataset._data.to_pandas to dataset._data.to_pydict by default when we call dataset[0] in Change features vs schema logic #423 . Right now it raises an error but it can be fixed by adding this method to ExtensionArray2D:

    def to_pylist(self):
        return self.to_numpy().tolist()

Second, I noticed that ExtensionArray2D.to_numpy() always return a (5, 5) shape in your example. I thought ExtensionArray was for possibly multiple examples and so I was expecting a shape like (1, 5, 5) for example. Did I miss something ?
Therefore when I apply the fix I mentioned (adding to_pylist), it returns one example per row in each image (in your example of 2 images of shape 5x5, I get len(dataset._data.to_pydict()["image"]) == 10 # True)

[EDIT] I changed the reshape step in ExtensionArray2D.to_numpy() by

numpy_arr = numpy_arr.reshape(len(self), *ExtensionArray2D._construct_shape(self.storage))

and it did the job: len(dataset._data.to_pydict()["image"]) == 2 # True

Finally, I was able to make to_pandas work though, by implementing custom array dtype in pandas with arrow conversion (I got inspiration from here and here)

Maybe you could add me in your repo so I can open a PR to add these changes to your branch ?

lhoestq · 2020-07-24T15:45:17Z

combine_chunks doesn't seem to work btw:
ArrowNotImplementedError: concatenation of extension<arrow.py_extension_type>

eltoto1219 · 2020-07-24T20:09:09Z

@lhoestq Nope all should be good!

Awesome :)

I think it would be good to start to add some tests then.
You already have test_multi_array.py which is a good start, maybe you can place it in /tests and make it a unittest.TestCase ?

Would you like me to add the entries.combine_chunks().to_batch_size() code + benchmark?

That would be interesting. We don't want reading/writing to be the bottleneck of dataset processing for example in terms of speed. Maybe we could test the write + read speed of different datasets:

write speed + read speed a dataset with nlp.Array2D features

write speed + read speed a dataset with nlp.Sequence(nlp.Sequence(nlp.Value("float32"))) features

write speed + read speed a dataset with nlp.Sequence(nlp.Value("float32")) features (same data but flatten)
It will be interesting to see the influence of .combine_chunks() on the Array2D test too.

What do you think ?

Ya! that should be no problem at all, Ill use the timeit module and get back to you with the results sometime over the weekend.

eltoto1219 · 2020-07-24T20:16:34Z

Thank you for all your help getting the pandas and row indexing for the dataset to work! For print(dataset[0]), I considered the workaround of doing print(dataset["col_name"][0]) a temporary solution, but ya, I was never able to figure out how to previously get it to work. I'll add you to my repo right now, let me know if you do not see the invite. Also lastly, it is strange how the to_batches method is not working, so I can check that out while I add some speed tests + add the multi dim test under the unit tests this weekend.

lhoestq · 2020-07-27T14:36:29Z

I created the PR :)
I also tested to_batches and it works on my side

eltoto1219 · 2020-07-30T21:04:52Z

Sorry for the bit of delay! I just added the tests, the PR into my fork, and some speed tests. It should be fairly easy to add more tests if we need. Do you think there is anything else to checkout?

lhoestq · 2020-07-30T21:26:11Z

Cool thanks for adding the tests :)

Next step is merge master into this branch.
Not sure I understand what you did in your last commit, but it looks like you discarded all the changes from master ^^'

We've done some changes in the features logic on master, so let me know if you need help merging it.

As soon as we've merged from master, we'll have to make sure that we have extensive tests and we'll be good to do !
About the lxmert dataset, we can probably keep it for another PR as soon as we have working 2d features. What do you think ?

thomwolf · 2020-07-30T21:28:41Z

We might want to merge this after tomorrow's release though to avoid potential side effects @lhoestq

lhoestq · 2020-07-30T21:31:26Z

Yep I'm sure we can have it not for tomorrow's release but for the next one ;)

eltoto1219 · 2020-07-30T21:55:03Z

haha, when I tried to rebase, I ran into some conflicts. In that last commit, I restored the features.py from the previous commit on the branch in my fork because upon updating to master, the pandasdtypemanger and pandas extension types disappeared. If you actually could help me with merging in what is needed, that would actually help a lot.

Other than that, ya let me go ahead and move the dataloader code out of this PR. Perhaps we could discuss in the slack channelk soon about what to do with that because we can either just support the pretraining corpus for lxmert or try to implement the full COCO and visual genome datasets (+VQA +GQA) which im sure people would be pretty happy about.

Also we can talk more tests soon too when you are free.

Goodluck on the release tomorrow guys!

lhoestq · 2020-08-20T17:44:23Z

I'll add Array3D, 4D.. tomorrow but it should take only a few lines. The rest won't change

thomwolf

Very nice addition! I love it @eltoto1219 @lhoestq

src/nlp/arrow_dataset.py

thomwolf · 2020-08-21T09:32:05Z

src/nlp/arrow_dataset.py

                    else:
-                        outputs = self._data[key].to_pylist()
+                        outputs = self._data.to_pandas(types_mapper=pandas_types_mapper)[key].to_list()


Same question here

src/nlp/arrow_dataset.py

thomwolf · 2020-08-21T09:44:04Z

src/nlp/arrow_writer.py

@@ -45,13 +125,12 @@ class ArrowWriter(object):

    def __init__(
        self,
-        data_type: Optional[pa.DataType] = None,


Good cleanup!

src/nlp/features.py

thomwolf · 2020-08-21T10:21:32Z

src/nlp/features.py

+
+    def to_numpy(self):
+        numpy_arr = Array2DExtensionType._generate_flatten(self.storage, self.dims)
+        numpy_arr = numpy_arr.reshape(len(self), *Array2DExtensionArray._construct_shape(self.storage))


Is this zero copy?

In 2D, these two lines are equivalent to self.storage.flatten().flatten().to_numpy(zero_copy=True).reshape(...).
The zero copy conversion from arrow to numpy is done in _generate_flatten.

thomwolf · 2020-08-21T10:25:19Z

src/nlp/features.py

@@ -479,7 +667,9 @@ def generate_from_dict(obj: Any):

    if class_type == Sequence:
        return Sequence(feature=generate_from_dict(obj["feature"]), length=obj["length"])
-    return class_type(**obj)
+
+    field_names = set(f.name for f in fields(class_type))


Why do we need to do this (curious)?

Feature types like Value etc. are serialized in json format when saving the dataset info.
If in the future we add fields to these feature types then they would still be loadable from older versions of nlp

tests/test_array_2d.py

lhoestq · 2020-08-21T15:40:18Z

I took your comments into account and I added Array[3-5]D.
I changed the storage type to fixed lengths lists. I had to update the to_numpy function because of that. Indeed slicing a FixedLengthListArray returns a view a of the original array, while in the previous case slicing a ListArray copies the storage.

thomwolf

LGTM!

thomwolf · 2020-08-21T16:06:50Z

src/nlp/features.py

-    id: Optional[str] = None
-    # Automatically constructed
-    _type: str = field(default="Array2D", init=False, repr=False)
+class _ArrayXD:


thomwolf · 2020-08-21T16:09:35Z

tests/test_array_xd.py

+    ]
+
+
+@parameterized.named_parameters(get_array_feature_types())


…for writer

eltoto1219 changed the title ~~Adding support for an generic multi dimensional tensors and auxillary image data for multimodal datasets~~ Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets Jul 9, 2020

lhoestq reviewed Jul 10, 2020

View reviewed changes

eltoto1219 marked this pull request as draft July 10, 2020 17:19

eltoto1219 force-pushed the support_multi_dim_tensors_for_images branch 2 times, most recently from 145feb5 to 9072e05 Compare July 11, 2020 06:23

eltoto1219 force-pushed the support_multi_dim_tensors_for_images branch from ed53c8b to b554845 Compare July 14, 2020 06:50

lhoestq reviewed Jul 15, 2020

View reviewed changes

eltoto1219 force-pushed the support_multi_dim_tensors_for_images branch from b088aa8 to 0cbcb9b Compare July 16, 2020 01:09

lhoestq mentioned this pull request Jul 20, 2020

Add dataset.export() to TFRecords #339

Merged

lhoestq mentioned this pull request Jul 27, 2020

pandas extension + fix array shape + add to_pylist eltoto1219/nlp#1

Merged

eltoto1219 force-pushed the support_multi_dim_tensors_for_images branch from db72b56 to 81f120c Compare July 30, 2020 21:00

lhoestq force-pushed the master branch from db3f399 to 21e8091 Compare August 3, 2020 17:24

lhoestq force-pushed the support_multi_dim_tensors_for_images branch from 81f120c to 845a461 Compare August 17, 2020 10:00

thomwolf approved these changes Aug 21, 2020

View reviewed changes

lhoestq requested a review from thomwolf August 21, 2020 16:07

thomwolf approved these changes Aug 21, 2020

View reviewed changes

lhoestq added 22 commits August 24, 2020 11:14

add support for multi dim arrays

23dbde1

format and add tests for array2d

9951bae

fix docs and style

c4b3703

fix docs

0a96813

make both write_batch and write_on_file work for array2d

0b37756

refac types inference to support multi dim arrays

ab0ede6

fix test

cf54a83

remove lxmert dataset for now

11b010d

minor

d1a2ebb

add tests + fix from_dict and pandas

99f2754

naming

ea64d0a

add typed sequence tests

1915bf4

set disable_nullable to False, remove data_type in writer, add tests …

da6985b

…for writer

better error messages for overflows

c5bdf3e

quality

32a2867

add array3-5d

ac7a7b5

add Array[3-5]D and use fixed length lists

dde3990

fix overflow error message

174ed6f

use table with one column instead of lading everything

29aefb2

allow -> disallow

47b9f01

style

6ef85d2

style

375858c

lhoestq force-pushed the support_multi_dim_tensors_for_images branch from 52ace77 to 375858c Compare August 24, 2020 09:54

lhoestq merged commit c93a19e into huggingface:master Aug 24, 2020

davidefiocco mentioned this pull request Dec 17, 2020

Consider renaming to nld #138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363

Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363

eltoto1219 commented Jul 9, 2020

lhoestq left a comment

eltoto1219 commented Jul 10, 2020

eltoto1219 commented Jul 11, 2020

lhoestq left a comment

eltoto1219 commented Jul 16, 2020 •

edited

lhoestq commented Jul 21, 2020

eltoto1219 commented Jul 23, 2020

lhoestq commented Jul 24, 2020

lhoestq commented Jul 24, 2020

lhoestq commented Jul 24, 2020 •

edited

lhoestq commented Jul 24, 2020

eltoto1219 commented Jul 24, 2020

eltoto1219 commented Jul 24, 2020

lhoestq commented Jul 27, 2020

eltoto1219 commented Jul 30, 2020

lhoestq commented Jul 30, 2020

thomwolf commented Jul 30, 2020

lhoestq commented Jul 30, 2020

eltoto1219 commented Jul 30, 2020

lhoestq commented Aug 20, 2020

thomwolf left a comment

thomwolf Aug 21, 2020

thomwolf Aug 21, 2020

thomwolf Aug 21, 2020

lhoestq Aug 21, 2020

thomwolf Aug 21, 2020

lhoestq Aug 21, 2020

lhoestq commented Aug 21, 2020

thomwolf left a comment

thomwolf Aug 21, 2020

thomwolf Aug 21, 2020

		]


		@parameterized.named_parameters(get_array_feature_types())

Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363

Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363

Conversation

eltoto1219 commented Jul 9, 2020

lhoestq left a comment

Choose a reason for hiding this comment

eltoto1219 commented Jul 10, 2020

eltoto1219 commented Jul 11, 2020

lhoestq left a comment

Choose a reason for hiding this comment

eltoto1219 commented Jul 16, 2020 • edited

lhoestq commented Jul 21, 2020

eltoto1219 commented Jul 23, 2020

lhoestq commented Jul 24, 2020

lhoestq commented Jul 24, 2020

lhoestq commented Jul 24, 2020 • edited

lhoestq commented Jul 24, 2020

eltoto1219 commented Jul 24, 2020

eltoto1219 commented Jul 24, 2020

lhoestq commented Jul 27, 2020

eltoto1219 commented Jul 30, 2020

lhoestq commented Jul 30, 2020

thomwolf commented Jul 30, 2020

lhoestq commented Jul 30, 2020

eltoto1219 commented Jul 30, 2020

lhoestq commented Aug 20, 2020

thomwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq commented Aug 21, 2020

thomwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eltoto1219 commented Jul 16, 2020 •

edited

lhoestq commented Jul 24, 2020 •

edited