New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363
Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool ! I left some comments in the code.
This is going in the right direction, great job !
If I can help you on some aspects about apache arrow, feel free to contact me (here on github or slack)
Also should we make this PR a "draft PR" until things are ready to get merged ?
Thank you! I just marked this as a draft PR. It probably would be better to create specific Array2D and Array3D classes as needed instead of a generic MultiArray for now, it should simplify the code a lot too so, I'll update it as such. Also i was meaning to reply earlier, but I wanted to thank you for the testing script you sent me earlier since it ended up being tremendously helpful. |
145feb5
to
9072e05
Compare
Okay, I just converted the MultiArray class to Array2D, and got rid of all those "globals()"! The main issues I had were that when including a "pa.ExtensionType" as a column, the ordinary methods to batch the data would not work and it would throw me some mysterious error, so I first cleaned up my code to order the row to match the schema (because when including extension types the row is disordered ) and then made each row a pa.Table and then concatenated all the tables. Also each n-dimensional vector class we implement will be size invariant which is some good news. |
ed53c8b
to
b554845
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's clean ! It looks like we're getting closer and closer to something very cool here :)
And thanks for changing the MultiArray for Array2D
Could you give more details about the error you have with the code in write_on_file
?
Also I left some comments (mostly nitpicking).
b088aa8
to
0cbcb9b
Compare
Okay awesome! I just added your suggestions and changed up my recursive functions. Here is the traceback for the when I use the original code in the write_on_file method:
I think when trying to cast an extension array within a list of dictionaries, some method gets called that bugs out Arrow and somehow doesn't get called when adding a single row to a a table and then appending multiple tables together. I tinkered with this for a while but could not find any workaround. In the case that this new method causes bad compression/worse performance, we can explicitly set the batch size in the pa.Table.to_batches(batch_size) method, which will return a list of batches. Perhaps, we can check that the batch size is not too large converting the table to batches after X many rows are appended to it by following the batch_size check below. |
Indeed that's weird.
The argument of We can fix that just by doing Do you still have errors that need to be fixed ? |
@lhoestq Nope all should be good! Would you like me to add the entries.combine_chunks().to_batch_size() code + benchmark? |
Awesome :) I think it would be good to start to add some tests then.
That would be interesting. We don't want reading/writing to be the bottleneck of dataset processing for example in terms of speed. Maybe we could test the write + read speed of different datasets:
What do you think ? |
Well actually it looks like we're still having the |
I just tested your code to try to understand better.
def to_pylist(self):
return self.to_numpy().tolist()
[EDIT] I changed the reshape step in numpy_arr = numpy_arr.reshape(len(self), *ExtensionArray2D._construct_shape(self.storage)) and it did the job:
Maybe you could add me in your repo so I can open a PR to add these changes to your branch ? |
|
Ya! that should be no problem at all, Ill use the timeit module and get back to you with the results sometime over the weekend. |
Thank you for all your help getting the pandas and row indexing for the dataset to work! For |
I created the PR :) |
db72b56
to
81f120c
Compare
Sorry for the bit of delay! I just added the tests, the PR into my fork, and some speed tests. It should be fairly easy to add more tests if we need. Do you think there is anything else to checkout? |
Cool thanks for adding the tests :) Next step is merge master into this branch. We've done some changes in the features logic on master, so let me know if you need help merging it. As soon as we've merged from master, we'll have to make sure that we have extensive tests and we'll be good to do ! |
We might want to merge this after tomorrow's release though to avoid potential side effects @lhoestq |
Yep I'm sure we can have it not for tomorrow's release but for the next one ;) |
haha, when I tried to rebase, I ran into some conflicts. In that last commit, I restored the features.py from the previous commit on the branch in my fork because upon updating to master, the pandasdtypemanger and pandas extension types disappeared. If you actually could help me with merging in what is needed, that would actually help a lot. Other than that, ya let me go ahead and move the dataloader code out of this PR. Perhaps we could discuss in the slack channelk soon about what to do with that because we can either just support the pretraining corpus for lxmert or try to implement the full COCO and visual genome datasets (+VQA +GQA) which im sure people would be pretty happy about. Also we can talk more tests soon too when you are free. Goodluck on the release tomorrow guys! |
81f120c
to
845a461
Compare
I'll add Array3D, 4D.. tomorrow but it should take only a few lines. The rest won't change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice addition! I love it @eltoto1219 @lhoestq
src/nlp/arrow_dataset.py
Outdated
else: | ||
outputs = self._data[key].to_pylist() | ||
outputs = self._data.to_pandas(types_mapper=pandas_types_mapper)[key].to_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here
@@ -45,13 +125,12 @@ class ArrowWriter(object): | |||
|
|||
def __init__( | |||
self, | |||
data_type: Optional[pa.DataType] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good cleanup!
src/nlp/features.py
Outdated
|
||
def to_numpy(self): | ||
numpy_arr = Array2DExtensionType._generate_flatten(self.storage, self.dims) | ||
numpy_arr = numpy_arr.reshape(len(self), *Array2DExtensionArray._construct_shape(self.storage)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this zero copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 2D, these two lines are equivalent to self.storage.flatten().flatten().to_numpy(zero_copy=True).reshape(...)
.
The zero copy conversion from arrow to numpy is done in _generate_flatten
.
@@ -479,7 +667,9 @@ def generate_from_dict(obj: Any): | |||
|
|||
if class_type == Sequence: | |||
return Sequence(feature=generate_from_dict(obj["feature"]), length=obj["length"]) | |||
return class_type(**obj) | |||
|
|||
field_names = set(f.name for f in fields(class_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to do this (curious)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature types like Value etc. are serialized in json format when saving the dataset info.
If in the future we add fields to these feature types then they would still be loadable from older versions of nlp
I took your comments into account and I added Array[3-5]D. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
id: Optional[str] = None | ||
# Automatically constructed | ||
_type: str = field(default="Array2D", init=False, repr=False) | ||
class _ArrayXD: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XD!
] | ||
|
||
|
||
@parameterized.named_parameters(get_array_feature_types()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
52ace77
to
375858c
Compare
nlp/features.py:
The main factory class is MultiArray, every single time this class is called, a corresponding pyarrow extension array and type class is generated (and added to the list of globals for future use) for a given root data type and set of dimensions/shape. I provide examples on working with this in datasets/lxmert_pretraining_beta/test_multi_array.py
src/nlp/arrow_writer.py
I had to add a method for writing batches that include extension array types because despite having a unique class for each multidimensional array shape, pyarrow is unable to write any other "array-like" data class to a batch object unless it is of the type pyarrow.ExtensionType. The problem in this is that when writing multiple batches, the order of the schema and data to be written get mixed up (where the pyarrow datatype in the schema only refers to as ExtensionAray, but each ExtensionArray subclass has a different shape) ... possibly I am missing something here and would be grateful if anyone else could take a look!
datasets/lxmert_pretraining_beta/lxmert_pretraining_beta.py & datasets/lxmert_pretraining_beta/to_arrow_data.py:
I have begun adding the data from the original LXMERT paper (https://arxiv.org/abs/1908.07490) hosted here: (https://github.com/airsplay/lxmert). The reason I am not pulling from the source of truth for each individual dataset is because it seems that there will also need to be functionality to aggregate multimodal datasets to create a pre-training corpus (:sleepy: ).
For now, this is just being used to test and run edge-cases for the MultiArray feature, so ive labeled it as "beta_pretraining"!
(still working on the pretraining, just wanted to push out the new functionality sooner than later)