New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF2 porting: sequence feature #682
TF2 porting: sequence feature #682
Conversation
This looks great, seems like a great starting point. All the shapes are correct |
While working on the The method signature for input features are The method signature for output features are Output features are missing the option to specify the I just want to confirm that is the intended design. |
Yes it is intended. The reason is that before there was a mechanism in TF1 to specify the scope of tf variables that would be created, and users could specify those variables to be tied to variables initialuzed for another input. This is useful if i have for instance two pieces of text and i want to verify if they match or not, i want the two text encoders to have same structure and weights. Now in tf2, as the encoder is an object, it’s much easier to just pass the encoder object and that’s it. |
Status updateFor the past couple of days, I've spent my time reviewing documentation and prototyping code related to sequence models:
I'm planning to apply what I learned in this exercise to porting Ludwig's Sequence Decoders to utilize TFA. If you look at the code, which is provided at end, you may notice that some of the methods' return signatures differ from what we do in Ludwig. Right now I think I can still adhere to the current return signatures with use of dictionary data structures. At the momment I'm expecting no changes to Ludwig's internal processing. I'm planning to have any changes occur in the specific output feature or decoder class methods. Let me know if I need to be aware of anything. Sequence Encoder/Decoder PrototypeNotes:
Prototype output
|
This all looks great! It can be a blueprint to use as a reference for how to implement things in Ludwig. In Ludwig there are more options obviously (attention beam search to name two important ones) but this is a really good starting point.
Here things may be different, as we expect encoders right now to return just one tensor. We can extend what we did for output features prediction methods, and return a dictionary with keys instead, so that components down the road can use whatever they want from that dictionary. Actually the previous version of Ludwig was already returning dictionaries, so most components should already work fine if we decide to to go this route.
In Ludwig there's already a mechanism to allow to reuse the embeddings in the encoder, so that
As in Ludwig the decoder is agnostic of the encoder, it may be the case that there is no encoder state or that the encoder state is not of the size required. In the first case, one could either initialize the state to be all zeros, or one could initialize it to be a vector of weights that are learned. Literature suggest the latter to work better. We don't have to implement it from the get go, bu again, something to keep in mind. Another consideration, your example is an example of what in Ludwig is called generator decoder. There's also the tagger decoder (which is simpler) that has a different structure, makes different assumptions on the inputs and provides outputs in a different way. Let me know if you want me to explain how that works, but again, it's much easier than the generator decoder, and once we make the generator work, the tagger will be much simpler. Let me know if you have any doubt and I'll be happy to clarify. |
Glad you pointed this out. I was looking at the decoder code and noticed there did not appear to be a link between encoder and decoder states. This saved me asking a follow-up question. For now I'll plan to user all zero initial states, it's the easiest to implement.
I was thinking of using a dictionary as you described. So I'll proceed with this direction.
If tagger is simpler, I start with that. If I have questions, I'll reach back out. In a reply from several weeks ago, you mentioned starting to port code to use |
I would assume that every encoder return a dictionary with at least a
Yes, basically is assumes that the input is a sequence, so that the shape is
Don't remember exactly the context, in general I'm not familiar with A final consideration: it could be the case that https://www.tensorflow.org/tutorials/text/text_generation could be a valuable example, it does not use anything from |
re:
Just to confirm my understanding. https://github.com/uber/ludwig/blob/212122c87930277ff6ef055dac11f9e694de53d8/ludwig/models/ecd.py#L53-L58 With this change the return from
Would you like me to take a slight pause on the sequence work and submit a new PR to implement the 'encoder return a dictionary' for the current Numerical, Binary and Categorical encoders to support this dictionary structure? This way if you have other folks working on TF2 porting this should establish the a baseline for that work. Or just include these changes in this PR on the Sequence Feature? |
I think it's more flexible, so yes, I would say it would be a good pattern to follow. The concat combiner would need to be changed accordingly too (and all other combiners, but that will be done later). |
This commit 2ebf5a5 implements use of
Numerical, Binary and Categorical encoders and ConcatCombiner class updated to reflect the change. I used the recently added unit test
|
Looks good. Can I ask you to actually create another pr just for this commit? And also add the same thing for image features? This way other people working on the branch will have this updated pattern. |
Maybe one question.is "accuracy" in the table below "row-wise accuracy"?
|
Yes. This requires another little bit of explanation (I guess eventually we can collect all those explanations and put them in the developer's guide :) ). Now, the way I previously implemented this mechanism was to collect these vectors of 1s and 0s for each batch, concatenate them for the length of the whole set, perform the and operation with these vectors from all the output features, and finally summing and dividing. That's why in the sequence feature you see those Finally, the reason why the name |
This is what I understand: "accuracy" in the table is the concept of "combined accuracy". For the moment, this is out of scope. Of the remaining values in the table, we have "loss" (or combined loss). I pushed commits for "last_accuracy" and "preplexity". I'm now going to focus on "token accuracy" and "edit distance". I was looking at the function Right now the Ludwig data generator for sequences always returns a sequence of In looking at Am I correct in my understanding in how I could generate variable length sequences? If this is true, then I can be more robust in my testing and we can use this to enhance the unit tests for Ludwig. |
feat: add determination for tf dtype on predictions refactor: api fix up and code clean-up
This commit de7b133 contained these changes:
I added some code that could be used as a basis to handle this kind of situation in the SequenceOutputFeature.__init()` method:
We need to use either then in the
After completing this work, I thought more about it. Do you think Next drop should include the remaining accuracy metrics. |
Accuracy in the table of a sequence feature is row-wise accuracy.
let's keep in mind we want to reintroduce them later on.
Correct.
I believe this is incorrect. The logic of the contrib Decoder I was using in TF1 is that it keeps on generating until an
If all elements of the batch reach
In the code you'll find various pieces where I do manipulation and fill zeros to make sure the length of true sequences and predicted sequences is the same, and then I do masking on the loss to avoid considering
Yes.
That is fine, but consider there are a bunch of tricky things here. In particular, the tagger decoder and the generator decode behave quite differently in this regard. The tagger expects a embedded sequence as input of a fixed length including padding 0 vectors), produces a sequence of fixed length (including padding 0s and compares it with the ground truth sequence of the same fixed length (including padding 0s) to obtain a loss for each element of the sequence and then masks this vector of losses figuring out where the padding is. There is no The generator on the other hand can have all sorts of lengths for reasons I explained before, plus has also So to answer your questions, you should test tagger with sequences of exactly the same length, we should actually add a value error / assert in the tagger if the length of inputs and ground truth don't match. For testing the generator, yeas you want to have sequences of different lengths. |
There is already a function in Ludwig to do that:
Got it. Now we also have to make sure that when we are calculating the metrics / losses the datatype makes no difference. In the docs for
Unfortunately it can actually happen. Thin about classifying (or embedding, on the converse, for the encoder size) users of a social network platform. I actually had todeal with something that was on the verge of int32 already. Also, being this only the prediction, it's not a huge cost to pay in therms of memory really. (the rank is just rank 2, not a rank 3 tensor, which would be much more expensive). |
from this guidance
I going to follow approach. It is simpler. This commit def27bd does this. re: sequence lengths...in the earlier posting, I wanted to point out that in TF2 determination of sequence lengths is occurring in different components than TF1. In TF1 the sequence lengths are computed in the sequence decoder. In TF2, sequence lengths will probably have to be computed in several different places, i.e., loss and metric functions. For now I'll focus on finishing out computing all the required metrics with the fixed length sequences. We can come back later to deal with variable length sequences. |
Sounds good. Let's make sure the same thing happens in the categorical feature too.
That's fine. I believe we can have an additional tensor returned by the |
…5802/ludwig into tf2_sequence_feature
…wig into tf2_sequence_feature
I think we can merge what we have here as the sequence input feature already works with the embed encoder, I just ported the parallelcnn encoder too (yet to be thoroughly tested). and if we merge I can work on the other encoders while you can keep working on the decoders. |
Sounds good |
An update to the sequence feature porting. I ported all the encoders and cleaned a lot the code underneath. The encoding part of the sequence features can be considered done. |
Great...I've been looking at the next set of work for decoders. From what I can tell, I have to retrofit Attention back into the Tagger decoder. Right now I'm looking over the TFA attention apis and trying to map out how to implement them. I may have few questions in the next day or two. Once I'm done with that, then convert the Generator decoder. Once I have something to post, I'll create a new PR. Making progress |
Attention is not needed for the Tagger anyway, only needed for the generator. Again, the tagger should be very simple to deal with, the generator is the more complicated one. |
OK. Before moving on to the generator decoder, I just want to make sure I accounted for functionality that you want in Tagger. The reason I asked about attention is that I saw this code when I started the work: At this time, these I have not explicitly ported this over as well: Finally, this Tagger functionality was replaced by a single if this correct, I'll start looking at the Generator decoder. |
Oh I see now, completely forgot about it. Was an experimental thing anyway, I plan to reintroduce it in the futur, please add a todo here saying "todo tf2 add feed forward attention".
where hidden is a
For the timeseries, it's very simple, just the dense output dimension should be 1, no more than that.
That's correct., as we discussed, under the hood what happens is reshape-matmul-reshape, which is slightly more complex than what happens in the category classifier decoder case, but
Sounds great! if you want you can do a PR just for the Tagger stuff and then do another one for the Generator. |
Code Pull Requests
This is the start of porting sequence feature to TF2. There is still more to be done. Current state of code:
sequence_feature.py
. These will be filled in as work progresses.SequencePassthroughEncoder
andSequenceEmbedEncoder
as subclass oftf.keras.layers.Layer
. Integrated existing functions to work with the Layer subclass.From a small test both encoders work. The test passed an
input
array of shape [2, 5] through theSequencePassthroughEncoder
andSequenceEmbedEncoder
. Results of the test:reduce_output=None
resulted in an output tensor with shape [2, 5, 1]embedding_size=3, reduce_output='sum'
resulted in an output tensor with shape [2,3]embedding_size=3, reduce_output='None'
resulted in an output tensor with shape [2, 5, 3]Does this look correct?
Next steps for me is to get a working decoder and
full_experiment()
to complete successfully for a simple model.This is the simple test case and its output.
Test output