New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF2 porting: category feature #667
TF2 porting: category feature #667
Conversation
refactor: comment out legacy code to be cleaned up later
This commit cd78001 added From my perspective, I plan to do the following to finish the category feature
Let me know if I missed anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
There are a bunch of things that should be tested here:
- all parameters of the embed encoder (pretrained embeddings etc)
- all parameters of the loss
Maybe adding some unit tests here would be useful.
Also, after we get it to work entirely, it will be a good idea to do two things:
- make
Embed
more TF2esque (I will study how TF2 embedding layers work to see if it's a good idea or not) - unpack the weighted cross entropy loss in a bunch of different losses, one for the sampled case, one for the weighted case (as the hone hots are required only in that case and slow down quite a bit when there's a lot of classes).
I think this is a good occasion for also improving the code structure here, what do you think?
ludwig/features/category_feature.py
Outdated
def _setup_loss(self): | ||
self.train_loss_function = SoftmaxCrossEntropyLoss( | ||
num_classes=self.num_classes, | ||
feature_loss=self.loss, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe here we can have all parameters unpacked, and pass **self.loss to the call instead of the dictionary
def __init__( | ||
self, | ||
num_classes=0, | ||
feature_loss=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unpack all needed parameters from feature loss in separate parameters
def __init__( | ||
self, | ||
num_classes=0, | ||
feature_loss=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here, unpack parameters
re: these two comments
I understand this to mean to make the loss dictionary parameter more explicit in the function signature. In the case of https://github.com/uber/ludwig/blob/126c1ca4df0c2f7a121cf68dadab01309f2fae43/ludwig/models/modules/loss_modules.py#L330 The Is this the correct interpretation of the guidance? |
One more question re:
Is the focus of the 'unit test' on the loss function itself, i.e.,
I'm thinking you mean the former ('returns correct values') but I wanted to confirm. In this case, do you have your favorite test cases that I can use to bootstrap this work? |
Yes.
So, I would imagine a test to check the different loss (sampled, non sampled, weighted, non weighted) work for different parameters, Ideally providing the correct values. I don't have specific tests, but usually one thing I was doing when implementing that functionality was to have a dataset with a big output vocabulary, say 10000, and compare computation times and loss values between the regular softmax and the sampled one. Ideally the value of the loss is not too distant (although it will be different for sure, in particular at the beginning) while the speed of the sampled one would be much higher. Not sure this helps, but it's a good way to at least assess if the sampled softmax is running correctly. |
I have a few question on the sampled softmax loss functions. Re:
|
To be honest, I implemented that part more than 2 years ago and I don't remember the details. What I believe to be true is that
Yes they serve the same purpose, but some functions (specifically the weighted losses for instance) need the labels as one_hots, others need them as integers, like the sampled softmax I believe, that's why you see them both used .
For categorical, it is always one. For other types of features 9sets for instance) it may be more than one, but for now don't worry and consider it to be 1. If you need I can provide you an explanation in general for the intuition behind sampled softmax as it seems it can be confusing, it may help you understand better what's going on with that loss.
Yes they are weights and biases from the |
Thank you for the clarifications. re: I understand that re: Thank you for the pointer. I believe these will retrieve the weights and biases, respectively, I'll proceed as described above. |
I just noticed a potential naming conflict with the sampled softmax loss function. Case 1: Current code invokes where
Case 2: There is a second use of the name
This conflict surfaced because #667 (comment)
So the elements in the model definition I'm planning to resolve this conflict by changing the names for Let me know if I'm missing something. |
Sounds perfect. |
I'm closer to getting sampled softmax working. The current hurdle I have to get over involves the call to The issue is in line 216. As I understand this part, In looking over the TF2 code base, I believe the required I believe to complete the sampled softmax function we have to return I wanted to run this by you before making any changes. I also noticed at least one other person has started contributing to the TF2 code base. So if a change has to be made, I wanted to coordinate it with you to minimize generating merge conflicts. |
…format for predictions
The last set of commits fixes the numeric output feature issue. The resolution was to wrap the keras mse metric class with a ludwig specific wrapper class to extract out the require values. Summary of what has been adapted to support the new prediction format:
If the approach proposed in #667 (comment) is reasonable, I can do the retrofit of the
I'll wait to hear the verdict before continuing the work. |
Ok I checked everything. Approve of this design :) Great job! One minor thing, You can move on and adapt this to binary and numerical. Once that is done, we can look at the encoding side, to see if there's a way to use TF2 Embedding layer or to transform Ludwig's embedding mechanism in a TF2 Layer (would be useful for serialization and deserialization purposes), and we should test the loading and assignment of pre-trained embeddings carefully. |
Thank you. I understand my next steps are
|
oops...just a point of clarification, I may have had a typo in my earlier response. What I've been using as 'last_hidden' you want to rename to be 'final_hidden'. |
Question on coding convention on use of Is there preference which form to use? |
tl;dr do as you please Since the beginning I've been using |
Status update:
Re: test other samplers While I was working on this, I noticed there was no existing unit for simple features, e.g. numerical, binary, category features. So I decided to add a new test harness
Example usage:
The above example specifies two test runs
In the case of the categorical feature, here are the test cases for each sampler.
Here is the pytest log for the
re: the category features sampler tests. These tests shows the code runs to completion without issue. I still have to look at how to test if the correct values are computed. What do you think? |
This all sounds great, thank you! Adding @msaisumanth to the discussion here as he was the one who implemented most of the tests to give an opinion ad suggestions, but it all looks good to me. |
Sounds good. I'll hold off on pushing any other changes to this PR until I hear back. In the meantime, I can work offline on more robust testing of the category feature and sampled softmax testing. re: your suggestion
I understand the overall procedure to be:
Some questions on details:
Given the above choices, I'm thinking the comparison should be the first one, loss per batch because that loss will drive calculating the gradients.
|
I figured I should push this fix before the merge. While working on the sampled softmax testing, I noticed that the values stored in
This commit 73ef884 fixes the above issue. Format of data written to
|
A general consideration that probably answers all the questions: the reasons why I didn't add a test like this before in the suite of tests is because this will be really slow compared to most other tests. As you noticed, having 10000 categories means that the datapoints have, at lease, to be 10000 and if you turn on stratified sampling in the synthesizer you'll get 1 sample per class, which is not great for comparing loss values, you are are probably better of with a greater number, say 100000. Now, generating data and training a model on 100000 datapoints may take minutes and it could be too expensive to run a regular CI task. Moreover, as you noticed, the idea of similar loss or potential similar accuracy is very fuzzy, the range of what is acceptable may vary depending on the number of categories and number of negative samples (in theory with 10000 categories and 9999 negative samples the losses should be identical for instance), so it's really difficult to give you a precise epsilon that is always acceptable and also a precise set of parameters. Also the number of epochs, may vary depending on data size and number of classes. This is another reason why I didn't want to include it in my suite of test and just use it as a sanity check. Given this, I don't know if there's a mechanism in pytest to add a test case and tag it so that it is ignore, but it is still there for people to run it manually, that would be definitely useful. A possible alternative is to use a dataset with relatively wide set of classes and see what the performance of samples vs non sampled in terms of loss and accuracy trends are.This will not allow to test the speed aspect (as with, say ,20 classes, the speed advantage of the sampled softmax will nto show up), but will give us confidence that overall the mechanism works. A possible example dataset for this could be ATIS intent classification task, but it would require us to have Sequence encoders already ported to TF2. |
Thank you for the response. I have a related question. When running with
Have you seen this message before? I've done a search on 'gradients do not exist for variables`. I found a few references in the Stack Overflow and Tensorflow's GitHub issue. Based on my understanding, none seemed to point to a solution. I've tried different settings for |
No I've never seen it, it could be a new TF2 thing. It partially makes sense, as only parts of those variables should receive gradients using sampled softmax (only the rows corresponding to the positive label and the negative sampled ones at each batch), but if they don't receive gradients at all, that's problematic.Does it happen always or intermittedly? |
@jimthompson5802 The tests look great! Thanks a lot |
So I'm merging this.
|
Thank you for the merge. Which one would you like me to focus on next? And how do you want to keep track of the other one so it does not fall through the crack? |
We can use this column: https://github.com/uber/ludwig/projects/1#column-8037588 to keep track, i will update it with the 2 items we discussed. I believe the TF2ization will be fundamental when we reintroduce the model saving functionalities, and I believe we would have to find new solutions to pre-trained embeddings load, but at the same time I believe there is a bigger and more fundamental thing to do, which is tackling sequence features. if we tackle them, then text, timeseries and audio fall into place easily, while the remaining features are pretty straightforward. So, if you are ok with it, let's work on the sequence features first.. |
re:
I understand you'd like me to work on the sequence feature next. Similar to the work I did for the Binary and Category features. I'll get started and submit a new PR for this work |
Yes. Consider that the sequence feature is the most complex one to tackle because it has the greater number of different encoders, and because it has two decoders, the tagger should be easy, but the you already had a taster of the generator decoder :) But this time it will be easier because there won't be any mix of TF1 and TF2, it will be pure TF2, which will make things much easier. |
An additional note: @ydudin3 ported the image input feature to TF2 using TF layers (resnet is still WIP). I did an additional pass on both conv and fc layers to expose almost all their parameters all the way up the stack. The work done on Stacked2DCNN that you can see in the commits is a good blueprint on how to do the same kind of work for both Embed encoder for category features and all the encoders for the sequence features. |
Thank you for the pointer to Stacked2DCNN. I'll look at its implementation. |
Code Pull Requests
Here is the start of converting the category feature to TF2 eager execution. More work is needed to complete.
The following have been implemented:
At this point training phase completes w/o error. As noted above only the Loss function has been implemented.
The predict phase fails because of an incomplete implementation of the
predictions()
method and missing metric functions. The work to be done is similar to what I did with the binary feature.Since this is the first time I've implemented encoder and decoder, I'd appreciate if you would take a look at how I implemented them.
I'm attaching the training data I'm using for testing. Included in the zip file is a log file from a test run.
ludwig_category_feature.zip
Here is the model definition I'm using