New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF2 porting: initial work #639
TF2 porting: initial work #639
Conversation
I thought it would be good to establish a baseline for the unit tests. In doing this, I noticed strange behavior. The initial run of the unit tests led to this result:
This is the complete log file All the errors were instances of this example:
Not clear, at least to me, how this may be related to TF2. I'll note the travis-ci run encounters the same problem as above. Have you encountered this kind of problem with This is the results of the
Here is the complete log file for this run: Whatever the issue is with |
Commit 1dbfdfb disables running |
@w4nderlust a question...this code fragment in |
No never. I think it's fine to disable the test temporarily and then to re-add it when the rest of the stuff is done. There are not so many tests failing, I expected worse to be honest, but there's still quite some work to do. Regarding the regularizers, my understanding is that in TF2 all regularizers are actually keras regularizers. So I tried to replace the old ones with the keras ones, but it did not work. So I believe we can ignore the regularizers for now, port everything to TF2 and only later reintroduce them. |
Commit 6689168 eliminates 5 out of 10 errors in the unit tests. Local running of unit tests:
|
Commit b2673eb fixes 1 of the 5 remaining errors in Local testing:
|
Just a quick update...3 of the 4 remaining errors in the unit tests relate to sequence related models. re: the seq2seq test, made some progress in the sense that errors occur later in the processing. Slowly reverse engineering conversion from |
a quick update....I'm focused on fixing the multiple seq2seq unit test. I'm assuming getting this to work will inform how to resolve the other failing tests. From what I can tell I've I accounted for all "input parameters" and converted the deprecated Although multiple seq2seq test fails, the unit test fail later in processing than when I first started working on it. I'm assuming that's progress. :-) Current error appears to involve some dependency on the legacy I've pushed all changes to-date to this PR. Any feedback will be appreciated, especially feedback on if the data structures are mapped correctly to the TF2 equivalent usage. |
@w4nderlust Re: an earlier comment
You may have noticed that the number of fails in the unit test took a turn for the worse in some of the commits (04470ae 7e4261c). This occurred because I changed line of code https://github.com/uber/ludwig/blob/2f55f8bd4922890a738d89dc10d953298037e25d/ludwig/models/modules/recurrent_modules.py#L36 to use For the moment I went back to |
I've been looking at the last set of errors and could use a second set of eyes. The tests now fail with
As I understand the error, the code is expecting a vector but received instead a scalar value. Likely cause is that I mis-mapped a data structure when I converted to the tfa api set. Here is an example error stack trace
The full pytest log file if needed. |
Your work has been great! I'm suggesting refocus just because it seems unfair to me to ask you to spend your time to get that deeper level of understanding. On the other hand, if you are willing to and you believe it's something beneficial to you, we can definitely keep on working on it. It's your call, just don't want you to feel obliged because I understand how much of a learning curve there is :) |
I'm still willing to work on the TF1 to TF2 upgrade. While it is a lot of effort and slow progress, I'm getting practical experience in making deep learning software work. So if you're OK with the current pace of work and me bugging you with questions, I'd like to continue. We can have this understanding. If, for any reason, you want to take over the task, just let me know that you want to take over. At the same time, if I feel overwhelmed, I'll reach out to you and pass the work back to you. These can help me in my work.
Basically setting boundaries for the work will help me focus in the right area. |
that's a perfect arrangement for me.
Let me do a pass on the current recurrent_modules in the PR to remove everything related to timeseries, that way things would be much cleaner to begin with.
makes sense. Part of the reason it is difficult is because you expect some things to work in a certain way in TF2, that you just need to replace calls with corresponsing ones, but than we discover it's not the case, so one needs to adjust. But that's fine, it's an iterative process. Regarding the graph building part, the function you should look at is |
Thank you for the guidance. As a proof-of-concept for the above guidance, I'm planning to demonstrate eager execution on this model:
This way I only have to focus on one type of feature and a simple fully connected layer. If I can get this to work for train, validation and test functions, this would form the foundation for the rest of the work. I'll provide updates as the work progresses |
Kind of "Daily stand-up" :-)What was accomplishedThis code and resulting output is a proof-of-concept showing creation and use
My current thinking where the above code fragments are inserted in the Ludwig code structureTo fit in the above code fragments, I'm expecting to remove/replace surrounding code in the cited locations. This fragment
is added in this region I'm leaning toward Alternative 2. This fragment
is added in this region https://github.com/uber/ludwig/blob/1fda07c555f3f3d3a9f989ab63441c8021a50391/ludwig/models/modules/fully_connected_modules.py#L122-L137 This fragment
is added in this region https://github.com/uber/ludwig/blob/1fda07c555f3f3d3a9f989ab63441c8021a50391/ludwig/models/model.py#L203-L222 This fragment
is added in this region https://github.com/uber/ludwig/blob/1fda07c555f3f3d3a9f989ab63441c8021a50391/ludwig/models/model.py#L531-L566 This fragment
is added in this region https://github.com/uber/ludwig/blob/1fda07c555f3f3d3a9f989ab63441c8021a50391/ludwig/models/model.py#L857-L893 Next steps
BlockersNone |
A few comments on this.
|
Oh, weird, I added a commit to remove the timeseries stuff from recurrent modules, and i pushed the suaul way but it merge it in the tf2_porting branch. I guess we can continue from there |
I understand this branch was accidentally merged. When you write, "I guess we can continue from there," are you referring to me to continue pushing changes to this branch |
I think so. Anyway, it was merged in the tf2_porting branch not in master, which is ok. |
I believe I have an understanding of how the custom eager mode custom training loop works. Using the advanced tutorial you pointed out, I worked out a native TF2 custom training example. This code will serve as guide for future work. My next step is begin the work to modify Ludwig for eager mode. Note: While working on the sample program, it appeared that training was not converging. While researching this, I found this issue. Not clear if this is relevant or not. Any thoughts about this? This is the native TF2 sample program
|
Following this comment you may want to check the shape of |
I'm now looking at how to adapt Ludwig data pipeline to support the design described in the previous post. From what I can tell, the current design has training data converted primarily into a dictionary structure, which is fed into the current tf1 training loop. For the TF2 eager execution, I have to convert the pre-processed training data to a Does this sound like a reasonable approach? Are there other factors I need to be aware of? For example is the |
I'm not sure you actually need to do the conversion to a tf dataset. I believe the inputs to MyModel.call() in your example will gradly accept numpy ndarrays, which is what Ludwig batcher provide. Actually Ludwig batcer provides a disctionary with the column names and ndarrays as values, so internally in the overall model call the different columns ndarrays need to be dispatched to the right subparts of the model (the different input encoders). The way to do it is by using the name of the column. If my model contains a dictionary of input_feature_name -> encoder function (let's call it self.input_feature_ecoders), when a new input batch arrives you just use the name of the column and obtain the output of the encode, and you do the same for the decoders, something like:
does this make sense? |
OK..as I understand the guidance, I should add these two dictionary attributes to MyModel class:
The dictionary key values will be the input/output column names, i.e., 'x1', 'x2', 'y'. The dictionary values will be the functions that encode the data type of the column. Depending on the answer to the next question, I may have follow-up question on the encoding functions. re:
Please point me to the relevant code section for the batcher. The values returned by the "Ludwig batcher" have these already been pre-processed? For example, if the column is a numerical data type, has it been "z-scored" if that was requested and missing values filled in. Or does this function |
I see now how Batcher function works. So no need to point out code section. I'm still curious to know if the values returned by
have already been pre-processed? |
With some experimentation and use of the debugger, I believe I have the answer to my question, "Has the data been pre-processed?" at this point in the code
From what I can tell, the answer is "Yes." If this is true, then some of the changes I thought needed to happen is no longer relevant. Again, thank you for the guidance. |
Yes data is already preprocessed when it is provided as a batch. There are other PRs for refactoring that preprocessing code to make it mode generic, byt they could be completely independent form this I believe. Ans yes the image is an example of an autoencoder with an encoder and a decoder, In Ludwig you have one different encoder for each input feature, a different decoder for each output feature and a combiner. Maybe the presentations here https://w4nderlu.st/projects/ludwig or Ludwig's whitepaper can help you getting a better understanding about this notion. Let me know if you have other questions. |
I had a similar issue due to the fact that wandb uses the context manager which conflicts with pytest. It was solved by using a fixture such as In my case This may not be the problem here since it is working in master branch. In any case let me know if you need any help with this test. |
@borisdayma thank you for offering to help! I this branch has many thing that are not working yet, so this is just one of the many, but when we get to the point where most things work, then we'll start work on the contribs, at that point your experience will likely be valuable. Please monitor the open PR (this one was merged by error) #646 . I will mention you there when needed, thank you again! |
I'm stuck with converting tf.contrib.seq2seq.TrainingHelper to tf2.2, please help me. |
use tensorflow_addons instead |
thanks @w4nderlust , i fixed it, but know I have a problem with
It always return TypeError: call() got an unexpected keyword argument 'training' although I followed the tutorial on stackoverflow. Can you explain for me why I got this issue? |
It's very difficult for me to help out without further context unfortuantely. Plus, this is not really Ludwig related :) What I suggest you to do is to look at Ludwig's Sequence decoders module and compare with your implementation. Hopefully it helps. |
@w4nderlust This is a test of sending results of my work. Initially I tried pushing my changes directly to
uber/ludwig
but it failed because I did not have write permission touber/ludwig
.I'm now trying to submit my changes through a PR for branch
tf2_porting
branch in my forkjimthompson5802/ludwig
. The target for this PR isuber/ludwig' branch 'tf2_porting
. If this works, then I can just add commits to this PR.The Docker image I use for my development environment is built with the updated
requirements.txt
on thetf2_porting
branch, which containsWith this initial set of commits,
train
completes and some, not all, of the data are saved for TensorBoard. For my test, I'm using the Titanic Survivor example. Here is the log from training.tf2_sample_train_log.txt
Here is screenshot of TensorBoard for the data that was collected.
Let me know what you think.