Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated samples look bad, but training seems ok #34

Open
grisaitis opened this issue Feb 17, 2018 · 12 comments
Open

Generated samples look bad, but training seems ok #34

grisaitis opened this issue Feb 17, 2018 · 12 comments

Comments

@grisaitis
Copy link
Contributor

I'm getting crappy-looking samples from networks that seem to train properly. Loss converges nicely (see last screenshot), but samples don't compare with @hardmaru's results.

Here are some examples... Any ideas for what could be causing this?

image
image
image
image

This is after training for 10920 iterations (30 * 364) with python train.py --rnn_size 400 --num_layers 3. Default args produce similar results.

Training and val loss look fine:

Has anyone been able to train and produce great results with recent tensorflow? I wonder if some defaults have changed, or interface changes are resulting in some bad set up for the loss function.

I think the loss function is the issue because loss values "look good" but actual results look bad. I think the loss function is optimizing for the wrong thing, basically.

Any ideas appreciated. My next steps are to review how the loss is defined and maybe compare it with other implementations (https://greydanus.github.io/2016/08/21/handwriting/, https://github.com/snowkylin/rnn-handwriting-generation)

@grisaitis
Copy link
Contributor Author

I believe 681cee9 introduced an issue. That code produces models with these weird outputs, but the previous commit does not. I tried this with two tensorflow versions: 1.0.0 and 1.3.0, which was the latest version as of November 2017.

I'll look into this a bit more... Some arguments changed for legacy_seq2seq.rnn_decoder and MultiRNNCell.

@grisaitis
Copy link
Contributor Author

I should add that I modified 681cee9 slightly so that MultiRNNCell uses args.num_layers and not args.rnn_size, but that shouldn't be an issue.

@zxMaxpainxz
Copy link

Very helpful comment! I'm also facing the same problem. But, whenever I change legacy_seq2seq.rnn_decoder and MultiRNNCell argument, then value error occurred. Does anyone solve this problem?

@bbrito
Copy link

bbrito commented Aug 19, 2018

Any news on this issue? Did someone manage to solve it?

@partha2409
Copy link

Hi.I am using tensorflow 1.4 and i am able to reproduce this issue.Did anyone of you manage to fix it?TIA

@bbrito
Copy link

bbrito commented Oct 17, 2018

I tried the other repos that @grisaitis suggested. You may get better results with it.

@partha2409
Copy link

Hi.Found the bug.

In line 58 of model.py, the cell always takes zero_state as initial state.This is ok while training but not when generating samples.

While training, we pass 50 batches of 300 timesteps. Inside the tf.contrib.legacy_seq2seq.rnn_decoder the function automatically updates the state input of a timestep by state output of preceeding timestep. For the next minibatch, the states are reinitialised to zero before starting

While generating sample, we generate one point at a time.This means that whenever we run the session, tf.contrib.legacy_seq2seq.rnn_decoder is called for every timestep separately.Now,Note the bug in line 58,everytime we call the rnn_decoder we set the state as zero_state. So each timestep of the generated sample is generated without the knowledge of previous state out.

The fix is,everytime we run the sess.run to generate samples, we need to get the previous state and pass it in while generating the next sample instead of passing zero_state always. I was able to generate proper results with that fix :) Hope it helps

@bbrito
Copy link

bbrito commented Nov 9, 2018

@partha2409 did you tried that fix? As far as I understand in the sample function we equal the previous state to the next_state that is one of the outputs of the network. I do not believe this was the problem.

@partha2409
Copy link

Hi i tried the fix and it worked. You are partially correct... in sample func in line 275 we assign the prev state to next state and in 235 we pass the new state value to the "state_in" variable...however note that in 58 the state_in variable is not at all used and always zero state is used..it should be similar to line 57 which is commented where the state_in variable is passed to the rnn_decoder.

@bbrito
Copy link

bbrito commented Nov 9, 2018

You are right! Well spotted! What is the performance that you got during training? Which parameters did you use?

@falaktheoptimist
Copy link

We're facing a similar issue. Could you explain the exact change that you made in the code @partha2409 ? And also, what results are you getting after the fix? I tried uncommenting the line above as you mentioned but that does not seem to work too.

@aleksandersawicki
Copy link

As mentioned earlier the problem accure in rnn_decoder

Instead of

        outputs, state_out = tf.contrib.legacy_seq2seq.rnn_decoder(
            inputs, zero_state, cell, loop_function=None, scope='rnnlm')

it could be

        self.state_in_as_tuple=(self.state_in[0,:,:],self.state_in[1,:,:])
        outputs, state_out = tf.contrib.legacy_seq2seq.rnn_decoder(
        inputs, self.state_in_as_tuple, cell, loop_function=None, scope='rnnlm')


g6124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants