Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hidden state raises NotImplementedError() #21

Closed
SvenGroen opened this issue Jun 4, 2020 · 8 comments
Closed

hidden state raises NotImplementedError() #21

SvenGroen opened this issue Jun 4, 2020 · 8 comments

Comments

@SvenGroen
Copy link

SvenGroen commented Jun 4, 2020

Hey,
I am trying to integrate your convLSTM cell into an existing model I have.
I did this in the following way:

    def forward(self,x): 
        out = my_model(x)
        out = out.unsqueeze(1) #make sure dimensions fit
        out, self.hidden = self.convlstm(out, self.hidden)
        return out[-1].squeeze(1)

self.hidden is none in the first run, but not none in the second, leading to:

    # Implement stateful ConvLSTM
    if hidden_state is not None:
        raise NotImplementedError()

which is in your ConvLSTM module (line 141, 142)

would this be implemented just by:

    # Implement stateful ConvLSTM
    if hidden_state is not None:
        hidden_state=hidden_state`

or am I misunderstanding something?
What is supposed to happen here?

Thanks for any help :-D

@nicoladainese96
Copy link

Hi,
looking at the case in which hidden_state is None you can see that the expected content of the variable is something like:
[(h_1, c_1), ..., (h_{num_layers}, c_{num_layers})]

So in short I think that the variable last_state_list can be reused as hidden_state in the next forward pass if the flag return_all_layers was set to True during initialization (because you need to pass the hidden and cell states of all layers and not just the last one).

As a clean fix I would just use an if in case the hidden_state is not provided.
if hidden_state is None: hidden_state = self._init_hidden(batch_size=b, image_size=(h, w))

I will try it myself in these days and get back here to update the post in case.
Cheers,
Nicola

@SvenGroen
Copy link
Author

Hi,
in my use case i have a stream of images (video) and the lstm receives the current models prediction [t] and the previous 2:
so my input is [t-2, t-1, t].
In my case it would not make sense to initialize the hidden state again, since this would delete all previous information.

I actually tried out, self.hidden = self.convlstm(out, self.hidden) with return_all_layers=False, with is basically the same as reusing the last_state_list as you suggested, since it is returned.

It actually works okay...i am still experimenting and seeing if it increases performance.

I have not tried to set return_all_layers=True yet for complexity reasons. I have set num_layers=1.

Let me know if you find a good solution for dealing with the hidden state! :)
Cheers,
Sven

@nicoladainese96
Copy link

In general your solution works only because in the case of a single layer setting the return_all_layers true or false makes no difference. For the general case if you set it to true it runs without any error (I still have to check if it learns properly, but I guess it will).
Regarding the video stream, in theory if you provide the right hidden state (default option) you would get the same result as feeding the previous 2 states starting with the hidden state at t-2 and then taking as results lstm(state[t], hidden_recomputed[t]). Or am I missing something?

@SvenGroen
Copy link
Author

To be honest I am massively confused :D
So far I concatinated t-2 and t-1 with t like this:
#out.shape: [bs, 1, #channels, heigth, width]
out = torch.cat(self.old_pred + [out], dim=1)
#out.shape: [bs, 3, #channels, heigth, width]
out, self.hidden = self.lstm(out, self.hidden)
self.hidden = [tuple(state.detach() for state in i) for i in self.hidden] # had to detach it because of a retain_graph error
out = out[0]
self.old_pred[0] = self.old_pred[1] # oldest at 0 position
self.old_pred[1] = out.detach() # newest at 1 position

It is the first time I am working with lstms (for my bachelor thesis) and I so far they gave me a lot of headache 😄

@nicoladainese96
Copy link

Hi,
I see your confusion.
I never worked with video and not much with LSTMs in general, but my suggestion would be to simplify your problem avoiding the concatenation.
Usually frames are stacked together when we want to implicitly model a dynamic effect by introducing a time dimension in the input, but this is not necessary if you are already using an LSTM, because it will remember automatically previous states (that's its job after all).
I also suspect that with all those detach you probably killed the gradient flow at a certain point and the back-propagation might not work properly because of that.

@SvenGroen
Copy link
Author

SvenGroen commented Aug 23, 2020

The stacking was just done because the tensors are feed in a loop into the lstm cell anyway (line 158-161).

Yes, but detaching was (so far) the only way for me to enable the model to run without running into memory issues.

I guess If I want to use information from [t-2, t-1, t] correctly I would need to detach after every 2 batches and specify retain_graph=True. (See https://discuss.pytorch.org/t/convolutional-lstm-retain-graph-error/85200/4 at the bottom).

EDIT: I think I see want you meant with stacking.
If I feed in a stack each batch I basically feed timesteps multiple times into the network, e.g.:

at t:
[t-2, t-1, t]

at t+1:
[t-1, t, t+1]

at t+2:
[t, t+1, t+2]

so t will be processed 3x.
Just puting in t and keep the backpropagation graph for 3 batches would be correct...is this what you meant?

@nicoladainese96
Copy link

Hi,
sorry for the late reply.
Yes, as you said in the edit, feeding just t should be fine and I think you won't have to detach anything at all, even though I'm not 100% sure.

My implementation is in Reinforcement Learning, so is somewhat different, but in one case I'm feeding an input vector x of shape (time, batch, channels, resolution, resolution) together with the initial hidden state (detached from the graph because it's how I get it, I never tried keeping its graph).

In the other case I'm using the network with torch.no_grad() and basically feeding in a loop the updated hidden state together with the new input (1, batch, channels, resolution, resolution) .

But if you look at the implementation there is nothing that makes me think that the input or the hidden state can't have the computational graph attached. Usually the gradients accumulated on the network's weights are reset calling optimizer.zero_grad(), whereas the variables in a batch with their gradients are trashed away automatically after a while.

To wrap up just try

# x.shape: [B, T, C, W, H]
out, unused_hidden = self.lstm(x)

if you have batches of videos that do not need previous context ( hidden_state = bunch of zeros).
If for some reason you prefer a loop do something like

xs = [ x[:,i,...].unsqueeze(1) for i in range(x.shape[1]) ] # get a list of T tensors of shape (B, 1, C, W, H)
hidden_state = self.lstm._init_hidden(batch_size=out.shape[0], 
                                                             image_size=out.shape[-2:]
                                                            )
outputs = [ ]
for t in range(out.shape[1]):
  y, hidden_state = self.lstm(xs[t], hidden_state)
  outputs.append(y)

outputs = torch.cat(outputs, dim=1) # I didn't check this passage, but you just want to get back an output of shape (B, T, HC, W, H)

And that's it, no detach at all.

Let me know how it goes :)

@SvenGroen
Copy link
Author

Hi,
Wow I was not aware that you could use convlstms in Reinforcement Learning, sounds impressive 😮.

I am actually in the middle of testing and waiting for the results :D
So what I have found out so far is:

There are 2 possibilities:

  1. not feeding in the hidden state:
    out, unused_hidden = self.lstm(x)
    which reinitialized the hidden state with 0's everytime the lstm is called
    (automatically detaches the old hidden, since the variable gets newly initialized).

  2. feed in the hidden state but detach it manually, else the following error will be thrown:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

Setting loss.backward(retain_graph=True) will increase the memory useage with every batch and will only work for a couple of batches until memory is full.

Solution: retain_graph=False + detach hidden state

--> No error + keeps the hidden values from previous run.
(sources:
https://discuss.pytorch.org/t/solved-why-we-need-to-detach-variable-which-contains-hidden-representation/1426/14 https://discuss.pytorch.org/t/solved-training-a-simple-rnn/9055/22
https://discuss.pytorch.org/t/initialization-of-first-hidden-state-in-lstm-and-truncated-bptt/58384/2)

I am testing right now if there is a performance difference between:

A.) Inserting [t-2, t-1, t] in a loop each batch an detach like described in 2.

or

B.) Insert [t] and only detach the hidden state every 3 batches and keep the graph in between (will be small enough to not go beyond my memory limit)

I already have tested Option A with hidden state option 1 vs. Option A with hidden state option 2 and the scond one is ~2-8% points better (depending on the metric). But this might very well be due to my use case, since the frames are so similar...
I'll let you know what else I might find out.

Cheers,
Sven

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants