Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About inference in real-time #3

Closed
liamsun2019 opened this issue Apr 6, 2022 · 10 comments
Closed

About inference in real-time #3

liamsun2019 opened this issue Apr 6, 2022 · 10 comments

Comments

@liamsun2019
Copy link

Hi author,

Thanks for your such excellent work. I did some training and tests based on your paper and codes and the results are good. I am now curious about the inference in real time. My intention is to estimate the 3D coordinates while playing back a video. According to your strategy and demo code, estimation for a center frame need the 2D poses before and after it, which means the 3D pose of a certain frame cannot be achieved until the 2D poses after it are calculated. But for real-time inference, the 2D pose sequence after a centain frame cannot be acquired while it's played back.

I am now in a dilemma. I have already a 2D pose estimator which achieves a good balance between performance and speed even after being deployed on a mobile device after quantization. My thought is to use it plus P-STMO to act as a real-time 3D pose estimator, i.e, firstly get the 2D poses and then acquire the 3D pose. Actually I am a little confused about the training strategy. My understanding is that the frames "before" current frame are supposed to be enough for prediction, why are the frames "after" current one also collected for training? It's the case seen from your training code. My naive idea is just using the "before" frames as the input sequence for inference exclusive of the "after" frames. Appreicate your comment, big thanks.

@paTRICK-swk
Copy link
Owner

Frames after the current one are used to maintain the continuity of the movement. You can replace the symmetric convolutions in MOFA with causal convolutions, which are used in this paper, for real-time inference.

@liamsun2019
Copy link
Author

Thanks for your prompt reply. My understanding is that strided_transformer_encoder.py is supposed to be MOFA module. But for Conv1d, I have not seen any logic about dilation/padding/kernel_size in this module. Looks like the 'dilation' argument is always set to default value, i.e, 1. Could you explain in more detail about the replacement by causal convolution?

On the other hand, I conducted some tests where only the left_padding is applied to the input training sequence, i.e, the right_padding is always 0. The resulted model has also a good accuracy against human3.6M.

@paTRICK-swk
Copy link
Owner

For more details about the causal convolution, please refer to Figure 6 in the paper I mentioned. It only performs 1D convolutions on the frames before the current frame. This approach is essentially the same as your implementation (right_padding=0).

@liamsun2019
Copy link
Author

Appreciate your help. Will ask you for advice in case of further questions. This issue could be closed.

@vicentowang
Copy link

vicentowang commented Jul 21, 2022

@paTRICK-swk

@vicentowang
Copy link

vicentowang commented Jul 21, 2022

Frames after the current one are used to maintain the continuity of the movement. You can replace the symmetric convolutions in MOFA with causal convolutions, which are used in this paper, for real-time inference.
@paTRICK-swk @liamsun2019
I just changge the pad parameters from 121 to 243, which means the many-to-one frame aggregator will refer to the last frame, is that the right way to achieve real time 3d pose estimation ?

@paTRICK-swk
Copy link
Owner

Frames after the current one are used to maintain the continuity of the movement. You can replace the symmetric convolutions in MOFA with causal convolutions, which are used in this paper, for real-time inference. @paTRICK-swk @liamsun2019 I just changge the pad parameters from 121 to 243, which means the many-to-one frame aggregator will refer to the last frame, is that the right way to achieve real time 3d pose estimation ?

No, changing the parameters from 121 to 243 will still pad the left and right sides of the current frame. You need to modify the convolution to achieve real-time 3d pose estimation. You can refer to this repo for causal convolutions.

@vicentowang
Copy link

vicentowang commented Aug 16, 2022

Frames after the current one are used to maintain the continuity of the movement. You can replace the symmetric convolutions in MOFA with causal convolutions, which are used in this paper, for real-time inference. @paTRICK-swk @liamsun2019 I just changge the pad parameters from 121 to 243, which means the many-to-one frame aggregator will refer to the last frame, is that the right way to achieve real time 3d pose estimation ?

No, changing the parameters from 121 to 243 will still pad the left and right sides of the current frame. You need to modify the convolution to achieve real-time 3d pose estimation. You can refer to this repo for causal convolutions.

I mean changge the pad parameters from 121 to 243 to make the supervised frames as last frame, which means the network learns to predict the last frame pose, symmetric convolutions or causal convolutions may have no big difference.

@Edu4444
Copy link

Edu4444 commented Sep 19, 2022

Thanks for your prompt reply. My understanding is that strided_transformer_encoder.py is supposed to be MOFA module. But for Conv1d, I have not seen any logic about dilation/padding/kernel_size in this module. Looks like the 'dilation' argument is always set to default value, i.e, 1. Could you explain in more detail about the replacement by causal convolution?

On the other hand, I conducted some tests where only the left_padding is applied to the input training sequence, i.e, the right_padding is always 0. The resulted model has also a good accuracy against human3.6M.

For more details about the causal convolution, please refer to Figure 6 in the paper I mentioned. It only performs 1D convolutions on the frames before the current frame. This approach is essentially the same as your implementation (right_padding=0).

Hello. I'm also interested in causal convolutions for real time processing, but I am not able to find left_padding or right_padding in the code.
Where are these variables?

@noahcoolboy
Copy link

Have you been successful at training the causal model? It would save me quite some money if it has been trained already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants