-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About inference in real-time #3
Comments
Frames after the current one are used to maintain the continuity of the movement. You can replace the symmetric convolutions in MOFA with causal convolutions, which are used in this paper, for real-time inference. |
Thanks for your prompt reply. My understanding is that strided_transformer_encoder.py is supposed to be MOFA module. But for Conv1d, I have not seen any logic about dilation/padding/kernel_size in this module. Looks like the 'dilation' argument is always set to default value, i.e, 1. Could you explain in more detail about the replacement by causal convolution? On the other hand, I conducted some tests where only the left_padding is applied to the input training sequence, i.e, the right_padding is always 0. The resulted model has also a good accuracy against human3.6M. |
For more details about the causal convolution, please refer to Figure 6 in the paper I mentioned. It only performs 1D convolutions on the frames before the current frame. This approach is essentially the same as your implementation (right_padding=0). |
Appreciate your help. Will ask you for advice in case of further questions. This issue could be closed. |
Frames after the current one are used to maintain the continuity of the movement. You can replace the symmetric convolutions in MOFA with causal convolutions, which are used in this paper, for real-time inference. |
No, changing the parameters from 121 to 243 will still pad the left and right sides of the current frame. You need to modify the convolution to achieve real-time 3d pose estimation. You can refer to this repo for causal convolutions. |
I mean changge the pad parameters from 121 to 243 to make the supervised frames as last frame, which means the network learns to predict the last frame pose, symmetric convolutions or causal convolutions may have no big difference. |
Hello. I'm also interested in causal convolutions for real time processing, but I am not able to find left_padding or right_padding in the code. |
Have you been successful at training the causal model? It would save me quite some money if it has been trained already |
Hi author,
Thanks for your such excellent work. I did some training and tests based on your paper and codes and the results are good. I am now curious about the inference in real time. My intention is to estimate the 3D coordinates while playing back a video. According to your strategy and demo code, estimation for a center frame need the 2D poses before and after it, which means the 3D pose of a certain frame cannot be achieved until the 2D poses after it are calculated. But for real-time inference, the 2D pose sequence after a centain frame cannot be acquired while it's played back.
I am now in a dilemma. I have already a 2D pose estimator which achieves a good balance between performance and speed even after being deployed on a mobile device after quantization. My thought is to use it plus P-STMO to act as a real-time 3D pose estimator, i.e, firstly get the 2D poses and then acquire the 3D pose. Actually I am a little confused about the training strategy. My understanding is that the frames "before" current frame are supposed to be enough for prediction, why are the frames "after" current one also collected for training? It's the case seen from your training code. My naive idea is just using the "before" frames as the input sequence for inference exclusive of the "after" frames. Appreicate your comment, big thanks.
The text was updated successfully, but these errors were encountered: