Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does the paper mean by different temporal extent being used on 60f network? #5

Open
ajay9022 opened this issue Feb 15, 2019 · 7 comments

Comments

@ajay9022
Copy link

ajay9022 commented Feb 15, 2019

I was reading the paper Long-term Temporal Convolutions for Action Recognition and read that they have tried different temporal extent t ∈{20,40,60,80,100} on the 60f Network.

I didn't get the term temporal extent used here. Can you also explain what does 60f network mean?

From this link I got to know that a video is made up of many clips and each clip is of some x frames. Does that hold true in this paper too?

@gulvarol
Copy link
Owner

The temporal extent is simply the number of input frames (clip) of the network. t ∈ {20,40,60,80,100} is not on the 60f network. It is either 20-frames (20f), 40-frames (40f), 60-frames (60f) etc. We did experiments with different input resolutions. Yes, the terminology must be the same for clip and video.

@ajay9022
Copy link
Author

ajay9022 commented Feb 15, 2019

So, that means for 60f network the input is of 60 frames. Right?

Also, one of the takeaways of the paper is that higher temporal resolution inputs get better accuracy. So, that means the inputs with more input frames are identified better than those with fewer frames.

So, does that mean the difference between 2 consecutive frames in temporal extent = 100 is less than that in temporal extent = 60 case? Because now for the same video there will be lesser frames which will be far apart.

@gulvarol
Copy link
Owner

Yes 60f means 60 frames.

The difference between 2 consecutive frames for 60f and 100f is the same since we always sample consecutive frames from the original video. More randomness in such sampling could improve the results. This is not something we investigated in that paper.

@ajay9022
Copy link
Author

That means that for a video of 240 frames when fed into a 60f network will only take the first 60 frames and neglect the last 180 frames. This surely means that there is a lack of information that we are feeding into the network. This will surely hamper the accuracy in recognising the video.

Just to confirm, did I get it right?

@gulvarol
Copy link
Owner

This is explained in the last two paragraphs of Section 3.3 of the paper. At training, we take a random (not necessarily the first) 60-frame clip. At test time, we perform sliding windows and average their scores. Otherwise, using only 1 clip of course reduces the accuracy.

@ajay9022
Copy link
Author

ajay9022 commented Mar 18, 2019

Does sliding windows mean that sliding through 1-60 frames and then 6-65, 11-70 because the stride of 4 frames is given in the paper or does that mean anything else?

@ajay9022
Copy link
Author

Can you explain the last paragraph of Section 3.3 a bit more? I didn't get it how the cropping is being done?

At test time, a video is divided into t-frame clips with a temporal stride of 4 frames. Each clip is further tested with 10 crops, namely the 4 corners and the center, together with their horizontal flips. The video score is obtained by averaging over clip scores and crop scores. If the number of frames in a video is less than the clip size, we pad the input by repeating the last frames to fill the missing volume.

Also, do different clips in a given video show different actions. Why are we focusing on clips of a video rather than talking about the complete video at a time?

Does the clip size mean the no. of frames in a given clip? Again how has a given video been divided into the clips? Do clips in a video share common frames?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants