Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression #3

Open
raijinspecial opened this issue Mar 31, 2021 · 17 comments
Open

regression #3

raijinspecial opened this issue Mar 31, 2021 · 17 comments

Comments

@raijinspecial
Copy link

Beautiful work as usual, thanks for this implementation.

I'm curious if you tried using this for a regression task? I have tried using TimeSFormer without success yet, I know the signal is there because I can learn it with a small 3dcnn trained from scratch so I suspect my understanding of how and where to modify the transformer is the culprit. The output is a 1D vector with len == num_frames. Any suggestions very appreciated!

@tcapelle
Copy link

tcapelle commented May 23, 2021

This is a pure code implementation, no experiments or training code or test.
I am currently using this and TimeSformer for regression, you don't need to modify anything, just set n_classes to the number of regressors, and use MSELoss.
The output of these type of models comes from the clst_oken attending to other inputs. You can see that the head is super simple:

self.mlp_head = nn.Linear(dim, num_classes)

@monajalal
Copy link

@tcapelle

What do you mean by "number of regressors"?

I initially had a classification based transformer code and then convert it to a regressor.

I am not sure if the following is correct? Is 1 correct here? What should I set 1 to?

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(emb_dim),
            nn.Linear(emb_dim, 1) # is this 1 correct for regression? 
        )

Previously, it was:
nn.Linear(emb_dim, num_classes)

@Taimoor-R
Copy link

@tcapelle

What do you mean by "number of regressors"?

I initially had a classification based transformer code and then convert it to a regressor.

I am not sure if the following is correct? Is 1 correct here? What should I set 1 to?

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(emb_dim),
            nn.Linear(emb_dim, 1) # is this 1 correct for regression? 
        )

Previously, it was: nn.Linear(emb_dim, num_classes)

Hi did you figure out how to use timesformer for regression tasks as i am trying to do the same but have found no luck

@tcapelle
Copy link

tcapelle commented Jan 4, 2023

Yeah, that's it!
You will put as many outputs as variables to regress. If you have only one-dimensional regression, then 1 is it.
My only take away, is that most regression problems can be converted to classification problems by binning the outputs.
Instead of predicting the price of a good in, let's say, a range of[0,100], you will predict the probability of the value to be in bins:

  • [0,10], [10,20], ..., [90,100]
  • This way you get a probabilistic model that can be trained with standard cross entropy loss. It's a very useful trick.
    The tricky part is creating a data pipeline to train this model; good luck 👍 .

@Taimoor-R
Copy link

Yeah, that's it! You will put as many outputs as variables to regress. If you have only one-dimensional regression, then 1 is it. My only take away, is that most regression problems can be converted to classification problems by binning the outputs. Instead of predicting the price of a good in, let's say, a range of[0,100], you will predict the probability of the value to be in bins:

  • [0,10], [10,20], ..., [90,100]
  • This way you get a probabilistic model that can be trained with standard cross entropy loss. It's a very useful trick.
    The tricky part is creating a data pipeline to train this model; good luck 👍 .

Thank you for the quick response, so lets say that I am hoping to use the pretrained timesformer model for regression instead of classification, for example using negative pearson loss, and each frame of the video having a unique numeric label/ground truth. So essentially the training data would be a 60 sec video broken into frames with corrsponding values/ labels for each frame. So in this case the we will only have a 1 dimensional regression am I right?

@tcapelle
Copy link

tcapelle commented Jan 5, 2023

Thank you for the quick response, so let's say that I am hoping to use the pre-trained timesformer model for regression instead of classification, for example, using negative Pearson loss, and each frame of the video has a unique numeric label/ground truth. So essentially, the training data would be a 60-sec video br

I think that TimeSformer expects a fat tensor of the type:

frames = torch.randn(2, 5, 3, 256, 256) # (batch x frames x channels x height x width)

So you have to construct a dataloader that generates this. When I used these models I trained from scratch. So I was not carefully checking what input the model expects, I used the model as an architecture.

For training, construct a dataloader that, for each batch of videos, gives you a batch of values. How you label this snippets of video (you will have to subsample or reduce the input size, as the model cannot ingest inputs that are too long).
I was training using 10 frames of video that came from a camera with one image per minute, so a 10-minute sequence and estimating the average movement speed. So I predicted one value for this 10-second tensor (bs, 10, 128, 128).

I hope that clarifies the strategy to follow.

Another quick tip, you can create a super simple dataloader by stacking the full video together and then just slicing randomly on it; here you have an example

@Taimoor-R
Copy link

Thank you so much for the quick and detailed responce, I am sorry for asking so many questions I am new to the whole video transformer domain. I just have a follow up question so my dataloader looks something like this

Containing video frames and corresponding to them pulse signal.
Frames are put in 4D tensor with size [c x d x w x h]

train_loader = torch.utils.data.DataLoader(
pulse(with pulse containing (frames, labels)),
batch_size=args.batch_size, shuffle=False,
num_workers=args.workers, pin_memory=True, sampler=sampler)

@tcapelle
Copy link

tcapelle commented Jan 5, 2023

Hope this clarifies my idea:
image

@Taimoor-R
Copy link

@tcapelle hi thanks for all the help regarding the data loader, I am sorry to bother yet again. I was having some trouble understadning where this issue arises from and why it arises as the only thing I changed is the dataloaders.
Screenshot 2023-01-09 at 12 38 36

@Taimoor-R
Copy link

I have pin-pointed where the issue is it seems like my traindataloader doesnt have the values in bold for cur_iter, (inputs, labels,### _, meta) in enumerate(train_loader). I dont understand how to reslove this though as i am not using their dataloaders. The dataloader i am using works in the following way where the pulse_3d returns: sample = (frames, labels)
Screenshot 2023-01-09 at 13 49 51

@tcapelle
Copy link

tcapelle commented Jan 9, 2023

Sorry, I can't help you with this. Maybe ask on the PyTorch forums?

@Taimoor-R
Copy link

I will try asking there but i dont think its a pytorch issue is it? I beleive it comes from the dataloader apperently the dataloader should contain inputs, labels, _,meta as seen in the following snippet from the train_net.py(TimeSformer)

Screenshot 2023-01-09 at 14 07 41

@tcapelle
Copy link

tcapelle commented Jan 9, 2023

sorry, don't know.

@Taimoor-R
Copy link

sorry, don't know.

Thank you for all the help, just a tiny follow up for the TimeSformer did you use the code provided by facebook or did you manage to find some other script

@tcapelle
Copy link

tcapelle commented Jan 9, 2023

I used @lucidrains implementation

@Taimoor-R
Copy link

But @lucidrains implementation doesnt have a trainer code does it?

@Taimoor-R
Copy link

Taimoor-R commented Jan 29, 2023

hi @tcapelle using TimeSformer(orange line) for regression commapred to 3D CNN(pink line) my results are quite weird. I am adding a screen shot of the loss(MSE)-epoch graph for training and validation. Note: Each video is broken into chuncks of 32 conseutive frames each with their corresponding gt values. The model predicts 1 value per frame fed so for 32 frames it outputs 32 values.
Screenshot 2023-01-26 at 01 05 48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants