Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposals #249

Closed
EelcoHoogendoorn opened this issue Mar 18, 2021 · 14 comments
Closed

Feature proposals #249

EelcoHoogendoorn opened this issue Mar 18, 2021 · 14 comments

Comments

@EelcoHoogendoorn
Copy link
Contributor

EelcoHoogendoorn commented Mar 18, 2021

Hi; first of all thanks for making this library; it looks very useful.

There are two papers that id like to try and implement in your repo:

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

This paper introduces a contrastive loss function with some quite different properties from the more cited ones. Out of all loss functions ive tried for my projects, this is the one I had most success with; and also the one that a-priori seems the most sensible to me, of whats currently been published. The paper provides source and its a trivially implemented additional loss function really. Ive got experience implementing it, plus some generalizations I came up with I could provide. Should be an easy addition to this package. My motivation for contribution is a selfish one, as having it included here in the benchmarks would help me better understand the relative strengths and weaknesses on different types of datasets.

Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage

I also recently came across this paper. It also provides (rather ugly imo) torch code. Implementing it in this package first would be easier to me than implementing it in my private repos; since itd be easier to properly test the implementation it using the infrastructure provided here. The title is quite descriptive; given the importance of batch-size for within-batch mining techniques, and the fact that many of us are working with single gpus, being able to scale batch sizes arbitrarily is super useful, and I think this paper has the right idea of how to go about it.

The contribution guide mentions to first discuss such additions; so tell me what you think!

@IgorSusmelj
Copy link
Contributor

Hi @EelcoHoogendoorn,
Thank you for your kind words and for proposing the new methods. Give us a day or two to look into the papers.
They look very interesting :)

@philippmwirth
Copy link
Contributor

Hi @EelcoHoogendoorn and thank you for contacting us :) We would be more than happy about your contribution!

I quickly glanced over the papers you listed and both are very closely related to what we're trying to do:

  • Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
    Interpretability of most self-supervised frameworks is still very poor. A low loss does not always correlate with good representations. Our goal is that our users know how to get the best representations for their dataset and the proposed loss function could help us do exactly that.
  • Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage
    This paper is also very interesting as it addresses the fact that not many of our users have access to multiple GPUs and self-supervised training is often relying on large batch sizes. However, this seems to be a more involved topic and it would probably be good to get a more in-depth understanding of how an implementation could look like.

Therefore, I'd suggest we create two follow-up issues (one for each paper). Since the implementation of the alignment & uniformity loss is rather straight-forward and you seem to have experience with it, you could start working on it anytime. In the meantime, we could discuss how an implementation of the second paper would look like and where it would fit in the package.

Would that work for you?

@EelcoHoogendoorn
Copy link
Contributor Author

Yeah; the second paper is more subtle from an implementation point of view; the implementation involves some low level hackery that might clash with design decision of lightly. Or maybe not; I havnt looked at it deeply enough to tell. Ill get started with the first paper first, for sure.

@philippmwirth
Copy link
Contributor

philippmwirth commented Mar 18, 2021

Great, I'll create the follow-up issue for the alignment and uniformity loss. Let me know if I can help you in any other way!

The issue is open: #250. @EelcoHoogendoorn, just leave a quick comment there once you have started so I can assign it to you :)

@EelcoHoogendoorn
Copy link
Contributor Author

EelcoHoogendoorn commented Mar 19, 2021

On the topic of feature proposals; I don't see any issues mentioning transformers/attention yet. The official CLIP repo contains a pretty clean torch implementation thereof; and other minor variations like the BotNet paper should be easy to integrate as parametrisable options as well. Training a pure transformer model from scratch is kindof an exercise in machoism if you ask me, and has been shown to work poorly; unless you have CLIP-scale data and training resources to match. But hybrid conv-transformer architectures appear to be more data efficient than plain convnets in a number of benchmarks ive seen, so could be a very practical addition.

@philippmwirth
Copy link
Contributor

To my understanding, BotNet is more of an architectural choice for the backbone neural network. We purposely left the backbone for self-supervised training as a parameter so that any user can bring their own. Having said that, it's true that we currently offer the ResNet backbone as a default architecture and it might make sense to switch the default to something better in the future.

@EelcoHoogendoorn
Copy link
Contributor Author

Yeah, I can imagine that supporting every model out there isnt something you want as part of the scope of your project; and I can imagine that a focus on the functionality specific to self-supervised learning is enough of a challenge.

The reason I found this library was frustration with the incomplete comparisons (more often than not suspected intentional) made in the recently published literature. A library like this is a nice platform for independent benchmarking using standardized and community-reviewed implementations, providing a living document that could be much more useful than any old paper. Keeping up with the sota in vision is going to require transformers going forward I think.

But yeah I dont know if its in-scope or not, but adding the ViT/CLIP models specifically wouldnt be a lot of work since there is quality code out there, and I do think it would add to the utility of the library.

@EelcoHoogendoorn
Copy link
Contributor Author

CLIPs transformer implementation aside, adding support for CLIPs multi-modal loss function would fall under the scope of the project I imagine? I guess you can make an argument both ways; that CLIP is a supervised method in the sense that it comes with image-text pairs; but id say it isnt, since the parallel training of the language model is rather the essence of it. Leaving aside text-image pairs, I bet there are other multi-modal datasets that could be trained in a similar contrastive manner; for which an actual public dataset exists.

@philippmwirth
Copy link
Contributor

Thanks for your input, you're making some good points. Let me read up and get back to you.

@IgorSusmelj
Copy link
Contributor

Regarding the backbones. I don't think it's necessary to implement them directly to lightly. The framework is already very flexible. It's very easy to use the models from timm which also has transformer architectures. (https://github.com/rwightman/pytorch-image-models)
I think though it would add a lot of value to add more examples on how to use them.

I like multi-modal. We should definitively look into it. Who knows. Maybe one day you could combine lightly with huggingface :)

@philippmwirth
Copy link
Contributor

I agree, we could highlight the flexibility more and maybe even add some benchmarks with other backbones?

Multi-modal should be very easy to implement (from scratch or simply as a symmetrized version of the NTXEntLoss).
So we can open up an issue for this as well :)

@philippmwirth
Copy link
Contributor

I opened a ticket about the multi-modal loss function #264

@EelcoHoogendoorn
Copy link
Contributor Author

Playing around the last few weeks and investigating collapse, I decided to play around with IterNorm. Seems to work pretty well.

Today I discovered these guys beat me to it:
https://arxiv.org/pdf/2007.06346.pdf

Seems like a pretty elegant loss function that could also be included. It does not include any hyperparameters, which is nice. Not sure about all the subbatch mumbo jumbo; seems like they are trying to fix the same problems iternorm already solved.

@philippmwirth
Copy link
Contributor

Hi @EelcoHoogendoorn I will close this issue due to inactivity. Feel free to re-open it if you think we haven't addressed it properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants