New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature proposals #249
Comments
Hi @EelcoHoogendoorn, |
Hi @EelcoHoogendoorn and thank you for contacting us :) We would be more than happy about your contribution! I quickly glanced over the papers you listed and both are very closely related to what we're trying to do:
Therefore, I'd suggest we create two follow-up issues (one for each paper). Since the implementation of the alignment & uniformity loss is rather straight-forward and you seem to have experience with it, you could start working on it anytime. In the meantime, we could discuss how an implementation of the second paper would look like and where it would fit in the package. Would that work for you? |
Yeah; the second paper is more subtle from an implementation point of view; the implementation involves some low level hackery that might clash with design decision of lightly. Or maybe not; I havnt looked at it deeply enough to tell. Ill get started with the first paper first, for sure. |
Great, I'll create the follow-up issue for the alignment and uniformity loss. Let me know if I can help you in any other way! The issue is open: #250. @EelcoHoogendoorn, just leave a quick comment there once you have started so I can assign it to you :) |
On the topic of feature proposals; I don't see any issues mentioning transformers/attention yet. The official CLIP repo contains a pretty clean torch implementation thereof; and other minor variations like the BotNet paper should be easy to integrate as parametrisable options as well. Training a pure transformer model from scratch is kindof an exercise in machoism if you ask me, and has been shown to work poorly; unless you have CLIP-scale data and training resources to match. But hybrid conv-transformer architectures appear to be more data efficient than plain convnets in a number of benchmarks ive seen, so could be a very practical addition. |
To my understanding, BotNet is more of an architectural choice for the backbone neural network. We purposely left the backbone for self-supervised training as a parameter so that any user can bring their own. Having said that, it's true that we currently offer the ResNet backbone as a default architecture and it might make sense to switch the default to something better in the future. |
Yeah, I can imagine that supporting every model out there isnt something you want as part of the scope of your project; and I can imagine that a focus on the functionality specific to self-supervised learning is enough of a challenge. The reason I found this library was frustration with the incomplete comparisons (more often than not suspected intentional) made in the recently published literature. A library like this is a nice platform for independent benchmarking using standardized and community-reviewed implementations, providing a living document that could be much more useful than any old paper. Keeping up with the sota in vision is going to require transformers going forward I think. But yeah I dont know if its in-scope or not, but adding the ViT/CLIP models specifically wouldnt be a lot of work since there is quality code out there, and I do think it would add to the utility of the library. |
CLIPs transformer implementation aside, adding support for CLIPs multi-modal loss function would fall under the scope of the project I imagine? I guess you can make an argument both ways; that CLIP is a supervised method in the sense that it comes with image-text pairs; but id say it isnt, since the parallel training of the language model is rather the essence of it. Leaving aside text-image pairs, I bet there are other multi-modal datasets that could be trained in a similar contrastive manner; for which an actual public dataset exists. |
Thanks for your input, you're making some good points. Let me read up and get back to you. |
Regarding the backbones. I don't think it's necessary to implement them directly to lightly. The framework is already very flexible. It's very easy to use the models from timm which also has transformer architectures. (https://github.com/rwightman/pytorch-image-models) I like multi-modal. We should definitively look into it. Who knows. Maybe one day you could combine lightly with huggingface :) |
I agree, we could highlight the flexibility more and maybe even add some benchmarks with other backbones? Multi-modal should be very easy to implement (from scratch or simply as a symmetrized version of the |
I opened a ticket about the multi-modal loss function #264 |
Playing around the last few weeks and investigating collapse, I decided to play around with IterNorm. Seems to work pretty well. Today I discovered these guys beat me to it: Seems like a pretty elegant loss function that could also be included. It does not include any hyperparameters, which is nice. Not sure about all the subbatch mumbo jumbo; seems like they are trying to fix the same problems iternorm already solved. |
Hi @EelcoHoogendoorn I will close this issue due to inactivity. Feel free to re-open it if you think we haven't addressed it properly. |
Hi; first of all thanks for making this library; it looks very useful.
There are two papers that id like to try and implement in your repo:
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
This paper introduces a contrastive loss function with some quite different properties from the more cited ones. Out of all loss functions ive tried for my projects, this is the one I had most success with; and also the one that a-priori seems the most sensible to me, of whats currently been published. The paper provides source and its a trivially implemented additional loss function really. Ive got experience implementing it, plus some generalizations I came up with I could provide. Should be an easy addition to this package. My motivation for contribution is a selfish one, as having it included here in the benchmarks would help me better understand the relative strengths and weaknesses on different types of datasets.
Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage
I also recently came across this paper. It also provides (rather ugly imo) torch code. Implementing it in this package first would be easier to me than implementing it in my private repos; since itd be easier to properly test the implementation it using the infrastructure provided here. The title is quite descriptive; given the importance of batch-size for within-batch mining techniques, and the fact that many of us are working with single gpus, being able to scale batch sizes arbitrarily is super useful, and I think this paper has the right idea of how to go about it.
The contribution guide mentions to first discuss such additions; so tell me what you think!
The text was updated successfully, but these errors were encountered: