-
It's a theoretical question, of course the first thought is to change the architecture from a RN50 to a transformer, if one aims for better performance overall. With that being said, there is a limit of complexity in the data that a RN50 image-encoder is able to learn. Could the performance of a RN50 image-encoder using a larger LAION-2B or LAION-5B dataset somehow be estimated? What are the expected performance gains? Maybe someone in this group even tried to train the RN50 image-encoder with these large LAION datasets? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@PatCH0816 they might do better, but given the performance relative to the ViT models in the original paper (for amount of compute), didn't seem that enticing. If you haven't noticed, those ResNets aren't like normal ResNets, they have a fairly large self-attn pooling layer at the end, which fixes their resolution (unless you interpolate the pos embed), just like ViT. I have recently pushed some ConvNeXt-Base models. The open_clip/src/open_clip/pretrained.py Lines 162 to 171 in 2ca5893 |
Beta Was this translation helpful? Give feedback.
@PatCH0816 they might do better, but given the performance relative to the ViT models in the original paper (for amount of compute), didn't seem that enticing. If you haven't noticed, those ResNets aren't like normal ResNets, they have a fairly large self-attn pooling layer at the end, which fixes their resolution (unless you interpolate the pos embed), just like ViT.
I have recently pushed some ConvNeXt-Base models. The
convnext_base_w
models at 256x256 are sized to be roughly equivalent to the RN50x4 in the original paper in compute. They perform quite a bit better than that when trained with LAION-2B, they are better than the ViT-B-16 for ImageNet Zero Shot. More tests need to be perfo…