-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
After reading the CLIP paper, I’m highly impressed by its ability to perform zero-shot learning and generalize across image-text tasks without task-specific fine-tuning. The contrastive learning approach, combined with large-scale internet pretraining, allows CLIP to match ResNet50 on ImageNet without labeled examples, which is a significant achievement.
However, I have a few questions regarding future improvements:
1.Model Variants: Are there any plans to release additional CLIP model variants with different architectures or training strategies?
2.Fine-Tuning Support: While CLIP excels at zero-shot learning, is there an official recommendation or upcoming support for fine-tuning it on specific datasets?
3.Performance on Complex Queries: Have there been any internal evaluations or planned improvements for handling more complex, multi-part queries?
Looking forward to any insights on these points. Thanks for the amazing work!