-
Notifications
You must be signed in to change notification settings - Fork 926
smolvlm2 #2690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
smolvlm2 #2690
Conversation
merveenoyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise looks good!
smolvlm2.md
Outdated
|
|
||
| To demonstrate our vision in small video models, we've built three practical applications that showcase the versatility of these models. | ||
|
|
||
| ### iPhone Video Understanding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should move the use cases right after section SmolVLM2 2.2B: Our New Star Player for Vision and Video, and put ToC right here imo for better readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also no need to make collection a subsection, we can just mention in TLDR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of diving in into the applications. Do we really need a ToC for this post? (Just unsure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted
| thumbnail: /blog/assets/paligemma2/thumbnail.png | ||
| author: ariG23498 | ||
| date: Feb 19, 2024 | ||
| date: Feb 19, 2025 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
smolvlm2.md
Outdated
| - [Technical Details](#technical-details) | ||
| - [SmolVLM2 2.2B: Our New Star Player for Vision and Video](#smolvlm2-22b-our-new-star-player-for-vision-and-video) | ||
| - [Going Even Smaller: Meet the 500M and 256M Video Models](#going-even-smaller-meet-the-500m-and-256m-video-models) | ||
| - [Using SomlVLM2 and Fine-tuning it with Transformers and MLX](#using-somlvlm2-and-fine-tuning-it-with-transformers-and-mlx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - [Using SomlVLM2 and Fine-tuning it with Transformers and MLX](#using-somlvlm2-and-fine-tuning-it-with-transformers-and-mlx) | |
| - [Using SmolVLM2 and Fine-tuning it with Transformers and MLX](#using-smolvlm2-and-fine-tuning-it-with-transformers-and-mlx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted
smolvlm2.md
Outdated
|
|
||
|
|
||
|
|
||
| ## Using SomlVLM2 and Fine-tuning it with Transformers and MLX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Using SomlVLM2 and Fine-tuning it with Transformers and MLX | |
| ## Using SmolVLM2 and Fine-tuning it with Transformers and MLX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted
merveenoyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add call to action
|
|
||
|
|
||
|
|
||
| ## Read More |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Read More | |
| ## Read More | |
| We are looking forward to see all the things you'll build with SmolVLM2! | |
| If you'd like to learn more about SmolVLM family of models, feel free to read the following. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted
| ``` | ||
|
|
||
| ### Fine-tuning SmolVLM2 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add the video FT script to SmolLM repo today and we can link to my existing notebook for image FT to smol-vision
pcuenca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 🔥 Made a few optional suggestions.
| author: mfarre | ||
| thumbnail: /blog/assets/smolvlm2/banner.png | ||
| date: Feb 20, 2025 | ||
| tags: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we have one for vlm? Perhaps we should :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually have one, added
_blog.yml
Outdated
| - vision | ||
|
|
||
| - local: smolvlm2 | ||
| title: "SmolVLM2 Family: Bringing Video Understanding to Every Device" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| title: "SmolVLM2 Family: Bringing Video Understanding to Every Device" | |
| title: "SmolVLM2: Bringing Video Understanding to Every Device" |
Maybe more concise? Not sure we need the "family" term here, but your call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100%
smolvlm2.md
Outdated
| @@ -0,0 +1,315 @@ | |||
| --- | |||
| title: "SmolVLM2 Family: Bringing Video Understanding to Every Device" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same (if you agree)
| - user: orrzohar | ||
| guest: true | ||
| org: Stanford | ||
| - user: mfarre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main author should be inferred from _blog.yml, but there was a bug and it was taken from the first author in this list. I think it has been fixed recently, but I'd recommend to put mfarre first just in case.
| - user: orrzohar | |
| guest: true | |
| org: Stanford | |
| - user: mfarre | |
| - user: mfarre | |
| - user: orrzohar | |
| guest: true | |
| org: Stanford |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set Orr as author
smolvlm2.md
Outdated
|
|
||
| SmolVLM2 represents a fundamental shift in how we think about video understanding - moving from massive models that require substantial computing resources to efficient models that can run anywhere. Our goal is simple: make video understanding accessible across all devices and use cases, from phones to servers. | ||
|
|
||
| We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready from day zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready from day zero. | |
| We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python _and_ Swift APIs) from day zero. |
I still need to cleanup some code and submit a few PRs to the repos 😅 Hope we can get it done in time.
smolvlm2.md
Outdated
| }, | ||
| ] | ||
|
|
||
| # infer like above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd show a complete snippet so people can copy/paste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added inference code
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing cc @merveenoyan
smolvlm2.md
Outdated
| ## Read More | ||
| [SmolVLM2 - Collection with Models and Demos](https://huggingface.co/collections/HuggingFaceTB/smolvlm2-smallest-video-lm-ever-67ab6b5e84bf8aaa60cb17c7) | ||
| [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replace with a conclusion still. We can still link to the collection there, but not to the previous blog post imo (we can do it elsewhere in the post), since this would not be a more in-depth reading.
smolvlm2.md
Outdated
|
|
||
| ### Transformers | ||
|
|
||
| There are two ways to infer SmolVLM2 models, one is through a chat template and the other one gives you more control by passing in media through the processor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| There are two ways to infer SmolVLM2 models, one is through a chat template and the other one gives you more control by passing in media through the processor. | |
| The easiest way to run inference with the SmolVLM2 models is through the conversational API – applying the chat template takes care of preparing all inputs automatically. |
Removing the processor use because we are not showing it later.
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
smolvlm2.md
Outdated
|
|
||
|
|
||
|
|
||
| ## Using SmolVLM2 and Fine-tuning it with Transformers and MLX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Using SmolVLM2 and Fine-tuning it with Transformers and MLX | |
| ## Using SmolVLM2 with Transformers and MLX |
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool! Just some suggestions & nits.
I could also add a Transformers.js example code section soon (requires installing from main at the moment, but I will put out a release for it soon).
|
|
||
| ## TL;DR: SmolVLM can now watch 📺 with even better visual understanding | ||
|
|
||
| SmolVLM2 represents a fundamental shift in how we think about video understanding - moving from massive models that require substantial computing resources to efficient models that can run anywhere. Our goal is simple: make video understanding accessible across all devices and use cases, from phones to servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| SmolVLM2 represents a fundamental shift in how we think about video understanding - moving from massive models that require substantial computing resources to efficient models that can run anywhere. Our goal is simple: make video understanding accessible across all devices and use cases, from phones to servers. | |
| SmolVLM2 represents a fundamental shift in how we think about video understanding — moving from massive models that require substantial computing resources to efficient models that can run anywhere. Our goal is simple: make video understanding accessible across all devices and use cases, from phones to servers. |
smolvlm2.md
Outdated
| We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python _and_ Swift APIs) from day zero. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python _and_ Swift APIs) from day zero. | |
| We are releasing models in three sizes ([2.2B](https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct), [500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) and [256M](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct)), MLX (Python _and_ Swift APIs) and ONNX ready from day zero. | |
Added links for people to easily navigate to the models. Could replace with a collection link though.
Maybe not needed to mention ONNX. Feel free to remove.
Btw, is it intentional that the naming is different for each model?
e.g., https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct vs. https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruc
smolvlm2.md
Outdated
| We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python _and_ Swift APIs) from day zero. | ||
|
|
||
|
|
||
| To demonstrate our vision in small video models, we've built three practical applications that showcase the versatility of these models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To demonstrate our vision in small video models, we've built three practical applications that showcase the versatility of these models. | |
| To demonstrate our vision for small video models, we've built three practical applications that showcase the versatility of these models. |
smolvlm2.md
Outdated
| </center> | ||
| </td> | ||
| <td valign="top" style="border: none;"> | ||
| We've created an iPhone app running SmolVLM2 completely locally. Using our 500M model, users can analyze and understand video content directly on their device - no cloud required. Interested in building iPhone video processing apps with AI models running locally? We're releasing it very soon - <a href="https://huggingface.co/datasets/HuggingFaceTB/smolvlm2-iphone-waitlist" target="_blank">fill this form to test and build with us!</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We've created an iPhone app running SmolVLM2 completely locally. Using our 500M model, users can analyze and understand video content directly on their device - no cloud required. Interested in building iPhone video processing apps with AI models running locally? We're releasing it very soon - <a href="https://huggingface.co/datasets/HuggingFaceTB/smolvlm2-iphone-waitlist" target="_blank">fill this form to test and build with us!</a> | |
| We've created an iPhone app that runs SmolVLM2 completely locally. Using our 500M model, users can analyze and understand video content directly on their device — no cloud required. Interested in building iPhone video processing apps with AI models running locally? We're releasing it very soon — <a href="https://huggingface.co/datasets/HuggingFaceTB/smolvlm2-iphone-waitlist" target="_blank">fill in this form to test and build with us!</a> |
smolvlm2.md
Outdated
| </center> | ||
| </td> | ||
| <td valign="top" style="border: none;"> | ||
| Available as a Hugging Face Space, this application takes long-form videos (1+ hours) and automatically extracts the most significant moments. We've tested it extensively with soccer matches and other lengthy events, making it a powerful tool for content summarization. <a href="https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator" target="_blank">Try it yourself in our demo space.</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Available as a Hugging Face Space, this application takes long-form videos (1+ hours) and automatically extracts the most significant moments. We've tested it extensively with soccer matches and other lengthy events, making it a powerful tool for content summarization. <a href="https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator" target="_blank">Try it yourself in our demo space.</a> | |
| Available as a Hugging Face Space, this application takes long-form videos (1+ hours) and automatically extracts the most significant moments. We've tested it extensively with soccer matches and other lengthy events, making it a powerful tool for content summarization. <a href="https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator" target="_blank">Try it yourself with our demo space.</a> |
|
|
||
| We are introducing three new models with 256M, 500M and 2.2B parameters. The 2.2B model is the go-to choice for vision and video tasks, while the 500M and 256M models represent **the smallest video language models ever released**. | ||
|
|
||
| While they're small in size, they outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families on the 2B range and we lead the pack in the even smaller space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| While they're small in size, they outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families on the 2B range and we lead the pack in the even smaller space. | |
| While they're small in size, they outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families in the 2B range and we lead the pack in the even smaller space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a link to Video-MME on HF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see the footnote. Maybe add a link to it here.
|
|
||
| While they're small in size, they outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families on the 2B range and we lead the pack in the even smaller space. | ||
|
|
||
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2-videomme2.png" width="50%" alt="SmolVLM2 Performance"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smolvlm2.md
Outdated
|
|
||
|
|
||
|
|
||
| When it comes to video tasks, 2.2B is a good bang for the buck. Across the different scientific benchmarks where we evaluated it we want to highlight its performance on Video-MME where it outperforms all existing 2B models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| When it comes to video tasks, 2.2B is a good bang for the buck. Across the different scientific benchmarks where we evaluated it we want to highlight its performance on Video-MME where it outperforms all existing 2B models. | |
| When it comes to video tasks, 2.2B is a good bang for the buck. Across the various scientific benchmarks we evaluated it on, we want to highlight its performance on Video-MME where it outperforms all existing 2B models. |
Maybe a better way to phrase this too. Original wording confused me.
smolvlm2.md
Outdated
|
|
||
| #### Video Inference | ||
|
|
||
| You can pass videos through a chat template by passing in {“type”: “video”, “path”:{video_path}”. See below complete example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| You can pass videos through a chat template by passing in {“type”: “video”, “path”:{video_path}”. See below complete example. | |
| You can pass videos through a chat template by passing in `{"type": "video", "path": {video_path}`. See below for a complete example. |
smolvlm2.md
Outdated
|
|
||
| #### Interleaving Image, Text and Video | ||
|
|
||
| You can interleave image, video and text together by passing in `<image>` and `<video>` tokens inside text, cutting text through and inserting image lines in between. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably fine to remove this section due to performance, and also it's a pretty niche use-case imo.
|
|
||
|
|
||
| ## Read More | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We would like to thank Raushan Turganbay, Arthur Zucker and Pablo Montalvo Leroux for their contribution of the model to transformers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
(I just created it 😂)
merveenoyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added notebook
smolvlm2.md
Outdated
|
|
||
| ### Fine-tuning SmolVLM2 | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| You can fine-tune SmolVLM2 on videos using transformers 🤗 | |
| We have fine-tuned 500M variant in Colab on video-caption pairs in [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) dataset for demonstration purposes. Since 500M variant is small, it's better to apply full fine-tuning instead of QLoRA or LoRA, meanwhile you can try to apply QLoRA on 2.2B variant. You can find the fine-tuning notebook [here](https://github.com/huggingface/smollm/blob/main/vision/finetuning/SmolVLM2_Video_FT.ipynb). |

Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.
Preparing the Article
You're not quite done yet, though. Please make sure to follow this process (as documented here):
mdfile. You can also specifyguestororgfor the authors.Here is an example of a complete PR: #2382
Getting a Review
Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.
Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.