Skip to content

Conversation

@mfarre
Copy link
Contributor

@mfarre mfarre commented Feb 20, 2025

Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

  • Add an entry to _blog.yml.
  • Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
  • Check you use a short title and blog path.
  • Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
  • Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
  • Ensure the publication date is correct.
  • Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.

Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.

Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise looks good!

smolvlm2.md Outdated

To demonstrate our vision in small video models, we've built three practical applications that showcase the versatility of these models.

### iPhone Video Understanding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move the use cases right after section SmolVLM2 2.2B: Our New Star Player for Vision and Video, and put ToC right here imo for better readability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also no need to make collection a subsection, we can just mention in TLDR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of diving in into the applications. Do we really need a ToC for this post? (Just unsure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted

thumbnail: /blog/assets/paligemma2/thumbnail.png
author: ariG23498
date: Feb 19, 2024
date: Feb 19, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

smolvlm2.md Outdated
- [Technical Details](#technical-details)
- [SmolVLM2 2.2B: Our New Star Player for Vision and Video](#smolvlm2-22b-our-new-star-player-for-vision-and-video)
- [Going Even Smaller: Meet the 500M and 256M Video Models](#going-even-smaller-meet-the-500m-and-256m-video-models)
- [Using SomlVLM2 and Fine-tuning it with Transformers and MLX](#using-somlvlm2-and-fine-tuning-it-with-transformers-and-mlx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Using SomlVLM2 and Fine-tuning it with Transformers and MLX](#using-somlvlm2-and-fine-tuning-it-with-transformers-and-mlx)
- [Using SmolVLM2 and Fine-tuning it with Transformers and MLX](#using-smolvlm2-and-fine-tuning-it-with-transformers-and-mlx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted

smolvlm2.md Outdated



## Using SomlVLM2 and Fine-tuning it with Transformers and MLX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Using SomlVLM2 and Fine-tuning it with Transformers and MLX
## Using SmolVLM2 and Fine-tuning it with Transformers and MLX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted

Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add call to action




## Read More
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Read More
## Read More
We are looking forward to see all the things you'll build with SmolVLM2!
If you'd like to learn more about SmolVLM family of models, feel free to read the following.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted

```

### Fine-tuning SmolVLM2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the video FT script to SmolLM repo today and we can link to my existing notebook for image FT to smol-vision

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🔥 Made a few optional suggestions.

author: mfarre
thumbnail: /blog/assets/smolvlm2/banner.png
date: Feb 20, 2025
tags:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we have one for vlm? Perhaps we should :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually have one, added

_blog.yml Outdated
- vision

- local: smolvlm2
title: "SmolVLM2 Family: Bringing Video Understanding to Every Device"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: "SmolVLM2 Family: Bringing Video Understanding to Every Device"
title: "SmolVLM2: Bringing Video Understanding to Every Device"

Maybe more concise? Not sure we need the "family" term here, but your call!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%

smolvlm2.md Outdated
@@ -0,0 +1,315 @@
---
title: "SmolVLM2 Family: Bringing Video Understanding to Every Device"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same (if you agree)

Comment on lines +5 to +8
- user: orrzohar
guest: true
org: Stanford
- user: mfarre
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main author should be inferred from _blog.yml, but there was a bug and it was taken from the first author in this list. I think it has been fixed recently, but I'd recommend to put mfarre first just in case.

Suggested change
- user: orrzohar
guest: true
org: Stanford
- user: mfarre
- user: mfarre
- user: orrzohar
guest: true
org: Stanford

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set Orr as author

smolvlm2.md Outdated

SmolVLM2 represents a fundamental shift in how we think about video understanding - moving from massive models that require substantial computing resources to efficient models that can run anywhere. Our goal is simple: make video understanding accessible across all devices and use cases, from phones to servers.

We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready from day zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready from day zero.
We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python _and_ Swift APIs) from day zero.

I still need to cleanup some code and submit a few PRs to the repos 😅 Hope we can get it done in time.

smolvlm2.md Outdated
},
]

# infer like above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd show a complete snippet so people can copy/paste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added inference code

Comment on lines 309 to 311



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing cc @merveenoyan

smolvlm2.md Outdated
Comment on lines 313 to 315
## Read More
[SmolVLM2 - Collection with Models and Demos](https://huggingface.co/collections/HuggingFaceTB/smolvlm2-smallest-video-lm-ever-67ab6b5e84bf8aaa60cb17c7)
[SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace with a conclusion still. We can still link to the collection there, but not to the previous blog post imo (we can do it elsewhere in the post), since this would not be a more in-depth reading.

smolvlm2.md Outdated

### Transformers

There are two ways to infer SmolVLM2 models, one is through a chat template and the other one gives you more control by passing in media through the processor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There are two ways to infer SmolVLM2 models, one is through a chat template and the other one gives you more control by passing in media through the processor.
The easiest way to run inference with the SmolVLM2 models is through the conversational API – applying the chat template takes care of preparing all inputs automatically.

Removing the processor use because we are not showing it later.

mfarre and others added 2 commits February 20, 2025 11:56
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
smolvlm2.md Outdated



## Using SmolVLM2 and Fine-tuning it with Transformers and MLX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Using SmolVLM2 and Fine-tuning it with Transformers and MLX
## Using SmolVLM2 with Transformers and MLX

maybe?

Copy link
Contributor

@xenova xenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool! Just some suggestions & nits.

I could also add a Transformers.js example code section soon (requires installing from main at the moment, but I will put out a release for it soon).


## TL;DR: SmolVLM can now watch 📺 with even better visual understanding

SmolVLM2 represents a fundamental shift in how we think about video understanding - moving from massive models that require substantial computing resources to efficient models that can run anywhere. Our goal is simple: make video understanding accessible across all devices and use cases, from phones to servers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SmolVLM2 represents a fundamental shift in how we think about video understanding - moving from massive models that require substantial computing resources to efficient models that can run anywhere. Our goal is simple: make video understanding accessible across all devices and use cases, from phones to servers.
SmolVLM2 represents a fundamental shift in how we think about video understanding moving from massive models that require substantial computing resources to efficient models that can run anywhere. Our goal is simple: make video understanding accessible across all devices and use cases, from phones to servers.

smolvlm2.md Outdated
Comment on lines 22 to 23
We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python _and_ Swift APIs) from day zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python _and_ Swift APIs) from day zero.
We are releasing models in three sizes ([2.2B](https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct), [500M](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) and [256M](https://huggingface.co/HuggingFaceTB/SmolVLM2-256M-Video-Instruct)), MLX (Python _and_ Swift APIs) and ONNX ready from day zero.

Added links for people to easily navigate to the models. Could replace with a collection link though.

Maybe not needed to mention ONNX. Feel free to remove.

Btw, is it intentional that the naming is different for each model?
e.g., https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct vs. https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruc

smolvlm2.md Outdated
We are releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python _and_ Swift APIs) from day zero.


To demonstrate our vision in small video models, we've built three practical applications that showcase the versatility of these models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To demonstrate our vision in small video models, we've built three practical applications that showcase the versatility of these models.
To demonstrate our vision for small video models, we've built three practical applications that showcase the versatility of these models.

smolvlm2.md Outdated
</center>
</td>
<td valign="top" style="border: none;">
We've created an iPhone app running SmolVLM2 completely locally. Using our 500M model, users can analyze and understand video content directly on their device - no cloud required. Interested in building iPhone video processing apps with AI models running locally? We're releasing it very soon - <a href="https://huggingface.co/datasets/HuggingFaceTB/smolvlm2-iphone-waitlist" target="_blank">fill this form to test and build with us!</a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We've created an iPhone app running SmolVLM2 completely locally. Using our 500M model, users can analyze and understand video content directly on their device - no cloud required. Interested in building iPhone video processing apps with AI models running locally? We're releasing it very soon - <a href="https://huggingface.co/datasets/HuggingFaceTB/smolvlm2-iphone-waitlist" target="_blank">fill this form to test and build with us!</a>
We've created an iPhone app that runs SmolVLM2 completely locally. Using our 500M model, users can analyze and understand video content directly on their device no cloud required. Interested in building iPhone video processing apps with AI models running locally? We're releasing it very soon <a href="https://huggingface.co/datasets/HuggingFaceTB/smolvlm2-iphone-waitlist" target="_blank">fill in this form to test and build with us!</a>

smolvlm2.md Outdated
</center>
</td>
<td valign="top" style="border: none;">
Available as a Hugging Face Space, this application takes long-form videos (1+ hours) and automatically extracts the most significant moments. We've tested it extensively with soccer matches and other lengthy events, making it a powerful tool for content summarization. <a href="https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator" target="_blank">Try it yourself in our demo space.</a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Available as a Hugging Face Space, this application takes long-form videos (1+ hours) and automatically extracts the most significant moments. We've tested it extensively with soccer matches and other lengthy events, making it a powerful tool for content summarization. <a href="https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator" target="_blank">Try it yourself in our demo space.</a>
Available as a Hugging Face Space, this application takes long-form videos (1+ hours) and automatically extracts the most significant moments. We've tested it extensively with soccer matches and other lengthy events, making it a powerful tool for content summarization. <a href="https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator" target="_blank">Try it yourself with our demo space.</a>


We are introducing three new models with 256M, 500M and 2.2B parameters. The 2.2B model is the go-to choice for vision and video tasks, while the 500M and 256M models represent **the smallest video language models ever released**.

While they're small in size, they outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families on the 2B range and we lead the pack in the even smaller space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While they're small in size, they outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families on the 2B range and we lead the pack in the even smaller space.
While they're small in size, they outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families in the 2B range and we lead the pack in the even smaller space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a link to Video-MME on HF?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the footnote. Maybe add a link to it here.


While they're small in size, they outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families on the 2B range and we lead the pack in the even smaller space.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2-videomme2.png" width="50%" alt="SmolVLM2 Performance">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and it seems to only be for some text.
image

smolvlm2.md Outdated



When it comes to video tasks, 2.2B is a good bang for the buck. Across the different scientific benchmarks where we evaluated it we want to highlight its performance on Video-MME where it outperforms all existing 2B models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When it comes to video tasks, 2.2B is a good bang for the buck. Across the different scientific benchmarks where we evaluated it we want to highlight its performance on Video-MME where it outperforms all existing 2B models.
When it comes to video tasks, 2.2B is a good bang for the buck. Across the various scientific benchmarks we evaluated it on, we want to highlight its performance on Video-MME where it outperforms all existing 2B models.

Maybe a better way to phrase this too. Original wording confused me.

smolvlm2.md Outdated

#### Video Inference

You can pass videos through a chat template by passing in {“type”: “video”, “path”:{video_path}”. See below complete example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can pass videos through a chat template by passing in {“type”: “video”, “path”:{video_path}. See below complete example.
You can pass videos through a chat template by passing in `{"type": "video", "path": {video_path}`. See below for a complete example.

smolvlm2.md Outdated

#### Interleaving Image, Text and Video

You can interleave image, video and text together by passing in `<image>` and `<video>` tokens inside text, cutting text through and inserting image lines in between.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine to remove this section due to performance, and also it's a pretty niche use-case imo.



## Read More

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We would like to thank Raushan Turganbay, Arthur Zucker and Pablo Montalvo Leroux for their contribution of the model to transformers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

mfarre and others added 2 commits February 20, 2025 14:09
(I just created it 😂)
Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added notebook

smolvlm2.md Outdated

### Fine-tuning SmolVLM2


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can fine-tune SmolVLM2 on videos using transformers 🤗
We have fine-tuned 500M variant in Colab on video-caption pairs in [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) dataset for demonstration purposes. Since 500M variant is small, it's better to apply full fine-tuning instead of QLoRA or LoRA, meanwhile you can try to apply QLoRA on 2.2B variant. You can find the fine-tuning notebook [here](https://github.com/huggingface/smollm/blob/main/vision/finetuning/SmolVLM2_Video_FT.ipynb).

@mfarre mfarre merged commit 6deaa9d into main Feb 20, 2025
1 check passed
@mfarre mfarre deleted the miquel/smolvlm2 branch February 20, 2025 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants