# Set up your environment

Before we can run any code, we need to set up our development environment. This means installing libraries -- or other people's code -- that our own code will call.

`pip` is a package management tool the Python coding language.

`torch` is one of the primary Machine Learning libraries that AI models are built from.

`transformers` is the HuggingFace library that provides code that makes it easy to use different models in your own code.

`pillow` is an imaging library used to open, display, create, and manipulate images in Python.

In [1]:
%pip install -U torch transformers pillow ipywidgets

Note: you may need to restart the kernel to use updated packages.


# Import Statements

Our code will run code from other libraries. These libraries need to be imported before they can be used. It is best practice to place `import` statements before any of your own code.

In [2]:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image


# Choose a Model

AI advancements are being made every day, and AI models are constantly being created/updated. When choosing a model, there are a few key factors to consider:

1. The most important factor in choosing a model is picking one that can **solve your problem given your type of inputs**. If you want to classify images, you'll need a vision model; a chatbot needs a model with Text-to-Text capabilities; some models are "multimodal", meaning they can output predictions from multiple types of input (text, video, audio, images, etc.).
2. The next most important factor is picking a model that's the **right "size" for your environment**. The size of the model is usually measured in the number of **parameters** it uses, with the largest AI models measuring hundreds of billions and the smallest ones being hundreds of thousands. More parameters require more RAM. In my experience, a newer laptop with 48GB of RAM can run a model with 20B parameters slowly; mobile devices can run models up to 7B parameters.
3. Finally, test different models against your data. The models you get from HuggingFace are **pretrained**. The output generated by a model depends a lot on the data it saw during the pretraining step. Since different model providers use different data sets during pretraining, you won't know what kind of response you'll get from a model until you try it!

In [3]:
MODEL_ID = "HuggingFaceTB/SmolVLM-Instruct"  # @param ["HuggingFaceTB/SmolVLM-Instruct", "moonshotai/Kimi-VL-A3B-Instruct", "Qwen/Qwen2.5-VL-7B-Instruct"] {"allow-input": true, "isTemplate": true}
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [5]:
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

model.safetensors:   0%|          | 0.00/4.49G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]