# Deployment Gemma3: Step-by-Step Guide


## Setup

In [1]:
%%capture
!pip install --upgrade vastai
!pip install --upgrade openai

In [5]:
%%bash
export VAST_API_KEY="405acbff24f03c3dca457de754ed546406ac41b8fa01df72a5726e6383a896"
vastai set api-key $VAST_API_KEY

Your api key has been saved in /root/.config/vastai/vast_api_key


### Choosing Hardware

To deploy the Gemma3 model on Vast.ai, we need to find a GPU with the following specifications:

1. GPU Memory:
  - Gemma3 model weights (4B Parameters)


2. At least one direct port that we can forward for:
   - vLLM's OpenAI-compatible API server
   - External access to the model endpoint
   - Secure request routing

3. At least 100GB of disk space to hold the model and other things we might like to download

In [11]:
%%bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 12 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 100 \
rentable = true"

ID        CUDA   N  Model        PCIE  cpu_ghz  vCPUs     RAM  Disk  $/hr    DLP    DLP/$   score  NV Driver   Net_up  Net_down  R      Max_Days  mach_id  status    host_id  ports  country            
19442347  12.8  1x  RTX_5070_Ti  12.8  8.5      10.0     32.1  1521  0.1740  76.1   437.17  316.0  570.133.07  824.1   822.5     99.3   250.7     34604    verified  55116    499    Vietnam,_VN        
18395983  12.6  1x  H100_SXM     54.9  3.7      24.0    290.2  564   1.5370  345.4  224.71  312.6  560.35.05   815.1   735.3     99.8   256.6     32302    verified  169960   2047   India,_IN          
19442135  12.8  1x  RTX_5080     12.8  8.5      10.0     32.1  1511  0.2007  84.4   420.52  312.1  570.133.07  860.2   868.9     99.1   250.7     34619    verified  55116    499    Vietnam,_VN        
19180837  12.4  1x  H200         52.6  4.0      24.0    258.0  2257  3.2009  455.4  142.27  309.9  550.127.05  1530.2  17453.0   99.9   74.4      32676    verified  97732    4999   ,_US           

### Deploying the Server via Vast Template

Choose a machine and copy and paste the id below to set `INSTANCE_ID`.

We will deploy a template that:
1. Uses `vllm/vllm-openai:latest` docker image. This gives us an OpenAI-compatible server.
2. Forwards port `8000` to the outside of the container, which is the default OpenAI server port
3. Forwards `--model google/gemma-3-4b-it --max-model-len 8192` on to the default entrypoint (the server itself)
4. Uses `--tensor-parallel-size 1` by default.
5. Uses `--gpu-memory-utilization 0.90` by default
6. Ensures that we have 100 GB of Disk space


In [15]:
# I refer use new architecture GPU like L40S, A100, A6000, if you use T4 will error.

%%bash
export INSTANCE_ID='15727356'
vastai create instance $INSTANCE_ID --disk 100 \
  --template_hash c7e768487d8bb520ae2e0bea9844ea0f

Started. {'success': True, 'new_contract': 19520053}


### Verify Setup

In [16]:
%%bash
export VAST_IP_ADDRESS="64.62.194.198"
export VAST_PORT="20576"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
           "model": "google/gemma-3-4b-it",
           "prompt": "Hello, how are you?",
           "max_tokens": 1000
         }'


{"id":"cmpl-d7e958d49b7f4bbe85eddd173c7abf02","object":"text_completion","created":1744966568,"model":"google/gemma-3-4b-it","choices":[{"index":0,"text":"\n\nI'm working on a project that involves a lot of data processing and analysis, specifically with the Pandas library in Python. I'm currently facing an issue where I'm trying to perform a simple operation like filtering rows based on a condition","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":7,"total_tokens":57,"completion_tokens":50,"prompt_tokens_details":null}}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   128    0     0  100   128      0    106  0:00:01  0:00:01 --:--:--   106100   128    0     0  100   128      0     58  0:00:02  0:00:02 --:--:--    58100   128    0     0  100   128      0     39  0:00:03  0:00:03 --:--:--    39100   715  100   587  100   128    169     36  0:00:03  0:00:03 --:--:--   206


## Usage

## Setup Model

In [22]:
VAST_IP_ADDRESS="64.62.194.198"
VAST_PORT="20576"

openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
model_name = "google/gemma-3-4b-it"

## Request

In [24]:
from openai import OpenAI

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model=model_name,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Hello, how are you today"},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)


Chat completion output: Hello there! I’m doing well, thank you for asking! As an AI, I don’t really *feel* in the same way humans do, but my systems are running smoothly and I’m ready to chat. 😊

How are *you* doing today? Is there anything you’d like to talk about or anything I can help you with?


## Delete Machine

In [26]:
# Delete vast.ai machine (check instace ID in above command or in Vast AI UI)
%%bash
export INSTANCE_ID='19520053'
vastai destroy instance $INSTANCE_ID

destroying instance 19520053.
