# **Quickstart Guide for mini-VLA**


* ### *Welcome to mini-VLA! This notebook will help you train a Vision-Language-Action model from scratch.*
> ##### We will:
> 1. Collect Data: Watch an expert script solve a task and record it.
> 2. Train: Teach the AI to mimic that expert behavior.
> 3. Test: See if the AI can solve the task on its own using only visual input.


* ### Prerequisites
> ##### Before running this notebook, ensure you have:
> 1. Cloned the repository: You should be running this file from inside the mini-vla folder.
> 2. Created the Environment:<br>&nbsp;*conda create --name mini-vla python=3.10*<br>&nbsp;*conda activate mini-vla*

#### Lets install the dependencies, if not already installed :D

In [1]:
import os
import sys
!pip install -r requirements.txt



* ### Let's ensure we are in the project root
> ##### Ignore line 4-13, we suppressed some of the warning which were harmless - clean output!
> ##### If there is GPU on your machine
> * will use that we speed up the Traning runtime
> * handle complex demanding tasks


In [2]:
import torch

# Check for GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
print(f"Current working directory: {os.getcwd()}")
print(f"PyTorch Version: {torch.__version__}")

Using device: cpu
Current working directory: C:\Users\abhay\mini-vla
PyTorch Version: 2.9.1+cpu


* ### **Step 1:** Generating the Dataset We need data to train our model.
  >Since we don't have a *real robot*, we use the *Meta-World simulator.*
  > The command below runs an "Expert Script"—a hard-coded algorithm that knows exactly how to solve the task—and records its movements into a .npz file.
> ##### **Note:** You might see some warnings from the simulator. These are safe to ignore! 

In [3]:
print("Collecting expert demonstration data... (This might take a minute)")
!python -W ignore -m scripts.collect_data \
  --env-name push-v3 \
  --camera-name corner \
  --episodes 10 \
  --max-steps 100 \
  --output-path data/metaworld_push_bc.npz

Collecting expert demonstration data... (This might take a minute)
Episode 1/10 finished after 50 steps, success=0
Episode 2/10 finished after 50 steps, success=0
Episode 3/10 finished after 50 steps, success=0
Episode 4/10 finished after 50 steps, success=0
Episode 5/10 finished after 50 steps, success=0
Episode 6/10 finished after 50 steps, success=0
Episode 7/10 finished after 50 steps, success=0
Episode 8/10 finished after 50 steps, success=0
Episode 9/10 finished after 50 steps, success=0
Episode 10/10 finished after 50 steps, success=0
Saved Meta-World push dataset to data/metaworld_push_bc.npz
  images: (500, 480, 480, 3)
  states: (500, 39)
  actions: (500, 4)
  text_ids: (500, 6)


* ### **Step 2:** Training the VLA Model (Diffusion Policy)
  
> #### Now we train the model, the model takes the images we collected and tries to predict the correct arm movements.
> 1. **Epochs:** How many times the model sees the entire dataset. (We use 5 for speed, but 50+ is better for results).
> 2. **Batch Size:** How many examples it learns from at once

In [4]:
print("Starting Training...")
!python -W ignore -m scripts.train \
  --dataset-path data/metaworld_push_bc.npz \
  --epochs 50 \
  --batch-size 64 \
  --save-path checkpoints/model.pt \
  --device cpu


Starting Training...
Epoch 1/30  loss=0.9151
Epoch 2/30  loss=0.9820
Epoch 3/30  loss=0.9925
Epoch 4/30  loss=1.0365
Epoch 5/30  loss=0.9990
Epoch 6/30  loss=0.9695
Epoch 7/30  loss=0.9728
Epoch 8/30  loss=0.9903
Epoch 9/30  loss=1.0427
Epoch 10/30  loss=0.9745
Epoch 11/30  loss=0.9865
Epoch 12/30  loss=0.9918
Epoch 13/30  loss=0.9872
Epoch 14/30  loss=0.9866
Epoch 15/30  loss=1.0253
Epoch 16/30  loss=0.9900
Epoch 17/30  loss=0.9712
Epoch 18/30  loss=1.0005
Epoch 19/30  loss=0.9696
Epoch 20/30  loss=1.0347
Epoch 21/30  loss=0.9950
Epoch 22/30  loss=0.9803
Epoch 23/30  loss=0.9858
Epoch 24/30  loss=1.0346
Epoch 25/30  loss=1.0022
Epoch 26/30  loss=0.9524
Epoch 27/30  loss=1.0439
Epoch 28/30  loss=0.9852
Epoch 29/30  loss=0.9856
Epoch 30/30  loss=1.0095
Saved checkpoint: checkpoints/model.pt


* ### Step 3: Testing the Trained VLA model in Simulation
  > #### It will now try to control the robot using only the camera images
> 1. **episodes:** The number of times the robot attempts the task; higher numbers give a more accurate success rate but take longer.
> 2. **max-steps:** The time limit for each attempt; if too low, the robot may be cut off before it finishes the task.

In [5]:
print("Testing the trained model...")
!python -W ignore -m scripts.test \
  --checkpoint checkpoints/model.pt \
  --env-name push-v3 \
  --episodes 10 \
  --max-steps 150 \
  --instruction "push the object to the goal" \
  --device cpu \
  --save-video \
  --video-dir videos

Testing the trained model...
[test] Loading checkpoint from checkpoints/model.pt
[test] Meta-World MT1 env: push-v3
[test] state_dim=39, action_dim=4, obs_shape=(480, 480, 3)
[test] Episode 1/10: reward=5.494, steps=150
[test] Saved video to videos\push-v3_ep001.mp4
[test] Episode 2/10: reward=20.047, steps=150
[test] Saved video to videos\push-v3_ep002.mp4
[test] Episode 3/10: reward=4.110, steps=150
[test] Saved video to videos\push-v3_ep003.mp4
[test] Episode 4/10: reward=5.846, steps=150
[test] Saved video to videos\push-v3_ep004.mp4
[test] Episode 5/10: reward=4.393, steps=150
[test] Saved video to videos\push-v3_ep005.mp4
[test] Episode 6/10: reward=5.389, steps=150
[test] Saved video to videos\push-v3_ep006.mp4
[test] Episode 7/10: reward=3.940, steps=150
[test] Saved video to videos\push-v3_ep007.mp4
[test] Episode 8/10: reward=4.881, steps=150
[test] Saved video to videos\push-v3_ep008.mp4
[test] Episode 9/10: reward=4.340, steps=150
[test] Saved video to videos\push-v3_ep009.

* ### **Step 4**.Visualizing the result *right here* in Notebook
  > ##### After running the cell below you can see result(s).mp4

In [6]:
from IPython.display import Video, display
import glob
import os

# 1. Get the list of all MP4 files in the videos folder
video_files = glob.glob("videos/*.mp4")

# 2. Check if we found any files
if video_files:
    print(f"Found {len(video_files)} videos. Playing them now...\n")
    
    # 3. Loop through EVERY file in the list
    for video_path in video_files:
        print(f"Playing: {os.path.basename(video_path)}")
        display(Video(video_path, embed=True))
else:
    print("No video found. Check if the test step ran correctly.")

Found 10 videos. Playing them now...

Playing: push-v3_ep001.mp4


Playing: push-v3_ep002.mp4


Playing: push-v3_ep003.mp4


Playing: push-v3_ep004.mp4


Playing: push-v3_ep005.mp4


Playing: push-v3_ep006.mp4


Playing: push-v3_ep007.mp4


Playing: push-v3_ep008.mp4


Playing: push-v3_ep009.mp4


Playing: push-v3_ep010.mp4


## *Conclusion:*
#### Congratulations! You have successfully built and run an end-to-end VLA pipeline. You collected expert data, trained a behavior-cloning policy, and tested it in a 3D simulation.
> Now if Arm Fails (as seen above) to execute the Instruction - "push the object to the goal", it is due to very short training (<50 epochs) and use of tiny datasheet(<10 episodes) to make the code run fast on you machine.
>##### To get a robot that actually succeeds, you would typically need:
> *   More Data: ~100+ expert episodes.
 > *   Longer Training: ~100-200 epochs.
> *  GPU Acceleration: To handle the heavier computation.
#####  *Feel free to increase episodes and epochs in the config if you have the time and hardware!*
   




