# **Quickstart Guide for mini-VLA**


* ### *Welcome to mini-VLA! This notebook will help you train a Vision-Language-Action model from scratch.*
> ##### We will:
> 1. Collect Data: Watch an expert script solve a task and record it.
> 2. Train: Teach the AI to mimic that expert behavior.
> 3. Test: See if the AI can solve the task on its own using only visual input.


* ### Prerequisites
> ##### Before running this notebook, ensure you have:
> 1. Cloned the repository: You should be running this file from inside the mini-vla folder.
> 2. Created the Environment:<br>&nbsp;*conda create --name mini-vla python=3.10*<br>&nbsp;*conda activate mini-vla*

#### Lets install the dependencies, if not already installed :D

In [2]:
!pip install -r requirements.txt



* ### Let's ensure we are in the project root

In [3]:
import os
import sys
import torch

print(f"Current working directory: {os.getcwd()}")
print(f"PyTorch Version: {torch.__version__}")

Current working directory: C:\Users\abhay\mini-vla
PyTorch Version: 2.9.1+cpu


* ### **Step 1:** Generating the Dataset We need data to train our model.
  >Since we don't have a *real robot*, we use the *Meta-World simulator.*
  > The command below runs an "Expert Script"—a hard-coded algorithm that knows exactly how to solve the task—and records its movements into a .npz file.
> ##### **Note:** You might see some warnings from the simulator. These are safe to ignore! 

In [None]:
print("Collecting expert demonstration data... (This might take a minute)")
!python -W ignore -m scripts.collect_data \
  --env-name push-v3 \
  --camera-name corner \
  --episodes 10 \
  --max-steps 100 \
  --output-path data/metaworld_push_bc.npz

* ### **Step 2:** Training the VLA Model (Diffusion Policy)
  
> #### Now we train the model, the model takes the images we collected and tries to predict the correct arm movements.
> 1. **Epochs:** How many times the model sees the entire dataset. (We use 5 for speed, but 50+ is better for results).
> 2. **Batch Size:** How many examples it learns from at once

In [None]:
print("Starting Training...")
!python -W ignore -m scripts.train \
  --dataset-path data/metaworld_push_bc.npz \
  --epochs 50 \
  --batch-size 64 \
  --save-path checkpoints/model.pt \
  --device cpu


* ### Step 3: Testing the Trained VLA model in Simulation
  > #### It will now try to control the robot using only the camera images
> 1. **episodes:** The number of times the robot attempts the task; higher numbers give a more accurate success rate but take longer.
> 2. **max-steps:** The time limit for each attempt; if too low, the robot may be cut off before it finishes the task.

In [None]:
print("Testing the trained model...")
!python -W ignore -m scripts.test \
  --checkpoint checkpoints/model.pt \
  --env-name push-v3 \
  --episodes 10 \
  --max-steps 150 \
  --instruction "push the object to the goal" \
  --device cpu \
  --save-video \
  --video-dir videos

* ### **Step 4**.Visualizing the result *right here* in Notebook
  > ##### After running the cell below you can see result(s).mp4

In [None]:
from IPython.display import Video, display
import glob
import os

# 1. Get the list of all MP4 files in the videos folder
video_files = glob.glob("videos/*.mp4")

# 2. Check if we found any files
if video_files:
    print(f"Found {len(video_files)} videos. Playing them now...\n")
    
    # 3. Loop through EVERY file in the list
    for video_path in video_files:
        print(f"Playing: {os.path.basename(video_path)}")
        display(Video(video_path, embed=True))
else:
    print("No video found. Check if the test step ran correctly.")

## *Conclusion:*
#### Congratulations! You have successfully built and run an end-to-end VLA pipeline. You collected expert data, trained a behavior-cloning policy, and tested it in a 3D simulation.
> Now if Arm Fails (as seen above) to execute the Instruction - "push the object to the goal", it is due to very short training (<50 epochs) and use of tiny datasheet(<10 episodes) to make the code run fast on you machine.
>##### To get a robot that actually succeeds, you would typically need:
> *   More Data: ~100+ expert episodes.
 > *   Longer Training: ~100-200 epochs.
> *  GPU Acceleration: To handle the heavier computation.
#####  *Feel free to increase episodes and epochs in the config if you have the time and hardware!*
   




