Fine-tune Qwen3-VL-2B via LoRA for natural language robot dog navigation.
Takes a camera image + text instruction, outputs a discrete navigation action (MOVE_FORWARD, TURN_LEFT, TURN_RIGHT, STOP).
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtQwenNav/
├── configs/
│ └── lora_config.yaml # LoRA + training hyperparameters
├── src/
│ ├── actions.py # Action space + system prompt + parse_action()
│ ├── model.py # ParentVLA wrapper + LoRA application
│ ├── dataset.py # Navigation dataset loader (JSONL → chat format)
│ ├── train.py # Training loop with label masking + gradient accumulation
│ ├── inference.py # Single-image inference (image + instruction → action)
│ └── evaluate.py # Validation evaluation (accuracy, per-class, confusion matrix)
├── scripts/
│ └── download_data.py # Download + convert StreamVLN data + train/val split
├── tests/
│ └── verify_pipeline.py # CPU end-to-end pipeline tests
└── requirements.txt
The following are generated locally and excluded from git (see .gitignore):
data/— StreamVLN images (23 GB), annotations, and JSONL training filescheckpoints/— saved LoRA adapters after trainingvenv/— Python virtual environment
- Base model: Qwen3-VL-2B-Instruct (vision encoder + LLM, 2.1B params)
- Wrapper: ParentVLA freezes all weights, then injects LoRA
- LoRA targets: q_proj + v_proj in LLM attention layers (rank=8, alpha=16)
- Trainable parameters: 1,605,632 (0.08% of total)
- Vision encoder: completely frozen
- Label masking: loss computed only on assistant action tokens (everything else masked with -100)
-
Data pipeline
- StreamVLN annotations downloaded (10,819 episodes)
- StreamVLN R2R images downloaded (23 GB, 10,819 scene directories)
- Converted to JSONL training format (12,365 samples in
data/train.jsonl) - Action distribution across training set: MOVE_FORWARD, TURN_LEFT, TURN_RIGHT, STOP
- Train/val split available via
--split(90/10, seed=42)
-
Model pipeline
- Qwen3-VL-2B-Instruct cached locally (4 GB, SafeTensors)
- ParentVLA wrapper applies LoRA, freezes base weights
- Collate function tokenizes chat-format messages and masks non-action labels
-
CPU end-to-end test
- Synthetic images → JSONL → dataset → collate → forward → backward → optimizer step
- Confirmed: only LoRA params get gradients, frozen params untouched
- Confirmed: label masking isolates action tokens correctly
-
Evaluation
src/evaluate.pyruns inference on validation set and reports accuracy- Per-class accuracy and confusion matrix for detailed analysis
-
Train on GPU — run
python -m src.trainon a machine with a GPU. Config is set for batch_size=1 with 4-step gradient accumulation. Adjust inconfigs/lora_config.yaml. -
Inference on robot — use
src/inference.pyin a loop: camera frame → model → action → execute → repeat. The adapter loads on top of the frozen base model. -
Potential improvements:
- Add a learning rate scheduler (warmup + cosine decay)
- Add validation loss tracking during training
- Experiment with LoRA rank (16, 32) or targeting more modules (k_proj, o_proj)
# If starting fresh:
python scripts/download_data.py --login
python scripts/download_data.py --annotations-only
python scripts/download_data.py --download-images
python scripts/download_data.py --convert --max-episodes 200# Split train.jsonl into 90% train + 10% val (seed=42, deterministic)
python scripts/download_data.py --splitThis reads data/train.jsonl, shuffles, splits 90/10, and writes back data/train.jsonl + data/val.jsonl.
Each line in the JSONL training file:
{"image": "/abs/path/to/frame.jpg", "instruction": "Navigate to the kitchen", "action": "TURN_LEFT"}python -m src.trainHyperparameters are in configs/lora_config.yaml.
python -m src.inference --image "path/to/frame.jpg" --instruction "Go to the kitchen"With a trained LoRA adapter:
python -m src.inference --image "path/to/frame.jpg" --instruction "Go to the kitchen" --adapter "checkpoints/"# Evaluate base model on validation set
python -m src.evaluate --data "data/val.jsonl"
# Evaluate with a trained LoRA adapter
python -m src.evaluate --data "data/val.jsonl" --adapter "checkpoints/"Reports overall accuracy, per-class accuracy, and a confusion matrix.