Skip to content

jeffelin/QwenNav

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QwenNav

Fine-tune Qwen3-VL-2B via LoRA for natural language robot dog navigation.

Takes a camera image + text instruction, outputs a discrete navigation action (MOVE_FORWARD, TURN_LEFT, TURN_RIGHT, STOP).

Setup

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Project Structure

QwenNav/
├── configs/
│   └── lora_config.yaml       # LoRA + training hyperparameters
├── src/
│   ├── actions.py              # Action space + system prompt + parse_action()
│   ├── model.py                # ParentVLA wrapper + LoRA application
│   ├── dataset.py              # Navigation dataset loader (JSONL → chat format)
│   ├── train.py                # Training loop with label masking + gradient accumulation
│   ├── inference.py            # Single-image inference (image + instruction → action)
│   └── evaluate.py             # Validation evaluation (accuracy, per-class, confusion matrix)
├── scripts/
│   └── download_data.py        # Download + convert StreamVLN data + train/val split
├── tests/
│   └── verify_pipeline.py      # CPU end-to-end pipeline tests
└── requirements.txt

The following are generated locally and excluded from git (see .gitignore):

  • data/ — StreamVLN images (23 GB), annotations, and JSONL training files
  • checkpoints/ — saved LoRA adapters after training
  • venv/ — Python virtual environment

Architecture

  • Base model: Qwen3-VL-2B-Instruct (vision encoder + LLM, 2.1B params)
  • Wrapper: ParentVLA freezes all weights, then injects LoRA
  • LoRA targets: q_proj + v_proj in LLM attention layers (rank=8, alpha=16)
  • Trainable parameters: 1,605,632 (0.08% of total)
  • Vision encoder: completely frozen
  • Label masking: loss computed only on assistant action tokens (everything else masked with -100)

Workflow

Current status

  1. Data pipeline

    • StreamVLN annotations downloaded (10,819 episodes)
    • StreamVLN R2R images downloaded (23 GB, 10,819 scene directories)
    • Converted to JSONL training format (12,365 samples in data/train.jsonl)
    • Action distribution across training set: MOVE_FORWARD, TURN_LEFT, TURN_RIGHT, STOP
    • Train/val split available via --split (90/10, seed=42)
  2. Model pipeline

    • Qwen3-VL-2B-Instruct cached locally (4 GB, SafeTensors)
    • ParentVLA wrapper applies LoRA, freezes base weights
    • Collate function tokenizes chat-format messages and masks non-action labels
  3. CPU end-to-end test

    • Synthetic images → JSONL → dataset → collate → forward → backward → optimizer step
    • Confirmed: only LoRA params get gradients, frozen params untouched
    • Confirmed: label masking isolates action tokens correctly
  4. Evaluation

    • src/evaluate.py runs inference on validation set and reports accuracy
    • Per-class accuracy and confusion matrix for detailed analysis

What's next

  1. Train on GPU — run python -m src.train on a machine with a GPU. Config is set for batch_size=1 with 4-step gradient accumulation. Adjust in configs/lora_config.yaml.

  2. Inference on robot — use src/inference.py in a loop: camera frame → model → action → execute → repeat. The adapter loads on top of the frozen base model.

  3. Potential improvements:

    • Add a learning rate scheduler (warmup + cosine decay)
    • Add validation loss tracking during training
    • Experiment with LoRA rank (16, 32) or targeting more modules (k_proj, o_proj)

Data

StreamVLN

# If starting fresh:
python scripts/download_data.py --login
python scripts/download_data.py --annotations-only
python scripts/download_data.py --download-images
python scripts/download_data.py --convert --max-episodes 200

Train/Val Split

# Split train.jsonl into 90% train + 10% val (seed=42, deterministic)
python scripts/download_data.py --split

This reads data/train.jsonl, shuffles, splits 90/10, and writes back data/train.jsonl + data/val.jsonl.

Data format

Each line in the JSONL training file:

{"image": "/abs/path/to/frame.jpg", "instruction": "Navigate to the kitchen", "action": "TURN_LEFT"}

Training

python -m src.train

Hyperparameters are in configs/lora_config.yaml.

Inference

python -m src.inference --image "path/to/frame.jpg" --instruction "Go to the kitchen"

With a trained LoRA adapter:

python -m src.inference --image "path/to/frame.jpg" --instruction "Go to the kitchen" --adapter "checkpoints/"

Evaluation

# Evaluate base model on validation set
python -m src.evaluate --data "data/val.jsonl"

# Evaluate with a trained LoRA adapter
python -m src.evaluate --data "data/val.jsonl" --adapter "checkpoints/"

Reports overall accuracy, per-class accuracy, and a confusion matrix.

About

LoRA fine-tuning pipeline for Qwen3-VL-2B for robot dog navigation. Camera image + text instruction into discrete action.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages