QwenNav

Fine-tune Qwen3-VL-2B via LoRA for natural language robot dog navigation.

Takes a camera image + text instruction, outputs a discrete navigation action (MOVE_FORWARD, TURN_LEFT, TURN_RIGHT, STOP).

Setup

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Project Structure

QwenNav/
├── configs/
│   └── lora_config.yaml       # LoRA + training hyperparameters
├── src/
│   ├── actions.py              # Action space + system prompt + parse_action()
│   ├── model.py                # ParentVLA wrapper + LoRA application
│   ├── dataset.py              # Navigation dataset loader (JSONL → chat format)
│   ├── train.py                # Training loop with label masking + gradient accumulation
│   ├── inference.py            # Single-image inference (image + instruction → action)
│   └── evaluate.py             # Validation evaluation (accuracy, per-class, confusion matrix)
├── scripts/
│   └── download_data.py        # Download + convert StreamVLN data + train/val split
├── tests/
│   └── verify_pipeline.py      # CPU end-to-end pipeline tests
└── requirements.txt

The following are generated locally and excluded from git (see .gitignore):

data/ — StreamVLN images (23 GB), annotations, and JSONL training files
checkpoints/ — saved LoRA adapters after training
venv/ — Python virtual environment

Architecture

Base model: Qwen3-VL-2B-Instruct (vision encoder + LLM, 2.1B params)
Wrapper: ParentVLA freezes all weights, then injects LoRA
LoRA targets: q_proj + v_proj in LLM attention layers (rank=8, alpha=16)
Trainable parameters: 1,605,632 (0.08% of total)
Vision encoder: completely frozen
Label masking: loss computed only on assistant action tokens (everything else masked with -100)

Workflow

Current status

Data pipeline
- StreamVLN annotations downloaded (10,819 episodes)
- StreamVLN R2R images downloaded (23 GB, 10,819 scene directories)
- Converted to JSONL training format (12,365 samples in data/train.jsonl)
- Action distribution across training set: MOVE_FORWARD, TURN_LEFT, TURN_RIGHT, STOP
- Train/val split available via --split (90/10, seed=42)
Model pipeline
- Qwen3-VL-2B-Instruct cached locally (4 GB, SafeTensors)
- ParentVLA wrapper applies LoRA, freezes base weights
- Collate function tokenizes chat-format messages and masks non-action labels
CPU end-to-end test
- Synthetic images → JSONL → dataset → collate → forward → backward → optimizer step
- Confirmed: only LoRA params get gradients, frozen params untouched
- Confirmed: label masking isolates action tokens correctly
Evaluation
- src/evaluate.py runs inference on validation set and reports accuracy
- Per-class accuracy and confusion matrix for detailed analysis

What's next

Train on GPU — run python -m src.train on a machine with a GPU. Config is set for batch_size=1 with 4-step gradient accumulation. Adjust in configs/lora_config.yaml.
Inference on robot — use src/inference.py in a loop: camera frame → model → action → execute → repeat. The adapter loads on top of the frozen base model.
Potential improvements:
- Add a learning rate scheduler (warmup + cosine decay)
- Add validation loss tracking during training
- Experiment with LoRA rank (16, 32) or targeting more modules (k_proj, o_proj)

Data

StreamVLN

# If starting fresh:
python scripts/download_data.py --login
python scripts/download_data.py --annotations-only
python scripts/download_data.py --download-images
python scripts/download_data.py --convert --max-episodes 200

Train/Val Split

# Split train.jsonl into 90% train + 10% val (seed=42, deterministic)
python scripts/download_data.py --split

This reads data/train.jsonl, shuffles, splits 90/10, and writes back data/train.jsonl + data/val.jsonl.

Data format

Each line in the JSONL training file:

{"image": "/abs/path/to/frame.jpg", "instruction": "Navigate to the kitchen", "action": "TURN_LEFT"}

Training

python -m src.train

Hyperparameters are in configs/lora_config.yaml.

Inference

python -m src.inference --image "path/to/frame.jpg" --instruction "Go to the kitchen"

With a trained LoRA adapter:

python -m src.inference --image "path/to/frame.jpg" --instruction "Go to the kitchen" --adapter "checkpoints/"

Evaluation

# Evaluate base model on validation set
python -m src.evaluate --data "data/val.jsonl"

# Evaluate with a trained LoRA adapter
python -m src.evaluate --data "data/val.jsonl" --adapter "checkpoints/"

Reports overall accuracy, per-class accuracy, and a confusion matrix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QwenNav

Setup

Project Structure

Architecture

Workflow

Current status

What's next

Data

StreamVLN

Train/Val Split

Data format

Training

Inference

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

QwenNav

Setup

Project Structure

Architecture

Workflow

Current status

What's next

Data

StreamVLN

Train/Val Split

Data format

Training

Inference

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages