Danxu Liu1,4 *, Di Wang2,4 *, Hebaixu Wang2,4 *, Haoyang Chen2,4 *, Wentao Jiang2, Yilin Cheng3,4, Haonan Guo2,4, Wei Cui1 †, Jing Zhang2,4 †.
1 Beijing Institute of Technology, 2 Wuhan University, 3 Fudan University, 4 Zhongguancun Academy.
* Equal contribution. † Corresponding authors.
Update | Abstract | Datasets | Pre-training | Usage | Statement
2026.3.24
- The codes of pretraining and classification in fintuning are released!
2026.3.23
- SARMAE pretrained weights are publicly available on Hugging Face and Baidu Netdisk.
2026.3.16
- SAR-1M dataset is publicly available on Hugging Face and Baidu Netdisk.
2026.2.21
- The paper is accepted by CVPR 2026! 🎉🎉🎉
2025.12.19
- The paper is post on arXiv! (arXiv SARMAE)
Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks.
Figure 1. Overview of the SARMAE pretraining framework. The framework consists of two branches: (i) a SAR branch following the MAE architecture with Speckle-Aware Representation Enhancement (SARE) to handle inherent speckle noise, and (ii) an optical branch using a frozen DINOv3 encoder. For paired SAR-optical data, Semantic Anchor Representation Constraint (SARC) aligns SAR features with semantic-rich optical representations. Unpaired SAR images are processed solely through the SAR branch.
Figure 2. The organization of data sources in SAR-1M.
SAR-1M is a large-scale synthetic aperture radar (SAR) image dataset designed for SAR representation learning. The dataset contains over one million SAR images, and about 75% of the SAR samples are paired with geographically aligned optical images, enabling multimodal remote sensing studies.
Environment:
- Python 3.8.20
- Pytorch 1.12.1+cu113
- torchvision 0.13.1+cu113
- timm 0.6.13
conda create -n sarmae python=3.8 -y
conda activate sarmae
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
-
Preparing with SAR-1M: Download the SAR-1M. The indices of paired images are provided in
paired.json, while those of unpaired images are listed inunpaired.json. To extend the SAR-1M dataset for pretraining with additional data, users can append the corresponding image indices to these JSON files. -
Pretraining: take ViT-B as an example (batchsize: 4096=8*512)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch \
--nproc_per_node=8 \
--master_port 20003 \
train_mae_contrastive.py \
--model mae_vit_base_patch16 \
--data_path ./data \
--enable_sar_noise \
--noise_ratio 0.5 --random_noise \
--noise_min 0.0 --noise_max 0.7 \
--output_dir ./output_vitb \
--batch_size 512 --epochs 300 \
--lr 1e-4 --mae_loss_weight 1 --alignment_loss_weight 0.8 \
--loss_schedule cosine \
--sar_pretrained ./mae_pretrain_vit_base.pth \
--dinov3_pretrained ./dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth \
--freeze_optical_completely \
--clip_grad 1.0
- Fine-tuning: an example of evaluating the pretrained ViT-B weight on Fusar dataset
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 20005 \
main_finetune.py \
--dataset 'fusar' --data_path /data/FUSAR \
--model 'vit_base_patch16' \
--batch_size 8 --epochs 30 --exp_num=5 \
--finetune './SARMAE_vit_Base.pth'
| Pretrain | Backbone | Input size | Pretrained model |
|---|---|---|---|
| SARMAE | ViT-B | 224 × 224 | Weights |
| SARMAE | ViT-L | 224 × 224 | Weights |
Coming Soon.
Figure 3. SARMAE outperforms SOTA methods on multiple datasets. 1: 40-SHOT; 2: 30% labeled. a: Multi-classes; b: Water.
| Method | FUSAR-SHIP | MSTAR | SAR-ACD | ||
|---|---|---|---|---|---|
| 40-shot | 30% | 40-shot | 30% | 30% | |
| ResNet-50 | - | 58.41 | - | 89.94 | 59.70 |
| Swin Transformer | - | 60.79 | - | 82.97 | 67.50 |
| Bet | 59.70 | 71.13 | 40.70 | 69.75 | 79.77 |
| LoMaR | 82.70 | - | 77.00 | - | - |
| SAR-JEPA | 85.80 | - | 91.60 | - | - |
| SUMMIT | - | 71.91 | - | 98.39 | 84.25 |
| SARMAE(ViT-B) | 89.30 | 92.92 | 96.70 | 99.61 | 95.06 |
| SARMAE(ViT-L) | 90.86 | 92.80 | 97.24 | 98.92 | 95.63 |
Table 1. Performance comparison (Top1 Accuracy, %) of different methods on the target classification task.
| Method | SARDet-100k | SSDD | Method | RSAR |
|---|---|---|---|---|
| ImageNet | 52.30 | 66.40 | RoI Transformer | 35.02 |
| Deformable DETR | 50.00 | 52.60 | Def. DETR | 46.62 |
| Swin Transformer | 53.80 | 40.70 | RetinaNet | 57.67 |
| ConvNeXt | 55.10 | - | ARS-DETR | 61.14 |
| CATNet | - | 64.66 | R3Det | 63.94 |
| MSFA | 56.40 | - | ReDet | 64.71 |
| SARAFE | 57.30 | 67.50 | O-RCNN | 64.82 |
| SARMAE(ViT-B) | 57.90 | 68.10 | SARMAE(ViT-B) | 66.80 |
| SARMAE(ViT-L) | 63.10 | 69.30 | SARMAE(ViT-L) | 72.20 |
Table 2. Performance comparison (mAP, %) of different methods on horizontal and oriented object detection tasks.
| Method | Multiple classes | Water | ||||||
|---|---|---|---|---|---|---|---|---|
| Industrial Area | Natural Area | Land Use | Water | Housing | Other | mIoU | IoU | |
| FCN | 37.78 | 71.58 | 1.24 | 72.76 | 67.69 | 39.05 | 48.35 | 85.95 |
| ANN | 41.23 | 72.92 | 0.97 | 75.95 | 68.40 | 56.01 | 52.58 | 87.32 |
| PSPNet | 33.99 | 72.31 | 0.93 | 76.51 | 68.07 | 57.07 | 51.48 | 87.13 |
| DeepLab V3+ | 40.62 | 70.67 | 0.55 | 72.93 | 69.96 | 34.53 | 48.21 | 87.53 |
| PSANet | 40.70 | 69.46 | 1.33 | 69.46 | 68.75 | 32.68 | 47.14 | 86.18 |
| DANet | 39.56 | 72.00 | 1.00 | 74.95 | 67.79 | 56.28 | 39.56 | 89.29 |
| SARMAE(ViT-B) | 65.87 | 75.65 | 29.20 | 84.01 | 73.23 | 71.21 | 66.53 | 92.31 |
| SARMAE(ViT-L) | 65.84 | 78.04 | 29.47 | 87.12 | 75.22 | 69.34 | 67.51 | 93.06 |
Table 3. Performance comparison of semantic segmentation methods on multiple classes and water classes.
If you find SARMAE helpful, please give a ⭐ and cite it as follows:
@misc{liu2025sarmaemaskedautoencodersar,
title={SARMAE: Masked Autoencoder for SAR Representation Learning},
author={Danxu Liu and Di Wang and Hebaixu Wang and Haoyang Chen and Wentao Jiang and Yilin Cheng and Haonan Guo and Wei Cui and Jing Zhang},
year={2025},
eprint={2512.16635},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16635},
}
- This project is released under the CC BY-NC 4.0.
- For any other questions please contact Danxu Liu at bit.edu.cn or gmail.com.


