SARMAE: Masked Autoencoder for SAR Representation Learning

Danxu Liu^{1,4 *}, Di Wang^{2,4 *}, Hebaixu Wang^{2,4 *}, Haoyang Chen^{2,4 *}, Wentao Jiang², Yilin Cheng^3,4, Haonan Guo^2,4, Wei Cui^{1 †}, Jing Zhang^{2,4 †}.

¹ Beijing Institute of Technology, ² Wuhan University, ³ Fudan University, ⁴ Zhongguancun Academy.

^* Equal contribution. ^† Corresponding authors.

🔥 Update

2026.3.24

The codes of pretraining and classification in fintuning are released!

2026.3.23

SARMAE pretrained weights are publicly available on Hugging Face and Baidu Netdisk.

2026.3.16

SAR-1M dataset is publicly available on Hugging Face and Baidu Netdisk.

2026.2.21

The paper is accepted by CVPR 2026! 🎉🎉🎉

2025.12.19

The paper is post on arXiv! (arXiv SARMAE)

🌞 Abstract

Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks.

Figure 1. Overview of the SARMAE pretraining framework. The framework consists of two branches: (i) a SAR branch following the MAE architecture with Speckle-Aware Representation Enhancement (SARE) to handle inherent speckle noise, and (ii) an optical branch using a frozen DINOv3 encoder. For paired SAR-optical data, Semantic Anchor Representation Constraint (SARC) aligns SAR features with semantic-rich optical representations. Unpaired SAR images are processed solely through the SAR branch.

📖 Datasets

Figure 2. The organization of data sources in SAR-1M.

SAR-1M is a large-scale synthetic aperture radar (SAR) image dataset designed for SAR representation learning. The dataset contains over one million SAR images, and about 75% of the SAR samples are paired with geographically aligned optical images, enabling multimodal remote sensing studies.

🚀 Pre-training

Environment:

Python 3.8.20
Pytorch 1.12.1+cu113
torchvision 0.13.1+cu113
timm 0.6.13

Step-by-step installation (suitable for 4090&A800&A100)

conda create -n sarmae python=3.8 -y
conda activate sarmae

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Preparing with SAR-1M: Download the SAR-1M. The indices of paired images are provided in paired.json, while those of unpaired images are listed in unpaired.json. To extend the SAR-1M dataset for pretraining with additional data, users can append the corresponding image indices to these JSON files.
Pretraining: take ViT-B as an example (batchsize: 4096=8*512)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7     python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --master_port 20003 \
    train_mae_contrastive.py \
    --model mae_vit_base_patch16 \
    --data_path ./data \
    --enable_sar_noise \
    --noise_ratio 0.5 --random_noise \
    --noise_min 0.0 --noise_max 0.7 \
    --output_dir ./output_vitb \
    --batch_size 512 --epochs 300 \
    --lr 1e-4 --mae_loss_weight 1 --alignment_loss_weight 0.8 \
    --loss_schedule cosine \
    --sar_pretrained ./mae_pretrain_vit_base.pth \
    --dinov3_pretrained ./dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth \
    --freeze_optical_completely \
    --clip_grad 1.0

Fine-tuning: an example of evaluating the pretrained ViT-B weight on Fusar dataset

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1   --master_port 20005 \
main_finetune.py \
--dataset 'fusar'  --data_path /data/FUSAR  \
--model 'vit_base_patch16'   \
--batch_size 8 --epochs 30  --exp_num=5  \
--finetune './SARMAE_vit_Base.pth'

SARMAE pretrained weights

Pretrain	Backbone	Input size	Pretrained model
SARMAE	ViT-B	224 × 224	Weights
SARMAE	ViT-L	224 × 224	Weights

🔨 Usage

Coming Soon.

🍭 Results

Figure 3. SARMAE outperforms SOTA methods on multiple datasets. ¹: 40-SHOT; ²: 30% labeled. ^a: Multi-classes; ^b: Water.

Method	FUSAR-SHIP		MSTAR		SAR-ACD
Method	40-shot	30%	40-shot	30%	30%
ResNet-50	-	58.41	-	89.94	59.70
Swin Transformer	-	60.79	-	82.97	67.50
Bet	59.70	71.13	40.70	69.75	79.77
LoMaR	82.70	-	77.00	-	-
SAR-JEPA	85.80	-	91.60	-	-
SUMMIT	-	71.91	-	98.39	84.25
SARMAE(ViT-B)	89.30	92.92	96.70	99.61	95.06
SARMAE(ViT-L)	90.86	92.80	97.24	98.92	95.63

Table 1. Performance comparison (Top1 Accuracy, %) of different methods on the target classification task.

Method	SARDet-100k	SSDD	Method	RSAR
ImageNet	52.30	66.40	RoI Transformer	35.02
Deformable DETR	50.00	52.60	Def. DETR	46.62
Swin Transformer	53.80	40.70	RetinaNet	57.67
ConvNeXt	55.10	-	ARS-DETR	61.14
CATNet	-	64.66	R3Det	63.94
MSFA	56.40	-	ReDet	64.71
SARAFE	57.30	67.50	O-RCNN	64.82
SARMAE(ViT-B)	57.90	68.10	SARMAE(ViT-B)	66.80
SARMAE(ViT-L)	63.10	69.30	SARMAE(ViT-L)	72.20

Table 2. Performance comparison (mAP, %) of different methods on horizontal and oriented object detection tasks.

Method	Multiple classes							Water
Method	Industrial Area	Natural Area	Land Use	Water	Housing	Other	mIoU	IoU
FCN	37.78	71.58	1.24	72.76	67.69	39.05	48.35	85.95
ANN	41.23	72.92	0.97	75.95	68.40	56.01	52.58	87.32
PSPNet	33.99	72.31	0.93	76.51	68.07	57.07	51.48	87.13
DeepLab V3+	40.62	70.67	0.55	72.93	69.96	34.53	48.21	87.53
PSANet	40.70	69.46	1.33	69.46	68.75	32.68	47.14	86.18
DANet	39.56	72.00	1.00	74.95	67.79	56.28	39.56	89.29
SARMAE(ViT-B)	65.87	75.65	29.20	84.01	73.23	71.21	66.53	92.31
SARMAE(ViT-L)	65.84	78.04	29.47	87.12	75.22	69.34	67.51	93.06

Table 3. Performance comparison of semantic segmentation methods on multiple classes and water classes.

⭐ Citation

If you find SARMAE helpful, please give a ⭐ and cite it as follows:

@misc{liu2025sarmaemaskedautoencodersar,
      title={SARMAE: Masked Autoencoder for SAR Representation Learning}, 
      author={Danxu Liu and Di Wang and Hebaixu Wang and Haoyang Chen and Wentao Jiang and Yilin Cheng and Haonan Guo and Wei Cui and Jing Zhang},
      year={2025},
      eprint={2512.16635},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.16635}, 
}

🎺 Statement

This project is released under the CC BY-NC 4.0.
For any other questions please contact Danxu Liu at bit.edu.cn or gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Figs		Figs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SARMAE: Masked Autoencoder for SAR Representation Learning

🔥 Update

🌞 Abstract

📖 Datasets

🚀 Pre-training

Step-by-step installation (suitable for 4090&A800&A100)

SARMAE pretrained weights

🔨 Usage

🍭 Results

⭐ Citation

🎺 Statement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Folders and files

Latest commit

History

Repository files navigation

SARMAE: Masked Autoencoder for SAR Representation Learning

🔥 Update

🌞 Abstract

📖 Datasets

🚀 Pre-training

Step-by-step installation (suitable for 4090&A800&A100)

SARMAE pretrained weights

🔨 Usage

🍭 Results

⭐ Citation

🎺 Statement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Packages