Official implementation of the ICRA 2026 paper:
SLNet is a super-lightweight PyTorch framework for 3D point cloud classification and segmentation. The architecture integrates:
- Nonparametric Adaptive Point Embedding (NAPE)
- Lightweight Geometric Modulation Units (GMU)
- Parameter-free normalization
- Compact residual MLP refinement
The design objective is maximum accuracy-per-parameter efficiency, achieving strong performance with:
- extremely low parameter count
- minimal memory footprint
- low computational complexity
- no attention mechanisms
- no graph convolutions
- no heavy residual stacks
This makes SLNet particularly suitable for edge devices, embedded GPUs, and resource-constrained systems such as NVIDIA Jetson platforms.
.
├── attention/
│ └── *.png # Attention map images comparing NAPE, NAPE+GMU, and DGCNN
├── checkpoints/ # Trained model checkpoints
├── data/
│ └── dataset/ # All datasets
├── decoder/ # Segmentation decoders
├── encoder/ # NAPE, GMU, backbone modules
├── model_slnet_t # SLNet-T model (SLNet-Transformer)
├── pointnet2_ops_lib/ # Custom CUDA/C++ ops (compiled)
├── pytorch3d/ # PyTorch3D source (compiled from source)
├── scripts/ # Training / evaluation scripts
├── tasks/ # Python entry points
├── utils/ # Utilities (logging, loaders, metrics)
├── requirements.txt
└── README.md
The SLNet-T is trained on the ✅ Nvidia RTX 5090 GPU
details of its installation can be found in this link.
SLNet-S/M supports:
- ✅ x86_64 Linux + NVIDIA Desktop GPU
- ✅ ARM64 NVIDIA Jetson Orin (JetPack 6.x)
git clone https://github.com/m-saeid/SLNet
cd SLNetpython3 -m venv ~/venvs/slnet
source ~/venvs/slnet/bin/activate
pip install --upgrade pip setuptools wheelpip install -r requirements.txtInstall the version matching your CUDA version.
Example (CUDA 12.1):
pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu121Verify:
python - << 'EOF'
import torch
print("CUDA:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))
EOFsudo apt update
sudo apt install gcc-10 g++-10
export CC=gcc-10
export CXX=g++-10Optional: Add the export lines to ~/.bashrc for persistence.
cd pointnet2_ops_lib
pip install .
cd ..pip install pytorch3dOr build from source:
git clone https://github.com/facebookresearch/pytorch3d
cd pytorch3d
pip install -e .
cd ..python - << 'EOF'
import torch
import pointnet2_ops
import pytorch3d
print("CUDA:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))
EOF
⚠️ Jetson Orin GPUs are Ampere (sm_87) CUDA ≥ 12.0 removes legacy architectures. Always setTORCH_CUDA_ARCH_LIST=8.7when compiling CUDA extensions.
Before building CUDA extensions:
sudo nvpmodel -m 0
sudo jetson_clocksgit clone https://github.com/m-saeid/SLNet
cd SLNetpython3 -m venv ~/venvs/slnet
source ~/venvs/slnet/bin/activate
pip install --upgrade pip setuptools wheelpip install -r requirements.txtFor JetPack 6.2 (CUDA 12.6), the following Jetson-optimized index works:
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \
--index-url https://pypi.jetson-ai-lab.io/jp6/cu126Jetson uses:
- ARM64 (aarch64)
- CUDA integrated with JetPack
- Ampere GPU (sm_87)
Standard PyPI wheels do not provide CUDA acceleration on Jetson.
python - << 'EOF'
import torch
print("Torch:", torch.__version__)
print("CUDA:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))
EOFExpected output:
CUDA: True
Device: Orin (nvgpu)
sudo apt update
sudo apt install gcc-10 g++-10
export CC=gcc-10
export CXX=g++-10Optional (persistent):
echo 'export CC=gcc-10' >> ~/.bashrc
echo 'export CXX=g++-10' >> ~/.bashrcSet CUDA architecture:
export TORCH_CUDA_ARCH_LIST="8.7"
export FORCE_CUDA=1Build and install:
cd pointnet2_ops_lib
pip install -e . --no-build-isolation
cd ..If building fails, ensure the following line is present inside:
- pointnet2_ops_lib/setup.py
- pointnet2_ops_lib/pointnet2_ops/pointnet2_utils.py
os.environ["TORCH_CUDA_ARCH_LIST"] = "8.7"Expected result:
Successfully built pointnet2_ops
Successfully installed pointnet2_ops-3.0.0Verify:
python - << 'EOF'
import pointnet2_ops._ext
print("pointnet2_ops extension loaded successfully")
EOFgit clone https://github.com/facebookresearch/pytorch3d
cd pytorch3d
export FORCE_CUDA=1
export TORCH_CUDA_ARCH_LIST="8.7"
export CUDA_HOME=/usr/local/cuda
export MAX_JOBS=2
pip install -e . --no-build-isolation
cd ..This compiles CUDA kernels for:
knn_pointsball_querysample_farthest_points
Verify PyTorch3D CUDA
python - << 'EOF'
import torch
from pytorch3d.ops import knn_points
print("CUDA available:", torch.cuda.is_available())
EOFpython - << 'EOF'
import torch
import pointnet2_ops
import pytorch3d
print("CUDA:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))
EOFCUDA: True
Device: Orin (nvgpu)- GPU Architecture: sm_87
- Always export:
export TORCH_CUDA_ARCH_LIST=8.7
export FORCE_CUDA=1-
Use performance mode when compiling:
sudo nvpmodel -m 0 sudo jetson_clocks
After completing these steps, SLNet runs with full CUDA acceleration on both:
- ✅ Desktop NVIDIA GPUs
- ✅ Jetson Orin devices
Place datasets under:
data/dataset/
Required structure:
data/
└── dataset/
├── modelnet40_ply_hdf5_2048/
├── s3dis/Stanford3dDataset_v1.2_Aligned_Version/
├── scanobject/h5_files/
├── shapenetcore_partanno_segmentation_benchmark_v0_normal/
└── modelnet_fewshot/
S3DIS Dataset: Solve the problem in the S3DIS dataset with the help of this link. Run the following code to prepare the S3DIS dataset.
python data/preprocess_s3dis.py --raw data/dataset/s3dis/Stanford3dDataset_v1.2_Aligned_Version --out data/dataset/s3dis/processed/Stanford3dDataset_v1.2_Aligned_VersionOther Datasets:
No manual preprocessing is required. All sampling and normalization are handled internally.
All tasks are launched via scripts/:
cd SLNetbash scripts/run_all_test.shbash scripts/run_modelnet.shbash scripts/run_scanobject.shbash scripts/run_shapenet.shbash scripts/run_s3dis.shbash scripts/run_fewshot.shUnified launcher for Jetson devices:
source ~/venvs/slnet/bin/activate
cd SLNet
python tasks/cls_modelnet.py --device cudaThis directly runs the main model using GPU acceleration without shell scripts.
source ~/venvs/slnet/bin/activate
cd SLNet
# ModelNet40
bash scripts/run_modelnet.sh
# Few-shot
bash scripts/run_fewshot.shimport torch
from pytorch3d.ops import knn_points, ball_query, sample_farthest_points
print("Torch CUDA:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))
x = torch.randn(1, 1000, 3, device='cuda')
y = torch.randn(1, 1000, 3, device='cuda')
d, i, _ = knn_points(x, y, K=3)
print("knn_points CUDA:", d.is_cuda, i.is_cuda)
idx = ball_query(x, y, radius=0.2, K=5)
print("ball_query CUDA:", idx[0].is_cuda)
pts, idx2 = sample_farthest_points(x, K=100)
print("fps CUDA:", pts.is_cuda, idx2.is_cuda)Expected output: all True.
bash scripts/eval_model.shComputes:
- GFLOPs
- peak GPU memory
- inference latency
- parameter count
bash scripts/attention_map.shOutputs png comparisons between:
- NAPE
- NAPE + GMU
- DGCNN (2nd layer)
encoder/– NAPE, GMU, backbonedecoder/– segmentation headsmodel_slnet_t/– SLNet-T modeltasks/– training/eval entry pointsscripts/– automation wrappersutils/– loaders, metrics, loggingattention/– visualization outputscheckpoints/– trained weights
All benchmark results below follow the exact evaluation protocol described in the paper. Unless otherwise stated, profiling is performed on a single RTX 2080 (CUDA 11.8) with identical compile flags and cuDNN benchmarking disabled for determinism.
| Model | OA (%) | mAcc (%) | Param (M) | FLOPs (G) | Mem (MB) | Time (ms) | NetScore | NetScore+ |
|---|---|---|---|---|---|---|---|---|
| PointNet | 90.04 | 86.52 | 3.47 | 0.44 | 30.20 | 0.53 | 76.34 | 70.29 |
| PointNet++ (SSG) | 92.31 | 89.65 | 1.47 | 0.85 | 34.15 | 5.07 | 77.61 | 66.41 |
| DGCNN | 92.82 | 89.09 | 1.81 | 2.68 | 86.22 | 6.05 | 71.84 | 58.25 |
| CurveNet | 93.38 | - | 2.14 | 0.32 | 29.45 | 11.70 | 80.34 | 67.65 |
| PointMLP | 93.66 | 90.99 | 13.23 | 15.67 | 94.80 | 5.96 | 55.69 | 41.93 |
| SLNet-S | 93.64 | 89.46 | 0.14 | 0.31 | 19.61 | 0.95 | 92.45 | 86.10 |
| SLNet-M | 93.92 | 91.10 | 0.54 | 1.21 | 32.09 | 1.75 | 80.72 | 72.00 |
| Model | OA (%) | mAcc (%) | NetScore | NetScore+ |
|---|---|---|---|---|
| PointNet | 91.39 | 88.79 | 76.60 | 70.58 |
| PointNet++ | 94.02 | 92.40 | 77.96 | 66.77 |
| DGCNN | 94.03 | 92.64 | 72.07 | 58.49 |
| CurveNet | 94.12 | 92.65 | 80.59 | 67.91 |
| PointMLP | 95.33 | 94.30 | 56.00 | 42.24 |
| SLNet-S | 94.53 | 92.21 | 92.65 | 86.30 |
| SLNet-M | 94.81 | 93.76 | 88.53 | 79.78 |
| Model | 5w-10s | 5w-20s | 10w-10s | 10w-20s |
|---|---|---|---|---|
| PointNet | 52.0 | 57.8 | 46.6 | 35.2 |
| PointCNN | 65.4 | 68.6 | 46.6 | 50.0 |
| Point-NN | 88.8 | 90.9 | 79.9 | 84.9 |
| NPNet | 92.0 | 93.2 | 82.5 | 87.6 |
| SLNet-S | 84.0 | 89.0 | 75.5 | 93.5 |
| SLNet-M | 89.0 | 95.0 | 80.0 | 94.0 |
| Model | OA (%) | mAcc (%) | Param (M) | FLOPs (G) | Mem (MB) | Time (ms) | NetScore | NetScore+ |
|---|---|---|---|---|---|---|---|---|
| PointNet | 68.2 | 63.4 | 3.46 | 0.44 | 30.17 | 0.52 | 71.52 | 65.64 |
| PointNet++ | 77.9 | 75.4 | 1.47 | 0.85 | 34.12 | 5.02 | 74.70 | 63.52 |
| DGCNN | 78.1 | 73.6 | 1.80 | 2.68 | 86.19 | 6.23 | 68.87 | 55.22 |
| PointMLP | 85.4 | 83.9 | 13.23 | 15.67 | 94.80 | 5.96 | 54.09 | 40.33 |
| SLNet-S | 83.45 | 81.45 | 0.12 | 0.26 | 16.87 | 0.94 | 91.76 | 85.75 |
| SLNet-M | 84.25 | 82.86 | 0.48 | 1.02 | 26.47 | 1.69 | 80.12 | 71.85 |
| Model | ins-IoU | cls-IoU | Param (M) | FLOPs (G) | Mem (MB) | Time (ms) | NetScore | NetScore+ |
|---|---|---|---|---|---|---|---|---|
| PointNet | 83.7 | 80.4 | 8.34 | 5.79 | 124.39 | 1.90 | 59.37 | 47.50 |
| PointNet++ (SSG) | 85.1 | 81.9 | 1.40 | 1.12 | 42.20 | 5.74 | 74.57 | 62.65 |
| DGCNN | 85.2 | 82.3 | 1.46 | 4.96 | 157.26 | 9.86 | 68.01 | 52.06 |
| PointMLP | 86.1 | 84.6 | 16.75 | 6.26 | 121.04 | 4.42 | 56.88 | 43.24 |
| SLNet-S | 85.27 | 83.99 | 1.24 | 0.90 | 46.86 | 3.30 | 76.49 | 65.54 |
| SLNet-M | 85.53 | 84.45 | 1.90 | 2.33 | 49.45 | 4.37 | 70.60 | 58.93 |
Measured on Jetson Orin Nano (1024 points, batch size 4).
| Dataset | Model | GFLOPs | Param (M) | Mem (MB) | Time (ms) |
|---|---|---|---|---|---|
| ModelNet40 | SLNet-S | 0.31 | 0.14 | 21.30 | 17.83 |
| ModelNet40 | SLNet-M | 1.22 | 0.55 | 33.00 | 30.11 |
| ScanObjectNN | SLNet-S | 0.26 | 0.12 | 17.87 | 16.36 |
| ScanObjectNN | SLNet-M | 1.02 | 0.48 | 27.37 | 27.43 |
| ShapeNetPart | SLNet-S | 0.83 | 1.24 | 30.09 | 38.50 |
| ShapeNetPart | SLNet-M | 2.25 | 1.90 | 48.27 | 57.81 |
NetScore:
NetScore = 20 log10( a² / sqrt(p · m) )
NetScore+:
NetScore+ = 20 log10( a² / ( sqrt(p · m) · (t · r)^(1/4) ) )
Where:
- a = accuracy
- p = parameters
- m = FLOPs
- t = latency
- r = peak memory
Higher is better for both metrics.
This section provides a visual breakdown of SLNet components and efficiency characteristics.
Pipeline:
-
NAPE Front-End – Nonparametric Adaptive Point Embedding
-
GMU Front-End – Geometric Modulation Unit
-
Four Encoder Stages
- Farthest Point Sampling
- Local grouping (ball query / kNN)
- Parameter-free normalization
- Lightweight residual MLP refinement
-
Task Heads
- Classification head with global pooling
- U-Net–style decoder for segmentation with feature propagation and global context
The architecture eliminates attention layers and heavy graph convolutions while preserving geometric awareness.
The Adaptive Point Embedding (APE) block maps raw 3D coordinates using:
- Fused Gaussian RBF bases
- Cosine harmonic bases
- Adaptive bandwidth estimation
- Parameter-free geometric encoding
This produces geometry-aware embeddings without learned convolutional kernels.
GMU performs lightweight geometric feature modulation via:
- Local context aggregation
- Channel-wise modulation
- Minimal parameter overhead
- Residual refinement coupling
It acts as a geometry-sensitive gating mechanism between embedding and encoder refinement.
Radar plot comparing NetScore and NetScore+ for:
- SLNet-S
- SLNet-M
- PointNet
- PointNet++
- DGCNN
- PointMLP
SLNet-S dominates in composite efficiency metrics due to its ultra-low parameter count and reduced FLOPs.
Comparison on ModelNet40 samples:
- Top row: DGCNN stage-2 EdgeConv gradient saliency
- Middle row: NAPE gradient saliency (channel width 16)
- Bottom row: NAPE gradient saliency (channel width 32)
NAPE produces sharper and more semantically focused responses despite operating at the input embedding stage.
- NVIDIA GPU (RTX / A-series)
- CUDA ≥ 11.8
- Python ≥ 3.9
| Component | Value |
|---|---|
| Device | Jetson Orin Nano Engineering |
| JetPack | 6.2.2 |
| CUDA | 12.6 |
| Driver | 540.5.0 |
| Python | 3.10.12 |
| RAM | 8GB |
| Swap | 3.7GB |
| Storage | 56GB (35GB free) |
Performance mode:
sudo nvpmodel -m 0
sudo jetson_clockstorch==2.8.0
torchvision==0.23.0
torchaudio==2.8.0
pytorch3d==0.7.8
numpy==1.26.4
SLNet is explicitly designed for:
- embedded inference
- robotics perception
- autonomous systems
- edge AI
- low-power GPUs
- mobile robotics
- real-time 3D perception
with full CUDA acceleration and minimal memory pressure.
@article{saeid2026slnet,
title={SLNet: A Super-Lightweight Geometry-Adaptive Network for 3D Point Cloud Recognition},
author={Saeid, Mohammad and Salarpour, Amir and MohajerAnsari, Pedram and Pes{\'e}, Mert D},
journal={...},
year={2026}
}📧 Questions? Reach out to: imm.saeid@gmail.com


