AI decompiler targeting MLIR. Learns to lift binary to structured control flow IR, then deterministic compiler passes lower to C.
┌─────────────────────────┐
│ AI Model │
│ p-code → SCF MLIR │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ SCF + arith + memref │
│ (target repr) │
└──────┬──────────┬───────┘
│ │
┌────────────┘ └────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Pipeline A: │ │ Pipeline B: │
│ SCF → LLVM │ │ SCF → EmitC │
│ → LLVM IR │ │ → C code │
│ → x86 obj │ │ │
└─────────────────┘ └─────────────────┘
│ │
(verification: (human-readable
recompile & decompilation
diff binary) output)
Polygeist (aka cgeist) is the key tool for
generating training pairs. It compiles C/C++ directly to SCF-level MLIR using Clang's
frontend, emitting into scf, affine, memref, and arith dialects rather than
going through LLVM IR.
This gives you matched (binary, SCF MLIR) pairs:
Source.c ──┬── cgeist ──→ SCF MLIR (training target)
│
└── clang -O2 ──→ binary (training input)
Polygeist is pinned to a specific LLVM commit and must be built against its bundled LLVM/Clang source tree (system packages won't work):
git clone https://github.com/llvm/Polygeist
cd Polygeist
git submodule update --init --depth 1
# Build bundled LLVM + MLIR + Clang
mkdir -p llvm-project/build && cd llvm-project/build
cmake -G Ninja ../llvm \
-DLLVM_ENABLE_PROJECTS="mlir;clang" \
-DLLVM_TARGETS_TO_BUILD="host" \
-DLLVM_USE_LINKER=lld \
-DLLVM_INCLUDE_BENCHMARKS=OFF \
-DCMAKE_BUILD_TYPE=Release
ninja
# Build Polygeist
cd ../../ && mkdir build && cd build
cmake -G Ninja .. \
-DMLIR_DIR=$PWD/../llvm-project/build/lib/cmake/mlir \
-DCLANG_DIR=$PWD/../llvm-project/build/lib/cmake/clang \
-DCMAKE_BUILD_TYPE=Release \
-DPOLYGEIST_USE_LINKER=lld
ninjabuild.sh auto-detects cgeist at ~/.local/src/Polygeist/build/bin/cgeist
or on PATH. If not found, the forward pipeline is skipped.
# C → SCF-level MLIR
cgeist sample.c -S --function=* -O2 \
--raise-scf-to-affine \
--memref-fullrank \
-o sample_scf.mlir
# Same source → binary (training input)
clang -O2 sample.c -o sample.bin -c
objcopy -O binary sample.bin sample.rawKey cgeist flags for training data:
--raise-scf-to-affine: Produce higher-level affine loops where possible--memref-fullrank: Keep memref shapes for array access patterns-O0through-O2: Generate pairs at different optimization levels
sample.c → cgeist → sample_cgeist.mlir (SCF + arith + memref)
Generates the target-level MLIR representation directly from C source. This is the forward direction used for training data generation.
sample_cgeist.mlir
→ mlir-opt (--lower-affine --convert-scf-to-cf --convert-cf-to-llvm ...)
→ mlir-translate --mlir-to-llvmir
→ llc -filetype=obj
→ sample_cgeist.o (ELF x86-64)
Compiles the SCF MLIR to a native object file. In the full pipeline, this lets you verify the model's output by diffing the compiled MLIR against the original input binary.
sample_pure_emitc.mlir
→ mlir-translate --mlir-to-cpp
→ decompiled.cpp
The human-facing decompilation output. Requires system MLIR 20+ (EmitC
lvalue/for/func ops were added after MLIR 18).
sample.c → clang → sample.ll → mlir-translate --import-llvm → sample_llvm_dialect.mlir
Shows what LLVM-level MLIR looks like. This is the "floor" representation that the AI model would need to raise to SCF level.
-
scf.while→ EmitC:convert-scf-to-emitcdoes not lowerscf.while. Workaround: convert while to bounded for loops, or handle in the memref→ptr shim before EmitC translation. -
Dynamic memref → EmitC:
convert-memref-to-emitconly handles static shapes. The decompiler shim needs to convertmemref<?xi32>to!emitc.ptr<i32>before the standard pipeline. -
No automatic raise: There is no LLVM dialect → SCF pass (that's what your AI model does).
--lift-cf-to-scfexists but operates on thecfdialect, not LLVM dialect — useful if you first strip LLVM-specific ops.
src/
sample.c # Reference C source
mlir/
sample_pure_emitc.mlir # EmitC dialect (translates to C via mlir-translate)
(generated by build.sh:)
sample_cgeist.mlir # SCF MLIR from cgeist (forward pipeline)
sample_cgeist_llvm.mlir # LLVM dialect (Pipeline A intermediate)
sample_cgeist.ll # LLVM IR (Pipeline A intermediate)
sample_llvm_dialect.mlir # MLIR from importing clang LLVM IR (reference)
sample.ll # LLVM IR from clang (reference)
output/ # Generated by build.sh
decompiled.cpp # C code from EmitC (Pipeline B)
sample_cgeist.s # x86 assembly (Pipeline A)
sample_cgeist.o # x86 object (Pipeline A)
build.sh # Runs all pipelines
The target representation (SCF + arith + memref) is good because:
- Architecture-independent: Same MLIR target regardless of input binary's ISA
- Structured control flow:
scf.for,scf.while,scf.ifmap directly to C - Typed memory access:
memref.load/memref.storewith type info - Deterministic lowering: MLIR → C is mechanical, not learned
- Verification path: MLIR → x86 lets you check semantic correctness
The model never needs to learn what C looks like. It learns binary → MLIR, and
deterministic tooling handles MLIR → C.
# Install toolchain (any supported LLVM version, e.g. 18, 20, 23)
sudo apt install mlir-<ver>-tools llvm-<ver> clang-<ver>
# Run all pipelines
chmod +x build.sh && ./build.sh