aletheia

AI decompiler targeting MLIR. Learns to lift binary to structured control flow IR, then deterministic compiler passes lower to C.

Architecture

                          ┌─────────────────────────┐
                          │      AI Model           │
                          │   p-code → SCF MLIR     │
                          └────────────┬────────────┘
                                       │
                                       ▼
                          ┌─────────────────────────┐
                          │  SCF + arith + memref   │
                          │     (target repr)       │
                          └──────┬──────────┬───────┘
                                 │          │
                    ┌────────────┘          └────────────┐
                    ▼                                    ▼
          ┌─────────────────┐                  ┌─────────────────┐
          │  Pipeline A:    │                  │  Pipeline B:    │
          │  SCF → LLVM     │                  │  SCF → EmitC    │
          │  → LLVM IR      │                  │  → C code       │
          │  → x86 obj      │                  │                 │
          └─────────────────┘                  └─────────────────┘
                    │                                    │
             (verification:                       (human-readable
              recompile &                          decompilation
              diff binary)                          output)

Training Data: Polygeist

Polygeist (aka cgeist) is the key tool for generating training pairs. It compiles C/C++ directly to SCF-level MLIR using Clang's frontend, emitting into scf, affine, memref, and arith dialects rather than going through LLVM IR.

This gives you matched (binary, SCF MLIR) pairs:

Source.c ──┬── cgeist ──→ SCF MLIR (training target)
           │
           └── clang -O2 ──→ binary (training input)

Building Polygeist

Polygeist is pinned to a specific LLVM commit and must be built against its bundled LLVM/Clang source tree (system packages won't work):

git clone https://github.com/llvm/Polygeist
cd Polygeist
git submodule update --init --depth 1

# Build bundled LLVM + MLIR + Clang
mkdir -p llvm-project/build && cd llvm-project/build
cmake -G Ninja ../llvm \
  -DLLVM_ENABLE_PROJECTS="mlir;clang" \
  -DLLVM_TARGETS_TO_BUILD="host" \
  -DLLVM_USE_LINKER=lld \
  -DLLVM_INCLUDE_BENCHMARKS=OFF \
  -DCMAKE_BUILD_TYPE=Release
ninja

# Build Polygeist
cd ../../ && mkdir build && cd build
cmake -G Ninja .. \
  -DMLIR_DIR=$PWD/../llvm-project/build/lib/cmake/mlir \
  -DCLANG_DIR=$PWD/../llvm-project/build/lib/cmake/clang \
  -DCMAKE_BUILD_TYPE=Release \
  -DPOLYGEIST_USE_LINKER=lld
ninja

build.sh auto-detects cgeist at ~/.local/src/Polygeist/build/bin/cgeist or on PATH. If not found, the forward pipeline is skipped.

Using Polygeist for Training Data Generation

# C → SCF-level MLIR
cgeist sample.c -S --function=* -O2 \
  --raise-scf-to-affine \
  --memref-fullrank \
  -o sample_scf.mlir

# Same source → binary (training input)
clang -O2 sample.c -o sample.bin -c
objcopy -O binary sample.bin sample.raw

Key cgeist flags for training data:

--raise-scf-to-affine: Produce higher-level affine loops where possible
--memref-fullrank: Keep memref shapes for array access patterns
-O0 through -O2: Generate pairs at different optimization levels

Pipelines Demonstrated

Forward: C → SCF MLIR (Polygeist/cgeist)

sample.c → cgeist → sample_cgeist.mlir (SCF + arith + memref)

Generates the target-level MLIR representation directly from C source. This is the forward direction used for training data generation.

Pipeline A: SCF MLIR → x86 (verification path)

sample_cgeist.mlir
  → mlir-opt (--lower-affine --convert-scf-to-cf --convert-cf-to-llvm ...)
  → mlir-translate --mlir-to-llvmir
  → llc -filetype=obj
  → sample_cgeist.o (ELF x86-64)

Compiles the SCF MLIR to a native object file. In the full pipeline, this lets you verify the model's output by diffing the compiled MLIR against the original input binary.

Pipeline B: EmitC MLIR → C code (decompilation output)

sample_pure_emitc.mlir
  → mlir-translate --mlir-to-cpp
  → decompiled.cpp

The human-facing decompilation output. Requires system MLIR 20+ (EmitC lvalue/for/func ops were added after MLIR 18).

Reference: C → LLVM IR → MLIR LLVM dialect

sample.c → clang → sample.ll → mlir-translate --import-llvm → sample_llvm_dialect.mlir

Shows what LLVM-level MLIR looks like. This is the "floor" representation that the AI model would need to raise to SCF level.

Known MLIR Gaps

scf.while → EmitC: convert-scf-to-emitc does not lower scf.while. Workaround: convert while to bounded for loops, or handle in the memref→ptr shim before EmitC translation.
Dynamic memref → EmitC: convert-memref-to-emitc only handles static shapes. The decompiler shim needs to convert memref<?xi32> to !emitc.ptr<i32> before the standard pipeline.
No automatic raise: There is no LLVM dialect → SCF pass (that's what your AI model does). --lift-cf-to-scf exists but operates on the cf dialect, not LLVM dialect — useful if you first strip LLVM-specific ops.

Files

src/
  sample.c                  # Reference C source

mlir/
  sample_pure_emitc.mlir    # EmitC dialect (translates to C via mlir-translate)
  (generated by build.sh:)
  sample_cgeist.mlir        # SCF MLIR from cgeist (forward pipeline)
  sample_cgeist_llvm.mlir   # LLVM dialect (Pipeline A intermediate)
  sample_cgeist.ll          # LLVM IR (Pipeline A intermediate)
  sample_llvm_dialect.mlir  # MLIR from importing clang LLVM IR (reference)
  sample.ll                 # LLVM IR from clang (reference)

output/                     # Generated by build.sh
  decompiled.cpp            # C code from EmitC (Pipeline B)
  sample_cgeist.s           # x86 assembly (Pipeline A)
  sample_cgeist.o           # x86 object (Pipeline A)

build.sh                    # Runs all pipelines

AI Model Training Notes

The target representation (SCF + arith + memref) is good because:

Architecture-independent: Same MLIR target regardless of input binary's ISA
Structured control flow: scf.for, scf.while, scf.if map directly to C
Typed memory access: memref.load/memref.store with type info
Deterministic lowering: MLIR → C is mechanical, not learned
Verification path: MLIR → x86 lets you check semantic correctness

The model never needs to learn what C looks like. It learns binary → MLIR, and deterministic tooling handles MLIR → C.

Quick Start

# Install toolchain (any supported LLVM version, e.g. 18, 20, 23)
sudo apt install mlir-<ver>-tools llvm-<ver> clang-<ver>

# Run all pipelines
chmod +x build.sh && ./build.sh

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
mlir		mlir
src		src
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aletheia

Architecture

Training Data: Polygeist

Building Polygeist

Using Polygeist for Training Data Generation

Pipelines Demonstrated

Forward: C → SCF MLIR (Polygeist/cgeist)

Pipeline A: SCF MLIR → x86 (verification path)

Pipeline B: EmitC MLIR → C code (decompilation output)

Reference: C → LLVM IR → MLIR LLVM dialect

Known MLIR Gaps

Files

AI Model Training Notes

Quick Start

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

aletheia

Architecture

Training Data: Polygeist

Building Polygeist

Using Polygeist for Training Data Generation

Pipelines Demonstrated

Forward: C → SCF MLIR (Polygeist/cgeist)

Pipeline A: SCF MLIR → x86 (verification path)

Pipeline B: EmitC MLIR → C code (decompilation output)

Reference: C → LLVM IR → MLIR LLVM dialect

Known MLIR Gaps

Files

AI Model Training Notes

Quick Start

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages