Skip to content

igoforth/aletheia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aletheia

AI decompiler targeting MLIR. Learns to lift binary to structured control flow IR, then deterministic compiler passes lower to C.

Architecture

                          ┌─────────────────────────┐
                          │      AI Model           │
                          │   p-code → SCF MLIR     │
                          └────────────┬────────────┘
                                       │
                                       ▼
                          ┌─────────────────────────┐
                          │  SCF + arith + memref   │
                          │     (target repr)       │
                          └──────┬──────────┬───────┘
                                 │          │
                    ┌────────────┘          └────────────┐
                    ▼                                    ▼
          ┌─────────────────┐                  ┌─────────────────┐
          │  Pipeline A:    │                  │  Pipeline B:    │
          │  SCF → LLVM     │                  │  SCF → EmitC    │
          │  → LLVM IR      │                  │  → C code       │
          │  → x86 obj      │                  │                 │
          └─────────────────┘                  └─────────────────┘
                    │                                    │
             (verification:                       (human-readable
              recompile &                          decompilation
              diff binary)                          output)

Training Data: Polygeist

Polygeist (aka cgeist) is the key tool for generating training pairs. It compiles C/C++ directly to SCF-level MLIR using Clang's frontend, emitting into scf, affine, memref, and arith dialects rather than going through LLVM IR.

This gives you matched (binary, SCF MLIR) pairs:

Source.c ──┬── cgeist ──→ SCF MLIR (training target)
           │
           └── clang -O2 ──→ binary (training input)

Building Polygeist

Polygeist is pinned to a specific LLVM commit and must be built against its bundled LLVM/Clang source tree (system packages won't work):

git clone https://github.com/llvm/Polygeist
cd Polygeist
git submodule update --init --depth 1

# Build bundled LLVM + MLIR + Clang
mkdir -p llvm-project/build && cd llvm-project/build
cmake -G Ninja ../llvm \
  -DLLVM_ENABLE_PROJECTS="mlir;clang" \
  -DLLVM_TARGETS_TO_BUILD="host" \
  -DLLVM_USE_LINKER=lld \
  -DLLVM_INCLUDE_BENCHMARKS=OFF \
  -DCMAKE_BUILD_TYPE=Release
ninja

# Build Polygeist
cd ../../ && mkdir build && cd build
cmake -G Ninja .. \
  -DMLIR_DIR=$PWD/../llvm-project/build/lib/cmake/mlir \
  -DCLANG_DIR=$PWD/../llvm-project/build/lib/cmake/clang \
  -DCMAKE_BUILD_TYPE=Release \
  -DPOLYGEIST_USE_LINKER=lld
ninja

build.sh auto-detects cgeist at ~/.local/src/Polygeist/build/bin/cgeist or on PATH. If not found, the forward pipeline is skipped.

Using Polygeist for Training Data Generation

# C → SCF-level MLIR
cgeist sample.c -S --function=* -O2 \
  --raise-scf-to-affine \
  --memref-fullrank \
  -o sample_scf.mlir

# Same source → binary (training input)
clang -O2 sample.c -o sample.bin -c
objcopy -O binary sample.bin sample.raw

Key cgeist flags for training data:

  • --raise-scf-to-affine: Produce higher-level affine loops where possible
  • --memref-fullrank: Keep memref shapes for array access patterns
  • -O0 through -O2: Generate pairs at different optimization levels

Pipelines Demonstrated

Forward: C → SCF MLIR (Polygeist/cgeist)

sample.c → cgeist → sample_cgeist.mlir (SCF + arith + memref)

Generates the target-level MLIR representation directly from C source. This is the forward direction used for training data generation.

Pipeline A: SCF MLIR → x86 (verification path)

sample_cgeist.mlir
  → mlir-opt (--lower-affine --convert-scf-to-cf --convert-cf-to-llvm ...)
  → mlir-translate --mlir-to-llvmir
  → llc -filetype=obj
  → sample_cgeist.o (ELF x86-64)

Compiles the SCF MLIR to a native object file. In the full pipeline, this lets you verify the model's output by diffing the compiled MLIR against the original input binary.

Pipeline B: EmitC MLIR → C code (decompilation output)

sample_pure_emitc.mlir
  → mlir-translate --mlir-to-cpp
  → decompiled.cpp

The human-facing decompilation output. Requires system MLIR 20+ (EmitC lvalue/for/func ops were added after MLIR 18).

Reference: C → LLVM IR → MLIR LLVM dialect

sample.c → clang → sample.ll → mlir-translate --import-llvm → sample_llvm_dialect.mlir

Shows what LLVM-level MLIR looks like. This is the "floor" representation that the AI model would need to raise to SCF level.

Known MLIR Gaps

  1. scf.while → EmitC: convert-scf-to-emitc does not lower scf.while. Workaround: convert while to bounded for loops, or handle in the memref→ptr shim before EmitC translation.

  2. Dynamic memref → EmitC: convert-memref-to-emitc only handles static shapes. The decompiler shim needs to convert memref<?xi32> to !emitc.ptr<i32> before the standard pipeline.

  3. No automatic raise: There is no LLVM dialect → SCF pass (that's what your AI model does). --lift-cf-to-scf exists but operates on the cf dialect, not LLVM dialect — useful if you first strip LLVM-specific ops.

Files

src/
  sample.c                  # Reference C source

mlir/
  sample_pure_emitc.mlir    # EmitC dialect (translates to C via mlir-translate)
  (generated by build.sh:)
  sample_cgeist.mlir        # SCF MLIR from cgeist (forward pipeline)
  sample_cgeist_llvm.mlir   # LLVM dialect (Pipeline A intermediate)
  sample_cgeist.ll          # LLVM IR (Pipeline A intermediate)
  sample_llvm_dialect.mlir  # MLIR from importing clang LLVM IR (reference)
  sample.ll                 # LLVM IR from clang (reference)

output/                     # Generated by build.sh
  decompiled.cpp            # C code from EmitC (Pipeline B)
  sample_cgeist.s           # x86 assembly (Pipeline A)
  sample_cgeist.o           # x86 object (Pipeline A)

build.sh                    # Runs all pipelines

AI Model Training Notes

The target representation (SCF + arith + memref) is good because:

  • Architecture-independent: Same MLIR target regardless of input binary's ISA
  • Structured control flow: scf.for, scf.while, scf.if map directly to C
  • Typed memory access: memref.load/memref.store with type info
  • Deterministic lowering: MLIR → C is mechanical, not learned
  • Verification path: MLIR → x86 lets you check semantic correctness

The model never needs to learn what C looks like. It learns binary → MLIR, and deterministic tooling handles MLIR → C.

Quick Start

# Install toolchain (any supported LLVM version, e.g. 18, 20, 23)
sudo apt install mlir-<ver>-tools llvm-<ver> clang-<ver>

# Run all pipelines
chmod +x build.sh && ./build.sh

About

AI decompiler targeting MLIR

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors