Skip to content

v0.17.17

Choose a tag to compare

@Goldziher Goldziher released this 21 May 13:54
· 5432 commits to main since this release
v0.17.17
eff3c3d

Fixed

  • alef-e2e: remove hardcoded project-name special-casing across codegen. Alef is a generic generator — literal consumer-repo or product names in emitted source/output strings are bugs. Sweep across alef-e2e codegen: (1) elixir.rs derives the per-project Elixir module via module_path instead of literal Kreuzcrawl; (2) streaming_assertions.rs adds accessor_with_module_qualifier so the C# crawl-event branch uses the namespace qualifier and the Rust branch uses the cargo crate name; (3) php.rs drops the unreachable fallback that hardcoded \HtmlToMarkdown\ConversionOptions; (4) c.rs replaces the hardcoded LiterllmDefaultClientChatStreamStreamHandle with {pascal_prefix}DefaultClientChatStreamStreamHandle; (5) java.rs templates the FormatMetadataDisplay import via {java_group_id} and removes default_java_nested_types(); (6) csharp.rs removes default_csharp_nested_types(); (7) rust/http.rs doc-comment rewording — emitted code already used the dep_name parameter correctly. Downstream impact: kreuzberg/kreuzcrawl alef.toml must now declare nested_types explicitly under their [crates.e2e.calls.*] sections.

  • alef-backend-go: exclude struct types from unresolved-type fallback. Wave 7 Go fallback emitted *json.RawMessage for all Named types not in enum_names or data_enum_names, but legitimate struct types in the same generated package were incorrectly treated as external/unresolved. This caused type mismatches when struct types like OcrConfig, ChunkingConfig appeared in fields of other structs (e.g., ExtractionConfig.Ocr was emitted as *json.RawMessage instead of *OcrConfig), breaking Go compilation with "cannot use &v as *json.RawMessage" errors. Added a struct_names HashSet to track all non-opaque struct type names, passed to gen_struct_type, and included in the unresolved check so real struct types fall through to correct typed fields. Unblocks go binding compilation. (crates/alef-backend-go/src/gen_bindings/mod.rs, crates/alef-backend-go/src/gen_bindings/types.rs)

  • alef-e2e/java: derive streaming DTO imports from declared adapters instead of hardcoding kreuzcrawl types. The Java test-file generator unconditionally emitted import <pkg>.CrawlEvent; import <pkg>.CrawlStreamRequest; import <pkg>.BatchCrawlStreamRequest; for every project with a streaming fixture, leaking kreuzcrawl-specific type names into the generic polyglot generator. Other consumers (liter-llm, kreuzberg) ship streaming with different DTOs (e.g., ChatCompletionChunk + ChatCompletionRequest) and never declare any Crawl* types, so the maven build failed with three cannot find symbol errors on every streaming-touched test class. The fix iterates adapters, filters to AdapterPattern::Streaming, and imports each adapter's item_type (skipping ChatCompletionChunk which is already emitted above) plus request_type (stripped of its Rust path prefix). For kreuzcrawl this produces the same three imports as before; for liter-llm it produces none extra. (crates/alef-e2e/src/codegen/java.rs)

  • alef-e2e/elixir: handle binary/list return types in min/max_length assertions and zero-arg functions. Elixir e2e tests for binary-returning functions like render_pdf_page_to_png failed with String.length/1 type errors because the assertion logic didn't account for binaries (which use byte_size/1) or lists (byte arrays from Rustler). The min/max_length assertion now pattern-matches all three cases: is_binary → byte_size, is_list → length, else → String.length. Additionally, zero-argument functions were incorrectly receiving the harness' internal setup dictionary from the fixture input; the argument builder now filters out the setup key when no explicit args are configured so functions with no parameters are called with no arguments. Added alef_e2e_format_to_string/1 helper for FormatMetadata struct display conversion (pattern-matches on nested fields like metadata.image.format to extract displayable strings, falling back to struct inspection). Fixes Elixir e2e: pdf_test, smoke_image_png, extractors_list, register_document_extractor. (crates/alef-e2e/src/codegen/elixir.rs)

  • alef-backend-csharp: emit gen_adapter_wrapper for streaming adapters. Adds the IAsyncEnumerable<ItemType> public method derived from AdapterConfig. Wires &config.adapters through gen_bindings::emit. (crates/alef-backend-csharp/src/gen_bindings/methods.rs, crates/alef-backend-csharp/src/gen_bindings/mod.rs)

  • alef-backend-rustler: restore use_keyword_opts = trailing_keyword_count >= 2 threshold. A single trailing optional param (e.g. config: Option<T>) now stays positional with \\ nil instead of collapsing to opts \\ []. Aligns with the common config-parameter pattern where a single JSON string or nil is passed positionally. (crates/alef-backend-rustler/src/gen_bindings/mod.rs)

  • alef-backend-go: emit *json.RawMessage for unresolved external-crate Named types. Go struct fields with unresolved Named types (e.g., types from external crates that alef cannot resolve to a struct definition) now fall back to *json.RawMessage instead of attempting to reference a non-existent Go struct. This allows JSON round-tripping of opaque external types without requiring the binding generator to understand their shape. Fields are emitted with omitempty tag to handle null/absent values gracefully. Prevents compile errors when polyglot repos embed external types via re-exports. (crates/alef-backend-go/src/gen_bindings/types.rs)

  • alef-e2e/dart: required string args use positional syntax to match hand-written facades. Commit 2572cb72 reverted the prior required → positional, optional → named heuristic in favour of "always emit named", citing FRB v2's all-named convention. The revert was correct for liter-llm (whose chat/embed calls go through the from_json path that emits req: named, and the client_factory path that hardcodes its own arg shape), but broke every polyglot repo whose Dart surface is a hand-written facade wrapping the FRB bridge: H2mBridge.convert(String html, {ConversionOptions? options}), TreeSitterLanguagePackBridge.process(String source, ProcessConfig config), KreuzbergBridge.extractBytes(Uint8List content, String mimeType, [ExtractionConfig? config]). All three failed Dart compilation with Too few positional arguments because the codegen emitted process(source: 'x', _config) against a positional signature. Restored the required → positional, optional → named policy (matching the json_object handler at line 743 and the bytes/file_path handlers above), which mirrors the Rust idiom every facade follows. Unblocks tslp, h2m, and kreuzberg Dart e2e compilation. (crates/alef-e2e/src/codegen/dart.rs)

  • alef-backend-pyo3: emit _rust.CrawlStreamRequest and _rust.BatchCrawlStreamRequest in adapter wrappers. The generated api.py streaming adapter wrappers constructed the facade dataclass CrawlStreamRequest(url=url) and passed it to engine.crawl_stream(req), but the PyO3 native method enforces strict type identity against _rust.CrawlStreamRequest. Tests failed with TypeError: 'CrawlStreamRequest' object is not an instance of 'CrawlStreamRequest' (same name, different class object). The wrapper now constructs the native pyclass via _rust.CrawlStreamRequest(url=url) (and _rust.BatchCrawlStreamRequest(urls=urls)), matching the type the underlying method expects. Unblocks ~8 python streaming e2e tests. (crates/alef-backend-pyo3/src/gen_bindings/functions.rs)

  • alef-backend-php: walk raw typ.fields (not binding_fields) for trait-bridge withers. Commit 4e18a6a5 added a with_<field> wither emission for trait-bridge / opaque-handle optional fields (so PHP callers can attach a visitor after constructing options), but iterated binding_fields(&typ.fields) which filters out binding_excluded entries — and trait-bridge fields like ConversionOptions.visitor are flagged binding-excluded so they don't appear in the generated __construct / from_json parameter list. Net result: the wither method was never emitted, every html-to-markdown PHP visitor e2e (54 tests) errored with Call to undefined method ConversionOptions::with_visitor(). Switched the wither loop to walk typ.fields.iter() directly so trait-bridge fields are reached. The wither remains gated on Option<Named> where Named ∈ opaque_types ∪ bridge_type_aliases, so non-bridge excluded fields are still skipped. (crates/alef-backend-php/src/gen_bindings/types.rs)

  • alef-e2e/zig: omit fixtures whose target language is outside [crates.zig].languages. The Zig binding statically compiles a subset of tree-sitter grammars (it does not currently dynamically load parsers at runtime), but the e2e generator emitted tests for every fixture regardless of the fixture's input.language. Fixtures like smoke_bibtex therefore generated tests that failed to load their parser. Adds a new languages: Vec<String> field to ZigConfig and a filter in the Zig codegen that consults both input.language and input.config.language (mirroring the WASM filter from f9e0ff50). When the list is set and non-empty, fixtures whose target grammar is not in the list are omitted entirely from the generated test file — not emitted as it.skip() placeholders. Defaults to empty (all fixtures included), preserving prior behaviour for hosts that haven't opted in. tree-sitter-language-pack Zig e2e: every non-static-set fixture (e.g. smoke_actionscript, smoke_bibtex) auto-excluded. (crates/alef-core/src/config/languages.rs, crates/alef-e2e/src/codegen/zig.rs)

  • alef-backend-wasm: generate camelCase-aware Input DTOs for config-like function parameters. WASM function parameters of types like ProcessConfig (ending with "Config"/"Options"/"Settings"/"Params") failed to deserialize camelCase JSON from JS because serde_wasm_bindgen::from_value expected snake_case field names. The JS test suite passed {chunkMaxSize: 50} but the Rust ProcessConfig { chunk_max_size: ... } field wasn't found, silently defaulting to None, causing chunking to silently fail with empty chunks arrays. The fix generates an Input DTO struct (e.g., ProcessConfigInput) with #[serde(default, rename_all = "camelCase")] before each function that takes such a parameter, deserializes the JS value into the DTO, then converts to the core type via a generated From impl. All config-like parameters now correctly round-trip camelCase JSON from JavaScript. Tree-sitter-language-pack wasm e2e: chunks tests now passing. (crates/alef-backend-wasm/src/gen_bindings/functions.rs, crates/alef-backend-wasm/src/template_env.rs, new templates: gen_input_dto.jinja, serde_config_required.jinja, serde_config_optional.jinja)

  • alef-backend-php: emit with_visitor wither on ConversionOptions for trait-bridge opaque types. E2e PHP tests call $options->with_visitor($visitorHandle) to set trait-bridge visitor handles on options objects, but the generator only checked IR-level opaque types and skipped bridge type aliases like VisitorHandle. When a struct has an Option<NamedType> field where the named type is a trait-bridge alias, the generator now emits a wither method with_field_name(value: NamedType) -> Self that accepts the unwrapped type and wraps it in Some() before assignment. Collects both IR opaque types and config.trait_bridges[*].type_alias into an all_opaque_types set before field iteration. Fixes html-to-markdown PHP e2e visitor tests from 208/262 to 262/262 (54 visitor tests now passing). (crates/alef-backend-php/src/gen_bindings/types.rs)

  • alef-backend-csharp: emit primitive option fields as nullable to preserve Rust defaults. Option-config types (with typ.has_default) were emitting primitive/string fields with C# default values (false, 0, ""), which when serialized with WhenWritingNull would cause issues if the user explicitly set a field to its Rust default value but different C# default. The fix makes all non-optional primitive fields in option-config types nullable in C#, defaulting to null. With WhenWritingNull serialization, unset fields stay null and get stripped from JSON, letting Rust apply its own defaults; explicitly set fields (even if matching Rust defaults) flow through as JSON, preserving user intent. Combined with the wrapper template change to use WhenWritingNull for FFI input serialization, this ensures that explicit user values always reach Rust while unset fields let Rust apply its defaults. Fixes html-to-markdown C# e2e from 175/262 to 262/262 (all tests passing). (crates/alef-backend-csharp/src/gen_bindings/types.rs, templates/wrapper_class_header.jinja)

  • alef-backend-csharp: restore [JsonConverter] attribute on generated enums. Commit 342a0f0c (fix alef-backend-java enum serialization) accidentally re-introduced a conditional if needs_custom_converter check in the C# enum template and enums.rs converter generation, which disabled [JsonConverter(typeof(EnumNameJsonConverter))] emission on standard snake_case enums. Without the converter, System.Text.Json serializes enum variants as numeric values (0, 1) instead of string names ("function", "method"), breaking all assertions that depend on serialized JSON containing variant names. The fix restores unconditional converter generation and attribute emission for all enums (not just those with non-standard naming), ensuring enum-to-string conversion works consistently. Fixes 18 tree-sitter-language-pack C# e2e test failures (410/410 now passing). (crates/alef-backend-csharp/src/gen_bindings/enums.rs, crates/alef-backend-csharp/templates/enum_header.jinja)

  • alef-e2e/swift: aggregate every stringy accessor when contains asserts against a Vec<DTO> field. XCTAssertTrue(result.imports().map { $0.source().toString() }.contains("os")) previously relied on result_field_accessor naming a single "primary" accessor (e.g. imports → source, structure → kind), which fails whenever the asserted value lives on a sibling field (ImportInfo.items for from pathlib import Path, StructureItem.name for MyConfig). The codegen now walks the element type's IR fields, classifies every String/Option<String>/Vec<String>/serde-enum field as a "stringy" accessor, and emits a contains(where: { item in … }) closure that gathers every text-bearing value into a [String] before substring-matching the expected value — mirroring python's _alef_e2e_item_texts. Vec<String> accessors are flattened via .map { $0.as_str().toString() } (swift-bridge wraps borrowed RustString elements as RustStringRef, which exposes as_str() from SwiftBridgeCore.swift, not toString()). The aggregator only fires when the element type carries ≥2 stringy fields, leaving the existing single-accessor path untouched for trivial cases. Unblocks 2 process tests (testProcessPythonImportsDetail, testProcessRustStructureName) in tree-sitter-language-pack swift e2e. (crates/alef-e2e/src/codegen/swift.rs, crates/alef-e2e/src/field_access.rs)

  • alef-e2e/java: gate FormatMetadataDisplay.java emission on presence of FormatMetadata in Java assert_enum_fields. The helper was previously emitted unconditionally for every Java e2e harness and imports dev.kreuzberg.FormatMetadata, a sealed interface that only exists in the kreuzberg binding crate. Other polyglot repos (e.g. tree-sitter-language-pack) without that type failed Java compilation with cannot find symbol: class FormatMetadata. The generator now walks the resolved Java call overrides (call plus all named calls) and emits the helper only when at least one assert_enum_fields entry maps to "FormatMetadata". tree-sitter-language-pack Java e2e: 0 errors → 410/410 tests passing. (crates/alef-e2e/src/codegen/java.rs)

Added

  • alef-e2e/c: emit visitor test category for C FFI bindings. The C e2e generator previously filtered out all visitor fixtures and panicked if any reached render_test_file. It now collects visitor fixtures into a separate list, generates test_visitor.c with per-fixture static callbacks and the full HTMHtmVisitorCallbacks + htm_visitor_create + htm_options_set_visitor_handle + htm_convert + JSON assertion pattern, adds visitor forward declarations to test_runner.h, wires all visitor tests into main.c, and includes test_visitor.c in the Makefile SRCS. Adds ~54 visitor tests to the C e2e suite (bringing html-to-markdown C e2e from 208 to 262 total). (crates/alef-e2e/src/codegen/c.rs)

Fixed

  • alef-e2e/r: four codegen fixes to close the kreuzberg R e2e gap. (1) build_args_string now returns an empty argument string whenever args = [] is declared, regardless of fixture input shape — the previous fall-through emitted positional list(...) from harness metadata (e.g. setup.lazy_init_required for Go's eager-init shim), producing unused argument errors on no-arg wrappers like list_document_extractors(). (2) Empty Vec<String> args (element_type = "String") now emit character(0) instead of c(); c() is NULL in R and extendr rejects it with Expected Strings got Null for Vec<String> Rust signatures. (3) Per-call extra_args is now honoured for R (mirroring Ruby/Zig/Swift) — appended verbatim after declared args, so render_pdf_page_to_png can fill in extendr-required positionals (dpi, password) with NULL when the fixture omits them. (4) Terminal metadata.format accessors are wrapped with a new .alef_format_value() helper emitted into setup-fixtures.R that collapses the internally-tagged FormatMetadata enum ({image: {format: "PNG", ...}, excel: NULL, ...} under simplifyVector = FALSE) down to the inner format string, matching the assertion expectation. The codegen also threads result_is_bytes into render_assertion so min_length/max_length assertions on raw-byte returns use length() instead of nchar() (raw vectors element-wise on nchar, breaking the scalar expect_true contract). Combined, the kreuzberg R e2e suite goes from 153/158 to 159/160 with only the env-dependent tesseract-not-registered failure remaining. (crates/alef-e2e/src/codegen/r.rs)

  • alef-backend-csharp: strip nulls (not defaults) when serializing config objects to FFI. Commit 980e6f10 introduced JsonSerializationOptions (no DefaultIgnoreCondition) for FFI-input serialization so explicit false/0 weren't silently elided, but it then included null for every C# nullable field (PreprocessingPreset? Preset = null, etc.). When the Rust source declares the corresponding field as non-Option (e.g. PreprocessingOptions { preset: PreprocessingPreset }), serde deserialisation chokes on "preset": null and the whole options object is dropped — regressed html-to-markdown C# e2e from 7 failures to 87. Switched JsonSerializationOptions to DefaultIgnoreCondition.WhenWritingNull, which drops null-valued nullable fields (so required Rust fields fall back to Rust defaults) while still serialising explicit false/0 (so the original WhenWritingDefault regression stays fixed). (crates/alef-backend-csharp/src/gen_bindings/types.rs)

  • alef-backend-jni: treat empty-string complex-param payload as None for optional params. The Kotlin/Java JNI client emits options?.let { mapper.writeValueAsString(it) } ?: "" for Option<DTO> parameters — an empty string is the legacy host-language sentinel for "caller passed null". The Rust JNI shim previously fed that empty string into serde_json::from_str::<DTO> unconditionally, which fails with EOF while parsing a value at line 1 column 0 and throws RuntimeException from every call that omits options (e.g. HtmlToMarkdownRs.convert("…", null)). For optional complex params the shim now checks is_empty() first and yields None without invoking serde, leaving the existing non-empty parse path intact for genuine payloads. Required params remain strict (empty payload still raises). Fixes every kotlin_android null options call in html-to-markdown e2e. (crates/alef-backend-jni/src/gen_shims.rs)

  • alef-backend-kotlin-android: drop trait-bridge type_alias field when its param_name is in kotlin_android.exclude_functions. When a host configured [crates.kotlin_android].exclude_functions = ["visitor"] to suppress the bridge function (e.g. because the JNI trait-handle bridge isn't implemented yet in alef-backend-jni), the visitor function was filtered out of the module facade but ConversionOptions.visitor: VisitorHandle? (and ConversionOptionsUpdate.visitor) was still emitted on the data classes. Since VisitorHandle itself has no Kotlin representation in this configuration, every Kotlin file referencing the data class failed to compile with Unresolved reference 'VisitorHandle'. The fix collects an effective_excluded_types set in gen_bindings::emit that includes any trait_bridge.type_alias whose param_name matches a kotlin_android.exclude_functions entry (or whose exclude_languages lists kotlin_android), then drops both the alias type itself and any field whose TypeRef references it before the data-class emission. Mirrors the existing exclude_types-driven filter in alef-backend-kotlin-android/src/lib.rs for the case where the user opts out of the bridge function alone. Fixes html-to-markdown kotlin_android e2e compilation. (crates/alef-backend-kotlin-android/src/gen_bindings.rs)

  • alef-e2e/kotlin_android: emit .orEmpty() for Map.get(key) assertions in kotlin_android style. Kotlin's Map<K, V>.get(key) returns V?, so calling .trim() directly on the result fails kotlin_android compilation with Only safe (?.) or non-null asserted (!!.) calls are allowed on a nullable receiver of type 'String?'. The assertion emitter previously short-circuited field_is_optional to false for any field path with has_map_access, regardless of target. The branch now returns kotlin_android_style instead, so kotlin_android assertions on map-access paths coalesce the nullable receiver via .orEmpty() before invoking .trim()/.contains(). The kotlin/JVM target keeps its legacy behaviour to avoid churning unrelated snapshots — Java records' platform types make the missing .orEmpty() harmless there. Resolves 10 MetadataTest compile errors in html-to-markdown kotlin_android e2e (testOgBasicTags, testOgMultipleTags, testTwitterCardTags). (crates/alef-e2e/src/codegen/kotlin.rs)

  • alef-backend-kotlin: fully-qualify kotlin.collections.List/Map inside sealed-class data variants whose siblings shadow the stdlib name. When a sealed enum variant carries the simple name List (or Map), Kotlin resolves bare List<T> inside the sealed body to the nested data class rather than to kotlin.collections.List. The MetadataBlock variant of NodeContent declared val entries: List<String>, which the compiler rejected with No type arguments expected for 'data class List : NodeContent'. render_type_ref_disambiguated now consults variant_names: when "List" (resp. "Map") is present as a sibling variant, generic emissions use kotlin.collections.List<…> / kotlin.collections.Map<…, …> so the stdlib type wins over the nested shadow. Fixes html-to-markdown kotlin_android NodeContent.kt compilation. (crates/alef-backend-kotlin/src/gen_bindings/object_wrapper.rs)

  • alef-e2e/ruby: skip all empty-string config values, not just marked enum fields. When a fixture's config object contained a key with an empty string (e.g., embedding_model: ""), the Ruby codegen only skipped it if that key was registered in the call's enum_fields map. For enum-typed fields discovered during fixture rendering but not pre-declared in alef.toml, the empty string was rendered as a literal '', causing deserialization errors like "Unknown embedding preset: ". The fix: skip empty-string values unconditionally in the config builder loop — all empty strings are invalid for enum fields, regardless of whether they were pre-declared. Resolves 2 Ruby e2e failures: embed_texts_async_preset_switch and embed_texts_batch. Ruby: 88→90/91.

  • alef-e2e/swift: fix two visitor-method codegen bugs that left every visitor test silently inert. (1) swift_visitor_params emitted _ ctx: String for every callback, but the swift backend declares the protocol method with _ ctx: NodeContext (a typealias to RustBridge.NodeContext). Swift overload resolution treats the mismatched signature as a brand-new method, so the local visitor class never overrode the protocol's default implementation and every callback silently returned .continue — fixtures using .custom/.skip actions produced unchanged output. Now emits _ ctx: NodeContext, matching the protocol declaration exactly so overrides take effect. (2) swift_action_body interpolated optional String? parameters (e.g. visit_video's src) directly via \(src), which Swift renders as Optional("tutorial.mp4") for fixtures whose template is [VIDEO: {src}] — comparing against [VIDEO: tutorial.mp4] always failed. Added a swift_visitor_param_is_optional table mirroring the ? suffix in swift_visitor_params so optional-typed placeholders emit \(src ?? "") and unwrap to the underlying string. Combined, these two fixes take html-to-markdown swift_e2e from "compiles but 72/262 visitor assertions fail" to 262/262 passing. (crates/alef-e2e/src/codegen/swift_visitors.rs)

  • alef-e2e/swift: emit visitor callback actions with correct case naming and tuple-variant label. Fixture-driven callback action codegen (swift_action_body in swift_visitors.rs) emitted .continue_ and .custom("payload"), both inconsistent with how the swift backend declares VisitResult: the Continue unit variant is case `continue` (backtick-escaped because continue is a Swift keyword), and the Custom(String) tuple variant is case custom(field0: String) — swift-bridge synthesises field0: labels for single-field tuple variants. Without the corrections, every fixture using a custom or continue action produced 'VisitResult' has no member 'continue_' or missing argument label 'field0:' in call. Now emits .`continue` and .custom(field0: "payload"). (crates/alef-e2e/src/codegen/swift_visitors.rs)

  • alef-backend-swift, alef-e2e/swift: unbreak Swift e2e visitor compilation across three fronts. Three independent codegen bugs combined to keep html-to-markdown's swift_e2e suite at hundreds of compile errors. (1) The trait-bridge protocol default extension emitted return .continue_ (Rust-side trailing-underscore escape style) for every visitor method, but the actual enum cases are emitted via swift_case_ident which uses Swift-idiomatic backtick escapes (case `continue`) — producing 'VisitResult' has no member 'continue_' at every callback site. The default extension now derives the return literal from the first unit (no-field) variant of the result enum (swift_case_ident(&variant.name.to_lower_camel_case())), so the generated return . literal matches whatever the enum declaration emits. (2) The trait-bridge {options_type}FromJsonWith{Field} shim (e.g. conversionOptionsFromJsonWithVisitor) was only emitted as a swift-bridge extern "Rust" function in the RustBridge module — there was no top-level forwarder in the user-facing module, so e2e tests calling HtmlToMarkdown.conversionOptionsFromJsonWithVisitor(json, handle) saw module 'HtmlToMarkdown' has no member named 'conversionOptionsFromJsonWithVisitor'. The swift backend now emits a public top-level wrapper alongside the make{Trait}Handle factory, forwarding into RustBridge.{options_fn}. (3) Every e2e test file imported both the user-facing module and RustBridge, but each opaque extern "Rust" { type T; } declaration produces a public class T in RustBridge that collides with the first-class Swift Codable enum/struct of the same name — VisitResult ambiguity was the most prominent, blocking every visitor callback signature. With the new top-level forwarder in place, the e2e codegen no longer needs import RustBridge (test files only reference public-module symbols), so the line is dropped from render_test_file. Combined, these three fixes take html-to-markdown swift_e2e from "fails at module compile" to a runnable test suite. (crates/alef-backend-swift/src/gen_bindings.rs, crates/alef-e2e/src/codegen/swift.rs)

  • alef-e2e/typescript: handle FormatMetadata assertions with display helper function. TypeScript e2e codegen for optional FormatMetadata fields (e.g., metadata.format) was applying String(...) to the tagged-enum object, which returns [object Object] instead of the format string. Now emits a _alefE2eFormatMetadataDisplay() helper that pattern-matches the FormatMetadata tagged-enum variant and extracts the format field if present. Resolves Node e2e smoke_image_png test assertion failure. (crates/alef-e2e/templates/typescript/helpers.jinja, crates/alef-e2e/templates/typescript/assertion.jinja, crates/alef-e2e/src/codegen/typescript/assertions.rs)

  • alef-backend-java: emit Builders for all serializable types (has_serde=true) in Auto mode, even without has_default. When a Rust type has #[derive(Serialize, Deserialize)] but implements Default manually (not via #[derive(Default)]), the alef extractor marks has_default=false, causing the Java backend to skip Builder emission. Without a Builder, Jackson deserializes nested fields like PreprocessingOptions using the record constructor directly, applying Java defaults (all false/0) instead of Rust defaults. This causes serialized options sent to Rust to override preset-level settings — e.g., {"preset":"Aggressive"} deserializes with removeNavigation=false, silencing the preset's intent. The fix: in should_emit_builder() Auto mode, force Builder emission for any type where has_serde=true, regardless of has_default. All serde types benefit from a Builder to ensure nested deserialization respects Rust defaults. Additionally fixed import generation to check will_emit_builder (instead of only typ.has_default) when deciding whether to import java.util.Optional, java.util.List, java.util.Map, @JsonProperty, and @JsonPOJOBuilder so Builders get required imports even when emitted for has_serde-only types. Fixes 2 failing html-to-markdown Java e2e tests: testOptionsPreprocessingAggressive and testOptionsPreprocessingRemoveForms (all 262 Java e2e tests now pass). (crates/alef-backend-java/src/gen_bindings/types.rs)

  • alef-e2e/zig: skip chunks_have_heading_context synthetic-field assertion instead of emitting a derived predicate. heading_context on TextChunk is Option<HeadingContext> with #[serde(skip_serializing_if = "Option::is_none")], so chunks without a heading context produce no JSON key at all. The previous codegen emitted a Zig predicate that called c.object.get("heading_context") and required the value to be non-null for every chunk, which spuriously failed on extraction results where some chunks legitimately have no heading. Matching the Ruby codegen's behaviour, the assertion is now emitted as a // skipped: comment. Fixes kreuzberg's config_chunking_prepend_heading_context zig e2e (final blocker getting kreuzberg zig to 88/88 passing). (crates/alef-e2e/src/codegen/zig.rs)

  • alef-backend-csharp: re-apply separate JsonSerializationOptions for FFI parameter serialization. The C# generator used a single JsonSerializerOptions with DefaultIgnoreCondition.WhenWritingDefault for both deserializing FFI responses and serializing input parameters (like ConversionOptions) to pass to Rust. When a test explicitly set an option to false/0/null, the serializer skipped writing that field — Rust received incomplete JSON and applied defaults, overwriting the caller's intent. This fix was originally landed but accidentally removed in a refactoring. Restored: added a second JsonSerializationOptions (without WhenWritingDefault) used when serializing Named parameters and config objects in wrappers, streaming methods, and record-level methods; deserialization continues to use the original JsonOptions for sparse response handling. Fixes 7 failing html-to-markdown C# e2e tests: Test_FormSelectOptions, Test_FormInputElements, Test_OptionsPreprocessingEnabledFalseSkipsCleanup, Test_OptionsCompactTablesTrue, Test_OptionsPreprocessingRemoveNavigationFalseKeepsNav. (crates/alef-backend-csharp/src/gen_bindings/types.rs, methods.rs, and templates)

  • alef-backend-zig: de-duplicate VisitorHandle (trait-bridge type_alias) emission. The zig backend emitted trait-bridge type_alias types twice: once at the top of the file as pub const VisitorHandle = *anyopaque; (the correct form, referenced by struct fields like visitor: ?VisitorHandle and by the bridge factory html_visitor_handle_from_vtable), and again later as a struct wrapper pub const VisitorHandle = struct { _handle: *anyopaque, ... } through the generic opaque-handle emission loop. Zig rejects the duplicate declaration with a duplicate struct member error at file scope, failing every zig e2e test compile. The opaque-handle loop now filters out any type whose name matches a configured [[trait_bridges]].type_alias (respecting the bridge's exclude_languages = ["zig"] setting), so the trait-bridge contract — a raw *anyopaque pointer — is preserved as the single emission. Fixes html-to-markdown zig e2e compilation across all 9 test files. (crates/alef-backend-zig/src/gen_bindings/mod.rs)

  • alef-e2e/brew: route subcommand based on fixture tags (crawl → "crawl", map → "map", else "scrape"). Brew e2e codegen was rendering every fixture with the default subcommand (hardcoded "scrape"), so fixtures tagged "crawl" or "map" were invoked as kreuzcrawl scrape URL instead of kreuzcrawl crawl URL or kreuzcrawl map URL. This caused fixture failures because the CLI flags and output shape differ across subcommands. Now render_test_function inspects the fixture's tags and calls determine_subcommand() to route: if tags contain "crawl" use "crawl", if tags contain "map" use "map", else use the default. Fixes brew e2e test routing for all crawl/map fixtures. (crates/alef-e2e/src/codegen/brew.rs)

  • alef-e2e/python, ruby, php: emit dict/hash literals for tagged-enum arrays in build_args_and_setup. Python/Ruby/PHP e2e codegen was emitting constructor calls with kwargs (e.g., PageAction(selector="#open", type="click")) for array arguments with tagged-enum element types like PageAction, but the bindings (PyO3, Magnus, ext-php-rs) expect dict/hash/array literals (e.g., {"selector": "#open", "type": "click"}). Fixed two code paths: (1) in the options_via == "dict" branch of build_args_and_setup, when element_type is set and value is an array of objects, emit dict literals instead of constructor calls; (2) in the json_object && element_type branch, when element_type is not BatchBytesItem/BatchFileItem and value is an array, emit array of dict/hash literals. Resolves ~36 Python, Ruby, and PHP e2e failures in interaction tests (interact_click_element, interact_type_field, interact_fill_form, interact_action_sequence, etc.). (crates/alef-e2e/src/codegen/python/test_function.rs:769-788, crates/alef-e2e/src/codegen/ruby.rs:1435-1453, crates/alef-e2e/src/codegen/php.rs:1475-1492)

  • alef-e2e/swift: stop skipping json_object args with scalar element_type. Swift e2e codegen flagged every json_object arg without options_via as unresolvable and emitted XCTSkipIf(true, ...) stubs. Args with a scalar element_type (String, bool, i*/u*, f32/f64) describe Vec<T> Rust parameters that the swift-bridge surface exposes as native Swift [T] arrays — these construct cleanly from array literals and never needed the opaque-options path. The unresolvable-arg check now excludes scalar-element json_objects, so tslp's download_languages call (args = [{ name = "names", field = "languages", type = "json_object", element_type = "String" }]) emits real download(names: [...]) invocations instead of skip placeholders. Resolves 4 skipped tslp Swift e2e download tests (download_empty_list, download_invalid_language, download_multiple_languages, download_single_language). (crates/alef-e2e/src/codegen/swift.rs)

  • alef-scaffold/elixir: drop nonexistent lib/ and checksum-*.exs from mix.exs files: list. The scaffolded Elixir mix.exs unconditionally advertised ~w(lib native .formatter.exs mix.exs README* checksum-*.exs), but lib/ is only written when at least one non-OptionsField trait bridge GenServer is emitted, and checksum-*.exs is only produced by mix rustler_precompiled.download — which alef does not wire into the publish workflow. mix hex.publish validates every entry on disk before contacting the registry and aborted with Missing files: lib, checksum-*.exs. The scaffold now omits lib unless a bridge populates it, drops checksum-*.exs entirely (consumers fall back to building from source via plain rustler), and — when crates.output.elixir points outside packages/elixir/lib/ — appends the same relative *.ex glob already used for elixirc_paths so the externally-located source actually ships in the Hex tarball. Re-applies the fix originally landed as 0d987874 which was lost in a later force-push to main. Surfaced on spikard Hex publish (Goldziher/spikard runs 26215742325 and 26222483530). (crates/alef-scaffold/src/languages/elixir.rs, crates/alef-scaffold/src/tests.rs)

  • alef-backend-swift: emit custom init(from decoder:) for first-class Codable structs whose Rust source has #[derive(Default)] or impl Default. Swift's auto-synthesised Codable decoder rejects JSON that omits any non-Optional declared property, so JSON produced by Rust serializers using #[serde(default)] or #[serde(skip_serializing_if = "...")] (e.g. {"language":"python"} decoding into ProcessConfig, or empty-Vec fields on ProcessResult round-tripped via serde_json) failed with keyNotFound. The Swift backend now emits a custom decoder whenever TypeDef.has_default is true; each field uses decodeIfPresent + ?? <fallback> with the per-field literal from FieldDef.typed_default (BoolLiteral/IntLiteral/FloatLiteral/StringLiteral) or a type-based default ([]/[:]/false/0/""/nil). Fields with no safe Swift fallback (e.g. nested Named structs) decode without ?? and rely on the nested type's own decoder. CodingKeys is force-emitted alongside the custom decoder. Reduces tslp Swift e2e failures from 355 → <10. (crates/alef-backend-swift/src/gen_bindings.rs)

  • alef-e2e/java: handle FormatMetadata assertions with sealed-interface pattern matching. Java e2e codegen for optional FormatMetadata fields (e.g., metadata.format) was incorrectly applying enum coercion .map(v -> v.getValue()).orElse("") which fails because FormatMetadata is a sealed interface, not a Java enum. Now registers FormatMetadata types via assert_enum_fields in call overrides, passes the type map to render_assertion, and applies FormatMetadataDisplay.toDisplayString() for all FormatMetadata fields instead of enum-specific handling. Resolves Java e2e smoke test compilation errors on metadata.format assertions. (crates/alef-e2e/src/codegen/java.rs, kreuzberg/alef.toml)

  • alef-e2e/wasm: omit (not auto-skip) fixtures for languages outside static-compiled set. When [crates.wasm].languages was set to a curated list (e.g., 31 languages), the codegen logic set base_include = false for fixtures with input.language not in that list, but then emitted them as it.skip() tests instead of dropping them entirely. This produced 281 skip placeholders in tslp wasm e2e with the message "language X not in WASM's static-compiled set". The fix: replace the auto-skip branch with a simple continue, so fixtures for unsupported languages are omitted from the test file entirely — no skip clutter, no spurious test count. Resolves wasm e2e fixture emission. (crates/alef-e2e/src/codegen/wasm.rs)

  • alef-backend-rustler: emit default-arg signatures for params with | nil in typespec, and use keyword-opts collapsing for any trailing optionals. When a function had params marked optional in the @SPEC (type includes | nil) but not in the Rust IR (param not Option<T>), the Elixir wrapper counted trailing optionals only via p.optional, missing those with | nil typespecs. This caused two failures: (1) when emitting arity variants, no defaults were added (e.g., def embed_texts_async(texts, config) for config: String.t() | nil), breaking intermediate-arity calls like embed_texts_async(["text"]); (2) when deciding whether to use keyword-opts collapsing, only 2+ trailing optionals triggered opts \\ [] form, leaving 1-optional functions with positional signatures that conflicted with e2e codegen's keyword emission. Fix: (1) when counting trailing optionals, check both p.optional || type_str.contains("| nil") so params with Option Rust types are counted even if IR doesn't mark them .optional; (2) lower the keyword-opts threshold from >= 2 to >= 1 so e2e tests calling with keyword syntax (e.g., extract_file_sync(path, config: "...")) map correctly to the opts \\ [] form. Fixes 46 Elixir e2e failures across smoke, extraction, and embedding categories. (crates/alef-backend-rustler/src/gen_bindings/mod.rs)

  • alef-backend-csharp: fully-qualify all Marshal calls with global::System.Runtime.InteropServices. C# codegen was emitting bare Marshal.* references in multiple locations: gen_visitor.rs (PtrToStringUTF8 conversions), trait_bridge.rs (StringToCoTaskMemUTF8 callbacks), and gen_bindings/methods.rs (JSON deserialization). Template-based codegen in named_param_handle_from_json.jinja and multiple other templates also had bare Marshal references. Without the global:: qualifier, these break compilation with CS0103: name 'Marshal' does not exist. Now all emissions use global::System.Runtime.InteropServices.Marshal.* consistently. Resolves C# e2e compilation errors in ExtractionResult.cs and related wrapper classes. (crates/alef-backend-csharp/src/gen_visitor.rs, crates/alef-backend-csharp/src/trait_bridge.rs, crates/alef-backend-csharp/src/gen_bindings/methods.rs, crates/alef-backend-csharp/templates/*)