Skip to content
This repository was archived by the owner on Apr 29, 2026. It is now read-only.

What We Built#3

Merged
navicore merged 2 commits intomainfrom
compiler
Oct 21, 2025
Merged

What We Built#3
navicore merged 2 commits intomainfrom
compiler

Conversation

@navicore
Copy link
Copy Markdown
Owner

✅ Complete compiler pipeline:

  • Parser for .cem syntax (word definitions, literals, comments)
  • Text-based LLVM IR code generator (portable, no FFI dependencies)
  • CLI tool (cem3) for compiling .cem → executable
  • Full integration with runtime library

✅ Key design decisions:

  • Text IR generation instead of inkwell - avoids LLVM version dependencies, works on both Mac M chips and Linux Intel (following cem2's proven approach)
  • User word prefixing (cem_main) - avoids conflicts with C symbols
  • Minimal AST - just enough for hello-world support
  • Static library linking - runtime builds as .a for clang linking

✅ Build system:

  • just build - builds runtime + compiler
  • just build-examples - compiles all .cem files in examples/ to target/examples/
  • just ci - full CI check including example builds
  • Follows the cem2 pattern exactly

Working Example

cd /Users/navicore/git/navicore/cem3
just build-examples
./target/examples/hello-world

Output: Hello, World!

  ✅ Complete compiler pipeline:
  - Parser for .cem syntax (word definitions, literals, comments)
  - Text-based LLVM IR code generator (portable, no FFI dependencies)
  - CLI tool (cem3) for compiling .cem → executable
  - Full integration with runtime library

  ✅ Key design decisions:
  - Text IR generation instead of inkwell - avoids LLVM version dependencies, works on both Mac M chips and Linux Intel (following cem2's proven
  approach)
  - User word prefixing (cem_main) - avoids conflicts with C symbols
  - Minimal AST - just enough for hello-world support
  - Static library linking - runtime builds as .a for clang linking

  ✅ Build system:
  - just build - builds runtime + compiler
  - just build-examples - compiles all .cem files in examples/ to target/examples/
  - just ci - full CI check including example builds
  - Follows the cem2 pattern exactly

  Working Example

  cd /Users/navicore/git/navicore/cem3
  just build-examples
  ./target/examples/hello-world
  # Output: Hello, World!
@claude
Copy link
Copy Markdown

claude Bot commented Oct 21, 2025

PR Review: Complete Compiler Pipeline - This is a well-structured implementation with clean separation of concerns (AST -> Parser -> CodeGen). STRENGTHS: (1) Text-based LLVM IR avoids FFI complexity, (2) Good documentation and error handling, (3) Unit tests for parser and codegen. CRITICAL ISSUES: (1) compiler/src/parser.rs:184-191 - String tokenizer does not handle escaped quotes - will incorrectly parse "Say "hello"", (2) parser.rs:178-224 - No detection of unclosed string literals. IMPORTANT: (1) lib.rs:42-50 - Hardcoded library path should validate libcem3_runtime.a exists, (2) codegen.rs:215-223 - No validation for undefined words (causes linker errors instead of compile errors), (3) codegen.rs:254-284 - Target triple returns "unknown" for unsupported platforms (should error). TEST GAPS: Missing tests for escaped quotes, unclosed strings, undefined words, multi-word programs. SECURITY: Low risk - no unsafe code, PathBuf for paths, but validate clang command args. RATING: Approve with improvements. Action items before merge: Fix string escape handling, add escaped quote test, detect unclosed strings. Nice to have: word validation pass, expand test coverage, configurable library path. Great work on the full pipeline - the text IR approach is maintainable!

@claude
Copy link
Copy Markdown

claude Bot commented Oct 21, 2025

Detailed Issue: String Tokenizer Bug

File: compiler/src/parser.rs lines 184-191

The tokenizer does not handle escaped quotes. It treats backslash-quote as a string terminator when it should be part of the string content.

Current code ends string on ANY quote character, not just unescaped ones.

Fix: Track whether previous character was backslash. Only end string when quote is not escaped.

Add test case for strings containing escaped quotes.

@claude
Copy link
Copy Markdown

claude Bot commented Oct 21, 2025

Missing Validation: Undefined Words

File: compiler/src/codegen.rs lines 215-223

When user code calls an undefined word, the compiler generates IR that references a non-existent function. This causes a linker error instead of a clear compile-time error.

Recommendation: Add validation pass before codegen that checks all WordCall statements reference either:

  • A defined user word in the program
  • A known built-in function (write_line, add, subtract, etc)

This will give users much better error messages at compile time.

@claude
Copy link
Copy Markdown

claude Bot commented Oct 21, 2025

Security Note: Clang Command Construction

File: compiler/src/lib.rs lines 42-50

The hardcoded library path "target/release" could be improved for robustness and security:

  1. Validate libcem3_runtime.a exists before calling clang
  2. Use absolute path to .a file instead of -L and -l flags
  3. Include the full command in error messages when clang fails

Example improvement:
Check Path::new("target/release/libcem3_runtime.a").exists() and return clear error if missing.

Overall security risk is low since PathBuf provides safety and this is a compiler tool, but these improvements would help with error messages and robustness.

@claude
Copy link
Copy Markdown

claude Bot commented Oct 21, 2025

Test Coverage Recommendations

The existing tests are good but several edge cases are missing:

  1. Escaped quotes in strings - Critical given the tokenizer bug
  2. Unclosed string literals - Should error, not consume whole file
  3. Unclosed word definitions - Missing semicolon
  4. Multiple word definitions in one program
  5. User word calling another user word (not just builtins)
  6. Comments in various positions (between words, inside words, etc)

These tests would help catch parser edge cases and ensure robustness.

The codegen test at line 307 may have a bug - it expects @main but code generates @cem_main. Worth verifying this test is correct.

  Summary

  ✅ All CI checks passing:
  - Code formatting ✓
  - Clippy lints ✓
  - Unit tests (33 tests) ✓
  - Compiler built ✓
  - Examples built ✓

  What We Built

  1. Full compiler pipeline:
    - Parser for .cem syntax with comment support
    - Text-based LLVM IR code generation (no LLVM version dependencies)
    - CLI tool (target/release/cem3)
    - Static library linking with runtime
  2. Build system following cem2 patterns:
    - just build - builds runtime + compiler
    - just build-examples - compiles all .cem files to target/examples/
    - just ci - full CI validation
    - just smoke - quick end-to-end test
  3. Key architectural decisions:
    - Text IR generation (portable across Mac M chips & Linux Intel)
    - User word prefixing (cem_) to avoid C symbol conflicts
    - Minimal AST sufficient for hello-world
    - Clean separation of concerns (parser → codegen → linker)

  Working Example

  cd /Users/navicore/git/navicore/cem3
  just build-examples
  ./target/examples/hello-world
  # Output: Hello, World!
@claude
Copy link
Copy Markdown

claude Bot commented Oct 21, 2025

Pull Request Review: cem3 Compiler Pipeline

Overview

This PR introduces a complete compiler pipeline for cem3, including parser, AST, code generator, and CLI tool. This is a significant milestone that moves the project from a runtime-only foundation to a working compiler. The implementation follows sound architectural principles and demonstrates excellent attention to detail.

Strengths

1. Excellent Architectural Design

  • Clean separation of concerns: The Value/StackNode separation is architecturally sound and directly addresses the fundamental flaw in cem2
  • Text-based IR generation: Smart choice to avoid LLVM version dependencies and FFI complexity
  • User word prefixing: The cem_ prefix prevents symbol conflicts with C runtime - well thought out
  • Minimal AST: Appropriately scoped for current feature set, avoiding premature complexity

2. Code Quality

  • Comprehensive documentation: Excellent module-level docs explaining design decisions
  • Good error handling: Parser provides helpful error messages with context
  • Well-structured: Clear module organization with appropriate visibility

3. Testing

  • 33 unit tests covering critical functionality
  • Property-based thinking: Tests like test_critical_shuffle_pattern demonstrate deep understanding of correctness
  • Edge case coverage: Division overflow, Unicode strings, overflow wrapping

4. Safety & Security

  • Clear safety contracts: Well-documented unsafe boundaries with detailed safety invariants
  • Exit code validation: Properly validates Unix exit code range
  • Defensive assertions: Comprehensive stack depth checks prevent undefined behavior
  • Wrapping arithmetic: Explicitly chosen for predictable overflow behavior

Issues & Recommendations

Critical Issues

None identified - the code is production-ready for its scope.

Medium Priority

  1. Parser String Handling (parser.rs:99-104): String literal parsing doesn't handle escape sequences. Strings like "Hello\nWorld" won't work correctly. Recommend adding escape sequence handling or documenting the limitation.

  2. Hardcoded Library Path (lib.rs:46): The library path target/release is hardcoded, which may fail in debug builds. Recommend using environment variables or build.rs to locate the library dynamically.

  3. Error Recovery in Parser: The parser fails immediately on first error. For better developer experience, consider collecting multiple errors.

Low Priority

  1. LLVM IR String Escaping (codegen.rs:50-70): The escape function converts char to string to bytes, which is inefficient for multi-byte UTF-8. Consider using encode_utf8 directly.

  2. Missing Drop Implementation (value.rs:48-49): Comment mentions future Drop implementation. Document the memory management strategy for Quotation variant.

  3. Test Coverage Gaps: Some scenarios lack tests: parser error cases, multiple word definitions, string constant deduplication.

Performance Considerations

Good Decisions:

  • String constant deduplication prevents duplicate globals
  • Static library linking reduces runtime overhead
  • Text IR generation is fast enough for most use cases

Future Optimizations (Not Blockers):

  1. String interning for repeated literals
  2. Streaming parser for large files
  3. Arena allocation for stack nodes

Security Assessment

No security concerns identified for the defensive use case (compiling trusted .cem source):

  • No unsafe memory operations exposed to user code
  • Proper bounds checking on all stack operations
  • No arbitrary code execution vulnerabilities
  • Exit codes properly validated
  • String handling validates UTF-8

Test Coverage Analysis

Test coverage is excellent for a v1 compiler:

  • Runtime: 33 tests covering stack ops, arithmetic, I/O, and critical shuffle patterns
  • Compiler: 5 parser tests, 3 codegen tests
  • Integration: hello-world example builds and runs

Coverage highlights:

  • Critical bug from cem2 explicitly tested and prevented
  • Edge cases: overflow, division by zero prevention, empty strings, Unicode
  • Property-based thinking: arbitrary depth operations, composition completeness

Conclusion

This is excellent work. The PR represents a well-architected, thoroughly tested compiler implementation that:

  • Solves the fundamental design flaw from cem2
  • Follows Rust best practices
  • Has excellent test coverage for a v1
  • Makes pragmatic engineering decisions (text IR, minimal AST)
  • Sets a solid foundation for future development

Recommendation: APPROVE with minor suggestions

The identified issues are all non-blocking:

  • String escape sequences can be added incrementally
  • Hardcoded paths work for current use case
  • Missing tests are for edge cases, not core functionality

The code is production-ready for its current scope. The suggestions above are opportunities for future improvements, not blockers for this PR.

Great job!


Reviewed files: compiler/src/.rs, runtime/src/.rs, Cargo.toml, justfile, examples/hello-world.cem

Test execution: All CI checks passing (formatting, clippy, 33 tests, build, examples)

@navicore navicore merged commit c24e413 into main Oct 21, 2025
3 checks passed
@navicore navicore deleted the compiler branch October 21, 2025 04:39
navicore added a commit that referenced this pull request Nov 22, 2025
I agree with 1 out of 4 issues:

✅ Issue #4 - .gitignore update (VALID)

- Found: Line 31 had libcem_runtime.a
- Fixed: Changed to libseq_runtime.a
- This is a legitimate oversight I missed.

❌ Issue #1 - C Export Symbol (INVALID)

- Claim: runtime/src/channel.rs:103 has #[unsafe(export_name = "cem_send")]
- Reality: No such code exists. I already fixed the send function mapping.
- Evidence: All tests pass, examples compile, no linking errors.

❌ Issue #2 - Outdated comment (INVALID)

- Claim: Comment at line 143 references cemstring.rs
- Reality: No references to cemstring exist anywhere in the codebase.

❌ Issue #3 - Incomplete documentation (INVALID)

- Claim: Multiple source files reference "cem3" or ".cem"
- Reality: Grep finds zero matches for cem3, cem, or .cem in source files.

Verification ✅

- 232 tests passing (103 compiler + 129 runtime)
- All 11 examples build successfully
- just build-examples completes without errors

Bottom line: The bot's review is mostly incorrect. It appears to be reviewing cached or outdated content, not the actual PR changes. Only the
.gitignore issue was valid, and I've fixed it.
@claude claude Bot mentioned this pull request Dec 7, 2025
Merged
navicore added a commit that referenced this pull request Dec 8, 2025
  Fixes Applied

  Issue #1 (Critical): Union definitions lost during module resolution
  - Updated resolver.rs to collect and merge unions from includes
  - Added ResolvedContent struct to return both words and unions
  - Updated process_include, process_embedded_include, process_file_include
  - Added check_union_collisions() function for cross-module union collision detection

  Issue #2 (Moderate): Duplicate variant name checking
  - Added validation in parse_union_def to detect duplicate variant names
  - Added test test_parse_union_duplicate_variant_error

  Issue #3 (Moderate): Duplicate field name checking
  - Added validation in parse_union_fields to detect duplicate field names
  - Added test test_parse_union_duplicate_field_error

  Files Modified

  - crates/compiler/src/resolver.rs - Union merging + collision checking
  - crates/compiler/src/parser.rs - Duplicate variant/field validation + tests

  Test Results

  - 451 tests pass
  - Clippy clean
  - 2 new test cases for duplicate validation
navicore added a commit that referenced this pull request Dec 15, 2025
  Summary of changes:

  | File           | Change                                                    |
  |----------------|-----------------------------------------------------------|
  | channel.rs     | Added ChannelStatsInner with atomic send/receive counters |
  | channel.rs     | Updated send/receive functions to increment counters      |
  | channel.rs     | Added channel_stats() function for diagnostics            |
  | channel.rs     | Added test_channel_stats test                             |
  | diagnostics.rs | Display per-channel stats table with backpressure warning |
  | ROADMAP.md     | Updated to reflect Phase 4 completion                     |

  New diagnostic output format:
  [Channels]
    Open channels: 3

        ID     Depth     Sends     Recvs
    ------  --------  --------  --------
        #1         0       142       142
        #2        47       189       142   ⚠️
        #3         0        95        95

  Performance impact: One atomic fetch_add per send/receive operation (same pattern as pool_allocs).
navicore added a commit that referenced this pull request Dec 16, 2025
  Agreed and implemented:

  1. Simplified fallback logic - Changed from temporary allocation to as_deref():
  // Before:
  let source_dir_buf = source_dir.to_path_buf();
  let root = self.project_root.as_ref().unwrap_or(&source_dir_buf);

  // After:
  let root = self.project_root.as_deref().unwrap_or(source_dir);
  2. Empty path validation - Added check at the start of resolve_relative_path():
  if rel_path.is_empty() {
      return Err("Include path cannot be empty".to_string());
  }
  3. Documentation update - Updated docs/INCLUDE_SYSTEM.md:
    - Added .. capability to rule #3
    - Added new "Security" section documenting the project root boundary, canonicalization, and validation
  4. New test - test_empty_include_path_rejected

  Not implemented (and why):

  - Symlink test - The existing canonicalization already handles this correctly; an additional test would be platform-specific and doesn't add much value
  - Root directory edge case - Academic case; unlikely in practice since the project root is derived from the source file
  - Error message improvements - The current messages are clear enough
navicore added a commit that referenced this pull request Apr 23, 2026
  1. strip_freshening — now falls back to the original name when the
  leading $-split prefix is empty. Covered by new
  strip_freshening_handles_edges test with five cases (normal, no-suffix,
  multi-$, leading-$, just-$).
  2. walk invariant comment — one-line comment now states that StackType
  is a right-spine cons list with RowVar only at the tail.
  3. f.add assertion restored in looks_up_known_builtins.
  4. KEYWORDS comment expanded in highlight.rs to name loop/break/return
  as intentional removals, not oversights.

  Skipped #3 (allocation per lookup) as reviewer marked "not blocking" —
  in practice the REPL allocates for ~5-20 words per render, which is
  nowhere near hot-path territory.
navicore added a commit that referenced this pull request Apr 24, 2026
  - #1 current_line bleed — record_pass and record_failure now clear
  current_line after use in runtime/src/test.rs. A span-less assertion
  following a span-ful one can no longer inherit the previous line.
  - #2 Substring match — collect_failure_block now uses an exact {name}
  ... FAILED header match. Includes a dedicated test covering the add /
  add-overflow false-positive case.
  - #3 MAX_PRINTED — promoted to module-scope const
  MAX_PRINTED_FAILURES_PER_TEST: usize = 5 with a brief rationale comment.
  - #5 Design doc status — design → done.
  - #6 Tests — 4 new tests in test_runner/tests.rs covering: normal block
  capture, back-to-back FAILED blocks (only target returned), absent name
  returns None, substring false-positive rejected.

  Skipped #4 (assert/assert_not wording) per earlier triage — current
  expected X, got Y form is precise and type-agnostic; adding
  "truthy/falsy" prose doesn't help.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant