Skip to content

Support pm-gawk-style persistent global variables across scripts by remapping global slots at execution time #462

@bertysentry

Description

@bertysentry

Summary

Add support for persistent global variables shared across independently compiled scripts, in the spirit of GNU awk's pm-gawk, while preserving Jawk's current slot-based runtime execution model.

The preferred direction is to remap global slots at execution boundaries when switching to a different compiled script, instead of replacing runtime variable access with HashMap<String, Object> lookups or introducing per-access cell indirection.

Context

Jawk currently compiles AWK identifiers to fixed offsets and executes them through array-backed global and local slots. That design is efficient because:

  • variable names are resolved at compile time
  • runtime variable access is reduced to slot dereference
  • locals and globals have simple, predictable layouts

The problem is that each compiled program has its own global layout.

Example:

  • script 1 globals: A -> 0, B -> 1, C -> 2
  • script 2 globals: C -> 0, D -> 1

If script 1 runs first and then script 2 runs later in the same persistent memory context, script 2 should observe the value previously assigned to C, even though C is compiled to a different slot offset in script 2.

Persistence therefore requires name-based identity across runs, but not necessarily name-based lookups during execution.

Desired behavior

We want to support a persistence model where:

  • user-defined global variables survive across script executions
  • independently compiled scripts can observe the same global variables by name
  • locals remain local to their function invocation and are not persistent
  • built-in AWK variables such as NR, NF, FS, RS, etc. remain runtime-managed and non-persistent
  • persistent functions are treated as a separate concern

Proposed direction

Use execution-boundary global slot remapping.

High-level idea

When a new compiled script is about to run:

  1. read that script's compiled global name-to-offset mapping
  2. rebuild the AVM global slot array so the new script's expected offsets contain the correct persistent values
  3. preserve any previously known persistent globals that are not referenced by the new script by appending them after the new script's globals
  4. replace the AVM's current global-layout metadata with the new layout
  5. execute the script normally with unchanged slot-based opcodes

Conceptually:

  • persistence is keyed by variable name between runs
  • execution remains keyed by slot offset within a run

Example

After script 1 runs:

  • 0 -> A
  • 1 -> B
  • 2 -> C

Now script 2 is about to run and expects:

  • 0 -> C
  • 1 -> D

The AVM would rebuild the global layout for script 2 as:

  • 0 -> C with the value previously stored for C
  • 1 -> D with no prior value
  • 2 -> A
  • 3 -> B

That allows script 2 to execute with its own compiled offsets while preserving previously created globals for future runs.

Why this direction

This approach preserves the current runtime strengths:

  • no hot-path hash lookup for global reads or writes
  • no opcode-level change for DEREFERENCE, ASSIGN, PLUS_EQ, and similar operations
  • locals can remain slot-based and frame-scoped exactly as they are today
  • independently compiled scripts keep their own offset layouts without conflict

It also appears less invasive than introducing a GlobalCell indirection layer for every global slot access.

Required runtime changes

This is not a zero-refactor change. At minimum, the runtime will need to:

  • retain persistent global values across executions instead of discarding them on reset
  • retain enough metadata to map current slots back to global names
  • rebuild the globals array when switching to a different compiled global layout
  • update the AVM's active global name/offset metadata after each remap

In other words, Jawk still needs a canonical name-based view of persistent globals at execution boundaries, even if the interpreter itself remains slot-based during execution.

Important design constraints

  • Uninitialized globals must retain correct AWK semantics after remapping.
  • Array vs scalar behavior must remain correct across runs.
  • Built-in runtime-managed variables must stay outside this persistence/remapping scheme unless explicitly designed otherwise.
  • The remap logic should preserve previously known globals even when the next script does not reference them.
  • The design should be explicit about whether it assumes a single sequential AVM or must also support concurrent execution safely.

Non-goals

This issue does not cover:

  • persistent local variables
  • replacing all runtime variable access with name-based map lookups
  • persistent user-defined functions
  • persistence file format or on-disk heap implementation details

Those may be follow-up issues.

Open questions

  • What should be the canonical persistent representation between runs: Map<String, Object>, name-to-slot metadata plus a globals array, or another structure?
  • Should remapping happen only when switching to a different compiled program, or on every execution?
  • How should array/scalar misuse checks interact with values restored from a previous script?
  • Should built-in but materialized globals such as ARGC, ARGV, or ENVIRON participate in this scheme, or remain special cases?
  • How should the runtime expose load/save/clear operations for persistent state?
  • What concurrency guarantees, if any, should be provided?

Why not a full HashMap<String, Object> runtime

A full name-based runtime would simplify cross-program identity, but it would also:

  • make every global variable access pay a map lookup cost
  • require reworking scope handling that the current slot model already solves well
  • cut across the existing tuple/interpreter contract instead of extending it

For Jawk, that looks like the wrong tradeoff if persistent globals can be achieved by remapping at execution time.

Acceptance direction

A successful implementation should make it possible for:

  1. one compiled script to assign a user-defined global variable
  2. a separate compiled script, executed later in the same persistence context, to read or modify that same variable by name
  3. both scripts to keep their own compiled slot layouts without conflict
  4. the AVM to remap globals for the active script without changing hot-path opcode behavior
  5. local variables and built-in runtime state to remain non-persistent unless explicitly designed otherwise

Reference

GNU awk persistent memory manual:
https://www.gnu.org/software/gawk/manual/pm-gawk/pm-gawk.html

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions