Skip to content

Implement AWK strnum semantics for input-derived numeric strings #476

@bertysentry

Description

@bertysentry

Summary

Jawk should implement AWK's strnum / numeric-string semantics. Today, input values and script-created strings appear to use the same runtime representation, so comparisons cannot distinguish an input-derived numeric string from a plain string literal.

Related: #110

Current implementation context

Runtime values are passed around as Object, currently including values such as Long, Double, String, and UninitializedObject.

Relevant methods:

  • JRT.compare2(Object, Object, boolean) for comparisons
  • JRT.toDouble(Object) for numeric operations

This is enough for arithmetic conversion, but not enough for AWK comparison semantics, because AWK needs to distinguish three value attributes:

  • number: numeric value, usually from arithmetic
  • string: ordinary string, such as a string literal or a string operation result
  • strnum: string from user/input sources that fully looks like a number

Required comparison rule

For <, <=, >, >=, ==, and !=, AWK chooses string or numeric comparison based on the operand attributes:

left / right string number strnum
string string string string
number string numeric numeric
strnum string numeric numeric

A pure string operand forces string comparison. Otherwise, comparison is numeric.

Important behavior to preserve

Arithmetic operations should still use permissive numeric-prefix conversion:

echo 2x    | gawk '{ print($1 + 1) }'   # 3
echo 2.3x  | gawk '{ print($1 + 1) }'   # 3.3
echo 2x.3x | gawk '{ print($1 + 1) }'   # 3
echo 2e+02 | gawk '{ print($1 + 1) }'   # 201

But comparisons must not simply parse numeric prefixes:

echo 2x    | gawk '{ print($1 < 10) }'  # 0, string comparison: "2x" < "10"
echo 2x.3x | gawk '{ print($1 < 10) }'  # 0, string comparison
echo 2e01  | gawk '{ print($1 < 10) }'  # 0, numeric comparison: 20 < 10
echo 9     | gawk '{ print($1 < 10) }'  # 1, numeric comparison: 9 < 10

Attribute propagation examples

Assignment should preserve the attribute:

echo 9 | gawk '{ x = $1; print(x < 10) }'   # 1

String operations should produce plain strings:

echo 9 | gawk '{ x = $1 ""; print(x < 10) }'   # 0, because "9" < "10" is false

String literals are plain strings, not strnum:

gawk 'BEGIN { print("9" < 10) }'   # 0
gawk 'BEGIN { print(9 < "10") }'   # 0

Numeric operations produce numeric values:

echo 9 | gawk '{ x = $1 + 0; print(x < 10) }'   # 1

Hexadecimal note

By default, GNU awk does not treat input such as 0x10 as hexadecimal during ordinary string-to-number conversion:

echo 0x10 | gawk '{ print($1 + 1) }'   # 1
echo 0x10 | gawk '{ print($1 < 10) }'  # 1, string comparison: "0x10" < "10"

If Jawk later supports strtonum(), that can explicitly parse hexadecimal input.

Long and Integer

While we're working on this, we could avoid checking results of arithmetic operations to see whether the result is actually a Long (integer) and output an Long object instead of a Double. This can be performed only when the number must be converted to a String. This will improve the performance of arithmetic operations. So, we could have only 3 object types: StrNum, Double, and String (and Uninitialized).

Possible implementation approach

Consider introducing a dedicated internal representation such as StrNum, or a broader scalar type, so JRT.compare2(...) can distinguish:

  • plain String
  • input-derived numeric string / strnum
  • actual numeric values (Long, Double)

StrNum should preserve the original string, and may optionally cache its numeric value.

Acceptance criteria

  • Input-derived values that fully look numeric are tagged as strnum.
  • Input-derived values that do not fully look numeric remain plain strings.
  • Assignment preserves the attribute.
  • String operations produce plain strings.
  • Numeric operations produce numeric values and keep permissive numeric-prefix parsing.
  • Comparisons use the AWK string / number / strnum matrix.
  • Regression tests cover the examples above.
  • Issue Conversion from string to numbers doesn't follow POSIX specification #110 is updated or closed when this is fixed.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions