Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toLeopard: Static number casting issues in equality operator + NaN leaking #123

Open
towerofnix opened this issue Dec 17, 2023 · 5 comments
Labels
bug Something isn't working compatibility Mismatch or currently unsupported language behavior fmt: Leopard Pertains to Leopard format (JavaScript)

Comments

@towerofnix
Copy link
Member

Branching from #9 (comment).

We try to use === for operator_equals where possible, but at the moment are too greedy. Casting to a number doesn't work exactly the same in Cast.compare and that's something sb-edit is currently unaware of.

We only use === in operator_equals if at least one of the arguments can be parsed as a number. Since it's impossible to represent NaN this way (parseNumber only returns true if the value is non-NaN after casting to number), we don't need to worry about (...) === NaN type comparisons (which are treated as String(...).toLowerCase() === "nan" in Scratch).

Therefore, if a block expression returns NaN, it can't be equal to the written number value. This will be useful later.

We similarly have an issue where the division operator can return NaN (zero divided by zero) and this is treated as satisfying number inputs. While this is fine for certain situations, in the context of most math operators (in Scratch), NaN is supposed to be treated as zero. Thus NaN may leak from operators which currently identify as satisfying Number, and this is a compatibility issue.

My proposal is to implement a "strict" mode for casting to number in Leopard: this.toNumber(..., true). It would report NaN for any values which can never resolve a numerical comparison (in Scratch). This isn't drastically complicated; it mostly means returning NaN for cases we already detect (and currently return 0 for). It also means specifically banning the values null and whitespace-only strings from casting to zero (see Cast.isWhiteSpace.)

Then replace InputShape.Number with three new shapes:

  • InputShape.NumberPossiblyNaN: A number value which was representable as a number in the first place or is now represented as NaN.
    • This is indirectly satisfied by NumberNotNaN.
    • Directly satisfied by the division operator (0/0 = NaN) if, and only if, both of its inputs are blocks or one of its inputs is zero.
    • Probably not directly satisfied by any other ops, but check operator_mathop for sure.
    • To cast: this.toNumber((...), true)
  • InputShape.NumberNotNaN: A number value which was representable as a number in the first place or is now represented as zero.
    • This is not indirectly satisfied by NumberPossiblyNaN.
    • Directly satisfied by most math operators. (Because math ops treat NaN as zero, NaN will immediately be "squashed" into zero if it's passed into another math op, rather than leaking. Scratch casts NaN to zero every time; we will, when we have to, by making these math ops take NumberNotNaN inputs.)
    • To cast: this.toNumber((...))
  • InputShape.LooseNumber: A number value for which 0 and NaN are dynamically treated as equal.
    • Indirectly satisfied by both NumberNotNaN and NumberPossiblyNaN.
    • Not directly satisfied by anything.
    • To cast: this.toNumber((...))

Then differentiate the desired input types for all existing uses of InputShape.Number. For example, math blocks require NumberNotNaN. Equals-comparison of numbers requires NumberPossiblyNaN. Blocks like "repeat", "wait", and "say and wait" all benefit from using LooseNumber because they will still behave correctly if provided NaN, treating that value as zero, and don't require casting more strictly like equals-comparison - casting to zero instead of NaN is fine.

We also need to adapt staticBlockInputToLiteral. Since it already takes a desired input shape, this doesn't necessitate an (internal) interface change, conveniently. At the moment values like 3ee3e3e3e33eee333 always get returned as strings, which is clearly dangerous. (These are functionally NaN being returned for the desired input shape Number!) Non-numeric values should be cast to zero or NaN adapting to the desired input shape.

In summary, we want to avoid using the casting provided by NumberPossiblyNaN everywhere except for equals-comparison; we also want to ensure that NaN never leaks out of division and into operators that are supposed to treat NaN as zero but don't currently perform any dynamic casting for any number-shaped reporters.

Although the implementation for all this is relatively trivial, it's quite a bit to reason about when ensuring Leopard number behavior matches Scratch number behavior. But these tools are necessary, and together, enable rather clean translation. In the vast majority of cases, only equals-comparison is affected; division is also affected when both operands are a block or one operand is zero, but not when one of the operands is a non-zero constant - likely a large portion of uses of division in general. And inputs which don't care whether passed NaN or zero don't get affected at all, regardless what the reporter satisfies.

@towerofnix towerofnix added bug Something isn't working compatibility Mismatch or currently unsupported language behavior fmt: Leopard Pertains to Leopard format (JavaScript) labels Dec 17, 2023
@towerofnix
Copy link
Member Author

I like NumberNotNaN as a name, but NumberPossiblyNaN is a bit strange. It makes sense from a "satisfies" angle ("block divided by block is possibly NaN") but not so much from a "desires" angle ("I'd like a number, possibly NaN, please"). It's also a little funny to have NumberPossiblyNaN be true for NumberNotNaN, even if that makes sense in terms of set theory or practice.

Possibly better: StrictlyCastNumber, ZeroCastNumber, LooselyCastNumber? This mainly has a more direct analogue in Leopard, i.e. this.toNumber(value, strict) but also at least connotates a sense of "strictness" which is counter to zero-casting. It makes NaN only implied by any of the terminology, though. Maybe it gets a better sense of what to think about when using those input shapes?

Regardless, there'll obviously be documentation in the InputShape enum clearly delineating all three.

@PullJosh
Copy link
Collaborator

I'm glad you're thinking about this.

I'm not totally clear on what the difference is between NumberPossiblyNaN vs LooseNumber. Am I correct in saying that NumberPossiblyNaN is what a block satisfies when it could return a number or NaN (such as division) and LooseNumber is what an input requires when it's okay with receiving NaNs? If so, why are they two separate things?

I'm curious what nuance I'm missing.

@towerofnix
Copy link
Member Author

@PullJosh Just to clarify, I'm going to refer to NumberPossiblyNaN as StrictlyCastNumber, because that's the name I've been using in the (so far pending/uncreated) pull request. Likewise LooseNumber is LooselyCastNumber. Those are only different names - they have the exact same meanings as NumberPossiblyNaN and LooseNumber in my first comment.

The point of these differentiations is to provide more expressive capability to the desiring input. That is, we have a baseline amount of detail that the reporter's spec (in toLeopard) has to define - whether or not it may be returning NaN - and then the desired input type is used to build on that baseline information further, adapting it (via casting) as desired.

That's the sake of input types in general, which is fairly clear, but what's less obvious is that the desired type fundamentally has more control than the satisfying one (the reporter spec).

Our general goal is to transition as smoothly as possible from a satisfying input type to the desired one. (This is why we have a concept of a "satisfying" input type in the first place - so that we don't unnecessarily cast an expression which satisfies boolean to a desired boolean, for example.)

I think the part that's most challenging to wrap your head around is that an input type is, in truth, just a keyword declaring an agreement about what value an expression could evaluate to. That's the full sum of its intrinsic meaning... but intrinsic meaning has nothing to do with actual behavior.

It's like if I wrote in a dictionary that one quality of a skilled receptionist is to be agreeable. Well, OK, that's a dictionary definition that all involved parties might agree on (let's say so!)... but that same, single definition has very different practical implications for the receptionist and the client. For the receptionist, it indicates what standard they should be looking to meet; for the client, it indicates what standard they're looking to judge by (consciously or not). The dictionary definition has useful meaning, but it doesn't actually do anything on its own. Its practical effect always depends on its interpretation by a specific party - its extrinsic meaning.

Likewise, the extrinsic meaning of "a number that might be NaN" has to be interpreted twofold: once by the satisfying reporter, once by the desiring input. For the satisfying input, there's only one interpretation: "If my expression's value is or might be NaN, then I need to declare so."

But for the desiring input, there are two interpretations. One is, "I require that you most certainly give me a number. And if it's NaN, then I require that you leave it as it is." The other is, "I require that you most certainly give me a number. But really, I don't mind what that number is. It's okay if it's NaN, but you can give me zero instead, and I'll treat that the same as NaN. Whatever works better for you."

See how all three of those interpretations are in line with a keyword that simply means "a number that might be NaN"? I think that's the reason it isn't terribly obvious that you might want to differentiate between those two desiring interpretations, especially when there's only one applicable satisfying interpretation.

I only alluded to the key detail briefly in my original post:

InputShape.[LooselyCastNumber]:
... Not directly satisfied by anything.

This belies that LooselyCastNumber is only used to express an additional detail by the desiring input. It provides absolutely no meaning for the satisfying reporter, and so no reporter spec would ever directly satisfy it.

Because the desiring input is the one which, in the end, has more expressive power, I decided to reframe the terminology in terms of how the expression gets cast, instead of what its value might be. That's how the second batch of names (which I've been using in this comment) came about.

StrictlyCastNumber and LooselyCastNumber have the exact same meaning for the satisfying reporter (so only one of them - incidentally the
first one - is used when declaring a reporter's satisfied input type). But they have different meanings for the desiring input type, as I described above.

Individual reporter specs never differentiate between a desired input type of StrictlyCastNumber or LooselyCastNumber - after all, what could they possibly glean? It isn't relevant to them.

Instead, it's the in-between casting (performed at the very bottom of blockToJS) which differentiates. Its job is expressly to bridge the satisfied input type to the desired one. It really does care if the value is desired to be cast "strictly" or "loosely", because it's the one which will enact that casting!

So at the crux of it, why do we differentiate between strict and loose casting? The answer comes from the third party in play here - the goal of the caster itself, which is to bridge those two values as smoothly as possible.

The trick is that casting a number strictly is less smooth than casting it loosely (read: to zero) - mainly, from a purely syntactical standpoint! Zero casting a number uses toNumber(expression); strict casting uses toNumber(expression, true).

(On a less "aesthetic" level - although aesthetics are mostly what count for the caster - strict casting also fundamentally conveys more information than zero casting. Most non-number values get cast to NaN in strict casting; in loose casting they are cast as zero, and so cannot be differentiated from an expression which was actually zero in the first place. This detail turns out to be absolutely critical for the equals comparison operator... but it's also highly inappropriate for other operations to concern themselves with, particularly when they don't care if the value is zero or NaN. In general, inputs shouldn't be asking for more information than they actually need.)


To boil everything down:

  1. Try to decouple the extrinsic, practical meanings of a keyword from its more inherent, intrinsic one. The way that satisfying reporters interpret a given keyword is fundamentally different from how desiring inputs interpret that same keyword.
  2. It's possible for there to be more than one interpretation (i.e. extrinsic meaning) for the same keyword by on either or both ends. In our case, the desiring input needs to be able to express different expectations for casting behavior.
  3. We need to have enough keywords to express all extrinsic satisfying meanings and all extrinsic desiring meanings. It's the job of the casting behavior to bridge arbitrary satisfying meanings to arbitrary desiring meanings, so that's where it's most likely for differentiation to actually occur.

While I think there's an intuition to this that can be understood fairly easily once you "get" it, it's all rather a bit to wrap your head around in the first place. If I gave myself permission to shake up the status quo and express this in a simpler way from the start, what might I change?

Well - personally I think the hardest parts are:

  • Mentally decoupling extrinsic meanings from the completely unstated intrinsic meanings (we use code comments to try to explain them, but that's not really what keywords mean...)
  • Realizing that keywords are literally just terms that exist to express the full set of extrinsic meanings of both sides of the conversation, and it just happens that the same keyword is usually. but not always, suitable for both sides of that conversation (...see what I mean?)
  • Bothering to memorize extrinsic meanings (or a mental system of intuiting them) in the first place, let alone the fact that one of the keywords (LooselyCastNumber) literally doesn't carry any extrinsic meaning on one side of the conversation, arbitrarily

All said, my alternate approach would be to just state extrinsic meanings on their own, explicitly, and NOT try to bundle the extrinsic meanings for satisfying and desiring together. They do both center around the same fundamental intrinsic meanings, yes (such as "a string" or "a number that might be NaN"), but it muddies the waters to try to put them together using one word for two extrinsic meanings.

I'd call those extrinsic meanings "traits", and define one set for satisfying, one for desiring:

enum DesirableTraits {
    IsBoolean,
    IsNumber,
    IsString,

    // only applicable alongside IsNumber
    IsCastToNaN,
    IsCastToZero,
    // if *neither* is desired then it's treated
    // the same as "LooselyCastNumber"
}

enum SatisfyingTraits {
    IsBoolean,
    IsNumber,
    IsString,

    // only applicable alongside IsNumber
    IsNotNaN,
    // if not satisfied then it's treated as though
    // it may be NaN
}

It would be possible - and common - to specify multiple traits, both when satisfying and when desiring. Then the caster compares desired traits with satisfied ones to figure out the smoothest way to bridge, performing the same work it currently does, just with more coherent terminology.

I think there's a simpler intuitive logic to this, but to lay it out explicitly:

  • If desiredTraits includes IsNumber:
    • If desiredTraits includes IsCastToNaN:
      • If satisfiedTraits includes IsNumber:
        • Return expression (as-is)
      • Return this.toNumber(expression, true) (NaN when appropriate)
    • If desiredTraits includes IsCastToZero:
      • If satisfiedTraits includes IsNumber:
        • If satisfiedTraits includes IsNotNaN:
          • Return expression (as-is)
        • Return this.toNumber(expression) (casts NaN to zero)
      • Return this.toNumber(expression) (never NaN)
    • If satisfiedTraits includes IsNumber:
      • Return expression (as-is)
    • Return this.toNumber(expression) (simpler syntax, less extraneous information)

It'd take a little extra work to go through block definitions with this system, but IMO it would be worth it, because this is a lot easier to parse and intuitively understand in my opinion, and is a lot more extensible for defining more kinds of desirable and satisfying traits later on, if that becomes relevant.

Sorry for the devastatingly long reply LOL. Thank you very much for your time checking out and considering all this!

@PullJosh
Copy link
Collaborator

Thank you for the novel. (Sincerely) This helped me understand the situation a lot better. I definitely prefer your method of explicitly declaring both desirable traits and satisfying traits separately, although it does feel a little funny that they almost always (but not always always) come in perfectly matching pairs.

For now, it's probably best to tack on the three slightly different variations of Number/possibly NaN values as you wrote in your original post, but if these kinds of situations keep popping up repeatedly then maybe we would want to revisit this and follow the differentiated input/output traits you described.

@towerofnix
Copy link
Member Author

You got it! Since the situation's laid out pretty clearly (including notes for revisiting), I'm not worried about putting this on the backburner and coming back to traits sometime later on.

I have a WIP branch with basically all the work done, so apart from putting together a Scratch project that actually tests all the expected behavior, it's pretty much good to go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working compatibility Mismatch or currently unsupported language behavior fmt: Leopard Pertains to Leopard format (JavaScript)
Projects
None yet
Development

No branches or pull requests

2 participants