RFC: Custom data validators in schemas #797

nvie · 2023-04-04T14:47:38Z

Motivation

When designing a schema for a data format, it is often necessary to perform specific validations on the data, beyond what is possible with the basic string or number types. For example, a schema may need to validate that a given field is an integer, or that another field contains a valid email address. In such cases, it is common to define custom validation functions or use regular expressions to perform these validations.

However, defining custom validation functions can be time-consuming, error-prone, and may result in non-standard schemas. Using regular expressions can also be complex and may not cover all possible cases. Moreover, it can be challenging to share these validation functions or regexes across multiple schemas or projects.

To address these issues, this proposal suggests the addition of a selected set of globally available and well-known types that can be freely used in the schema language to perform more specific validations. These types could include commonly used types such as Int, PositiveInt, Email, and regexes.

Proposal

This proposal suggests the addition of a set of built-in types that can be used in the schema language to perform more specific validations. These types would be globally available and well-known, and could include the following:

Positive: a positive number (>= 0)
Int: an integer number
PositiveInt: a positive integer number
Email: a valid email address
ISO8601¹: a valid ISO-formatted date string

An example:

type Storage {
  age: Positive
  counter: Int
  numVisits: PositiveInt
  email: Email
  createdAt: ISO8601
}

Users would be able to use those types as if they were built-in types, but they are part of a standard library of "pluggable types" that we would offer and document.

Precedence

True built-ins like string, number, etc. cannot be overridden in the language, i.e. the following document would be invalid:

type Storage { x: number }
type number { }
#    ^^^^^^ ❌ Cannot redefine a built-in type

But for these new data validation types, we may need to be a bit more flexible. It may be a bit of a contrived example, but suppose someone already has defined this schema:

type Storage { x: Phone }
type Phone { type: "iPhone" | "Pixel", model: string, ... }

And then at one point in the future we would introduce a Phone validator, for validating phone numbers. The question now is: how do we interpret this old, already-existing, and working schema? In this case, for this document, Phone should mean the locally-defined object type. But in a document without a type Phone { ... } definition, it would mean the custom data validator.

In other words: these data validators should be allowed to be redefined in a schema document, in which case you can no longer refer to the custom validator that the Liveblocks runtime provides.

Regex literals

type Storage {
  sha: /^[0-9a-f]{7,40}$/i
  postalCode: /^\d{5}$/
}

A special case in the syntax of the language would be reserved for regexes, because those would have to be parameterized, so the proposal is to allow using regex literals directly in type positions, like in the postalCode field in the example.

What is the expected behavior if the regex anchors are not present. I.e. what is the expected behavior of /\d{5}/ vs /^\d{5}$/? Is it expected that /\d{5}/ will match "abc123456xyz"? Because it will in JavaScript.

Warning Allowing free-form regexes will make the room using the schema vulnerable to ReDoS. The impact of this change is scoped to the individual rooms, not on the rest of the Liveblocks infrastructure. If developers use inefficient regexes, it will make schema validation in their rooms slower. This may be good to call out in documentation.

Alternatively, we can use a library like recheck to check if the regex pattern is potentially unsafe, and disallow those patterns when creating the schema for the first time. (We would not do this check again every time we load the schema.)

Allowing ranges for numeric types

A common case is to allow numeric values within certain ranges, i.e. a number between 0 and 100, or -100 to 100. It may be possible to allow such ranges on any numeric type, i.e. on number, Int, Positive, etc.

The proposal is to allow optional range specifiers, that look like:

type Storage {
  between3and19Inclusive: number[-3..19]    # -3, -2.999, -1, 0, ..., 18.999, 19 👍
  atLeastThree: number[3..]                 # 3, 3.001, 4, ..., 999999999, ... 👍
  moreThanThree: number[>3..]               # 3.001, 4, ..., 999999999, ... 👍
  percentage: number[0..100]                # 0, 1, 3.0001, ..., 99.999, 100 👍
  ratio: number[0..<1]                      # 0, 0.0001, 0.5, 0.8, 0.999 👍
  percentageButNotHundred: number[0..<100]  # 0, 1, 3.0001, ..., 99.999 👍
  empty: number[3..1]                       # <nothing>
}

The syntax is:

<numeric type> "[" ( ">"? <minbound> )? ".." ( "<"? <maxbound> )? "]"

Either <minbound> or <maxbound> must be provided, or both. The range [..] is not valid syntax. (That would be the default range.)
The <minbound> may be prefixed with a > sign to exclude the min bound from the range.
The <maxbound> may be prefixed with a < sign to exclude the max bound from the range.

Allowing this syntax for all numeric types has the benefit that it works even for custom data validators, like Int or PositiveInt:

type Storage {
  homepageGame: Int[-2..2]     # -2, -1, 0, 1, 2 (but not -1.34)
  example: PositiveInt[-2..2]  # 0, 1, 2  (the -2 range is a bit useless here)
}

Note: with this syntax, we don't need a specific type named Positive or PositiveInt. You could use number[0..] and Int[0..] respectively, which would be exactly the same.

Reject, don't clamp

Suppose you have this schema:

type Storage {
  age: Int
}

If a client does root.set('age', 33.5), then we will reject this message. We will not automatically round numbers to the nearest integer, because there can be cases where automatically changing data would have pretty terrible consequences depending on the app.

Ranges on string types

Range syntax can also be useful on strings, to express a minimum or maximum string length. The syntax is similar:

type Storage {
  max30: string[0..30]
  max30: string[..30]   # same
  min7: string[7..]
  between7and10: string[7..10]
}

The difference with numeric ranges is that on string types, ranges must only use positive numbers, and there is no "exclusion" operator (i.e. you cannot do string[>3..<8], only string[4..7]).

Like with numeric types, this syntax would automatically also work on all string-like types, although it's questionable how useful it would be:

type Storage {
  emailMax30: Email[..30]      # only short email addresses?
  url: URL[30..]               # only long URLs?
  nonEmptyString: string[1..]  # non-empty string
}

Not super useful for these cases, but… it would be a free side-benefit from this design ¯\_(ツ)_/¯

Implementation details

An interesting implementation detail is in which part of the system to implement these types. Ultimately, a type like Email is a runtime validator, so intuitively it may belong to our private backend repo.

However, to the language itself, it has to know if Email is going to be a string-like type or a number-like type.

For the language (= parser + checker), it's important to know that Email is a string-like type, because it must be able to reject a union like this:

a: string | Email   # ❌ Type 'Email' cannot appear in a union with 'string'
a: "hey" | Email    # ❌ Type 'Email' cannot appear in a union with '"hey"'
a: number | Email   # ✅
a: 42 | Email       # ✅

Similarly, it must know which types are number-like types:

a: Int | number    # ❌ Type 'number' cannot appear in a union with 'Int'
a: Int | 42        # ❌ Type '42' cannot appear in a union with 'Int'
a: Int | string    # ✅
a: Int | "hey"     # ✅

It will depends on the runtime implementation (in our private backend repo) what meaning is given to these global types, but the language will have to know upfront what sort of data those pluggable parts will produce.

Therefore, this proposal suggests to make this configuration part of the parse() call, as an argument, passing the knowledge about these types down to the language.

import { parse } from "@liveblocks/schema";

parse(
  text,
  {
    custom: {
      numberLike: ["Int", "PositiveInt"],
      stringLike: ["Email", "URL"],
    }
  }
)

With this configuration, the parser and type checker will have enough knowledge to interpret the unknown global types that are potentially found in a schema text correctly, while not having to know much else about them:

Take this schema text as an example:

type Storage {
  foo: Foo | number | null
}

Then:

// ❌ Unknown type Foo on line 2
parse(text)

And:

// ❌ Type 'number' cannot appear in a union with type 'Foo'
parse(text, { custom: { numberLike: ["Foo"] } })

But:

// ✅
parse(text, { custom: { stringLike: ["Foo"] } })

The parser will not return a StringType AST node for that Foo instance, but a StringLikeType AST node, which will carry the alias "Foo" as payload. The schema validation runtime then has all the knowledge to know how to interpret that type, and to build a decoder for it that performs the adequate validation.

Deliberately not using the Date type here, because it might suggest that JS Date instances would go to/from the server, which is not the case. These are strings that would be in the ISO8601 format. ↩

The text was updated successfully, but these errors were encountered:

GuillaumeSalles · 2023-04-05T00:07:46Z

Clean proposal! These validators are definitely useful!

Positive: a positive number (>= 0)

Int: an integer number

PositiveInt: a positive integer number

Email: a valid email address

ISO8601[^1]: a valid ISO-formatted date string

Can Int, PositiveInt, Email conflict with user defined types? Does lowercasing these types solve this issue?

Regex literals

Regex can introduce security issues (vulnerability to DDos attacks) 1, 2). I doubt this will have any practical impact. Our infra will not be impacted because every room is sandboxed. But maybe we should let the developer know of the risks? 🤔

Allowing ranges for numeric types

I like the syntax!

Clamping or rejecting?

Definitely rejecting. Schema validation should never modify the behavior of an incoming operations.

Implementation details

A s discussed earlier, not sure the custom param is necessary here but if you believe it creates a better separation of concern, I'll trust you on this :)

nvie · 2023-04-05T07:19:47Z

Can Int, PositiveInt, Email conflict with user defined types?

Great point, I forgot to mention that in the proposal. I've added the Precedence section.

Does lowercasing these types solve this issue?

It would avoid the potential name clashes. But I personally think it would look unnecessarily alien and unfamiliar to users. Example:

type Storage {
  name: string
  age: positiveint
  email: email
  list: LiveList<int[0..10] | iso8601>
}

I think using capital casing here is more common and familiar and communicates that these are "just types", except that they happen to be defined elsewhere, and not in this document. But I fully admit it's a matter of taste.

Regex can introduce security issues (vulnerability to DDos attacks) 1, 2). I doubt this will have any practical impact. Our infra will not be impacted because every room is sandboxed. But maybe we should let the developer know of the risks? 🤔

Great point, I've added a warning callout about it.

Definitely rejecting. Schema validation should never modify the behavior of an incoming operations.

Thanks for confirming my hunch 🙏 ! I've removed the open question from the document and replaced it with this decision.

nvie added the feature request Feature requested by the community label Apr 4, 2023

nvie self-assigned this Apr 4, 2023

nvie changed the title ~~RFC: Custom data validators~~ RFC: Custom data validators in schemas Apr 4, 2023

nvie mentioned this issue Apr 4, 2023

RFC: Custom data validators liveblocks/liveblocks-schema#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Custom data validators in schemas #797

RFC: Custom data validators in schemas #797

nvie commented Apr 4, 2023 •

edited

GuillaumeSalles commented Apr 5, 2023 •

edited

Regex literals

Allowing ranges for numeric types

Clamping or rejecting?

Implementation details

nvie commented Apr 5, 2023 •

edited

RFC: Custom data validators in schemas #797

RFC: Custom data validators in schemas #797

Comments

nvie commented Apr 4, 2023 • edited

Motivation

Proposal

Precedence

Regex literals

Allowing ranges for numeric types

Reject, don't clamp

Ranges on string types

Implementation details

Footnotes

GuillaumeSalles commented Apr 5, 2023 • edited

Regex literals

Allowing ranges for numeric types

Clamping or rejecting?

Implementation details

nvie commented Apr 5, 2023 • edited

nvie commented Apr 4, 2023 •

edited

GuillaumeSalles commented Apr 5, 2023 •

edited

nvie commented Apr 5, 2023 •

edited