Base64 Encoding data #661

puffyqi · 2024-02-19T09:09:21Z

Documentation

I have consulted the documentation, and my question isn't answered there.

Nuget Package

JsonSchema.Net

Package Version

3.0.15

How can I help? Please provide as much context as possible.

I have a field in json schema that requires the string to be in base64 encoded.
I tried to put this in format:

"definitions": {
"File": {
"type": "object",
"properties": {
"subfield1": {
"type": "string",
“format”: “base64”
}
}
}
}

but when i try to validate using the playground https://json-everything.net/json-schema , it does not work correctly even if I put invalid base64 data. I have tried using "contentEncoding": "base64" also not working. Any ideas how to make it work?

Thanks

Code of Conduct

I agree to follow this project's Code of Conduct

elgonzo · 2024-02-19T15:56:20Z

The Json Schema specification does not define "base64" as a format specifier. So trying that isn't going to work unless the chosen schema validator has a custom implementation for a custom "base64" format.

On the other hand, the Json Schema spec defines "base64" as a content encoding for the contentEncoding keyword. But the contentEncoding keyword is an annotation, not an assertion. The current Json Schema validation spec goes even so far as to stipulate that schema validators shall not do contentEncoding-based parsing/decoding/validation by default. And even if a validator is doing a contentEncoding-based validation, the particulars of the validation and the returned results is left unspecified by the spec. (https://json-schema.org/draft/2020-12/draft-bhutton-json-schema-validation-01#section-8.2). So, this is also not looking too great, either...

Any ideas how to make it work?

Yes. RFC4648-conforming Base64 is a rather simple data format with a very specific and limited set of characters, arranged in 4-character tuples (https://datatracker.ietf.org/doc/html/rfc4648#section-4 ). This can be relatively easily modeled by a regex pattern (using the pattern keyword instead of format).

The case-insensitive pattern for one full non-padded 4-char tuple is basically: [a-z0-9+/]{4}.

The last 4-char tuple (or the single tuple if the decoded data is <= 3 bytes in length) however can be a full 4-char tuple or padded tuple in the form of XX== or XXX=, the regex pattern for the last tuple in the encoded Base64 string thus being (with white-spaces for legibility): ( [a-z0-9+/]{4} | [a-z0-9+/]{2}(=|[a-z0-9+/])= )

Therefore, the full regex pattern that matches non-empty Base64-strings that consist of one or more 4-character tuples:

(?i)^([a-z0-9+/]{4})*([a-z0-9+/]{4}|[a-z0-9+/]{2}(=|[a-z0-9+/])=)$

If the schema you are designing should be interoperable/compatible with various different schema validators (which is probably a good idea to follow regardless of whether you have this particular requirement right now or not), it's preferable to not use the (?i) inline-flag in the regex pattern (as it is not part of the ECMAScript/JavaScript regex dialect the Json Schema regex specification for the pattern keyword refers to). Instead, use both the a-z and A-Z ranges in the character classes to achieve case-insensitive matching, like for example so:

^([a-zA-Z0-9+/]{4})*([a-zA-Z0-9+/]{4}|[a-zA-Z0-9+/]{2}(=|[a-zA-Z0-9+/])=)$

(No guarantee about me not making a silly error here when plonking down the regex pattern here. Please do verify it yourself before putting it into production code...)

puffyqi · 2024-02-20T07:19:10Z

Thank you for the detailed response. The regex works fine :)

The Json Schema specification does not define "base64" as a format specifier. So trying that isn't going to work unless the chosen schema validator has a custom implementation for a custom "base64" format.

On the other hand, the Json Schema spec defines "base64" as a content encoding for the contentEncoding keyword. But the contentEncoding keyword is an annotation, not an assertion. The current Json Schema validation spec goes even so far as to stipulate that schema validators shall not do contentEncoding-based parsing/decoding/validation by default. And even if a validator is doing a contentEncoding-based validation, the particulars of the validation and the returned results is left unspecified by the spec. (https://json-schema.org/draft/2020-12/draft-bhutton-json-schema-validation-01#section-8.2). So, this is also not looking too great, either...

Any ideas how to make it work?

Yes. RFC4648-conforming Base64 is a rather simple data format with a very specific and limited set of characters, arranged in 4-character tuples (https://datatracker.ietf.org/doc/html/rfc4648#section-4 ). This can be relatively easily modeled by a regex pattern (using the pattern keyword instead of format).

The case-insensitive pattern for one full non-padded 4-char tuple is basically: [a-z0-9+/]{4}.

The last 4-char tuple (or the single tuple if the decoded data is <= 3 bytes in length) however can be a full 4-char tuple or padded tuple in the form of XX== or XXX=, the regex pattern for the last tuple in the encoded Base64 string thus being (with white-spaces for legibility): ( [a-z0-9+/]{4} | [a-z0-9+/]{2}(=|[a-z0-9+/])= )

Therefore, the full regex pattern that matches non-empty Base64-strings that consist of one or more 4-character tuples:
(?i)^([a-z0-9+/]{4})*([a-z0-9+/]{4}|[a-z0-9+/]{2}(=|[a-z0-9+/])=)$
If the schema you are designing should be interoperable/compatible with various different schema validators (which is probably a good idea to follow regardless of whether you have this particular requirement right now or not), it's preferable to not use the (?i) inline-flag in the regex pattern (as it is not part of the ECMAScript/JavaScript regex dialect the Json Schema regex specification for the pattern keyword refers to). Instead, use both the a-z and A-Z ranges in the character classes to achieve case-insensitive matching, like for example so:
^([a-zA-Z0-9+/]{4})*([a-zA-Z0-9+/]{4}|[a-zA-Z0-9+/]{2}(=|[a-zA-Z0-9+/])=)$
(No guarantee about me not making a silly error here when plonking down the regex pattern here. Please do verify it yourself before putting it into production code...)

gregsdennis · 2024-02-20T07:42:05Z

Please read through the documentation. You will find that @elgonzo's response is correct.

Another solution is to create a custom format. Information for doing so can be found in the linked documentation.

puffyqi added the question Further information is requested label Feb 19, 2024

puffyqi closed this as completed Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base64 Encoding data #661

Base64 Encoding data #661

puffyqi commented Feb 19, 2024

elgonzo commented Feb 19, 2024 •

edited

Loading

puffyqi commented Feb 20, 2024

gregsdennis commented Feb 20, 2024

Base64 Encoding data #661

Base64 Encoding data #661

Comments

puffyqi commented Feb 19, 2024

Documentation

Nuget Package

Package Version

How can I help? Please provide as much context as possible.

Code of Conduct

elgonzo commented Feb 19, 2024 • edited Loading

puffyqi commented Feb 20, 2024

gregsdennis commented Feb 20, 2024

elgonzo commented Feb 19, 2024 •

edited

Loading