Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base64 Encoding data #661

Closed
2 tasks done
puffyqi opened this issue Feb 19, 2024 · 3 comments
Closed
2 tasks done

Base64 Encoding data #661

puffyqi opened this issue Feb 19, 2024 · 3 comments
Labels
question Further information is requested

Comments

@puffyqi
Copy link

puffyqi commented Feb 19, 2024

Documentation

  • I have consulted the documentation, and my question isn't answered there.

Nuget Package

JsonSchema.Net

Package Version

3.0.15

How can I help? Please provide as much context as possible.

I have a field in json schema that requires the string to be in base64 encoded.
I tried to put this in format:

"definitions": {
"File": {
"type": "object",
"properties": {
"subfield1": {
"type": "string",
“format”: “base64”
}
}
}
}

but when i try to validate using the playground https://json-everything.net/json-schema , it does not work correctly even if I put invalid base64 data. I have tried using "contentEncoding": "base64" also not working. Any ideas how to make it work?

Thanks

Code of Conduct

  • I agree to follow this project's Code of Conduct
@puffyqi puffyqi added the question Further information is requested label Feb 19, 2024
@elgonzo
Copy link

elgonzo commented Feb 19, 2024

The Json Schema specification does not define "base64" as a format specifier. So trying that isn't going to work unless the chosen schema validator has a custom implementation for a custom "base64" format.

On the other hand, the Json Schema spec defines "base64" as a content encoding for the contentEncoding keyword. But the contentEncoding keyword is an annotation, not an assertion. The current Json Schema validation spec goes even so far as to stipulate that schema validators shall not do contentEncoding-based parsing/decoding/validation by default. And even if a validator is doing a contentEncoding-based validation, the particulars of the validation and the returned results is left unspecified by the spec. (https://json-schema.org/draft/2020-12/draft-bhutton-json-schema-validation-01#section-8.2). So, this is also not looking too great, either...


Any ideas how to make it work?

Yes. RFC4648-conforming Base64 is a rather simple data format with a very specific and limited set of characters, arranged in 4-character tuples (https://datatracker.ietf.org/doc/html/rfc4648#section-4 ). This can be relatively easily modeled by a regex pattern (using the pattern keyword instead of format).

The case-insensitive pattern for one full non-padded 4-char tuple is basically: [a-z0-9+/]{4}.

The last 4-char tuple (or the single tuple if the decoded data is <= 3 bytes in length) however can be a full 4-char tuple or padded tuple in the form of XX== or XXX=, the regex pattern for the last tuple in the encoded Base64 string thus being (with white-spaces for legibility): ( [a-z0-9+/]{4} | [a-z0-9+/]{2}(=|[a-z0-9+/])= )

Therefore, the full regex pattern that matches non-empty Base64-strings that consist of one or more 4-character tuples:

(?i)^([a-z0-9+/]{4})*([a-z0-9+/]{4}|[a-z0-9+/]{2}(=|[a-z0-9+/])=)$

If the schema you are designing should be interoperable/compatible with various different schema validators (which is probably a good idea to follow regardless of whether you have this particular requirement right now or not), it's preferable to not use the (?i) inline-flag in the regex pattern (as it is not part of the ECMAScript/JavaScript regex dialect the Json Schema regex specification for the pattern keyword refers to). Instead, use both the a-z and A-Z ranges in the character classes to achieve case-insensitive matching, like for example so:

^([a-zA-Z0-9+/]{4})*([a-zA-Z0-9+/]{4}|[a-zA-Z0-9+/]{2}(=|[a-zA-Z0-9+/])=)$

(No guarantee about me not making a silly error here when plonking down the regex pattern here. Please do verify it yourself before putting it into production code...)

@puffyqi
Copy link
Author

puffyqi commented Feb 20, 2024

Thank you for the detailed response. The regex works fine :)

The Json Schema specification does not define "base64" as a format specifier. So trying that isn't going to work unless the chosen schema validator has a custom implementation for a custom "base64" format.

On the other hand, the Json Schema spec defines "base64" as a content encoding for the contentEncoding keyword. But the contentEncoding keyword is an annotation, not an assertion. The current Json Schema validation spec goes even so far as to stipulate that schema validators shall not do contentEncoding-based parsing/decoding/validation by default. And even if a validator is doing a contentEncoding-based validation, the particulars of the validation and the returned results is left unspecified by the spec. (https://json-schema.org/draft/2020-12/draft-bhutton-json-schema-validation-01#section-8.2). So, this is also not looking too great, either...

Any ideas how to make it work?

Yes. RFC4648-conforming Base64 is a rather simple data format with a very specific and limited set of characters, arranged in 4-character tuples (https://datatracker.ietf.org/doc/html/rfc4648#section-4 ). This can be relatively easily modeled by a regex pattern (using the pattern keyword instead of format).

The case-insensitive pattern for one full non-padded 4-char tuple is basically: [a-z0-9+/]{4}.

The last 4-char tuple (or the single tuple if the decoded data is <= 3 bytes in length) however can be a full 4-char tuple or padded tuple in the form of XX== or XXX=, the regex pattern for the last tuple in the encoded Base64 string thus being (with white-spaces for legibility): ( [a-z0-9+/]{4} | [a-z0-9+/]{2}(=|[a-z0-9+/])= )

Therefore, the full regex pattern that matches non-empty Base64-strings that consist of one or more 4-character tuples:

(?i)^([a-z0-9+/]{4})*([a-z0-9+/]{4}|[a-z0-9+/]{2}(=|[a-z0-9+/])=)$

If the schema you are designing should be interoperable/compatible with various different schema validators (which is probably a good idea to follow regardless of whether you have this particular requirement right now or not), it's preferable to not use the (?i) inline-flag in the regex pattern (as it is not part of the ECMAScript/JavaScript regex dialect the Json Schema regex specification for the pattern keyword refers to). Instead, use both the a-z and A-Z ranges in the character classes to achieve case-insensitive matching, like for example so:

^([a-zA-Z0-9+/]{4})*([a-zA-Z0-9+/]{4}|[a-zA-Z0-9+/]{2}(=|[a-zA-Z0-9+/])=)$

(No guarantee about me not making a silly error here when plonking down the regex pattern here. Please do verify it yourself before putting it into production code...)

@puffyqi puffyqi closed this as completed Feb 20, 2024
@gregsdennis
Copy link
Collaborator

Please read through the documentation. You will find that @elgonzo's response is correct.

Another solution is to create a custom format. Information for doing so can be found in the linked documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants