Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: create the jauthchk tool - check on the contents of an .author.json file #69

Closed
lcn2 opened this issue Feb 11, 2022 · 288 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@lcn2
Copy link
Contributor

lcn2 commented Feb 11, 2022

We need to create the jauthchk tool in order to help verify that contents of an .author.json file found within an entry directory.

This tool will primarily be used by other tools (not humans). As such it should behave like fnamchk in that if all is well, it should not print anything and simply exit 0. If there are problems found with the .author.json file, then warning messages should be printed to stderr AND the jauthchk tool should exit with a non-zero status. The use of a -v level may be use to assist in debugging.

The jauthchk tool is primarily a stand alone tool. As a sanity check, the mkiocccentry program should execute the jauthchk code AFTER .author.json file has been created and before the compressed tarball is formed. If mkiocccentry program sees a 0 exit status, then all is well. For a non-zero exit code, the tool probably should abort because any problems detected by jauthchk based on what mkiocccentry wrote into .author.json indicates there is a serious mismatch between what mkiocccentry is doing and what jauthchk expects.

The following might be how mkiocccentry output is changed with the use of this tool (and the other tool):

Is the above list a correct list of files in your entry? [yn]: y

Checking the format of .info.json ...
... all appears well with the .info.json file.

Checking the format of .author.json ...
... all appears well with the .author.json file.

About to run the tar command to form the compressed tarball ...

As a stand alone tool, the jauthchk tool will be invoked by other tools as part of the IOCCC submission process. That process is beyond the scope of this repo. Suffice it to sat the the IOCCC judges will use this tool is part of their submission workflow.

Here is a possible command line usage message:

jauthchk [-h] [-v level] [-V] file

	-h			print help message and exit 0
	-v level		set verbosity level: (def level: 0)
	-V			print version string and exit

        file                    path to a .author.json file

exit codes:

        0                       no errors or warnings detected
        >0                      some error(s) and/or warning(s) were detected

NOTE: We mention file above even though the canonical filename will be .author.json. The tool should NOT check, nor object to using a different filename.

The mkiocccentry tool will need to invoke this tool. As such a similar method used to find and specify the location of txzchk should be used. As this tool is one of 2 tools being considered, we recommend the following of added to the mkiocccentry command line:

	-j /path/to/jinfochk	path to jinfochk executable used by txzchk (def: ./jinfochk)
	-J /path/to/jauthchk	path to jauthchk executable used by txzchk (def: ./jauthchk)

IMPORTANT: While it might be tempting to consider depending on some general JSON checker, we do NOT need nor want that. It is important that the mkiocccentry GitHub repo remain stand alone. I.e., all the code needed by someone wishing to enter the IOCCC (beside a C compiler, make, tar, cp, ls) should found in this GitHub repo alone. As there is NO standard JSON tool in widespread distribution the all of the code for this tool needs to reside in this repo only.

IMPORTANT: We do not need a general JSON format checker. We only need to verify that the file contains the JSON needed and only the JSON needed for the judges to process IOCCC entries.

While is it NOT recommended, if someone wishes to edit their .author.json and re-create the compressed tarball we cannot stop them. As such mkiocccentry should be STRICT on what is writes into .author.json AND jauthchk should be permissive (but not to a fault) in what is considers as OK.

This tool should neither generate an error, nor warn if someone were to reformat the JSON. And as JSON is not order dependent, of someone wishes to reorder the JSON elements, that is fine. As long as all the requirement JSON elements are present, and no new JSON elements are found, and the version string matches, all is OK.

Should something go wrong and a change to the JSON is required during an open IOCCC, the judges will preserve the older JSON check tools and use those against older JSON formats. This there is no need for a >= version check: a version string match seems good enough.

See the a followup comment for details on the checks needed against an .author.json file.

@lcn2 lcn2 added enhancement New feature or request help wanted Extra attention is needed labels Feb 11, 2022
@lcn2
Copy link
Contributor Author

lcn2 commented Feb 11, 2022

The following is a guide as to what needs to be checked in an .author.json file. The values found below are simply examples.

To recap an import point made above:

IMPORTANT: We do not need a general JSON format checker. We only need to verify that the file contains the JSON needed and only the JSON needed for the judges to process IOCCC entries.

While is it NOT recommended, if someone wishes to edit their .author.json and re-create the compressed tarball we cannot stop them. As such mkiocccentry should be STRICT on what is writes into .author.json AND jinfochk should be permissive (but not to a fault) in what is considers as OK.

This tool should neither generate an error, nor warn if someone were to reformat the JSON. And as JSON is not order dependent, of someone wishes to reorder the JSON elements, that is fine. As long as all the requirement JSON elements are present, and no new JSON elements are found, and the version string matches, all is OK.

See www.json.org for details of the JSON format.

{

All files must start with a {.

	"IOCCC_info_version" : "1.7 2022-02-04",

The IOCCC_info_version value MUST match INFO_VERSION from limit_ioccc.h.

	"ioccc_contest" : "IOCCC28",

The ioccc_contest value MUST match IOCCC_CONTEST from limit_ioccc.h.

	"ioccc_year" : 2022,

The ioccc_year value MUST match IOCCC_YEAR from limit_ioccc.h.

	"mkiocccentry_version" : "0.35 2022-02-07",

The mkiocccentry_version value MUST match MKIOCCCENTRY_VERSION from limit_ioccc.h.

	"iocccsize_version" : "28.7 2022-02-01",

The iocccsize_version value MUST match IOCCCSIZE_VERSION from limit_ioccc.h.

	"IOCCC_contest_id" : "test",

The IOCCC_contest_id value MUST be either test or a valid Contest ID (a UUID with version UUID_VERSION and variant UUID_VARIANT as defined in limit_ioccc.h`).

	"entry_num" : 0,

The entry_num value must be >= 0 and <= MAX_ENTRY_NUM as defined in limit_ioccc.h.

	"authors" : [
		{
			"name" : "author name",
			"location_code" : "CC",
			"location_name" : "Cocos (Keeling) Islands (the)",
			"email" : "test@example.com",
			"url" : "https:\/\/example.com\/index.html",
			"twitter" : "@twitter",
			"github" : "@github",
			"affiliation" : "an affiliation",
			"winner_handle" : "author-last",
			"author_number" : 0
		}
	],

The authors must be a JSON array. In that array there MUST be exactly one of:

  • name
  • location_code
  • location_name
  • email
  • url
  • twitter
  • github
  • affiliation
  • winner_handle
  • author_number

The values conform with the restrictions imposed by the get_author_info() function from mkiocccentry.c.

NOTE: When the user declined to enter a value (where permitted by the get_author_info() function), the value will be the JSON special value null.

NOTE: Due to a design flaw of the JSON spec, the last value of the authors array cannot be followed by a "," (ASCII comma). All value except the last value of the manifest array must be followed by a "," (ASCII comma).

	"formed_timestamp" : 1644618833,

The formed_timestamp value must be an integer >= MIN_TIMESTAMP as defined in limit_ioccc.h.

	"formed_timestamp_usec" : 631668,

The formed_timestamp_usec value must be an integer >= 0 and <= 999999.

	"timestamp_epoch" : "Thr Jan  1 00:00:00 1970 UTC",

The timestamp_epoch value must match TIMESTAMP_EPOCH as defined in limit_ioccc.h.

	"min_timestamp" : 1643987926

The min_timestamp value must match MIN_TIMESTAMP as defined in limit_ioccc.h.

	"formed_UTC" : "Fri Feb 11 23:09:42 2022 UTC"

The formed_UTCvalue must be a date string in the same format as the output of the following date command:

date "+%a %b %d %H:%M:%S %Y UTC"
}

All files must end with a }.

To be valid, the .author.json file must have exactly one of each of the above mentioned JSON "string: : "value" (or JSON array in the case of manifest). No other JSON elements are allowed.

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

Comments, suggestions, corrections and clarifications for the above long comment are welcome.

We recommend that you copy only the relevant parts of the long comment when you do. :-)

If/where needed, we will attempt to modify the long comments in place above, where and when possible.

@xexyl
Copy link
Contributor

xexyl commented Feb 12, 2022

See www.json.org for details of the JSON format.

This will be helpful to me. Thanks.

{

All files must start with a {.

That's easy enough to check.

	"IOCCC_info_version" : "1.7 2022-02-04",

The IOCCC_info_version value MUST match INFO_VERSION from limit_ioccc.h.

	"ioccc_contest" : "IOCCC28",

The ioccc_contest value MUST match IOCCC_CONTEST from limit_ioccc.h.

For the above two we can just use strtok_r() on : and then when finding the right first field do a strcmp() on the defines.

	"ioccc_year" : 2022,

The ioccc_year value MUST match IOCCC_YEAR from limit_ioccc.h.

In this case strtol() can be called and do a simple comparison. Should it be an error if someone were to change it to be:

"ioccc_year" : "2022",

I.e. add quotes or something else. That is a general question for all the other fields and the other file as well: if the user changes it so that it's not the right type (for example - int to string) should this be an error in validation OR should the quotes be ignored?

Since you're only after if the right fields (with the right values) are there is it okay if the correct fields are there at the same time as the format being wrong (by format being wrong I mean not validly JSON)?

	"IOCCC_contest_id" : "test",

The IOCCC_contest_id value MUST be either test or a valid Contest ID (a UUID with version UUID_VERSION and variant UUID_VARIANT as defined in limit_ioccc.h`).

As for this one: how do you propose this is tested? "test" is easy enough but what about the rest? Should it do something like fnamchk does?

What might be helpful here: can you provide an actual UUID string that might be valid in the contest so that testing this tool can be easier?

	"authors" : [
		{
			"name" : "author name",
			"location_code" : "CC",
			"location_name" : "Cocos (Keeling) Islands (the)",
			"email" : "test@example.com",
			"url" : "https:\/\/example.com\/index.html",
			"twitter" : "@twitter",
			"github" : "@github",
			"affiliation" : "an affiliation",
			"author_number" : 0
		}
	],

The authors must be a JSON array. In that array there MUST be exactly one of:

Here this makes me think that the format of the JSON file does in fact matter - since you say it must be an array and the fields must be exactly the below. Does this hold everywhere else too?

  • name
  • location_code
  • location_name
  • email
  • url
  • twitter
  • github
  • affiliation
  • author_number

The values conform with the restrictions imposed by the get_author_info() function from mkiocccentry.c.

That should be helpful.

NOTE: When the user declined to enter a value (where permitted by the get_author_info() function), the value will be the JSON special value null.

Good to know.

NOTE: Due to a design flaw of the JSON spec, the last value of the authors array cannot be followed by a "," (ASCII comma). All value except the last value of the manifest array must be followed by a "," (ASCII comma).

That's interesting bit of information. Thanks for clarifying this.

	"formed_timestamp" : 1644618833,

The formed_timestamp value must be an integer >= MIN_TIMESTAMP as defined in limit_ioccc.h.

Which integer type do you suggest? time_t?

	"formed_UTC" : "Fri Feb 11 23:09:42 2022 UTC"

The formed_UTCvalue must be a date string in the same format as the output of the following date command:

date "+%a %b %d %H:%M:%S %Y UTC"
}

Do you have a recommended way to go about this? The first thing that popped into my head is strptime() but I'm not sure if that's the best/better way to go about it.

--

These comments apply to this tool as well as what I'll say about the next one (if any comments/questions):

Hopefully these questions can start a good conversation on it and help bring clarity. Although I started this I'm not sure if I can finish this one. I might be able to but I think I'll have to take a break for a few days and maybe return to it in the middle of next week. Really what it comes down to is learning a bit about JSON (after your replies).

That being said I might start writing a parser that simply separates the field name and value into a list (I don't mean a linked list necessarily). I don't think that'll be today though: I'd rather get your feedback first and then process it.

It's possible tomorrow morning I can work on this a bit but if not there's no other time I can tomorrow. Monday I should be able to do a bit of it but I might be away a while since it is after all my 40th birthday (that's also - as I said - why I'll be gone most of tomorrow).

@xexyl
Copy link
Contributor

xexyl commented Feb 12, 2022

Something occurred to me that you did not address: should certain characters be verified that they're escaped? For example should the URL have the /s escaped like it is printed out by mkiocccentry?

Also should the mkiocccentry tool check for characters like \ in URLs? (Maybe it already does and I don't recall).

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

In this case strtol() can be called and do a simple comparison. Should it be an error if someone were to change it to be:

"ioccc_year" : "2022",
I.e. add quotes or something else. That is a general question for all the other fields and the other file as well: if the user changes it so that it's not the right type (for example - int to string) should this be an error in validation OR should the quotes be ignored?

Since you're only after if the right fields (with the right values) are there is it okay if the correct fields are there at the same time as the format being wrong (by format being wrong I mean not validly JSON)?

The value 2022 without quotes is correct. It is valid JSON (see the spec). The value "2022", a string, would be an error for that numeric value.

We will reply later to your other comments.

@xexyl
Copy link
Contributor

xexyl commented Feb 12, 2022

The value 2022 without quotes is correct. It is valid JSON (see the spec). The value "2022", a string, would be an error for that numeric value.

Does that mean that with quotes it should be considered invalid in the context of the tool? I get that impression but want to be sure.

We will reply later to your other comments.

Thank you. I'll probably look at it tomorrow - or else later on today if I get a chance.

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

The value 2022 without quotes is correct. It is valid JSON (see the spec). The value "2022", a string, would be an error for that numeric value.

Does that mean that with quotes it should be considered invalid in the context of the tool? I get that impression but want to be sure.

We will reply later to your other comments.

Thank you. I'll probably look at it tomorrow - or else later on today if I get a chance.

Yes. If the value is numeric, then there MUST be no quotes around the value.
In JSON:

"123" != 123

Only strings are names appear to be in double quotes. JSON values such these are not double quoted:

  • true
  • false
  • null
  • 12345
  • -123
  • 12.345
  • 12.345e-7

In the JSON used by IOCCC, we do not have use for numeric values that are non-integers. So the last 2 value forms may be safely ignored.

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

For the above two we can just use strtok_r() on : and then when finding the right first field do a strcmp() on the defines.

We are sure that you remember that the JSON elements may come in any order, and that whitespace can change without impacting the JSON validity, and that string such as "string string2" can have whitespace within them, such as in a name.

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

What might be helpful here: can you provide an actual UUID string that might be valid in the contest so that testing this tool can be easier?

Here is a sample UUID:

12345678-1234-4321-abcd-1234567890ab

For more info see this other comment.

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

Here this makes me think that the format of the JSON file does in fact matter - since you say it must be an array and the fields must be exactly the below. Does this hold everywhere else too?

Probably, yes.

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

	"formed_timestamp" : 1644618833,

The formed_timestamp value must be an integer >= MIN_TIMESTAMP as defined in limit_ioccc.h.

Which integer type do you suggest? time_t?

JSON numbers can be of any length. JSON numbers are typeless. JSON integers are just a string of decimal digits of any length.

See the number section of the JSON spec.

You need not support huge multi-precision numbers. Instead try to form a long long.

You might want to look at the length of the characters of a JSON number. Now LLONG_MAX is:

0x7fffffffffffffff == 9223372036854775807

And in decimal, 9223372036854775807 has 19 digits. So define:

#define LLONG_MAX_BASE10_DIGITS (19)

Then if the length (not counting and leading - sign) of the JSON number exceeds LLONG_MAX_BASE10_DIGITS, reject it as being too large.

Then use strtoll(3) to convert the characters of the JSON number into a long long, taking care (as perviously discussed) about detecting when errno is changed to non-zero (by presetting it to 0 before the strtoll(3) call) and doing all that stuff about rejecting LLONG_MIN and LLONG_MAX, and finally assigning it to a long long value.

While the checking length (ignoring any leading - sign) of the JSON number is option, it might still be a good idea in case strtoll(3) gets really confused with a huge number of digits.

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

Do you have a recommended way to go about this? The first thing that popped into my head is strptime() but I'm not sure if that's the best/better way to go about it.

The use of strptime() is good idea.

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 12, 2022

It is hard for us to determine what parts of a long reply need to be responded to and what are just comments.
It is easy for us to miss something within such a long reply, sorry.

Perhaps single issue messages might make that easier? Anyway if we missed a question, please ask it again, perhaps as 1 question (or 1 question set) per post?

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

Yes. If the value is numeric, then there MUST be no quotes around the value. In JSON:

"123" != 123

This makes sense.

Only strings are names appear to be in double quotes. JSON values such these are not double quoted:

  • true
  • false
  • null
  • 12345
  • -123
  • 12.345
  • 12.345e-7

In the JSON used by IOCCC, we do not have use for numeric values that are non-integers. So the last 2 value forms may be safely ignored.

This is helpful, thanks. I'll also take a look at the JSON documentation you provided - but tomorrow. Just quickly going through this with any comments and then turning in for the night.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

For the above two we can just use strtok_r() on : and then when finding the right first field do a strcmp() on the defines.

We are sure that you remember that the JSON elements may come in any order, and that whitespace can change without impacting the JSON validity, and that string such as "string string2" can have whitespace within them, such as in a name.

Yes I do but it's good that you made it clear. Appreciate that. One possibility is stripping the spaces out but I'll worry about the technicalities when working on it.

Edit: Ah but you're saying it because of my reference to strcmp(). Yes this does indeed matter. I haven't actually looked at the strings in question yet so this was just quick thoughts on my part. Thanks for pointing this out.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

What might be helpful here: can you provide an actual UUID string that might be valid in the contest so that testing this tool can be easier?

Here is a sample UUID:

12345678-1234-4321-abcd-1234567890ab

For more info see this other comment.

Thanks. A short bit ago I remembered the tool uuidgen but it's helpful that you provide a specific one I can test. Still because the parsing is already in fnamchk that's probably sufficient - I wasn't sure on the part of variants though so this is helpful.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

Here this makes me think that the format of the JSON file does in fact matter - since you say it must be an array and the fields must be exactly the below. Does this hold everywhere else too?

Probably, yes.

Thanks for confirming this.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

The formed_timestamp value must be an integer >= MIN_TIMESTAMP as defined in limit_ioccc.h.
Which integer type do you suggest? time_t?

JSON numbers can be of any length. JSON numbers are typeless. JSON integers are just a string of decimal digits of any length.

In that case (this is just a quick thought): since the values are of limited range in C it might be possible to just use strspn() (to verify that it only has digits) and then (when comparing to a max) use strcmp() (since one can check for <, > etc. - not just == 0). This is just a quick thought though and maybe I'll come up with a different way (or you have a preference and you cay say).

See the number section of the JSON spec.

Will do.

You need not support huge multi-precision numbers. Instead try to form a long long.

Okay.

You might want to look at the length of the characters of a JSON number. Now LLONG_MAX is:
0x7fffffffffffffff == 9223372036854775807

Yep. And I actually wondered about this. At least if you mean for each JSON number count the number of digits. Is this what you mean?

And in decimal, 9223372036854775807 has 19 digits. So define:

#define LLONG_MAX_BASE10_DIGITS (19)

Then if the length (not counting and leading - sign) of the JSON number exceeds LLONG_MAX_BASE10_DIGITS, reject it as being too large.

Good idea. I actually wrote a function years ago that counts the number of digits in an int - but since this will be parsed as a string (initially) I can just use strlen().

Then use strtoll(3) to convert the characters of the JSON number into a long long, taking care (as perviously discussed) about detecting when errno is changed to non-zero (by presetting it to 0 before the strtoll(3) call) and doing all that stuff about rejecting LLONG_MIN and LLONG_MAX, and finally assigning it to a long long value.

That sounds good and reasonable.

While the checking length (ignoring any leading - sign) of the JSON number is option, it might still be a good idea in case strtoll(3) gets really confused with a huge number of digits.

And any + too for that matter (I don't know if that's valid in JSON but it certainly is in the strto*() functions.

The way I read this is:

  • Get the length (strlen()) and make sure it's not longer than the number of digits in a LLONG_MAX (as above).
  • And then try using strtoll() to verify that it works (making sure to check for any errors etc.).

Is that what you're saying? That seems reasonable to me at a quick glance.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

Do you have a recommended way to go about this? The first thing that popped into my head is strptime() but I'm not sure if that's the best/better way to go about it.

The use of strptime() is good idea.

Okay I'll consider that then. Thanks.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

It is hard for us to determine what parts of a long reply need to be responded to and what are just comments. It is easy for us to miss something within such a long reply, sorry.

No need to be sorry. Actually I'm sorry: I thought just quoting the specific parts would be enough. I'll make sure to do a single question in a single comment in the future. I should know this but I tend to write a lot - apologies!

Perhaps single issue messages might make that easier? Anyway if we missed a question, please ask it again, perhaps as 1 question (or 1 question set) per post?

I'll ask anything again - in a single message - if anything else comes up (or I should say when something else comes up). But no worries if you missed anything.

Perhaps you have some things you can add to my above comments and I'll read them in the morning. I'm going to check the other thread and if nothing else was posted there I'll turn in for the night.

Take good care and enjoy the rest of your day! More tomorrow. I'm sure I'll make a pull request in the morning but almost certainly after that nothing until Monday.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

Something occurred to me that you did not address: should certain characters be verified that they're escaped? For example should the URL have the /s escaped like it is printed out by mkiocccentry?

Also should the mkiocccentry tool check for characters like \ in URLs? (Maybe it already does and I don't recall).

Ah, there's this question you did miss. Should the JSON validator detect / without a \ before it?

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 13, 2022

Ah, there's this question you did miss. Should the JSON validator detect / without a \ before it?

We think so. The JSON spec seems to suggest that for some reason, /'s needs to be -escaped.

See json_putc() in mkiocccentry for the list of chars that need to be escaped:

/*
 * json_putc - print a UTF-8 character with JSON encoding
 *
 * JSON string encoding JSON string encoding.
 *
 * These escape characters are required by JSON:
 *
 *     old              new
 *     --------------------
 *      "               \"
 *      /               \/
 *      \               \\
 *      <backspace>     \b      (\x08)
 *      <tab>           \t      (\x09)
 *      <newline>       \n      (\x0a)
 *      <vertical tab>  \f      (\x0c)
 *      <enter>         \r      (\x0d)
 *
 * These escape characters are implied by JSON due to
 * HTML and XML encoding, although not strictly required:
 *
 *     old              new
 *     --------------------
 *      <               \u003C
 *      >               \u003E
 *      &               \u0026
 *
 * These escape characters are implied by JSON to let humans
 * view JSON without worrying about characters that might
 * not display / might not be printable:
 *
 *     old              new
 *     --------------------
 *      \x00-\x07       \u0000 - \u0007
 *      \x0e-\x1f       \u0005 - \x001f
 *      \x7f-\xff       \u007f - \u00ff
 *
 * See:
 *
 *      https://developpaper.com/escape-and-unicode-encoding-in-json-serialization/
 *
 * NOTE: We chose to not escape '%' as was suggested by the above URL
 *       because it is neither required by JSON nor implied by JSON.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

Ah, there's this question you did miss. Should the JSON validator detect / without a \ before it?

We think so. The JSON spec seems to suggest that for some reason, /'s needs to be -escaped.

See json_putc() in mkiocccentry for the list of chars that need to be escaped:

/*
 * json_putc - print a character with JSON encoding
 *
 * JSON string encoding JSON string encoding.  We will encode as follows:
 *
 *     old              new
 *     --------------------
 *      "               \"
 *      /               \/
 *      \               \\
 *      <backspace>     \b
 *      <vertical tab>  \f
 *      <tab>           \t
 *      <enter>         \r
 *      <newline>       \n
 *      <               \u003C
 *      >               \u003E
 *      &               \uoo26
 *      %               \u0025
 *      \x80-\xff       \u0080 - \u00ff
 */

That's very helpful, thanks. I guess that means when parsing the fields one will have to keep track of the previous character too so that when they encounter a character that has to be escaped and the previous character was not \ then it's an error so when the tool ends it'll return non-zero which means that if running in mkiocccentry the latter will abort too.

Perhaps there could be a function that does the checking: it would take the previous character and the current character and if the current character is one of the above and the previous character is not then it's an error. That would make it more modular and cleaner.

Well as you'll see I did a pull request with two checks added to txzchk that I think you'll appreciate (one might be called redundant and I almost had it so that a file called JUST . be only a file called JUST . but I made it that AND an invalidly named dot file). The other one is checking for non-portable characters in the FULL filename (not just the basename).

The latter test can be modified a bit so that it can be used in the json checker (but since this is only basename perhaps the code in mkiocccentry will suffice: move that part of the function that checks extra data files to a new function).

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

On the subject of escaping: I checked the mkiocccentry comments like you suggested and I went to the website you referred to (https://developpaper.com/escape-and-unicode-encoding-in-json-serialization/). This in turn made me wonder:

What to do about characters > \u00ff? I guess this is an error and so the tool should return a value > 0?

I haven't started working on any of the parsing yet; for now I'm wanting to get these things clarified. I might work on some of the parsing in a little bit but I'm not sure if I'll have the time and energy (time yes for now but not sure if I have both).

@lcn2
Copy link
Contributor Author

lcn2 commented Feb 13, 2022

We can assume UTF-8 throughout the tool chain. As the encoding article recommends, JSON tools should insist that the \ be followed ONLY by these characters:

  • "
  • /
  • \
  • b
  • f
  • t
  • r
  • n
  • uxxxx

The uxxxx is a special case where xxxx are hex characters 0123456789abcdefABCDEF (either lower or upper case).

While tools are insistent in terms of what they produce, they should be generous in terms of what they accept.

Moreover, JSON tools should flag as an error, if any of the following UTF-8 characters are found when NOT preceded by a \:

  • "
  • /
  • ASCII backspace \b
  • ASCII formfeed \f
  • ASCII tab \t
  • ASCII return \r
  • ASCII newline \n
  • u (unless what follows are 4 hex [0-9a-fA-F] encoding chars)
  • \ (unless what follows is an above mentioned escape sequence)

The following encodings are encouraged (and implied) but not required by the JSON definition:

  • [\x00-\x07]
  • [\x0b]
  • [\x0e-\x1f]
  • [\x7f-\xff]

We suggest that you may wish to create a utility function that converts JSON encoded strings into a malloced un-encoded string. That is:

char * json_decode(char const *json_string)

The json_decode() function should return a malloced UTF-8 string where the JSON -escaped characters are de-converted, or return NULL if the string was improperly encoded. The dbg() function should be used by json_decode() to inform the user of any encoding problems found. The json_decode() function should not call an error function, but rather return NULL to let the calling function decide what to do.

And what you are at it, the following reverse utility function should be written:

char * json_encode(char const *utf8_string)

The json_encode() function should return a malloced JSON encoded string, or NULL if there is a malloc error. The json_encode() function should not call an error function, but rather return NULL to let the calling function decide what to do. Also it should NOT enclose the string in " (ASCII double-quote) but rather let the calling function do that.

The json_encode() and json_decode() functions should be proper inverses of each other, unless they return NULL.

For testing and "general tool usefulness", two utilities are in order:

jstrencode [-h] [-v level] [-V] [string ...]
jstrdecode [-h] [-v level] [-V] [string ...]

These utility tools write to stdout the JSON string encoding or decoding of their input. If no string args are given, then data should be read in stdin until EOF. They should exit 0 is all is well, or exit non-0 if an error (such as NULL is returned) is encountered.

As with Unix tools, the output of one should be able to be fed into the other, so:

jstrencode < foo | jstrdecode > bar
if ! cmp foo bar; then
    echo "foo and bar differ"
fi 

Then you can add to the test rule, testing for JSON string encoding/decoding as well as tests for detecting improper JSON string encoding.

jstrdecode '\error' >/dev/null
status = "$?"
if [[ $status != 0 ]]; then
    echo "Improper decoding not detected" 1>&2
    exit "$status"
fi

Etc.

@xexyl
Copy link
Contributor

xexyl commented Feb 13, 2022

Please let me know if this reply is okay or if it should be split off into other messages. I wasn't quite sure of this since it's all one reply I'm replying to. If necessary I'll rewrite it at another time. Just let me know and I'll be happy to do that - then you can disregard the below (I just am about to head off for the day - well I'll be at the computer a bit longer but I'm afraid I'm done with this for the day).

I was thinking one of the next things I'll do is add to the processing of lines identifying which field it is so that the only thing left to be done is to parse it (the arrays will of course be different but I'll worry about that later). As you know I also already detect whether the file starts with a { and ends with a }. I think the tools you suggest below should be done before this though so I'll work on that - but not until tomorrow (probably tomorrow).

Anyway reply below. Have a great rest of your day!

We can assume UTF-8 throughout the tool chain. As the encoding article recommends, JSON tools should insist that the \ be followed ONLY by these characters:

In other words: if any other character has a \ before it is an error, right?

While tools are insistent in terms of what they produce, they should be generous in terms of what they accept.

But not when validating the file, right? I mean if it has for example \_ it should fail, right? That's just an example; I'm not sure how you mean generous in what they accept: in what way should they be generous?

We suggest that you may wish to create a utility function that converts JSON encoded strings into a malloced un-encoded string. That is:

char * json_decode(char const *json_string)
char * json_encode(char const *utf8_string)

I guess that the json_putc() function can help with this since it kind of does that already. Then as for the inverse it's just a matter of reversing the logic. This is a good idea you have.

The json_encode() and json_decode() functions should be proper inverses of each other, unless they return NULL.

Makes sense.

For testing and "general tool usefulness", two utilities are in order:

jstrencode [-h] [-v level] [-V] [string ...]
jstrdecode [-h] [-v level] [-V] [string ...]

In other words the above functions can be put in json.c and the two tools you suggest would make use of those functions. That sounds like a good idea and probably the next thing to be done.

As with Unix tools, the output of one should be able to be fed into the other, so:

jstrencode < foo | jstrdecode > bar
if ! cmp foo bar; then
    echo "foo and bar differ"
fi 

Of course. Just like my Enigma machine does. I made sure to design it that way!

Then you can add to the test rule, testing for JSON string encoding/decoding as well as tests for detecting improper JSON string encoding.

jstrdecode '\error' >/dev/null
status = "$?"
if [[ $status != 0 ]]; then
    echo "Improper decoding not detected" 1>&2
    exit "$status"
fi

I guess you're referring to make test with new scripts? I thought of making some test scripts to test fnamchk and txzchk but I haven't got to that yet - not sure if it's necessary.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

As for running it with valgrind I just did:

valgrind -v --leak-check=yes --track-origins=yes --leak-resolution=high --read-var-info=yes --leak-check=full --show-leak-kinds=all make test

and there were no errors and the only memory leaks that I could see were from make itself. So unless it doesn't actually catch what the script does we're in good shape it looks like. I'll update the Makefile to note the improved valgrind flags and I might add a test target that runs it with valgrind: not sure of that though.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

Update: I'm playing with a function I wrote years ago that extracts expressions (or fields) delimited by :s (originally it was ;s) but does not remove quotes. This might be useful for extraction of the fields. I have it in json.c and also have a util function matching_quote (finds the matching closing un-escaped '") in util.c. I however really do need a break and I might not do any more of it today: I'm really not sure. I'll have to change the parser of the array extraction and any place that this function might prove useful and the keyword is might: it might not even be of value but I think it might because it keeps "s that are in the string.

It's re-entrant as well and this is the customised strtok_r() I referred to before. It has been a very long time since I've used it though so it might not be of value: that's to be determined at a later time. For now it's a break but whether I'll return to it today or tomorrow I don't know. What I do know is I don't think I'll merge the two tools into jchk.c just yet. But maybe that'll be done soon: not sure.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

As for running it with valgrind I just did:

valgrind -v --leak-check=yes --track-origins=yes --leak-resolution=high --read-var-info=yes --leak-check=full --show-leak-kinds=all make test

and there were no errors and the only memory leaks that I could see were from make itself. So unless it doesn't actually catch what the script does we're in good shape it looks like. I'll update the Makefile to note the improved valgrind flags and I might add a test target that runs it with valgrind: not sure of that though.

This has also been done and pushed.

@lcn2
Copy link
Contributor Author

lcn2 commented Mar 7, 2022

About that that reminds me: is there a reason we have -O specified during development? Because that does make it harder to run through a debugger even with debugging symbols generated. Of course debugging symbols could be disabled once it's all ready and specifying some optimisation level would be good too. But for development I'm not so sure it's needed and it might be an issue in some cases. Thoughts?

We can turn off -O during development.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

About that that reminds me: is there a reason we have -O specified during development? Because that does make it harder to run through a debugger even with debugging symbols generated. Of course debugging symbols could be disabled once it's all ready and specifying some optimisation level would be good too. But for development I'm not so sure it's needed and it might be an issue in some cases. Thoughts?

We can turn off -O during development.

Probably a good idea. Want me to do that? Do you want -O0 or just not specifying any -O at all?

@lcn2
Copy link
Contributor Author

lcn2 commented Mar 7, 2022

Two bools that have to be there: past_winner and default_handle. Additionally a string author_handle that probably cannot be longer than MAX_NAME_LEN and has to be at least ` char long.

Are those correct or am I missing anything there?

We thought those were already added.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

Two bools that have to be there: past_winner and default_handle. Additionally a string author_handle that probably cannot be longer than MAX_NAME_LEN and has to be at least ` char long.
Are those correct or am I missing anything there?

We thought those were already added.

They are indeed. If I didn't you did. I only saw that later on and I forgot to update this. Sorry!

@lcn2
Copy link
Contributor Author

lcn2 commented Mar 7, 2022

About that that reminds me: is there a reason we have -O specified during development? Because that does make it harder to run through a debugger even with debugging symbols generated. Of course debugging symbols could be disabled once it's all ready and specifying some optimisation level would be good too. But for development I'm not so sure it's needed and it might be an issue in some cases. Thoughts?

We can turn off -O during development.

Probably a good idea. Want me to do that? Do you want -O0 or just not specifying any -O at all?

Change:

COPT= -O3 -g3
#COPT= -O0 -g

to this:

#COPT= -O3 -g3
COPT= -O0 -g
``

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

#COPT= -O3 -g3
COPT= -O0 -g

Done and pushed.

@lcn2
Copy link
Contributor Author

lcn2 commented Mar 7, 2022

The next commit will the the 500th commit. Thank you @xexyl for helping with a number of those 500!

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

The next commit will the the 500th commit. Thank you @xexyl for helping with a number of those 500!

Thank you! That means a great deal to me! I'm proud and happy to participate! I'm having a lot of fun even if I have to take a lot of breaks and many of the days I can't do much.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

I'm hoping to make at least one additional commit today but I'm not sure of that. In jinfochk.c I have some clean ups but since I also have debug calls I've not yet committed it. It's tempting to go ahead remove those debugging symbols just so I could be at the 500th commit.

Well I think I will do that because it's good to clean it up and it's a source of pride: something I have a difficult time with. Expect a pull request soon!

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

There.

Now I'll work a bit more on the extraction of the array but I'm guessing I won't work at it much longer. I did get some important insight today at least which will probably help with the actual task at hand when I'm working on it next.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

Just had an idea with the array... going to play with it and if it works I'll commit it. If not I'll think it over when not working on it.

Update: Will return to this later: either later today or tomorrow morning. Not sure which yet.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

I took another moment and I think I solved this problem! Both make test and running the tool on that file pass!

Looking a bit more and if so I'll commit and push. That'll probably be all I do for the day.

@xexyl
Copy link
Contributor

xexyl commented Mar 7, 2022

Pushed. Now I'm done for now. More tomorrow if not later today.

Enjoy your time with Leo and stay safe! Cheers.

@lcn2
Copy link
Contributor Author

lcn2 commented Mar 7, 2022

Pushed. Now I'm done for now. More tomorrow if not later today.

Enjoy your time with Leo and stay safe! Cheers.

Thanks, the demo with Leo went well.

Leo suggested a change in a mkiocccenty message that will be done in the next fix.

@xexyl
Copy link
Contributor

xexyl commented Mar 8, 2022

Pushed. Now I'm done for now. More tomorrow if not later today.
Enjoy your time with Leo and stay safe! Cheers.

Thanks, the demo with Leo went well.

Leo suggested a change in a mkiocccenty message that will be done in the next fix.

Great! Looking forward to it. I will be going to sleep soon so that will be tomorrow.

I do have a thought that might help solve the comma issue but that’s also for tomorrow. Hope you have a great night!

@lcn2
Copy link
Contributor Author

lcn2 commented May 15, 2022

This issue is pending the completion of the JSON parser and closing of #156 "Enhancement: finish the C-based general JSON parser".

@xexyl
Copy link
Contributor

xexyl commented May 23, 2022

jauthchk-assigned-to-xexyl-by-ioccc-judges

I'm honoured! Thank you!

@lcn2
Copy link
Contributor Author

lcn2 commented Jun 15, 2022

See comment 1155885090 for changes to the chk code / chk warning and error facilities.

@lcn2
Copy link
Contributor Author

lcn2 commented Jun 21, 2022

Closing this request in favor of issue #259.

@lcn2 lcn2 closed this as completed Jun 21, 2022
@lcn2 lcn2 added invalid This doesn't seem right and removed enhancement New feature or request help wanted Extra attention is needed labels Jun 21, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants