-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide on handling of RFC variations #27
Comments
(independent of Python3's excellent UTF-8 processing) ... I would highly recommend using UTF-8 as the common storage method independent of the RIRs/IIRs pathetic ASCII mindset. There's no going back to ASCII at this point (just look at the /e flag on jpnic's whois service to see unstructured non-ascii text at it's best). Mapping inbound ASCII to UTF-8 should be nearly the default. |
My initial phrasing was a bit unclear - I clarified it. Internally we will use UTF-8, but of course inputs and outputs of ISO-8859-1 shouldn't be hard, same for ASCII input. However, as far as I can find, allowing any characters that are not ASCII is an RFC deviation, so it requires a business decision. Another question would be whether to allow any UTF-8 character, or limit them. Allowing any would allow emoji, for example. |
Update on key-certs: 15 key-cert objects were incorrectly parsed. However, another 5 can not be imported as they are old PGP-2 keys, which are no longer supported by gpg 2.1 (since 2014). What to do with these also requires a business decision. |
Here's another one from ARIN to deal with:
(set members are a lookup field, and therefore always validated) We could simply strip out the =20 part if it occurs on the end of a line, but that doesn't feel very clean either. |
=20 is a space badly decoded from the inbound email to the IRR. See https://productforums.google.com/forum/?noredirect=true#!topic/gmail/WCklpQTrJMk or similar pages. That said; I’d drop the whole line as it’s not valid syntax. Or write an irrlint command so that someone can produce an error report to post somewhere. |
Following up on encodings: after running into some encoding issues, I ran chardet on all the files I got from rr.ntt.net, which is very thorough (and really slow, this took about an hour):
If this detection is correct, encodings are all over the place, and for some we aren't even very sure - but it's something. |
The =20 issue shows up all over the place; but mainly in plain text areas.
For example:
However, there's rampant syntax errors all over the place that cleanly affect data. Here's are indented lines; which should not be; however are being handled as part of the previous descr or remarks lines:
This is clearly a mistake; but should be fixed by the end-user:
Sigh. Plenty to syntax checking needs to be done. Sigh. Sigh. Sigh. |
Note that Registro.br is an RPSL routing registry, not an RPSL-ng routing registry, so one decision to make in irrd4 is having or not support for legacy notations. |
@mahtin Nice finds. For the plain-text fields, =20 it's ugly but valid, so not a concern there. The clearly unintended indents are entirely valid objects, so not much we can do there. A future improvement could be that IRRD warns the user when a line was (correctly) interpreted as continuation of the previous line, yet starts with a valid attribute name, i.e. they may have made a mistake. Of course, that doesn't help for objects received from mirrors. @rubenskuhl after having a quick look, it may not be too hard as it's still quite similar, but rr.ntt.net does not include it now, so it would be out of scope for the IRRD4 phase 1 project. |
This issue has been split into issues 47-62 in this repo. |
Based on the current state of our parser, there are a few areas where we may need to adjust validation.
A full overview of all errors in non-strict mode for all databases is available on
irrd01.dashcare.nl:parser_results/nonstrict-others.errors
, and a run on the NTTCOM db with strict mode innttcom.errorsum
with the full objects for context listed innttcom.errors
. For the difference between strict and non-strict, see the docs.The errors about being unable to read public PGP keys should be assumed to be parser bugs at this time, as currently all key-certs generate an error. Other than that, all these errors require a business decision.
In strict mode on NTTCOM:
In other databases, on non-strict mode:
20013
, which is not valid as this should beAS20013
. As this is a primary key of the route object, validation is enabled for this field even in non-strict.*mb
. Although we generally allow unknown attributes to be added, this is an invalid attribute name syntax - not just one we don't know.These checks verify whether IRRD isn't being too strict, but they don't validate whether it's being too flexible, so we also need to check whether the current list of permitted attributes isn't too permissive.
Lastly, RFC 2622 strictly defines ASCII as the allowed character set, but RIPE, TC, GT, BBOI and ARIN use ISO-8859-1 characters. AFRINIC and APNIC use UTF-8. Many others are just ASCII. What should the permitted character set be, and assuming we don't want to stick to pure 7-bit ASCII, what encoding should be used for queries, NRTM and data export?
(Internally, Python code almost always uses UTF-8, as will the database.)
The text was updated successfully, but these errors were encountered: