Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on handling of RFC variations #27

Closed
mxsasha opened this issue May 14, 2018 · 10 comments
Closed

Decide on handling of RFC variations #27

mxsasha opened this issue May 14, 2018 · 10 comments
Assignees

Comments

@mxsasha
Copy link
Collaborator

mxsasha commented May 14, 2018

Based on the current state of our parser, there are a few areas where we may need to adjust validation.

A full overview of all errors in non-strict mode for all databases is available on irrd01.dashcare.nl:parser_results/nonstrict-others.errors, and a run on the NTTCOM db with strict mode in nttcom.errorsum with the full objects for context listed in nttcom.errors. For the difference between strict and non-strict, see the docs.

The errors about being unable to read public PGP keys should be assumed to be parser bugs at this time, as currently all key-certs generate an error. Other than that, all these errors require a business decision.

In strict mode on NTTCOM:

  • (6x) MAIL-FROM attributes are used in NTTCOM but will not be supported in IRRDv4.
  • (40x) as-name is missing on some aut-nums in NTTCOM, but this is mandatory in RFC 2622
  • (a few hundred times) In many cases, the tech-c/admin-c of objects contains a person or department name, which is not a reference to a nic-hdl of a person/role object. For some or all, there is even no person/role object under that same name. In RFC 2622, these must be nic-handles.

In other databases, on non-strict mode:

  • (1x) ARIN includes a route object with an origin of 20013, which is not valid as this should be AS20013. As this is a primary key of the route object, validation is enabled for this field even in non-strict.
  • (1x) RIPE contains one object with an attribute named *mb. Although we generally allow unknown attributes to be added, this is an invalid attribute name syntax - not just one we don't know.
  • (63x) Objects in ARIN (IPv6) and SAVVIS (IPv4) contain network prefixes in primary key fields which have host bits enabled, e.g. 2605:4d00::1/32 or 23.15.142.1/23.
  • (8x) Similar to NTTCOM, some tech-c/admin-c contain people's names in RADB - some of which are in this case also a nic-hdl on a person/role.
  • (6x) Level3 still uses ASDOT notation in a few objects.

These checks verify whether IRRD isn't being too strict, but they don't validate whether it's being too flexible, so we also need to check whether the current list of permitted attributes isn't too permissive.

Lastly, RFC 2622 strictly defines ASCII as the allowed character set, but RIPE, TC, GT, BBOI and ARIN use ISO-8859-1 characters. AFRINIC and APNIC use UTF-8. Many others are just ASCII. What should the permitted character set be, and assuming we don't want to stick to pure 7-bit ASCII, what encoding should be used for queries, NRTM and data export?
(Internally, Python code almost always uses UTF-8, as will the database.)

@mxsasha mxsasha added this to the IRRDv4 phase 1 milestone May 14, 2018
@mxsasha mxsasha mentioned this issue May 14, 2018
6 tasks
@mahtin
Copy link

mahtin commented May 14, 2018

(independent of Python3's excellent UTF-8 processing) ... I would highly recommend using UTF-8 as the common storage method independent of the RIRs/IIRs pathetic ASCII mindset. There's no going back to ASCII at this point (just look at the /e flag on jpnic's whois service to see unstructured non-ascii text at it's best).

Mapping inbound ASCII to UTF-8 should be nearly the default.

@mxsasha
Copy link
Collaborator Author

mxsasha commented May 14, 2018

My initial phrasing was a bit unclear - I clarified it. Internally we will use UTF-8, but of course inputs and outputs of ISO-8859-1 shouldn't be hard, same for ASCII input.

However, as far as I can find, allowing any characters that are not ASCII is an RFC deviation, so it requires a business decision. Another question would be whether to allow any UTF-8 character, or limit them. Allowing any would allow emoji, for example.

@mxsasha
Copy link
Collaborator Author

mxsasha commented May 15, 2018

Update on key-certs: 15 key-cert objects were incorrectly parsed. However, another 5 can not be imported as they are old PGP-2 keys, which are no longer supported by gpg 2.1 (since 2014). What to do with these also requires a business decision.

@mxsasha
Copy link
Collaborator Author

mxsasha commented Jun 28, 2018

Here's another one from ARIN to deal with:

as-set:         AS11976:AS-CUSTOMERS
descr:          Fidelity Communications
members:        AS11976
members:        AS31919
members:        AS46611
members:        AS17286=20
members:        AS5638
mnt-by:         MNT-FIDN
changed:        nocadmins@fidnet.com 20100806
source:         ARIN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ERROR: Invalid set AS17286=20: component AS17286=20 is not a valid AS number nor a valid set name

(set members are a lookup field, and therefore always validated)

We could simply strip out the =20 part if it occurs on the end of a line, but that doesn't feel very clean either.

@mahtin
Copy link

mahtin commented Jun 28, 2018

=20 is a space badly decoded from the inbound email to the IRR. See https://productforums.google.com/forum/?noredirect=true#!topic/gmail/WCklpQTrJMk or similar pages.

That said; I’d drop the whole line as it’s not valid syntax. Or write an irrlint command so that someone can produce an error report to post somewhere.

@mxsasha
Copy link
Collaborator Author

mxsasha commented Jun 28, 2018

Following up on encodings: after running into some encoding issues, I ran chardet on all the files I got from rr.ntt.net, which is very thorough (and really slow, this took about an hour):

afrinic.db: utf-8 with confidence 0.99
altdb.db: Windows-1252 with confidence 0.73
apnic.db: Windows-1254 with confidence 0.4888600622828866
arin-whois.db: ascii with confidence 1.0
arin.db: Windows-1252 with confidence 0.7299344525030538
bboi.db: ISO-8859-1 with confidence 0.73
bell.db: ascii with confidence 1.0
gt.db: ISO-8859-1 with confidence 0.73
internal.db: ascii with confidence 1.0
jpirr.db: ascii with confidence 1.0
level3.db: Windows-1252 with confidence 0.7299986263799446
nttcom.db: ISO-2022-JP with confidence 0.99
radb.db: Windows-1254 with confidence 0.5017295329798788
rgnet.db: ascii with confidence 1.0
ripe.db: Windows-1254 with confidence 0.45201648599143607
savvis.db: ISO-8859-1 with confidence 0.73
tc.db: Windows-1254 with confidence 0.4542790826372332

If this detection is correct, encodings are all over the place, and for some we aren't even very sure - but it's something.

@mahtin
Copy link

mahtin commented Jun 28, 2018

The =20 issue shows up all over the place; but mainly in plain text areas.

$ gunzip < radb.db.gz | egrep '=20$' | cut -d' ' -f1 | sort | uniq -c
   3 
   1 changed:
  51 descr:
 571 remarks:
$

For example:

route:      152.234.192.0/19
descr:      Proxy-registered route object
remarks:    ---------------------------------
remarks:    For SECURITY issues, please send messages to
remarks:    origin AS administrator
remarks:    (e.g. whois -h whois.registro.br <origin>)
remarks:    ---------------------------------
remarks:    This route is from Oi's AS  which
remarks:    is being exported under origin AS below.
remarks:    =20=20=20=20=20
remarks:    Please contact BGPOI-RADB if you have any
remarks:    questions regarding this object.
remarks:    =20=20=20=20=20
origin:     AS7738
notify:     ip-bgp@oi.net.br
mnt-by:     MAINT-AS8167
changed:    ip-bgp@oi.net.br 20150401
source:     RADB
route:      138.185.168.0/22
descr:      Transito OPENX =E2=80=93 Navinet LTDA
origin:     AS28152
mnt-by:     MAINT-AS22356
changed:    epafras@durand.com.br 20170627  #12:27:48Z =20
source:     RADB

However, there's rampant syntax errors all over the place that cleanly affect data. Here's are indented lines; which should not be; however are being handled as part of the previous descr or remarks lines:

as-set:     AS7473:AS-CUSTOMERS:AS23991
descr:      STIX customer ASes through AS23991
            members:   AS23991, AS38210
admin-c:    DNS34678
tech-c:     TKH147
notify:     noc@ix.singtel.com
mnt-by:     MAINT-AS7473
changed:    jatadero@singtel.com 20060830
source:     RADB
as-set:     AS7473:AS-CUSTOMERS:AS17494
descr:      STIX customer ASes through AS17494
            members:   AS17494, AS24432, AS38067,
            AS38067, AS38138, AS38136,
            AS38071, AS24122, AS38073,
            AS38192, AS24507, AS38230,
                   AS38200
admin-c:    DNS34678
tech-c:     TKH147
notify:     noc@ix.singtel.com
mnt-by:     MAINT-AS7473
changed:    arnold@singtel.com 20070212
source:     RADB
route:      196.49.28.0/24
descr:      Client name  IXPN-MGMT-ABUJA
            descr:         Internet Exchange of Nigeria Ltd - ABUJA
origin:     AS36932
mnt-by:     MAINT-AS36994
changed:    ashwin.lalla@vodacom.co.za 20170712  #08:06:08Z
source:     RADB

This is clearly a mistake; but should be fixed by the end-user:

aut-num:        AS3634
as-name:        sfasu-as
descr:  Stephen F. Austin State University
  import: from AS4557 action pref=10;
          accept ANY
  import: from AS114 action pref=20;
          accept ANY
  import: from AS2914 action pref=20;
          accept ANY
  export: to AS114 announce AS3634
  export: to AS4557 announce AS3634
  export: to AS2914 announce AS3634
admin-c:JG2220
tech-c: JG2220
notify:jgarner@sfasu.edu
mnt-by:    MAINT-AS3634
changed:   jgarner@sfasu.edu 20000114
source:      RADB

Sigh.

Plenty to syntax checking needs to be done. Sigh. Sigh. Sigh.

@rubenskuhl
Copy link

Note that Registro.br is an RPSL routing registry, not an RPSL-ng routing registry, so one decision to make in irrd4 is having or not support for legacy notations.

@mxsasha
Copy link
Collaborator Author

mxsasha commented Jul 3, 2018

@mahtin Nice finds. For the plain-text fields, =20 it's ugly but valid, so not a concern there. The clearly unintended indents are entirely valid objects, so not much we can do there. A future improvement could be that IRRD warns the user when a line was (correctly) interpreted as continuation of the previous line, yet starts with a valid attribute name, i.e. they may have made a mistake. Of course, that doesn't help for objects received from mirrors.

@rubenskuhl after having a quick look, it may not be too hard as it's still quite similar, but rr.ntt.net does not include it now, so it would be out of scope for the IRRD4 phase 1 project.

@mxsasha
Copy link
Collaborator Author

mxsasha commented Jul 9, 2018

This issue has been split into issues 47-62 in this repo.

@mxsasha mxsasha closed this as completed Jul 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants