-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds libraries for Punycode and IDNA conversions. #958
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is almost done! Now we're done bickering over rulesets and we're defining the interface for other NSE programmers and streamlining the code. Take a look at these changes and let me know if you have time to keep working on it.
nselib/idna.lua
Outdated
-- Since this library is dependent on punycode and vice-versa, we need to | ||
-- import each of the libraries into another. This prevents idna from entering | ||
-- into a recursive loop. | ||
package.loaded["idna"] = _ENV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can eliminate this confusion if we divide the functions cleanly between the two libraries. In punycode.lua
we should only handle Punycode encoding and decoding of labels. These functions should be in idna.lua
instead of punycode.lua
:
- Splitting names into labels based on RFC 3490 separators.
- Unicode de/encoding. In fact, we could even make this the responsibility of the caller, and only accept tables of code points for Unicode inputs (i.e. never a string of encoded bytes)
- Joining labels with "."
This has the potential to really simplify a lot of the code here. Most likely, scripts and other libraries will only require idna.lua
, since that's the higher-level interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
nselib/idna.lua
Outdated
decoded_tbl = concat_table_in_tables(decoded_tbl) | ||
|
||
-- Regular expressions (RFC 3490 separators) | ||
for index, cp in ipairs(decoded_tbl) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping is actually done in idnaMappings
, so we don't need to do the additional checks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
nselib/idna.lua
Outdated
for index, cp in ipairs(decoded_tbl) do | ||
local lookup = idnaMappings[cp] | ||
if type(lookup) == "table" then | ||
if lookup.status == "deviation" then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reason to keep looping over and over the code points. One loop is sufficient. Inside the loop:
- do the lookup in
idnaMappings
- Replace the code point if it's a number
- if it's a table, check the status. You can do
if transitionalProcessing and lookup.status == "deviation" then
to handle that case. - handle the code point as needed.
In other words, handle each code point exactly once. The output of idnaMappings
is always final; if it's mapped, then it is mapped to one or more valid code points which do not need to be checked again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
nselib/idna.lua
Outdated
-- Since this library is dependent on punycode and vice-versa, we need to | ||
-- import each of the libraries into another. This prevents idna from entering | ||
-- into a recursive loop. | ||
package.loaded["idna"] = _ENV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid this confusion by limiting punycode.lua
to simply encoding and decoding labels instead of whole names. In other words, only idna.lua
is responsible for:
- Separating and joining names and labels on RFC 3490 separators.
- Unicode-to-bytes encoding/decoding. In fact, we could require the caller to handle this instead, though that might lead to code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
nselib/idna.lua
Outdated
decoded_tbl = concat_table_in_tables(decoded_tbl) | ||
|
||
-- Regular expressions (RFC 3490 separators) | ||
for index, cp in ipairs(decoded_tbl) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are already handled by idnaMappings
, so no need to check separately for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this piece of code.
nselib/idna.lua
Outdated
-- @param checkBidi Boolean flag to represent if the input is of Bidi type. | ||
-- @param checkJoiners Boolean flag to check for ContextJ rules in input. | ||
-- @param useSTD3ASCIIRules Boolean value to represent ASCII rules. | ||
-- @param transitionalProcessing Boolean value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add default values to the NSEdoc and note any flags that are not supported. We can't rely on consumers to read the code; at best they will look at the documentation, so it must be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
nselib/idna.lua
Outdated
local inputString = unicode.encode(codepoints, encoder) | ||
|
||
-- Checks for invalid domain codepoints and proceeds further. | ||
if not match(inputString, "[^a-z0-9_@=+/%-%(%)%%]") then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reference for these valid code points/characters? I'm surprised to see _@=+()%
in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I deleted this piece of code.
-- incrementing `n` each time, so we'll fix that now: | ||
if (floor(i / out) > maxInt - n) then | ||
--error('overflow'); | ||
return nil, "Overflow exception occurred." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we checking for this and other punycode errors when we call these functions from idna.lua
? We should be doing so and passing the error up to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
punycode.mapLabels
is the only function being used in idna.lua
and this function is returned to the caller, so if there is an error the error will be returned to the caller along with nil
as response value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, effectively it becomes the job of the caller to handle the errors being thrown.
function encode_label(s, decoder) | ||
|
||
local flag = false | ||
local decoded_tbl = unicode.decode(s, decoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember, for simplicity we can require callers of punycode functions to already do the decoding and pass in a table of code points. After all, this is what idna.lua
has after doing the mapping/validation, so it doesn't make sense to keep encoding and decoding between calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought decoding here will help the NSE developers who import this function exclusively.
end | ||
|
||
-- Table of punycode test cases. | ||
local testCases = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you convert the libraries to move the domain splitting/joining and Unicode stuff to idna.lua
, you'll have to change these test cases to each be a single label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Almost all the requested changes have been done, please review the modified code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add missing requires for math
and table
to both, address these 2 typos, and commit! Thanks so much!
nselib/idna.lua
Outdated
-- To use this part of code, add disallowed_STD3_mapped and disallowed_STD3_valid | ||
-- codepoints to idnaMappings.lua. For now, we ignore these because idnaMappings.lua | ||
-- is set to support only for the latest version of IDNA. | ||
if UseSTD3ASCIIRules then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is miscapitalized and therefore not referring to useSTD3ASCIIRules
from the parameters list.
nselib/punycode.lua
Outdated
local delimiterCodePoint = 0x002E | ||
local delimiter = unicode.encode({0x002E}, encoder) | ||
|
||
codepoints = unicode.decode(input, decoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to make this local
No description provided.