Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content of an element can also contain a tag terminator ('>') #22

Closed
wants to merge 2 commits into from

Conversation

matsumotosyu
Copy link

The content of an element can also contain a tag terminator ('>'), even if the CDATA, COMMENT or PI sections are not used. (Tag starters ('<') cannot be included.)

Citation.
https://www.w3.org/TR/xml/#NT-content

[43] content ::= CharData? ((element | Reference | CDSect | PI | Comment) CharData?)*
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

From the above, the following was the correct XML format. (I'm sorry.)
(I've also verified it using this site ( https://www.w3schools.com/xml/xml_validator.asp ))

<x some="・・>・・" any='・・"・・>・・'>・・・(A string that does not contain '<', but can contain '>'.)・・>・・</x>

So, I rethought the logic for getting the tag endings as follows.

・When retrieving the tag endpoint ('>') of a tag containing the current cursor, read the current cursor position one character at a time and use the following rules to find the tag endpoint ('>') while skipping the character string.

  1. When the character the cursor points to matches "(double quotes), jump the cursor to the next "(double quotes).†1†3
  2. When the character the cursor points to matches a '(single quote), jump the cursor to the next '(single quote).†2†3
  3. If the character pointed by the cursor matches '>', it is treated as the end of the tag.

†1 " (double quotes) does not appear in a string enclosed in "(double quotes).
†2 ' (single quote) does not appear in a string enclosed in ' (single quote).
†3 If the corresponding symbol does not exist, the error is handled in the same way as before.

In considering the above logic, the following definitions were taken into account.

・The BNF of the start tag (STAG) (tag name, Attribute and blank (S) can only be included in the start tag)

Citation.
https://www.w3.org/TR/xml/#sec-starttags

[40] STag ::= '<' Name (S Attribute)* S? '>' [WFC: Unique Att Spec]

EmptyElemTag, as well as the start tag (STAG)

[44] EmptyElemTag ::= '<' Name (S Attribute)* S? '/>' [WFC: Unique Att Spec]

The BNF of Attribute

[41] Attribute ::= Name Eq AttValue [VC: Attribute Value Type] [WFC: No External Entity References] [WFC: No < in Attribute Values]

The BNF of AttValue

[10] AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'"
[66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';' [WFC: Legal Character]
[67] Reference ::= EntityRef | CharRef
[68] EntityRef ::= '&' Name ';' [WFC: Entity Declared] [VC: Entity Declared] [WFC: Parsed Entity] [WFC: No Recursion]

I also modified the test code assumptions and added test items.
We would appreciate it if you would consider incorporating the above considerations.

lib/parser.js Outdated Show resolved Hide resolved
lib/parser.js Outdated Show resolved Hide resolved
lib/parser.js Show resolved Hide resolved
@nikku
Copy link
Owner

nikku commented Apr 7, 2020

Thanks for the continued work on the topic.

I see that this will have a positive impact on the code base and usage.

@nikku
Copy link
Owner

nikku commented Apr 8, 2020

Merged via f12ad15.

@nikku nikku closed this Apr 8, 2020
@nikku
Copy link
Owner

nikku commented Apr 8, 2020

I was able to further simplify the skipping logic (and save some bytes) via 2f208e2.

Thanks for your great work 🙏.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants