Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handle double quoted string #342

Closed
wants to merge 1 commit into from
Closed

Fix handle double quoted string #342

wants to merge 1 commit into from

Conversation

sters
Copy link

@sters sters commented Jan 28, 2023

Background

This Example 7.5 Double Quoted Line Breaks case is not working properly. Some related cases as well.
https://yaml.org/spec/1.2.2/#example-double-quoted-line-breaks

I'd like to handle correctly this behavior.

What are these changes

  • Edited on scanner/scanner.go
    • Added utility function: isWhiteChar
      • Confirm the char is space or tab.
    • Added utility function: isOnlyWhiteToLineEnds
      • Confirm only white between current index - newline.
    • Added new function: scanNextEmptyLines
      • This function is aimed to scan the next empty lines and skip do scan.
    • Added docStartLine field in Scanner.
      • I'm not sure a: |\n Text\n seems not worked properly. It will be used to fix this issue with scanLiteral changes.
    • Changes in scanDoubleQuote
      • When a newline is found, skip the next empty lines. To ignore new line if next empty lines exist.
      • When white is found and it continues to a newline, ignore it.
      • When escape char \is found, ignore the next char.
  • Also added some more testcases.

@goccy
Copy link
Owner

goccy commented Mar 19, 2023

Thank you for your great PR . I've commented

@goccy goccy added the reviewed label Mar 19, 2023
@sters
Copy link
Author

sters commented Mar 20, 2023

@goccy

Thank you for your great PR . I've commented

Thank you for your review. Where can I find your comments?

return c == ' ' || c == '\t'
}

func (s *Scanner) isOnlyWhiteToLineEnds(src []rune, size, idx int) bool {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not get the intent of the name of this function from the process. Is there a more appropriate name ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goccy

Thank you for commenting.

Before begin, in line 126, maybe isWhiteSpaceChar is more clear.

  • From YAML spec, blank space + tab characters combined symbol is called White Space Characters.
  • This function aims to detect whether a single char is included in the White Space Characters or not.
  • Also, isNewLineChar is already exists. Following isXxxxChar is better.

And then, How about isWhiteSpaceCharUntilNewLineChar on this function?

  • Just combined: isWhiteSpaceChar until NewLineChar?
  • BTW, from YAML spec, it says Line Break Characters, but there is already exists isNewLineChar. So NewLine is more clear instead BreakLine.

I can provide some more ideas with these combinations:

  1. is/are
  2. WhiteSpaceChar/WhiteSpaceChars
  3. To/Until
  4. NewLine/LineBreak/LineEnds

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name ( isWhiteSpaceCharUntilNewLineChar or isOnlyWhiteToLineEnds ) itself is good. What I am wondering is that with this English meaning, the following string would be true. But in fact, it is false.

<white-space><white-space><new-line>

Specifically, I don't understand why it is returning false when isWhiteChar 🤔 .

Copy link
Author

@sters sters Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry. I remembered the details wrong actually this PR was created 2 months ago. And I found there are some mistakes such as for i := idx + 1; idx < size; idx++ (i is not incrementing) 🙇


YAML spec says...

All leading and trailing white space characters on each line are excluded from the content.

ref https://yaml.org/spec/1.2.2/#example-double-quoted-line-breaks

There is already covered "leading" part.

		} else if s.isWhiteChar(c) && isFirstLineChar {
			continue

ref. https://github.com/goccy/go-yaml/pull/342/files#diff-7d1df5e49f35d124e1d892865498cde1749af35afbb0f0022499a4b1b09a91eaR365-R366

This isOnlyWhiteToLineEnds function aims to cover "trailing" part.


Now, I have another idea for this topic:

  • Create a new process to support the YAML spec instead this isOnlyWhiteToLineEnds discussion.
  • In scanDoubleQuote function, if whitespace is found, try to scan the next whitespaces until another char.
    • If there are multiple whitespaces, keep them as the buffer.
    • If another char is newline, drop those whitespaces and continue to scanDoubleQuote loop.
    • If another char is not newline, the buffer is joined to value to handle as whitespace chars and continue to scanDoubleQuote loop.
    • If reached the size limit, same as "another char is not newline" case. (btw, this is the unexpected case because it should have " as end of double quote)
  • notes: This logic can be support "leading" and "trailing" both part.

The reason for why "scan" instead "check", it avoids multiple times check whitespaces like this:

aaa<whites><whites><whites><newline>bbb

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding image:

} else if s.isWhiteChar(c) {
    // scan multiple whitespaces
    buf := []rune{}
    for ; idx < size; idx++ {
        c := src[idx]
        if !s.isWhiteChar(c) {
            break
        }
        buf = append(buf, c)
    }

    // skip if whitespaces are in leading or trailing
    if isFirstLineChar || s.isNewLineChar(src[idx]) {
        idx-- // handle latest char in main scan loop.
        continue
    }

    // handle as whitespace chars if intermediate whitespaces
    value = append(value, buf...)
    idx-- // handle latest char in main scan loop.
    continue

@sters
Copy link
Author

sters commented Sep 4, 2023

@goccy

Thanks for reviewing this PR.

I don't have enough time for handle this PR for now. Let me close PR.
And I'll revisit this when I have enough time. 🙇

@sters sters closed this Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants