Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct spacing between block tags and flow tags #10

Closed
onizet opened this issue Oct 20, 2017 · 1 comment
Closed

Correct spacing between block tags and flow tags #10

onizet opened this issue Oct 20, 2017 · 1 comment
Assignees

Comments

@onizet
Copy link
Owner

onizet commented Oct 20, 2017

[Copied from codeplex]

In the constructor of class HtmlEnumerator this line:

html = Regex.Replace(html, @"(\s*)(</?(p |div|br|body)[^>]*/?>)(\s*)", "$2", RegexOptions.Multiline| RegexOptions.IgnoreCase);```

must be modified in
```c#
html = Regex.Replace(html, @"(\s*)(</?(\bp\b|\bdiv\b|\bbr\b|\bbody\b)[^>]*/?>)(\s*)", "$2", RegexOptions.Multiline | RegexOptions.IgnoreCase);

In this way the regex analize the words and not the array characters
(example: the word 'p' and not the word 'pre')

Another problem: the spaces after a flow tag (example: <b> or <i>) are deleted.
To retain this spaces, you can modify this line of code in MoveUntilMatch function HtmlEnumerator class:

while ((success = en.MoveNext()) && (current = en.Current.Trim('\n', '\r')).Length == 0) ;

modified in:

while ((success = en.MoveNext()) && (current = en.Current.Trim('\r')).Length == 0) ;

This is an HTML example to parse:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta content="text/html; charset=ISO-8859-1"
 http-equiv="content-type" />
  <title></title>
</head>
<body>
<p>
Hello <b>beautiful</b>
world!!!</p>
<p>
Hello <b>beautiful</b>
<i>world!!!</i></p>
  <p>Lorem Ipsum</p>
 <pre> 
Hello
  world!!!  </pre>
</body>
</html>

This is the behaviour now:
linebreakold

This is the new behaviour with the modified code:
linebreaknew

onizet wrote Dec 17, 2015 at 2:47 PM

I effectively tested your first fix but I don't have much time to perform many testing.
I'm glad you come back with this troubleshooting, coz I found the same bug but I didn't make the link with the regex changes.
So thanks, you make my day :-)

onizet wrote Dec 17, 2015 at 3:50 PM

If I'm not mistaken, I can only stick with \bp\b because the other tags are very different from the others Html tags.
So I can keep only:

html = Regex.Replace(html, @"(\s*)(</?(\bp\b|div|br|body)[^>]*/?>)(\s*)", "$2", RegexOptions.Multiline| RegexOptions.IgnoreCase);

giorand wrote Dec 17, 2015 at 4:32 PM

You're right.
It could be only for a correct logic maintain the other \b

onizet wrote Dec 17, 2015 at 5:29 PM

about your statement:
Another problem: the spaces after a flow tag (example: <b> or <i>) are deleted
If you paste your HTML in a browser, you will see they will be deleted.

Associated with changeset 90889: This is a major commit about RowSpan bug (#13058, #12781, #13689). Also, include the fix from giorand about spaces.

giorand wrote Dec 18, 2015 at 7:46 AM

You're right, in browser there is a space between the words 'beautiful' and 'world'.
But if you parsing with actual dll, the result in Word 2013 is 'beautifulworld' without space (as you can see in the first image)

onizet wrote Jan 12, 2016 at 9:25 PM

just to notified you that I'm still working on this issue, which I consider major.

@onizet onizet self-assigned this Oct 20, 2017
@onizet
Copy link
Owner Author

onizet commented Jan 12, 2018

Rules for correct white spacing handling
https://www.w3.org/TR/css-text-3/#white-space-processing

@onizet onizet closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant