Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html conversion to xml leaves many tags unclosed #792

Closed
davidmichaelkarr opened this issue Jan 12, 2019 · 3 comments
Closed

html conversion to xml leaves many tags unclosed #792

davidmichaelkarr opened this issue Jan 12, 2019 · 3 comments

Comments

@davidmichaelkarr
Copy link

I'm sure I'm doing something wrong, or have some misconception about this, but I'm trying to process html files in the JDK 1.8 javadoc tree with tidy, so I can run "xpath" on them. So far, tidy is not fully converting them to valid XML. I'm using options "--html --xmlout". Initially, I see that all the "meta" and "link" elements in the "head" elements are left unchanged.

@geoffmcl
Copy link
Contributor

@davidmichaelkarr thanks for the issue... but not sure what you are doing... I guess it seems something like convert html to xml?

If that is it, I am not sure such conversion is a primary purpose of tidy...

Just I know there have been no code changes, fixes, or improvements in the --output-xml yes option code in many, MANY, years... and maybe there are other existing xml issues/bugs... need to check...

But maybe something can be done... when we know what to fix, improve, etc...

First let's try to understand your exact configuration options tried - please be precise -

  • --html - What option is this?
  • --xmlout - What option is this?

Then a small sample of the problem(s)... and what version of tidy are you using?

Can you construct a minimum input case, show what config you set, what output you get, and what output you expect... thanks...

@davidmichaelkarr
Copy link
Author

I eventually found the "-axsml" option, and that appears to consistently work, so I think I'm ok for now.

@geoffmcl
Copy link
Contributor

@davidmichaelkarr assuming you meant option -asxml, or -asxhtml - please take care...

That is --output-xml yes, or in a cfg file, output-xml: yes, and glad to hear you found the output consistent... from what ever version of tidy you are using...

Just searching around, and there may be other related issues, found open #767, but that is with -ashtml... still searching...

So maybe -asxml could also have problems... but would appreciate a new issue on this/these...

Meantime closing this... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants