Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs #332

Merged
merged 22 commits into from
Nov 6, 2017
Merged

Update docs #332

merged 22 commits into from
Nov 6, 2017

Conversation

twm
Copy link
Contributor

@twm twm commented Apr 15, 2017

When trying to figure out how to sanitize some HTML I noticed that the docs have lagged behind the implementation. So I removed the mention of HTMLSanitizer and then... things got out of hand. Summary of changes:

  • Add tox -e doc environment so that it's easy to build the docs.
  • Remove documentation of HTMLTokenizer, as it is now private.
  • Remove documentation of HTMLSanitizer, as it no longer exists.
  • Add basic documentation of the treeadapters package.
  • Linkify many references to classes, functions, and modules.
  • Fix various Sphinx warnings and formatting issues.
  • Add __version__ to html5lib.__all__
  • Remove sub-modules from html5lib.treeadapters.__all__ (as this caused Sphinx warnings and didn't really make sense).

I didn't squash as suggested in CONTRIBUTING.md because recent PRs don't seem to be following that procedure. I am happy to do so if you like, or to try to break this into smaller PRs.

twm added 14 commits April 15, 2017 08:22
It runs together in the built HTML.
It's not much use if it's private.
Run "tox -e doc" to build the documentation in doc/_build.
Right now the docs have entries for re-exports like
html5lib.__init__.HTMLParser, including full class documentation. This
is redundant with the docs for html5lib.html5parser.HTMLParser, which
is a public name anyway, so I think that it is best to be explicit that
this is a re-export.
HTMLTokenizer is now a private API (I cannot find a public export).
HTMLSanitizer no longer exists as a tokenizer, and has been replaced
with a filter.
@willkg
Copy link
Contributor

willkg commented Apr 15, 2017

The sanitizer still exists, but it got rewritten as a filter and is now in html5lib.filters.sanitizer. You can see it here:

https://github.com/html5lib/html5lib-python/blob/17499b9763a090f7715af49555d21fe4b558958b/html5lib/filters/sanitizer.py

Just a drive-by comment on the off chance it helps.

@twm
Copy link
Contributor Author

twm commented Apr 15, 2017

@willkg Yup, and the new form is documented. The old docs just weren't removed.

twm added a commit to twm/yarrharr that referenced this pull request Jul 24, 2017
Still lots more to do, as html5lib's sanitization likes to escape tags
instead of dropping them.

I ended up fixing up the html5lib docs while working on this:
html5lib/html5lib-python#332
@willkg willkg added this to the 1.0 milestone Oct 3, 2017
Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry this has been sitting around so long. I appreciate you working on it and fixing so many issues!

I have some comments. If you have time, can you look through them? If not, let me know and I can work through them.

Thank you! Looking forward to landing this!

@@ -19,7 +28,8 @@
from .serializer import serialize

__all__ = ["HTMLParser", "parse", "parseFragment", "getTreeBuilder",
"getTreeWalker", "serialize"]
"getTreeWalker", "serialize", "__version__"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want __version__ to get imported when importing *.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As __version__ is part of the public interface of this module, which __all__ defines, I think it should be here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think about __all__ that way. It's really for specifying what gets exported if the user does from html5lib import *. I don't want __version__ to get exported in that scenario. Please undo this change.


# this has to be at the top level, see how setup.py parses this
#: Distribution version number, which asymptotically approaches 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this. Amongst other things, it's not going to be true for much longer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank goodness! Will do.

@@ -110,11 +105,11 @@ You can alter the stream content with filters provided by html5lib:
the document

* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
``LintError`` exceptions on invalid tag and attribute names, invalid
:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really an AssertionError? If so, we should write up an issue to change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the implementation is basically all assert statements:

assert namespace is None or isinstance(namespace, text_type)
assert namespace != ""
assert isinstance(name, text_type)
assert name != ""
assert isinstance(token["data"], dict)
if (not namespace or namespace == namespaces["html"]) and name in voidElements:
assert type == "EmptyTag"
else:
assert type == "StartTag"
if type == "StartTag" and self.require_matching_tags:
open_elements.append((namespace, name))
for (namespace, name), value in token["data"].items():
assert namespace is None or isinstance(namespace, text_type)
assert namespace != ""
assert isinstance(name, text_type)
assert name != ""
assert isinstance(value, text_type)
elif type == "EndTag":
namespace = token["namespace"]
name = token["name"]
assert namespace is None or isinstance(namespace, text_type)
assert namespace != ""
assert isinstance(name, text_type)
assert name != ""
if (not namespace or namespace == namespaces["html"]) and name in voidElements:
assert False, "Void element reported as EndTag token: %(tag)s" % {"tag": name}
elif self.require_matching_tags:
start = open_elements.pop()
assert start == (namespace, name)
elif type == "Comment":
data = token["data"]
assert isinstance(data, text_type)
elif type in ("Characters", "SpaceCharacters"):
data = token["data"]
assert isinstance(data, text_type)
assert data != ""
if type == "SpaceCharacters":
assert data.strip(spaceCharacters) == ""
elif type == "Doctype":
name = token["name"]
assert name is None or isinstance(name, text_type)
assert token["publicId"] is None or isinstance(name, text_type)
assert token["systemId"] is None or isinstance(name, text_type)
elif type == "Entity":
assert isinstance(token["name"], text_type)
elif type == "SerializerError":
assert isinstance(token["data"], text_type)
else:
assert False, "Unknown token type: %(type)s" % {"type": type}



Filters
~~~~~~~

You can alter the stream content with filters provided by html5lib:
html5lib provides several filters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that what follows is a bulleted list, can you add a : to the end of this line?


* :class:`~html5lib.serializer.HTMLSerializer`, to generate a stream of bytes; and
* filters, to manipulate the token stream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leader has "a few tools", but this list has two items. Further, the items seem like a sentence. I'd either change the bullet list into a sentence or unsentencify the bullet items. Maybe something like this?:

html5lib provides two tools for consuming token streams:

* :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes
* filters for manipulating the token stream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up going with a sentence, rather than the bulleted list, as there aren't exactly two (really there's the serializer and a bunch of filters), so it's really two categories of token consumer, but that distinction isn't really useful to point out.

@@ -1,13 +1,8 @@
html5lib Package
================

:mod:`html5lib` Package
-----------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why take the header out here?

Copy link
Contributor Author

@twm twm Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise there are two headers in a row with exactly the same text.

@twm
Copy link
Contributor Author

twm commented Nov 2, 2017

I think that I have addressed all the issues you noted as appropriate. Please let me know if anything else is required! Thanks!

Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one outstanding issue with __version__ being listed in __all__. Otherwise this is good to go. Thank you!

@willkg willkg merged commit 69606e5 into html5lib:master Nov 6, 2017
@willkg
Copy link
Contributor

willkg commented Nov 6, 2017

@twm Thank you so much for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants