Update docs #332

twm · 2017-04-15T19:25:19Z

When trying to figure out how to sanitize some HTML I noticed that the docs have lagged behind the implementation. So I removed the mention of HTMLSanitizer and then... things got out of hand. Summary of changes:

Add tox -e doc environment so that it's easy to build the docs.
Remove documentation of HTMLTokenizer, as it is now private.
Remove documentation of HTMLSanitizer, as it no longer exists.
Add basic documentation of the treeadapters package.
Linkify many references to classes, functions, and modules.
Fix various Sphinx warnings and formatting issues.
Add __version__ to html5lib.__all__
Remove sub-modules from html5lib.treeadapters.__all__ (as this caused Sphinx warnings and didn't really make sense).

I didn't squash as suggested in CONTRIBUTING.md because recent PRs don't seem to be following that procedure. I am happy to do so if you like, or to try to break this into smaller PRs.

It runs together in the built HTML.

It's not much use if it's private.

Run "tox -e doc" to build the documentation in doc/_build.

Right now the docs have entries for re-exports like html5lib.__init__.HTMLParser, including full class documentation. This is redundant with the docs for html5lib.html5parser.HTMLParser, which is a public name anyway, so I think that it is best to be explicit that this is a re-export.

HTMLTokenizer is now a private API (I cannot find a public export). HTMLSanitizer no longer exists as a tokenizer, and has been replaced with a filter.

willkg · 2017-04-15T20:05:52Z

The sanitizer still exists, but it got rewritten as a filter and is now in html5lib.filters.sanitizer. You can see it here:

https://github.com/html5lib/html5lib-python/blob/17499b9763a090f7715af49555d21fe4b558958b/html5lib/filters/sanitizer.py

Just a drive-by comment on the off chance it helps.

twm · 2017-04-15T20:18:49Z

@willkg Yup, and the new form is documented. The old docs just weren't removed.

Still lots more to do, as html5lib's sanitization likes to escape tags instead of dropping them. I ended up fixing up the html5lib docs while working on this: html5lib/html5lib-python#332

willkg

I'm really sorry this has been sitting around so long. I appreciate you working on it and fixing so many issues!

I have some comments. If you have time, can you look through them? If not, let me know and I can work through them.

Thank you! Looking forward to landing this!

willkg · 2017-10-31T18:14:28Z

html5lib/__init__.py

@@ -19,7 +28,8 @@
 from .serializer import serialize

 __all__ = ["HTMLParser", "parse", "parseFragment", "getTreeBuilder",
-           "getTreeWalker", "serialize"]
+           "getTreeWalker", "serialize", "__version__"]


I don't think we want __version__ to get imported when importing *.

As __version__ is part of the public interface of this module, which __all__ defines, I think it should be here.

I wouldn't think about __all__ that way. It's really for specifying what gets exported if the user does from html5lib import *. I don't want __version__ to get exported in that scenario. Please undo this change.

willkg · 2017-10-31T18:14:45Z

html5lib/__init__.py


 # this has to be at the top level, see how setup.py parses this
+#: Distribution version number, which asymptotically approaches 1.


I'd remove this. Amongst other things, it's not going to be true for much longer.

Thank goodness! Will do.

willkg · 2017-10-31T18:16:46Z

doc/movingparts.rst

@@ -110,11 +105,11 @@ You can alter the stream content with filters provided by html5lib:
  the document

 * :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
-  ``LintError`` exceptions on invalid tag and attribute names, invalid
+  :exc:`AssertionError` exceptions on invalid tag and attribute names, invalid


Is it really an AssertionError? If so, we should write up an issue to change that.

Yeah, the implementation is basically all assert statements:

html5lib-python/html5lib/filters/lint.py

Lines 24 to 79 in 7bbde54

assert namespace is None or isinstance(namespace, text_type)

assert namespace != ""

assert isinstance(name, text_type)

assert name != ""

assert isinstance(token["data"], dict)

if (not namespace or namespace == namespaces["html"]) and name in voidElements:

assert type == "EmptyTag"

else:

assert type == "StartTag"

if type == "StartTag" and self.require_matching_tags:

open_elements.append((namespace, name))

for (namespace, name), value in token["data"].items():

assert namespace is None or isinstance(namespace, text_type)

assert namespace != ""

assert isinstance(name, text_type)

assert name != ""

assert isinstance(value, text_type)

elif type == "EndTag":

namespace = token["namespace"]

name = token["name"]

assert namespace is None or isinstance(namespace, text_type)

assert namespace != ""

assert isinstance(name, text_type)

assert name != ""

if (not namespace or namespace == namespaces["html"]) and name in voidElements:

assert False, "Void element reported as EndTag token: %(tag)s" % {"tag": name}

elif self.require_matching_tags:

start = open_elements.pop()

assert start == (namespace, name)

elif type == "Comment":

data = token["data"]

assert isinstance(data, text_type)

elif type in ("Characters", "SpaceCharacters"):

data = token["data"]

assert isinstance(data, text_type)

assert data != ""

if type == "SpaceCharacters":

assert data.strip(spaceCharacters) == ""

elif type == "Doctype":

name = token["name"]

assert name is None or isinstance(name, text_type)

assert token["publicId"] is None or isinstance(name, text_type)

assert token["systemId"] is None or isinstance(name, text_type)

elif type == "Entity":

assert isinstance(token["name"], text_type)

elif type == "SerializerError":

assert isinstance(token["data"], text_type)

else:

assert False, "Unknown token type: %(type)s" % {"type": type}

willkg · 2017-10-31T18:17:25Z

doc/movingparts.rst



 Filters
 ~~~~~~~

-You can alter the stream content with filters provided by html5lib:
+html5lib provides several filters


Given that what follows is a bulleted list, can you add a : to the end of this line?

willkg · 2017-10-31T18:19:54Z

doc/movingparts.rst


+* :class:`~html5lib.serializer.HTMLSerializer`, to generate a stream of bytes; and
+* filters, to manipulate the token stream.


The leader has "a few tools", but this list has two items. Further, the items seem like a sentence. I'd either change the bullet list into a sentence or unsentencify the bullet items. Maybe something like this?:

html5lib provides two tools for consuming token streams: * :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes * filters for manipulating the token stream

I ended up going with a sentence, rather than the bulleted list, as there aren't exactly two (really there's the serializer and a bunch of filters), so it's really two categories of token consumer, but that distinction isn't really useful to point out.

willkg · 2017-10-31T18:23:29Z

doc/html5lib.rst

@@ -1,13 +1,8 @@
 html5lib Package
 ================

-:mod:`html5lib` Package
-----------------------


Why take the header out here?

Otherwise there are two headers in a row with exactly the same text.

reST doesn't support nesting inline markup, so this shows up in rendered for with the backticks.

twm · 2017-11-02T03:39:40Z

I think that I have addressed all the issues you noted as appropriate. Please let me know if anything else is required! Thanks!

willkg

I have one outstanding issue with __version__ being listed in __all__. Otherwise this is good to go. Thank you!

willkg · 2017-11-06T20:03:41Z

@twm Thank you so much for this!

twm added 14 commits April 15, 2017 08:22

Fix formatting of docstring example

224d9f4

It runs together in the built HTML.

Use with, it's idiomatic

3fb6af3

Fix typo in changelog

ba63e09

Export and document html5lib.__version__

6b99d52

It's not much use if it's private.

Add a documentation env to tox.ini

323d736

Run "tox -e doc" to build the documentation in doc/_build.

Remove docs for HTMLTokenizer and HTMLSanitizer

abf6224

HTMLTokenizer is now a private API (I cannot find a public export). HTMLSanitizer no longer exists as a tokenizer, and has been replaced with a filter.

Fix Sphinx title underline warnings

8554098

Open in binary mode for Python 3

c8fca0e

Update and expand "moving parts" doc

637826f

Add treeadapters package doc

254fc90

Remove duplicate header

deb4206

Link to the spec

2909867

Add myself to AUTHORS

739dcf0

willkg added this to the 1.0 milestone Oct 3, 2017

willkg added 2 commits October 31, 2017 14:11

Merge branch 'master' into update-docs

cbaf304

Merge branch 'master' into update-docs

fc69044

willkg reviewed Oct 31, 2017

View reviewed changes

twm added 5 commits November 1, 2017 20:01

Add missing colon

f25d7c0

Rework token stream intro

5eb89cc

Merge remote-tracking branch 'upstream/master' into update-docs

d270666

Remove textual backticks in changelog

cb2702c

reST doesn't support nesting inline markup, so this shows up in rendered for with the backticks.

Asymptote no more

1084ed0

willkg requested changes Nov 2, 2017

View reviewed changes

Remove __version__ from __all__

deb98bb

willkg approved these changes Nov 6, 2017

View reviewed changes

willkg merged commit 69606e5 into html5lib:master Nov 6, 2017

willkg mentioned this pull request Nov 9, 2017

Links to tokenizer.py and sanitizer.py broken #313

Closed

hugovk mentioned this pull request Dec 4, 2017

update CHANGES.rst and AUTHORS.rst for 1.0 release #382

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update docs #332

Update docs #332

twm commented Apr 15, 2017

willkg commented Apr 15, 2017

twm commented Apr 15, 2017

willkg left a comment

willkg Oct 31, 2017

twm Nov 2, 2017

willkg Nov 2, 2017

willkg Oct 31, 2017

twm Nov 2, 2017

willkg Oct 31, 2017

twm Nov 2, 2017

willkg Oct 31, 2017

willkg Oct 31, 2017

twm Nov 2, 2017

willkg Oct 31, 2017

twm Nov 2, 2017 •

edited

Loading

twm commented Nov 2, 2017

willkg left a comment

willkg commented Nov 6, 2017


		# this has to be at the top level, see how setup.py parses this
		#: Distribution version number, which asymptotically approaches 1.

	assert namespace is None or isinstance(namespace, text_type)
	assert namespace != ""
	assert isinstance(name, text_type)
	assert name != ""
	assert isinstance(token["data"], dict)
	if (not namespace or namespace == namespaces["html"]) and name in voidElements:
	assert type == "EmptyTag"
	else:
	assert type == "StartTag"
	if type == "StartTag" and self.require_matching_tags:
	open_elements.append((namespace, name))
	for (namespace, name), value in token["data"].items():
	assert namespace is None or isinstance(namespace, text_type)
	assert namespace != ""
	assert isinstance(name, text_type)
	assert name != ""
	assert isinstance(value, text_type)

	elif type == "EndTag":
	namespace = token["namespace"]
	name = token["name"]
	assert namespace is None or isinstance(namespace, text_type)
	assert namespace != ""
	assert isinstance(name, text_type)
	assert name != ""
	if (not namespace or namespace == namespaces["html"]) and name in voidElements:
	assert False, "Void element reported as EndTag token: %(tag)s" % {"tag": name}
	elif self.require_matching_tags:
	start = open_elements.pop()
	assert start == (namespace, name)

	elif type == "Comment":
	data = token["data"]
	assert isinstance(data, text_type)

	elif type in ("Characters", "SpaceCharacters"):
	data = token["data"]
	assert isinstance(data, text_type)
	assert data != ""
	if type == "SpaceCharacters":
	assert data.strip(spaceCharacters) == ""

	elif type == "Doctype":
	name = token["name"]
	assert name is None or isinstance(name, text_type)
	assert token["publicId"] is None or isinstance(name, text_type)
	assert token["systemId"] is None or isinstance(name, text_type)

	elif type == "Entity":
	assert isinstance(token["name"], text_type)

	elif type == "SerializerError":
	assert isinstance(token["data"], text_type)

	else:
	assert False, "Unknown token type: %(type)s" % {"type": type}


		* :class:`~html5lib.serializer.HTMLSerializer`, to generate a stream of bytes; and
		* filters, to manipulate the token stream.

Update docs #332

Update docs #332

Conversation

twm commented Apr 15, 2017

willkg commented Apr 15, 2017

twm commented Apr 15, 2017

willkg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twm Nov 2, 2017 • edited Loading

Choose a reason for hiding this comment

twm commented Nov 2, 2017

willkg left a comment

Choose a reason for hiding this comment

willkg commented Nov 6, 2017

twm Nov 2, 2017 •

edited

Loading