New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create re-usable Cleaner class #125

Closed
jvanasco opened this Issue Apr 29, 2014 · 3 comments

Comments

Projects
None yet
3 participants
@jvanasco

jvanasco commented Apr 29, 2014

While glancing at the source, the inner workings of clean() caught my eye.

The current structure creates 2 object instances to clean a single bit of text. If you do a lot of cleaning to a standard configuration, a factory pattern works really well.

I tried something and set up a test suite to see if there would be a noticeable difference. html5lib tends to be bulky and annoying, and it often lends well to caching.

Using the factory pattern below to cache a "cleaner", we average a 15% speedup over the default clean if you clean more than 10 items. if you're doing hundreds of items, the savings seems to be over 40%.

class _AbstractBleachCleaner(object):
    """abstract object to cache the `sanitizer` and `parser` objects, and expose a `clean` method"""
    sanitizer = None
    parser = None

    def clean(self, text):
        """cleans text using the cached parser"""
        if not text:
            return ''
        text = force_unicode(text)    
        return _render(self.parser.parseFragment(text))

def cleaner_factory( tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES,
          styles=ALLOWED_STYLES, strip=False, strip_comments=True):
    """Generates a custom bleach Cleaner that can be re-used."""
    class s(BleachSanitizer):
        allowed_elements = tags
        allowed_attributes = attributes
        allowed_css_properties = styles
        strip_disallowed_elements = strip
        strip_html_comments = strip_comments
    p = html5lib.HTMLParser(tokenizer=s)
    customCleaner = _AbstractBleachCleaner()
    customCleaner.sanitizer = s
    customCleaner.parser = p
    return customCleaner

myCleaner = cleaner_factory()
print myCleaner.clean( sample_text ) 

@jvanasco jvanasco changed the title from Docs suggestion, possible feature change to Docs suggestion, possible feature Apr 29, 2014

@jsocol jsocol added this to the v2.0 milestone Apr 29, 2014

@jsocol

This comment has been minimized.

Show comment
Hide comment
@jsocol

jsocol Apr 29, 2014

Member

I want to take a deeper look at Bleach's structure. It's always had the dubious honor of being my first major Python project, and there are lots of things I dislike, both from a code organizational standpoint (so much in __init__.py) and from a complexity/refactoring point.

I don't know exactly when or how, but I want this to be a design consideration for bleach 2.0. Something closer to what you've done may be backported to 1.x once we've nailed down the API. (Though it depends on the EOL plan for 1.x.)

Member

jsocol commented Apr 29, 2014

I want to take a deeper look at Bleach's structure. It's always had the dubious honor of being my first major Python project, and there are lots of things I dislike, both from a code organizational standpoint (so much in __init__.py) and from a complexity/refactoring point.

I don't know exactly when or how, but I want this to be a design consideration for bleach 2.0. Something closer to what you've done may be backported to 1.x once we've nailed down the API. (Though it depends on the EOL plan for 1.x.)

@willkg

This comment has been minimized.

Show comment
Hide comment
@willkg

willkg Mar 3, 2017

Member

I finally looked into this and did some tinkering with Bleach 2.0 rewrite.

I like the idea of having a Cleaner class that captures settings and reuses html5lib bits that can be reused. I'm going to implement that for Bleach 2.0.

Member

willkg commented Mar 3, 2017

I finally looked into this and did some tinkering with Bleach 2.0 rewrite.

I like the idea of having a Cleaner class that captures settings and reuses html5lib bits that can be reused. I'm going to implement that for Bleach 2.0.

@willkg willkg changed the title from Docs suggestion, possible feature to create re-usable Cleaner class Mar 3, 2017

@willkg

This comment has been minimized.

Show comment
Hide comment
@willkg

willkg Mar 4, 2017

Member

This is done in PR #257. It'll be in Bleach 2.0.

Member

willkg commented Mar 4, 2017

This is done in PR #257. It'll be in Bleach 2.0.

@willkg willkg closed this Mar 4, 2017

@jvanasco

This comment has been minimized.

Show comment
Hide comment
@jvanasco

jvanasco Mar 4, 2017

jvanasco commented Mar 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment