New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Made protocols configurable. #149
Conversation
This is something we need too. Would appreciate a merge of this to master. |
What's the specifics of the use case here? What protocols do you want to allow through? |
In our case we parse HTML from e-mail messages. There the "cid"-prototcol is used to link to inline images. |
+1 for this feature as well |
@AndreasMalecki @DavidMuller What's your use case? |
I'm also looking at using bleach to clean HTML email messages. I haven't gotten to trying the cid-protocol yet, but I'll need it eventually. |
@nickburlett What're you cleaning HTML messages for? Is it to display in a browser? Is it for storage? Something else? |
@willkg our use case involves allowing certain ios "url schemes". For example, to direct a user to the sms app on their iphone, we would like to be able to preserve # current behavior
In [8]: sms_string = '<a href="sms:">Launch Messages App</a>'
In [9]: bleach.clean(sms_string)
Out[9]: u'<a>Launch Messages App</a>' |
@willkg my plan is for display in a browser. |
As a general workaround, this works for now, it's just incredibly ugly: |
@willkg is this a feature you guys are considering merging? |
It's still open, so it's still in progress. Generally with bleach I want to add as little as humanly possible. For now, every code change and every new feature needs to be very compelling and have a well defined and documented reason to exist. I'm still wrapping my head around the underlying problem. It'd help to have an issue in the issue tracker that walks through the problem, the impact of the problem, what kinds of things the problem prevents, the work-arounds and then possible solutions. I haven't looked at this since last week. Generally, I'm wondering the following:
That's where I'm at. |
We required some additional acceptable protocols like "smb" and had to use a monkey patch. |
|
I think the main question (in my mind) is if the developer needs configurable protocols or if there is just additional protocols that bleach should accept as allowed by default. If there are use cases where you need arbitrary protocols (I think mobile phones might work by having each app register a unique protocol used to open that app up?) then I think it needs to be configurable since there is no way to enumerate all possible protocols that a user of bleach may want to use. If the use cases are simply "here's some additional safe protocols that should be allowed" then I think the way forward would be to just add some more protocols to the list of allowed protocols. |
@dstufft: I believe the developer needs configurable protocols. The set of safe protocols varies by use case. Not everyone will want the |
+1 for this feature as well. We're trying to support custom protocols (in our case, the protocol is actually more proprietary than the protocols specified in this conversation) in one of our projects and would benefit from this PR. |
I'm on board for adding this feature. I see the compelling use case and I don't see other viable solutions with the current architecture. When I get a chance, I'll work through the PR and we can move forward. I'll try to do it by the end of Friday. |
Thanks @willkg! Looking forward to this one. Great lib and thanks for your work. |
@@ -43,6 +44,8 @@ | |||
|
|||
ALLOWED_STYLES = [] | |||
|
|||
ALLOWED_PROTOCOLS = copy.copy(HTMLSanitizer.acceptable_protocols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're now "owning this" and we want to document it explicitly, I think it's prudent that instead of copying it with copy.copy
, that we make it explicitly defined here:
ALLOWED_PROTOCOLS = [
u'ed2k', u'ftp', u'http', u'https', u'irc', u'mailto', u'news', u'gopher', u'nntp',
u'telnet', u'webcal', u'xmpp', u'callto', u'feed', u'urn', u'aim', u'rsync', u'tag',
u'ssh', u'sftp', u'rtsp', u'afs', u'data'
]
That way our list won't change between versions of html5lib and bleach can explicitly declare what we think is appropriate regardless of what html5lib says.
Related to this, I'm not sure I'd consider that list a list of "safe protocols". I think I'd want to trim it down to a much smaller subset. Maybe this much more conservative set:
ALLOWED_PROTOCOLS = [u'http', u'https', u'mailto']
Anyone have thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide to change the default protocols to a smaller list I would suggest a big BACKWARDS INCOMPATIBLE warning on the next release, as suddenly all sorts of links would stop working in people's code unless they also went through the protocols one by one and checked if anyone used them or not. We have people using at least five of the ones removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EmilStenstrom Which protocols? Where would you need this "BACKWARDS INCOMPATIBLE" warning? Is a note in the CHANGES file enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing based on seeing a lot of user generated posts on our platform but: ftp, rtsp, nntp, webcal, feed.
We tend to look in the version history on PyPI so that would be ideal for us, but a CHANGES file is ok too, just a bit harder to stumble over when you're doing a big "lets update some libraries" drive.
I think the This needs documentation, too. At a minimum, we should add a new section to |
@AndreasMalecki ^^^ Are these changes you want to work on? If not, I can take what you started and finish it up some time this month. |
+1 I also want this feature. Because it's possible XSS is the case, such as the following.
I want to allow the iframe tag. But I can not reject the data protocol. |
@willkg Okay. I will do that within the week. |
@willkg So, what's the decision concerning the default protocols? Limit them to the three you mentioned or stay backwards compatible? My favorite would be compatibility but, as you suggested, to define the list separately for bleach. |
@AndreasMalecki There were a couple of thoughts on severely limiting the Given that, I think we should limit them. I think for now we should go with:
Having said that, I'm still curious about situations where this isn't great, so I'll spend some time talking with people I know who use bleach and see what they think. If anything comes out of that, I'll write up an issue and PR to fix the list. Thank you for working on this! |
@willkg You're welcome. I implemented the requested changes. Anything else I should do? |
Travis is green and this looks good to me. Thank you for doing the work! I'll merge it now. I'll also add a note to Thank you again! |
Really excited to see this feature in master. Are you guys planning to publish a new release to pypi soon? Looks like version 1.4.2 does not contain the commits that power this feature |
…sion of bleach containing these commits is released
Hey guys, was wondering if there was a plan to create a formal release (including this merged PR) on PyPi? Our team is pretty excited for these changes 👍 |
For one of my projects, I required configurable protocols for the href attribute of anchor tags when using bleach.clean. This pull request exposes the allowed protocols in bleach.clean.