Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add escape parameter to to_html() #2919

Merged
merged 2 commits into from Apr 10, 2013

Conversation

gdraps
Copy link
Contributor

@gdraps gdraps commented Feb 23, 2013

Treating DataFrame content as plain text, rather than HTML markup, by escaping
everything (#2617) seems like the right default for to_html(), however, if a DataFrame contains HTML (example) or text already HTML escaped, it results in either unwanted escaping or double-escaping.

Changes in this PR:

  • make HTML escaping programmable through a new to_html() parameter named
    escape (default True), allowing users to restore old to_html() behavior (<=0.10.0) by setting escape=False.
  • add & to the list of HTML chars escaped, so strings that happen to contain
    HTML escape sequences or reserved entities, such as "&lt;", are displayed
    properly.

@ghost
Copy link

ghost commented Feb 25, 2013

Good addition, but I think there are actually 3 cases to be handled, and
disagree with your choice of default behaviour.

Escaping <> by default is important to prevent XSS. your PR
keeps that and that's fine.

Running the following example in qtconsole/ipnb (or opening the HTML in a browser,I guess)

import pandas as pd
class AsHtml(object):
   def _repr_html_(self):
       return pd.DataFrame(["<b>str<ing1 &amp; &lt; &a</b>"]).to_html()
AsHtml()

currently produces

<b>str<ing1 & < &a</b>

and with your PR it produces:

<b>str<ing1 &amp; &lt; &a</b>

It seems to me that when a user has HTML in his frame and is using to_html(), he
would usually want those html entities to display properly (as & not & for example).

So there are 3 cases:

  1. escape only <> but display html entities properly (the current default)
  2. escape everything (your PR's suggested default behaviour)
  3. don't escape anything (escape=False in your PR)

I think 1 should stay the default, and 2/3 should become possible by
specifying the escape argument.

@gdraps
Copy link
Contributor Author

gdraps commented Feb 25, 2013

Thanks for the feedback. It's interesting because I've come to embrace black/white escaping rules (i.e., text is markup or not) based on experience with web frameworks. Mixing < and &amp; in the same string and expecting &amp; to render as simply & when escape=True is not what I would expect, but I may be biased.

In web frameworks, the general practice is to explicitly identify markup and treat all other text as suspect. Below are two examples of how you'd add markup to non-markup text with markupsafe, written by the author of flask and also used by flask.

>>> from markupsafe import Markup
>>> bold = Markup("<b>%s</b>")
>>> print bold % "str<ing1 & < &a"
<b>str&lt;ing1 &amp; &lt; &amp;a</b>
>>> print Markup("<b>") + "str<ing1 & < &a" + Markup("</b>")
<b>str&lt;ing1 &amp; &lt; &amp;a</b>

Furthermore, one must explicitly identify strings as Markup to prevent double-escaping.

>>> print bold % Markup("str&lt;ing1 &amp; &lt; &amp;a")
<b>str&lt;ing1 &amp; &lt; &amp;a</b>
>>> print bold % "str&lt;ing1 &amp; &lt; &amp;a"
<b>str&amp;lt;ing1 &amp;amp; &amp;lt; &amp;amp;a</b>

I'd be interested to hear your feedback on the following patterns and whether there is a better API for pandas users to safely markup DataFrame content.

import pandas as pd
from markupsafe import Markup
from IPython.core.display import display_html

bold = Markup("<b>%s</b>")
display_html(pd.DataFrame(["str<ing1 & < &a"]).to_html(), raw=True)
display_html(pd.DataFrame([bold % "str<ing1 & < &a"]).to_html(escape=False), raw=True)
display_html(pd.DataFrame([bold % Markup("str&lt;ing1 &amp; &lt; &amp;a")]).to_html(escape=False), raw=True)

@ghost
Copy link

ghost commented Feb 25, 2013

That's nice, but I'm not sure that addresses the problem I mentioned.
The case not covered is when we want tags to be escaped (for example
<script src=http://evil.com/diabolical.js></script>), but '&amp' to display
as &.

Let me put it this way: suppose a user of pandas is happy with the way
things currently work. Suppose your PR is merged, what would the user
have to do to keep things behaving as they are?

If you're aware of security issues that make merely <> escaping ineffective,
please speak up.

Also note that GH displays '&' when you write '&' in a comment, and GH
is quite a popular web application :)

@ghost
Copy link

ghost commented Feb 27, 2013

I'm going 180 on this.
http://wonko.com/post/html-escaping
https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet

In a nutshell, you can't tell what markup context the output might somehow end up embedded in
and so many characters are potentially dangerous, including quotes, ampersands,
percentage signs, braces and alarmingly, still others.

It seems an unlikely attack vector, but common experience shows that's bad reasoning.
So +1 for an escape smackdown, and thanks @gdraps for stirring this back up.

@wesm
Copy link
Member

wesm commented Apr 9, 2013

Merge status?

@ghost
Copy link

ghost commented Apr 9, 2013

it's the right thing to do but i'm afraid it'd really inconvenience some users.
defaults bikeshedding?
dunno.

@gdraps
Copy link
Contributor Author

gdraps commented Apr 9, 2013

For strict compatibility with 0.10.1, either a third escape mode can be added (e.g., 'compat' in addition to True/False), in which & is not escaped, or & escaping can be removed from this PR. The main goal was to add an escape parm to give users the ability to restore 0.10.0 behavior.

@wesm
Copy link
Member

wesm commented Apr 9, 2013

I'm comfortable with you merging as long as you put something in the What's New so we can point to it

@gdraps
Copy link
Contributor Author

gdraps commented Apr 9, 2013

Rebased to master and added a few words to RELEASE.rst and v0.11.0.txt. Let me know if I only need to update one or the other.

@wesm wesm merged commit f3d01cb into pandas-dev:master Apr 10, 2013
@wesm
Copy link
Member

wesm commented Apr 10, 2013

merged, thanks!

@jreback
Copy link
Contributor

jreback commented Apr 10, 2013

Failing in master on p3.3

======================================================================
FAIL: test_to_html_escaped (pandas.tests.test_format.TestDataFrameFormatting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.3_with_system_site_packages/lib/python3.3/site-packages/pandas-0.11.0.dev_d749b91-py3.3-linux-x86_64.egg/pandas/tests/test_format.py", line 307, in test_to_html_escaped
    self.assertEqual(xp, rs)
AssertionError: '<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: rig [truncated]... != '<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: rig [truncated]...
Diff is 1049 characters long. Set self.maxDiff to None to see it.
----------------------------------------------------------------------
Ran 2960 tests in 122.311s

@gdraps
Copy link
Contributor Author

gdraps commented Apr 10, 2013

Shucks, I can look into it later today.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2013

FYI These are failing in 3.2

======================================================================
ERROR: test_rplot1 (pandas.tests.test_rplot.TestRPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tests/test_rplot.py", line 242, in test_rplot1
    self.plot.render(self.fig)
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 881, in render
    adjust_subplots(fig, axes_grid, last_trellis, new_layers[-1])
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 759, in adjust_subplots
    label1 = "%s = %s" % (trellis.by[0], trellis.group_grid[index / trellis.cols][index % trellis.cols][0])
TypeError: list indices must be integers, not float
======================================================================
ERROR: test_rplot2 (pandas.tests.test_rplot.TestRPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tests/test_rplot.py", line 252, in test_rplot2
    self.plot.render(self.fig)
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 881, in render
    adjust_subplots(fig, axes_grid, last_trellis, new_layers[-1])
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 753, in adjust_subplots
    label1 = "%s = %s" % (trellis.by[1], trellis.group_grid[index / trellis.cols][index % trellis.cols])
TypeError: list indices must be integers, not float
======================================================================
ERROR: test_rplot3 (pandas.tests.test_rplot.TestRPlot)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tests/test_rplot.py", line 262, in test_rplot3
    self.plot.render(self.fig)
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 881, in render
    adjust_subplots(fig, axes_grid, last_trellis, new_layers[-1])
  File "/home/travis/virtualenv/python3.2_with_system_site_packages/lib/python3.2/site-packages/pandas-0.11.0.dev_fc8a679-py3.2-linux-x86_64.egg/pandas/tools/rplot.py", line 756, in adjust_subplots
    label1 = "%s = %s" % (trellis.by[0], trellis.group_grid[index / trellis.cols][index % trellis.cols])
TypeError: list indices must be integers, not float
----------------------------------------------------------------------
Ran 3093 tests in 300.638s

@gdraps
Copy link
Contributor Author

gdraps commented Apr 11, 2013

Ok, test_to_html_escaped failed due to unsafe reliance on dict key ordering. The character escapes were stored in a dict, even though & must be escaped first to prevent double escaping. py3.3's hash randomization was wonderfully effective at uncovering this every other run. I pushed a new commit to this branch with the dict replaced by an ordered dict. Let me know if I should open a new PR instead.

@ghost
Copy link

ghost commented Apr 11, 2013

cherry picked in master. Try and get travis-ci installed, it'll do the py3 testing for you
if you don't use tox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants