-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indent HTML lists correctly (Issue 1073) #1170
base: master
Are you sure you want to change the base?
Conversation
Some debugging and polish needed.
Some variables need tweaking. Needs testing. Code reuse unsatisfactory.
Some variables need tweaking. Needs testing. Code reuse unsatisfactory.
Potentially significant issues with tests: 1. test_html_ln_outside_p - IndexError: list index out of range. 2. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected. Code reuse unsatisfactory.
Need feedback for handling <dd> and <blockquote>. Potentially significant issues with tests: 1. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected. Need feedback for whether or not the new indentation that contradicts old tests is satisfactory. Code reuse unsatisfactory. Need feedback.
Bug present: bullets are made one per line instead of one per paragraph. Saving progress before introducing a `Bullet` class.
Feature implemented. Testing and adjustments of tests needed.
Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets.
Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets. `<ul>` and `<ol>` tags now cause a creation of a paragraph with the string `\n` being used to generate a fragment of the height `list_pseudo_margin`. Adjusted defaults for `li_tag_indent`.
Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line. Dealing with the issue of inappropriately large distance between `<dt>` and their child `<dd>` elements when `Paragraph.top_margin` is 0.
Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line.
# Conflicts: # fpdf/html.py
Adjusted old tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far, but the devil is in the detail...
As I'm sure you've noticed, the interplay between the HTML parser, text regions, line wrapping, and rendering is non-trivial. I've added some pointers of how to fix the parts that don't quite add up yet.
…`list_vertical_margin`. Removed the `MultiLineBreak.indent` attribute. Added a test for long `<ol>` bullets.
# Conflicts: # CHANGELOG.md
…_bullet_frag_and_tl`.
…_bullet_frag_and_tl`.
# Conflicts: # fpdf/text_region.py
…ags_and_tl` method and in the `Bullet` class.
Edited `TextRegion.md` to reflect the introduced changes for `Paragraph`s. Added tests for `Paragraph` generation in `test_html.py`
Reformatted |
# Conflicts: # CHANGELOG.md # test/html/html_features.pdf
Now that the code has largely settled, let's look at the results.
The indents for lists are now much smaller than before. The previous default indent depended on the font size (5 x width of NBSP). This now seems to have changed to 5 "document units". With the default of mm this is too small. With the document units set to eg. inches, it will be way too large. You will have to pick a reasonable size in mm (eg. 8 or 10), and then make sure that if the document units aren't mm, this value gets converted appropriately before being used. The documentation also must clearly state that Note that top and bottom margins of The docstring for |
Regarding tests with small indents, do you want me to pass Regarding the new margins between elements in I might not have time to confidently deal with the 'magic numbers' tonight, so I will likely be pushing the changes tomorrow, and not today. Should I just do the conversion into appropriate units with them? If so, would you prefer for me to intentionally change them a little in order for them to look nicer, or would you prefer a more exact conversion? EDIT: Going to note that, currently, due to the |
…margin values in `html.py` to the chosen document unit of measurement. Adjusted default tag indent values. Moved the `Paragraph` docstring to the `ParagraphCollectorMixin.paragraph()` method. Changed the `CustomPDF` class in `test_html_customize_ul` to have non-static attributes `li_tag_indent` and `ul_bullet_char`. Adjusted tests.
@@ -270,8 +270,8 @@ def __init__( | |||
self, | |||
pdf, | |||
image_map=None, | |||
li_tag_indent=5, | |||
dd_tag_indent=10, | |||
li_tag_indent=30, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are still magic numbers (overly large now for mm), which will vary in their effect depending on the current document units.
pdf.k
contains the scale factor between the current document units and typographical points (pica), which the PDF format uses internally. To get the conversion factor between an arbitrary unit and pica, you can use eg, util.get_scale_factor("mm")
.
You will also need test cases specifically for verifying that any constants defined internally really give the same result even when the user has set different document units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pdf.k
contains the scale factor between the current document units and typographical points (pica), which the PDF format uses internally. To get the conversion factor between an arbitrary unit and pica, you can use eg,util.get_scale_factor("mm")
.
pdf.k
is already defined using util.get_scale_factor
, though.
Are there any cases where the needed conversion factor isn't going to be the same as the scale factor?
You will also need test cases specifically for verifying that any constants defined internally really give the same result even when the user has set different document units
I have already introduced test_html_measurement_units
. Do I need to have tests for something not covered by it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pdf.k is already defined using util.get_scale_factor, though.
Yes of course, when the user sets the document units, then unit util.get_scale_factor()
is used to populate self.k
with the appropriate conversion factor. But if we define built-in constants in mm, then we need to use it directly, to make sure we're independent of the document units. It's a bit tricky to keep track of what unit any specific value is in, so we need to to be careful and validate as much as possible.
I have already introduced test_html_measurement_units
I didn't see that, will check it out.
edit: Now I see. They look the same, which is good. Actually, you could check against the same file with all units, to verify that automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 tests are failing:
test_html_customize_ul
test_html_li_tag_indent
test_html_measurement_units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have enough time to deal with things quite right now, so I will be tackling this stuff in a few hours instead.
Yes of course, when the user sets the document units, then unit
util.get_scale_factor()
is used to populateself.k
with the appropriate conversion factor. But if we define built-in constants in mm, then we need to use it directly, to make sure we're independent of the document units. It's a bit tricky to keep track of what unit any specific value is in, so we need to to be careful and validate as much as possible
Where should we get the information about the units from, in that case? Should I introduce a separate FPDF.unit
attribute?
Now I see. They look the same, which is good. Actually, you could check against the same file with all units, to verify that automatically
Any advice on how to do that elegantly?
Changing FPDF.k
seems inelegant, but I don't see another way of handling this without introducing changes to FDPF
. So, I take it, I should introduce FPDF.unit
?
3 tests are failing: ...
My bad. I didn't do the tests this time.
EDIT: I don't seem to understand the task regarding the replacement of self.pdf.k
with something else when converting constants within html.py
.
Yes of course, when the user sets the document units, then unit
util.get_scale_factor()
is used to populateself.k
with the appropriate conversion factor. But if we define built-in constants in mm, then we need to use it directly, to make sure we're independent of the document units
What do we pass to util.get_scale_factor()
if we are supposed to not use document units here? What are we trying to convert them to if we aren't expressing the constants in the chosen document units.
If it is not the document units that we want to convert the constants to, then I'm at a loss as to what we are trying to do.
EDIT 2: Going to just update the test files for now.
Will be introducing the FPDF.unit
property in the next commit, so that the measurement unit test files can be consolidated into a single one.
Not sure what is required regarding the replacement of self.pdf.k
in html.py
if we are not supposed to be converting the constants in accordance with the chosen document units, which is currently represented in FPDF
objects with the scale factor FPDF.k
. Unless stopped, I will be replacing relevant usages of self.pdf.k
with get_scale_factor(self.pdf.unit)
individually.
Again, though, I am confused as to why we should be replacing the usage of self.pdf.k
with anything else there, so I'd like to receive further input regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where should we get the information about the units from, in that case? Should I introduce a separate FPDF.unit attribute?
That would just be a redundant copy of pdf.k
. You can convert between all units if you use points as an intermediate value.
- From current units to points:
x * pdf.k
- From mm to points:
x * util.get_scale_factor("mm")
- From mm to current units:
x * util.get_scale_factor("mm") / pdf.k
- etc.
It seems to me that you're actually giving the default values in points (30 pts ~= 12 mm). Is this intentional?
Now I see. They look the same, which is good. Actually, you could check against the same file with all units, to verify that automatically
Any advice on how to do that elegantly?
Just use the same file name for all units. They should result in identical output, after all. (If you're very unlucky though, then rounding errors will make them just a tiny bit different...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that you're actually giving the default values in points (30 pts ~= 12 mm). Is this intentional?
Yes. This way the conversion is just a simple division by self.pdf.k
. I am still unsure why we want to replace that with division by util.get_scale_factor()
, and what to pass to the function as a parameter, unless I am to introduce FPDF.unit
. I do question the utility of making additional function calls when we already store the relevant information in self.pdf.k
.
Just use the same file name for all units
The question is how to best go about it?
Without storing information about units in FPDF
, the solution seems to be to modify pdf.k
, and, as you noted, there might be rounding errors.
Introducing the FPDF. unit
property attribute with a setter that also modifies FPDF.k
introduces some redundancy. I have also encountered some issues with this approach - got misaligned indents and failure to generate pages that use centimetres or inches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that you're actually giving the default values in points (30 pts ~= 12 mm). Is this intentional?
Yes. This way the conversion is just a simple division by
self.pdf.k
Readability (for humans) is much more important than computational efficiency in most cases, and especially for a conversion that happens at most once for every table (in some cases once per program run). And where converting a constant is really too expensive, it needs a comment explaining what unit it is and what that means in real world terms.
Actually, any dimensional constant needs a comment specifying its unit anyway (unless the name already makes it clear).
I am still unsure why we want to replace that with division by util.get_scale_factor()
I explained that several times already:
pdf.k
is the conversion factor for the current document units. You don't know what the user will supply for those (could be anything), and you don't want to know.- You call
util.get_scale_factor(<unit>)
directly if you need the conversion factor for a fixed and known unit (for constants).
and what to pass to the function as a parameter
Have a look at the definition of util.get_scale_factor()
. That might make things more clear.
Just use the same file name for all units
The question is
how to best go about it?
Just don't vary the file name for each iteration. There's no problem at all to have several tests compare against the same file.
as you noted, there might be rounding errors.
Why do you let a hypothetical problem stop you from trying? This is a theoretical possibility, but since we only store a very limited number of decimal digits in the PDF file, differences through rounding errors are actually very rare (which is why I wrote "if you're very unlucky").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained that several times already:
pdf.k is the conversion factor for the current document units. You don't know what the user will supply for those (could be anything), and you don't want to know.
You callutil.get_scale_factor(<unit>)
directly if you need the conversion factor for a fixed and known unit (for constants).
Do I understand it correctly, then, that I should introduce the attribute FPDF.unit
, potentially as a property, and then convert default values of relevant variables with * get_scale_factor('mm') / get_scale_factor(self.pdf.unit)
? This approach seems to currently do the same thing as * get_scale_factor('mm') / self.pdf.k
while making this conversion potentially not cohere with the scale factor of the document itself. In addition, it seems to contradict what you said previously regarding the addition of the FPDF.unit
attribute:
That would just be a redundant copy of
pdf.k
Have a look at the definition of
util.get_scale_factor()
. That might make things more clear.
I have. We need to pass the chosen unit as a parameter, and currently there is no FPDF.attribute
, meaning that to get the relevant unit, we would have to compare FPDF.k
to possible return values of utils.get_scale_factor()
.
For the time being, I will assume that I misunderstood something, return the default values of relevant variables to what they were before, and convert them to document units with * get_scale_factor('mm') / self.pdf.k
. If need be, I will implement conversion using get_scale factor()
instead of self.pdf.k
.
Why do you let a hypothetical problem stop you from trying?
Math brain, being used to preferring perfect precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I repeat: You don't need and want to care about what the document units actually are (or how they might be named). There's a reason those are not actually stored anywhere. All you need is the conversion factor stored in FPDF.k
. This was set once using get_scale_factor(<user_supplied_unit>)
, and there's no point in repeating that.
You can use the same function to produce conversion factors for units of your own chosing.
Most size values in fpdf2 are stored and passed around in document units. So if you also store your constants in document units, that will save you many repeated conversions when they are used with other modules.
If you look at FPDF.__init__()
, you'll find initializations like follows, converting the values from points, but with comments noting the mm/cm sizes:
self.line_width = 0.567 / self.k # line width (0.2 mm)
...
# Page margins (1 cm)
margin = (7200 / 254) / self.k
You can follow this pattern for simplicity, or you can do somethin like this:
mm_to_doc_u = get_scale_factor("mm") / self.pdf.k
XXX = 123 * mm_to_doc_u # this constant value
YYY = 456 * mm_to_doc_u # another constant value
Why do you let a hypothetical problem stop you from trying?
Math brain, being used to preferring perfect precision.
Alas, the real world is utterly imprecise... 😉
The default resolution of the PDF format (eg. for images) is 72 dpi. We position elements more accurately than that, but any difference below 0.01 mm or so has no practical relevance, because it is imperceptible to the human eye.
Minor changes to HTML contents of `test_html_measurement_units()`.
docs/TextRegion.md
Outdated
@@ -84,6 +84,10 @@ For more typographical control, you can use the following arguments. Most of tho | |||
* line_height (float, optional) - factor by which the line spacing will be different from the font height. (default: by region) | |||
* top_margin (float, optional) - how much spacing is added above the paragraph. No spacing will be added at the top of the paragraph if the current y position is at (or above) the top margin of the page. (Default: 0.0) | |||
* bottom_margin (float, optional) - Those two values determine how much spacing is added below the paragraph. No spacing will be added at the bottom if it would result in overstepping the bottom margin of the page. (Default: 0.0) | |||
* indent (float, optional): determines the indentation of the paragraph. (Default: 0.0) | |||
* bullet_rel_x_displacement (float, optional) - determines the relative displacement of the bullet along the x-axis. The distance is between the rightmost point of the bullet to the leftmost point of the paragraph's text. (Default: 2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "bullet_right_margin" describes more accurately what this does?
docs/TextRegion.md
Outdated
@@ -84,6 +84,10 @@ For more typographical control, you can use the following arguments. Most of tho | |||
* line_height (float, optional) - factor by which the line spacing will be different from the font height. (default: by region) | |||
* top_margin (float, optional) - how much spacing is added above the paragraph. No spacing will be added at the top of the paragraph if the current y position is at (or above) the top margin of the page. (Default: 0.0) | |||
* bottom_margin (float, optional) - Those two values determine how much spacing is added below the paragraph. No spacing will be added at the bottom if it would result in overstepping the bottom margin of the page. (Default: 0.0) | |||
* indent (float, optional): determines the indentation of the paragraph. (Default: 0.0) | |||
* bullet_rel_x_displacement (float, optional) - determines the relative displacement of the bullet along the x-axis. The distance is between the rightmost point of the bullet to the leftmost point of the paragraph's text. (Default: 2.0) | |||
* bullet_rel_y_displacement (float, optional) - determines the relative displacement of the bullet along the y-axis. The distance is between the topmost point of the bullet and the topmost point of the paragraph's text. (Default: 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And "bullet_top_margin" ?
Edit: I just noticed your earlier suggestion to remove this one entirely. That's a good idea, since it isn't really needed at the moment, and might stand in the way of other possible enhancements later.
…associated class attributes and docstring mentions to `bullet_r_margin`, `bullet_t_margin`, `bullet.r_margin`, `bullet.t_margin`.
@@ -352,7 +359,7 @@ def __init__( | |||
# "inserted" is a special attribute indicating that a cell has be inserted in self.table_row | |||
|
|||
if not tag_indents: | |||
tag_indents = {} | |||
tag_indents = {k: v / self.pdf.k for k, v in DEFAULT_TAG_INDENTS.items()} | |||
if dd_tag_indent != DEFAULT_TAG_INDENTS["dd"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to compare between different units. The user supplied argument is supposed to be in document units, but your currently store the default values in points. For the assignment below you then convert document units to points, which also doesn't looks correct.
You should probably compare against tag_indents
instead and not convert the user input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the assignment below you then convert document units to points, which also doesn't looks correct
I'm not seeing that anywhere.
Although, it seems that I made some changes for naught tonight. More specifically, the conversion of li_tag_indent
and dd_tag_indent
. That was an error, as those values are only relevant if they differ from the defaults, meaning that they shouldn't be subject to this conversion. I'm going back on those changes.
As such, this note seems to not be correct:
Those are still magic numbers (overly large now for mm), which will vary in their effect depending on the current document units
as the generated files evidently do not differ with change in document units provided that those default values in question are used.
You should probably compare against tag_indents instead and not convert the user input
Assuming you are talking about the change in the if
blocks for non-default li_tag_indent
and dd_tag_indent
, yeah, that was an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing that anywhere.
You actually do. Your next paragraph explains exactly what the problem is. 👍
As such, this note seems to not be correct:
At that time it wasn't clear to me that the value of 30 is actually in points (problem of readability!).
That works for DEFAULT_TAG_INDENTS
, but for li_tag_indent
and dd_tag_indent
the default unit is mm, which would result in ridiculously large indents.
Actually, since those two parameters are deprecated and their default value should never get used, it might make sense to default them to None, which will make it simpler to determine whether they have been set by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually do. Your next paragraph explains exactly what the problem is
But there was conversion from points to document units, and not the other way around.
In any case, seems like that matter is resolved now anyway.
That works for
DEFAULT_TAG_INDENTS
, but forli_tag_indent
anddd_tag_indent
the default unit is mm, which would result in ridiculously large indents
Considering that we only want to deal with those if the user passes the relevant parameters, and in that case we want to handle them regardless of whether or not they are equal to relevant entries in DEFAULT_TAG_INDENTS
, I'm making them have the value of None
by default, so that's not going to be an issue going forward anyway.
Actually, since those two parameters are deprecated and their default value should never get used, it might make sense to default them to None
Yeah. And there was another reason to handle them that way anyway.
self.ol_type = [] # when inside a <ol> tag, can be "a", "A", "i", "I" or "1" | ||
self.bullet = [] | ||
if list_vertical_margin is None: | ||
self.list_vertical_margin = 0.3 / self.pdf.k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This converts 0.3 points (0.12 mm) to document units.
You may have seen self.font_size
being converted like this a few times. The pitfall being that self.font_size
is derived from pdf.font_size_pt
, which actually is in points. We should probably rename self.font_size
in this module also to self.font_size_pt
to make this more obvious.
Besides the "magical number" problem in this specific instance, there are more cases where you do a similarly incorrect conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this conversion incorrect, and in what sense is it a 'magical number' if it's simply the default value for the vertical margin, considering that the conversion is handled?
EDIT: Do you want the default value of HTML2PDF.list_vertical_margin
to depend on something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the conversion is intentional, but then the amount doesn't look like something you'd want.
I think the default here used to be 2 mm (implemented incorrectly without taking document units into account).
If you wanted to specify the same amount in points and convert it to document units for passing on, then 2 * util.get_scale_factor("mm")
would result in ~5.67 points, not 0.3.
Your current value amounts to roughly a tenth of a mm, so you could just as well leave the margin away.
And again, a constant like that should come with a comment specifying its unit and how much that is in mm (most people are not aware of how big a typographical point is).
And no, I don't see this depending on any other size. What did you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again, a constant like that should come with a comment specifying its unit and how much that is in mm (most people are not aware of how big a typographical point is)
Given that there were no such comments prior to my intrusion, I assumed that I shouldn't be changing that. Where else should I introduce such comments, if you don't mind me asking?
And no, I don't see this depending on any other size. What did you have in mind?
I am/was confused about why we should make get_scale_factor()
calls instead of just using self.pdf.k
. Do I understand it correctly now that you want me to return the relevant default values to how they were prior to my changes, and to convert them with * get_scale_factor('mm') / self.pdf.k
into the appropriate units?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there were no such comments prior to my intrusion, I assumed that I shouldn't be changing that. Where else should I introduce such comments, if you don't mind me asking?
This module still contains a lot of legacy code of rather low quality. Any improvement to that is welcome. So far we have just fixed the parts we were touching anyway when adding features. I recommend you do the same, in order to keep the focus on your actual goal. At some later point someone will have to go through every line with a fine toothed comb, but that is a separate and rather tedious task, requiring a good grasp of the whole code base.
I've commented on recommended unit use further above.
… `dd_tag_indent` in `html.py`.
Actually, I see an issue with how the default values for |
Fixes #1073
Implements paragraph indentation via adjustments of
pdf.x
instead of using a natural number of whitespaces.Breaks up list items into individual paragraphs.
Checklist:
The GitHub pipeline is OK (green),
meaning that both
pylint
(static code analyzer) andblack
(code formatter) are happy with the changes of this PR.A unit test is covering the code added / modified by this PR
This PR is ready to be merged
In case of a new feature, docstrings have been added, with also some documentation in the
docs/
folderA mention of the change is present in
CHANGELOG.md
Not sure how to actually name the
HTML2FPDF.list_pseudo_margin
attribute. It is used for determining the height of the\n
line created when a<ul>
or<ol>
starting tag is handled.Should the re-implementation of paragraph indentation be reflected in a doctstring, even though the
tag_indents
parameter ofwrite_html()
has not been touched? If so, where should it be placed?By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.