-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the additional newline generated by iter_lines() caused by a '\r\n' pair being separated in two different chunks. #3984
Conversation
a6796ad
to
8307525
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, this patch looks broadly right! I've left a few minor structural notes in the diff.
Additionally, it'd be good if you could add a test that reproduces this problem and demonstrates that it's fixed. Should be easily enough done with careful choice of body and chunk size.
requests/models.py
Outdated
@@ -820,6 +827,9 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter= | |||
for line in lines: | |||
yield line | |||
|
|||
# check if the current chunk ends with '\r' | |||
last_chunk_ends_with_cr = chunk.endswith('\r' if decode_unicode else b'\r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's hoist the result of these two else
calls to the top of the function, rather than check it twice each time around the loop:
carriage_return = u'\r' if decode_unicode else b'\r'
line_feed = u'\n' if decode_unicode else b'\r'
requests/models.py
Outdated
# Edge case: if the last chunk ends with '\r', and the current chunk starts with \n, | ||
# they should be merged and treated as only "one" new line separator '\r\n'. | ||
# So the first '\n' in this chunk should be skipped since it's just the second half of | ||
# the CRLF pair ('\r\n') rather than another new line break. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should expand this comment to explain that this only affects the splitlines
case because splitlines
will treat any of \r
, \r\n
, and \n
as newlines, and so splitting \r\n
into two chunks will get misleading results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do that.
requests/models.py
Outdated
@@ -820,6 +827,9 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter= | |||
for line in lines: | |||
yield line | |||
|
|||
# check if the current chunk ends with '\r' | |||
last_chunk_ends_with_cr = chunk.endswith('\r' if decode_unicode else b'\r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to hoist this into the block that holds the splitlines
call, as we don't need to do this processing in any other situation.
@Lukasa Thanks for the review and I already modified the code by following your instructions. |
@Lukasa While trying to fix the unit tests, I noted that treating "\r\n" as two line breaks is actually the intended behavior based on the current unit tests. While this behavior is incompatible with the default of python (universal newline) and is not explained in the requests API doc either, what
Either way, the current code needs some fix since different chunk size can yield different results. Another issue with the current uint tests is it only works with python2. For working with python3, delimiter should be bytes rather than str if decode_unicode is not True, which is not addressed in the unit tests. |
requests/models.py
Outdated
# The last chunk ends with '\r', so the '\n' at chunk[0] | ||
# is just the second half of a '\r\n' pair rather than a | ||
# new line break. Just skip it. | ||
chunk = chunk[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to keep in mind here is the final line of the file ending in \r\n
. If we split at \r
, we get a final one byte line of \n
. This code makes chunk equivalent to ''
and ''.splitlines()
will result in an empty list. You'll get an index out of bounds with the call lines[-1]
below.
Probably a rarer edge case but something we should cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks.
The unit tests should be updated:
Yup, that should change as well. |
It seems sensible to have the test data default be in bytes since that's what we should be receiving from urllib3. We can have a |
@Lukasa @nateprewitt Just fixed all of the unit tests and the mentioned chunk[1:] problems. All related tests should pass now. Also provide a small test program to demonstrate the bug.
Thanks! |
requests/models.py
Outdated
@@ -813,13 +831,17 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter= | |||
# | |||
# If we're using `splitlines()`, we only do this if the chunk | |||
# ended midway through a line. | |||
incomplete_line = lines[-1] and lines[-1][-1] == chunk[-1] | |||
incomplete_line = lines[-1] and lines[-1][-1] and lines[-1][-1] == chunk[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has the condition been doubled up here? Under what circumstances will line[-1][-1]
be falsey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot to remove them. It should be prevented by the check for chunk[1:].
requests/models.py
Outdated
if delimiter or incomplete_line: | ||
pending = lines.pop() | ||
|
||
for line in lines: | ||
yield line | ||
|
||
# check if the current chunk ends with '\r' | ||
if delimiter is None: | ||
last_chunk_ends_with_cr = chunk.endswith(carriage_return) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this should be brought into the if branch above.
tests/test_requests.py
Outdated
# the chunks containing a single '\n', it emits '' as a line -- whereas | ||
# `.splitlines()` combines with the '\r' and splits on `\r\n`. | ||
# decode_unicode=True, output unicode strings | ||
assert list(r.iter_lines(decode_unicode=True, delimiter='\r\n')) == unicode_mock_data.split('\r\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests run on Python 2, so Unicode strings must be prefixed with u
tests/test_requests.py
Outdated
([b'a\r\n',b'\rb\n'], ['a', '', 'b'], ['a', '\rb\n']), | ||
([b'a\nb', b'c'], ['a', 'bc'], ['a\nbc']), | ||
([b'a\n', b'\rb', b'\r\nc'], ['a', '', 'b', 'c'], ['a\n\rb', 'c']), | ||
([b'a\r\nb', b'', b'c'], ['a', 'bc'], ['a', 'bc']) # Empty chunk with pending data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, your test seems to show that all the unprefixed strings are intended to be Unicode: they need to be prefixed
Codecov Report
@@ Coverage Diff @@
## proposed/3.0.0 #3984 +/- ##
==================================================
+ Coverage 89.44% 89.48% +0.04%
==================================================
Files 15 15
Lines 1932 1941 +9
==================================================
+ Hits 1728 1737 +9
Misses 204 204
Continue to review full report at Codecov.
|
@Lukasa Thanks for the detailed review. All problems are cleared now. |
@@ -796,7 +800,23 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter= | |||
if delimiter: | |||
lines = chunk.split(delimiter) | |||
else: | |||
# Python splitlines() supports the universal newline (PEP 278). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't universal newline support dropped in Python 3? I know you can no longer get a TextIOBuffer object that supports that. How does that affect splitting strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, splitlines() actually supports a superset of universal newline.
FYI: https://docs.python.org/3/library/stdtypes.html#str.splitlines
The unicode strings supports a superset of universal newline.
However, the new python 3 bytes behaves differently.
https://docs.python.org/3/library/stdtypes.html#bytes.splitlines
In both cases, however, the '\r', '\n', and '\r\n' rule still apply.
requests/models.py
Outdated
@@ -813,7 +833,7 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter= | |||
# | |||
# If we're using `splitlines()`, we only do this if the chunk | |||
# ended midway through a line. | |||
incomplete_line = lines[-1] and lines[-1][-1] == chunk[-1] | |||
incomplete_line = lines[-1] and lines[-1][-1] == chunk[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there an extra space between and
and lines[-1][-1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, this is looking really good. A few minor stylistic notes for you.
requests/models.py
Outdated
# Python splitlines() supports the universal newline (PEP 278). | ||
# That means, '\r', '\n', and '\r\n' are all treated as end of | ||
# line. If the last chunk ends with '\r', and the current chunk | ||
# starts with \n, they should be merged and treated as only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: as there are inverted commas around '\r'
in the line above, they should be added here around \n
.
requests/models.py
Outdated
# That means, '\r', '\n', and '\r\n' are all treated as end of | ||
# line. If the last chunk ends with '\r', and the current chunk | ||
# starts with \n, they should be merged and treated as only | ||
# "one" new line separator '\r\n' by splitlines(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: should be *one*
, not "one"
.
@Lukasa Done! :) |
requests/models.py
Outdated
lines = chunk.splitlines() | ||
# check if the current chunk ends with '\r' | ||
last_chunk_ends_with_cr = chunk.endswith(carriage_return) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, just noticed a subtle misbehaviour. Imagine we get the following chunks:
[b'this is a string\r', b'\n', b'\nso is this']
This will not do the right thing: we'll strip two newline characters, rather than emit an empty line. That is, we'll emit: [b'this is a string', b'so is this']
instead of [b'this is a string', b'', b'so is this']
, which is what we should emit.
This is because a chunk that is just a \n
causes an early continue, above, without resetting the value of the last_chunk_ends_with_cr
boolean. Essentially it gets completely ignored. That's not right.
It'd be good to add a test case (or modify an existing one) to include that case, and then fix it up by hoisting this flag up before the if not chunk: continue
clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lukasa Nice catch! Sorry that I overlooked such a simple check. Just add a simple fix along with a test case for it. BTW, I did not hoist the flag per your request. It will look somewhat like this:
# a temp variable is needed here and I cannot find a good variable naming that is not confusing
old_last_chunk_ends_with_cr = last_chunk_ends_with_cr
last_chunk_ends_with_cr = chunk.endswith(carriage_return)
if old_last_chunk_ends_with_cr and chunk.startswith(line_feed):
chunk = chunk[1:]
if not chunk:
continue
Also this yields an unnecessary method call endwiths('\r') since when chun[1:] is empty we already know the result is False. So I prefer just setting the flag to False though it looks like code duplication at first glance.
if not chunk:
last_chunk_ends_with_cr = False
Simple and readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm. I don't know if I think it's more readable to do it this way. It means there are two places in the code that set this variable, instead of just one, which forces the reader to ask themselves why it was done this way. I don't know that the efficiency gains are worth it. 🤔
assert result != mock_data.splitlines() | ||
assert result[2] == '' | ||
assert result[4] == '' | ||
assert result == mock_data.splitlines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably collapse this line into the one above it as with the other assertions:
assert list(r.iter_lines()) == mock_data.splitlines()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of code seems to be outdated. I didnt' see it in the current HEAD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, you're right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is now looking good except for my possible concern about the readability of our two alternative approaches. I'd like either @sigmavirus24 or @nateprewitt to weigh in there, if possible.
I think I'm missing something. I've looked through the PR but I don't see alternative approaches or readability issues. This could just be the woefully immature GitHub Review system hiding it from me though. |
Yeah, it is. See this discussion. |
requests/models.py
Outdated
# starts with '\n', they should be merged and treated as only | ||
# *one* new line separator '\r\n' by splitlines(). | ||
# This rule only applies when splitlines() is used. | ||
if last_chunk_ends_with_cr and chunk.startswith(line_feed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I understand what concerns @Lukasa has, I agree. I think this could be better written as:
chunk_startswith_line_feed = last_chunk_ends_with_cr and chunk.startswith(line_feed)
last_chunk_ends_with_cr = chunk.endswith(carriage_return)
if chunk_startswith_line_feed:
chunk = chunk[1:]
if not chunk:
continue
lines = chunk.splitlines()
This properly explains the condition on this line with a descriptive variable name. It avoids the old_last_chunk_ends_with_cr
variable and it keeps things concise and easy to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm on board with @sigmavirus24's suggestion. That seems the most readable option out of what's been suggested.
The only minor input I have is standardizing the use of _
in ends_with
vs startswith
in the variable names.
@Lukasa Fixed as suggested by @sigmavirus24 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me too. Do you want to squash the commits down just for cleanliness sake? Then I'll go ahead and merge.
…n' pair being separated in two different chunks.
64ff760
to
458df8f
Compare
@Lukasa @sigmavirus24 git squashed. Thanks for your patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all, thank you for working with us to get this ready to go. Well done! ✨ 🍰 ✨
When a "\r\n" CRLF pair is splitted into two chunks accidentally, iter_lines() generates two line breaks instead of one. This patch fixes the incorrect behavior and closes issue #3980.