Fix the additional newline generated by iter_lines() caused by a '\r\n' pair being separated in two different chunks. #3984

PCMan · 2017-04-21T13:48:45Z

When a "\r\n" CRLF pair is splitted into two chunks accidentally, iter_lines() generates two line breaks instead of one. This patch fixes the incorrect behavior and closes issue #3980.

Lukasa

Cool, this patch looks broadly right! I've left a few minor structural notes in the diff.

Additionally, it'd be good if you could add a test that reproduces this problem and demonstrates that it's fixed. Should be easily enough done with careful choice of body and chunk size.

Lukasa · 2017-04-21T14:22:38Z

requests/models.py

@@ -820,6 +827,9 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
            for line in lines:
                yield line

+            # check if the current chunk ends with '\r'
+            last_chunk_ends_with_cr = chunk.endswith('\r' if decode_unicode else b'\r')


Let's hoist the result of these two else calls to the top of the function, rather than check it twice each time around the loop:

carriage_return = u'\r' if decode_unicode else b'\r' line_feed = u'\n' if decode_unicode else b'\r'

Lukasa · 2017-04-21T14:24:03Z

requests/models.py

+                # Edge case: if the last chunk ends with '\r', and the current chunk starts with \n,
+                # they should be merged and treated as only "one" new line separator '\r\n'.
+                # So the first '\n' in this chunk should be skipped since it's just the second half of 
+                # the CRLF pair ('\r\n') rather than another new line break.


We should expand this comment to explain that this only affects the splitlines case because splitlines will treat any of \r, \r\n, and \n as newlines, and so splitting \r\n into two chunks will get misleading results.

OK, will do that.

Lukasa · 2017-04-21T14:24:36Z

requests/models.py

@@ -820,6 +827,9 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
            for line in lines:
                yield line

+            # check if the current chunk ends with '\r'
+            last_chunk_ends_with_cr = chunk.endswith('\r' if decode_unicode else b'\r')


It would also be good to hoist this into the block that holds the splitlines call, as we don't need to do this processing in any other situation.

PCMan · 2017-04-21T14:55:38Z

@Lukasa Thanks for the review and I already modified the code by following your instructions.
However there seems to be some issues with the unit tests. I'll take a look on them.

PCMan · 2017-04-21T17:18:38Z

@Lukasa While trying to fix the unit tests, I noted that treating "\r\n" as two line breaks is actually the intended behavior based on the current unit tests. While this behavior is incompatible with the default of python (universal newline) and is not explained in the requests API doc either, what delimiters=None does should be well defined and documented before we try to fix this. The unit tests should be fixed to reflect the change as well.

If treating "\r\n" as two line breaks is intended, then the fix is avoiding the use of splitlines() and split the lines on either '\r' or '\n' outselves. Then, make this documented in the API doc.
Otherwise, we should simulate the behavior of splitlines() and follow python universal newline handling, and fix all the unit tests and API doc to reflect this behavior change.

Either way, the current code needs some fix since different chunk size can yield different results.

Another issue with the current uint tests is it only works with python2. For working with python3, delimiter should be bytes rather than str if decode_unicode is not True, which is not addressed in the unit tests.
@Lukasa Your suggestions are needed. Thank you.

nateprewitt · 2017-04-21T18:28:39Z

requests/models.py

+                    # The last chunk ends with '\r', so the '\n' at chunk[0]
+                    # is just the second half of a '\r\n' pair rather than a
+                    # new line break. Just skip it.
+                    chunk = chunk[1:]


One thing to keep in mind here is the final line of the file ending in \r\n. If we split at \r, we get a final one byte line of \n. This code makes chunk equivalent to '' and ''.splitlines() will result in an empty list. You'll get an index out of bounds with the call lines[-1] below.

Probably a rarer edge case but something we should cover.

Fixed. Thanks.

Lukasa · 2017-04-21T19:13:07Z

The unit tests should be updated: splitlines is clearly intended to be part of the behaviour of the code and so the unit tests as written should be considered as codifying the v2 behaviour, rather than guiding the v3. We can change them for v3.

For working with python3, delimiter should be bytes rather than str if decode_unicode is not True, which is not addressed in the unit tests.

Yup, that should change as well.

nateprewitt · 2017-04-21T19:18:25Z

It seems sensible to have the test data default be in bytes since that's what we should be receiving from urllib3. We can have a parametrize toggle if we want to test both True and False for decode_unicode.

PCMan · 2017-04-22T03:57:15Z

@Lukasa @nateprewitt Just fixed all of the unit tests and the mentioned chunk[1:] problems. All related tests should pass now. Also provide a small test program to demonstrate the bug.

import requests

test_content = b"line1\r\nline2"

def mock_iter_content(chunk_size=1, decode_unicode=None):
    for i in range(0, len(test_content), chunk_size):
        yield test_content[i: i + chunk_size]

r = requests.Response()
r._content_consumed = True
r.iter_content = mock_iter_content

assert list(r.iter_lines(chunk_size=6)) == test_content.splitlines()

Thanks!

Lukasa · 2017-04-22T07:46:44Z

requests/models.py

@@ -813,13 +831,17 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
            #
            # If we're using `splitlines()`, we only do this if the chunk
            # ended midway through a line.
-            incomplete_line = lines[-1] and lines[-1][-1] == chunk[-1]
+            incomplete_line = lines[-1] and lines[-1][-1] and lines[-1][-1] == chunk[-1]


Why has the condition been doubled up here? Under what circumstances will line[-1][-1] be falsey?

Sorry I forgot to remove them. It should be prevented by the check for chunk[1:].

Lukasa · 2017-04-22T07:47:10Z

requests/models.py

            if delimiter or incomplete_line:
                pending = lines.pop()

            for line in lines:
                yield line

+            # check if the current chunk ends with '\r'
+            if delimiter is None:
+                last_chunk_ends_with_cr = chunk.endswith(carriage_return)


Again, this should be brought into the if branch above.

Lukasa · 2017-04-22T07:48:24Z

tests/test_requests.py

-        # the chunks containing a single '\n', it emits '' as a line -- whereas
-        # `.splitlines()` combines with the '\r' and splits on `\r\n`.
+        # decode_unicode=True, output unicode strings
+        assert list(r.iter_lines(decode_unicode=True, delimiter='\r\n')) == unicode_mock_data.split('\r\n')


These tests run on Python 2, so Unicode strings must be prefixed with u

Lukasa · 2017-04-22T07:50:11Z

tests/test_requests.py

+            ([b'a\r\n',b'\rb\n'], ['a', '', 'b'], ['a', '\rb\n']),
+            ([b'a\nb', b'c'], ['a', 'bc'], ['a\nbc']),
+            ([b'a\n', b'\rb', b'\r\nc'], ['a', '', 'b', 'c'], ['a\n\rb', 'c']),
+            ([b'a\r\nb', b'', b'c'], ['a', 'bc'], ['a', 'bc'])  # Empty chunk with pending data


Again, your test seems to show that all the unprefixed strings are intended to be Unicode: they need to be prefixed

codecov-io · 2017-04-22T08:56:18Z

Codecov Report

Merging #3984 into proposed/3.0.0 will increase coverage by 0.04%.
The diff coverage is 100%.

@@                Coverage Diff                 @@
##           proposed/3.0.0    #3984      +/-   ##
==================================================
+ Coverage           89.44%   89.48%   +0.04%     
==================================================
  Files                  15       15              
  Lines                1932     1941       +9     
==================================================
+ Hits                 1728     1737       +9     
  Misses                204      204

Impacted Files	Coverage Δ
requests/models.py	`94.33% <100%> (+0.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73456b0...64ff760. Read the comment docs.

PCMan · 2017-04-22T09:04:23Z

@Lukasa Thanks for the detailed review. All problems are cleared now.

sigmavirus24 · 2017-04-22T12:34:09Z

requests/models.py

@@ -796,7 +800,23 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
            if delimiter:
                lines = chunk.split(delimiter)
            else:
+                # Python splitlines() supports the universal newline (PEP 278).


Wasn't universal newline support dropped in Python 3? I know you can no longer get a TextIOBuffer object that supports that. How does that affect splitting strings?

Well, splitlines() actually supports a superset of universal newline.
FYI: https://docs.python.org/3/library/stdtypes.html#str.splitlines
The unicode strings supports a superset of universal newline.
However, the new python 3 bytes behaves differently.
https://docs.python.org/3/library/stdtypes.html#bytes.splitlines
In both cases, however, the '\r', '\n', and '\r\n' rule still apply.

sigmavirus24 · 2017-04-22T12:34:54Z

requests/models.py

@@ -813,7 +833,7 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
            #
            # If we're using `splitlines()`, we only do this if the chunk
            # ended midway through a line.
-            incomplete_line = lines[-1] and lines[-1][-1] == chunk[-1]
+            incomplete_line = lines[-1] and  lines[-1][-1] == chunk[-1]


Why is there an extra space between and and lines[-1][-1]?

Lukasa

Cool, this is looking really good. A few minor stylistic notes for you.

Lukasa · 2017-04-22T22:18:17Z

requests/models.py

+                # Python splitlines() supports the universal newline (PEP 278).
+                # That means, '\r', '\n', and '\r\n' are all treated as end of 
+                # line. If the last chunk ends with '\r', and the current chunk
+                # starts with \n, they should be merged and treated as only 


Minor nit: as there are inverted commas around '\r' in the line above, they should be added here around \n.

Lukasa · 2017-04-22T22:18:52Z

requests/models.py

+                # That means, '\r', '\n', and '\r\n' are all treated as end of 
+                # line. If the last chunk ends with '\r', and the current chunk
+                # starts with \n, they should be merged and treated as only 
+                # "one" new line separator '\r\n' by splitlines().


Minor nit: should be *one*, not "one".

PCMan · 2017-04-23T13:37:03Z

@Lukasa Done! :)

Lukasa · 2017-04-24T12:15:33Z

requests/models.py

                lines = chunk.splitlines()
+                # check if the current chunk ends with '\r'
+                last_chunk_ends_with_cr = chunk.endswith(carriage_return)


Oooh, just noticed a subtle misbehaviour. Imagine we get the following chunks:

[b'this is a string\r', b'\n', b'\nso is this']

This will not do the right thing: we'll strip two newline characters, rather than emit an empty line. That is, we'll emit: [b'this is a string', b'so is this'] instead of [b'this is a string', b'', b'so is this'], which is what we should emit.

This is because a chunk that is just a \n causes an early continue, above, without resetting the value of the last_chunk_ends_with_cr boolean. Essentially it gets completely ignored. That's not right.

It'd be good to add a test case (or modify an existing one) to include that case, and then fix it up by hoisting this flag up before the if not chunk: continue clause.

@Lukasa Nice catch! Sorry that I overlooked such a simple check. Just add a simple fix along with a test case for it. BTW, I did not hoist the flag per your request. It will look somewhat like this:

# a temp variable is needed here and I cannot find a good variable naming that is not confusing old_last_chunk_ends_with_cr = last_chunk_ends_with_cr last_chunk_ends_with_cr = chunk.endswith(carriage_return) if old_last_chunk_ends_with_cr and chunk.startswith(line_feed): chunk = chunk[1:] if not chunk: continue

Also this yields an unnecessary method call endwiths('\r') since when chun[1:] is empty we already know the result is False. So I prefer just setting the flag to False though it looks like code duplication at first glance.

if not chunk: last_chunk_ends_with_cr = False

Simple and readable.

Hrm. I don't know if I think it's more readable to do it this way. It means there are two places in the code that set this variable, instead of just one, which forces the reader to ask themselves why it was done this way. I don't know that the efficiency gains are worth it. 🤔

Lukasa · 2017-04-24T12:20:19Z

tests/test_requests.py

-        assert result != mock_data.splitlines()
-        assert result[2] == ''
-        assert result[4] == ''
+        assert result == mock_data.splitlines()


We can probably collapse this line into the one above it as with the other assertions:

assert list(r.iter_lines()) == mock_data.splitlines()

This piece of code seems to be outdated. I didnt' see it in the current HEAD.

Yup, you're right.

Lukasa

So this is now looking good except for my possible concern about the readability of our two alternative approaches. I'd like either @sigmavirus24 or @nateprewitt to weigh in there, if possible.

sigmavirus24 · 2017-04-25T11:28:50Z

So this is now looking good except for my possible concern about the readability of our two alternative approaches.

I think I'm missing something. I've looked through the PR but I don't see alternative approaches or readability issues. This could just be the woefully immature GitHub Review system hiding it from me though.

Lukasa · 2017-04-25T11:29:44Z

Yeah, it is. See this discussion.

sigmavirus24 · 2017-04-25T11:43:58Z

requests/models.py

+                # starts with '\n', they should be merged and treated as only
+                # *one* new line separator '\r\n' by splitlines().
+                # This rule only applies when splitlines() is used.
+                if last_chunk_ends_with_cr and chunk.startswith(line_feed):


Now that I understand what concerns @Lukasa has, I agree. I think this could be better written as:

chunk_startswith_line_feed = last_chunk_ends_with_cr and chunk.startswith(line_feed) last_chunk_ends_with_cr = chunk.endswith(carriage_return) if chunk_startswith_line_feed: chunk = chunk[1:] if not chunk: continue lines = chunk.splitlines()

This properly explains the condition on this line with a descriptive variable name. It avoids the old_last_chunk_ends_with_cr variable and it keeps things concise and easy to reason about.

Yep, I'm on board with @sigmavirus24's suggestion. That seems the most readable option out of what's been suggested.

The only minor input I have is standardizing the use of _ in ends_with vs startswith in the variable names.

PCMan · 2017-04-26T03:46:17Z

@Lukasa Fixed as suggested by @sigmavirus24 .

sigmavirus24

This looks fine to me.

Lukasa

Looks good to me too. Do you want to squash the commits down just for cleanliness sake? Then I'll go ahead and merge.

…n' pair being separated in two different chunks.

PCMan · 2017-04-26T15:05:07Z

@Lukasa @sigmavirus24 git squashed. Thanks for your patience.

Lukasa

Not at all, thank you for working with us to get this ready to go. Well done! ✨ 🍰 ✨

PCMan mentioned this pull request Apr 21, 2017

iter_lines() sometimes yields broken results when chunk_size is small #3980

Closed

PCMan force-pushed the pcman@fix_iter_lines branch from a6796ad to 8307525 Compare April 21, 2017 13:51

Lukasa requested changes Apr 21, 2017

View reviewed changes

nateprewitt reviewed Apr 21, 2017

View reviewed changes

Lukasa requested changes Apr 22, 2017

View reviewed changes

sigmavirus24 reviewed Apr 22, 2017

View reviewed changes

Lukasa requested changes Apr 22, 2017

View reviewed changes

Lukasa requested changes Apr 24, 2017

View reviewed changes

Lukasa approved these changes Apr 25, 2017

View reviewed changes

sigmavirus24 requested changes Apr 25, 2017

View reviewed changes

sigmavirus24 approved these changes Apr 26, 2017

View reviewed changes

Lukasa requested changes Apr 26, 2017

View reviewed changes

Fix the additional newline generated by iter_lines() caused by a '\r\…

458df8f

…n' pair being separated in two different chunks.

PCMan force-pushed the pcman@fix_iter_lines branch from 64ff760 to 458df8f Compare April 26, 2017 15:04

Lukasa approved these changes Apr 26, 2017

View reviewed changes

Lukasa merged commit 084625c into psf:proposed/3.0.0 Apr 26, 2017

jacquerie mentioned this pull request May 17, 2017

The case of the haunted stream request #4035

Closed

nateprewitt mentioned this pull request May 31, 2017

iter_lines gives different results depending on chunk_size #4121

Closed

nateprewitt mentioned this pull request Sep 2, 2017

Unexpected result in iter_lines with '\r\n' delimiter #4271

Closed

nateprewitt mentioned this pull request May 4, 2018

Fix iter_lines boundary bug when CRLF straddles two chunks #4629

Closed

vitidev mentioned this pull request Jul 24, 2020

iter_lines is still broken? #5540

Open

github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2021

Fix the additional newline generated by iter_lines() caused by a '\r\n' pair being separated in two different chunks. #3984

Fix the additional newline generated by iter_lines() caused by a '\r\n' pair being separated in two different chunks. #3984

Conversation

PCMan commented Apr 21, 2017

Lukasa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PCMan commented Apr 21, 2017

PCMan commented Apr 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lukasa commented Apr 21, 2017

nateprewitt commented Apr 21, 2017

PCMan commented Apr 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Apr 22, 2017 • edited

Codecov Report

PCMan commented Apr 22, 2017

Choose a reason for hiding this comment

PCMan Apr 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lukasa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PCMan commented Apr 23, 2017

Choose a reason for hiding this comment

PCMan Apr 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lukasa left a comment

Choose a reason for hiding this comment

sigmavirus24 commented Apr 25, 2017

Lukasa commented Apr 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PCMan commented Apr 26, 2017

sigmavirus24 left a comment

Choose a reason for hiding this comment

Lukasa left a comment

Choose a reason for hiding this comment

PCMan commented Apr 26, 2017

Lukasa left a comment

Choose a reason for hiding this comment

PCMan commented Apr 21, 2017 •

edited

PCMan commented Apr 22, 2017 •

edited

codecov-io commented Apr 22, 2017 •

edited

PCMan Apr 22, 2017 •

edited

PCMan Apr 25, 2017 •

edited