Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the additional newline generated by iter_lines() caused by a '\r\n' pair being separated in two different chunks. #3984

Merged
merged 1 commit into from Apr 26, 2017

Conversation

PCMan
Copy link

@PCMan PCMan commented Apr 21, 2017

When a "\r\n" CRLF pair is splitted into two chunks accidentally, iter_lines() generates two line breaks instead of one. This patch fixes the incorrect behavior and closes issue #3980.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this patch looks broadly right! I've left a few minor structural notes in the diff.

Additionally, it'd be good if you could add a test that reproduces this problem and demonstrates that it's fixed. Should be easily enough done with careful choice of body and chunk size.

@@ -820,6 +827,9 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
for line in lines:
yield line

# check if the current chunk ends with '\r'
last_chunk_ends_with_cr = chunk.endswith('\r' if decode_unicode else b'\r')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hoist the result of these two else calls to the top of the function, rather than check it twice each time around the loop:

carriage_return = u'\r' if decode_unicode else b'\r'
line_feed = u'\n' if decode_unicode else b'\r'

# Edge case: if the last chunk ends with '\r', and the current chunk starts with \n,
# they should be merged and treated as only "one" new line separator '\r\n'.
# So the first '\n' in this chunk should be skipped since it's just the second half of
# the CRLF pair ('\r\n') rather than another new line break.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should expand this comment to explain that this only affects the splitlines case because splitlines will treat any of \r, \r\n, and \n as newlines, and so splitting \r\n into two chunks will get misleading results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do that.

@@ -820,6 +827,9 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
for line in lines:
yield line

# check if the current chunk ends with '\r'
last_chunk_ends_with_cr = chunk.endswith('\r' if decode_unicode else b'\r')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to hoist this into the block that holds the splitlines call, as we don't need to do this processing in any other situation.

@PCMan
Copy link
Author

PCMan commented Apr 21, 2017

@Lukasa Thanks for the review and I already modified the code by following your instructions.
However there seems to be some issues with the unit tests. I'll take a look on them.

@PCMan
Copy link
Author

PCMan commented Apr 21, 2017

@Lukasa While trying to fix the unit tests, I noted that treating "\r\n" as two line breaks is actually the intended behavior based on the current unit tests. While this behavior is incompatible with the default of python (universal newline) and is not explained in the requests API doc either, what delimiters=None does should be well defined and documented before we try to fix this. The unit tests should be fixed to reflect the change as well.

  • If treating "\r\n" as two line breaks is intended, then the fix is avoiding the use of splitlines() and split the lines on either '\r' or '\n' outselves. Then, make this documented in the API doc.
  • Otherwise, we should simulate the behavior of splitlines() and follow python universal newline handling, and fix all the unit tests and API doc to reflect this behavior change.

Either way, the current code needs some fix since different chunk size can yield different results.

Another issue with the current uint tests is it only works with python2. For working with python3, delimiter should be bytes rather than str if decode_unicode is not True, which is not addressed in the unit tests.
@Lukasa Your suggestions are needed. Thank you.

# The last chunk ends with '\r', so the '\n' at chunk[0]
# is just the second half of a '\r\n' pair rather than a
# new line break. Just skip it.
chunk = chunk[1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind here is the final line of the file ending in \r\n. If we split at \r, we get a final one byte line of \n. This code makes chunk equivalent to '' and ''.splitlines() will result in an empty list. You'll get an index out of bounds with the call lines[-1] below.

Probably a rarer edge case but something we should cover.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

@Lukasa
Copy link
Member

Lukasa commented Apr 21, 2017

The unit tests should be updated: splitlines is clearly intended to be part of the behaviour of the code and so the unit tests as written should be considered as codifying the v2 behaviour, rather than guiding the v3. We can change them for v3.

For working with python3, delimiter should be bytes rather than str if decode_unicode is not True, which is not addressed in the unit tests.

Yup, that should change as well.

@nateprewitt
Copy link
Member

It seems sensible to have the test data default be in bytes since that's what we should be receiving from urllib3. We can have a parametrize toggle if we want to test both True and False for decode_unicode.

@PCMan
Copy link
Author

PCMan commented Apr 22, 2017

@Lukasa @nateprewitt Just fixed all of the unit tests and the mentioned chunk[1:] problems. All related tests should pass now. Also provide a small test program to demonstrate the bug.

import requests

test_content = b"line1\r\nline2"

def mock_iter_content(chunk_size=1, decode_unicode=None):
    for i in range(0, len(test_content), chunk_size):
        yield test_content[i: i + chunk_size]

r = requests.Response()
r._content_consumed = True
r.iter_content = mock_iter_content

assert list(r.iter_lines(chunk_size=6)) == test_content.splitlines()

Thanks!

@@ -813,13 +831,17 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
#
# If we're using `splitlines()`, we only do this if the chunk
# ended midway through a line.
incomplete_line = lines[-1] and lines[-1][-1] == chunk[-1]
incomplete_line = lines[-1] and lines[-1][-1] and lines[-1][-1] == chunk[-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has the condition been doubled up here? Under what circumstances will line[-1][-1] be falsey?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to remove them. It should be prevented by the check for chunk[1:].

if delimiter or incomplete_line:
pending = lines.pop()

for line in lines:
yield line

# check if the current chunk ends with '\r'
if delimiter is None:
last_chunk_ends_with_cr = chunk.endswith(carriage_return)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this should be brought into the if branch above.

# the chunks containing a single '\n', it emits '' as a line -- whereas
# `.splitlines()` combines with the '\r' and splits on `\r\n`.
# decode_unicode=True, output unicode strings
assert list(r.iter_lines(decode_unicode=True, delimiter='\r\n')) == unicode_mock_data.split('\r\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests run on Python 2, so Unicode strings must be prefixed with u

([b'a\r\n',b'\rb\n'], ['a', '', 'b'], ['a', '\rb\n']),
([b'a\nb', b'c'], ['a', 'bc'], ['a\nbc']),
([b'a\n', b'\rb', b'\r\nc'], ['a', '', 'b', 'c'], ['a\n\rb', 'c']),
([b'a\r\nb', b'', b'c'], ['a', 'bc'], ['a', 'bc']) # Empty chunk with pending data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, your test seems to show that all the unprefixed strings are intended to be Unicode: they need to be prefixed

@codecov-io
Copy link

codecov-io commented Apr 22, 2017

Codecov Report

Merging #3984 into proposed/3.0.0 will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@                Coverage Diff                 @@
##           proposed/3.0.0    #3984      +/-   ##
==================================================
+ Coverage           89.44%   89.48%   +0.04%     
==================================================
  Files                  15       15              
  Lines                1932     1941       +9     
==================================================
+ Hits                 1728     1737       +9     
  Misses                204      204
Impacted Files Coverage Δ
requests/models.py 94.33% <100%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73456b0...64ff760. Read the comment docs.

@PCMan
Copy link
Author

PCMan commented Apr 22, 2017

@Lukasa Thanks for the detailed review. All problems are cleared now.

@@ -796,7 +800,23 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
if delimiter:
lines = chunk.split(delimiter)
else:
# Python splitlines() supports the universal newline (PEP 278).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't universal newline support dropped in Python 3? I know you can no longer get a TextIOBuffer object that supports that. How does that affect splitting strings?

Copy link
Author

@PCMan PCMan Apr 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, splitlines() actually supports a superset of universal newline.
FYI: https://docs.python.org/3/library/stdtypes.html#str.splitlines
The unicode strings supports a superset of universal newline.
However, the new python 3 bytes behaves differently.
https://docs.python.org/3/library/stdtypes.html#bytes.splitlines
In both cases, however, the '\r', '\n', and '\r\n' rule still apply.

@@ -813,7 +833,7 @@ def iter_lines(self, chunk_size=ITER_CHUNK_SIZE, decode_unicode=None, delimiter=
#
# If we're using `splitlines()`, we only do this if the chunk
# ended midway through a line.
incomplete_line = lines[-1] and lines[-1][-1] == chunk[-1]
incomplete_line = lines[-1] and lines[-1][-1] == chunk[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an extra space between and and lines[-1][-1]?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this is looking really good. A few minor stylistic notes for you.

# Python splitlines() supports the universal newline (PEP 278).
# That means, '\r', '\n', and '\r\n' are all treated as end of
# line. If the last chunk ends with '\r', and the current chunk
# starts with \n, they should be merged and treated as only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: as there are inverted commas around '\r' in the line above, they should be added here around \n.

# That means, '\r', '\n', and '\r\n' are all treated as end of
# line. If the last chunk ends with '\r', and the current chunk
# starts with \n, they should be merged and treated as only
# "one" new line separator '\r\n' by splitlines().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: should be *one*, not "one".

@PCMan
Copy link
Author

PCMan commented Apr 23, 2017

@Lukasa Done! :)

lines = chunk.splitlines()
# check if the current chunk ends with '\r'
last_chunk_ends_with_cr = chunk.endswith(carriage_return)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, just noticed a subtle misbehaviour. Imagine we get the following chunks:

[b'this is a string\r', b'\n', b'\nso is this']

This will not do the right thing: we'll strip two newline characters, rather than emit an empty line. That is, we'll emit: [b'this is a string', b'so is this'] instead of [b'this is a string', b'', b'so is this'], which is what we should emit.

This is because a chunk that is just a \n causes an early continue, above, without resetting the value of the last_chunk_ends_with_cr boolean. Essentially it gets completely ignored. That's not right.

It'd be good to add a test case (or modify an existing one) to include that case, and then fix it up by hoisting this flag up before the if not chunk: continue clause.

Copy link
Author

@PCMan PCMan Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lukasa Nice catch! Sorry that I overlooked such a simple check. Just add a simple fix along with a test case for it. BTW, I did not hoist the flag per your request. It will look somewhat like this:

                # a temp variable is needed here and I cannot find a good variable naming that is not confusing
                old_last_chunk_ends_with_cr = last_chunk_ends_with_cr
                last_chunk_ends_with_cr = chunk.endswith(carriage_return)
                if old_last_chunk_ends_with_cr and chunk.startswith(line_feed):
                    chunk = chunk[1:]
                    if not chunk:
                        continue

Also this yields an unnecessary method call endwiths('\r') since when chun[1:] is empty we already know the result is False. So I prefer just setting the flag to False though it looks like code duplication at first glance.

if not chunk:
    last_chunk_ends_with_cr = False

Simple and readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. I don't know if I think it's more readable to do it this way. It means there are two places in the code that set this variable, instead of just one, which forces the reader to ask themselves why it was done this way. I don't know that the efficiency gains are worth it. 🤔

assert result != mock_data.splitlines()
assert result[2] == ''
assert result[4] == ''
assert result == mock_data.splitlines()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably collapse this line into the one above it as with the other assertions:

assert list(r.iter_lines()) == mock_data.splitlines()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code seems to be outdated. I didnt' see it in the current HEAD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you're right.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is now looking good except for my possible concern about the readability of our two alternative approaches. I'd like either @sigmavirus24 or @nateprewitt to weigh in there, if possible.

@sigmavirus24
Copy link
Contributor

So this is now looking good except for my possible concern about the readability of our two alternative approaches.

I think I'm missing something. I've looked through the PR but I don't see alternative approaches or readability issues. This could just be the woefully immature GitHub Review system hiding it from me though.

@Lukasa
Copy link
Member

Lukasa commented Apr 25, 2017

Yeah, it is. See this discussion.

# starts with '\n', they should be merged and treated as only
# *one* new line separator '\r\n' by splitlines().
# This rule only applies when splitlines() is used.
if last_chunk_ends_with_cr and chunk.startswith(line_feed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I understand what concerns @Lukasa has, I agree. I think this could be better written as:

chunk_startswith_line_feed = last_chunk_ends_with_cr and chunk.startswith(line_feed)
last_chunk_ends_with_cr = chunk.endswith(carriage_return)
if chunk_startswith_line_feed:
    chunk = chunk[1:]
    if not chunk:
        continue
lines = chunk.splitlines()

This properly explains the condition on this line with a descriptive variable name. It avoids the old_last_chunk_ends_with_cr variable and it keeps things concise and easy to reason about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm on board with @sigmavirus24's suggestion. That seems the most readable option out of what's been suggested.

The only minor input I have is standardizing the use of _ in ends_with vs startswith in the variable names.

@PCMan
Copy link
Author

PCMan commented Apr 26, 2017

@Lukasa Fixed as suggested by @sigmavirus24 .

Copy link
Contributor

@sigmavirus24 sigmavirus24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too. Do you want to squash the commits down just for cleanliness sake? Then I'll go ahead and merge.

…n' pair being separated in two different chunks.
@PCMan
Copy link
Author

PCMan commented Apr 26, 2017

@Lukasa @sigmavirus24 git squashed. Thanks for your patience.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all, thank you for working with us to get this ready to go. Well done! ✨ 🍰 ✨

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants