BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg #17040

jeffknupp · 2017-07-20T18:56:45Z

Fix a few locations where a parser's error_msg buffer is written to
without having been previously allocated. This manifested as a double
free during exception handling code making use of the error_msg.

Additionally, use size_t/ssize_t where array indices or lengths will
be stored. Previously, int32_t was used and would overflow on columns
with very large amounts of data (i.e. greater than INTMAX bytes).

closes Pandas core dumps when reading large CSV file using read_csv(..., low_memory=False) #16798
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

…error_msg Fix a few locations where a parser's `error_msg` buffer is written to without having been previously allocated. This manifested as a double free during exception handling code making use of the `error_msg`. Aditionally, use `size_t/ssize_t` where array indicies or lengths will be stored. Previously, int32_t was used and would overflow on columns with very large amounts of data (i.e. greater than INTMAX bytes).

gfyoung · 2017-07-20T18:58:44Z

pandas/_libs/parsers.pyx

        char *data     # pointer to data to be processed
-        int datalen    # amount of data available
-        int datapos
+        size_t datalen    # amount of data available


Fix the spacing so that the hashtags line up like before.

If I could do that and keep in under 80 columns, I would have

Why can't you move the "pointer to data to be processed" one further right?

Oops! Sorry, was looking at wrong file! Will fix in next push.

gfyoung · 2017-07-20T18:59:06Z

pandas/_libs/parsers.pyx

+        size_t words_len
+        size_t words_cap

        char *pword_start    # pointer to stream start of current field


Modify the spacing so that it's realigned with the hashtags below.

gfyoung · 2017-07-20T18:59:46Z

pandas/_libs/parsers.pyx

-        int header_start # header row start
-        int header_end # header row end
+        ssize_t header_start # header row start
+        ssize_t header_end # header row end


Modify the spacing so that it aligns with the hashtag above.

Also, why ssize_t here (and not size_t like above)? We should either use one or the other with these changes. I think either would work since all of these fields must be >= 0.

In some specific places, -1 is used as a sentinel value. This patch should exhaustively identify all of those cases or change them to not need to use a negative number.

Ah, I see. Also, I think in light of what @wesm said below, it might be preferable to use int64_t instead given the cross-platform compatibility issue.

wesm · 2017-07-20T19:22:14Z

Very honestly, I'm not sure I'm comfortable with this many unsigned integers. My rule of thumb is:

Use int for things that are unlikely to ever be big enough to worry about an int32 overflow (like the number of columns in a table -- if you had 2 billion columns, you would probably run into other problems sooner)
Use int64_t for things that are expected to be able to overflow a 32-bit signed integer
Use size_t exclusively when dealing with malloc, memcpy, and other C standard library functions, and for handling C++ loop indices in the STL (for (size_t i = 0; i < foo.size(); ++i))

In Google's C++ style guide they say

Some people, including some textbook authors, recommend using unsigned types to represent
numbers that are never negative. This is intended as a form of self-documentation. However, in C, 
the advantages of such documentation are outweighed by the real bugs it can introduce. Consider:

for (unsigned int i = foo.Length()-1; i >= 0; --i) ...

This code will never terminate! Sometimes gcc will notice this bug and warn you, but often it will not.
Equally bad bugs can occur when comparing signed and unsigned variables. Basically, C's type
promotion scheme causes unsigned types to behave differently than one might expect.

So, document that a variable is non-negative using assertions. Don't use an unsigned type.

gfyoung · 2017-07-20T19:38:29Z

Very honestly, I'm not sure I'm comfortable with this many unsigned integers.

That's fair. In that case, ssize_t would be preferable to size_t.

wesm · 2017-07-20T19:56:32Z

ssize_t isn't a standard integer type; it's a POSIX typedef. For best cross-platform compatibility the best integer to use for values that are expected to be large is int64_t

jeffknupp · 2017-07-20T20:15:17Z

@wesm FWIW, almost all of the changed fields are malloc-ed, memcpy-ed or realloced (the source of the original bug). I tried to keep the surface area of the change as small as possible but still address the places where a) an int32_t would overflow and b) one would reasonably expect and (s)size_t.

For me, size_t was always seemed natural because a) it's what malloc() expects and b) it's what sizeof() returns. The reason why it is (and must be) unsigned, then, becomes clear: if it weren't, one wouldn't be able to allocate the full addressable space via malloc() or accurately report the size of a type greater than int64_t max, while both are perfectly valid uses of the respective functions (IMHO).

That said, it sounds like you're -1 on size_t, which is fine. I'll fix all changed types to be int64_t.

codecov · 2017-07-20T22:21:44Z

Codecov Report

Merging #17040 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #17040      +/-   ##
==========================================
- Coverage   91.02%      91%   -0.02%     
==========================================
  Files         161      161              
  Lines       49308    49308              
==========================================
- Hits        44883    44874       -9     
- Misses       4425     4434       +9

Flag	Coverage Δ
#multiple	`88.78% <ø> (ø)`	⬆️
#single	`40.19% <ø> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e58225...e04d12a. Read the comment docs.

codecov · 2017-07-20T22:22:08Z

Codecov Report

Merging #17040 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17040      +/-   ##
==========================================
+ Coverage   91.02%   91.03%   +<.01%     
==========================================
  Files         161      161              
  Lines       49308    49351      +43     
==========================================
+ Hits        44883    44927      +44     
+ Misses       4425     4424       -1

Flag	Coverage Δ
#multiple	`88.81% <100%> (+0.03%)`	⬆️
#single	`40.26% <66.66%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/conftest.py	`96.77% <100%> (+0.34%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/missing.py	`84.36% <0%> (-0.89%)`	⬇️
pandas/core/generic.py	`92% <0%> (-0.31%)`	⬇️
pandas/core/ops.py	`91.86% <0%> (-0.17%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.11%)`	⬇️
pandas/core/sparse/array.py	`91.42% <0%> (-0.05%)`	⬇️
pandas/core/panel.py	`96.92% <0%> (-0.03%)`	⬇️
pandas/core/sparse/frame.py	`94.24% <0%> (ø)`	⬆️
pandas/io/html.py	`84.85% <0%> (ø)`	⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e58225...6a1ba23. Read the comment docs.

jeffknupp · 2017-07-20T22:36:06Z

As an aside, Google's style guide seems a bit all-over-the-map. They suggest not using unsigned types and in the very same breath dictate that assertions should be preferred for "documenting" a value must be non-negative. Of course, the only problem with that is assert is a macro dependent on the NDEBUG macro which itself is not part of C++ standard library. Also, I can't imagine finding a widely used version of gcc (well, g++ I guess) that wouldn't warn about the type difference in a comparison of the form they present. If the compiler can't make that determination, you may as well just give up...

Enough bike-shedding, though! I made the requested changes and pushed. Please review at your convenience.

gfyoung · 2017-07-20T22:43:03Z

pandas/_libs/parsers.pyx


-        char *pword_start    # pointer to stream start of current field
-        int word_start       # position start of current field
+        char *pword_start       # pointer to stream start of current field


So close! One more space.

gfyoung · 2017-07-20T22:43:38Z

pandas/_libs/parsers.pyx

-        int header_start # header row start
-        int header_end # header row end
+        int64_t header_start # header row start
+        int64_t header_end # header row end


I'm going to let OCD take over and suggest that you add a couple more spaces to align with the hashtag above. 😄

gfyoung · 2017-07-20T22:46:37Z

@jeffknupp : Do you have a test to reproduce the issue (I know that was problematic was the issue, but I'm curious what the current status on that is now)?

Also, add an entry in the whatsnew .

gfyoung · 2017-07-20T23:17:13Z

@jeffknupp : Can you check that the issue in #14696 is handled by your changes?

jeffknupp · 2017-07-21T00:03:03Z

Yes, it is handled by this change. Thanks, Jeff

…

On Jul 20, 2017, 7:17 PM -0400, gfyoung ***@***.***>, wrote: @jeffknupp : Can you check that the issue in #14696 is handled by your changes? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wesm · 2017-07-21T00:09:36Z

@jeffknupp sorry about my bike shedding. I agree it's often a lot of worrying over nothing but thanks for indulging my OCDs

jeffknupp · 2017-07-21T00:51:16Z

@gfyoung Added what I think is an appropriate set of entries in whatsnew as well as some heavy-duty horizontal alignment 😉.

I have a very simple test (which I gave to @wesm and believe was linked to from the original issue) which would cause a simple script of the form:

import pandas as pd

try:
    df = pd.read_csv('test_file.gz2', low_memory=False)
except:
    print('Failed')

to core (and, after the allocation issue was fixed, to simply print an "out of memory" error). It required a data file that was ~300 MB after bzip2, for which there should be a link.

gfyoung · 2017-07-21T01:01:10Z

as well as some heavy-duty horizontal alignment

It's beautiful, thanks. 😄

I have a very simple test

Ah, okay. It seems like we would have to add this to the repository, though I don't want to commit something like that given the file size without further discussion. Thoughts?

jeffknupp · 2017-07-21T01:03:05Z

@wesm was faced with a similar issue for a similar problem in apache/arrow, which is where this first manifest itself. Let me dig around and see what he did (though I remember an email in the past few days specifically mentioning that the test included would only be run if a specific flag was set, and wouldn't run by default).

jeffknupp · 2017-07-21T01:43:37Z

So, it looks like it works aside from Python 2.7 on Windows? I'm looking at the failing tests now in appveyor. Would like to get this buttoned up tonight but may have to punt to the morning.

jreback

so if you can add a test which triggers this before the fixes and works after. ideally you simply generate in memory a large enough string that you need. (and mark this as a slow test). There are a couple of examples where we do something this already in the parser tests.

jreback · 2017-07-21T10:52:52Z

pandas/_libs/parsers.pyx

-        int datalen    # amount of data available
-        int datapos
+        int64_t chunksize  # Number of bytes to prepare for each chunk
+        char *data        # pointer to data to be processed


pls line up

jreback · 2017-07-21T10:54:21Z

pandas/_libs/parsers.pyx

        kh_destroy_str(table)

    cdef _get_column_name(self, Py_ssize_t i, Py_ssize_t nused):
+        cdef int64_t j


was this not defined before?

No, and for the life of me I couldn't figure out how it was working given all other instances seem to require it (left as an exercise for the reader as to why it worked but should be defined with the proper type here, regardless).

jreback · 2017-07-21T10:55:50Z

pandas/_libs/src/parser/tokenizer.c

-            int bufsize = 100;
+            size_t bufsize = 100;
            self->error_msg = (char *)malloc(bufsize);
            snprintf(self->error_msg, bufsize,


I thought we were going to pre-allocate the error msg buffer?

I think doing so is a logically separate change and has some subtleties (e.g. what should allocated size be and what happens when more is needed down the road but it's not obvious someone has to check their (possibly dynamic) output length against the buffer size). Feel free to open an issue and I'll submit a patch if I have time.

well if you like to open an issue that is fine, the original PR was closed and didn't have an issue.

jreback · 2017-07-21T10:56:53Z

pandas/_libs/src/parser/tokenizer.c

    /* trim line_start, line_fields */
-    new_cap = _next_pow2(self->lines) + 1;
-    if ((int)new_cap < self->lines_cap) {
+    if (new_cap < 1024 * 1024 * 1024) {


why are you doing this ?

Sorry, that must have been left over from old test code; will revert.

jeffknupp · 2017-07-21T11:53:50Z

@jreback Right, that's the testing solution I originally proposed on the issue. Marking it as a "slow" test, though, won't work, as someone looking to be exhaustive in their testing who includes the slow tests will be surprised when one fails because "slow" actually meant "bit" in this case.

Is there already a separate category for "large memory" tests? It looks like that's what @wesm did for a very similar issue involving the same test data in apache/arrow

jeffknupp · 2017-07-23T03:08:52Z

@jreback Note that the issue is not related to the underlying architecture of the machine. Rather, it's due to the fact that using ints in calls to malloc/realloc exposes you to integer overflow issues when trying to allocate anything more than INT_MAX / 2 bytes (a 32-bit system can, by definition, store an address greater than 2^16).

To trigger the issue, one simply needs to create a CSV where at least one column is larger than INT_MAX bytes. However, since that value is typically exactly 2 GB, you'll need a machine with at least that much RAM. In reality, you'll need quite a bit more since a) you need to make at least one copy of the data for the parser and b) you need to make sure you don't start swapping so bad you take down the system while also avoid being killed by the oom killer.

So, tl;dr, any test for this issue will need to be run on a machine with sufficient RAM, where "sufficient" is likely in the > 6GB range.

pep8speaks · 2017-07-23T04:22:13Z

Hello @jeffknupp! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 23, 2017 at 14:07 Hours UTC

jeffknupp · 2017-07-23T04:32:45Z

@jreback @gfyoung OK, this should be considered complete.

I added a new pytest option, --run-highmemory which enables tests marked high_memory. They are not run by default as I don't want to destroy your numerous CI environments 😄 .

The new test, marked as high_memory, is crude but effective. The whole test takes roughly 8GB on my MBP and maybe 20 seconds to complete. I have confirmed:

If run against current master, the pytest crashes (which I think we can consider "a failure") as was observed in the issue
If run on the new code with --run-highmemory enabled, the test passes

Note that since I'm not aware of a nontrivial way of constructing a test to see if the interpreter itself will crash, test coverage will likely complain that the two new error_msg allocations are not covered (since this code fixes both the crashes at those locations and the code that was incorrectly invoking them in the first place). I hope that's OK as I'm not sure how much more time I can devote to this specific issue.

gfyoung · 2017-07-23T05:54:50Z

I hope that's OK as I'm not sure how much more time I can devote to this specific issue.

Understood. Could you at least patch the lint errors on Travis right now? It would be great to get all builds passing. 😄

jeffknupp · 2017-07-23T06:32:57Z

@gfyoung because you asked so nicely... 😄 . A bit of acrobatics for 80-character width. 120 is the new 80.

gfyoung · 2017-07-23T10:29:21Z

pandas/tests/io/parser/test_parsers.py



+@pytest.mark.high_memory
+def test_bytes_exceed_2gb():


I would refactor to add this into the CParser tests. You don't necessarily have to have the test run for the low-memory c-parser (you can just have it skip if it is the case).

I considered that, but given that the parser tests in general are all constructed to run all of the tests under one of three parser setups (and this only applies to one of them), it seemed like it would be shoe-horning it in just to keep things "pretty". There were a few other locations where one-off tests for the parsers were created because they had very specific functions, so I thought this was closer to the latter case.

There was also text in the documentation that pointed to the old unittest style tests being deprecated in favor of normal pytest-style tests.

Of course, all that said, I'm happy to change if you disagree.

gfyoung · 2017-07-23T10:29:49Z

pandas/tests/io/parser/test_parsers.py

+    csv = StringIO('strings\n' + '\n'.join(
+        ['x' * (1 << 20) for _ in range(2100)]))
+    df = read_csv(csv, low_memory=False)
+    assert not df.empty


Ideally, it would be great if we could check the actual contents. Then again, I'm not sure I want to instantiate another copy of this DataFrame in memory 😄

this is fine - we care about segfault

Thanks. Would this be mergeable, then? I've addressed @gfyoung 's comments which I believe were the last outstanding.

gfyoung · 2017-07-23T10:32:22Z

doc/source/whatsnew/v0.21.0.txt

 - Bug in :func:`read_csv` in which non integer values for the header argument generated an unhelpful / unrelated error message (:issue:`16338`)
-
+- Bug in :func:`read_csv` in which passing a CSV with at least one very large (i.e. more than 2^31 - 1 bytes) column along with ``low_memory=False`` would cause an integer overflow. The result was an always unsuccessful attempt to allocate an enourmous buffer and then reporting "Out of memory." (:issue:`16798`).
+- Bug in :func:`read_csv` in which some errors paths were assigning error messages to the internal tokenizer's ``error_msg`` field without first allocating the memory. When this happened as part of exception handling, it resulted in a double ``free`` and the program halted due to a ``SIGSEGV`` (:issue:`16798`).


I wonder if we can make these descriptions less technical. Our (less technically) audience might not understand everything written here. For example, you could write:

- Bug in :func:`read_csv` in which passing a CSV with at least one very large...would cause Python to run out of memory. - Bug in :func:`read_csv` in which memory mismanagement with error messages caused Python to crash

True, it doesn't capture the entire error description as before, but it gets the general idea across I hope. The ellipses mean I took your entire sentence from what proceeds "would cause"

Sure. I was a bit hesitant to write these entries for this exact reason (I wasn't sure who the audience was and what the expectations were about entries). I've removed a lot of the detail while still giving the reader enough information to determine if they bug they just saw is, in fact, one of the two listed.

jreback

lgtm merge in green

jeffknupp · 2017-07-23T14:41:33Z

Finally all green. Thanks for your help @jreback @gfyoung @wesm @codecov[bot] @pep8speaks ...

jreback · 2017-07-23T16:28:16Z

thanks @jeffknupp

wesm · 2017-07-23T18:44:47Z

thanks @jeffknupp!!

jreback · 2017-07-24T00:46:20Z

ok this is causing the 32-bit builds to segfault

https://travis-ci.org/MacPython/pandas-wheels/jobs/256710160

no easy way to debug this in travis

can u setup a vm and see what is happening ?

thanks

gfyoung · 2017-07-24T02:23:28Z

@jreback : We could always just drop support for 32-bit. Then this won't be an issue. 😄

On a more serious note, it would be great if we could somehow add 32-bit testing to Travis. I think numpy might have a way to do it without using Docker (the current workaround based on what I've seen on Travis issues), but I had issues previously setting it up. Perhaps it might be worthwhile to revisit.

In any case, I suspect the issue might be the malloc-ing for error messages (note that it uses size_t instead of int64_t unlike the other changes). I don't see at this point how the other changes could impact this. What do you think?

jeffknupp · 2017-07-24T09:13:07Z

I'm not sure how/why a 32-bit Travis worker would behave any differently than a 64-bit one in this case. Unless the 32-bit worker had less RAM, there's no reason why it should segfault (unless there is a regression somewhere else?).

Also, @gfyoung , just FYI, malloc(3) is meant to take a size_t (for obvious reasons) as seen in its declaration (and those of all the other memory allocation related functions):

SYNOPSIS
     #include <stdlib.h>

     void *
     calloc(size_t count, size_t size);

     void
     free(void *ptr);

     void *
     malloc(size_t size);

     void *
     realloc(void *ptr, size_t size);

     void *
     reallocf(void *ptr, size_t size);

     void *
     valloc(size_t size);

jeffknupp · 2017-07-24T09:17:58Z

@jreback Also, AFAICT, you weren't actually running the new test, since --run-high-memory wasn't passed as an argument to py.test according to the Travis run logs...

jreback · 2017-07-24T11:07:31Z

I created #17063 to track this.

@jeffknupp

you new test is not enabled, but this is not the issue. a previously passing 32-bit build now segfaults. Pls try to replicate / fix this issue. If this cannot be tracked down soon I will revert this commit.

We currently DO support 32-bit builds. That may be re-evaluateda at some point (even for 0.21.0). But that should not couple with this issue.

gfyoung · 2017-07-24T15:13:09Z

Also, @gfyoung , just FYI, malloc(3) is meant to take a size_t (for obvious reasons)

@jeffknupp : Doh! You're absolutely right. Then again, segfaulting with this commit is defying logic as it is right now. 😄

jeffknupp · 2017-07-24T16:44:23Z

@jreback @gfyoung is there any way for us to discuss this in more detail via a more synchronous form of communication? I'm a bit confused about the failure @jreback pointed to and how one can match up what ran in each MacPython/pandas-wheels build.

Do you guys have Slack/IRC/AIM/ICQ/Prodigy that we could sync up on?

gfyoung · 2017-07-24T16:51:14Z

is there any way for us to discuss this in more detail via a more synchronous form of communication?

To have more synchronous communication, you need to be all available at the same time 😄 . Unfortunately, not sure if that's possible ATM.

Do you guys have Slack/IRC/AIM/ICQ/Prodigy that we could sync up on?

Not that I know of. We have Gitter, though I'm not sure that's going to alleviate the issue.

I'm a bit confused about the failure @jreback pointed to and how one can match up what ran in each MacPython/pandas-wheels build.

You're not alone in the confusion. 😢 I don't think any of us have a clear idea of what the failure is, as debugging on Travis is very difficult. What @jreback is suggesting is that you spin up a 32-bit Linux VM and try running tests in that VM to see if you can reproduce the segfault when running tests.

@TomAugspurger

* consolidated the duplicate definitions of NA values (in parsers & IO) (pandas-dev#16589) * GH15943 Fixed defaults for compression in HDF5 (pandas-dev#16355) * DOC: add header=None to read_excel docstring (pandas-dev#16689) * TST: Test against python-dateutil master (pandas-dev#16648) * BUG: .iloc[:] and .loc[:] return a copy of the original object pandas-dev#13873 (pandas-dev#16443) closes pandas-dev#13873 * TST: Add test of building frame from named Series and columns (pandas-dev#9232) (pandas-dev#16700) * DOC: fix wrongly placed versionadded (pandas-dev#16702) * DOC: pin sphinx to version 1.5 (pandas-dev#16704) * CI: restore np 113 in ci builds (pandas-dev#16656) * Revert "BLD: fix numpy on 3.6 build as 1.13 was released but no deps are built for it (pandas-dev#16633)" This reverts commit dfebd8a. closes pandas-dev#16634 * BUG: Fix regression for RGB(A) color arguments (pandas-dev#16701) * Add test * Pass tuples that are RGB or RGBA like in list * Update what's new * change whatsnew to reflect regression fix * Add test for RGBA as well * CI: pin jemalloc=4.4.0 (pandas-dev#16727) * MAINT: Drop Categorical.order & sort (pandas-dev#16728) Deprecated back in 0.18.1 xref pandas-devgh-12882 * Fix reading Series with read_hdf (pandas-dev#16610) * Added test to reproduce issue pandas-dev#16583 * Fix pandas-dev#16583 by adding an explicit `mode` argument to `read_hdf` kwargs which are meant for the opening of the HDFStore should be filtered out before passing the remaining kwargs to the `select` function to load the data. * Noted fix for pandas-dev#16583 in WhatsNew * DOC: typo (pandas-dev#16733) * whatsnew v0.21.0.txt typos (pandas-dev#16742) * whatsnew v0.20.3 edits (pandas-dev#16743) * BUG: do not raise UnsortedIndexError if sorting is not required closes pandas-dev#16734 Author: Pietro Battiston <me@pietrobattiston.it> This patch had conflicts when merged, resolved by Committer: Jeff Reback <jeff.reback@twosigma.com> Closes pandas-dev#16736 from toobaz/index_what_you_can and squashes the following commits: f77e2b3 [Pietro Battiston] BUG: do not raise UnsortedIndexError if sorting is not required * DOC: whatsnew typos * Test for pandas-dev#16726. unittest that ensures datetime is understood (pandas-dev#16744) * Test for pandas-dev#16726. unittest that ensures datetime is understood * Corrected the test as suggested by @TomAugspurger * Fixed flake8 errors and warnings * DOC: some rst fixes (pandas-dev#16763) * DOC: Update Sphinx Deprecated Directive (pandas-dev#16512) * MAINT: Drop Index.sym_diff (pandas-dev#16760) Deprecated in 0.18.1 xref pandas-devgh-12591, pandas-devgh-12594 * MAINT: Drop pd.options.display.mpl_style (pandas-dev#16761) Deprecated in 0.18.0 xref pandas-devgh-12190 * DOC: remove section on Panel4D support in HDF io (pandas-dev#16783) * DOC: add section on data validation and library engarde (pandas-dev#16758) * TST: register slow marker (pandas-dev#16797) * TST: register slow marker * Update setup.cfg * BUG: Load data from a CategoricalIndex for dtype comparison, closes #… (pandas-dev#16738) * BUG: Load data from a CategoricalIndex for dtype comparison, closes pandas-dev#16627 * Enable is_dtype_equal on CategoricalIndex, fixed some doc typos, added ordered CategoricalIndex test * Flake8 windows suggestion * Fixed some documentation/formatting issues, clarified the purpose of the test case. * Bug in pd.merge() when merge/join with multiple categorical columns (pandas-dev#16786) closes pandas-dev#16767 * BUG: Fix read of py3 PeriodIndex DataFrame HDF made in py2 (pandas-dev#16781) (pandas-dev#16790) In Python3, reading a DataFrame with a PeriodIndex from an HDF file created in Python2 would incorrectly return a DataFrame with an Int64Index. * BUG: Fix Series doesn't work in pd.astype(). Now treat Series as dict. (pandas-dev#16725) * FIX: Allow aggregate to return dictionaries again pandas-dev#16741 (pandas-dev#16752) * BUG: fix to_latex bold_rows option (pandas-dev#16708) * Revert "CI: pin jemalloc=4.4.0 (pandas-dev#16727)" (pandas-dev#16731) This reverts commit 09d8c22. * CI: use dist/trusty rather than os/linux (pandas-dev#16806) closes pandas-dev#16730 * TST: Verify columns entirely below chop_threshold still print (pandas-dev#6839) (pandas-dev#16809) * BUG: clip dataframe column-wise pandas-dev#15390 (pandas-dev#16504) * TST: Verify that positional shifting works with duplicate columns (pandas-dev#9092) (pandas-dev#16810) * BUG: render dataframe as html do not produce duplicate element id's (pandas-dev#16780) (pandas-dev#16801) * BUG: when rendering dataframe as html do not produce duplicate element id's pandas-dev#16780 * CLN: removing spaces in code causes pylint check to fail * DOC: moved whatsnew comment to 0.20.3 release from 0.21.0 * fix BUG: ValueError when performing rolling covariance on multi indexed DataFrame (pandas-dev#16814) * fix multi index names * fix line length to pep8 * added what's new entry and reference issue number in test * Update test_multi.py * Update v0.20.3.txt * BUG: rolling.cov with multi-index columns should presever the MI (pandas-dev#16825) xref pandas-dev#16814 * use network decorator on additional tests (pandas-dev#16824) * BUG: TimedeltaIndex raising ValueError when slice indexing (pandas-dev#16637) (pandas-dev#16638) * Bug issue 16819 Index.get_indexer_not_unique inconsistent return types vs get_indexer (pandas-dev#16826) * TST: Verify that float columns stay float after pivot (pandas-dev#7142) (pandas-dev#16815) * BUG/MAINT: Change default of inplace to False in pd.eval (pandas-dev#16732) * BUG: kind parameter on categorical argsort (pandas-dev#16834) * DOC: Updated cookbook to show usage of Grouper instead of TimeGrouper… (pandas-dev#16794) * BUG: allow empty multiindex (fixes .isin regression, GH16777) (pandas-dev#16782) * BUG: fix missing sort keyword for PeriodIndex.join (pandas-dev#16586) * COMPAT: 32-bit compat for testing of indexers (pandas-dev#16849) xref pandas-dev#16826 * BUG: fix infer frequency for business daily (pandas-dev#16683) * DOC: Whatsnew updates (pandas-dev#16853) [ci skip] * TST/PKG: Move test HDF5 file to legacy (pandas-dev#16856) It wasn't being picked up in our package data otherwise * COMPAT: moar 32-bit compat for testing of indexers (pandas-dev#16861) xref pandas-dev#16826 * MAINT: Drop the get_offset_name method (pandas-dev#16863) Deprecated since 0.18.0 xref pandas-devgh-11834 * DOC: Fix missing parentheses in documentation (pandas-dev#16862) * BUG: rolling.quantile does not return an interpolated result (pandas-dev#16247) * ENH - Modify Dataframe.select_dtypes to accept scalar values (pandas-dev#16860) * COMPAT: moar 32-bit compat for testing of indexers (pandas-dev#16869) xref pandas-dev#16826 * Confirm that select was *not* clearer in 0.12 (pandas-dev#16878) * Added tests for _get_dtype (pandas-dev#16845) * BUG: Series.isin fails or categoricals (pandas-dev#16858) * COMPAT with dateutil 2.6.1, fixed ambiguous tz dst behavior (pandas-dev#16880) * fix wrongly named method (pandas-dev#16881) * TST/PKG: Removed pandas.util.testing.slow definition (pandas-dev#16852) * MAINT: Remove unused mock import (pandas-dev#16908) We import it, set it as an attribute, and then don't use it. * Let _get_dtype accept Categoricals and CategoricalIndex (pandas-dev#16887) * Fixes for pandas-dev#16896(TimedeltaIndex indexing regression for strings) (pandas-dev#16907) * Fix for pandas-dev#16909(DeltatimeIndex.get_loc is not working on np.deltatime64 data type) (pandas-dev#16912) * DOC: Recommend sphinx 1.5 for now (pandas-dev#16929) For the SciPy sprint tomorrow, until the cause of the doc-building slowdown is fully identified. * BUG: Allow value labels to be read with iterator (pandas-dev#16926) All value labels to be read before the iterator has been used Fix issue where categorical data was incorrectly reformatted when write_index was False closes pandas-dev#16923 * DOC: Update flake8 command instructions (pandas-dev#16919) * TST: Don't assert that a bug exists in numpy (pandas-dev#16940) Better to ignore the warning from the bug, rather than assert the bug is still there After this change, numpy/numpy#9412 _could_ be backported to fix the bug * CI: add .pep8speakes.yml * CLN16668: remove OrderedDefaultDict (pandas-dev#16939) * Change "pls" to "please" in error message (pandas-dev#16947) * BUG: MultiIndex sort with ascending as list (pandas-dev#16937) * DOC: Improving docstring of pop method (pandas-dev#16416) (pandas-dev#16520) * PEP8 * WARN: add stacklevel to to_dict() UserWarning (pandas-dev#16927) (pandas-dev#16936) * ERR: add stacklevel to to_dict() UserWarning (pandas-dev#16927) * TST: Add warning testing to to_dict() * Fix warning assertion on to_dict() test * Add github issue to documentation on to_dict() warning test * CI: fix pep8speaks .yml file * DOC: whatsnew 0.21.0 edits * CI: disable codecov reporting * MAINT: Move series.remove_na to core.dtypes.missing.remove_na_arraylike Closes pandas-devgh-16935 * Support non unique period indexes on join and merge operations (pandas-dev#16949) * Support non unique period indexes on join and merge operations * Add frame assertion on tests and release notes * Explicitly use dtype int64 on arange * BUG: Set secondary axis font size for `secondary_y` during plotting The parameter was not being respected for `secondary_y`. Closes pandas-devgh-12565 * DOC: more whatsnew fixes * DOC: Reset index examples closes pandas-dev#16416 Author: aernlund <awe220@nyumc.org> Closes pandas-dev#16967 from aernlund/reset_index_docs and squashes the following commits: 3c6a4b6 [aernlund] DOC: added examples to reset_index 4838155 [aernlund] DOC: added examples to reset_index 2a51e2b [aernlund] DOC: added examples to reset_index * channel from pandas to conda-forge (pandas-dev#16966) * BUG: coercing of bools in groupby transform (pandas-dev#16895) * DOC: misspelling in DatetimeIndex.indexer_between_time [CI skip] (pandas-dev#16963) * CLN: some residual code removed, xref to pandas-dev#16761 (pandas-dev#16974) * ENH: Create a 'Y' alias for date_range yearly frequency Closes pandas-devgh-9313 * Revert "ENH: Create a 'Y' alias for date_range yearly frequency" (pandas-dev#16976) This reverts commit 9c096d2, as it was prematurely made. * DOC: behavior when slicing with missing bounds (pandas-dev#16932) closes pandas-dev#16917 * TST: Add test for sub-char in read_csv (pandas-dev#16977) Closes pandas-devgh-16893. * DEPR: deprecate html.border option (pandas-dev#16970) * DOC: document convention argument for resample() (pandas-dev#16965) * DOC: document convention argument for resample() * DOC: Clarify 'it' in aggregate doc (pandas-dev#16989) Closes pandas-devgh-16988. * CLN/COMPAT: for various py2/py3 in doc/bench scripts (pandas-dev#16984) * PERF: SparseDataFrame._init_dict uses intermediary dict, not DataFrame (pandas-dev#16883) Closes pandas-devgh-16773. * MAINT: Drop line_width and height from options (pandas-dev#16993) Deprecated since 0.11 and 0.12 respectively. * COMPAT: Add back remove_na for seaborn (pandas-dev#16992) Closes pandas-devgh-16971. * COMPAT: np.full not available in all versions, xref pandas-dev#16773 (pandas-dev#17000) * DOC, TST: Clarify whitespace behavior in read_fwf documentation (pandas-dev#16950) Closes pandas-devgh-16772 * API: add infer_objects for soft conversions (pandas-dev#16915) * API: add infer_objects for soft conversions * doc fixups * fixups * doc * BUG: np.inf now causes Index to upcast from int to float (pandas-dev#16996) Closes pandas-devgh-16957. * DOC: Make highlight functions match documentation (pandas-dev#16999) Closes pandas-devgh-16998. * BUG: Large object array isin closes pandas-dev#16012 Author: Morgan Stuart <morgansstuart243@gmail.com> Closes pandas-dev#16969 from Morgan243/large_array_isin and squashes the following commits: 31cb4b3 [Morgan Stuart] Removed unneeded details from whatsnew description 4b59745 [Morgan Stuart] Linting errors; additional test clarification 186607b [Morgan Stuart] BUG pandas-dev#16012 - fix isin for large object arrays * BUG: reindex would throw when a categorical index was empty pandas-dev#16770 closes pandas-dev#16770 Author: ri938 <r_irv938@hotmail.com> Author: Jeff Reback <jeff@reback.net> Author: Tuan <tuan.d.tran@hotmail.com> Author: Forbidden Donut <forbdonut@gmail.com> This patch had conflicts when merged, resolved by Committer: Jeff Reback <jeff@reback.net> Closes pandas-dev#16820 from ri938/bug_issue16770 and squashes the following commits: 0e2d315 [ri938] Merge branch 'master' into bug_issue16770 9802288 [ri938] Update v0.20.3.txt 1f2865e [ri938] Update v0.20.3.txt 83fd749 [ri938] Update v0.20.3.txt eab3192 [ri938] Merge branch 'master' into bug_issue16770 7acc09f [ri938] Minor correction to previous submit 6e8f1b3 [ri938] Minor corrections to previous submit (pandas-dev#16820) 9ed80f0 [ri938] Bring documentation into line with master branch. 26e1a60 [ri938] Move documentation of change to the next major release 0.21.0 59b17cd [Jeff Reback] BUG: rolling.cov with multi-index columns should presever the MI (pandas-dev#16825) 5362447 [Tuan] fix BUG: ValueError when performing rolling covariance on multi indexed DataFrame (pandas-dev#16814) 800b40d [ri938] BUG: render dataframe as html do not produce duplicate element id's (pandas-dev#16780) (pandas-dev#16801) a725fbf [Forbidden Donut] BUG: Fix read of py3 PeriodIndex DataFrame HDF made in py2 (pandas-dev#16781) (pandas-dev#16790) 8f8e3d6 [ri938] TST: register slow marker (pandas-dev#16797) 0645868 [ri938] Add backticks in documentation 0a20024 [ri938] Minor correction to previous submit 69454ec [ri938] Minor corrections to previous submit (pandas-dev#16820) 3092bbc [ri938] BUG: reindex would throw when a categorical index was empty pandas-dev#16770 * BUG: Don't with empty Series for .isin (pandas-dev#17006) Empty Series initializes to float64, even when the data type is object for .isin, leading to an error with membership. Closes pandas-devgh-16991. * ENH: Use 'Y' as an alias for end of year (pandas-dev#16978) Closes pandas-devgh-9313 Redo of pandas-devgh-16958 * DOC: infer_objects doc fixup (pandas-dev#17018) * Fixes SparseSeries initiated with dictionary raising AttributeError (pandas-dev#16960) * DOC: Improving docstring of reset_index method (pandas-dev#16416) (pandas-dev#16975) * DOC: add warning to append about inefficiency (pandas-dev#17017) * DOC : Remove redundant backtick (pandas-dev#17025) * DOC: Document business frequency aliases (pandas-dev#17028) Follow-up to pandas-devgh-16978. * DOC: Fix double back-tick in 'Reshaping by Melt' section (pandas-dev#17030) See current stable docs for the issue: https://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt The double ` is causing the entire paragraph to be fixed width until the next double `. This commit removes the extra "`" * Define DataFrame plot methods in DataFrame (pandas-dev#17020) * CLN: move safe_sort from core.algorithms to core.sorting (pandas-dev#17034) COMPAT: safe_sort will only coerce list-likes to object, not a numpy string type xref: pandas-dev#17003 (comment) * DOC: Fixed Minor Typo (pandas-dev#17043) Cocumentation to Documentation * BUG: do not cast ints to floats if inputs o crosstab are not aligned (pandas-dev#17011) closes pandas-dev#17005 * BUG in merging categorical dates closes pandas-dev#16900 Author: Dave Willmer <dave.willmer@gmail.com> This patch had conflicts when merged, resolved by Committer: Jeff Reback <jeff@reback.net> Closes pandas-dev#16986 from dwillmer/cat_fix and squashes the following commits: 1ea1977 [Dave Willmer] Minor tweaks + comment 21a35a0 [Dave Willmer] Merge branch 'cat_fix' of https://github.com/dwillmer/pandas into cat_fix 04d5404 [Dave Willmer] Update tests 3cc5c24 [Dave Willmer] Merge branch 'master' into cat_fix 5e8e23b [Dave Willmer] Add whatsnew item b82d117 [Dave Willmer] Lint fixes a81933d [Dave Willmer] Remove unused import 218da66 [Dave Willmer] Generic solution to categorical problem 48e7163 [Dave Willmer] Test inner join 8843c10 [Dave Willmer] Fix TypeError when merging categorical dates * BUG: __setitem__ with a tuple induces NaN with a tz-aware DatetimeIndex (pandas-dev#16889) (pandas-dev#16897) * Added test for _get_dtype_type. (pandas-dev#16899) * BUG/API: dtype inconsistencies in .where / .setitem / .putmask / .fillna (pandas-dev#16821) * CLN/BUG: fix ndarray assignment may cause unexpected cast supersedes pandas-dev#14145 closes pandas-dev#14001 * API: This fixes a number of inconsistencies and API issues w.r.t. dtype conversions. This is a reprise of pandas-dev#14145 & pandas-dev#16408. This removes some code from the core structures & pushes it to internals, where the primitives are made more consistent. This should all us to be a bit more consistent for pandas2 type things. closes pandas-dev#16402 supersedes pandas-dev#14145 closes pandas-dev#14001 CLN: remove uneeded code in internals; use split_and_operate when possible * BUG: Improved thread safety for read_html() GH16928 (pandas-dev#16930) * Fixed 'add_methods' when the 'select' argument is specified. (pandas-dev#17045) * TST: Fix error message check in np.argsort comparision (pandas-dev#17051) Closes pandas-devgh-17046. * TST: Move some Series ctor tests to SharedWithSparse (pandas-dev#17050) * BUG: Made SparseDataFrame.fillna() fill all NaNs A continuation of pandas-dev#16178 closes pandas-dev#16112 closes pandas-dev#16178 Author: Kernc <kerncece@gmail.com> Author: keitakurita <kris337jbn@yahoo.co.jp> This patch had conflicts when merged, resolved by Committer: Jeff Reback <jeff@reback.net> Closes pandas-dev#16892 from kernc/sparse-fillna and squashes the following commits: c1cd33e [Kernc] fixup! BUG: Made SparseDataFrame.fillna() fill all NaNs 2974232 [Kernc] fixup! BUG: Made SparseDataFrame.fillna() fill all NaNs 4bc01a1 [keitakurita] BUG: Made SparseDataFrame.fillna() fill all NaNs * BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg Fix a few locations where a parser's `error_msg` buffer is written to without having been previously allocated. This manifested as a double free during exception handling code making use of the `error_msg`. Additionally, use `size_t/ssize_t` where array indices or lengths will be stored. Previously, int32_t was used and would overflow on columns with very large amounts of data (i.e. greater than INTMAX bytes). xref pandas-dev#14696 closes pandas-dev#16798 Author: Jeff Knupp <jeff.knupp@enigma.com> Author: Jeff Knupp <jeff@jeffknupp.com> Closes pandas-dev#17040 from jeffknupp/16790-core-on-large-csv and squashes the following commits: 6a1ba23 [Jeff Knupp] Clear up prose a5d5677 [Jeff Knupp] Fix linting issues 4380c53 [Jeff Knupp] Fix linting issues 7b1cd8d [Jeff Knupp] Fix linting issues e3cb9c1 [Jeff Knupp] Add unit test plus '--high-memory' option, *off by default*. 2ab4971 [Jeff Knupp] Remove debugging code 2930eaa [Jeff Knupp] Fix line length to conform to linter rules e4dfd19 [Jeff Knupp] Revert printf format strings; fix more comment alignment 3171674 [Jeff Knupp] Fix some leftover size_t references 0985cf3 [Jeff Knupp] Remove debugging code; fix type cast 669d99b [Jeff Knupp] Fix linting errors re: line length 1f24847 [Jeff Knupp] Fix comment alignment; add whatsnew entry e04d12a [Jeff Knupp] Switch to use int64_t rather than size_t due to portability concerns. d5c75e8 [Jeff Knupp] BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg * TST: remove some test warnings in parser tests (pandas-dev#17057) TST: move highmemory test to proper location in c_parser_only xref pandas-dev#16798 * DOC: Add more examples for reset_index (pandas-dev#17055) * MAINT: Add dash in high memory message Follow-up to pandas-devgh-17057. * MAINT: kwards --> kwargs in parsers.pyx * CLN: Cleanup comments in before_install_travis.sh envars.sh doesn't exist anymore. In fact, it's been gone for awhile. * MAINT: Remove duplicate Series sort_index check Duplicate boolean validation check for sort_index in series/test_validate.py * BLD: Pin pyarrow=0.4.1 (pandas-dev#17065) Addresses pandas-devgh-17064. Also add some additional build information when calling `pd.show_versions` * ENH: provide "inplace" argument to set_axis() closes pandas-dev#14636 Author: Pietro Battiston <me@pietrobattiston.it> Closes pandas-dev#16994 from toobaz/set_axis_inplace and squashes the following commits: 8fb9d0f [Pietro Battiston] REF: adapt NDFrame.set_axis() calls to new signature 409f502 [Pietro Battiston] ENH: provide "inplace" argument to set_axis(), change signature * BUG: Fix parser field type compatability on 32-bit systems. (pandas-dev#17071) Closes pandas-devgh-17063 * COMPAT: rename isnull -> isna, notnull -> notna (pandas-dev#16972) closes pandas-dev#15001 * BUG: Thoroughly dedup columns in read_csv (pandas-dev#17060) * ENH: Add skipna parameter to infer_dtype (pandas-dev#17066) Currently defaults to False for backwards compatibility. Will default to True in the future. Closes pandas-devgh-17059. * MAINT: Remove unused variable in test_scalar.py The "expected" variable is unused at the end of a test in indexing/test_scalar.py * TST: Add tests/indexing/ and reshape/ to setup.py (pandas-dev#17076) Looks like we just forgot about them. Oops. * CI: partially revert pandas-dev#17065, un-pin pyarrow on some builds * DOC: whatsnew typos * TST: Check more error messages in tests (pandas-dev#17075) * BUG: Respect dtype when calling pivot_table with margins=True closes pandas-dev#17013 This fix actually exposed an occurrence of pandas-dev#17035 in an existing test (as well as in one I added). Author: Pietro Battiston <me@pietrobattiston.it> Closes pandas-dev#17062 from toobaz/pivot_margin_int and squashes the following commits: 2737600 [Pietro Battiston] Removed now obsolete workaround 956c4f9 [Pietro Battiston] BUG: respect dtype when calling pivot_table with margins=True * MAINT: Add missing space in parsers.pyx "2< heuristic" --> "2 < heuristic" * MAINT: Add missing paren around print statement Stray verbose print statement in parsers.pyx was bare without any parentheses. * DOC: fix typos in missing.rst xref pandas-dev#16972 * DOC: further clean-up null/na changes (pandas-dev#17113) * BUG: Allow pd.unique to accept tuple of strings (pandas-dev#17108) * BUG: Allow Series with same name with crosstab (pandas-dev#16028) Closes pandas-devgh-13279 * COMPAT: make sure use_inf_as_null is deprecated (pandas-dev#17126) closes pandas-dev#17115 * CI: bump version of xlsxwriter to 0.5.2 (pandas-dev#17142) * DOC: Clean up instructions in ISSUE_TEMPLATE (pandas-dev#17146) * Add missing space to the NotImplementedError's message for compound dtypes (pandas-dev#17140) * DOC: (de)type the return value of concat (pandas-dev#17079) (pandas-dev#17119) * BUG: Thoroughly dedup column names in read_csv (pandas-dev#17095) * DOC: Additions/updates to documentation (pandas-dev#17150) * ENH: add to/from_parquet with pyarrow & fastparquet (pandas-dev#15838) * DOC: doc typos, xref pandas-dev#15838 * TST: test for categorical index monotonicity (pandas-dev#17152) * correctly determine bottleneck version * tests for categorical index monotonicity * fix Index.is_monotonic to point to Index.is_monotonic_increasing directly * MAINT: Remove non-standard and inconsistently-used imports (pandas-dev#17085) * DOC: typos in whatsnew * DOC: whatsnew 0.21.0 fixes * BUG: Fix CSV parsing of singleton list header (pandas-dev#17090) Closes pandas-devgh-7757. * ENH: Support strings containing '%' in add_prefix/add_suffix (pandas-dev#17151) (pandas-dev#17162) * REF: repr - allow block to override values that get formatted (pandas-dev#17143) * MAINT: Drop unnecessary newlines in issue template * remove direct import of nan Author: Brock Mendel <jbrockmendel@gmail.com> Closes pandas-dev#17185 from jbrockmendel/dont_import_nan and squashes the following commits: ee260b8 [Brock Mendel] remove direct import of nan * use == to test String equality (pandas-dev#17171) * ENH: Add warning when setting into nonexistent attribute (pandas-dev#16951) closes pandas-dev#7175 closes pandas-dev#5904 * DOC: added string processing comparison with SAS (pandas-dev#16497) * CLN: remove unused get methods in internals (pandas-dev#17169) * Remove unused get methods that would raise AttributeError if called * Remove unnecessary import * TST: Partial Boolean DataFrame Indexing (pandas-dev#17186) Closes pandas-devgh-17170 * CLN: Reformat docstring for IPython fixture * Define Series.plot and Series.hist in class definition (pandas-dev#17199) * BUG: support pandas objects in iloc with old numpy versions (pandas-dev#17194) closes pandas-dev#17193 * Implement _make_accessor classmethod for PandasDelegate (pandas-dev#17166) * Create ABCDateOffset (pandas-dev#17165) * BUG: resample and apply modify the index type for empty Series (pandas-dev#17149) * DOC: Updated NDFrame.astype docs (pandas-dev#17203) * MAINT: Minor touch-ups to GitHub PULL_REQUEST_TEMPLATE (pandas-dev#17207) Remove leading space from task-list so that tasks aren't nested. * CLN: replace %s syntax with .format in core.computation (pandas-dev#17209) * Bugfix for multilevel columns with empty strings in Python 2 (pandas-dev#17099) * CLN/ASV clean-up frame stat ops benchmarks (pandas-dev#17205) * BUG: Rolling apply on DataFrame with Datetime index returns NaN (pandas-dev#17156) * CLN: Remove import exception handling (pandas-dev#17218) Imports should succeed on all versions of Python that pandas supports. * MAINT: Remove extra the's in deprecation messages (pandas-dev#17222) * DOC: Patch docs in _decorators.py * CLN: replace %s syntax with .format in pandas.util (pandas-dev#17224) * Add 'See also' sections (pandas-dev#17223) * move pivot_table doc-string to DataFrame (pandas-dev#17174) * Remove import of pandas as pd in core.window (pandas-dev#17233) * TST: Move more frame tests to SharedWithSparse (pandas-dev#17227) * REF: _get_objs_combined_axis (pandas-dev#17217) * ENH/PERF: Remove frequency inference from .dt accessor (pandas-dev#17210) * ENH/PERF: Remove frequency inference from .dt accessor * BENCH: Add DatetimeAccessor benchmark * DOC: Whatsnew * Fix apparent typo in tests (pandas-dev#17247) * COMPAT: avoid calling getsizeof() on PyPy closes pandas-dev#17228 Author: mattip <matti.picus@gmail.com> Closes pandas-dev#17229 from mattip/getsizeof-unavailable and squashes the following commits: d2623e4 [mattip] COMPAT: avoid calling getsizeof() on PyPy * CLN: replace %s syntax with .format in pandas.core.reshape (pandas-dev#17252) Replaced %s syntax with .format in pandas.core.reshape. Additionally, made some of the existing positional .format code more explicit. * ENH: Infer compression from non-string paths (pandas-dev#17206) * Fix bugs in IntervalIndex.is_non_overlapping_monotonic (pandas-dev#17238) * BUG: Fix behavior of argmax and argmin with inf (pandas-dev#16449) (pandas-dev#16449) Closes pandas-dev#13595 * CLN: Remove have_pytz (pandas-dev#17266) Closes pandas-devgh-17251 * CLN: replace %s syntax with .format in core.dtypes and core.sparse (pandas-dev#17270) * Replace imports of * with explicit imports (pandas-dev#17269) xref pandas-dev#17234 * TST: pytest deprecation warnings GH17197 (pandas-dev#17253) Test parameters with marks are updated according to the updated API of Pytest. https://docs.pytest.org/en/latest/changelog.html#pytest-3-2-0-2017-07-30 https://docs.pytest.org/en/latest/parametrize.html * Handle more date/datetime/time formats (pandas-dev#15871) * DOC: add example on json_normalize (pandas-dev#16438) * BUG: Have object dtype for empty Categorical.categories (pandas-dev#17249) * BUG: Have object dtype for empty Categorical ctor Previously we had a `Float64Index`, which is inconsistent with, e.g., the regular Index constructor. * TST: Update tests in multi for new return Previously these relied worked around the return type by wrapping list-likes in `np.array` and relying on that to cast to float. These workarounds are no longer nescessary. * TST: Update union_categorical tests This relied on `NaN` being a float and empty being a float. Not a necessary test anymore. * TST: set object dtype * CLN: replace %s syntax with .format in pandas.tseries (pandas-dev#17290) * TST: parameterize consistency tests for rolling/expanding windows (pandas-dev#17292) * FIX: define `DataFrame.items` for all versions of python (pandas-dev#17214) * PERF: Update ASV publish config (pandas-dev#17293) Stricter cutoffs for considering regressions [ci skip] * DOC: Expand docstrings for head / tail methods (pandas-dev#16941) * MAINT: Use set literal for unsupported + depr args Initializes unsupported and deprecated argument sets with set literals instead of the set constructor in pandas/io/parsers.py, as the former is slightly faster than the latter. * DOC: Add proper docstring to maybe_convert_indices Patches several spelling errors and expands current doc to a proper doc-string. * DOC: Improving docstring of take method (pandas-dev#16948) * BUG: Fixed regex in asv.conf.json (pandas-dev#17300) In pandas-dev#17293 I messed up the syntax. I used a glob instead of a regex. According to the docs at http://asv.readthedocs.io/en/latest/asv.conf.json.html#regressions-thresholds we want to use a regex. I've actually manually tested this change and verified that it works. [ci skip] * Remove unnecessary usage of _TSObject (pandas-dev#17297) * BUG: clip should handle null values closes pandas-dev#17276 Author: Michael Gasvoda <mgasvoda@mercatus.gmu.edu> Author: mgasvoda <mgasvoda01@gmail.com> Closes pandas-dev#17288 from mgasvoda/master and squashes the following commits: a1dbdf2 [mgasvoda] Merge branch 'master' into master 9333952 [Michael Gasvoda] Checking output of tests 4e0464e [Michael Gasvoda] fixing whatsnew text c442040 [Michael Gasvoda] formatting fixes 7e23678 [Michael Gasvoda] formatting updates 781ea72 [Michael Gasvoda] whatsnew entry d9627fe [Michael Gasvoda] adding clip tests 9aa0159 [Michael Gasvoda] Treating na values as none for clips * BUG: fillna returns frame when inplace=True if value is a dict (pandas-dev#16156) (pandas-dev#17279) * CLN: Index.append() refactoring (pandas-dev#16236) * DEPS: set min versions (pandas-dev#17002) closes pandas-dev#15206, numpy >= 1.9 closes pandas-dev#15543, matplotlib >= 1.4.3 scipy >= 0.14.0 * CLN: replace %s syntax with .format in core.tools, algorithms.py, base.py (pandas-dev#17305) * BUG: Fix strange behaviour of Series.iloc on MultiIndex Series (pandas-dev#17148) (pandas-dev#17291) * DOC: Add module doc-string to tseries/api.py * MAINT: Clean up docs in pandas/errors/__init__.py * CLN: replace %s syntax with .format in missing.py, nanops.py, ops.py (pandas-dev#17322) Replaced %s syntax with .format in missing.py, nanops.py, ops.py. Additionally, made some of the existing positional .format code more explicit. * Make pd.Period immutable (pandas-dev#17239) * Bug: groupby multiindex levels equals rows (pandas-dev#16859) closes pandas-dev#16843 * BUG: Cannot use tz-aware origin in to_datetime (pandas-dev#16842) closes pandas-dev#16842 Author: step4me <prosikeffect@gmail.com> Closes pandas-dev#17244 from step4me/step4me-feature and squashes the following commits: 09d051d [step4me] BUG: Cannot use tz-aware origin in to_datetime (pandas-dev#16842) * Replace usage of total_seconds compat func with timedelta method (pandas-dev#17289) * CLN: replace %s syntax with .format in core/indexing.py (pandas-dev#17357) Progress toward issue pandas-dev#16130. Converted old string formatting to new string formatting in core/indexing.py. * DOC: Point to dev-docs in issue template (pandas-dev#17353) [ci skip] * CLN: remove total_seconds compat from json (pandas-dev#17341) * CLN: Move test_intersect_str_dates (pandas-dev#17366) Moves test_intersect_str_dates from tests/indexes/test_range.py to tests/indexes/test_base.py. * BUG: Respect dups in reindexing CategoricalIndex (pandas-dev#17355) When the indexer is identical to the elements. We should still return duplicates when the indexer contains duplicates. Closes pandas-devgh-17323. * Unify Index._dir_* with Series implementation (pandas-dev#17117) * BUG: make order of index from pd.concat deterministic (pandas-dev#17364) closes pandas-dev#17344 * Fix typo that causes several NaT methods to have incorrect docstrings (pandas-dev#17327) * CLN: replace %s syntax with .format in io/formats/format.py (pandas-dev#17358) Progress toward issue pandas-dev#16130. Converted old string formatting to new string formatting in io/formats/format.py. * PKG: Added pyproject.toml for PEP 518 (pandas-dev#16745) Declaring build-time requirements: https://www.python.org/dev/peps/pep-0518/ * DOC: Update Overview page in documentation (pandas-dev#17368) * Update Overview page in documentation * DOC Revise Overview page * DOC Make further revisions in Overview webpage * Update overview.rst Remove references to Panel * API: Have MultiIndex consturctors always return a MI (pandas-dev#17236) * API: Have MultiIndex constructors return MI This removes the special case for MultiIndex constructors returning an Index if all the levels are length-1. Now this will return a MultiIndex with a single level. This is a backwards incompatabile change, with no clear method for deprecation, so we're making a clean break. Closes pandas-dev#17178 * fixup! API: Have MultiIndex constructors return MI * Update for comments

…error_msg Fix a few locations where a parser's `error_msg` buffer is written to without having been previously allocated. This manifested as a double free during exception handling code making use of the `error_msg`. Additionally, use `size_t/ssize_t` where array indices or lengths will be stored. Previously, int32_t was used and would overflow on columns with very large amounts of data (i.e. greater than INTMAX bytes). xref pandas-dev#14696 closes pandas-dev#16798 Author: Jeff Knupp <jeff.knupp@enigma.com> Author: Jeff Knupp <jeff@jeffknupp.com> Closes pandas-dev#17040 from jeffknupp/16790-core-on-large-csv and squashes the following commits: 6a1ba23 [Jeff Knupp] Clear up prose a5d5677 [Jeff Knupp] Fix linting issues 4380c53 [Jeff Knupp] Fix linting issues 7b1cd8d [Jeff Knupp] Fix linting issues e3cb9c1 [Jeff Knupp] Add unit test plus '--high-memory' option, *off by default*. 2ab4971 [Jeff Knupp] Remove debugging code 2930eaa [Jeff Knupp] Fix line length to conform to linter rules e4dfd19 [Jeff Knupp] Revert printf format strings; fix more comment alignment 3171674 [Jeff Knupp] Fix some leftover size_t references 0985cf3 [Jeff Knupp] Remove debugging code; fix type cast 669d99b [Jeff Knupp] Fix linting errors re: line length 1f24847 [Jeff Knupp] Fix comment alignment; add whatsnew entry e04d12a [Jeff Knupp] Switch to use int64_t rather than size_t due to portability concerns. d5c75e8 [Jeff Knupp] BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg

gfyoung added Bug IO CSV read_csv, to_csv labels Jul 20, 2017

gfyoung reviewed Jul 20, 2017

View reviewed changes

jeffknupp mentioned this pull request Jul 20, 2017

Pandas core dumps when reading large CSV file using read_csv(..., low_memory=False) #16798

Closed

Switch to use int64_t rather than size_t due to portability concerns.

e04d12a

gfyoung reviewed Jul 20, 2017

View reviewed changes

jreback mentioned this pull request Jul 20, 2017

Fix segfault on trying to free an invalid pointer in tokenizer.c #14696

Closed

Fix comment alignment; add whatsnew entry

1f24847

Fix linting errors re: line length

669d99b

jreback requested changes Jul 21, 2017

View reviewed changes

jeffknupp added 2 commits July 22, 2017 23:14

Remove debugging code

2ab4971

Add unit test plus '--high-memory' option, *off by default*.

e3cb9c1

Fix linting issues

7b1cd8d

jeffknupp added 2 commits July 23, 2017 02:28

Fix linting issues

4380c53

Fix linting issues

a5d5677

gfyoung reviewed Jul 23, 2017

View reviewed changes

Clear up prose

6a1ba23

jreback approved these changes Jul 23, 2017

View reviewed changes

jreback closed this in 8d7d3fb Jul 23, 2017

jreback mentioned this pull request Jul 24, 2017

COMPAT: 32-builds segfault, xref #17040 #17063

Closed



		@pytest.mark.high_memory
		def test_bytes_exceed_2gb():

Uh oh!

BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg #17040

BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg #17040

Uh oh!

Conversation

jeffknupp commented Jul 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Jul 20, 2017

Uh oh!

wesm commented Jul 20, 2017

Uh oh!

jeffknupp commented Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 20, 2017

Codecov Report

Uh oh!

codecov bot commented Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeffknupp commented Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Jul 20, 2017

Uh oh!

gfyoung commented Jul 20, 2017

Uh oh!

jeffknupp commented Jul 21, 2017 via email

Uh oh!

wesm commented Jul 21, 2017

Uh oh!

jeffknupp commented Jul 21, 2017

Uh oh!

gfyoung commented Jul 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffknupp commented Jul 21, 2017

Uh oh!

jeffknupp commented Jul 21, 2017

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gfyoung Jul 20, 2017 •

edited

Loading

wesm commented Jul 20, 2017 •

edited

Loading

jeffknupp commented Jul 20, 2017 •

edited

Loading

codecov bot commented Jul 20, 2017 •

edited

Loading

jeffknupp commented Jul 20, 2017 •

edited

Loading

gfyoung commented Jul 21, 2017 •

edited

Loading

pep8speaks commented Jul 23, 2017 •

edited

Loading

gfyoung Jul 23, 2017 •

edited

Loading