Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: qcut uniquess checking #14455

Closed
wants to merge 184 commits into from
Closed

ERR: qcut uniquess checking #14455

wants to merge 184 commits into from

Conversation

ashishsingal1
Copy link
Contributor

@ashishsingal1 ashishsingal1 commented Oct 19, 2016

Add option to drop non-unique bins.

@codecov-io
Copy link

codecov-io commented Oct 19, 2016

Current coverage is 85.25% (diff: 75.00%)

Merging #14455 into master will increase coverage by <.01%

@@             master     #14455   diff @@
==========================================
  Files           140        140          
  Lines         50631      50633     +2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43166      43168     +2   
  Misses         7465       7465          
  Partials          0          0          

Powered by Codecov. Last update c31ea34...4450884

@@ -172,11 +176,13 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
quantiles = q
bins = algos.quantile(x, quantiles)
return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
precision=precision, include_lowest=True)
precision=precision, include_lowest=True,
duplicate_edges='raise')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should pass duplicate_edges.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally prefer errors kw as it compat with others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, commented above

@sinhrks
Copy link
Member

sinhrks commented Oct 20, 2016

Thx for the PR. Can u add tests and whatsnew?

@sinhrks sinhrks added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas labels Oct 20, 2016
@sinhrks sinhrks added this to the 0.19.1 milestone Oct 20, 2016
dubourg and others added 2 commits October 20, 2016 06:25
Author: Iván Vallés Pérez <ivanvallesperez@gmail.com>

Closes #14434 from ivallesp/add-check-for-merge-indices and squashes the following commits:

e18b7c9 [Iván Vallés Pérez] Add some checks for assuring that the left_index and right_index parameters have correct types. Tests added.
@jreback jreback changed the title Update tile.py ERR: qcut uniquess checking Oct 20, 2016
@@ -141,6 +142,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
as a scalar.
precision : int
The precision at which to store and display the bins labels
duplicate_edges : {'raise', 'drop'}, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate_edges -> errors='raise', 'drop'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I like duplicates=raise, drop

@@ -191,7 +197,11 @@ def _bins_to_cuts(x, bins, right=True, labels=None, retbins=False,
ids = bins.searchsorted(x, side=side)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check all the valid possibilities for errors and raise otherwise (IOW, if you pass a bad value should raise an informative message)

if errors not in ['raise', 'drop']:
   raise ValueError("invalid value for errors paramters, valid are: raise, drop")

raise ValueError('Bin edges must be unique: %s' % repr(bins))
if (duplicate_edges == 'raise'):
raise ValueError('Bin edges must be unique: %s'
% repr(bins))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand this message to say, you can force edges to be unique by passing errors='drop'

@jreback
Copy link
Contributor

jreback commented Oct 20, 2016

looks pretty good. some more error checking formatting is needed. pls add a whatsnew note in 0.19.1, make an Enhancements changes section. Be sure to say that the default is the existing behavior.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2016

pls add some tests. use the example from the orginal issue exercising both options to duplicates and pass some invalid values as well (and assert those raise)

@jreback jreback removed this from the 0.19.1 milestone Oct 20, 2016
``pivot_table`` raises TypeError`` when ``index`` or ``columns`` is array-like and
``values`` is not specified.

Author: sinhrks <sinhrks@gmail.com>

Closes #14380 from sinhrks/pivot_table_bug and squashes the following commits:

be426db [sinhrks] BUG: pivot_table may raise TypeError without values
@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Oct 20, 2016
* BUG: underflow on Timestamp creation

* undo change to lower bound

* change lower bound; but keep rounding to us
@ashishsingal1
Copy link
Contributor Author

ashishsingal1 commented Oct 20, 2016

Thanks for the feedback -- this is my first PR on an open source project -- will make the changes and resubmit tomorrow. Had some trouble building my branch on Windows.

USE_CASE_RANGE is a GNU C feature. This change will activate
USE_CASE_RANGE on any platform when using GNU C and not on any platform
when a different compiler is being used.

closes #14373
@jreback
Copy link
Contributor

jreback commented Oct 20, 2016

contributing docs are here; there is a section on creating a windows env.

chris-b1 and others added 6 commits October 21, 2016 19:37
1) Add checks to ensure that add overflow does not
   occur both in the positive or negative directions.
2) Add benchmarks to ensure that operations involving
   this checked add function are significantly impacted.
The mention of panels that are created is not correct. You get a multi-index
Since we don't support Python 2.6 anymore, the `check_output` method
from `subprocess` is  at our disposal.    Follow-up to #14447.    xref
<a href="https://github.com/pandas-
dev/pandas/issues/14439#issuecomment-254522055"> #14439 (comment)</a>

Author: gfyoung <gfyoung17@gmail.com>

Closes #14465 from gfyoung/merge-pr-refactor and squashes the following commits:

e267d2b [gfyoung] MAINT: Use check_output when merging.
…y input

closes #13139

Added test case to check for invalid input(empy string) on pd.eval('')
and df.query('').  Used existing helper function(_check_expression)

Author: Thiago Serafim <thiago.serafim@gmail.com>

Closes #14473 from tserafim/issue#13139 and squashes the following commits:

77483dd [Thiago Serafim] ERR: correctly raise ValueError on empty input to pd.eval() and df.query() (#13139)
9a5c55f [Thiago Serafim] Fix GH13139: better error message on invalid pd.eval and df.query input
patniharshit and others added 24 commits December 12, 2016 06:10
…on_normalize

Author: dickreuter <dickreuter@yahoo.com>

Closes #14583 from dickreuter/json_normalize_enhancement and squashes the following commits:

701c140 [dickreuter] adjusted formatting
3c94206 [dickreuter] shortened lines to pass linting
2028924 [dickreuter] doc changes
d298588 [dickreuter] Fixed as instructed in pull request page
bcfbf18 [dickreuter] Avoids exception when pandas.io.json.json_normalize
closes #14778

Please see
regex search on long columns by first converting to Categorical, avoid
melting all dataframes with all the id variables, and wait with trying
to convert the "time" variable to `int` until last), and clear up the
docstring.

Author: nuffe <erik.cfr@gmail.com>

Closes #14779 from nuffe/wide2longfix and squashes the following commits:

df1edf8 [nuffe] asv_bench: fix indentation and simplify
dc13064 [nuffe] Set docstring to raw literal to allow backslashes to be printed (still had to escape them)
295d1e6 [nuffe] Use pd.Index in doc example
1c49291 [nuffe] Can of course get rid negative lookahead now that suffix is a regex
54c5920 [nuffe] Specify the suffix with a regex
5747a25 [nuffe] ENH/DOC: wide_to_long performance and functionality improvements (#14779)
+ Add doc explaining parse_date limitation
- [x] closes #12651  - [x] passes `git diff upstream/master | flake8
--diff`

Author: adrian-stepien <adrian-stepien@users.noreply.github.com>

Closes #14098 from adrian-stepien/doc/12651 and squashes the following commits:

4427e28 [adrian-stepien] DOC: Improved links between expanding and cum* (#12651)
8466669 [adrian-stepien] DOC: Improved links between expanding and cum* (#12651)
30164f3 [adrian-stepien] DOC: Correct link from b/ffill to fillna
Passing `'0.5min'` as a frequency string should generate 30 second
intervals, rather than five minute intervals. By recursively increasing
resolution until one is found for which the frequency is an integer,
this commit ensures that that's the case for resolutions from days to
microseconds.

Fixes #8419
`cpplint` was introduced #14740, and this commit extends to check
other `*.c` and `*.h` files.    Currently, they all reside in
`pandas/src`, and this commit expands the lint to check all of the
following:    1) `datetime` (dir)  2) `ujson` (dir)  3)
`period_helper.c`  4) `All header files`    The parser directory was
handled in #14740, and the others have been deliberately omitted per
the discussion <a href="https://github.com/pandas-
dev/pandas/pull/14740#issuecomment-265260209">here</a>.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14814 from gfyoung/c-style-continue and squashes the following commits:

27d4d46 [gfyoung] MAINT: Style check *.c and *.h files
Always return `SparseArray` and `SparseSeries` for
`SparseArray.cumsum()` and `SparseSeries.cumsum()` respectively,
regardless of `fill_value`.

Closes #12855.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14771 from gfyoung/sparse-return-type and squashes the following commits:

83314fc [gfyoung] API: Return sparse objects always for cumsum
BUG: Fixed KDE plot to ignore missing values

 closes #14821

* fixed kde plot to ignore the missing values
* added comment to elaborate the changes made
* added a release note in whatsnew/0.19.2
* added test to check for  missing values and cleaned up whatsnew doc
* added comment to refer the issue
* modified to fit lint checks
* replaced ._xorig with .get_xdata()
xref #13745

provides a modest speedup for all string hashing. The
key thing is, it will release the GIL on more operations where this is
possible (mainly factorize).
can be easily extended to value_counts() and .duplicated() (for strings)

Author: Jeff Reback <jeff@reback.net>

Closes #14859 from jreback/string and squashes the following commits:

98f46c2 [Jeff Reback] PERF: use StringHashTable for strings in factorizing
# Conflicts:
#	pandas/tools/tile.py
@jorisvandenbossche
Copy link
Member

@ashishsingal1 something went wrong with your rebase. Can you do:

git fetch upstream
git rebase upstream/master
git push -f

That should normally solve it

@ashishsingal1
Copy link
Contributor Author

Trouble rebasing, going to start over with a new PR.

@jorisvandenbossche jorisvandenbossche modified the milestones: No action, 0.20.0 Dec 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts