BUG: unwanted numeric coercion after groupby-apply #14423

Closed
waqarmalik opened this Issue Oct 14, 2016 · 2 comments

Comments

Projects
None yet
2 participants

waqarmalik commented Oct 14, 2016 edited by jreback

xref #14873 (boolean casts)
xref #14849 (datetime)

A small, complete example of the issue

import pandas as pd

def predictions(tool):
    out = pd.Series(index=['p1', 'p2', 'useTime'], dtype=object)
    if 'step1' in list(tool.State):
        out['p1'] = str(tool[tool.State == 'step1'].Machine.values[0])
    if 'step2' in list(tool.State):
        out['p2'] = str(tool[tool.State == 'step2'].Machine.values[0])
        out['useTime'] = str(tool[tool.State == 'step2'].oTime.values[0])
    return out


df1 = pd.DataFrame({'Key': ['B', 'B', 'A', 'A'],
                   'State': ['step1', 'step2', 'step1', 'step2'],
                   'oTime': ['', '2016-09-19 05:24:33', '', '2016-09-19 23:59:04'],
                   'Machine': ['23', '36L', '36R', '36R']})

df2 = df1.copy()
df2.oTime = pd.to_datetime(df2.oTime)


pred1 = df1.groupby('Key').apply(predictions)
pred2 = df2.groupby('Key').apply(predictions)

print(pred1)
print(pred2)

Actual Output:

      p1   p2              useTime
Key                               
A    36R  36R  2016-09-19 23:59:04
B     23  36L  2016-09-19 05:24:33
       p1   p2                        useTime
Key                                          
A     NaN  36R  2016-09-19T23:59:04.000000000
B    23.0  36L  2016-09-19T05:24:33.000000000

Expected Output

pred1 and pred2 should have the same values in column p1.
pred1 is correct whereas pred2 is changing type to float64.

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.0
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Contributor

jreback commented Oct 14, 2016

So I think we a duplicate of this already, need to search for it. In any event I think its doing a coercing conversion. This should strictly be a soft-conversion from object -> numeric. So the following works (though I think the existing code should actually work correctly, maybe something is not getting passed thru).

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index 3c376e3..a86e6d6 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -10,6 +10,7 @@ from pandas.compat import(
     zip, range, long, lzip,
     callable, map
 )
+import pandas as pd
 from pandas import compat
 from pandas.compat.numpy import function as nv
 from pandas.compat.numpy import _np_version_under1p8
@@ -3446,7 +3447,7 @@ class NDFrameGroupBy(GroupBy):
                 # as we are stacking can easily have object dtypes here
                 so = self._selected_obj
                 if (so.ndim == 2 and so.dtypes.isin(_DATELIKE_DTYPES).any()):
-                    result = result._convert(numeric=True)
+                    result = result.apply(lambda x: pd.to_numeric(x, errors='ignore'))
                     date_cols = self._selected_obj.select_dtypes(
                         include=list(_DATELIKE_DTYPES)).columns
                     date_cols = date_cols.intersection(result.columns)

a pull-request with tests would be welcome.

as an aside, what you are doing in side the .apply is completely inefficient and non-idiomatic.

jreback added this to the Next Major Release milestone Oct 14, 2016

jreback changed the title from Weird behavior in groupby-apply to BUG: unwanted numeric coercion after groupby-apply Oct 14, 2016

waqarmalik commented Oct 14, 2016 edited

Tested and the suggested change works on a much larger data set too.

As an aside, I'd like to find better ways to do it -- groupby followed by extracting key parameters from each group. I couldn't devise a way to make aggregate work. Could you provide some suggestion on improving this? I've setup another page on stackoverflow for the discussion.

http://stackoverflow.com/questions/40032039/pandas-groupby-apply-weird-behavior-with-series

@jreback jreback modified the milestone: 0.20.0, Next Major Release Mar 13, 2017

@gwpdt gwpdt added a commit to gwpdt/pandas that referenced this issue Mar 14, 2017

@gwpdt gwpdt BUG: Group-by numeric type-coericion with datetime
GH Bug #14423

During a group-by/apply on a DataFrame, in the presence of one or more
DateTime-like columns, Pandas would incorrectly coerce the type of all
other columns to numeric.  E.g. a String column would be coerced to
numeric, producing NaNs.

Fix the issue, and add a test.
46d12c2

@gwpdt gwpdt added a commit to gwpdt/pandas that referenced this issue Mar 16, 2017

@gwpdt gwpdt TST: Rename and expand test_numeric_coercion
Rename test_numeric_coercion to
test_apply_numeric_coercion_when_datetime, and add tests for GH #15421
and #14423
e1ed104

jreback closed this in 37e5f78 Mar 16, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@gwpdt @AnkurDedania gwpdt + AnkurDedania BUG: Group-by numeric type-coercion with datetime
closes #14423
closes #15421
closes #15670

During a group-by/apply
on a DataFrame, in the presence of one or more  DateTime-like columns,
Pandas would incorrectly coerce the type of all  other columns to
numeric.  E.g. a String column would be coerced to  numeric, producing
NaNs.

Author: Greg Williams <gregwill@pdtpartners.com>

Closes #15680 from gwpdt/bugfix14423 and squashes the following commits:

e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion
0a15674 [Greg Williams] CLN: move import, add whatsnew entry
c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes)
46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime
7d3333c

@mattip mattip added a commit to mattip/pandas that referenced this issue Apr 3, 2017

@gwpdt @mattip gwpdt + mattip BUG: Group-by numeric type-coercion with datetime
closes #14423
closes #15421
closes #15670

During a group-by/apply
on a DataFrame, in the presence of one or more  DateTime-like columns,
Pandas would incorrectly coerce the type of all  other columns to
numeric.  E.g. a String column would be coerced to  numeric, producing
NaNs.

Author: Greg Williams <gregwill@pdtpartners.com>

Closes #15680 from gwpdt/bugfix14423 and squashes the following commits:

e1ed104 [Greg Williams] TST: Rename and expand test_numeric_coercion
0a15674 [Greg Williams] CLN: move import, add whatsnew entry
c8844e0 [Greg Williams] CLN: PEP8 (whitespace fixes)
46d12c2 [Greg Williams] BUG: Group-by numeric type-coericion with datetime
0c2afdc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment