Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reset_index() on MultiIndexed empty dataframe does not preserve dtypes #19602

Closed
alberto-dellera opened this issue Feb 8, 2018 · 12 comments · Fixed by #34942
Closed

reset_index() on MultiIndexed empty dataframe does not preserve dtypes #19602

alberto-dellera opened this issue Feb 8, 2018 · 12 comments · Fixed by #34942
Labels
MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@alberto-dellera
Copy link

Code Sample, a copy-pastable example if possible

df = pd.DataFrame( data=[[0,0,0]], columns=['level_1','level_2','payload'] )

# make dataframe empty
df = df[ df.payload == -1 ]

# columns are all int64 here

print(df.info())
#output: level_1    0 non-null int64
#output: level_2    0 non-null int64
#output: payload    0 non-null int64

# set MultiIndex - levels are still int64 
df = df.set_index(['level_1','level_2'])

print(str(df.index.levels[0].dtype))
print(str(df.index.levels[1].dtype))
#output: int64
#output: int64

# reset_index - former-levels columns are now float64
df = df.reset_index()

print(df.info())
#output: level_1    0 non-null float64
#output: level_2    0 non-null float64
#output: payload    0 non-null int64

Problem description

The dtypes are preserved instead if either
a) index is not a MultiIndex
b) dataframe is not empty

(b) is a big issue for programs that calculate subset of dataframes that sometimes
can be empty, since downstream code might expect a certain dtype and fail when it finds
a float64 instead.

Real-world scenario: sampling a system (a collection of processes or threads) at regular intervals,
and collecting some measures (cpu used, or other resources or figures); a very common strategy
in performance investigation software (e.g. check Oracle's v$active_session_history).
Here, the natural index is (sample_time, process_id), sample_time being datetime64 (a Time Series).
Even more naturally, we want to computes differences of sample_time, yielding a timedelta64,
and divide it by np.timedelta64(1,'s') to get the elapsed time in seconds; but when the
initial dataframe is empty, we try to divide float64 / np.timedelta64(1,'s') and get an exception.

An obvious workaround is to check for empty dataframes after EVERY reset_index()
and coerce the float64s back to their correct value - but that easily becomes a maintenance/coverage nightmare :O

Expected Output

resetted columns having their initial dtype

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.1
openpyxl: 2.4.9
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex Effort Medium labels Feb 9, 2018
@TomAugspurger
Copy link
Contributor

Yeah, I could see these preserving the index dtypes. Interested in making a PR?

@alberto-dellera
Copy link
Author

@TomAugspurger : unfortunately I am not skilled enough for a PR yet, sorry :(

@TomAugspurger
Copy link
Contributor

I just came across this code for an unrelated PR. The issue may be at

pandas/pandas/core/frame.py

Lines 3387 to 3389 in a214915

if mask.all():
values = np.empty(len(mask))
values.fill(np.nan)

mask is empty, but mask.all() is true, so we do np.empty(), which is float.

If we check for mask.all() and len(mask) on that line, things may be fixed. Haven't tried yet.

@allComputableThings
Copy link

Is resolved with something like: ?

 if mask.all(): 
     values = np.empty(len(mask), dtype=index.dtype) 
     values.fill(np.nan) 

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 27, 2018 via email

@vaibhawc
Copy link

Did anything happen in this direction?
Recently I came across such a problem.

@TomAugspurger
Copy link
Contributor

Still open. @vaibhawc can you investigate the issue and make a PR?

@vaibhawc
Copy link

My issue was resolved by what @stuz5000 suggested. Though I didn't quite understand how it happened.
:| Should I make a PR?

@TomAugspurger
Copy link
Contributor

Sure, give that a shot.

@vaibhawc
Copy link

vaibhawc commented Nov 1, 2018

Ok, could you please help me with base and compare branch?

@TomAugspurger
Copy link
Contributor

Contributing guidelines are at http://pandas-docs.github.io/pandas-docs-travis/contributing.html. Post if you have any issues.

@agralak-queueco
Copy link

btw, as a quick workaround, you can copy the previous dtypes and assign them back after grouping:

    old_dtypes = dict(data.dtypes)
    filtered = #insert your filter here
    filtered = filtered.astype({**{key: value for key, value in old_dtypes.items() if key in filtered.columns}, 
        **{'new_columns': np.int16}})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
7 participants