Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Error when creating sparse dataframe with nan column label #8822

Merged
merged 1 commit into from
Apr 11, 2015

Conversation

artemyk
Copy link
Contributor

@artemyk artemyk commented Nov 15, 2014

Right now the following raises an exception:

from pandas import DataFrame, Series
from numpy import nan
nan_colname = DataFrame(Series(1.0,index=[0]),columns=[nan])
nan_colname_sparse = nan_colname.to_sparse()

This is because sparse dataframes use a dictionary to store information about columns, with the column label as the key. nan's do not equal themselves and create problems as dictionary keys. This avoids the issue by uses a dataframe to store this information.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

  • why would you ever want to actually have a nan as a column label?
  • nans cause many many issues in indexes. They are technically allowed, but highly discouraged
  • the columns are not stored in a dictionary, a SparseDataFrame is very much like a regular DataFrame internally
  • this is not allowed in a DataFrame now, why should SparseDataFrame support it?
  • You could need a whole battery of tests for this to be considered. eg. sdf[nan] will simply fail

@jreback jreback added the Sparse Sparse Data Type label Nov 15, 2014
@artemyk
Copy link
Contributor Author

artemyk commented Nov 15, 2014

@jreback, I totally agree with you. Except that this is not allowed in DataFrame now -- you can create a DataFrame with nan columns, but not a SparseDataFrame.

However, I am working on a get_dummies() that returns a sparse dataframe instead of dense.
There is a test of get_dummies which creates a dataframe with a nan column name (see last part of test_include_na in pandas/tests). So I wrote this in order to pass that test.

I would argue that if this shouldn't be done, it should be explicitly forbidden. E.g. so that people don't write tests creating using DataFrames with nan column names.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

it can be done. just not like you are doing it. You can do it from a dict only (or set the columns explicity). Its implicity allowed, even though if you try to index with more than one nan in an index it will fail.

I know their are some tests / behavior which allow this. Its a problem. Not sure how much work it would be to either not disallow it. I agree with your sentiments.

So not allowing expansion of this (have nixed a couple of pr's that tried to do this). it is a can of worms.

@artemyk
Copy link
Contributor Author

artemyk commented Nov 15, 2014

OK, I will soon commit a pull request that fails a test due to this. Maybe we can discuss it further there. The issue can probably be avoided with a simple test for this edge case.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

ok, sure

@artemyk
Copy link
Contributor Author

artemyk commented Nov 15, 2014

@jreback This might be necessary after all, for the relatively normal following case:

>>> print pd.get_dummies(pd.Series([1,2,np.nan]), dummy_na=True)

    1    2   NaN
0    1    0    0
1    0    1    0
2    0    0    1

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

that's an ok case
lol and see how it is constructed

@artemyk
Copy link
Contributor Author

artemyk commented Nov 18, 2014

@jreback Not sure how to get sparse get_dummies with dummy_na without this.
BTW, I realize SparseDataFrame doesn't store columns in dictionaries, but during one part of its creation, it does store column information in a dict w. column name as key (that it then passes to dict_to_manager).
I suppose another variation would be to create this dict using (True, None) as the key for np.nan-named columns, and (False, colname) as the key for non-np.nan-named columns. Not sure if this is better than using a regular DataFrame (I kind of like that this enforces that if a column name can be a valid column in a DataFrame, it works here, and if not, it should fail here also).

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

the usual way to do this is to stringify the nans, e.g. use 'nan' then subsitute it out at the end. Much simpler that way (or 2 be honest prob better), if a bit tricker on access.

@artemyk
Copy link
Contributor Author

artemyk commented Nov 18, 2014

But then if one of the values is 'nan' (in string form), then np.nan's and 'nan's would get confused. I think a

def safe_key(val):
  isnan = np.isnan(val)
  return (isnan, val if not isnan else None)

would be more robust.

@artemyk
Copy link
Contributor Author

artemyk commented Nov 21, 2014

@jreback I'm wondering how to proceed with this. The basic issue is that w/o this PR, the following raises an error:

import pandas as pd
import numpy as np
pd.get_dummies(pd.Series([1,2,np.nan]), dummy_na=True).to_sparse()

What are your thoughts? Keep as is? Or merge, but using dicts instead of a dataframe to store column information? (and if so --- because it's more lightweight?)

@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

@artemyk I am excited you are working on this. As we need a person really interested in sparese!

I haven't had a chance to look at how best to do this. Will get back to you next week.

@artemyk
Copy link
Contributor Author

artemyk commented Nov 21, 2014

@jreback OK, sounds good!

@jreback
Copy link
Contributor

jreback commented Mar 25, 2015

@artemyk can you rebase this and I'll take a look......

@artemyk
Copy link
Contributor Author

artemyk commented Mar 26, 2015

@jreback Rebased

@artemyk
Copy link
Contributor Author

artemyk commented Apr 8, 2015

@jreback ?

@@ -1663,6 +1663,11 @@ def test_as_blocks(self):
self.assertEqual(list(df_blocks.keys()), ['float64'])
assert_frame_equal(df_blocks['float64'], df)

def test_nan_columnname(self):
nan_colname = DataFrame(Series(1.0,index=[0]),columns=[nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue as a comment here

@jreback
Copy link
Contributor

jreback commented Apr 8, 2015

pls add a release note (use this PR number as the issue number)

@jreback jreback added this to the 0.16.1 milestone Apr 8, 2015
@jreback jreback added the Bug label Apr 8, 2015
Support for nan columns

Fix

Trigger Travis CI

jreback fixes

Release note update
@artemyk
Copy link
Contributor Author

artemyk commented Apr 10, 2015

@jreback Ready to merge, I think

jreback added a commit that referenced this pull request Apr 11, 2015
BUG: Error when creating sparse dataframe with nan column label
@jreback jreback merged commit 700f6eb into pandas-dev:master Apr 11, 2015
@jreback
Copy link
Contributor

jreback commented Apr 11, 2015

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants