BUG: Error when creating sparse dataframe with nan column label #8822

artemyk · 2014-11-15T01:33:18Z

Right now the following raises an exception:

from pandas import DataFrame, Series
from numpy import nan
nan_colname = DataFrame(Series(1.0,index=[0]),columns=[nan])
nan_colname_sparse = nan_colname.to_sparse()

This is because sparse dataframes use a dictionary to store information about columns, with the column label as the key. nan's do not equal themselves and create problems as dictionary keys. This avoids the issue by uses a dataframe to store this information.

jreback · 2014-11-15T01:38:06Z

why would you ever want to actually have a nan as a column label?
nans cause many many issues in indexes. They are technically allowed, but highly discouraged
the columns are not stored in a dictionary, a SparseDataFrame is very much like a regular DataFrame internally
this is not allowed in a DataFrame now, why should SparseDataFrame support it?
You could need a whole battery of tests for this to be considered. eg. sdf[nan] will simply fail

artemyk · 2014-11-15T01:51:14Z

@jreback, I totally agree with you. Except that this is not allowed in DataFrame now -- you can create a DataFrame with nan columns, but not a SparseDataFrame.

However, I am working on a get_dummies() that returns a sparse dataframe instead of dense.
There is a test of get_dummies which creates a dataframe with a nan column name (see last part of test_include_na in pandas/tests). So I wrote this in order to pass that test.

I would argue that if this shouldn't be done, it should be explicitly forbidden. E.g. so that people don't write tests creating using DataFrames with nan column names.

jreback · 2014-11-15T01:58:32Z

it can be done. just not like you are doing it. You can do it from a dict only (or set the columns explicity). Its implicity allowed, even though if you try to index with more than one nan in an index it will fail.

I know their are some tests / behavior which allow this. Its a problem. Not sure how much work it would be to either not disallow it. I agree with your sentiments.

So not allowing expansion of this (have nixed a couple of pr's that tried to do this). it is a can of worms.

artemyk · 2014-11-15T02:01:05Z

OK, I will soon commit a pull request that fails a test due to this. Maybe we can discuss it further there. The issue can probably be avoided with a simple test for this edge case.

jreback · 2014-11-15T02:01:59Z

ok, sure

artemyk · 2014-11-15T03:08:14Z

@jreback This might be necessary after all, for the relatively normal following case:

>>> print pd.get_dummies(pd.Series([1,2,np.nan]), dummy_na=True)

    1    2   NaN
0    1    0    0
1    0    1    0
2    0    0    1

jreback · 2014-11-15T03:10:05Z

that's an ok case
lol and see how it is constructed

artemyk · 2014-11-18T01:59:02Z

@jreback Not sure how to get sparse get_dummies with dummy_na without this.
BTW, I realize SparseDataFrame doesn't store columns in dictionaries, but during one part of its creation, it does store column information in a dict w. column name as key (that it then passes to dict_to_manager).
I suppose another variation would be to create this dict using (True, None) as the key for np.nan-named columns, and (False, colname) as the key for non-np.nan-named columns. Not sure if this is better than using a regular DataFrame (I kind of like that this enforces that if a column name can be a valid column in a DataFrame, it works here, and if not, it should fail here also).

jreback · 2014-11-18T02:07:00Z

the usual way to do this is to stringify the nans, e.g. use 'nan' then subsitute it out at the end. Much simpler that way (or 2 be honest prob better), if a bit tricker on access.

artemyk · 2014-11-18T03:07:48Z

But then if one of the values is 'nan' (in string form), then np.nan's and 'nan's would get confused. I think a

def safe_key(val):
  isnan = np.isnan(val)
  return (isnan, val if not isnan else None)

would be more robust.

artemyk · 2014-11-21T17:27:15Z

@jreback I'm wondering how to proceed with this. The basic issue is that w/o this PR, the following raises an error:

import pandas as pd
import numpy as np
pd.get_dummies(pd.Series([1,2,np.nan]), dummy_na=True).to_sparse()

What are your thoughts? Keep as is? Or merge, but using dicts instead of a dataframe to store column information? (and if so --- because it's more lightweight?)

jreback · 2014-11-21T17:54:11Z

@artemyk I am excited you are working on this. As we need a person really interested in sparese!

I haven't had a chance to look at how best to do this. Will get back to you next week.

artemyk · 2014-11-21T17:58:03Z

@jreback OK, sounds good!

jreback · 2015-03-25T23:20:01Z

@artemyk can you rebase this and I'll take a look......

artemyk · 2015-03-26T04:51:27Z

@jreback Rebased

artemyk · 2015-04-08T21:44:53Z

@jreback ?

jreback · 2015-04-08T21:46:13Z

pandas/sparse/tests/test_sparse.py

@@ -1663,6 +1663,11 @@ def test_as_blocks(self):
        self.assertEqual(list(df_blocks.keys()), ['float64'])
        assert_frame_equal(df_blocks['float64'], df)

+    def test_nan_columnname(self):
+        nan_colname = DataFrame(Series(1.0,index=[0]),columns=[nan])


add the issue as a comment here

jreback · 2015-04-08T21:47:14Z

pls add a release note (use this PR number as the issue number)

Support for nan columns Fix Trigger Travis CI jreback fixes Release note update

artemyk · 2015-04-10T16:26:20Z

@jreback Ready to merge, I think

BUG: Error when creating sparse dataframe with nan column label

jreback · 2015-04-11T18:35:20Z

thanks!

artemyk force-pushed the sparse_with_nancols branch from 4e10a64 to 66a4714 Compare November 15, 2014 01:36

jreback added the Sparse Sparse Data Type label Nov 15, 2014

artemyk mentioned this pull request Nov 15, 2014

ENH: Allow get_dummies to return SparseDataFrame #8823

Closed

artemyk force-pushed the sparse_with_nancols branch from 35ba20e to e9ed3d8 Compare November 17, 2014 22:01

artemyk force-pushed the sparse_with_nancols branch from 5d04093 to c776988 Compare March 26, 2015 04:50

jreback reviewed Apr 8, 2015
View reviewed changes

jreback added this to the 0.16.1 milestone Apr 8, 2015

jreback added the Bug label Apr 8, 2015

artemyk force-pushed the sparse_with_nancols branch from c776988 to b0e5ee3 Compare April 8, 2015 21:55

Fix to allow sparse dataframes to have nan column labels

7879205

Support for nan columns Fix Trigger Travis CI jreback fixes Release note update

artemyk force-pushed the sparse_with_nancols branch from b0e5ee3 to 7879205 Compare April 9, 2015 17:50

jreback added a commit that referenced this pull request Apr 11, 2015

Merge pull request #8822 from artemyk/sparse_with_nancols

700f6eb

BUG: Error when creating sparse dataframe with nan column label

jreback merged commit 700f6eb into pandas-dev:master Apr 11, 2015

kernc mentioned this pull request Jul 12, 2017

PERF: SparseDataFrame._init_dict uses intermediary dict, not DataFrame #16883

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Error when creating sparse dataframe with nan column label #8822

BUG: Error when creating sparse dataframe with nan column label #8822

artemyk commented Nov 15, 2014

jreback commented Nov 15, 2014

artemyk commented Nov 15, 2014

jreback commented Nov 15, 2014

artemyk commented Nov 15, 2014

jreback commented Nov 15, 2014

artemyk commented Nov 15, 2014

jreback commented Nov 15, 2014

artemyk commented Nov 18, 2014

jreback commented Nov 18, 2014

artemyk commented Nov 18, 2014

artemyk commented Nov 21, 2014

jreback commented Nov 21, 2014

artemyk commented Nov 21, 2014

jreback commented Mar 25, 2015

artemyk commented Mar 26, 2015

artemyk commented Apr 8, 2015

jreback Apr 8, 2015

jreback commented Apr 8, 2015

artemyk commented Apr 10, 2015

jreback commented Apr 11, 2015

BUG: Error when creating sparse dataframe with nan column label #8822

BUG: Error when creating sparse dataframe with nan column label #8822

Conversation

artemyk commented Nov 15, 2014

jreback commented Nov 15, 2014

artemyk commented Nov 15, 2014

jreback commented Nov 15, 2014

artemyk commented Nov 15, 2014

jreback commented Nov 15, 2014

artemyk commented Nov 15, 2014

jreback commented Nov 15, 2014

artemyk commented Nov 18, 2014

jreback commented Nov 18, 2014

artemyk commented Nov 18, 2014

artemyk commented Nov 21, 2014

jreback commented Nov 21, 2014

artemyk commented Nov 21, 2014

jreback commented Mar 25, 2015

artemyk commented Mar 26, 2015

artemyk commented Apr 8, 2015

jreback Apr 8, 2015

Choose a reason for hiding this comment

jreback commented Apr 8, 2015

artemyk commented Apr 10, 2015

jreback commented Apr 11, 2015