Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG SparseDataFrame with dense Series (#19374) #19377

Merged
merged 1 commit into from Jan 27, 2018
Merged

BUG SparseDataFrame with dense Series (#19374) #19377

merged 1 commit into from Jan 27, 2018

Conversation

datapythonista
Copy link
Member

@datapythonista datapythonista commented Jan 24, 2018

@gfyoung gfyoung added Bug Sparse Sparse Data Type labels Jan 24, 2018
def test_constructor_from_unknown_type(self):
class Unknown:
pass
pytest.raises(TypeError, SparseDataFrame, Unknown())
Copy link
Member

@gfyoung gfyoung Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check error message for all of your pytest.raises calls.

@@ -199,6 +199,31 @@ def test_constructor_from_series(self):
# without sparse value raises error
# df2 = SparseDataFrame([x2_sparse, y])

def test_constructor_from_dense_series(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number under all of your added tests.

@gfyoung
Copy link
Member

gfyoung commented Jan 24, 2018

@datapythonista : Looks pretty good so far. Don't forget to add a whatsnew entry.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note as well.

x = Series(np.random.randn(10000), name='a')
assert isinstance(x, Series)
df = SparseDataFrame(x)
assert isinstance(df, SparseDataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct an expected SDF and compare

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct a DataFrame and use .to_sparse() to construct an expected frame. we do not want to do all of these little checks, we already have well established comparison functions, e.g. tm.assert_sparse_equal for this

assert isinstance(df, SparseDataFrame)
assert df.columns == ['b']

# No column name available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for the rest

@@ -95,6 +95,13 @@ def __init__(self, data=None, index=None, columns=None, default_kind=None,
dtype=dtype, copy=copy)
elif isinstance(data, DataFrame):
mgr = self._init_dict(data, data.index, data.columns, dtype=dtype)
elif isinstance(data, Series):
if columns is None and data.name is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a test to hit this case

@datapythonista
Copy link
Member Author

Thanks a lot for the comments. Sorry about the whatsnew, I added it when opening the PR, but forgot to add it to the commit.

@jreback, I didn't find a way to construct the expected SparseDataFrame without using the constructor that is being tested itself. But I added the comparison of the length and sum of the original series and the new SparseDataFrame, which I think should be enough (and it helped find a bug in my previous code).

Addressed all the other comments, let me know if you see anything else. Thanks!

elif len(columns) != 1:
raise ValueError('columns must be of length one '
'if data is of type Series')
mgr = self._init_dict(data.to_frame(columns[0]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't construct like this, actually make dict

x = Series(np.random.randn(10000), name='a')
assert isinstance(x, Series)
df = SparseDataFrame(x)
assert isinstance(df, SparseDataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct a DataFrame and use .to_sparse() to construct an expected frame. we do not want to do all of these little checks, we already have well established comparison functions, e.g. tm.assert_sparse_equal for this

if columns is None:
if data.name is None:
raise ValueError('cannot pass a series '
'w/o a name or columns')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need any of this, you can simply all to_manager with the column (in a list)

@datapythonista
Copy link
Member Author

Thanks for the feedback @jreback, I misunderstood what the columns argument had to do, and I was overcomplicating the code.

Now it simply creates the DataFrame with the Series name if it has one, or it uses 0 as the column if it doesn't. This seems to be consistent with the Series constructor and with creating a SparseDataFrame from a DataFrame, which ignores unknown columns.

I'm not 100% sure I understood what you mean by "actually contruct dic", please let me know if this new version doesn't address this too.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small edits, other lgtm. ping on green.

# GH 19393
# series with name
x = Series(np.random.randn(10000), name='a')
assert isinstance(x, Series)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this assert

# series with name
x = Series(np.random.randn(10000), name='a')
assert isinstance(x, Series)
res = SparseDataFrame(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer result and expected

x = Series(np.random.randn(10000), name='a')
assert isinstance(x, Series)
res = SparseDataFrame(x)
assert isinstance(res, SparseDataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the assert

# series with no name
x = Series(np.random.randn(10000))
assert isinstance(x, Series)
res = SparseDataFrame(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same w.r.t naming & asserts

@jreback jreback added this to the 0.23.0 milestone Jan 26, 2018
…r, and providing useful error messages for other types (#19374)
@datapythonista
Copy link
Member Author

Thanks once again for the comments @jreback, seems that I followed the conventions of an old test. Addressed your comments, should be all right now.

@jreback jreback merged commit eec7f57 into pandas-dev:master Jan 27, 2018
@jreback
Copy link
Contributor

jreback commented Jan 27, 2018

thanks @datapythonista

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SparseDataFrame with dense Series or unknown type
3 participants