wide_to_long should verify uniqueness #16382

Closed
ohiobicyclist opened this Issue May 18, 2017 · 10 comments

Comments

Projects
None yet
4 participants
@ohiobicyclist

ohiobicyclist commented May 18, 2017

Code Sample, a copy-pastable example if possible

# Your code here
# This code produces stacked tables in 0.19.2 but respectively an error and a blank table in 0.20.1
import pandas as pd
pd.__version__
seed = []
for i in range(14):
    seed.append([1, 2, 3, 4, 5])
seed.append([1] * 5)
test_df = pd.DataFrame(seed).T
test_df.columns = ["A_A1", "B_B1", "A_A2", "B_B2", "A_A3", "B_B3", "A_A4", 
                   "B_B4", "A_A5", "B_B5", "A_A6", "B_B6", "A_A7", "B_B7", "x"]
test_df
# 0.19.2: a table with the 'i' field all as '1'.  0.20.1:  NotImplementedError
pd.wide_to_long(test_df, ["A_A", "B_B"], i="x", j="colname")
# 0.19.2: a table: 0.20.1: an empty table
# this line "clears" the error above in 0.20.1 by assigning each row a unique identifier.
test_df["x"] = test_df.apply(lambda row: row["A_A1"], axis=1)
pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")

Problem description

Changelog lists "performance improvements" for pd.wide_to_long but this is not an improvement for me; for these corner cases I would rather have the old behavior. Are these not mainstream enough to support?

Expected Output

 		A_A 	B_B
x 	colname 		
1 	1 	1 	1
1 	1 	2
1 	1 	3
1 	1 	4
1 	1 	5
1 	2 	1
1 	2 	2
1 	2 	3
1 	2 	4
1 	2 	5
1 	3 	1
1 	3 	2
1 	3 	3
1 	3 	4
1 	3 	5
1 	4 	1
1 	4 	2
1 	4 	3
1 	4 	4
1 	4 	5
1 	5 	1
1 	5 	2
1 	5 	3
1 	5 	4
1 	5 	5
2 	1 	1
2 	1 	2
2 	1 	3
2 	1 	4
2 	1 	5
... 	... 	...
6 	5 	1
6 	5 	2
6 	5 	3
6 	5 	4
6 	5 	5
7 	1 	1
7 	1 	2
7 	1 	3
7 	1 	4
7 	1 	5
7 	2 	1
7 	2 	2
7 	2 	3
7 	2 	4
7 	2 	5
7 	3 	1
7 	3 	2
7 	3 	3
7 	3 	4
7 	3 	5
7 	4 	1
7 	4 	2
7 	4 	3
7 	4 	4
7 	4 	5
7 	5 	1
7 	5 	2
7 	5 	3
7 	5 	4
7 	5 	5

175 rows x 2 cols

2)

 		A_ 	B_
x 	colname 		
1 	A1 	1.0 	NaN
2 	A1 	2.0 	NaN
3 	A1 	3.0 	NaN
4 	A1 	4.0 	NaN
5 	A1 	5.0 	NaN
1 	A2 	1.0 	NaN
2 	A2 	2.0 	NaN
3 	A2 	3.0 	NaN
4 	A2 	4.0 	NaN
5 	A2 	5.0 	NaN
1 	A3 	1.0 	NaN
2 	A3 	2.0 	NaN
3 	A3 	3.0 	NaN
4 	A3 	4.0 	NaN
5 	A3 	5.0 	NaN
1 	A4 	1.0 	NaN
2 	A4 	2.0 	NaN
3 	A4 	3.0 	NaN
4 	A4 	4.0 	NaN
5 	A4 	5.0 	NaN
1 	A5 	1.0 	NaN
2 	A5 	2.0 	NaN
3 	A5 	3.0 	NaN
4 	A5 	4.0 	NaN
5 	A5 	5.0 	NaN
1 	A6 	1.0 	NaN
2 	A6 	2.0 	NaN
3 	A6 	3.0 	NaN
4 	A6 	4.0 	NaN
5 	A6 	5.0 	NaN
... 	... 	... 	...
1 	B2 	NaN 	1.0
2 	B2 	NaN 	2.0
3 	B2 	NaN 	3.0
4 	B2 	NaN 	4.0
5 	B2 	NaN 	5.0
1 	B3 	NaN 	1.0
2 	B3 	NaN 	2.0
3 	B3 	NaN 	3.0
4 	B3 	NaN 	4.0
5 	B3 	NaN 	5.0
1 	B4 	NaN 	1.0
2 	B4 	NaN 	2.0
3 	B4 	NaN 	3.0
4 	B4 	NaN 	4.0
5 	B4 	NaN 	5.0
1 	B5 	NaN 	1.0
2 	B5 	NaN 	2.0
3 	B5 	NaN 	3.0
4 	B5 	NaN 	4.0
5 	B5 	NaN 	5.0
1 	B6 	NaN 	1.0
2 	B6 	NaN 	2.0
3 	B6 	NaN 	3.0
4 	B6 	NaN 	4.0
5 	B6 	NaN 	5.0
1 	B7 	NaN 	1.0
2 	B7 	NaN 	2.0
3 	B7 	NaN 	3.0
4 	B7 	NaN 	4.0
5 	B7 	NaN 	5.0

70 rows x 2 columns

pd_wide_to_long_changes.zip

Output of pd.show_versions()

# Paste the output here pd.show_versions() here # this is after upgrading.

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 18, 2017

Contributor

@nuffe does this seem related to #14779

Contributor

TomAugspurger commented May 18, 2017

@nuffe does this seem related to #14779

@erikcs

This comment has been minimized.

Show comment
Hide comment
@erikcs

erikcs May 18, 2017

Contributor

@ohiobicyclist sorry, but I do not understand what you intend to do (with either the 0.19 or 0.2 version).

From the wide_to_long docstring in 0.2:

Each row of these wide variables are assumed to be uniquely identified by
i (can be a single column name or a list of column names)

This is not directly checked, which is why you get at NotImplementedError much further down the stack trace.

After you fix this and supply a unique row identifier, you forget to specify the correct stubnames (as you did in the first example), which is why an empty data frame is returned (empty because no columns matching those stubs where found):

pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")
Out[20]:
B_B3	A_A4	A_A1	A_A6	A_A7	B_B7	B_B1	B_B5	B_B6	B_B4	A_A2	B_B2	A_A3	A_A5	A_	B_
x	colname																
pd.wide_to_long(test_df, ["A_A", "B_B"], i="x", j="colname")
A_A	B_B
x	colname		
1	1	1	1
1	1	2
1	1	3
1	1	4
1	1	5
1	2	1
1	2	2
1	2	3
1	2	4
1	2	5
1	3	1
...

which returns a new df with 5*7 = 35 rows because the df had 7 columns and 5 id rows.

Contributor

erikcs commented May 18, 2017

@ohiobicyclist sorry, but I do not understand what you intend to do (with either the 0.19 or 0.2 version).

From the wide_to_long docstring in 0.2:

Each row of these wide variables are assumed to be uniquely identified by
i (can be a single column name or a list of column names)

This is not directly checked, which is why you get at NotImplementedError much further down the stack trace.

After you fix this and supply a unique row identifier, you forget to specify the correct stubnames (as you did in the first example), which is why an empty data frame is returned (empty because no columns matching those stubs where found):

pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")
Out[20]:
B_B3	A_A4	A_A1	A_A6	A_A7	B_B7	B_B1	B_B5	B_B6	B_B4	A_A2	B_B2	A_A3	A_A5	A_	B_
x	colname																
pd.wide_to_long(test_df, ["A_A", "B_B"], i="x", j="colname")
A_A	B_B
x	colname		
1	1	1	1
1	1	2
1	1	3
1	1	4
1	1	5
1	2	1
1	2	2
1	2	3
1	2	4
1	2	5
1	3	1
...

which returns a new df with 5*7 = 35 rows because the df had 7 columns and 5 id rows.

@ohiobicyclist

This comment has been minimized.

Show comment
Hide comment
@ohiobicyclist

ohiobicyclist May 18, 2017

nuffe -- I'm a little less worried about the first example as yes, worst comes to worst, I can just reset_index (twice if the index isn't already unique) and use that as a unique column (it's annoying that code that works in 0.19.2 breaks in 0.20.1 because of this, but there is a work-around at least); however, I contend the following -- changing "A_Ax" "B_Bx" to "A_Ax" "B_Ax" then "A_" and "B_" ARE the correct sub-strings, with the stack strings being "A1" "A2" "A3" etc
all a substrings notebook.zip
.

Yet even that doesn't produce a table with data under 0.20.1.

nuffe -- I'm a little less worried about the first example as yes, worst comes to worst, I can just reset_index (twice if the index isn't already unique) and use that as a unique column (it's annoying that code that works in 0.19.2 breaks in 0.20.1 because of this, but there is a work-around at least); however, I contend the following -- changing "A_Ax" "B_Bx" to "A_Ax" "B_Ax" then "A_" and "B_" ARE the correct sub-strings, with the stack strings being "A1" "A2" "A3" etc
all a substrings notebook.zip
.

Yet even that doesn't produce a table with data under 0.20.1.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 18, 2017

Contributor

@ohiobicyclist can you post the code for the example that should work here in the issue?

Contributor

TomAugspurger commented May 18, 2017

@ohiobicyclist can you post the code for the example that should work here in the issue?

@ohiobicyclist

This comment has been minimized.

Show comment
Hide comment
@ohiobicyclist

ohiobicyclist May 18, 2017

Sure, thanks.

import pandas as pd
seed = []
for i in range(15):
    seed.append([1, 2, 3, 4, 5])
test_df = pd.DataFrame(seed).T
test_df.columns = ["A_A1", "B_A1", "A_A2", "B_A2", "A_A3", "B_A3", "A_A4", 
                   "B_A4", "A_A5", "B_A5", "A_A6", "B_A6", "A_A7", "B_A7", "x"]
pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")

ohiobicyclist commented May 18, 2017

Sure, thanks.

import pandas as pd
seed = []
for i in range(15):
    seed.append([1, 2, 3, 4, 5])
test_df = pd.DataFrame(seed).T
test_df.columns = ["A_A1", "B_A1", "A_A2", "B_A2", "A_A3", "B_A3", "A_A4", 
                   "B_A4", "A_A5", "B_A5", "A_A6", "B_A6", "A_A7", "B_A7", "x"]
pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname")
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 18, 2017

Contributor

That example has the same issue with the stubnames, right? It should be A_A and B_A I think

In [13]: pd.wide_to_long(test_df, ['A_A', 'B_A'], i='x', j='y').head()
Out[13]:
     A_A  B_A
x y
1 1    1    1
2 1    2    2
3 1    3    3
4 1    4    4
5 1    5    5
Contributor

TomAugspurger commented May 18, 2017

That example has the same issue with the stubnames, right? It should be A_A and B_A I think

In [13]: pd.wide_to_long(test_df, ['A_A', 'B_A'], i='x', j='y').head()
Out[13]:
     A_A  B_A
x y
1 1    1    1
2 1    2    2
3 1    3    3
4 1    4    4
5 1    5    5
@erikcs

This comment has been minimized.

Show comment
Hide comment
@erikcs

erikcs May 18, 2017

Contributor

@ohiobicyclist when you specify stubnames = [A_, B_] wide_to_long by default expects to find numeric suffixes (as does Stata - this is in the documentation), but this is not the case in your example. You can do this with (please note the new suffix argument):

In [7]: pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname", suffix='').head()
Out[7]:
            A_  B_
x colname
1 A1       1.0 NaN
  A2       1.0 NaN
  A3       1.0 NaN
  A4       1.0 NaN
  A5       1.0 NaN
Contributor

erikcs commented May 18, 2017

@ohiobicyclist when you specify stubnames = [A_, B_] wide_to_long by default expects to find numeric suffixes (as does Stata - this is in the documentation), but this is not the case in your example. You can do this with (please note the new suffix argument):

In [7]: pd.wide_to_long(test_df, ["A_", "B_"], i="x", j="colname", suffix='').head()
Out[7]:
            A_  B_
x colname
1 A1       1.0 NaN
  A2       1.0 NaN
  A3       1.0 NaN
  A4       1.0 NaN
  A5       1.0 NaN
@erikcs

This comment has been minimized.

Show comment
Hide comment
@erikcs

erikcs May 20, 2017

Contributor

Was this helpful @ohiobicyclist ? (I am not sure if your exact use case is better handled by another function - because I am not sure I understand, judging from your examples, what exactly you need to do).

But I agree that checking for a unique i should have been done (I can send a PR for this @TomAugspurger )

Contributor

erikcs commented May 20, 2017

Was this helpful @ohiobicyclist ? (I am not sure if your exact use case is better handled by another function - because I am not sure I understand, judging from your examples, what exactly you need to do).

But I agree that checking for a unique i should have been done (I can send a PR for this @TomAugspurger )

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 20, 2017

Contributor

@nuffe yes pls.

Contributor

jreback commented May 20, 2017

@nuffe yes pls.

@ohiobicyclist

This comment has been minimized.

Show comment
Hide comment
@ohiobicyclist

ohiobicyclist May 20, 2017

This (suffix="") is helpful. The thing that makes wide_to_long better than melt for my application is the dual-column (or multiple column) stacked output. A somewhat clearer example would be column names
["key_sampleid1", "value_sampleid1", "key_sampleid2", "value_sampleid2", "key_otherid1", "value_otherid1"] because it can nicely stack the 3 id's with key and value. Suffix="" takes care of that.

This (suffix="") is helpful. The thing that makes wide_to_long better than melt for my application is the dual-column (or multiple column) stacked output. A somewhat clearer example would be column names
["key_sampleid1", "value_sampleid1", "key_sampleid2", "value_sampleid2", "key_otherid1", "value_otherid1"] because it can nicely stack the 3 id's with key and value. Suffix="" takes care of that.

erikcs added a commit to erikcs/pandas that referenced this issue May 20, 2017

@TomAugspurger TomAugspurger changed the title from Wide_to_long has lost functionality compared to 0.19.2 to wide_to_long should verify uniqueness May 22, 2017

TomAugspurger added a commit that referenced this issue May 23, 2017

BUG: wide_to_long should check for unique id vars (#16382) (#16403)
* BUG: wide_to_long should check for unique id vars (#16382)

* Fix uncaught lint error

* Add whatsnew note (bug fix)

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 29, 2017

BUG: wide_to_long should check for unique id vars (#16382) (#16403)
* BUG: wide_to_long should check for unique id vars (#16382)

* Fix uncaught lint error

* Add whatsnew note (bug fix)

(cherry picked from commit 04356a8)

TomAugspurger added a commit that referenced this issue May 30, 2017

BUG: wide_to_long should check for unique id vars (#16382) (#16403)
* BUG: wide_to_long should check for unique id vars (#16382)

* Fix uncaught lint error

* Add whatsnew note (bug fix)

(cherry picked from commit 04356a8)

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

BUG: wide_to_long should check for unique id vars (#16382) (#16403)
* BUG: wide_to_long should check for unique id vars (#16382)

* Fix uncaught lint error

* Add whatsnew note (bug fix)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment