ENH/DOC: wide_to_long performance and docstring clarification #14779

erikcs · 2016-12-01T00:43:09Z

closes ENH/DOC: wide_to_long performance and docstring clarification #14778
tests passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Please see #14778 for details.

I make wide_to_long a bit faster (avoid slow regex search on long columns by first converting to Categorical, avoid melting all dataframes with all the id variables, and wait with trying to convert the "time" variable to int until last), and clear up the docstring.

jreback · 2016-12-01T01:10:44Z

you can simply push to this pr as you update FYI

jreback · 2016-12-01T15:09:32Z

pandas/core/reshape.py

+    new = df[id_vars].set_index(i).join(mstubs)
+
+    try:
+        new.index.set_levels(new.index.levels[-1].astype(int), level=-1,


when / why does this raise? can you provide a comment

This is just the same int conversion attempt done in the original code here, since this "time" column may contain strings not necessarily can be converted to integers. Like the original the index is set to [i, j], which is why this operation is done on the index at the end. I will add a comment, sorry for the confusion

jreback · 2016-12-01T15:10:39Z

pls add an asv for this, see: http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

erikcs · 2016-12-01T16:54:31Z

asv added:

[  0.00%] · For pandas commit hash 4014f118:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running reshape.wide_to_long_big.time_wide_to_long_big                                                                                 130.66ms
[ 50.00%] · For pandas commit hash 06f26b51:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running reshape.wide_to_long_big.time_wide_to_long_big                                                                                    2.05s    before     after       ratio
  [06f26b51] [4014f118]
-     2.05s   130.66ms      0.06  reshape.wide_to_long_big.time_wide_to_long_big

codecov-io · 2016-12-02T18:44:37Z

Current coverage is 85.27% (diff: 96.15%)

Merging #14779 into master will increase coverage by <.01%

@@             master     #14779   diff @@
==========================================
  Files           144        144          
  Lines         50981      50989     +8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43470      43481    +11   
+ Misses         7511       7508     -3   
  Partials          0          0

Powered by Codecov. Last update cb2d6eb...df1edf8

jreback · 2016-12-04T17:30:04Z

pandas/core/reshape.py

+    new = df[id_vars].set_index(i).join(mstubs)
+
+    # The index of the new dataframe is [i, j], if the j column is a time
+    # variable, try to convert this to integer.


not sure I understand what you are doing. can you show the index before / after

Jeff, here is an example:

In [8]: N = 3 ...: df = pd.DataFrame({ 'A 2010': np.random.rand(N), ...: 'A 2011': np.random.rand(N), ...: 'B 2010': np.random.rand(N), ...: 'B 2011': np.random.rand(N), ...: 'X' : np.random.randint(N, size=N), ...: }) ...: df['id'] = df.index ...: df ...: Out[8]: A 2010 A 2011 B 2010 B 2011 X id 0 0.731823 0.790627 0.236080 0.727762 1 0 1 0.820396 0.474342 0.614218 0.363226 0 1 2 0.463291 0.210859 0.332595 0.061011 0 2

before the Try/Except

In [9]: before = pd.wide_to_long(df, ['A', 'B'], i='id', j='year') ...: before.index ...: Out[9]: MultiIndex(levels=[[0, 1, 2], [u' 2010', u' 2011']], labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]], names=[u'id', u'year'])

after

In [10]: after= pd.wide_to_long(df, ['A', 'B'], i='id', j='year') ...: after.index ...: Out[10]: MultiIndex(levels=[[0, 1, 2], [2010, 2011]], labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]], names=[u'id', u'year'])

which is the same as on the master branch

In [11]: master = pd.wide_to_long(df, ['A', 'B'], i='id', j='year') ...: master.index ...: Out[11]: MultiIndex(levels=[[0, 1, 2], [2010, 2011]], labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]], names=[u'id', u'year'])

Why the original author did the Try before converting to int:

In [13]: df2 = pd.DataFrame({ 'A one': np.random.rand(N), ...: 'A two': np.random.rand(N), ...: 'B one': np.random.rand(N), ...: 'B two': np.random.rand(N), ...: 'X' : np.random.randint(N, size=N), ...: }) ...: df2 ...: Out[13]: A one A two B one B two X 0 0.315281 0.684260 0.397193 0.531613 1 1 0.156044 0.749942 0.923540 0.383348 0 2 0.577983 0.507933 0.226466 0.937341 0

In long format:

In [15]: df2['id'] = df2.index ...: pd.wide_to_long(df2, ['A', 'B'], i='id', j='year') ...: Out[15]: X A B id year 0 one 1 0.315281 0.397193 1 one 0 0.156044 0.923540 2 one 0 0.577983 0.226466 0 two 1 0.684260 0.531613 1 two 0 0.749942 0.383348 2 two 0 0.507933 0.937341

I don't like the auto coercing of the strings -> ints. This is not very idiomatic and unexpected. I would leave the columns as strings.

Fixed, but regarding a character that separates the stub name from the variable part:

In [7]: df = pd.DataFrame({ 'A.2010': np.random.rand(N), ...: 'A.2011': np.random.rand(N), ...: 'B.2010': np.random.rand(N), ...: 'B.2011': np.random.rand(N), ...: 'X' : np.random.randint(N, size=N), ...: }) ...: ...: df ...: Out[7]: A.2010 A.2011 B.2010 B.2011 X 0 0.873404 0.467946 0.569808 0.358077 1 1 0.780154 0.554582 0.668437 0.810530 1 2 0.884003 0.555784 0.246305 0.038423 2

In [8]: df['id'] = df.index ...: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year') ...: Out[8]: X A. B. id year 0 2010 1 0.873404 0.569808 1 2010 1 0.780154 0.668437 2 2010 2 0.884003 0.246305 0 2011 1 0.467946 0.358077 1 2011 1 0.554582 0.810530 2 2011 2 0.555784 0.038423

A user might expect the new separating character (.) to be stripped, like reshape in R does.

jreback · 2016-12-04T17:30:16Z

needs a whatsnew entry (0.20.)

jreback · 2016-12-04T20:03:17Z

you could add an argument to specify the split (or make it take a regex)

yes it should get stripped

jreback · 2016-12-04T23:50:28Z

doc/source/whatsnew/v0.20.0.txt

@@ -88,6 +88,7 @@ Removal of prior version deprecations/changes
 Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~

+- Improved performance of ``wide_to_long`` (:issue:`14779`)


pd.wide_to_long()

jreback · 2016-12-04T23:51:12Z

pandas/core/reshape.py

@@ -875,7 +875,7 @@ def lreshape(data, groups, dropna=True, label=None):
    return DataFrame(mdata, columns=id_cols + pivot_cols)


-def wide_to_long(df, stubnames, i, j):
+def wide_to_long(df, stubnames, i, j, sep=""):


maybe make it sep='\s+' whitespace?

hmm strange, rstrip doesn't seem to recognise that?

In [13]: 'A (quarterly) '.rstrip('\s+') Out[13]: 'A (quarterly) '

In [14]: 'A (quarterly) '.rstrip(" ") Out[14]: 'A (quarterly)'

jreback · 2016-12-04T23:51:28Z

pandas/core/reshape.py

@@ -890,8 +890,9 @@ def wide_to_long(df, stubnames, i, j):
        The name of the id variable.
    j : str
        The name of the subobservation variable.
-    stubend : str
-        Regex to match for the end of the stubs.
+    sep : str, optional


specify what the default is

jreback · 2016-12-04T23:53:38Z

pandas/tests/test_reshape.py

+        exp_frame = exp_frame.set_index(['id', 'year'])[["X", "A", "B"]]
+        long_frame = wide_to_long(df, ['A', 'B'], 'id', 'year')
+        tm.assert_frame_equal(long_frame, exp_frame)
+


can you add some tests with sep (and maybe some that have an invalid sep)?

What input where you thinking about?

if a nonsense separator is passed nothing is stripped:

In [15]: df = pd.DataFrame({'A.2010': np.random.rand(3), ...: 'A.2011': np.random.rand(3), ...: 'B.2010': np.random.rand(3), ...: 'B.2011': np.random.rand(3), ...: 'X' : np.random.randint(3, size=3)}) ...: df['id'] = df.index ...: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year', sep="nope") ...: Out[15]: X A. B. id year 0 2010 2 0.330193 0.728615 1 2010 0 0.710791 0.601923 2 2010 1 0.066218 0.618455 0 2011 2 0.597949 0.324131 1 2011 0 0.024911 0.968051 2 2011 1 0.310596 0.866798

In [16]: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year', sep=",,") Out[16]: X A. B. id year 0 2010 2 0.330193 0.728615 1 2010 0 0.710791 0.601923 2 2010 1 0.066218 0.618455 0 2011 2 0.597949 0.324131 1 2011 0 0.024911 0.968051 2 2011 1 0.310596 0.866798

jreback · 2016-12-04T23:54:35Z

further, not sure if http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt
needs any updates?

jreback · 2016-12-05T00:21:25Z

that's a regex

erikcs · 2016-12-05T00:26:30Z

Sorry, I didn't get that? And melt looks pretty rock solid.

jreback · 2016-12-05T00:38:37Z

@Nuffe I was referring to wide_to_long in the docs (e.g. is the example ok).

you are splitting on the sep character, but it could also be a regext, so doing something like

In [7]: re.split('\s+','A  2010')
Out[7]: ['A', '2010']

In [8]: re.split('\s+','A 2010')
Out[8]: ['A', '2010']

is probably reasonable

erikcs · 2016-12-05T04:42:59Z

To keep things simple I propose we break the API of wide_to_long and change the signature to adhere more to how R's reshape does this.

wide_to_long(df, varying, i, j, sep=' ')

I.e. the user passes the names of the time varying column as a list varying, and these (x) are expected (checked) to adhere to the following rules (more or less what reshape assumes):

based on sep x is split into exactly two strings.
sep (a single character) is constrained to be non-alphanumeric: \s, ., ;, etc...
the first part is the 'stubname', the second part is the 'time' part

In [3]: df = pd.DataFrame({"A 1970" : {0 : "a", 1 : "b", 2 : "c"},
   ...:                     "A 1980" : {0 : "d", 1 : "e", 2 : "f"},
   ...:                     "B 1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
   ...:                     "B 1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
   ...:                     "X"     : dict(zip(range(3), np.random.randn(3)))
   ...:                    })
   ...: df['id'] = df.index
   ...: df
   ...:
Out[3]:
  A 1970 A 1980  B 1970  B 1980         X  id
0      a      d     2.5     3.2  0.136953   0
1      b      e     1.2     1.3 -1.238109   1
2      c      f     0.7     0.1  1.249809   2

In [4]: varying = ['A 1970', 'A 1980', 'B 1970', 'B 1980']
   ...: pd.wide_to_long(df, varying, i='id', j='year', sep=' ')
   ...:
Out[4]:
                X  A    B
id year
0  1970  0.136953  a  2.5
1  1970 -1.238109  b  1.2
2  1970  1.249809  c  0.7
0  1980  0.136953  d  3.2
1  1980 -1.238109  e  1.3
2  1980  1.249809  f  0.1

The user can easily construct the varying list with a regex, a doc example can show this.

If the existing columns does not adhere to the above specification, they need to be changed to a suitable format first. A doc example can show how this can be easily done with a regex with a backreference.

What do you think?

jreback · 2016-12-05T11:12:41Z

the varying should be a list of tuples
space separated elements in lists are not pythonic

but otherwise looks ok

can this be backward compat?

erikcs · 2016-12-05T12:08:40Z

I didn't understand the first comment: varying is just the names of all columns that should be varying. For a sample dataframe it could be df.iloc[:, 4:11].columns.tolist(). I.e. that the names happen to be space separated is just how the data columns ended up looking like in messy real world data?

And I do not think this can be made backward compat because the varying argument would be different, stubnames would now be computed inside the function.

The old doc example were there is no single character separator, f.eks varying = ['A1970', 'A1980', 'B1970', 'B1980'] will only work by first converting the column names to the allowed format with

df.columns.str.replace('([A-B])', '\\1.')
Index([u'A.1970', u'A.1980', u'B.1970', u'B.1980', u'X', u'id'], dtype='object')

then calling wide_to_long with varying = ['A.1970', 'A.1980', 'B.1970', 'B.1980']

I do not know if this really is considered too unwieldy...? R's reshape interface is perhaps not the most user friendly, but with plenty of doc examples its flexibility should maybe be appreciated?

The original function author seemed to have tried to mimic Statas reshape where reshape essentially only would take the stubnames as argument. The problem is that in stata column names are highly constrained (can for ex not have whitespaces or non alphanumeric characters), while in pandas they can be any utf8 string, which makes it much harder to generalize.

So if we should preserve the original function authors intention, where the user only supplies stubnames as in Stata, we need to impose some strict assumptions on the column names passed, like the only kind of column names we can have (that are varying) are of the type PrefixPostfix, where Prefix and Postfix are alphanumeric. These are the only ones Stata's reshape needs to consider, and what the original wide_to_long function implicitly assumes. We can also handle PrefixSepPostfix where Sep is a single separating character.

Perhaps it is just better to make this implicit assumption explicit and keep its "Stata like" interface? And make it robust to this specificaton (PrefixSepPostfix), because the master branch function breaks with plenty of variations of this.

(sorry for the messyness here but I ended up spending some time familiarizing myself more with Rs less user friendly approach to this problem, and Stata's more user friendly but less flexible approach)

erikcs · 2016-12-05T21:30:04Z

So here is an attempt to make the original interface more robust, these two examples fail on the master branch, but should be able to produce the correct result which is:

In [12]: df = pd.DataFrame({
    ...:         'A11': ['a11', 'a22', 'a33'],
    ...:         'A12': ['a21', 'a22', 'a23'],
    ...:         'B11': ['b11', 'b12', 'b13'],
    ...:         'B12': ['b21', 'b22', 'b23'],
    ...:         'BB11': [1, 2, 3],
    ...:         'BB12': [4, 5, 6],
    ...:         'BBBX' : [91, 92, 93],
    ...:         'BBBZ' : [91, 92, 93]
    ...:     })
    ...: df['id'] = df.index
    ...: df
    ...:
Out[12]:
   A11  A12  B11  B12  BB11  BB12  BBBX  BBBZ  id
0  a11  a21  b11  b21     1     4    91    91   0
1  a22  a22  b12  b22     2     5    92    92   1
2  a33  a23  b13  b23     3     6    93    93   2

In [13]: pd.wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year')
Out[13]:
         BBBX  BBBZ    A    B  BB
id year
0  11      91    91  a11  b11   1
1  11      92    92  a22  b12   2
2  11      93    93  a33  b13   3
0  12      91    91  a21  b21   4
1  12      92    92  a22  b22   5
2  12      93    93  a23  b23   6

In [14]: df = pd.DataFrame({
    ...:         'A(quarterly)2011': ['a11', 'a22', 'a33'],
    ...:         'A(quarterly)2012': ['a21', 'a22', 'a23'],
    ...:         'B(quarterly)2011': ['b11', 'b12', 'b13'],
    ...:         'B(quarterly)2012': ['b21', 'b22', 'b23'],
    ...:         'BB(quarterly)2011': [1, 2, 3],
    ...:         'BB(quarterly)2012': [4, 5, 6],
    ...:         'BBBX' : [91, 92, 93],
    ...:         'BBBZ' : [91, 92, 93]
    ...:     })
    ...: df['id'] = df.index
    ...: df
    ...:
Out[14]:
  A(quarterly)2011 A(quarterly)2012 B(quarterly)2011 B(quarterly)2012  \
0              a11              a21              b11              b21
1              a22              a22              b12              b22
2              a33              a23              b13              b23

   BB(quarterly)2011  BB(quarterly)2012  BBBX  BBBZ  id
0                  1                  4    91    91   0
1                  2                  5    92    92   1
2                  3                  6    93    93   2

In [15]: pd.wide_to_long(df, ['A(quarterly)', 'B(quarterly)', 'BB(quarterly)'], i='id', j='year')
Out[15]:
         BBBX  BBBZ A(quarterly) B(quarterly)  BB(quarterly)
id year
0  2011    91    91          a11          b11              1
1  2011    92    92          a22          b12              2
2  2011    93    93          a33          b13              3
0  2012    91    91          a21          b21              4
1  2012    92    92          a22          b22              5
2  2012    93    93          a23          b23              6

The first one fails because the regex confuses the same substrings in the id_vars and value_vars, the second ones because of the parenthesis.

Assuming a Prefix(Optional Sep)Postfix structure on the "time" variables, I tried to make it robust:

 [16]: df = pd.DataFrame({
    ...:         'A11': ['a11', 'a22', 'a33'],
    ...:         'A12': ['a21', 'a22', 'a23'],
    ...:         'B11': ['b11', 'b12', 'b13'],
    ...:         'B12': ['b21', 'b22', 'b23'],
    ...:         'BB11': [1, 2, 3],
    ...:         'BB12': [4, 5, 6],
    ...:         'Acat' : [91, 92, 93],
    ...:         'BBBZ' : [91, 92, 93]
    ...:     })
    ...: df['id'] = df.index
    ...: df
    ...:
Out[16]:
   A11  A12  Acat  B11  B12  BB11  BB12  BBBZ  id
0  a11  a21    91  b11  b21     1     4    91   0
1  a22  a22    92  b12  b22     2     5    92   1
2  a33  a23    93  b13  b23     3     6    93   2

raises a ValueError: ('Ambiguous names: ', ['A11', 'A12', 'Acat']) .

While the following works

 df = pd.DataFrame({
    ...:         'A-11': ['a11', 'a22', 'a33'],
    ...:         'A-12': ['a21', 'a22', 'a23'],
    ...:         'B-11': ['b11', 'b12', 'b13'],
    ...:         'B-12': ['b21', 'b22', 'b23'],
    ...:         'BB-11': [1, 2, 3],
    ...:         'BB-12': [4, 5, 6],
    ...:         'Acat' : [91, 92, 93],
    ...:         'BBBZ' : [91, 92, 93]
    ...:     })
    ...: df['id'] = df.index
    ...: df
    ...:
Out[18]:
  A-11 A-12  Acat B-11 B-12  BB-11  BB-12  BBBZ  id
0  a11  a21    91  b11  b21      1      4    91   0
1  a22  a22    92  b12  b22      2      5    92   1
2  a33  a23    93  b13  b23      3      6    93   2

In [19]: pd.wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year', sep='-')
Out[19]:
         Acat  BBBZ    A    B  BB
id year
0  11      91    91  a11  b11   1
1  11      92    92  a22  b12   2
2  11      93    93  a33  b13   3
0  12      91    91  a21  b21   4
1  12      92    92  a22  b22   5
2  12      93    93  a23  b23   6

jreback · 2016-12-05T23:11:39Z

cc @jseabold any thought here.

@Nuffe ideally we want to make this back-compat (you can do introspection in the code to figure out what you are passed and such). And have as simple as API as possible.

erikcs · 2016-12-06T00:54:09Z

I have maintained the user friendly (and evidently Stata inspired) interface (and stated what structure this function assumes on the column names), and tried to fix mistakes that arise with various "pathological" input, for example if the stubnames contains groups that share a substring (which I discovered when I tried different examples in Stata and compared them to wide_to_long)

jreback · 2016-12-06T19:11:38Z

pandas/core/reshape.py

    Notes
    -----
    All extra variables are treated as extra id variables. This simply uses
    `pandas.melt` under the hood, but is hard-coded to "do the right thing"
    in a typicaly case.
    """
+    # For robustess, escape every user input string we use in a regex
+    import re


can be imported at the top of the file

jreback · 2016-12-06T19:12:35Z

pandas/core/reshape.py

+    # For ex. AA2011, AA2012, AAkitten have inconsistent postfix
+    for k, vars in enumerate(value_vars):
+        stripped = map(lambda x: x.replace(stubs[k], ""), vars)
+        is_digit = [s.isdigit() for s in stripped]


you have tests for this?

considering the comment below on not using a regex to find the id_vars: perhaps just formulate a consistency check, and use warnings and warn the user if for example an inferred value_var has different types?

For example: at the end check if the new data frame's 'j' index contains both ints and strings and warn about this? If stubnames supplied is ['AA2011', 'AA2012'] and df contains a column named Acat then the new dataframe's j column will have levels 2011, 2012, cat. And likewise if stubnames contains ['CatOne', 'CatTwo'] and df has a colum named Cat3000 the new j index will have levels One, Two, 3000.

The only way to disambiguate the first case is to take an optional stubendtype parameter denoting the stubends are numbers. The second case is not possible to disambiguate (tried in Stata)

jreback · 2016-12-06T19:13:02Z

pandas/core/reshape.py

+    # two resulting value_vars lists
+    if len(value_vars_flattened + id_vars) != len(df.columns):
+        value_vars_augmented = map(lambda x: get_var_names(
+            df, "^{0}".format(re.escape(x))), stubnames)


this looks fragile. I would just raise here

Or instead of doing a search for the id_vars in the first place, would it not be simpler just to do:

id_vars = set(df.columns.tolist()).difference(value_vars_flattened)?

(then do some consistency checks)

jreback · 2016-12-06T19:14:05Z

pandas/core/reshape.py

+    # This regex is needed to avoid multiple "greedy" matches with stubs
+    # that have overlapping substrings
+    # (for example A2011, A2012 are separate from AA2011, AA2012)
+    value_vars = list(map(lambda x: get_var_names(


ideally you would just look for a match of a letter followed by a non-letter (or vice versa), I think that is more robust.

But in the case of string stems, the three groups here will not be captured:

Aone, Atwo, Bone, Btwo, BBone, BBtwo

A negative lookahead ^B(?!B) could be more robust? I.e. the regex would be "^{0}(?!{1})".format(re.escape(x), x[-1])). That one would capture the three groups here and ignore for example BBBrating

erikcs · 2016-12-08T12:47:37Z

I found another "Stata like" use case wide_to_long doesn't handle at this page: if the data frame does not have a single id column that identifies the wide variables, Stata users can supply the necessary column names, fex ['id1', 'id2']. Currently the user would have to generate a new column that identifies the ['id1', 'id2'] combination and pass that as id.

(I am going to add the option of supplying a list of 'id' variables, it will require another short rewrite since I have to move from using join to merge, to handle the new multilevel index)

erikcs · 2016-12-09T11:49:37Z

Sometimes AppVeyor/Travis fails with unrelated tests (like test_bar_log_subplots, right now). Any hint on what I should do here? I've looked at the failing cases (unrelated plot methods), and there doesn't seem to be any state that has mutated because for ex. a random number seed has changed. (tests pass on my OS X laptop with both python 2 and python 3 virtual envs). Thanks

jreback · 2016-12-10T15:33:27Z

can you rebase, problem with appeveyor which i just fixed

) Speed up by avoiding big copies, and regex on categorical column Add functionality to deal with "pathological" input Add docstring examples and more test cases

jreback · 2016-12-10T22:11:25Z

pandas/core/reshape.py

+        in the wide format, to be stripped from the names in the long format.
+        For example, if your column names are A-suffix1, A-suffix2, you
+        can strip the hypen by specifying `sep`='-'
+    numeric_suffix : bool, default True


I would rather call this suffix='\d+', IOW use a regex to match this, no?

Yes, that makes more sense

jreback · 2016-12-11T16:59:44Z

pandas/core/reshape.py

+    Going from long back to wide just takes some creative use of `unstack`
+
+    >>> w = l.reset_index().set_index(['famid', 'birth', 'age']).unstack()
+    >>> w.columns = [name + suffix for name, suffix in wide.columns.tolist()]


use this:

In [28]: Index(w.columns).str.join('') Out[28]: Index(['ht1', 'ht2'], dtype='object')

jreback · 2016-12-11T17:01:05Z

pandas/core/reshape.py

+    if any(map(lambda s: s in df.columns.tolist(), stubnames)):
+        raise ValueError("stubname can't be identical to a column name")
+
+    if not isinstance(stubnames, list):


we usually use is_list_like, IOW, you can pass a non-string iterable, can you update the doc-string as well

jreback · 2016-12-11T17:01:36Z

pandas/tests/test_reshape.py

@@ -716,6 +716,204 @@ def test_stubs(self):

        self.assertEqual(stubs, ['inc', 'edu'])

+    def test_separating_character(self):
+        np.random.seed(123)


can you add this issue number as a comment

jreback · 2016-12-11T17:02:27Z

lgtm. some minor comments.

@jorisvandenbossche

Use is_list_like Add GH ticket #

erikcs · 2016-12-11T19:51:39Z

@jreback another minor issue: sphinx doesn't print the \ in the docstring suffix section. The only way I managed to get these printed was to set the entire docstring to a raw string literal, and escape the slash: '\\d+'

jorisvandenbossche · 2016-12-11T20:13:07Z

set the entire docstring to a raw string literal

You can indeed do that, but normally then the escaping should not be needed

had to escape them)

jorisvandenbossche

I didn't follow the full discussion above, but there was some talk about backwards compatibility. What is the conclusion on that? Is the last version back compat or are there changes in behaviour?

jorisvandenbossche · 2016-12-11T20:45:00Z

asv_bench/benchmarks/reshape.py

+
+    def setup(self):
+            vars = 'ABCD'
+            nyrs = 20


Can you fix up the indentation here?

jorisvandenbossche · 2016-12-11T20:47:17Z

asv_bench/benchmarks/reshape.py

+            idobs = dict(zip(range(nidvars), np.random.rand(nidvars, N)))
+
+            self.df = pd.concat([pd.DataFrame(idobs), pd.DataFrame(yearobs)],
+                                axis=1)


I think you can also do something like DataFrame(np.random.randn(N, nidvars + len(yrvars)), columns=list(range(nidvars)) + yrvars to make it a bit simpler

erikcs · 2016-12-11T21:26:04Z

@jorisvandenbossche Yes this version is back compat. The PR got a bit lengthy because I did more than I anticipated (was originally a simple PR for a quick speed improvement - but I discovered afterwards that there where several use cases the original function couldn't handle).

jreback · 2016-12-13T23:31:40Z

thanks @Nuffe very nice PR, and you were very responsive!

if you want to tackle other issues would be much appreciated!

closes pandas-dev#14778 Please see regex search on long columns by first converting to Categorical, avoid melting all dataframes with all the id variables, and wait with trying to convert the "time" variable to `int` until last), and clear up the docstring. Author: nuffe <erik.cfr@gmail.com> Closes pandas-dev#14779 from nuffe/wide2longfix and squashes the following commits: df1edf8 [nuffe] asv_bench: fix indentation and simplify dc13064 [nuffe] Set docstring to raw literal to allow backslashes to be printed (still had to escape them) 295d1e6 [nuffe] Use pd.Index in doc example 1c49291 [nuffe] Can of course get rid negative lookahead now that suffix is a regex 54c5920 [nuffe] Specify the suffix with a regex 5747a25 [nuffe] ENH/DOC: wide_to_long performance and functionality improvements (pandas-dev#14779)

erikcs closed this Dec 1, 2016

erikcs reopened this Dec 1, 2016

sinhrks added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Docs labels Dec 1, 2016

jreback reviewed Dec 1, 2016

View reviewed changes

jreback added the Performance Memory or execution speed performance label Dec 1, 2016

jreback reviewed Dec 4, 2016

View reviewed changes

jreback reviewed Dec 6, 2016

View reviewed changes

ENH/DOC: wide_to_long performance and functionality improvements (#14779

5747a25

) Speed up by avoiding big copies, and regex on categorical column Add functionality to deal with "pathological" input Add docstring examples and more test cases

jreback reviewed Dec 10, 2016

View reviewed changes

erikcs added 2 commits December 11, 2016 00:32

Specify the suffix with a regex

54c5920

Can of course get rid negative lookahead now that suffix is a regex

1c49291

jreback reviewed Dec 11, 2016

View reviewed changes

jreback added this to the 0.20.0 milestone Dec 11, 2016

Use pd.Index in doc example

295d1e6

Use is_list_like Add GH ticket #

Set docstring to raw literal to allow backslashes to be printed (still

dc13064

had to escape them)

jorisvandenbossche reviewed Dec 11, 2016

View reviewed changes

asv_bench: fix indentation and simplify

df1edf8

jreback closed this in 86233e1 Dec 13, 2016

jreback mentioned this pull request Dec 13, 2016

API: DataFrame.melt(), DataFrame.wide_to_long() #14876

Closed

TomAugspurger mentioned this pull request May 18, 2017

wide_to_long should verify uniqueness #16382

Closed

ENH/DOC: wide_to_long performance and docstring clarification #14779

ENH/DOC: wide_to_long performance and docstring clarification #14779

Conversation

erikcs commented Dec 1, 2016 • edited Loading

jreback commented Dec 1, 2016

Choose a reason for hiding this comment

erikcs Dec 1, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Dec 1, 2016

erikcs commented Dec 1, 2016 • edited Loading

codecov-io commented Dec 2, 2016 • edited Loading

Current coverage is 85.27% (diff: 96.15%)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikcs Dec 4, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Dec 4, 2016

jreback commented Dec 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 4, 2016

jreback commented Dec 5, 2016

erikcs commented Dec 5, 2016

jreback commented Dec 5, 2016

erikcs commented Dec 5, 2016

jreback commented Dec 5, 2016

erikcs commented Dec 5, 2016 • edited Loading

erikcs commented Dec 5, 2016

jreback commented Dec 5, 2016

erikcs commented Dec 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikcs Dec 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikcs Dec 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikcs Dec 6, 2016 • edited Loading

Choose a reason for hiding this comment

erikcs commented Dec 8, 2016 • edited Loading

erikcs commented Dec 9, 2016 • edited Loading

jreback commented Dec 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 11, 2016

erikcs commented Dec 11, 2016

jorisvandenbossche commented Dec 11, 2016

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikcs commented Dec 11, 2016 • edited Loading

jreback commented Dec 13, 2016 • edited Loading

erikcs commented Dec 1, 2016 •

edited

Loading

erikcs Dec 1, 2016 •

edited

Loading

erikcs commented Dec 1, 2016 •

edited

Loading

codecov-io commented Dec 2, 2016 •

edited

Loading

erikcs Dec 4, 2016 •

edited

Loading

erikcs commented Dec 5, 2016 •

edited

Loading

erikcs commented Dec 6, 2016 •

edited

Loading

erikcs Dec 7, 2016 •

edited

Loading

erikcs Dec 6, 2016 •

edited

Loading

erikcs Dec 6, 2016 •

edited

Loading

erikcs commented Dec 8, 2016 •

edited

Loading

erikcs commented Dec 9, 2016 •

edited

Loading

erikcs commented Dec 11, 2016 •

edited

Loading

jreback commented Dec 13, 2016 •

edited

Loading