Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/DOC: wide_to_long performance and docstring clarification #14779

Closed
wants to merge 6 commits into from
Closed

ENH/DOC: wide_to_long performance and docstring clarification #14779

wants to merge 6 commits into from

Conversation

erikcs
Copy link
Contributor

@erikcs erikcs commented Dec 1, 2016

Please see #14778 for details.

I make wide_to_long a bit faster (avoid slow regex search on long columns by first converting to Categorical, avoid melting all dataframes with all the id variables, and wait with trying to convert the "time" variable to int until last), and clear up the docstring.

@erikcs erikcs closed this Dec 1, 2016
@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

you can simply push to this pr as you update FYI

@erikcs erikcs reopened this Dec 1, 2016
@sinhrks sinhrks added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Docs labels Dec 1, 2016
new = df[id_vars].set_index(i).join(mstubs)

try:
new.index.set_levels(new.index.levels[-1].astype(int), level=-1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when / why does this raise? can you provide a comment

Copy link
Contributor Author

@erikcs erikcs Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the same int conversion attempt done in the original code here, since this "time" column may contain strings not necessarily can be converted to integers. Like the original the index is set to [i, j], which is why this operation is done on the index at the end. I will add a comment, sorry for the confusion

@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

@jreback jreback added the Performance Memory or execution speed performance label Dec 1, 2016
@erikcs
Copy link
Contributor Author

erikcs commented Dec 1, 2016

asv added:

[  0.00%] · For pandas commit hash 4014f118:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running reshape.wide_to_long_big.time_wide_to_long_big                                                                                 130.66ms
[ 50.00%] · For pandas commit hash 06f26b51:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running reshape.wide_to_long_big.time_wide_to_long_big                                                                                    2.05s    before     after       ratio
  [06f26b51] [4014f118]
-     2.05s   130.66ms      0.06  reshape.wide_to_long_big.time_wide_to_long_big

@codecov-io
Copy link

codecov-io commented Dec 2, 2016

Current coverage is 85.27% (diff: 96.15%)

Merging #14779 into master will increase coverage by <.01%

@@             master     #14779   diff @@
==========================================
  Files           144        144          
  Lines         50981      50989     +8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43470      43481    +11   
+ Misses         7511       7508     -3   
  Partials          0          0          

Powered by Codecov. Last update cb2d6eb...df1edf8

new = df[id_vars].set_index(i).join(mstubs)

# The index of the new dataframe is [i, j], if the j column is a time
# variable, try to convert this to integer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand what you are doing. can you show the index before / after

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff, here is an example:

In [8]: N = 3
   ...: df = pd.DataFrame({ 'A 2010': np.random.rand(N),
   ...:                     'A 2011': np.random.rand(N),
   ...:                     'B 2010': np.random.rand(N),
   ...:                     'B 2011': np.random.rand(N),
   ...:                      'X' : np.random.randint(N, size=N),
   ...:     })
   ...: df['id'] = df.index
   ...: df
   ...:
Out[8]:
     A 2010    A 2011    B 2010    B 2011  X  id
0  0.731823  0.790627  0.236080  0.727762  1   0
1  0.820396  0.474342  0.614218  0.363226  0   1
2  0.463291  0.210859  0.332595  0.061011  0   2

before the Try/Except

In [9]: before = pd.wide_to_long(df, ['A', 'B'], i='id', j='year')
   ...: before.index
   ...:
Out[9]:
MultiIndex(levels=[[0, 1, 2], [u' 2010', u' 2011']],
           labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]],
           names=[u'id', u'year'])

after

In [10]: after= pd.wide_to_long(df, ['A', 'B'], i='id', j='year')
    ...: after.index
    ...:
Out[10]:
MultiIndex(levels=[[0, 1, 2], [2010, 2011]],
           labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]],
           names=[u'id', u'year'])

which is the same as on the master branch

In [11]: master = pd.wide_to_long(df, ['A', 'B'], i='id', j='year')
    ...: master.index
    ...:
Out[11]:
MultiIndex(levels=[[0, 1, 2], [2010, 2011]],
           labels=[[0, 1, 2, 0, 1, 2], [0, 0, 0, 1, 1, 1]],
           names=[u'id', u'year'])

Why the original author did the Try before converting to int:

In [13]: df2 = pd.DataFrame({ 'A one': np.random.rand(N),
    ...:                     'A two': np.random.rand(N),
    ...:                     'B one': np.random.rand(N),
    ...:                     'B two': np.random.rand(N),
    ...:                      'X' : np.random.randint(N, size=N),
    ...:     })
    ...: df2
    ...:
Out[13]:
      A one     A two     B one     B two  X
0  0.315281  0.684260  0.397193  0.531613  1
1  0.156044  0.749942  0.923540  0.383348  0
2  0.577983  0.507933  0.226466  0.937341  0

In long format:

In [15]: df2['id'] = df2.index
    ...: pd.wide_to_long(df2, ['A', 'B'], i='id', j='year')
    ...:
Out[15]:
         X         A         B
id year
0   one  1  0.315281  0.397193
1   one  0  0.156044  0.923540
2   one  0  0.577983  0.226466
0   two  1  0.684260  0.531613
1   two  0  0.749942  0.383348
2   two  0  0.507933  0.937341

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the auto coercing of the strings -> ints. This is not very idiomatic and unexpected. I would leave the columns as strings.

Copy link
Contributor Author

@erikcs erikcs Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, but regarding a character that separates the stub name from the variable part:

In [7]: df = pd.DataFrame({ 'A.2010': np.random.rand(N),
   ...:                     'A.2011': np.random.rand(N),
   ...:                     'B.2010': np.random.rand(N),
   ...:                     'B.2011': np.random.rand(N),
   ...:                      'X' : np.random.randint(N, size=N),
   ...:     })
   ...:
   ...: df
   ...:
Out[7]:
     A.2010    A.2011    B.2010    B.2011  X
0  0.873404  0.467946  0.569808  0.358077  1
1  0.780154  0.554582  0.668437  0.810530  1
2  0.884003  0.555784  0.246305  0.038423  2
In [8]: df['id'] = df.index
   ...: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year')
   ...:
Out[8]:
         X        A.        B.
id year
0  2010  1  0.873404  0.569808
1  2010  1  0.780154  0.668437
2  2010  2  0.884003  0.246305
0  2011  1  0.467946  0.358077
1  2011  1  0.554582  0.810530
2  2011  2  0.555784  0.038423

A user might expect the new separating character (.) to be stripped, like reshape in R does.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

needs a whatsnew entry (0.20.)

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

you could add an argument to specify the split (or make it take a regex)

yes it should get stripped

@@ -88,6 +88,7 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved performance of ``wide_to_long`` (:issue:`14779`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.wide_to_long()

@@ -875,7 +875,7 @@ def lreshape(data, groups, dropna=True, label=None):
return DataFrame(mdata, columns=id_cols + pivot_cols)


def wide_to_long(df, stubnames, i, j):
def wide_to_long(df, stubnames, i, j, sep=""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make it sep='\s+' whitespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm strange, rstrip doesn't seem to recognise that?

In [13]: 'A (quarterly) '.rstrip('\s+')
Out[13]: 'A (quarterly) '
In [14]: 'A (quarterly) '.rstrip(" ")
Out[14]: 'A (quarterly)'

@@ -890,8 +890,9 @@ def wide_to_long(df, stubnames, i, j):
The name of the id variable.
j : str
The name of the subobservation variable.
stubend : str
Regex to match for the end of the stubs.
sep : str, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify what the default is

exp_frame = exp_frame.set_index(['id', 'year'])[["X", "A", "B"]]
long_frame = wide_to_long(df, ['A', 'B'], 'id', 'year')
tm.assert_frame_equal(long_frame, exp_frame)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests with sep (and maybe some that have an invalid sep)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What input where you thinking about?

if a nonsense separator is passed nothing is stripped:

In [15]: df = pd.DataFrame({'A.2010': np.random.rand(3),
    ...:                    'A.2011': np.random.rand(3),
    ...:                    'B.2010': np.random.rand(3),
    ...:                    'B.2011': np.random.rand(3),
    ...:                    'X' : np.random.randint(3, size=3)})
    ...: df['id'] = df.index
    ...: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year', sep="nope")
    ...:
Out[15]:
         X        A.        B.
id year
0  2010  2  0.330193  0.728615
1  2010  0  0.710791  0.601923
2  2010  1  0.066218  0.618455
0  2011  2  0.597949  0.324131
1  2011  0  0.024911  0.968051
2  2011  1  0.310596  0.866798
In [16]: pd.wide_to_long(df, ['A.', 'B.'], i='id', j='year', sep=",,")
Out[16]:
         X        A.        B.
id year
0  2010  2  0.330193  0.728615
1  2010  0  0.710791  0.601923
2  2010  1  0.066218  0.618455
0  2011  2  0.597949  0.324131
1  2011  0  0.024911  0.968051
2  2011  1  0.310596  0.866798

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

further, not sure if http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt
needs any updates?

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

that's a regex

@erikcs
Copy link
Contributor Author

erikcs commented Dec 5, 2016

Sorry, I didn't get that? And melt looks pretty rock solid.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

@Nuffe I was referring to wide_to_long in the docs (e.g. is the example ok).

you are splitting on the sep character, but it could also be a regext, so doing something like

In [7]: re.split('\s+','A  2010')
Out[7]: ['A', '2010']

In [8]: re.split('\s+','A 2010')
Out[8]: ['A', '2010']

is probably reasonable

@erikcs
Copy link
Contributor Author

erikcs commented Dec 5, 2016

To keep things simple I propose we break the API of wide_to_long and change the signature to adhere more to how R's reshape does this.

wide_to_long(df, varying, i, j, sep=' ')

I.e. the user passes the names of the time varying column as a list varying, and these (x) are expected (checked) to adhere to the following rules (more or less what reshape assumes):

based on sep x is split into exactly two strings.
sep (a single character) is constrained to be non-alphanumeric: \s, ., ;, etc...
the first part is the 'stubname', the second part is the 'time' part

In [3]: df = pd.DataFrame({"A 1970" : {0 : "a", 1 : "b", 2 : "c"},
   ...:                     "A 1980" : {0 : "d", 1 : "e", 2 : "f"},
   ...:                     "B 1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
   ...:                     "B 1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
   ...:                     "X"     : dict(zip(range(3), np.random.randn(3)))
   ...:                    })
   ...: df['id'] = df.index
   ...: df
   ...:
Out[3]:
  A 1970 A 1980  B 1970  B 1980         X  id
0      a      d     2.5     3.2  0.136953   0
1      b      e     1.2     1.3 -1.238109   1
2      c      f     0.7     0.1  1.249809   2
In [4]: varying = ['A 1970', 'A 1980', 'B 1970', 'B 1980']
   ...: pd.wide_to_long(df, varying, i='id', j='year', sep=' ')
   ...:
Out[4]:
                X  A    B
id year
0  1970  0.136953  a  2.5
1  1970 -1.238109  b  1.2
2  1970  1.249809  c  0.7
0  1980  0.136953  d  3.2
1  1980 -1.238109  e  1.3
2  1980  1.249809  f  0.1

The user can easily construct the varying list with a regex, a doc example can show this.

If the existing columns does not adhere to the above specification, they need to be changed to a suitable format first. A doc example can show how this can be easily done with a regex with a backreference.

What do you think?

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

the varying should be a list of tuples
space separated elements in lists are not pythonic

but otherwise looks ok

can this be backward compat?

@erikcs
Copy link
Contributor Author

erikcs commented Dec 5, 2016

I didn't understand the first comment: varying is just the names of all columns that should be varying. For a sample dataframe it could be df.iloc[:, 4:11].columns.tolist(). I.e. that the names happen to be space separated is just how the data columns ended up looking like in messy real world data?

And I do not think this can be made backward compat because the varying argument would be different, stubnames would now be computed inside the function.

The old doc example were there is no single character separator, f.eks varying = ['A1970', 'A1980', 'B1970', 'B1980'] will only work by first converting the column names to the allowed format with

df.columns.str.replace('([A-B])', '\\1.')
Index([u'A.1970', u'A.1980', u'B.1970', u'B.1980', u'X', u'id'], dtype='object')

then calling wide_to_long with varying = ['A.1970', 'A.1980', 'B.1970', 'B.1980']

I do not know if this really is considered too unwieldy...? R's reshape interface is perhaps not the most user friendly, but with plenty of doc examples its flexibility should maybe be appreciated?

The original function author seemed to have tried to mimic Statas reshape where reshape essentially only would take the stubnames as argument. The problem is that in stata column names are highly constrained (can for ex not have whitespaces or non alphanumeric characters), while in pandas they can be any utf8 string, which makes it much harder to generalize.

So if we should preserve the original function authors intention, where the user only supplies stubnames as in Stata, we need to impose some strict assumptions on the column names passed, like the only kind of column names we can have (that are varying) are of the type PrefixPostfix, where Prefix and Postfix are alphanumeric. These are the only ones Stata's reshape needs to consider, and what the original wide_to_long function implicitly assumes. We can also handle PrefixSepPostfix where Sep is a single separating character.

Perhaps it is just better to make this implicit assumption explicit and keep its "Stata like" interface? And make it robust to this specificaton (PrefixSepPostfix), because the master branch function breaks with plenty of variations of this.

(sorry for the messyness here but I ended up spending some time familiarizing myself more with Rs less user friendly approach to this problem, and Stata's more user friendly but less flexible approach)

@erikcs
Copy link
Contributor Author

erikcs commented Dec 5, 2016

So here is an attempt to make the original interface more robust, these two examples fail on the master branch, but should be able to produce the correct result which is:

In [12]: df = pd.DataFrame({
    ...:         'A11': ['a11', 'a22', 'a33'],
    ...:         'A12': ['a21', 'a22', 'a23'],
    ...:         'B11': ['b11', 'b12', 'b13'],
    ...:         'B12': ['b21', 'b22', 'b23'],
    ...:         'BB11': [1, 2, 3],
    ...:         'BB12': [4, 5, 6],
    ...:         'BBBX' : [91, 92, 93],
    ...:         'BBBZ' : [91, 92, 93]
    ...:     })
    ...: df['id'] = df.index
    ...: df
    ...:
Out[12]:
   A11  A12  B11  B12  BB11  BB12  BBBX  BBBZ  id
0  a11  a21  b11  b21     1     4    91    91   0
1  a22  a22  b12  b22     2     5    92    92   1
2  a33  a23  b13  b23     3     6    93    93   2

In [13]: pd.wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year')
Out[13]:
         BBBX  BBBZ    A    B  BB
id year
0  11      91    91  a11  b11   1
1  11      92    92  a22  b12   2
2  11      93    93  a33  b13   3
0  12      91    91  a21  b21   4
1  12      92    92  a22  b22   5
2  12      93    93  a23  b23   6
In [14]: df = pd.DataFrame({
    ...:         'A(quarterly)2011': ['a11', 'a22', 'a33'],
    ...:         'A(quarterly)2012': ['a21', 'a22', 'a23'],
    ...:         'B(quarterly)2011': ['b11', 'b12', 'b13'],
    ...:         'B(quarterly)2012': ['b21', 'b22', 'b23'],
    ...:         'BB(quarterly)2011': [1, 2, 3],
    ...:         'BB(quarterly)2012': [4, 5, 6],
    ...:         'BBBX' : [91, 92, 93],
    ...:         'BBBZ' : [91, 92, 93]
    ...:     })
    ...: df['id'] = df.index
    ...: df
    ...:
Out[14]:
  A(quarterly)2011 A(quarterly)2012 B(quarterly)2011 B(quarterly)2012  \
0              a11              a21              b11              b21
1              a22              a22              b12              b22
2              a33              a23              b13              b23

   BB(quarterly)2011  BB(quarterly)2012  BBBX  BBBZ  id
0                  1                  4    91    91   0
1                  2                  5    92    92   1
2                  3                  6    93    93   2

In [15]: pd.wide_to_long(df, ['A(quarterly)', 'B(quarterly)', 'BB(quarterly)'], i='id', j='year')
Out[15]:
         BBBX  BBBZ A(quarterly) B(quarterly)  BB(quarterly)
id year
0  2011    91    91          a11          b11              1
1  2011    92    92          a22          b12              2
2  2011    93    93          a33          b13              3
0  2012    91    91          a21          b21              4
1  2012    92    92          a22          b22              5
2  2012    93    93          a23          b23              6

The first one fails because the regex confuses the same substrings in the id_vars and value_vars, the second ones because of the parenthesis.

Assuming a Prefix(Optional Sep)Postfix structure on the "time" variables, I tried to make it robust:

 [16]: df = pd.DataFrame({
    ...:         'A11': ['a11', 'a22', 'a33'],
    ...:         'A12': ['a21', 'a22', 'a23'],
    ...:         'B11': ['b11', 'b12', 'b13'],
    ...:         'B12': ['b21', 'b22', 'b23'],
    ...:         'BB11': [1, 2, 3],
    ...:         'BB12': [4, 5, 6],
    ...:         'Acat' : [91, 92, 93],
    ...:         'BBBZ' : [91, 92, 93]
    ...:     })
    ...: df['id'] = df.index
    ...: df
    ...:
Out[16]:
   A11  A12  Acat  B11  B12  BB11  BB12  BBBZ  id
0  a11  a21    91  b11  b21     1     4    91   0
1  a22  a22    92  b12  b22     2     5    92   1
2  a33  a23    93  b13  b23     3     6    93   2

raises a ValueError: ('Ambiguous names: ', ['A11', 'A12', 'Acat']) .

While the following works

 df = pd.DataFrame({
    ...:         'A-11': ['a11', 'a22', 'a33'],
    ...:         'A-12': ['a21', 'a22', 'a23'],
    ...:         'B-11': ['b11', 'b12', 'b13'],
    ...:         'B-12': ['b21', 'b22', 'b23'],
    ...:         'BB-11': [1, 2, 3],
    ...:         'BB-12': [4, 5, 6],
    ...:         'Acat' : [91, 92, 93],
    ...:         'BBBZ' : [91, 92, 93]
    ...:     })
    ...: df['id'] = df.index
    ...: df
    ...:
Out[18]:
  A-11 A-12  Acat B-11 B-12  BB-11  BB-12  BBBZ  id
0  a11  a21    91  b11  b21      1      4    91   0
1  a22  a22    92  b12  b22      2      5    92   1
2  a33  a23    93  b13  b23      3      6    93   2

In [19]: pd.wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year', sep='-')
Out[19]:
         Acat  BBBZ    A    B  BB
id year
0  11      91    91  a11  b11   1
1  11      92    92  a22  b12   2
2  11      93    93  a33  b13   3
0  12      91    91  a21  b21   4
1  12      92    92  a22  b22   5
2  12      93    93  a23  b23   6

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

cc @jseabold any thought here.

@Nuffe ideally we want to make this back-compat (you can do introspection in the code to figure out what you are passed and such). And have as simple as API as possible.

@erikcs
Copy link
Contributor Author

erikcs commented Dec 6, 2016

I have maintained the user friendly (and evidently Stata inspired) interface (and stated what structure this function assumes on the column names), and tried to fix mistakes that arise with various "pathological" input, for example if the stubnames contains groups that share a substring (which I discovered when I tried different examples in Stata and compared them to wide_to_long)

Notes
-----
All extra variables are treated as extra id variables. This simply uses
`pandas.melt` under the hood, but is hard-coded to "do the right thing"
in a typicaly case.
"""
# For robustess, escape every user input string we use in a regex
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be imported at the top of the file

# For ex. AA2011, AA2012, AAkitten have inconsistent postfix
for k, vars in enumerate(value_vars):
stripped = map(lambda x: x.replace(stubs[k], ""), vars)
is_digit = [s.isdigit() for s in stripped]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have tests for this?

Copy link
Contributor Author

@erikcs erikcs Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering the comment below on not using a regex to find the id_vars: perhaps just formulate a consistency check, and use warnings and warn the user if for example an inferred value_var has different types?

For example: at the end check if the new data frame's 'j' index contains both ints and strings and warn about this? If stubnames supplied is ['AA2011', 'AA2012'] and df contains a column named Acat then the new dataframe's j column will have levels 2011, 2012, cat. And likewise if stubnames contains ['CatOne', 'CatTwo'] and df has a colum named Cat3000 the new j index will have levels One, Two, 3000.

The only way to disambiguate the first case is to take an optional stubendtype parameter denoting the stubends are numbers. The second case is not possible to disambiguate (tried in Stata)

# two resulting value_vars lists
if len(value_vars_flattened + id_vars) != len(df.columns):
value_vars_augmented = map(lambda x: get_var_names(
df, "^{0}".format(re.escape(x))), stubnames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks fragile. I would just raise here

Copy link
Contributor Author

@erikcs erikcs Dec 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or instead of doing a search for the id_vars in the first place, would it not be simpler just to do:

id_vars = set(df.columns.tolist()).difference(value_vars_flattened)?

(then do some consistency checks)

# This regex is needed to avoid multiple "greedy" matches with stubs
# that have overlapping substrings
# (for example A2011, A2012 are separate from AA2011, AA2012)
value_vars = list(map(lambda x: get_var_names(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally you would just look for a match of a letter followed by a non-letter (or vice versa), I think that is more robust.

Copy link
Contributor Author

@erikcs erikcs Dec 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the case of string stems, the three groups here will not be captured:

Aone, Atwo, Bone, Btwo, BBone, BBtwo

A negative lookahead ^B(?!B) could be more robust? I.e. the regex would be "^{0}(?!{1})".format(re.escape(x), x[-1])). That one would capture the three groups here and ignore for example BBBrating

@erikcs
Copy link
Contributor Author

erikcs commented Dec 8, 2016

I found another "Stata like" use case wide_to_long doesn't handle at this page: if the data frame does not have a single id column that identifies the wide variables, Stata users can supply the necessary column names, fex ['id1', 'id2']. Currently the user would have to generate a new column that identifies the ['id1', 'id2'] combination and pass that as id.

(I am going to add the option of supplying a list of 'id' variables, it will require another short rewrite since I have to move from using join to merge, to handle the new multilevel index)

@erikcs
Copy link
Contributor Author

erikcs commented Dec 9, 2016

Sometimes AppVeyor/Travis fails with unrelated tests (like test_bar_log_subplots, right now). Any hint on what I should do here? I've looked at the failing cases (unrelated plot methods), and there doesn't seem to be any state that has mutated because for ex. a random number seed has changed. (tests pass on my OS X laptop with both python 2 and python 3 virtual envs). Thanks

@jreback
Copy link
Contributor

jreback commented Dec 10, 2016

can you rebase, problem with appeveyor which i just fixed

)

Speed up by avoiding big copies, and regex on categorical column

Add functionality to deal with "pathological" input

Add docstring examples and more test cases
in the wide format, to be stripped from the names in the long format.
For example, if your column names are A-suffix1, A-suffix2, you
can strip the hypen by specifying `sep`='-'
numeric_suffix : bool, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather call this suffix='\d+', IOW use a regex to match this, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes more sense

Going from long back to wide just takes some creative use of `unstack`

>>> w = l.reset_index().set_index(['famid', 'birth', 'age']).unstack()
>>> w.columns = [name + suffix for name, suffix in wide.columns.tolist()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use this:

In [28]: Index(w.columns).str.join('')
Out[28]: Index(['ht1', 'ht2'], dtype='object')

if any(map(lambda s: s in df.columns.tolist(), stubnames)):
raise ValueError("stubname can't be identical to a column name")

if not isinstance(stubnames, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use is_list_like, IOW, you can pass a non-string iterable, can you update the doc-string as well

@@ -716,6 +716,204 @@ def test_stubs(self):

self.assertEqual(stubs, ['inc', 'edu'])

def test_separating_character(self):
np.random.seed(123)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this issue number as a comment

@jreback jreback added this to the 0.20.0 milestone Dec 11, 2016
@jreback
Copy link
Contributor

jreback commented Dec 11, 2016

lgtm. some minor comments.

@jorisvandenbossche

Use is_list_like

Add GH ticket #
@erikcs
Copy link
Contributor Author

erikcs commented Dec 11, 2016

@jreback another minor issue: sphinx doesn't print the \ in the docstring suffix section. The only way I managed to get these printed was to set the entire docstring to a raw string literal, and escape the slash: '\\d+'

@jorisvandenbossche
Copy link
Member

set the entire docstring to a raw string literal

You can indeed do that, but normally then the escaping should not be needed

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow the full discussion above, but there was some talk about backwards compatibility. What is the conclusion on that? Is the last version back compat or are there changes in behaviour?


def setup(self):
vars = 'ABCD'
nyrs = 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix up the indentation here?

idobs = dict(zip(range(nidvars), np.random.rand(nidvars, N)))

self.df = pd.concat([pd.DataFrame(idobs), pd.DataFrame(yearobs)],
axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also do something like DataFrame(np.random.randn(N, nidvars + len(yrvars)), columns=list(range(nidvars)) + yrvars to make it a bit simpler

@erikcs
Copy link
Contributor Author

erikcs commented Dec 11, 2016

@jorisvandenbossche Yes this version is back compat. The PR got a bit lengthy because I did more than I anticipated (was originally a simple PR for a quick speed improvement - but I discovered afterwards that there where several use cases the original function couldn't handle).

@jreback jreback closed this in 86233e1 Dec 13, 2016
@jreback
Copy link
Contributor

jreback commented Dec 13, 2016

thanks @Nuffe very nice PR, and you were very responsive!

if you want to tackle other issues would be much appreciated!

ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
closes pandas-dev#14778

Please see
regex search on long columns by first converting to Categorical, avoid
melting all dataframes with all the id variables, and wait with trying
to convert the "time" variable to `int` until last), and clear up the
docstring.

Author: nuffe <erik.cfr@gmail.com>

Closes pandas-dev#14779 from nuffe/wide2longfix and squashes the following commits:

df1edf8 [nuffe] asv_bench: fix indentation and simplify
dc13064 [nuffe] Set docstring to raw literal to allow backslashes to be printed (still had to escape them)
295d1e6 [nuffe] Use pd.Index in doc example
1c49291 [nuffe] Can of course get rid negative lookahead now that suffix is a regex
54c5920 [nuffe] Specify the suffix with a regex
5747a25 [nuffe] ENH/DOC: wide_to_long performance and functionality improvements (pandas-dev#14779)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH/DOC: wide_to_long performance and docstring clarification
5 participants