Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: set_value / get_value #15269

Closed
jreback opened this issue Jan 30, 2017 · 18 comments · Fixed by #17739
Closed

DEPR: set_value / get_value #15269

jreback opened this issue Jan 30, 2017 · 18 comments · Fixed by #17739
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

we already have too many public indexers.....

@jreback jreback added Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 30, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 30, 2017
@bluss
Copy link

bluss commented Jan 31, 2017

Is there a fast replacement for set_value? Using loc in its place seems to be slow.

In [328]: columns = list("abcdef")
In [329]: dx = 0.01; xs = np.arange(0, 1, step=dx);
In [330]: df = pd.DataFrame(index=xs)
In [331]: %%timeit
     ...: for x in xs:
     ...:   for c in columns:
     ...:     df.set_value(x, c, 1)
     ...: 
The slowest run took 8.40 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.24 ms per loop

In [333]: df = pd.DataFrame(index=xs)
In [334]: %%timeit
     ...: for x in xs:
     ...:   for c in columns:
     ...:     df.loc[x, c] = 1
     ...: 
10 loops, best of 3: 96.4 ms per loop

@jorisvandenbossche
Copy link
Member

In your example, df.at[x, c] = 1 is also faster than loc

@jorisvandenbossche
Copy link
Member

Sparked by #15268

Personally, I never use those functions, so would also not regret them being gone (and would welcome the cleaning up of namespace).

But, are there genuine cases where these methods can be useful? (compared to the other indexing methods) Why were they added in the first place?

On StackOverflow, it seems mainly mentioned for its speed (eg http://stackoverflow.com/questions/13842088/set-value-for-particular-cell-in-pandas-dataframe/24517695#24517695)

cc @pandas-dev/pandas-core

@jreback
Copy link
Contributor Author

jreback commented Jan 31, 2017

IIRC these were always there :> (e.g. way before .iloc/.loc even).

replacements are well-supported .iat/.at. (which internally do almost what these functions do, they should be performant, yet support all of the pandas types). These routines are 'raw' and don't do any validation / checking.

These are not idiomatic (e.g. calling functions to set/get values). And confusing to beginners.

@wesm
Copy link
Member

wesm commented Jan 31, 2017

Once upon a time, I spent a lot of time making get_value and set_value extremely fast. Over time the performance has degraded significantly.

In the course of rewriting the scalar value access code paths for pandas 2, things will get fast again, so I'm not sure how to proceed given that will occur at some point in the future

@bluss
Copy link

bluss commented Jan 31, 2017

.at is indeed fast, but still slows down that loop example by 2-3x. .at is barely documented, too.

@jreback
Copy link
Contributor Author

jreback commented Jan 31, 2017

@bluss

you know you can just do this right:

In [7]: %timeit pd.DataFrame(1.0, index=xs, columns=list('abcde'))
10000 loops, best of 3: 165 µs per loop

@bluss
Copy link

bluss commented Jan 31, 2017

Yes, the actual use case does not write just the same value, but a value specific to the current index and column. I've noticed that if I do it wrong, building the DataFrame overshadows all other computation, but with .set_value it is fine (and .at too, slower but not dominating).

@dov
Copy link

dov commented Aug 15, 2017

I was refered to this issue from #17256, where chris-b1 suggested that I use loc instead of set_value. I thought that I would share a small test that I made in building a dataframe through various methods. Indeed I switched from loc to set_value because of a performance problem with loc. Consider the following test program that shows the huge difference in efficiency:

import pandas as pd
import numpy as numpy
pd.options.display.float_format = '{:,.0f}'.format
import time

df = pd.DataFrame(numpy.random.rand(1000,100)*100)
df.loc[:,'A'] = None
df.loc[:,'B'] = None
df.loc[:,'C'] = None

t0 = time.time()
for idx,row in df.iterrows():
  row.loc[('A','B','C')] = (100+idx,200+idx,300+idx)
  df.loc[idx] = row
print 'First:  ', time.time()-t0

t0 = time.time()
for idx,row in df.iterrows():
  row.loc[('A','B','C')] = (100+idx,200+idx,300+idx)
print 'Second: ', time.time()-t0

t0 = time.time()
for idx,row in df.iterrows():
  df.set_value(idx,'A', 100+idx)
  df.set_value(idx,'B', 200+idx)
  df.set_value(idx,'C', 300+idx)
print 'Third:  ', time.time()-t0

t0 = time.time()
for idx,row in df.iterrows():
  df.loc[idx,'A'] = 100+idx
  df.loc[idx,'B'] = 200+idx
  df.loc[idx,'C'] = 300+idx
print 'Fourth: ', time.time()-t0

t0 = time.time()
for idx,row in df.iterrows():
  df.loc[idx,('A','B','C')] = (100+idx,200+idx,300+idx)
print 'Fifth:  ', time.time()-t0

On my home box this gives the output:

First:   13.1010141373
Second:  0.216350078583
Third:   0.0418322086334
Fourth:  0.731967926025
Fifth:   0.692142963409

Using set_value in the third example is more than 15 times faster than using loc in the forth and the fifth exmple! In a real world problem I reduced my run time from 54 to 6 seconds, just by doing these kind of changes. Needless to say, until these problems are solved, will I continue to use set_value.

@jreback
Copy link
Contributor Author

jreback commented Aug 15, 2017

@dov you should simply use .at and .iat. .set_value / .get_value are going to be deprecated. Iteratively setting values is an anti-pattern.

@dov
Copy link

dov commented Aug 16, 2017

Thanks @jreback, I missed the .at accessor. There are just too many different options! Adding the following to my above script:

t0 = time.time()
for idx,row in df.iterrows():
  df.at[idx,'A'] = 100+idx
  df.at[idx,'B'] = 200+idx
  df.at[idx,'C'] = 300+idx
print 'Sixth: ', time.time()-t0

AIdx = len(df.columns)-3
BIdx = len(df.columns)-2
CIdx = len(df.columns)-1

t0 = time.time()
for idx,row in df.iterrows():
  df.iat[idx,AIdx] = 100+idx
  df.iat[idx,BIdx] = 200+idx
  df.iat[idx,CIdx] = 300+idx
print 'Seventh: ', time.time()-t0

gives the additional output

Sixth:   0.0538790225983
Seventh: 0.0564727783203

I.e. it is "only" about 20% slower than the set_value() calls. I can live with that, but I still think it is strange to deprecate the fastest option.

Regarding the anti-pattern of the loop, I agree. But it can just be considered as a stress test run a thousand times. On the other hand I often do something non-pandas related (e.g. image processing) and you just want to store the result in an existing dataframe.

@jreback
Copy link
Contributor Author

jreback commented Aug 16, 2017

.ie. it is "only" about 20% slower than the set_value() calls. I can live with that, but I still think it is strange to deprecate the fastest option.

@dov well you can have correct, or slightly faster. I would always take correct.

(e.g. image processing)

this most certainly is an anti-pattern, there are vectorized methods in other libraries.

jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 5, 2017
jreback added a commit that referenced this issue Oct 5, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
@PointyShinyBurning
Copy link

What's the right way to set a single value in a method chain given this deprecation?

There's got to be something faster/more readable than:

%timeit df.set_value(1, "a", 1).mean()
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
68.6 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit df.assign(a=lambda f: f.a.mask(f.a.index==1,1)).mean()
632 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@jreback
Copy link
Contributor Author

jreback commented Mar 21, 2018

the deprecate warning is pretty explicit

@PointyShinyBurning
Copy link

Yes, but .at[1, "a"] = 1 returns None, so I have to break the method chain and give the intermediate variable a name. I have a lot of code that's patterned like https://tomaugspurger.github.io/method-chaining.html which I thought was a style pandas aimed to support?

I can roll my own but still losing speed/readability quite a bit:

def set_value(df, index, col, val):
    new_df = df.copy()
    df.at[index, col] = val
    return new_df


%timeit df.pipe(set_value, 1, "a", 1).mean()
160 µs ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@jreback
Copy link
Contributor Author

jreback commented Mar 21, 2018

you are missing the point
setting a single item at a time is non performant and an anti-pattern

sure you can do it but in a
method chain doesn’t make any sense

@PointyShinyBurning
Copy link

I understand that it's an anti-pattern to construct a dataframe by doing it 2000 times. I admit I don't really understand why it's wrong to correct, say, the first value in my data because the instrument it comes from has a warm up period?

(there also isn't a good pattern to set on a slice in a method chain either, of course, but that's a separate issue from removing functionality that already exists)

@LinuxIsCool
Copy link

@PointyShinyBurning has a really good point and I think this issue should be re-opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants