-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: set_value / get_value #15269
Comments
Is there a fast replacement for In [328]: columns = list("abcdef")
In [329]: dx = 0.01; xs = np.arange(0, 1, step=dx);
In [330]: df = pd.DataFrame(index=xs)
In [331]: %%timeit
...: for x in xs:
...: for c in columns:
...: df.set_value(x, c, 1)
...:
The slowest run took 8.40 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.24 ms per loop
In [333]: df = pd.DataFrame(index=xs)
In [334]: %%timeit
...: for x in xs:
...: for c in columns:
...: df.loc[x, c] = 1
...:
10 loops, best of 3: 96.4 ms per loop |
In your example, |
Sparked by #15268 Personally, I never use those functions, so would also not regret them being gone (and would welcome the cleaning up of namespace). But, are there genuine cases where these methods can be useful? (compared to the other indexing methods) Why were they added in the first place? On StackOverflow, it seems mainly mentioned for its speed (eg http://stackoverflow.com/questions/13842088/set-value-for-particular-cell-in-pandas-dataframe/24517695#24517695) cc @pandas-dev/pandas-core |
IIRC these were always there :> (e.g. way before replacements are well-supported These are not idiomatic (e.g. calling functions to set/get values). And confusing to beginners. |
Once upon a time, I spent a lot of time making get_value and set_value extremely fast. Over time the performance has degraded significantly. In the course of rewriting the scalar value access code paths for pandas 2, things will get fast again, so I'm not sure how to proceed given that will occur at some point in the future |
|
you know you can just do this right:
|
Yes, the actual use case does not write just the same value, but a value specific to the current index and column. I've noticed that if I do it wrong, building the DataFrame overshadows all other computation, but with .set_value it is fine (and .at too, slower but not dominating). |
I was refered to this issue from #17256, where chris-b1 suggested that I use import pandas as pd
import numpy as numpy
pd.options.display.float_format = '{:,.0f}'.format
import time
df = pd.DataFrame(numpy.random.rand(1000,100)*100)
df.loc[:,'A'] = None
df.loc[:,'B'] = None
df.loc[:,'C'] = None
t0 = time.time()
for idx,row in df.iterrows():
row.loc[('A','B','C')] = (100+idx,200+idx,300+idx)
df.loc[idx] = row
print 'First: ', time.time()-t0
t0 = time.time()
for idx,row in df.iterrows():
row.loc[('A','B','C')] = (100+idx,200+idx,300+idx)
print 'Second: ', time.time()-t0
t0 = time.time()
for idx,row in df.iterrows():
df.set_value(idx,'A', 100+idx)
df.set_value(idx,'B', 200+idx)
df.set_value(idx,'C', 300+idx)
print 'Third: ', time.time()-t0
t0 = time.time()
for idx,row in df.iterrows():
df.loc[idx,'A'] = 100+idx
df.loc[idx,'B'] = 200+idx
df.loc[idx,'C'] = 300+idx
print 'Fourth: ', time.time()-t0
t0 = time.time()
for idx,row in df.iterrows():
df.loc[idx,('A','B','C')] = (100+idx,200+idx,300+idx)
print 'Fifth: ', time.time()-t0 On my home box this gives the output:
Using |
@dov you should simply use |
Thanks @jreback, I missed the t0 = time.time()
for idx,row in df.iterrows():
df.at[idx,'A'] = 100+idx
df.at[idx,'B'] = 200+idx
df.at[idx,'C'] = 300+idx
print 'Sixth: ', time.time()-t0
AIdx = len(df.columns)-3
BIdx = len(df.columns)-2
CIdx = len(df.columns)-1
t0 = time.time()
for idx,row in df.iterrows():
df.iat[idx,AIdx] = 100+idx
df.iat[idx,BIdx] = 200+idx
df.iat[idx,CIdx] = 300+idx
print 'Seventh: ', time.time()-t0 gives the additional output
I.e. it is "only" about 20% slower than the Regarding the anti-pattern of the loop, I agree. But it can just be considered as a stress test run a thousand times. On the other hand I often do something non-pandas related (e.g. image processing) and you just want to store the result in an existing dataframe. |
@dov well you can have correct, or slightly faster. I would always take correct.
this most certainly is an anti-pattern, there are vectorized methods in other libraries. |
…l, SparseSeries, SparseDataFrame closes pandas-dev#15269
…l, SparseSeries, SparseDataFrame closes pandas-dev#15269
…l, SparseSeries, SparseDataFrame closes pandas-dev#15269
…l, SparseSeries, SparseDataFrame closes pandas-dev#15269
…l, SparseSeries, SparseDataFrame closes pandas-dev#15269
…l, SparseSeries, SparseDataFrame (pandas-dev#17739) closes pandas-dev#15269
…l, SparseSeries, SparseDataFrame (pandas-dev#17739) closes pandas-dev#15269
…l, SparseSeries, SparseDataFrame (pandas-dev#17739) closes pandas-dev#15269
What's the right way to set a single value in a method chain given this deprecation? There's got to be something faster/more readable than: %timeit df.set_value(1, "a", 1).mean()
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
68.6 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df.assign(a=lambda f: f.a.mask(f.a.index==1,1)).mean()
632 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) |
the deprecate warning is pretty explicit |
Yes, but I can roll my own but still losing speed/readability quite a bit: def set_value(df, index, col, val):
new_df = df.copy()
df.at[index, col] = val
return new_df
%timeit df.pipe(set_value, 1, "a", 1).mean()
160 µs ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) |
you are missing the point sure you can do it but in a |
I understand that it's an anti-pattern to construct a dataframe by doing it 2000 times. I admit I don't really understand why it's wrong to correct, say, the first value in my data because the instrument it comes from has a warm up period? (there also isn't a good pattern to set on a slice in a method chain either, of course, but that's a separate issue from removing functionality that already exists) |
@PointyShinyBurning has a really good point and I think this issue should be re-opened. |
we already have too many public indexers.....
The text was updated successfully, but these errors were encountered: