New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance for .loc and .iloc compared to .ix #6683

Closed
brownmk opened this Issue Mar 21, 2014 · 14 comments

Comments

Projects
None yet
7 participants
@brownmk

brownmk commented Mar 21, 2014

When using indices, we are encouraged to use .loc instead of .ix. Also the "SettingWithCopyWarning:" recommends us to use .loc instead. But it seems the performance of .loc and .iloc is 20-30 times slower than .ix (I am using Pandas 0.13.1)

.ix takes 4.54897093773 sec
.iloc takes 111.531260967 sec
.loc takes 92.8014230728 sec

The costs for .loc and .iloc seems too high. Thanks!
-- test code ---

!/usr/bin/env python

import pandas as pd
import numpy as np
import time

t=pd.DataFrame(data=np.zeros([100000,6]), columns=['A','B','C','D','E','F'])
start=time.time()
for i in t.index:
for j in ['A','B','C','D','E']:
t.ix[i,j]
t1=time.time()
print t1-start
for i in xrange(100000):
for j in xrange(6):
t.iloc[i,j]
t2=time.time()
print t2-t1
for i in t.index:
for j in ['A','B','C','D','E']:
t.loc[i,j]
t3=time.time()
print t3-t2

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 21, 2014

Contributor

this is on the order a few function calls the difference validating the input

what exactly are you trying to do?

If you are actually iterating thru the elements, then all of these approaches are wrong.

The indexers

Contributor

jreback commented Mar 21, 2014

this is on the order a few function calls the difference validating the input

what exactly are you trying to do?

If you are actually iterating thru the elements, then all of these approaches are wrong.

The indexers

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 21, 2014

Contributor
In [16]: %timeit t.ix[100,'A']
100000 loops, best of 3: 4.67 ᄉs per loop

In [17]: %timeit t.loc[100,'A']
10000 loops, best of 3: 142 ᄉs per loop

In [18]: %timeit t.iloc[100,0]
10000 loops, best of 3: 113 ᄉs per loop

If you really really want to access individual elements, then use iat/at

In [23]: %timeit t.iat[100,0]
100000 loops, best of 3: 8.8 ᄉs per loop

In [24]: %timeit t.at[100,'A']
100000 loops, best of 3: 5.8 ᄉs per loop
Contributor

jreback commented Mar 21, 2014

In [16]: %timeit t.ix[100,'A']
100000 loops, best of 3: 4.67 ᄉs per loop

In [17]: %timeit t.loc[100,'A']
10000 loops, best of 3: 142 ᄉs per loop

In [18]: %timeit t.iloc[100,0]
10000 loops, best of 3: 113 ᄉs per loop

If you really really want to access individual elements, then use iat/at

In [23]: %timeit t.iat[100,0]
100000 loops, best of 3: 8.8 ᄉs per loop

In [24]: %timeit t.at[100,'A']
100000 loops, best of 3: 5.8 ᄉs per loop
@brownmk

This comment has been minimized.

Show comment
Hide comment
@brownmk

brownmk Mar 21, 2014

Thanks! Will use .at instead. Could you explain a bit why is .loc so much slower in access an element (while it appear to work find with a range selection t.loc[:100, ['A','E']])? Why does not .ix suffer from that?

brownmk commented Mar 21, 2014

Thanks! Will use .at instead. Could you explain a bit why is .loc so much slower in access an element (while it appear to work find with a range selection t.loc[:100, ['A','E']])? Why does not .ix suffer from that?

@brownmk

This comment has been minimized.

Show comment
Hide comment
@brownmk

brownmk Mar 21, 2014

Since .ix is always the fastest in all cases, what is against using .ix instead of .at? Thanks!

brownmk commented Mar 21, 2014

Since .ix is always the fastest in all cases, what is against using .ix instead of .at? Thanks!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 21, 2014

Contributor

ix can very subtly give wrong results (use an index of say even numbers)

you can use whatever function you want; ix is still there, but it doesn't provide the guarantees that loc provides, namely that it won't interpret a number as a location

but 2 my first point, if you are actually iterating over elements their are much better ways to do it.

Contributor

jreback commented Mar 21, 2014

ix can very subtly give wrong results (use an index of say even numbers)

you can use whatever function you want; ix is still there, but it doesn't provide the guarantees that loc provides, namely that it won't interpret a number as a location

but 2 my first point, if you are actually iterating over elements their are much better ways to do it.

@jreback jreback closed this Mar 21, 2014

@csaladenes

This comment has been minimized.

Show comment
Hide comment
@csaladenes

csaladenes Oct 1, 2014

I usually convert the pandas to a Python dictionary (which is for many applications enough after you did the sorting/slicing specific to pandas). Then iteration is about 3-4 orders of magnitude faster.

csaladenes commented Oct 1, 2014

I usually convert the pandas to a Python dictionary (which is for many applications enough after you did the sorting/slicing specific to pandas). Then iteration is about 3-4 orders of magnitude faster.

@sergeny

This comment has been minimized.

Show comment
Hide comment
@sergeny

sergeny Dec 24, 2014

This does not really answer the question. While these approaches might be impractical, all of the documentation suggests using .loc or .iloc, or at, or .iat over .ix, yet .ix is consistently faster. Are there any advantages to not using .ix? Better type checking? What if I am completely sure that all my indexes only contain integers, and nothing else?

I've profiled the example above, inspired by the recent issue with .loc[list] (#9126), and indeed, even iloc remains slower than ix.

In fact, even .at is slower than .ix for retrieving a single element. This is because .at is calling pd.core.indexing._AtIndexer._covert_key, which is the most time-consuming function call, whereas .ix is just not doing it.

 a=pd.DataFrame(data=np.zeros([10000,6]), columns=['A','B','C','D','E','F'])
In [58]: %timeit  [a.at[5235,'C'] for i in xrange(100000)]
1 loops, best of 3: 668 ms per loop

In [58]: %prun  [a.at[5235,'C'] for i in xrange(100000)]

ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.226 0.000 0.393 0.000 indexing.py:1526(_convert_key)
100000 0.147 0.000 0.569 0.000 frame.py:1623(get_value)
100000 0.142 0.000 1.117 0.000 indexing.py:1498(getitem)
100000 0.116 0.000 0.253 0.000 internals.py:3469(get_values)
1 0.105 0.105 1.280 1.280 :1()
200000 0.083 0.000 0.095 0.000 index.py:606(is_integer)
100000 0.065 0.000 0.065 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}

In [60]: %timeit  [a.ix[5235,'C'] for i in xrange(100000)]
1 loops, best of 3: 446 ms per loop


In [62]: %prun  [a.ix[5235,'C'] for i in xrange(100000)]

ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.146 0.000 0.532 0.000 frame.py:1623(get_value)
100000 0.110 0.000 0.704 0.000 indexing.py:61(getitem)
100000 0.104 0.000 0.231 0.000 internals.py:3469(get_values)
1 0.100 0.100 0.862 0.862 :1()
100000 0.060 0.000 0.060 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}

sergeny commented Dec 24, 2014

This does not really answer the question. While these approaches might be impractical, all of the documentation suggests using .loc or .iloc, or at, or .iat over .ix, yet .ix is consistently faster. Are there any advantages to not using .ix? Better type checking? What if I am completely sure that all my indexes only contain integers, and nothing else?

I've profiled the example above, inspired by the recent issue with .loc[list] (#9126), and indeed, even iloc remains slower than ix.

In fact, even .at is slower than .ix for retrieving a single element. This is because .at is calling pd.core.indexing._AtIndexer._covert_key, which is the most time-consuming function call, whereas .ix is just not doing it.

 a=pd.DataFrame(data=np.zeros([10000,6]), columns=['A','B','C','D','E','F'])
In [58]: %timeit  [a.at[5235,'C'] for i in xrange(100000)]
1 loops, best of 3: 668 ms per loop

In [58]: %prun  [a.at[5235,'C'] for i in xrange(100000)]

ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.226 0.000 0.393 0.000 indexing.py:1526(_convert_key)
100000 0.147 0.000 0.569 0.000 frame.py:1623(get_value)
100000 0.142 0.000 1.117 0.000 indexing.py:1498(getitem)
100000 0.116 0.000 0.253 0.000 internals.py:3469(get_values)
1 0.105 0.105 1.280 1.280 :1()
200000 0.083 0.000 0.095 0.000 index.py:606(is_integer)
100000 0.065 0.000 0.065 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}

In [60]: %timeit  [a.ix[5235,'C'] for i in xrange(100000)]
1 loops, best of 3: 446 ms per loop


In [62]: %prun  [a.ix[5235,'C'] for i in xrange(100000)]

ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.146 0.000 0.532 0.000 frame.py:1623(get_value)
100000 0.110 0.000 0.704 0.000 indexing.py:61(getitem)
100000 0.104 0.000 0.231 0.000 internals.py:3469(get_values)
1 0.100 0.100 0.862 0.862 :1()
100000 0.060 0.000 0.060 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}

@xgdgsc

This comment has been minimized.

Show comment
Hide comment
@xgdgsc

xgdgsc Oct 8, 2015

Contributor

Are you sure this is fixed? I still experience 30 time faster when using ix instead of loc when accessing with something like df.ix['row', 'col'] with pandas 0.16.2. When running %timeit df.ix['row', 'col'], I get a message:

The slowest run took 10.01 times longer than the fastest. This could mean that an intermediate result is being cached 

when using loc, this message doesn' t appear and it is much slower.

@csaladenes Thanks. Converting to dict is indeed much faster!

Contributor

xgdgsc commented Oct 8, 2015

Are you sure this is fixed? I still experience 30 time faster when using ix instead of loc when accessing with something like df.ix['row', 'col'] with pandas 0.16.2. When running %timeit df.ix['row', 'col'], I get a message:

The slowest run took 10.01 times longer than the fastest. This could mean that an intermediate result is being cached 

when using loc, this message doesn' t appear and it is much slower.

@csaladenes Thanks. Converting to dict is indeed much faster!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 8, 2015

Contributor

@xgdgsc yes this is fixed.

.loc does quite a bit of inference and is much more typesafe than .ix. You can still use .ix if you'd like. But if this perf diff actually matters (and we are talking microseconds here). Then you are just doing things wrong. You shouldn't be repeatedly calling these function in a loop. There are much much better ways of doing things.

Contributor

jreback commented Oct 8, 2015

@xgdgsc yes this is fixed.

.loc does quite a bit of inference and is much more typesafe than .ix. You can still use .ix if you'd like. But if this perf diff actually matters (and we are talking microseconds here). Then you are just doing things wrong. You shouldn't be repeatedly calling these function in a loop. There are much much better ways of doing things.

@xgdgsc

This comment has been minimized.

Show comment
Hide comment
@xgdgsc

xgdgsc Oct 8, 2015

Contributor

Yes. I shouldn't be repeatedly calling these function in a loop.

Contributor

xgdgsc commented Oct 8, 2015

Yes. I shouldn't be repeatedly calling these function in a loop.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 8, 2015

Contributor

Here's a notebook that I show sometimes that explains why this is not a good idea: https://github.com/jreback/StrataNYC2015/blob/master/performance/8.%20indexing.ipynb

Contributor

jreback commented Oct 8, 2015

Here's a notebook that I show sometimes that explains why this is not a good idea: https://github.com/jreback/StrataNYC2015/blob/master/performance/8.%20indexing.ipynb

@ys198918

This comment has been minimized.

Show comment
Hide comment
@ys198918

ys198918 May 24, 2016

i used .ix instead of .loc,but its still very slow,almost the same.do you know why ?@jreback

ys198918 commented May 24, 2016

i used .ix instead of .loc,but its still very slow,almost the same.do you know why ?@jreback

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 24, 2016

Contributor

@ys198918 if this perf difference actually matters to you then you are iterating, which is wrong. you need to do vectorized things. iterating is much less performant.

Contributor

jreback commented May 24, 2016

@ys198918 if this perf difference actually matters to you then you are iterating, which is wrong. you need to do vectorized things. iterating is much less performant.

@markyoder

This comment has been minimized.

Show comment
Hide comment
@markyoder

markyoder Dec 14, 2017

Why does PANDAS need so many different methods to access data elements? Given the esoteric complexity -- use .at vs .loc or .xs, or .ix (and if you use the wrong mode of access, your code will run like a rocket sled plowing through a wall of jello), i am still trying to figure out what is so great about PANDAS?

markyoder commented Dec 14, 2017

Why does PANDAS need so many different methods to access data elements? Given the esoteric complexity -- use .at vs .loc or .xs, or .ix (and if you use the wrong mode of access, your code will run like a rocket sled plowing through a wall of jello), i am still trying to figure out what is so great about PANDAS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment