# A Closer Look at Python List Comprehensions

Inspired by Trey Hunner's blog post (https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/), I'm taking a bit deeper look into list comprehensions. List comprehensions in Python are definitely not syntatic sugar. They are executed in a very different way than a for loop, and are faster. Stay with me and let's see why.

I ran this notebook on my 21" iMac Retina.

## An Explicit Loop

We start with a typical task in Python - transform an existing list into a new one.

In [1]:
def func(item):
    return item > 0

old_list = range(1, 10000)

def f1(my_list):
    new_list = []
    for item in my_list:
        if func(item):
            new_list.append(item)
    return new_list

Let's use the dis module to examine what's going on at the byte code level.

In [2]:
import dis
dis.dis(f1)

  7           0 BUILD_LIST               0
              3 STORE_FAST               1 (new_list)

  8           6 SETUP_LOOP              39 (to 48)
              9 LOAD_FAST                0 (my_list)
             12 GET_ITER
        >>   13 FOR_ITER                31 (to 47)
             16 STORE_FAST               2 (item)

  9          19 LOAD_GLOBAL              0 (func)
             22 LOAD_FAST                2 (item)
             25 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             28 POP_JUMP_IF_FALSE       13

 10          31 LOAD_FAST                1 (new_list)
             34 LOAD_ATTR                1 (append)
             37 LOAD_FAST                2 (item)
             40 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             43 POP_TOP
             44 JUMP_ABSOLUTE           13
        >>   47 POP_BLOCK

 11     >>   48 LOAD_FAST                1 (new_list)
             51 RETURN_VALUE


Whew! That's a lot of byte codes. The first column is the line number from the code fragment we entered. There's not much documentation on the rest of the output, but it's simple enough to see what's going on.

First, LOAD_CONST loads a constant into the stack. BUILD_LIST takes the constants and builds a list, which STORE_FAST pushes on the stack. This is the byte code corresponding to line 3. We next begin with line 9, which creates an empty list.

Line 6 sets up the loop, which ends at bytecode line 48 in our listing, and LOAD_FAST pushes a reference onto the stack. Next, get the iterator, and we start the loop. FOR_ITER calls next() on the iteratable object (in this case a list), and pushes it onto the stack.

There are a few other bookkeeping calls here, but note the LOAD_ATTR byte code. This loads the append() function from the virtual table for the list object. As we shall see below, this is the biggest bottleneck for performance.

The complete list of byte codes for version 3.5 is https://docs.python.org/3.5/library/dis.html. Take a look at this page and take a further look at the byte codes as we go along.

## A Simple Performance Gain

We note above that line 34 loads an attribute - the append method of list - every time the loop is executed. This is the first topic we investigate. First let's load the timeit module and see how it's used.

In [3]:
import timeit

The [`timeit`](https://docs.python.org/3/library/timeit.html) module disables garbage collection.

Now we make a small change to the function defined above, and time both of them.

In [4]:
def f2(my_list):
    new_list = []
    my_append = new_list.append
    for item in my_list:
        if func(item):
            my_append(item)
    return new_list

First we time the original loop.

In [5]:
timeit.timeit(stmt="f1(old_list)", setup="from __main__ import f1; from __main__ import old_list", number=20000)

33.73541982700408

Next, we time the new incarnation, by loading the attribute before the loop starts.

In [6]:
timeit.timeit(stmt="f2(old_list)", setup="from __main__ import f2; from __main__ import old_list", number=20000)

27.061378504993627

Caching the method lookup saves about six seconds.

We should note here that timeit turns off garbage collection while running the statement. This eliminates a lot of ambiguity in the timings. There's also no caching and the setup code is run once.

Additionally, here are the actual byte codes, for your amusement and enlightenment.

In [7]:
dis.dis(f2)

  2           0 BUILD_LIST               0
              3 STORE_FAST               1 (new_list)

  3           6 LOAD_FAST                1 (new_list)
              9 LOAD_ATTR                0 (append)
             12 STORE_FAST               2 (my_append)

  4          15 SETUP_LOOP              36 (to 54)
             18 LOAD_FAST                0 (my_list)
             21 GET_ITER
        >>   22 FOR_ITER                28 (to 53)
             25 STORE_FAST               3 (item)

  5          28 LOAD_GLOBAL              1 (func)
             31 LOAD_FAST                3 (item)
             34 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             37 POP_JUMP_IF_FALSE       22

  6          40 LOAD_FAST                2 (my_append)
             43 LOAD_FAST                3 (item)
             46 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             49 POP_TOP
             50 JUMP_ABSOLUTE           22
        >>   53 POP_BLOCK

  7     >>   54 LO

We see from line 40 that we're doing a pointer load instead of looking up an attribute.

## The Next Step - List Comprehension

As we stated before, a list comprehension is not simply syntatic sugar, but is used very differently at the byte code level. Let's see why.

In [8]:
def f3(my_list):
    return [item for item in my_list if func(item)]

In [9]:
timeit.timeit(stmt="f3(old_list)", setup="from __main__ import f3; from __main__ import old_list", number=20000)

25.034677872012253

We picked up a bit of time here, but not as much as expected. However, the gain over the original function is quite significant. But let's see those byte codes.

In [10]:
dis.dis(f3)

  2           0 LOAD_CONST               1 (<code object <listcomp> at 0x104ba6780, file "<ipython-input-8-8c46db21b7b3>", line 2>)
              3 LOAD_CONST               2 ('f3.<locals>.<listcomp>')
              6 MAKE_FUNCTION            0
              9 LOAD_FAST                0 (my_list)
             12 GET_ITER
             13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             16 RETURN_VALUE


There's much less code here, but you can see the CALL_FUNCTION is used, and there's no loop iterator set up. The loop is performed at the C level, not in the byte codes.

At this point, we can definitively say that when optimizing for performance, both a list comprehension and caching method lookups result in significant gains. You can imagine that these gains are important when doing anything with large datasets.

## Conclusion

1. Prefer list comprehensions for speed and Pythonic code.
2. If you can't write a loop as a list comprehension, cache methods to avoid lookups in the vtable.
3. You can use a list comprehension to loop over two lists.
4. The same principles apply for dictionary and set comprehensions.