# Faster Programs

Initially when you start programming you probably don't worry too much about how *fast* your code executes. As long as it works, you are pretty happy, but eventually you will try to solve computationally intensive programs where the speed of the code *does* matter. 
The ever increasing speed of modern hardware has given us the luxury to often not worry too much about the need to optimize our code, however, there are still many cases where it is highly desirable to write more efficient programs. Unfortunately, fewer and fewer programmers are truly good at code optimization, which is evident in the many programs that run far slower than they should.

When to optimize the code you write:
* When it is relatively easy to do so, you should always write the faster version.
* When the computation you do is time consuming, and needs to be run over and over again.
* When the computation is so very time consuming that you would not get an answer unless you have fast code.
* When the computation is so very very time consuming that it requires a large computer farm, you should definitely spend time to improve the execution speed of your code!

When to not optimize the code (much):
* When you run your code once or a few times, and optimizing the code takes longer than running it.

Note that to get maximum speed, code optimiation by writing code better, is required in *every* computer language. In general, the fastest programming language is C (or Assembly), but this is only true for properly written C. For all the compiled languages (FORTRAN, C, C++, JAVA), you can get large speed boosts when the code is properly written for speed for that language (and perhaps the particularities of the processor you are using!). For a language like Python, which normally is interpreted, the speed increases can be tremendous when you are careful in how you write your code. 

Ultimately, if you are interested in writing the fastest code possible, you should switch to C (or possibly Rust or C++). You can see honest benchmarks of languages at the [The Computer Language Benchmark Game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/). However, don't forget that it doesn't matter how fast the programming language is if you cannot use it properly to create well functioning code. So often your program may be slower, but you will get results faster, by using a simpler programming language.

## Optimizing Loops

Most of the slowdown in any programming language comes from "loops". Python is especially slow when you write a standard loop, so the first place to look to speed up your code is to change the loops.

In order to optimize any code, we need to be able to compare the execution times of that code. One way to do this in python is to use the magic statement `%%timeit`, which will run the code in the cell multiple times and then compute the averate time of execution. If you only want to run the code *once*, you can use `%%time` instead.

We start with a really simple example. Consider a loop where you want to fill a large array with a set of computed values for the function $x^3 - 4 x^2 - 3 x +1$. The standard "C" program style to do this would be:

In [1]:
N = 10000

In [2]:
%%timeit
result=[]
for x in range(N):
    result.append(x**3 - 4*x**2 - 3*x +1)

7.01 ms ± 9.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


So on my computer this code takes on average 6.56 ms. If you are only going to run this *once*, then there is no need to do this faster. But now consider a case where this loop is called 10000 times. At that point, it would take a full minute to get the answer. *Still* not a problem in most cases, but you can start seeing a reason to optimize.

## List Comprehension

Very often you will find that for computations there is no need to write a "for" loop. We can change the computation and use *list comprehension* to make it go a lot faster. Here is the same computation, using list comprehension:

In [3]:
%%timeit
result=[x**3 - 4*x**2 - 3*x +1 for x in range(N)]

6.56 ms ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


That is actually *not* an impressive speedup. List comprehension actually does not significantly speed up code. It makes the code more readable and more compact, and sometimes helps because you can reduce memory use. The code you write in a list comprehension still is *interpreted code*, not fast optimized compiled code.

It is still worth knowing about list comprehension, so there is a separate notebook: [11_Loops_and_List_Comprehension](https://github.com/mholtrop/Phys601/blob/master/Notebooks/11_Loops_and_List_Comprehension.ipynb) to help you get used to them.

## Old Style Optimization 

There is an old trick in programming that works in most programming languages in most situations. What you need to do is find ways to reduce the *number of operations* in the loop. Typical examples are storing the results of an expensive computations like $\sin(\theta)$, if it is needed more than once. You also want to keep computations that do not update for each loop iteration outside of that loop.

It is fairly simple to update our polynomial. First, in most languages, it is usually faster on a computer to compute $x*x*x$ instead of $x^3$. The latter calls a general routine that can also handle $x^{2.3}$, or $x^1000$, and is thus more complicated, often containing a loop. Rewriting the loop, we get:

In [4]:
%%timeit
result=[x*x*x - 4*x*x - 3*x +1 for x in range(N)]

2.59 ms ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Next, condiser that $x^3 - 4 x^2 - 3 x +1 = x*x*x-4*x*x-3*x+1 = x (x (x-4) -3) +1$ (check for yourself) and that the last statement has more brackets but fewer operations: 2 multiplications and 3 additions, versus 5 multiplications and 3 additions. So re-writing the sequence of operations has a net gain of nearly a factor of two:

In [5]:
%%timeit
result=[ x*(x*(x - 4) - 3) +1 for x in range(N)]

1.67 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


For the list comprehension (and a normal loop), this is a reasonable speedup of ~7.01/1.67 = ~4, with very little work on our end.

## Use Numpy

There are good reasons to use Numpy: convenience and speed. Lets do the same computation with Numpy:

In [7]:
import numpy as np   # This should be outside the %%timeit, since importing is slow 
x = np.array(N)

In [8]:
%%timeit
result = x*x*x - 4*x*x - 3*x +1

2.35 µs ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Now that is a huge speedup, about 700 times faster. 

For *Numpy* however, the "old style optimizations" do not quite work the same way. The reasons are complicated and deal with the internal ways numpy works. Probably the best way to know what will be fastest is to try it:

In [9]:
%%timeit
result = x*(x*(x - 4) - 3) +1

3.39 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


We can try some other ways, but I don't expect this to improve the speed.

In [10]:
%%timeit
x = np.array(N)
result = np.power(x,3) - 4*np.square(x)- 3*x +1

2.92 µs ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [11]:
from numpy.polynomial.polynomial import polyval

In [13]:
%%timeit
result =  polyval(x,[1,-3,-4,1])

6.91 µs ± 23.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Clearly in this case using Numpy is the correct choice for a numerical evaluation, and a fairly straight forward use of Numpy was the fastest. There are many functions available in Numpy, and it makes sense to use them.

## Compile your Python

There is a lot of interest in "compiled python code" to speed up the execution of Python programs. In the ideal world, you could magically make the Python code run faster without changing anything. This does not exist yet, but many smart people are working on this problem and already a lot of progress has been made. 

There are several systems for Python that change the Python code into optimized *compiled* code. Each of these systems come with their own set of rules for what Python they can handle, and how to write the code so that it optimizes properly. Each of these systems keeps improving, so it is a moving target. This notebook cannot go into too much detail, but we can do a quick exploration. The amount of speedup that you can get using the compiled Python depends a lot on what you are doing and how you do it. If there are a lot of steps to your computation, and if you can have all the calculations performed inside the compiled Python functions, you may see a significant gain. If you were already making efficient use of Numpy, your speedup may be disappointing. 

The systems for compiling Python that I know of are [Cython](https://cython.org) and [Numba](http://numba.pydata.org). Another system Psyco is no longer maintained. There is also a completely separate implementation of standard Python (usually referred to as CPython) called [PyPy](http://pypy.org) that uses Just In Time (JIT) compilation, and is supposed to be faster than CPython. I have never used this, since it requires a completely separate installation of Python, and where I need fast code I tend to use pure C++.

## Cython

When you browse the Cython documentation you will see that a lot of it is dedicated to calling C or C++ functions from Python. This gets complicated quickly. In combination with ipython (and thus, inside notebooks), there are some interesting things to try. See [cythonmagic](https://ipython.org/ipython-doc/2/config/extensions/cythonmagic.html) for details, and some good information is [here as well](https://ipython-books.github.io/55-accelerating-python-code-with-cython/)

First, we load the Jupyther cython extension into our notebook:

In [14]:
%load_ext cython

To allow Cython to optimize things, we need to put as much of the computation into Cython compiled functions. So we need to change our approach a little. We write the code using only basic Python, but start with `%%cython -a`. The -a will anotate our code to let us know where Cython was effective in optimizine.  

In [29]:
%%cython -a 
import numpy as np
import cython

def myfun2():
    N=10000
    out=np.zeros(N,dtype="int")
    for i in range(N):
        x = i
        out[i]=(x*(x*(x - 4) - 3) +1)
    return(out)   


We see that there is *lots* of "Python interaction" in the code, so we cannot expect much of a speedup! Test it:

In [30]:
%timeit myfun2()

1.2 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


A bit faster than the fastest pure Python implementation, but now what we were looking for. We need to make the Python code more C like to do better, so that we can remove the "Python interaction" as much as possible.

We try again:

In [35]:
%%cython -a 
import numpy as np
def myfun2():
    cdef int N=10000
    cdef long[:] out = np.zeros(N,dtype="long")
    cdef int x
    for x in range(N):
        out[x]=(x*(x*(x - 4) - 3) +1)
    return(out)   



In [36]:
%timeit myfun2()

12.1 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


So that is an additional 100x speedup. It is impressive, but we *still* were better off just using Numpy. There may be cases however where the code you need cannot be efficiently cast into a Numpy style, and in those cases these tricks can help you get much faster Python. The *cost* to you is that the code is no longer simple Python! Note that if you make any error, the error messages that you get are now completely obscure.

## Numba

OK, to be honest, Cython is too difficult. That breaks the Python promise. [Numba](http://numba.pydata.org) promises to make all this easier. Lets simply follow the documented steps to add a "Numba decorator", and leave the code alone.


In [44]:
import numba
import numpy as np

@numba.jit(nopython=True)
def myfun3():
    N=10000
    out=np.zeros(N)
    for i in range(N):
        x = i
        out[i]=(x*(x*(x - 4) - 3) +1)
    return(out)   

list = myfun3() # This step compiles the code.

array([ 1.0000000e+00, -5.0000000e+00, -1.3000000e+01, ...,
        9.9870048e+11,  9.9900025e+11,  9.9930008e+11])

In [45]:
%timeit myfun3()

13.7 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


That is very promising. The result is almost as fast as Cython, and far easier to accomplish. There are a lot of details and additional tricks to make Numba go even faster. To learn about those, read the documentation!



# Exercises:

## Need for speed

Here is an example of a small function that returns all the prime numbers up to a given integer. It uses a fairly fast algorithm called the "Sieve of Aristophanes":

In [None]:
import numpy as np
def primeSieve_slow(sieveSize):
    ''' Returns a list of prime numbers calculated using
    the Sieve of Eratosthenes algorithm.'''
    sieve = [True] * sieveSize  # First we set all the numbers to True, then mark off the ones that are not prime.
    sieve[0] = False # zero and one are not prime numbers
    sieve[1] = False
    # create the sieve
    for i in range(2, sieveSize):
        if sieve[i] == True:                          # Only need to check numbers that still may be primes.
            pointer = i * 2                           # This would be prime times 2, so not prime.
            while pointer < sieveSize:
                sieve[pointer] = False                # We thus mark it False
                pointer += i                          # And move to then next multiple of prime.
                                                      # Until the while condition is false.
    # compile the list of primes
    primes = []
    for i in range(sieveSize):
        if sieve[i] == True:
            primes.append(i)
    return primes

In [None]:
%%time
ps = primeSieve_slow(10000000)   # Get all primes less than 10 Million

Now improve the speed of this code. For each suggestion, time the new version and see how much gain it gives.
1. Can you reduce the length of the main for loop? (Hint: does it need to go to sieveSize?)
1. Can you replace the last 5 lines with a single line using list comprehension?
2. You already know that 0,1 are not prime, and that [2,3] are prime, and then every even number is not prime. Can you rewrite the code so it never bothers with the even numbers, other than "2"?
3. The `while pointer < sieveSize:` is still a troublesome loop. Can it be eliminated using array slicing? (Hint: this is a more difficult optimization!)

## No more loops

1. Write a list comprehension statement that replaces the following code:
```
out=[]
for i in range(1000):
    x = i**3 - 98*i**2 + 450*i - 6956
    if x > -10 and x < 10:
        out.append(i)
print(out)
```
You may use 2 lines, one to define a lambda function.