# Pohitritev Python kode

## Making Your Programs Run Faster

https://www.loginradius.com/blog/async/speed-up-python-code/

Your program runs too slow and you’d like to speed it up without the assistance of more
extreme solutions, such as C extensions or a just-in-time (JIT) compiler.

While the first rule of optimization might be to “not do it,” the second rule is almost
certainly “don’t optimize the unimportant.” To that end, if your program is running slow,
you might start by profiling your code. 

More often than not, you’ll find that your program spends its time in a few hotspots,
such as inner data processing loops. Once you’ve identified those locations, you can use
the no-nonsense techniques presented in the following sections to make your program
run faster.

### Use functions

A lot of programmers start using Python as a language for writing simple scripts. When
writing scripts, it is easy to fall into a practice of simply writing code with very little
structure. For example:

In [None]:
# somescript.py
import sys
import csv

with open(sys.argv[1]) as f:
    for row in csv.reader(f):
        # Some kind of processing
        pass

A little-known fact is that code defined in the global scope like this runs slower than
code defined in a function. The speed difference has to do with the implementation of
local versus global variables (operations involving locals are faster). So, if you want to
make the program run faster, simply put the scripting statements in a function:

In [None]:
# somescript.py
import sys
import csv

def main(filename):
    with open(filename) as f:
        for row in csv.reader(f):
            # Some kind of processing
            pass
        
main(sys.argv[1])

The speed difference depends heavily on the processing being performed, but in our
experience, speedups of 15-30% are not uncommon.

**Primer**

In [None]:
# 01_function_no.py
n = 1000000

while n > 0:
    n -= 1

In [None]:
# 01_function_yes.py
def countdown(n):
    while n > 0:
        n -= 1

countdown(1000000)

    time python 01_function_no.py
    time python 01_function_yes.py

### Selectively eliminate attribute access

Every use of the dot (.) operator to access attributes comes with a cost. Under the covers,
this triggers special methods, such as `__getattribute__()` and `_getattr__()`, which
often lead to dictionary lookups.

You can often avoid attribute lookups by using the from module import name form of
import as well as making selected use of bound methods. To illustrate, consider the
following code fragment:

In [None]:
# 02_access.py
import math

def compute_roots(nums):
    result = []
    for n in nums:
        result.append(math.sqrt(n))
    return result

# Test
nums = range(1000000)
for n in range(10):
    r = compute_roots(nums)

When tested on our machine, this program runs in about 40 seconds. Now change the
compute_roots() function as follows:

In [None]:
# 02_noaccess.py
from math import sqrt

def compute_roots(nums):
    result = []
    result_append = result.append
    for n in nums:
        result_append(sqrt(n))
    return result
    
# Test
nums = range(1000000)
for n in range(10):
    r = compute_roots(nums)

This version runs in about 29 seconds. The only difference between the two versions of
code is the elimination of attribute access. Instead of using math.sqrt(), the code uses
sqrt(). The result.append() method is additionally placed into a local variable result_append and reused in the inner loop.

However, it must be emphasized that these changes only make sense in frequently executed
code, such as loops. So, this optimization really only makes sense in carefully
selected places.

### Understand locality of variables

As previously noted, local variables are faster than global variables. For frequently accessed
names, speedups can be obtained by making those names as local as possible.
For example, consider this modified version of the compute_roots() function just
discussed:

In [None]:
import math

def compute_roots(nums):
    sqrt = math.sqrt
    result = []
    result_append = result.append
    for n in nums:
        result_append(sqrt(n))
    return result

# Test
nums = range(1000000)
for n in range(10):
    r = compute_roots(nums)

In this version, sqrt has been lifted from the math module and placed into a local
variable. If you run this code, it now runs in about 25 seconds (an improvement over
the previous version, which took 29 seconds). That additional speedup is due to a local
lookup of sqrt being a bit faster than a global lookup of sqrt.

Locality arguments also apply when working in classes. In general, looking up a value
such as self.name will be considerably slower than accessing a local variable. In inner
loops, it might pay to lift commonly accessed attributes into a local variable. For example:

In [None]:
# Slower
class SomeClass:
    # neki
    def method(self):
        for x in s:
            op(self.value)

# Faster
class SomeClass:
    # neki
    def method(self):
        value = self.value
        for x in s:
            op(value)

### Avoid gratuitous abstraction

Any time you wrap up code with extra layers of processing, such as decorators, properties,
or descriptors, you’re going to make it slower. As an example, consider this class:

In [1]:
class A:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    @property
    def y(self):
        return self._y
    
    @y.setter
    def y(self, value):
        self._y = value

Now, try a simple timing test:

In [2]:
from timeit import timeit
a = A(1,2)

In [5]:
timeit('a.x', 'from __main__ import a')

0.11272381200024029

In [6]:
timeit('a.y', 'from __main__ import a')

0.2618356349998976

As you can observe, accessing the property y is not just slightly slower than a simple
attribute x, it’s about 4.5 times slower. If this difference matters, you should ask yourself
if the definition of y as a property was really necessary. If not, simply get rid of it and
go back to using a simple attribute instead. Just because it might be common for programs
in another programming language to use getter/setter functions, that doesn’t
mean you should adopt that programming style for Python.

### Use the built-in containers

Built-in data types such as strings, tuples, lists, sets, and dicts are all implemented in C,
and are rather fast. If you’re inclined to make your own data structures as a replacement
(e.g., linked lists, balanced trees, etc.), it may be rather difficult if not impossible to match
the speed of the built-ins. Thus, you’re often better off just using them.

[collections — Container datatypes](https://docs.python.org/3.8/library/collections.html)

### Avoid making unnecessary data structures or copies

Sometimes programmers get carried away with making unnecessary data structures
when they just don’t have to. For example, someone might write code like this:

In [9]:
sequence = tuple(range(50))

values = [x for x in sequence]
squares = [x*x for x in values]

Perhaps the thinking here is to first collect a bunch of values into a list and then to start
applying operations such as list comprehensions to it. However, the first list is completely
unnecessary. Simply write the code like this:

In [10]:
squares = [x*x for x in sequence]

Related to this, be on the lookout for code written by programmers who are overly
paranoid about Python’s sharing of values. Overuse of functions such as copy.deep
copy() may be a sign of code that’s been written by someone who doesn’t fully understand
or trust Python’s memory model. In such code, it may be safe to eliminate many
of the copies.

### Discussion

Before optimizing, it’s usually worthwhile to study the algorithms that you’re using first.
You’ll get a much bigger speedup by switching to an O(n log n) algorithm than by
trying to tweak the implementation of an an O(n**2) algorithm.

If you’ve decided that you still must optimize, it pays to consider the big picture. As a
general rule, you don’t want to apply optimizations to every part of your program,
because such changes are going to make the code hard to read and understand. Instead,
focus only on known performance bottlenecks, such as inner loops.

You need to be especially wary interpreting the results of micro-optimizations. For
example, consider these two techniques for creating a dictionary:
    

In [11]:
a = {
'name' : 'AAPL',
'shares' : 100,
'price' : 534.22
}

In [12]:
b = dict(name='AAPL', shares=100, price=534.22)

The latter choice has the benefit of less typing (you don’t need to quote the key names).
However, if you put the two code fragments in a head-to-head performance battle, you’ll
find that using dict() runs three times slower! With this knowledge, you might be
inclined to scan your code and replace every use of dict() with its more verbose alternative.
However, a smart programmer will only focus on parts of a program where
it might actually matter, such as an inner loop. In other places, the speed difference just
isn’t going to matter at all.

If, on the other hand, your performance needs go far beyond the simple techniques in
this recipe, you might investigate the use of tools based on just-in-time (JIT) compilation
techniques. For example, the PyPy project is an alternate implementation of the Python interpreter 
that analyzes the execution of your program and generates native machine
code for frequently executed parts. It can sometimes make Python programs run an
order of magnitude faster, often approaching (or even exceeding) the speed of code
written in C. Unfortunately, as of this writing, PyPy does not yet fully support Python3. 
So, that is something to look for in the future. You might also consider the Numba
project. Numba is a dynamic compiler where you annotate selected Python functions
that you want to optimize with a decorator. Those functions are then compiled into
native machine code through the use of LLVM. It too can produce signficant performance
gains. However, like PyPy, support for Python 3 should be viewed as somewhat
experimental.

Last, but not least, the words of John Ousterhout come to mind: “The best performance
improvement is the transition from the nonworking to the working state.” Don’t worry
about optimization until you need to. Making sure your program works correctly is
usually more important than making it run fast (at least initially).

## PyPy

[PyPy: Faster Python With Minimal Effort](https://realpython.com/pypy-faster-python/)

Python is one of the most popular programming languages among developers, but it has certain limitations. For example, depending on the application, it can be up to 100 times as slow as some lower-level languages. That’s why many companies rewrite their applications in another language once Python’s speed becomes a bottleneck for users. But what if there was a way to keep Python’s awesome features and improve its speed? Enter PyPy.

https://doc.pypy.org/en/latest/introduction.html

PyPy is a very compliant Python interpreter that is a worthy alternative to CPython 2.7, 3.6, and soon 3.7. By installing and running your application with it, you can gain noticeable speed improvements. How much of an improvement you’ll see depends on the application you’re running.

### Python and PyPy

The Python language specification is used in a number of implementations such as CPython (written in C), Jython (written in Java), IronPython (written for .NET), and PyPy (written in Python).

CPython is the original implementation of Python and is by far the most popular and most maintained. When people refer to Python, they more often than not mean CPython. You’re probably using CPython right now!

However, because it’s a high-level interpreted language, CPython has certain limitations and won’t win any medals for speed. That’s where PyPy can come in handy. Since it adheres to the Python language specification, PyPy requires no change in your codebase and can offer significant speed improvements thanks to the features you’ll see below.

Now, you may be wondering why CPython doesn’t implement PyPy’s awesome features if they use the same syntax. The reason is that **implementing those features would require huge changes to the source code and would be a major undertaking**.

Without diving too much into theory, let’s see PyPy in action.

#### Installation

If not, you can download a prebuilt binary for your OS and architecture. 

https://doc.pypy.org/en/latest/install.html

Izpišemo možne verzije:
    
    pyenv install --list | grep pypy

Namestimo:
    
    pyenv install -v pypy3.7-7.3.4

#### PyPy in Action

You now have PyPy installed and you’re ready to see it in action! To do that, create a Python file called script.py and put the following code in it:

In [None]:
import time

start_time = time.time()

total = 0
for i in range(1, 10000):
    for j in range(1, 10000):
        total += i + j

print(f"The result is {total}")

end_time = time.time()
print(f"It took {end_time-start_time:.2f} seconds to compute")

This is a script that, in two nested for loops, adds the numbers from 1 to 9,999, and prints the result.

Try running it with Python. 

    pyenv local 3.9.5
    python script.py

    The result is 999800010000
    It took 48.34 seconds to compute

Now run it with PyPy:

    pyenv local pypy3.7-7.3.4
    python script.py

    The result is 999800010000
    It took 0.49 seconds to compute

In this small synthetic benchmark, PyPy is roughly 94 times as fast as Python!

For more serious benchmarks, you can take a look at the PyPy Speed Center, where the developers run nightly benchmarks with different executables.

https://speed.pypy.org/

Keep in mind that how PyPy affects the performance of your code depends on what your code is doing. There are some situations in which PyPy is actually slower, as you’ll see later. However, on geometric average, it’s 4.3 times as fast as Python.

## PyPy and Its Features

Historically, PyPy has referred to two things:

1. A dynamic language framework for generating interpreters for dynamic languages
2. A Python implementation using that framework

You’ve already seen the second meaning in action by installing PyPy and running a small script with it. The Python implementation you used was written using a dynamic language framework called RPython, just like CPython was written in C and Jython was written in Java.

But weren’t you told earlier that PyPy was written in Python? Well, that’s a little bit of a simplification. The reason PyPy became known as a Python interpreter written in Python (and not in RPython) is that RPython uses the same syntax as Python.

To clear everything up, here’s how PyPy is produced:

1. The source code is written in RPython.
2. The RPython translation toolchain is applied to the code, which basically makes the code more efficient. It also compiles the code down into machine code, which is why Mac, Windows, and Linux users have to download different versions.
3. A binary executable is produced. This is the Python interpreter that you used to run your small script.

Keep in mind that you don’t need to go through all these steps to use PyPy. The executable is already available for you to install and use.

Also, since it’s very confusing to use the same word for both the framework and the implementation, the team behind PyPy decided to move away from this double usage. Now, PyPy refers only to the Python implementation. The framework is referred to as the RPython translation toolchain.

Next, you’ll learn about the features that make PyPy better and faster than Python in some cases.

#### Just-In-Time (JIT) Compiler

<section class="section3" id="just-in-time-jit-compiler">
<p>Before getting into what JIT compilation is, let’s take a step back and review the properties of <a href="https://en.wikipedia.org/wiki/Compiled_language">compiled</a> languages such as C and <a href="https://en.wikipedia.org/wiki/Interpreted_language">interpreted languages</a> such as JavaScript.</p>
<p><strong>Compiled</strong> programming languages are more performant but are harder to port to different CPU architectures and operating systems. <strong>Interpreted</strong> programming languages are more portable, but their performance is much worse than that of compiled languages. These are the two extremes of the spectrum.</p>
<p>Then there are programming languages such as Python that do a mix of both compilation and interpretation. Specifically, Python is first compiled into an <strong>intermediate bytecode</strong>, which is then interpreted by CPython. This makes the code perform better than code written in a purely interpreted programming language, and it maintains the portability advantage.</p>
<p>However, the performance is still nowhere near that of the compiled version. The reason is that the compiled code can do a lot of optimizations that just aren’t possible with bytecode.</p>
<p>That’s where the <strong>just-in-time (JIT) compiler</strong> comes in. It tries to get the better parts of the both worlds by doing some real compilation into machine code and some interpretation. In a nutshell, here are the steps JIT compilation takes to provide faster performance:</p>
<ol>
<li>Identify the most frequently used components of the code, such as a function in a loop.</li>
<li>Convert those parts into machine code during runtime.</li>
<li>Optimize the generated machine code.</li>
<li>Swap the previous implementation with the optimized machine code version.</li>
</ol>
<p>Remember the two nested loops at the beginning of the tutorial? PyPy detected that the same operation was being executed over and over again, compiled it into machine code, optimized the machine code, and then swapped the implementations. That’s why you saw such a big improvement in speed.</p>
</section>

#### Garbage Collection

Whenever you create variables, functions, or any other objects, your computer allocates memory to them. Eventually, some of those objects will no longer be needed. If you don’t clean them up, then your computer may run out of memory and crash your program.

In programming languages such as C and C++, you usually have to deal with this problem manually. Other programming languages such as Python and Java do it for you automatically. This is called automatic garbage collection, and there are several techniques for accomplishing it.

CPython uses a technique called reference counting. Essentially, a Python object’s reference count is incremented whenever the object is referenced, and it’s decremented when the object is dereferenced. When the reference count is zero, CPython automatically calls the memory deallocation function for that object. It’s a straightforward and effective technique, but there’s a catch.

When the reference count of a large tree of objects becomes zero, all the related objects are freed. As a result, you have a potentially long pause during which your program doesn’t progress at all.

Also, there’s a use case in which reference counting simply doesn’t work. Consider the following code:

In [None]:
class A(object):
    pass

a = A()
a.some_property = a
del a

In the code above, you define new class. Then, you create an instance of the class and assign it to be a property on itself. Finally, you delete the instance.

At this point, the instance is no longer accessible. However, reference counting doesn’t delete the instance from memory because it has a reference to itself, so the reference count is not zero. This problem is called a reference cycle, and it can’t be solved using reference counting.

This is where CPython uses another tool called the cyclic garbage collector. It walks over all objects in memory starting from known roots like the type object. It then identifies all reachable objects and frees unreachable objects since they aren’t alive anymore. This solves the reference cycle problem. However, it can create even more noticeable pauses when there are a large number of objects in memory.

PyPy, on the other hand, doesn’t use reference counting. Instead, it uses only the second technique, the cycle finder. That is, it periodically walks over alive objects starting from the roots. This gives PyPy some advantage over CPython since it doesn’t bother with reference counting, making the total time spent in memory management less than in CPython.

Also, instead of doing everything in one major undertaking like CPython, PyPy splits the work into a variable number of pieces and runs each piece until none are left. This approach adds just a few milliseconds after each minor collection rather than adding hundreds of milliseconds in one go like CPython.

Garbage collection is complex and has many more details that go beyond the scope of this tutorial. You can find more information about PyPy’s garbage collection in the documentation.

### Limitations of PyPy

PyPy isn’t a silver bullet and may not always be the most suitable tool for your task. It may even make your application perform much slower than CPython. That’s why it’s important that you keep the following limitations in mind.

#### It Doesn’t Work Well With C Extensions

**PyPy works best with pure Python applications**. Whenever you use a C extension module, it runs much slower than in CPython. The reason is that PyPy can’t optimize C extension modules since they’re not fully supported. In addition, PyPy has to emulate reference counting for that part of the code, making it even slower.

In such cases, the PyPy team recommends taking out the CPython extension and replacing it with a pure Python version so that JIT can see it and do its optimizations. If that’s not an option, then you’ll have to use CPython.

With that being said, the core team is working on C extensions. Some packages have already been ported to PyPy and work just as fast.

#### It Only Works Well With Long-Running Programs

Imagine you want to go to a shop that is very close to your home. You can either go on foot or drive.

Your car is clearly much faster than your feet. However, think about what it would require you to do:

- Go to your garage.
- Start your car.
- Warm the car up a little.
- Drive to the shop.
- Find a parking spot.
- Repeat the process on your way back.

There’s a lot of overhead involved in driving a car, and it’s not always worth it if the place you want to go is nearby!

Now think about what would happen if you wanted to go to a neighboring city fifty miles away. It would certainly be worth it to drive there instead of going on foot.

Although the difference in speed isn’t quite so noticeable as in the above analogy, the same is true with PyPy and CPython.

When you run a script with PyPy, it does a lot of things to make your code run faster. If the script is too small, then the overhead will cause your script would run slower than in CPython. On the other hand, if you have a long-running script, then that overhead can pay significant performance dividends.

To see for yourself, run the following small script in both CPython and PyPy:

In [None]:
import time

start_time = time.time()

for i in range(100):
    print(i)

end_time = time.time()
print(f"It took {end_time-start_time:.10f} seconds to compute")

There’s a small delay at the beginning when you run it with PyPy, while CPython runs it instantly. In exact numbers, it takes 0.0004873276 seconds to run it on a 2015 MacBook Pro with CPython and 0.0019447803 seconds to run it with PyPy.

#### It Doesn’t Do Ahead-Of-Time Compilation

As you saw at the beginning of this tutorial, PyPy isn’t a fully compiled Python implementation. It compiles Python code, but it isn’t a compiler for Python code. Because of the inherent dynamism of Python, it’s impossible to compile Python into a standalone binary and reuse it.

PyPy is a runtime interpreter that is faster than a fully interpreted language, but it’s slower than a fully compiled language such as C.

> PyPy is a fast and capable alternative to CPython. By running your script with it, you can get a major speed improvement without making a single change to your code. But it’s not a silver bullet. It has some limitations, and you’ll need to test your program to see if PyPy can be of help.

## Numba

- [Make python fast with numba](https://thedatafrog.com/en/articles/make-python-fast-numba/)
- [Boost python with your GPU (numba+CUDA)](https://thedatafrog.com/en/articles/boost-python-gpu/)
- [Speed Up your Algorithms Part 2— Numba](https://towardsdatascience.com/speed-up-your-algorithms-part-2-numba-293e554c5cc1)
- [numba_tutorial_scipy2017](https://github.com/gforsyth/numba_tutorial_scipy2017/tree/master/notebooks)

https://numba.pydata.org/

Numba is a just-in-time compiler for Python that works best on code that uses NumPy arrays and functions, and loops. The most common way to use Numba is through its collection of decorators that can be applied to your functions to instruct Numba to compile them. When a call is made to a Numba decorated function it is compiled to machine code “just-in-time” for execution and all or part of your code can subsequently run at native machine code speed!

Out of the box Numba works with the following:

- OS: Windows (32 and 64 bit), OSX and Linux (32 and 64 bit)
- Architecture: x86, x86_64, ppc64le. Experimental on armv7l, armv8l (aarch64).
- GPUs: Nvidia CUDA. Experimental on AMD ROC.
- CPython
- NumPy 1.15 - latest

- Works within the standard Python interpreter, and does not replace it
- Integrates tightly with NumPy
- Compatible with both multithreaded and distributed computing paradigms
- Can be targeted at non-CPU hardware

### A JIT Compiler for Python

Numba reads the Python bytecode for a decorated function and combines this with information about the types of the input arguments to the function. It analyzes and optimizes your code, and finally uses the LLVM compiler library to generate a machine code version of your function, tailored to your CPU capabilities. This compiled version is then used every time your function is called.

- An open-source, function-at-a-time compiler library for Python
- Compiler toolbox for different targets and execution models:
    - single-threaded CPU, multi-threaded CPU, GPU
    - regular functions, “universal functions” (array functions), etc
- Speedup: 2x (compared to basic NumPy code) to 200x (compared to pure Python)
- Combine ease of writing Python with speeds approaching FORTRAN
- BSD licensed (including GPU compiler)
- Goal is to empower scientists who make tools for themselves and other scientists

Numba may be best understood by what it **is not**:
- Replacement Python interpreter: PyPy, Pyston, Pyjion
    - Hard to implement
    - Difficult (but not impossible) to maintain compatibility with existing Python extensions
    - Does not address non-CPU targets
- Translator of Python to C/C++: Cython, Pythran, Theano, ShedSkin, Nuitka
    - Static analysis of dynamic languages is limiting
    - Ahead-of-time generated code is either underspecialized (both in data types and CPU capabilities) or bloated to cover all variants
    - JIT compilation requires C/C++ compiler on end user system 

### Numba Features

- Detects CPU model during compilation and optimizes for that target
- Automatic type inference: No need to give type signatures for functions
- Dispatches to multiple type-specializations for the same function
- Call out to C libraries with CFFI and types
- Special "callback" mode for creating C callbacks to use with external libraries
- Optional caching to disk, and ahead-of-time creation of shared libraries
- Compiler is extensible with new data types and functions

### Installation

Numba also has wheels available:

    pip install numba

Numba is often used as a core package so its dependencies are kept to an absolute minimum, however, extra packages can be installed as follows to provide additional functionality:
- `scipy` - enables support for compiling numpy.linalg functions.
- `colorama` - enables support for color highlighting in backtraces/error messages.
- `pyyaml` - enables configuration of Numba via a YAML config file.
- `icc_rt` - allows the use of the Intel SVML (high performance short vector math library, x86_64 only). Installation instructions are in the performance tips.

### Will Numba work for my code?

This depends on what your code looks like, if your code is numerically orientated (does a lot of math), uses NumPy a lot and/or has a lot of loops, then Numba is often a good choice. In these examples we’ll apply the most fundamental of Numba’s JIT decorators, `@jit`, to try and speed up some functions to demonstrate what works well and what does not.

Numba works well on code that looks like this:

In [None]:
from numba import jit
import numpy as np

x = np.arange(100).reshape(10, 10)

@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0.0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

print(go_fast(x))

It won’t work very well, if at all, on code that looks like this:

In [None]:
from numba import jit
import pandas as pd

x = {'a': [1, 2, 3], 'b': [20, 30, 40]}

@jit
def use_pandas(a): # Function will not benefit from Numba jit
    df = pd.DataFrame.from_dict(a) # Numba doesn't know about pd.DataFrame
    df += 1                        # Numba doesn't understand what this is
    return df.cov()                # or this!

print(use_pandas(x))

Note that Pandas is not understood by Numba and as a result Numba would simply run this code via the interpreter but with the added cost of the Numba internal overheads!

The Numba `@jit` decorator fundamentally operates in two compilation modes, `nopython` mode and `object` mode. In the go_fast example above, `nopython=True` is set in the `@jit` decorator; this is instructing Numba to operate in `nopython` mode. The behaviour of the `nopython` compilation mode is to essentially compile the decorated function so that it will run entirely without the involvement of the Python interpreter. This is the recommended and best-practice way to use the Numba jit decorator as it leads to the best performance.

Should the compilation in `nopython` mode fail, Numba can compile using `object` mode. This is a fall back mode for the `@jit` decorator if `nopython=True` is not set (as seen in the use_pandas example above). In this mode Numba will identify loops that it can compile and compile those into functions that run in machine code, and it will run the rest of the code in the interpreter. For best performance avoid using this mode!

### How to measure the performance of Numba?

First, recall that Numba has to compile your function for the argument types given before it executes the machine code version of your function. This takes time. However, once the compilation has taken place Numba caches the machine code version of your function for the particular types of arguments presented. If it is called again with the same types, it can reuse the cached version instead of having to compile again.

A really common mistake when measuring performance is to not account for the above behaviour and to time code once with a simple timer that includes the time taken to compile your function in the execution time.

For example:

In [1]:
from numba import jit
import numpy as np
import time

x = np.arange(100).reshape(10, 10)

@jit(nopython=True)
def go_fast(a): # Function is compiled and runs in machine code
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

# DO NOT REPORT THIS... COMPILATION TIME IS INCLUDED IN THE EXECUTION TIME!
start = time.time()
go_fast(x)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time.time()
go_fast(x)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))

Elapsed (with compilation) = 0.9285945892333984
Elapsed (after compilation) = 0.00021004676818847656


A good way to measure the impact Numba JIT has on your code is to time execution using the timeit module functions; these measure multiple iterations of execution and, as a result, can be made to accommodate for the compilation time in the first execution.

As a side note, if compilation time is an issue, Numba JIT supports on-disk caching of compiled functions and also has an Ahead-Of-Time compilation mode.

https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#jit-decorator-cache

### How fast is it?
Assuming Numba can operate in nopython mode, or at least compile some loops, it will target compilation to your specific CPU. Speed up varies depending on application but can be one to two orders of magnitude. Numba has a performance guide that covers common options for gaining extra performance.

https://numba.readthedocs.io/en/stable/user/performance-tips.html#performance-tips

https://www.infoworld.com/article/3247799/what-is-llvm-the-power-behind-swift-rust-clang-and-more.html

### When is Numba a Good Idea?
- Numerical algorithms
- Data is in the form of NumPy arrays, or (more broadly) flat data buffers
- Performance bottleneck is a handful of well encapsulated functions
- Example use cases:
    - Compiling user-defined functions to call from another algorithm (like an optimizer)
    - Creating "missing" NumPy/SciPy functions (librosa)
    - Rapidly prototyping GPU algorithms (FBPIC)
    - Constructing specialized Python compilers (HPAT, OAMap)

### Primer 1

Here is a function that can take a bit of time. This function takes a list of numbers, and returns the standard deviation of these numbers.

In [5]:
import math

def std(xs):
    # compute the mean
    mean = 0
    for x in xs: 
        mean += x
    mean /= len(xs)
    # compute the variance
    ms = 0
    for x in xs:
        ms += (x-mean)**2
    variance = ms / len(xs)
    std = math.sqrt(variance)
    return std

As we can see in the code, we need to loop twice on the sample of numbers: first to compute the mean, and then to compute the variance, which is the square of the standard deviation.

Obviously, the more numbers in the sample, the more time the function will take to complete. Let's start with 10 million numbers, drawn from a Gaussian distribution of unit standard deviation:

In [8]:
import numpy as np
a = np.random.normal(0, 1, 10000000)

In [9]:
%time std(a)

CPU times: user 13.4 s, sys: 0 ns, total: 13.4 s
Wall time: 13.4 s


1.000082316558144

In [12]:
%time std(a)

CPU times: user 13 s, sys: 0 ns, total: 13 s
Wall time: 13 s


1.000082316558144

The function takes a couple seconds to compute the standard deviation of the sample.

Now, let's import the njit decorator from numba, and decorate our std function to create a new function:

In [16]:
from numba import njit

 To make it such that only no python mode is used and if compilation fails an exception is raised the decorators `@njit` and `@jit(nopython=True)` can be used (the first is an alias of the second for convenience).

In [17]:
@njit
def c_std(xs):
    # compute the mean
    mean = 0
    for x in xs: 
        mean += x
    mean /= len(xs)
    # compute the variance
    ms = 0
    for x in xs:
        ms += (x-mean)**2
    variance = ms / len(xs)
    std = math.sqrt(variance)
    return std

The performance improvement might not seem striking, maybe due to some overhead related with interpreting the code in the notebook. Also, please keep in mind that the first time the function is called, numba will need to compile the function, which takes a bit of time.

In [18]:
# prvič je zraven še čas za complile
%time c_std(a)

CPU times: user 490 ms, sys: 27.7 ms, total: 518 ms
Wall time: 528 ms


1.000082316558144

In [19]:
%time c_std(a)

CPU times: user 41.8 ms, sys: 0 ns, total: 41.8 ms
Wall time: 41.9 ms


1.000082316558144

Primer z vgrajeno numpy funkcijo:

In [15]:
%time np.std(a)

CPU times: user 53.5 ms, sys: 28.3 ms, total: 81.8 ms
Wall time: 79.5 ms


1.0000823165582082

### Primer 2

Whilst NumPy has developed a strong idiom around the use of vector operations, Numba is perfectly happy with loops too. For users familiar with C or Fortran, writing Python in this style will work fine in Numba (after all, LLVM gets a lot of use in compiling C lineage languages).

In [1]:
import numpy as np

original = np.arange(0.0, 30.0, 0.01, dtype='f4')
shuffled = original.copy()
np.random.shuffle(shuffled)

In [2]:
def bubblesort(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [3]:
%time bubblesort(shuffled)

CPU times: user 5.8 s, sys: 28.2 ms, total: 5.83 s
Wall time: 5.84 s


In [7]:
@njit
def c_bubblesort(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [8]:
%time c_bubblesort(shuffled)

CPU times: user 533 ms, sys: 44.1 ms, total: 577 ms
Wall time: 576 ms


In [9]:
%time c_bubblesort(shuffled)

CPU times: user 15.5 ms, sys: 0 ns, total: 15.5 ms
Wall time: 15.6 ms
