<h1 style="color: teal">Lecture 2 - Performance</h1>

<strong style="color: #1B2A49">E. Margarita Palacios Vargas<br>
Fundación Universitaria Konrad Lorenz</strong>

---

<h2 style="color: teal">Performance</h2>

We already discussed that <strong>performance measurements are stochastic</strong> — i.e., repeated runs of the same program can produce slightly different execution times.

Recapping, we can measure:

- **Counts** — how often an event occurs
- **Duration** — the time taken for some interval or operation
- **Size** — the amount of data or memory used by a variable

Next, we will look at how to measure performance in <strong>Python</strong>.

---

<h2 style="color: teal">2.1. Measuring CPU time</h2>

<h4>1. <code>timeit</code></h4>

The <code>timeit</code> module can be used to measure the execution time of small pieces of code.

From the terminal, you can run:

```bash
python -m timeit "my_function()"
```

or, if you want to time a function inside a script:

```bash
python -m timeit -s "import my_script" "my_script.main()"
```

In a <strong>Jupyter Notebook</strong>, you can use the <strong>magic command</strong> <code>%timeit</code>:

```bash
%timeit my_function()
```

In [4]:
import numpy as np

%timeit [x**4 for x in range(10000)]
%timeit np.arange(10000)**4 # As you can see, numpy aranges run faster than Python core

3.13 ms ± 285 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
102 μs ± 4.04 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


As you can see, an approach to beat the stochastic CPU time is to use statistics. The output of this magic function shows the mean and standard deviation after running the subsequent code a number of times.

We can also measure the execution CPU time of functions:

**Note:** **`%timeit`** → one line, **`%%timeit`** → whole cell  

In [6]:
%%timeit # Now it applies to the entire cell
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

192 ns ± 4.62 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [12]:
# The previous script does not define the function
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

a = np.ones((2048, 2048)) 
a.size == 2048 ** 2 # Elements

True

In [10]:
%timeit sum2d(a)

1.84 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


---

<h4>2. <code>njit from numba</code></h4>

The <code>njit</code> decorator from the <strong>Numba</strong> library performs <strong>Just-In-Time (JIT) compilation</strong> of Python functions to optimized machine code, often achieving speeds comparable to compiled languages like C or Fortran.

<strong>Note:</strong> <code>@njit</code> is equivalent to <code>@jit(nopython = True)</code> and is now the recommended usage. The older <code>@jit</code> form is still supported, but its <em>object mode fallback</em> behavior has been deprecated. See the <a href="https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit">Numba documentation</a>.


In [13]:
from numba import njit

a = np.ones((2048, 2048))

In [14]:
@njit
def sum2dv3(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [19]:
%timeit sum2d(a)
%timeit sum2dv3(a) # njit's compiled code runs faster

1.9 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
9.88 ms ± 354 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


---

<h4>3. <code>numexpr</code></h4>

The `numexpr` library in Python is designed to efficiently evaluate numerical expressions on arrays. It provides a way to accelerate numerical computations, especially those involving large arrays, by optimizing memory usage and utilizing multiple CPU cores.

In [2]:
import numexpr as ne

In [5]:
a = np.random.rand(100000)
b = np.random.rand(100000)
%timeit np.sin(a) + np.log(b)
%timeit ne.evaluate("sin(a) + log(b)")

6.83 ms ± 373 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.05 ms ± 145 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%timeit 2*a + 3*b
%timeit ne.evaluate("2*a + 3*b")

<h2 style="color: teal">2.2. Measuring size</h2>

In [6]:
x = np.array([1.3, 2.4, 3.3])

In [7]:
x.data # Memory Location

<memory at 0x000002AC3D75D900>

In [8]:
# 'data' = A 2-tuple whose first argument is a Python integer
# that points to the data-area storing the array contents.
x.__array_interface__

{'data': (2938757333936, False),
 'strides': None,
 'descr': [('', '<f8')],
 'typestr': '<f8',
 'shape': (3,),
 'version': 3}

In [9]:
# Size (number of elements of the array)
x.size

3

In [10]:
# Memory size of one array element (in bytes)
x.itemsize

8

In [8]:
# Memory size of the full (in bytes)
x.itemsize * x.size

24

<h2 style="color: teal">2.3. Profiling</h2>

Profiling in Python means analyzing the performance of your code to identify bottlenecks and areas that can be optimized. Python provides several built-in tools for profiling. Here, we will cover some that are considered native (<i>i.e., they do not require additional software</i>).

<h4>1. <code>cProfile</code></h4>

<strong>Syntax (on bash):</strong>
```bash
python -m cProfile my_script.py
```

In [7]:
from os import system # Module to work with bash

In [4]:
system("cat examples/Example0.py")

1

In [6]:
system("cat examples/Example0.py")

1

What are we seeing?
- `ncalls`: This column shows the number of times each function was called during the execution of the program.
- `tottime`: This column indicates the total time (in seconds) spent in each function excluding time spent in its subfunctions. It's the "internal" time spent exclusively in the function itself.
- `percall`: This column shows the average time (in seconds) spent in each function call, calculated as tottime / ncalls.
- `cumtime`: This column represents the cumulative time (in seconds) spent in the function and all its subfunctions. It includes the time spent in the function itself and all the functions called from it.
- `percall` (cumtime): Average cumulative time (in seconds) per primitive call, calculated as cumtime / primitive calls. (If there’s no recursion, primitive calls ≈ ncalls, so the numbers will look the same.)
- `filename:lineno(function)`: This column provides information about the location of the function in your code, including the filename, line number, and function name.

The output is generally sorted by the cumtime column, which helps you quickly identify functions that consume the most overall time. These are potential candidates for optimization. You will want to look at functions with **high cumtime and ncalls values**.

Use `cProfile` to profile code and view stats:

```python
import cProfile, pstats

# Run and show results
cProfile.run("print('Hello profiling!')")

# Save results to a file
cProfile.run("print('Hello profiling!')", "profiler")

# Load and inspect saved results
stats = pstats.Stats("profiler")
stats.print_stats()
```

`cProfile` can also be invoked as a module to profile a given script (or module). The syntax is as follows.
```bash
python -m cProfile [-o output_file] [-s sort_order] (-m module | myscript.py)
```
Let us see an example. Run
```bash
python -m cProfile examples/factorial.py
```

<h4>2. <code>profile</code></h4>

The `profile` module is another built-in profiler that provides a higher-level interface for profiling your code. It outputs information about function calls and their time consumption. You can use the `profile` module to profile specific parts of your code.

In [11]:
import profile

In [9]:
def main():
	x = [1.0] * (2048 * 2048) 
	a = str(x[0]) 
	a += " is a one..." 
	del x			
	print(a)

profiler = profile.Profile()
profiler.runcall(main)
profiler.print_stats()

1.0 is a one...
         34 function calls in 0.031 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.031    0.031    0.031    0.031 987953020.py:1(main)
        2    0.000    0.000    0.000    0.000 :0(__exit__)
        1    0.000    0.000    0.000    0.000 :0(append)
        2    0.000    0.000    0.000    0.000 :0(get)
        2    0.000    0.000    0.000    0.000 :0(getpid)
        1    0.000    0.000    0.000    0.000 :0(is_done)
        2    0.000    0.000    0.000    0.000 :0(isinstance)
        2    0.000    0.000    0.000    0.000 :0(items)
        2    0.000    0.000    0.000    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(print)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        2    0.000    0.000    0.000    0.000 :0(write)
        1    0.000    0.000    0.000    0.000 iostream.py:138(_event_pipe)
        1    0.000    0.000    0.000    0.000 iostream.py:259(sche

<h4>3. <code>line_profiler</code></h4>

We are still getting an output similar to `cProfile`. To get an output of the performance line-by-line, we should do something else.
1. Install `line_profiler` using `pip` or `anaconda`.
```bash
pip install line-profiler
```
2. On the .py file that you want to analyze, put the decorator `@profile` above the function that you want to profile.
3. Use `kernprof.py` (found [here](https://github.com/pyutils/line_profiler/blob/main/kernprof.py), but also inside `examples`) on your .py file.
```bash
kernprof -l examples/Example1.py
```
4. Execute the command 
```bash
python -m profile my_script.py
```

Try to do this for the example files in `examples/`

There is also a way to do it locally. Bear with me.

In [14]:
from line_profiler import LineProfiler

In [15]:
def main(a,b,c):
	print("a = ", a)
	print("b = ", b)
	print(np.dot(a,b))
	print(a @ b)

a = np.array([[1,2],[4,3]])
b = np.array([[1,2],[4,3]])
c = np.arange(2) + 1

lp = LineProfiler()
lp_wrapper = lp(main)
lp_wrapper(a,b,c)
lp.print_stats()

a =  [[1 2]
 [4 3]]
b =  [[1 2]
 [4 3]]
[[ 9  8]
 [16 17]]
[[ 9  8]
 [16 17]]
Timer unit: 1e-07 s

Total time: 0.0018448 s

Could not find file C:\Users\Margarita\AppData\Local\Temp\ipykernel_15040\1664494061.py
Are you sure you are running this program from the same directory
that you ran the profiler from?
Continuing without the function's contents.

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           
     2         1       9447.0   9447.0     51.2  
     3         1       3180.0   3180.0     17.2  
     4         1       3113.0   3113.0     16.9  
     5         1       2708.0   2708.0     14.7  

