
# Benchmarking

Viri: 
- [Python 101: An Intro to Benchmarking Your Code](https://dzone.com/articles/python-101-an-intro-to-benchmarking-your-code)
- [Profiling Python using cProfile: a concrete case](https://julien.danjou.info/guide-to-python-profiling-cprofile-concrete-case-carbonara/)

What does it mean to benchmark one's code? The main idea behind benchmarking or profiling is to figure out how fast your code executes and where the bottlenecks are. The main reason to do this sort of thing is for optimization. You will run into situations where you need your code to run faster because your business needs have changed. When this happens, you will need to figure out what parts of your code are slowing it down.

## Timing Your Program

### Unix time command

If you simply want to time your whole program, it’s usually easy enough to use something
like the Unix time command. For example:

    time python3 someprogram.py

In [1]:
#someprogram.py
import math

def f1(degrees):
    return math.cos(degrees)

def f2(degrees):
    e = 2.718281828459045
    return ((e**(degrees * 1j) + e**-(degrees * 1j)) / 2).real


if __name__ == '__main__':
    print('Starting')
    for i in range(500000): 
        f1(i)
        f2(i)
    print('Finished')

Starting
Finished


<div class="post-text" itemprop="text">
        <p><strong>Real, User and Sys process time statistics</strong></p>

<p>One of these things is not like the other.  Real refers to actual elapsed time; User and Sys refer to CPU time used <em>only by the process.</em></p>

<ul>
<li><p><strong>Real</strong> is wall clock time - time from start to finish of the call.  This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).</p></li>
<li><p><strong>User</strong> is the amount of CPU time spent in user-mode code (outside the kernel) <em>within</em> the process.  This is only actual CPU time used in executing the process.  Other processes and time the process spends blocked do not count towards this figure.</p></li>
<li><p><strong>Sys</strong> is the amount of CPU time spent in the kernel within the process.  This means executing CPU time spent in system calls <em>within the kernel,</em> as opposed to library code, which is still running in user-space.  Like 'user', this is only CPU time used by the process.  See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.</p></li>
</ul>

<p><code>User+Sys</code> will tell you how much actual CPU time your process used.  Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by <code>Real</code> (which usually occurs).  Note that in the output these figures include the <code>User</code> and <code>Sys</code> time of all child processes (and their descendants) as well when they could have been collected, e.g. by <code>wait(2)</code> or <code>waitpid(2)</code>, although the underlying system calls return the statistics for the process and its children separately.</p>

<p><strong>Origins of the statistics reported by <code>time (1)</code></strong></p>

<p>The statistics reported by <code>time</code> are gathered from various system calls.  'User' and 'Sys' come from <a href="https://docs.oracle.com/cd/E23823_01/html/816-5168/wait-3c.html#scrolltoc" rel="noreferrer"><code>wait (2)</code></a> (<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/wait.html" rel="noreferrer">POSIX</a>) or <a href="https://linux.die.net/man/2/times" rel="noreferrer"><code>times (2)</code></a> (<a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/times.html" rel="noreferrer">POSIX</a>), depending on the particular system.  'Real' is calculated from a start and end time gathered from the <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/gettimeofday.html" rel="noreferrer"><code>gettimeofday (2)</code></a> call.  Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by <code>time</code>.</p>

<p>On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel.  Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.</p>

<p><strong>A brief primer on Kernel vs. User mode</strong></p>

<p>On Unix, or any protected-memory operating system, <a href="https://en.wikipedia.org/wiki/Kernel_mode#Supervisor_mode" rel="noreferrer">'Kernel' or 'Supervisor'</a> mode refers to a <a href="https://en.wikipedia.org/wiki/Process_management_(computing)#Processor_modes" rel="noreferrer">privileged mode</a> that the CPU can operate in.  Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code.  An example of such an action might be manipulation of the <a href="https://en.wikipedia.org/wiki/Memory_management_unit" rel="noreferrer">MMU</a> to gain access to the address space of another process.  Normally, <a href="https://en.wikipedia.org/wiki/User_space" rel="noreferrer">user-mode</a> code cannot do this (with good reason), although it can request <a href="https://en.wikipedia.org/wiki/Shared_memory" rel="noreferrer">shared memory</a> from the kernel, which <em>could</em> be read or written by more than one process.  In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.</p>

<p>The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode.  In order to switch to kernel mode you have to issue a specific instruction (often called a <a href="https://en.wikipedia.org/wiki/Trap_(computing)" rel="noreferrer"><em>trap</em></a>) that switches the CPU to running in kernel mode <em>and runs code from a specific location held in a jump table.</em>  For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode.  You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.</p>

<p>The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program.  Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode.  It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.</p>

<p><strong>More about 'sys'</strong></p>

<p>There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like <code>malloc</code> or<code>fread</code>/<code>fwrite</code> will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to <code>malloc</code> will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then <code>malloc</code> will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use <code>malloc</code> and the like in the background, which will again have some time in 'sys' then.</p>
    </div>

### timeit

Python comes with a module called timeit. You can use it to time small code snippets. The timeit module uses platform-specific time functions so that you will get the most accurate timings possible.

The timeit module has a command-line interface, but it can also be imported. We will start out by looking at how to use timeit from the command line. Open up a terminal and try the following examples:

    python3 -m timeit -s "[ord(x) for x in 'abcdfghi']"

    python3 -m timeit -s "[chr(int(x)) for x in '123456789']"

What’s going on here? Well, when you call Python on the command line and pass it the “-m” option, you are telling it to look up a module and use it as the main program. The “-s” tells the timeit module to run setup once. Then it runs the code for n number of loops 5 times and returns the best average of the 5 runs. For these silly examples, you won’t see much difference.

Your output will likely be slightly different as it is dependent on your computer’s specifications.

Let’s write a silly function and see if we can time it from the command line:

In [2]:
# simple_func.py
def my_function():
    try:
        1 / 0
    except ZeroDivisionError:
        pass

All this function does is cause an error that is promptly ignored. Yes, it’s another silly example. To get timeit to run this code on the command line, we will need to import the code into its namespace, so make sure you have changed your current working directory to be in the same folder that this script is in. Then run the following:

    python3 -m timeit "import simple_func; simple_func.my_function()"

Here we import the function and then call it. Note that we separate the import and the function call with semi-colons and that the Python code is in quotes. Now we’re ready to learn how to use timeit inside an actual Python script.

### Importing timeit for Testing

> [timeit — Measure execution time of small code snippets](https://docs.python.org/3/library/timeit.html)

Using the timeit module inside your code is also pretty easy. We’ll use the same silly script from before and show you how below:

In [4]:
def my_function():
    try:
        1 / 0
    except ZeroDivisionError:
        pass

if __name__ == "__main__":
    import timeit
    # To give the timeit module access to functions you define, 
    # you can pass a setup parameter which contains an import statement
    setup = "from __main__ import my_function"
    print(timeit.timeit("my_function()", setup=setup, number=10000))

0.014531105000060052


Here we check to see if the script is being run directly (i.e. not imported). If it is, then we import timeit, create a setup string to import the function into timeit’s namespace and then we call timeit.timeit. You will note that we pass a call to the function in quotes, then the setup string. 

Note however that timeit() will automatically determine the number of repetitions only when the command-line interface is used.

Another option is to pass globals() to the globals parameter, which will cause the code to be executed within your current global namespace. This can be more convenient than individually specifying imports.

In [None]:
def f(x):
    return x**2
def g(x):
    return x**4
def h(x):
    return x**8

if __name__ == "__main__":
    import timeit
    print(timeit.timeit('[func(42) for func in (f,g,h)]', globals=globals()))

### Use a Decorator

Writing your own timer is a lot of fun too, although it may not be as accurate as just using timeit depending on the use case.

For example, you may already know that your code spends most of its time in a few
selected functions. For selected profiling of functions, a short decorator can be useful.

In [None]:
# timethis.py
import time
from functools import wraps

def timethis(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        r = func(*args, **kwargs)
        end = time.perf_counter()
        print(f'{func.__module__}.{func.__name__} : {end - start}')
        return r
    return wrapper

To use this decorator, you simply place it in front of a function definition to get timings
from it. For example:

In [None]:
# decorator_test.py

from timethis import timethis

@timethis
def countdown(n):
    while n > 0:
        n -= 1
        
if __name__ == '__main__':
    countdown(10000000)

You will notice that it accepts a function and has another function inside of it. The nested function will grab the time before calling the passed in function. Then it waits for the function to return and grabs the end time. Now we know how long the function took to run, so we print it out. Of course, the decorator also needs to return the result of the function call and the function itself, so that’s what the last two statements are all about.

You would actually want to time functions that connect to databases (or run large queries), websites, run threads or do other things that take a while to complete.

When making performance measurements, be aware that any results you get are approximations.
The time.perf_counter() function used in the solution provides the
highest-resolution timer possible on a given platform. However, it still measures wallclock
time, and can be impacted by many different factors, such as machine load.

If you are interested in process time as opposed to wall-clock time, use time.process_time() instead. For example:

In [8]:
from functools import wraps
import time

def timethis_process_time(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.process_time()
        r = func(*args, **kwargs)
        end = time.process_time()
        print(f'{func.__module__}.{func.__name__} : {end - start}')
        return r
    return wrapper

@timethis_process_time
def countdown(n):
    #time.sleep(2)
    while n > 0:
        n -= 1
        
if __name__ == '__main__':
    countdown(10000000)

__main__.countdown : 1.0159407149999993


Last, but not least, if you’re going to perform detailed timing analysis, make sure to read
the documentation for the time, timeit, and other associated modules, so that you have
an understanding of important platform-related differences and other pitfalls.

> - **perf_counter()**: Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid.
- **process_time()**: Return the value (in fractional seconds) of the sum of the system and user CPU time of the current process. It does not include time elapsed during sleep. It is process-wide by definition. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid.

### Create a Timing Context Manager

To time a block of statements, you can define a context manager.

In [9]:
#contextmaneger_time.py
import time
from contextlib import contextmanager

@contextmanager
def timeblock(label):
    start = time.perf_counter()
    try:
        yield
    finally:
        end = time.perf_counter()
        print(f'{label} : {end - start}')

Here is an example of how the context manager works:

In [10]:
if __name__ == '__main__':
    with timeblock('counting'):
        n = 10000000
        while n > 0:
            n -= 1

counting : 2.248141140677035


### Making a Stopwatch Timer

You want to be able to record the time it takes to perform various tasks. The time module contains various functions for performing timing-related functions.
However, it’s often useful to put a higher-level interface on them that mimics a stop
watch. For example:

In [11]:
# stopwatch_timer.py
import time

class Timer:
    def __init__(self, func=time.perf_counter):
        self.elapsed = 0.0
        self._func = func
        self._start = None

    def start(self):
        if self._start is not None:
            raise RuntimeError('Already started')
        self._start = self._func()

    def stop(self):
        if self._start is None:
            raise RuntimeError('Not started')
        end = self._func()
        self.elapsed += end - self._start
        self._start = None

    def reset(self):
        self.elapsed = 0.0

    @property
    def running(self):
        return self._start is not None

    def __enter__(self):
        self.start()
        return self

    def __exit__(self, *args):
        self.stop()

This class defines a timer that can be started, stopped, and reset as needed by the user.
It keeps track of the total elapsed time in the elapsed attribute. Here is an example that
shows how it can be used:

In [12]:
if __name__ == '__main__':
    def countdown(n):
        while n > 0:
            n -= 1

    # Use 1: Explicit start/stop
    t = Timer()
    t.start()
    countdown(1000000)
    t.stop()
    print(t.elapsed)
    
    # Use 2: As a context manager
    with t:
        countdown(1000000)

    print(t.elapsed)

    with Timer() as t2:
        countdown(1000000)
        
    print(t2.elapsed)

0.15078234393149614
0.2844322547316551
0.10340904537588358


This recipe provides a simple yet very useful class for making timing measurements and
tracking elapsed time. It’s also a nice illustration of how to support the contextmanagement
protocol and the with statement.

One issue in making timing measurements concerns the underlying time function used
to do it. As a general rule, the accuracy of timing measurements made with functions
such as time.time() or time.clock() varies according to the operating system. In
contrast, the time.perf_counter() function always uses the highest-resolution timer
available on the system.

As shown, the time recorded by the Timer class is made according to wall-clock time,
and includes all time spent sleeping. If you only want the amount of CPU time used by
the process, use time.process_time() instead. For example:

In [None]:
t = Timer(time.process_time)
with t:
    countdown(1000000)
print(t.elapsed)

Both the time.perf_counter() and time.process_time() return a “time” in fractional
seconds. However, the actual value of the time doesn’t have any particular meaning. To
make sense of the results, you have to call the functions twice and compute a time
difference.

## Profiling Your Program

A profile is a set of statistics that describes how often and for how long various parts of the program executed. So if you are trying to optimize a script runtime, or you are having a particular function that is taking too much time to process, you can profile the script to narrow down the issue. These statistics can be formatted into reports via the pstats module. 

> The profiler modules are designed to provide an execution profile for a given program, not for benchmarking purposes (for that, there is timeit for reasonably accurate results). This particularly applies to benchmarking Python code against C code: the profilers introduce overhead for Python code, but not for C-level functions, and so the C code would seem faster than any Python one.

### cProfile

> - Tracks time spent in functions.
- Not for production -> overcome

The Python standard library provides two different implementations of the same profiling interface:
1. cProfile is recommended for most users; it’s a C extension with reasonable overhead that makes it suitable for profiling long-running programs. Based on lsprof, contributed by Brett Rosen and Ted Czotter.
2. profile, a pure Python module whose interface is imitated by cProfile, but which adds significant overhead to profiled programs. If you’re trying to extend the profiler in some way, the task might be easier with this module.

[The Python Profilers](https://docs.python.org/3.8/library/profile.html)

Python comes with its own code profilers built-in. There is the profile module and the cProfile module. The profile module is pure Python, but it will add a lot of overhead to anything you profile, so it’s usually recommended that you go with cProfile, which has a similar interface but is much faster.

You can call cProfile on the command line in much the same way as we did with the timeit module. The main difference is that you would pass a Python script to it instead of just passing a snippet

    python3 -m cProfile -o someprogram_results.cprof someprogram.py

In [None]:
import pstats

data = pstats.Stats('skripte/someprogram_results.cprof')
data.sort_stats('cumulative').print_stats('someprogram') 
#cumtime -> tolko časa kot je porabla ta funkcija pa vse ki jih je poklicala

In [13]:
data.sort_stats('tottime').print_stats('someprogram') 

Sun Apr 12 09:06:49 2020    skripte/someprogram_results.cprof

         1500067 function calls in 1.121 seconds

   Ordered by: internal time
   List reduced from 51 to 4 due to restriction <'someprogram'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   500000    0.638    0.000    0.638    0.000 someprogram.py:6(f2)
        1    0.239    0.239    1.121    1.121 someprogram.py:14(main)
   500000    0.153    0.000    0.244    0.000 someprogram.py:3(f1)
        1    0.000    0.000    1.121    1.121 someprogram.py:1(<module>)




<pstats.Stats at 0x7fbc49e60d30>

[snakeviz](https://jiffyclub.github.io/snakeviz) -> grafično prikazovanjem cprofile

In [6]:
# profiler_test.py
import cProfile
cProfile.run("[x for x in range(150000)]")

         4 function calls in 0.023 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.020    0.020    0.020    0.020 <string>:1(<listcomp>)
        1    0.003    0.003    0.023    0.023 <string>:1(<module>)
        1    0.000    0.000    0.023    0.023 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




Let’s break this down a bit. The first line shows that there were four function calls. The next line tells us how the results are ordered. According to the documentation, standard name refers to the far right column. There are a number of columns here.
- **ncalls** is the number of calls made.
- **tottime** is a total of the time spent in the given function.
- **percall** refers to the quotient of tottime divided by ncalls
- **cumtime** is the cumulative time spent in this and all subfunctions. It’s even accurate for recursive functions!
- The **econd percall** column is the quotient of cumtime divided by primitive calls
- **filename:lineno(function)** provides the respective data of each function

### line_profiler

There’s a neat 3rd party project called line_profiler that is designed to profile the time each individual line takes to execute. It also includes a script called kernprof for profiling Python applications and scripts using line_profiler. Just use pip to install the package. Here’s how:

    pip install line_profiler

To actually use the line_profiler, we will need some code to profile. But first, I need to explain how line_profiler works when you call it on the command line. You will actually be calling line_profiler by calling the kernprof script. I thought that was a bit confusing the first time I used it, but that’s just the way it works. Here’s the normal way to use it:

    kernprof -l silly_functions.py

This will print out the following message when it finishes: Wrote profile results to silly_functions.py.lprof. This is a binary file that we can’t view directly. When we run kernprof though, it will actually inject an instance of LineProfiler into your script’s `__builtins__` namespace. The instance will be named profile and is meant to be used as a decorator. With that in mind, we can actually write our script:

In [None]:
# silly_functions.py
import time
@profile
def fast_function():
    print("I'm a fast function!")
@profile
def slow_function():
    time.sleep(2)
    print("I'm a slow function")
if __name__ == '__main__':
    fast_function()
    slow_function()

So now we have two decorated functions that are decorated with something that isn’t imported. If you actually try to run this script as is, you will get a NameError because “profile” is not defined. So always remember to remove your decorators after you have profiled your code!

    kernprof -l silly_functions.py

Let’s back up and learn how to actually view the results of our profiler. There are two methods we can use. The first is to use the line_profiler module to read our results file:

    python3 -m line_profiler silly_functions.py.lprof

The alternate method is to just use kernprof in verbose mode by passing is -v:

    kernprof -l -v silly_functions.py

Regardless which method you use, you should end up seeing something like the following get printed to your screen:

    I'm a fast function!
    I'm a slow function
    Wrote profile results to silly_functions.py.lprof
    Timer unit: 1e-06 s
    Total time: 3.4e-05 s
    File: silly_functions.py
    Function: fast_function at line 3
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
         3                                           @profile
         4                                           def fast_function():
         5         1           34     34.0    100.0      print("I'm a fast function!")
    Total time: 2.001 s
    File: silly_functions.py
    Function: slow_function at line 7
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
         7                                           @profile
         8                                           def slow_function():
         9         1      2000942 2000942.0    100.0      time.sleep(2)
        10         1           59     59.0      0.0      print("I'm a slow function")

You will notice that the source code is printed out with the timing information for each line. There are six columns of information here. Let’s find out what each one means.

<ul>
 <li><strong>Line #</strong> – The line number of the code that was profiled</li>
 <li><strong>Hits</strong> – The number of times that particular line was executed</li>
 <li><strong>Time</strong> – The total amount of time the line took to execute (in the timer’s unit). The timer unit can be seen at the beginning of the output</li>
 <li><strong>Per Hit</strong> – The average amount of time that line of code took to execute (in timer units)</li>
 <li><strong>% Time</strong> – The percentage of time spent on the line relative to the total amount of time spent in said function</li>
 <li>Line Contents – The actual source code that was executed</li>
</ul>

If you happen to be an IPython user, then you might want to know that IPython has a magic command (%lprun) that allows you to specify functions to profile and even which statement to execute.

### memory_profiler

Another great 3rd party profiling package is memory_profiler. The memory_profiler module can be used for monitoring memory consumption in a process, or you can use it for a line-by-line analysis of the memory consumption of your code. Since it’s not included with Python, we’ll have to install it. You can use pip for this:

    
    pip install memory_profiler

Once it’s installed, we need some code to run it against. The memory_profiler actually works in much the same way as line_profiler in that when you run it, memory_profiler will inject an instance of itself into `__builtins__` named profile that you are supposed to use as a decorator on the function you are profiling. Here’s a simple example:

In [None]:
# memo_prof.py 
@profile
def mem_func():
    lots_of_numbers = list(range(1500))
    x = ['letters'] * (5 ** 10)
    del lots_of_numbers
    return None
if __name__ == '__main__':
    mem_func()

In this example, we create a list that contains 1500 integers. Then we create a list with 9765625 (5 to the 10 power) instances of a string. Finally we delete the first list and return. The memory_profiler doesn’t have another script you need to run to do the actual profiling like line_profiler did. Instead you can just run Python and use its **-m** parameter on the command line to load the module and run it against our script:

    python3 -m memory_profiler memo_prof.py 

The columns are pretty self-explanatory this time around. We have our line numbers and then the amount of memory used after said line was executed. Next we have an increment field which tells us the difference in memory of the current line versus the line previous. The very last column is for the code itself.

The memory_profiler also includes mprof which can be used to create full memory usage reports over time instead of line-by-line. It’s very easy to use; just take a look:

    mprof run memo_prof.py

mprof can also create a graph that shows you how your application consumed memory over time. To get the graph, all you need to do is:

    mprof plot

### profilehooks

The last third party package that we will look at in this chapter is called profilehooks. It is a collection of decorators specifically designed for profiling functions. To install profilehooks, just do the following:

    pip install profilehooks

Now that we have it installed, let’s re-use the example from the last section and modify it slightly to use profilehooks:

In [None]:
# profhooks.py
from profilehooks import profile

@profile
def mem_func():
    lots_of_numbers = list(range(1500))
    x = ['letters'] * (5 ** 10)
    del lots_of_numbers
    return None
if __name__ == '__main__':
    mem_func()

All you need to do to use profilehooks is import it and then decorate the function that you want to profile. If you run the code above, you will get output similar to the following sent to stdout:

    *** PROFILER RESULTS ***
    mem_func (c:\Users\mike\Dropbox\Scripts\py3\profhooks.py:3)
    function called 1 times
             3 function calls in 0.096 seconds
       Ordered by: cumulative time, internal time, call count
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.096    0.096    0.096    0.096 profhooks.py:3(mem_func)
            1    0.000    0.000    0.000    0.000 {range}
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
            0    0.000             0.000          profile:0(profiler)

The output for this package appears to follow that of the cProfile module from Python’s standard library. You can refer to the descriptions of the columns earlier in this chapter to see what these mean. The profilehooks package has two more decorators. The first one we will look at is called timecall. It gives us the course run time of the function:

In [None]:
# profhooks2.py
from profilehooks import timecall

@timecall
def mem_func():
    lots_of_numbers = list(range(1500))
    x = ['letters'] * (5 ** 10)
    del lots_of_numbers
    return None

if __name__ == '__main__':
    mem_func()

Finally I just want to mention that you can also run profilehooks on the command line using Python’s -m flag:

    
    python -m profilehooks mymodule.py

## Making Your Programs Run Faster

Your program runs too slow and you’d like to speed it up without the assistance of more
extreme solutions, such as C extensions or a just-in-time (JIT) compiler.

While the first rule of optimization might be to “not do it,” the second rule is almost
certainly “don’t optimize the unimportant.” To that end, if your program is running slow,
you might start by profiling your code. 

More often than not, you’ll find that your program spends its time in a few hotspots,
such as inner data processing loops. Once you’ve identified those locations, you can use
the no-nonsense techniques presented in the following sections to make your program
run faster.

### Use functions

A lot of programmers start using Python as a language for writing simple scripts. When
writing scripts, it is easy to fall into a practice of simply writing code with very little
structure. For example:

In [None]:
# somescript.py
import sys
import csv

with open(sys.argv[1]) as f:
    for row in csv.reader(f):
        # Some kind of processing
        pass

A little-known fact is that code defined in the global scope like this runs slower than
code defined in a function. The speed difference has to do with the implementation of
local versus global variables (operations involving locals are faster). So, if you want to
make the program run faster, simply put the scripting statements in a function:

In [None]:
# somescript.py
import sys
import csv

def main(filename):
    with open(filename) as f:
        for row in csv.reader(f):
            # Some kind of processing
            pass
        
main(sys.argv[1])

The speed difference depends heavily on the processing being performed, but in our
experience, speedups of 15-30% are not uncommon.

**Primer**

In [None]:
# 01_function_no.py
n = 1000000

while n > 0:
    n -= 1

In [None]:
# 01_function_yes.py
def countdown(n):
    while n > 0:
        n -= 1

countdown(1000000)

    time python 01_function_no.py
    time python 01_function_yes.py

### Selectively eliminate attribute access

Every use of the dot (.) operator to access attributes comes with a cost. Under the covers,
this triggers special methods, such as `__getattribute__()` and `_getattr__()`, which
often lead to dictionary lookups.

You can often avoid attribute lookups by using the from module import name form of
import as well as making selected use of bound methods. To illustrate, consider the
following code fragment:

In [None]:
# 02_access.py
import math

def compute_roots(nums):
    result = []
    for n in nums:
        result.append(math.sqrt(n))
    return result

# Test
nums = range(1000000)
for n in range(10):
    r = compute_roots(nums)

When tested on our machine, this program runs in about 40 seconds. Now change the
compute_roots() function as follows:

In [None]:
# 02_noaccess.py
from math import sqrt

def compute_roots(nums):
    result = []
    result_append = result.append
    for n in nums:
        result_append(sqrt(n))
    return result
    
# Test
nums = range(1000000)
for n in range(10):
    r = compute_roots(nums)

This version runs in about 29 seconds. The only difference between the two versions of
code is the elimination of attribute access. Instead of using math.sqrt(), the code uses
sqrt(). The result.append() method is additionally placed into a local variable result_append and reused in the inner loop.

However, it must be emphasized that these changes only make sense in frequently executed
code, such as loops. So, this optimization really only makes sense in carefully
selected places.

### Understand locality of variables

As previously noted, local variables are faster than global variables. For frequently accessed
names, speedups can be obtained by making those names as local as possible.
For example, consider this modified version of the compute_roots() function just
discussed:

In [None]:
import math

def compute_roots(nums):
    sqrt = math.sqrt
    result = []
    result_append = result.append
    for n in nums:
        result_append(sqrt(n))
    return result

# Test
nums = range(1000000)
for n in range(10):
    r = compute_roots(nums)

In this version, sqrt has been lifted from the math module and placed into a local
variable. If you run this code, it now runs in about 25 seconds (an improvement over
the previous version, which took 29 seconds). That additional speedup is due to a local
lookup of sqrt being a bit faster than a global lookup of sqrt.

Locality arguments also apply when working in classes. In general, looking up a value
such as self.name will be considerably slower than accessing a local variable. In inner
loops, it might pay to lift commonly accessed attributes into a local variable. For example:

In [None]:
# Slower
class SomeClass:
    # neki
    def method(self):
        for x in s:
            op(self.value)

# Faster
class SomeClass:
    # neki
    def method(self):
        value = self.value
        for x in s:
            op(value)

### Avoid gratuitous abstraction

Any time you wrap up code with extra layers of processing, such as decorators, properties,
or descriptors, you’re going to make it slower. As an example, consider this class:

In [1]:
class A:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    @property
    def y(self):
        return self._y
    
    @y.setter
    def y(self, value):
        self._y = value

Now, try a simple timing test:

In [2]:
from timeit import timeit
a = A(1,2)

In [5]:
timeit('a.x', 'from __main__ import a')

0.11272381200024029

In [6]:
timeit('a.y', 'from __main__ import a')

0.2618356349998976

As you can observe, accessing the property y is not just slightly slower than a simple
attribute x, it’s about 4.5 times slower. If this difference matters, you should ask yourself
if the definition of y as a property was really necessary. If not, simply get rid of it and
go back to using a simple attribute instead. Just because it might be common for programs
in another programming language to use getter/setter functions, that doesn’t
mean you should adopt that programming style for Python.

### Use the built-in containers

Built-in data types such as strings, tuples, lists, sets, and dicts are all implemented in C,
and are rather fast. If you’re inclined to make your own data structures as a replacement
(e.g., linked lists, balanced trees, etc.), it may be rather difficult if not impossible to match
the speed of the built-ins. Thus, you’re often better off just using them.

[collections — Container datatypes](https://docs.python.org/3.8/library/collections.html)

### Avoid making unnecessary data structures or copies

Sometimes programmers get carried away with making unnecessary data structures
when they just don’t have to. For example, someone might write code like this:

In [9]:
sequence = tuple(range(50))

values = [x for x in sequence]
squares = [x*x for x in values]

Perhaps the thinking here is to first collect a bunch of values into a list and then to start
applying operations such as list comprehensions to it. However, the first list is completely
unnecessary. Simply write the code like this:

In [10]:
squares = [x*x for x in sequence]

Related to this, be on the lookout for code written by programmers who are overly
paranoid about Python’s sharing of values. Overuse of functions such as copy.deep
copy() may be a sign of code that’s been written by someone who doesn’t fully understand
or trust Python’s memory model. In such code, it may be safe to eliminate many
of the copies.

### Discussion

Before optimizing, it’s usually worthwhile to study the algorithms that you’re using first.
You’ll get a much bigger speedup by switching to an O(n log n) algorithm than by
trying to tweak the implementation of an an O(n**2) algorithm.

If you’ve decided that you still must optimize, it pays to consider the big picture. As a
general rule, you don’t want to apply optimizations to every part of your program,
because such changes are going to make the code hard to read and understand. Instead,
focus only on known performance bottlenecks, such as inner loops.

You need to be especially wary interpreting the results of micro-optimizations. For
example, consider these two techniques for creating a dictionary:
    

In [11]:
a = {
'name' : 'AAPL',
'shares' : 100,
'price' : 534.22
}

In [12]:
b = dict(name='AAPL', shares=100, price=534.22)

The latter choice has the benefit of less typing (you don’t need to quote the key names).
However, if you put the two code fragments in a head-to-head performance battle, you’ll
find that using dict() runs three times slower! With this knowledge, you might be
inclined to scan your code and replace every use of dict() with its more verbose alternative.
However, a smart programmer will only focus on parts of a program where
it might actually matter, such as an inner loop. In other places, the speed difference just
isn’t going to matter at all.

If, on the other hand, your performance needs go far beyond the simple techniques in
this recipe, you might investigate the use of tools based on just-in-time (JIT) compilation
techniques. For example, the PyPy project is an alternate implementation of the Python interpreter 
that analyzes the execution of your program and generates native machine
code for frequently executed parts. It can sometimes make Python programs run an
order of magnitude faster, often approaching (or even exceeding) the speed of code
written in C. Unfortunately, as of this writing, PyPy does not yet fully support Python3. 
So, that is something to look for in the future. You might also consider the Numba
project. Numba is a dynamic compiler where you annotate selected Python functions
that you want to optimize with a decorator. Those functions are then compiled into
native machine code through the use of LLVM. It too can produce signficant performance
gains. However, like PyPy, support for Python 3 should be viewed as somewhat
experimental.

Last, but not least, the words of John Ousterhout come to mind: “The best performance
improvement is the transition from the nonworking to the working state.” Don’t worry
about optimization until you need to. Making sure your program works correctly is
usually more important than making it run fast (at least initially).