# VM Aware Programming in Python

> Sources: Materials from the following write-ups were modified for this in-class example:<br>
> 1: <a href = "https://www.codementor.io/@satwikkansal/python-practices-for-efficient-code-performance-memory-and-usability-aze6oiq65">Python Practices for Efficient Code: Performance, Memory, and Usability</a><br>
> 2. <a href = "https://realpython.com/python-memory-management/">Python Memory Management</a>

### Revisiting Loop Unrolling and Registers

Because Python uses an interpreter to obfuscate the memory management, some of our techniques will not improve performance. However, other techniques will!

In [3]:
def func( count, value ):
    return count + value

In [4]:
def no_opt( array_size, the_array ):
    
    sum_val = 0
    
    for idx in range(0, array_size):
        
        for count in range(0, 5):
            
            the_array[idx] = func( count, the_array[idx] )
            sum_val += the_array[idx]

Because the interpreter compiles to the machine for us, optimizations that work in C or C++ - such as intermediate register - have little impact on computing performance in Python.

In [5]:
def reg_opt( array_size, the_array ):
    
    sum_val = 0
    
    for idx in range(0, array_size):
        
        arr_idx = the_array[idx]
        
        for count in range(0, 5):
            
            arr_idx = func( count, arr_idx )
            sum_val += arr_idx
            
        the_array[idx] = arr_idx

Howeever, since the Python interpreter still needs to interact with instructions across multiple cache blocks or pages, techniques such as loop unrolling do have a measurable impact because of the reduction of branch prediction misses.

In [6]:
def unroll_opt( array_size, the_array ):
    
    sum_val = 0
    
    for idx in range(0, array_size):
        
        arr_idx = the_array[idx]
        
        arr_idx = func( 0, arr_idx )
        arr_idx = func( 1, arr_idx )
        arr_idx = func( 2, arr_idx )
        arr_idx = func( 3, arr_idx )
        arr_idx = func( 4, arr_idx )
        
        the_array[idx] = arr_idx

Python's interpreter does not utilize preprocessing, so there are no explicit modules. However, you will be able to observe that writing a macro equivelent (such as re-writing the code instead of calling the function.)

> Note: When writing Python code in industry, be sure to adhere to your company's standards. Modularity and code cleanliness are important, especially if they do not mind a performance tradeoff. However, if they do need improved performance, you have another tool in your toolkit.

In [7]:
def macro_equiv_opt( array_size, the_array ):
    
    sum_val = 0
    
    for idx in range(0, array_size):
        
        arr_idx = the_array[idx]
        
        arr_idx = 0 + arr_idx
        arr_idx = 1 + arr_idx
        arr_idx = 2 + arr_idx
        arr_idx = 3 + arr_idx
        arr_idx = 4 + arr_idx
        
        the_array[idx] = arr_idx

In [8]:
def test_opt( array_test_size ):
    
    the_array = [0] * array_test_size

    print("No Opt")
    %timeit -r1 no_opt( array_test_size, the_array )
    
    print("Reg Opt")
    %timeit -r1 reg_opt( array_test_size, the_array )
    
    print("Unroll Opt")
    %timeit -r1 unroll_opt( array_test_size, the_array )
    
    print("Macro Equivalent Opt")
    %timeit -r1 macro_equiv_opt( array_test_size, the_array )

In [9]:
test_size = 1024
test_opt( test_size )

No Opt
971 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Reg Opt
800 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Unroll Opt
473 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Macro Equivalent Opt
209 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)


In [10]:
test_size = 2048
test_opt( test_size )

No Opt
1.93 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Reg Opt
1.61 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Unroll Opt
944 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Macro Equivalent Opt
418 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)


In [11]:
test_size = 16384
test_opt( test_size )

No Opt
15.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
Reg Opt
12.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
Unroll Opt
7.56 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
Macro Equivalent Opt
3.49 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)


# Python Memory Management

Python is an <b>interpreted programming language</b>. Your Python code actually gets compiled down to more computer-readable instructions called <b>bytecode</b>. These instructions get interpreted by a virtual machine when you run your code.

> Have you ever seen a <code>.pyc</code> file or a <code>__pycache__</code> folder? That’s the bytecode that gets interpreted by the virtual machine. Let's generate some disassembled Bytecode!

In [18]:
def simple():
    x = int(input("x? "))
    y = int(input("y? "))

    print(x + y)

In [19]:
# Next, we will import the dis library, which will help us view the assembly code
import dis

In [20]:
# Run example
simple()

x? 100
y? 25
125


In [21]:
# By passing simple to a dis function, we can see how the compiler will organize it
# Functions are put in cache and then are cleared when done (like cramming for an exam)
dis.dis(simple)

  2           0 LOAD_GLOBAL              0 (int)
              2 LOAD_GLOBAL              1 (input)
              4 LOAD_CONST               1 ('x? ')
              6 CALL_FUNCTION            1
              8 CALL_FUNCTION            1
             10 STORE_FAST               0 (x)

  3          12 LOAD_GLOBAL              0 (int)
             14 LOAD_GLOBAL              1 (input)
             16 LOAD_CONST               2 ('y? ')
             18 CALL_FUNCTION            1
             20 CALL_FUNCTION            1
             22 STORE_FAST               1 (y)

  5          24 LOAD_GLOBAL              2 (print)
             26 LOAD_FAST                0 (x)
             28 LOAD_FAST                1 (y)
             30 BINARY_ADD
             32 CALL_FUNCTION            1
             34 POP_TOP
             36 LOAD_CONST               0 (None)
             38 RETURN_VALUE


In [27]:
bytecode = dis.Bytecode(simple)
for instr in bytecode:
    print(instr)

Instruction(opname='LOAD_GLOBAL', opcode=116, arg=0, argval='int', argrepr='int', offset=0, starts_line=2, is_jump_target=False)
Instruction(opname='LOAD_GLOBAL', opcode=116, arg=1, argval='input', argrepr='input', offset=2, starts_line=None, is_jump_target=False)
Instruction(opname='LOAD_CONST', opcode=100, arg=1, argval='x? ', argrepr="'x? '", offset=4, starts_line=None, is_jump_target=False)
Instruction(opname='CALL_FUNCTION', opcode=131, arg=1, argval=1, argrepr='', offset=6, starts_line=None, is_jump_target=False)
Instruction(opname='CALL_FUNCTION', opcode=131, arg=1, argval=1, argrepr='', offset=8, starts_line=None, is_jump_target=False)
Instruction(opname='STORE_FAST', opcode=125, arg=0, argval='x', argrepr='x', offset=10, starts_line=None, is_jump_target=False)
Instruction(opname='LOAD_GLOBAL', opcode=116, arg=0, argval='int', argrepr='int', offset=12, starts_line=3, is_jump_target=False)
Instruction(opname='LOAD_GLOBAL', opcode=116, arg=1, argval='input', argrepr='input', offs

## Back to CPython Memory Management

The memory management algorithms and structures exist in the CPython code, in C.