# VM Aware Programming in Python

> Sources: Materials from the following write-ups were modified for this in-class example:<br>
> 1: <a href = "https://www.codementor.io/@satwikkansal/python-practices-for-efficient-code-performance-memory-and-usability-aze6oiq65">Python Practices for Efficient Code: Performance, Memory, and Usability</a><br>
> 2. <a href = "https://heather.cs.ucdavis.edu/matloff/public_html/Python/PyIterGen.pdf">Tutorial on Python Iterators and Generators</a><br>
> 3. <a href = "https://medium.com/learning-better-ways-of-interpretting-and-using/python-generators-memory-efficient-programming-tool-41f09077353c">Python Generators: Memory-efficient programming tool</a><br>
> 4. <a href = "https://realpython.com/python-gil/">What Is the Python Global Interpreter Lock (GIL)?</a>

### Revisiting Loop Unrolling and Registers

Because Python uses an interpreter to obfuscate the memory management, some of our techniques will not improve performance. However, other techniques will!

In [3]:
def func( count, value ):
    return count + value

In [4]:
def no_opt( array_size, the_array ):
    
    sum_val = 0
    
    for idx in range(0, array_size):
        
        for count in range(0, 5):
            
            the_array[idx] = func( count, the_array[idx] )
            sum_val += the_array[idx]

Because the interpreter compiles to the machine for us, optimizations that work in C or C++ - such as intermediate register - have little impact on computing performance in Python.

In [5]:
def reg_opt( array_size, the_array ):
    
    sum_val = 0
    
    for idx in range(0, array_size):
        
        arr_idx = the_array[idx]
        
        for count in range(0, 5):
            
            arr_idx = func( count, arr_idx )
            sum_val += arr_idx
            
        the_array[idx] = arr_idx

Howeever, since the Python interpreter still needs to interact with instructions across multiple cache blocks or pages, techniques such as loop unrolling do have a measurable impact because of the reduction of branch prediction misses.

In [6]:
def unroll_opt( array_size, the_array ):
    
    sum_val = 0
    
    for idx in range(0, array_size):
        
        arr_idx = the_array[idx]
        
        arr_idx = func( 0, arr_idx )
        arr_idx = func( 1, arr_idx )
        arr_idx = func( 2, arr_idx )
        arr_idx = func( 3, arr_idx )
        arr_idx = func( 4, arr_idx )
        
        the_array[idx] = arr_idx

Python's interpreter does not utilize preprocessing, so there are no explicit modules. However, you will be able to observe that writing a macro equivelent (such as re-writing the code instead of calling the function.)

> Note: When writing Python code in industry, be sure to adhere to your company's standards. Modularity and code cleanliness are important, especially if they do not mind a performance tradeoff. However, if they do need improved performance, you have another tool in your toolkit.

In [7]:
def macro_equiv_opt( array_size, the_array ):
    
    sum_val = 0
    
    for idx in range(0, array_size):
        
        arr_idx = the_array[idx]
        
        arr_idx = 0 + arr_idx
        arr_idx = 1 + arr_idx
        arr_idx = 2 + arr_idx
        arr_idx = 3 + arr_idx
        arr_idx = 4 + arr_idx
        
        the_array[idx] = arr_idx

In [8]:
def test_opt( array_test_size ):
    
    the_array = [0] * array_test_size

    print("No Opt")
    %timeit -r1 no_opt( array_test_size, the_array )
    
    print("Reg Opt")
    %timeit -r1 reg_opt( array_test_size, the_array )
    
    print("Unroll Opt")
    %timeit -r1 unroll_opt( array_test_size, the_array )
    
    print("Macro Equivalent Opt")
    %timeit -r1 macro_equiv_opt( array_test_size, the_array )

In [9]:
test_size = 1024
test_opt( test_size )

No Opt
971 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Reg Opt
800 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Unroll Opt
473 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Macro Equivalent Opt
209 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)


In [10]:
test_size = 2048
test_opt( test_size )

No Opt
1.93 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Reg Opt
1.61 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Unroll Opt
944 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)
Macro Equivalent Opt
418 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1000 loops each)


In [11]:
test_size = 16384
test_opt( test_size )

No Opt
15.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
Reg Opt
12.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
Unroll Opt
7.56 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)
Macro Equivalent Opt
3.49 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)


## Generators

A Python <b>generator</b> is a function which returns a generator iterator (just an object we can iterate over) by calling <code>yield</code>.

Generators are a <b>memory-efficient approach</b> of processing huge datasets. They process the data incrementally and do not allocate memory to all the results at the same time. Generators are beneficial for implementing data science pipelines for huge datasets in a resource-constrained environment (in terms of RAM). 

Thus, generators become an effective tool to improve the <b>scalability</b> of a program and make it more responsive to user requests.

Python uses Iterator objects to go over are programming objects that follow the Iteration Protocol. Every Python object in the underlying CPython interpreter uses the <code>Py_ssize_t ob_refcnt</code> member for <code>PyObject</code> keeps track of the number of references that point to the object.

Let’s take a look at a brief code example to demonstrate how reference counting works:

In [10]:
import sys

a = []
b = a
sys.getrefcount(a)

3

The list object was referenced by 
<ol>
    <li>a</li>
    <li>b</li> 
    <li>the argument passed to <code>sys.getrefcount(a)</code></li>
</ol>

We now know that the argument passed to <code>sys.getrefcount(a)</code> is a different register since it is copied to an argument register. 

> This is an example of why we were careful to clear all registers <b>the moment they go out of scope</b> when we were learning assembly, because otherwise a tool like the Python interpreter would quickly run out of memory. When we would clear a variable using <code>add x19, x0, x0</code>, the Python interpreter would also decrement <code>ob_refcnt</code> so the OS knows that the register may be used again! Without this programming fundamental skill, Python would <b>not be possible</b>.

### Complexity of Programming Iterators and Generators as a Solution 

When we are defining custom-made iterators, we are forced to write a complex code as shown below. In the example below, we have tried to create a simple class called Counter that acts as a countdown machine.

In [11]:
import time

class Counter:
    
    #defining a class which is defined with a maximum count
    def __init__(self, max_limit):
        self.number = max_limit
        
    # Iter function that returns itself
    def __iter__(self):
        return self
    
    # next function that is used to define the functionality of an iterator
    def __next__(self):
        
        if self.number == 0:
            print("End of Class Counter")
            raise StopIteration   # We will discuss the purpose of StopIteration soon
        
        else:
            time.sleep(1)
            print(self.number)
            self.number -= 1
            return self.__next__()

Once we have defined this class, we have to instantiate it and iterate through it using a for loop. (Doesn’t it look horribly cumbersome?)

In [12]:
c = Counter(10)

for x in c:
    print(x)

10
9
8
7
6
5
4
3
2
1
End of Class Counter


### Coroutines and Subroutines

When we call a normal Python function, execution starts at function's first line and continues until a <code>return</code> statement, <code>exception</code>, or the end of the function (which is seen as an implicit <code>return None</code>) is encountered.

There are times, though, when it's beneficial to have the ability to create a "function" which, instead of simply returning a single value, is able to yield a series of values. To do so, such a function would need to be able to "save its work," so to speak.

When a <b>generator</b> function calls <code>yield</code>, the "state" of the generator function is frozen; the values of all variables are saved and the next line of code to be executed is recorded until <code>next()</code> is called again.

In [13]:
def simple_generator_function():
    yield 1
    yield 2
    yield 3
    
our_generator = simple_generator_function()

next(our_generator)

next(our_generator)

next(our_generator)

3

In [14]:
# What happens we call next a fourth time?

next(our_generator)

StopIteration: 

### For loop for accessing the Generator

Another solution is to use try/except with the <code>StopIteration</code> in order to prevent error messages

In [15]:
for value in simple_generator_function():
    print(value)

1
2
3


### Using StopIterator

Another solution is to use try/except with the <code>StopIteration</code> in order to prevent error messages

In [16]:
our_generator = simple_generator_function()

while True:
    try:
        print( next(our_generator) )

    except StopIteration:
        break

1
2
3


In [None]:
# Multiple Yield statements with thesame variable

def multi_yield():
    yield_str = "This will print the first string"
    yield yield_str
    yield_str = "This will print the second string"
    yield yield_str
    
multi_obj = multi_yield()

while True:
    try:
        print(next(multi_obj))

    except StopIteration:
        break

### Revising the Counter Example

Doesn't this look much simpler?

In [None]:
def Counter(limit):
    n = limit
    
    while n > 0:
        time.sleep(1)
        yield n
        n -= 1
    
    print("End of Generator Counter")

In [None]:
c = Counter(10)

for x in c:
    print(x)

### We can see the improvement in times for obtaining all the elements in a Fibonacci sequence

In [None]:
# Compare run times for printing Fibonacci sequence

# For this example, guaranteed that the number is greater than 2
def fibonacci( fib_num ):
    
    fib_array = []
    
    fib_1 = 1
    fib_2 = 1
    
    fib_array.append(fib_1)
    fib_array.append(fib_2)
    
    for val in range(2, fib_num):
        
        next_val = fib_1 + fib_2
        fib_array.append(next_val)
        
        fib_2 = fib_1
        fib_1 = next_val
        
    return fib_array

In [None]:
def fibonacci_generator( fib_num ):
    
    fib_1 = 1
    fib_2 = 1
    
    yield fib_1
    yield fib_2
    
    for val in range(2, fib_num):
        
        next_val = fib_1 + fib_2
        yield next_val
        fib_2 = fib_1
        fib_1 = next_val

In [None]:
def time_fib( fib_num ):

    fib_array = fibonacci( fib_num )

In [None]:
def time_fib_yield( fib_num ):

    fib_yield = fibonacci_generator( 10 )

In [None]:
def fib_yield_test( fib_num ):

    print("Fibonacci Test:")
    %timeit -r1 time_fib( fib_num )
    
    print("Fibonacci Yield Test")
    %timeit -r1 time_fib_yield( fib_num )

In [None]:
fib_num = 10
fib_yield_test( fib_num )

In [None]:
fib_num = 34
fib_yield_test( fib_num )

In [None]:
fib_num = 177
fib_yield_test( fib_num )

### How can we use Generators to improve our programs?

<ul>
    <li>For large numbers/data crunching, you can use libraries like <a href = "https://numpy.org/">Numpy</a>, which gracefully handles memory management.</li>
    <li>Don't use <code>+</code> for generating long strings</li>
    <ul>
        <li>In Python, <code>str</code> is immutable, so the left and right strings have to be copied into the new string for every pair of concatenations. If you concatenate four strings of length 10, you'll be copying (10+10) + ((10+10)+10) + (((10+10)+10)+10) = 90 characters instead of just 40 characters.</li>
    </ul>
</ul>

In [None]:
def add_string_with_plus(iters):
    s = ""
    for i in range(iters):
        s += "xyz"
    assert len(s) == 3*iters

    return s
    
    
def add_string_with_join(iters):
    l = []
    for i in range(iters):
        l.append("xyz")
    s = "".join(l)
    assert len(s) == 3*iters

    return s    
    
    
def add_string_with_format(iters):
    fs = "{}"*iters
    s = fs.format(*(["xyz"]*iters))
    assert len(s) == 3*iters

    return s
    
    
def string_yield(iters):
    
    for i in range(iters):
        append = "xyz"
        yield append


def add_string_with_yield(iters):
    
    s = []
    
    the_string_yield = string_yield( iters )
    
    while True:
        try:
            s.append( next(the_string_yield) )
        except StopIteration:
            break
            
    assert len(s) == 3*iters

    return s

In [None]:
def convert_list_print( iter_test ):
    
    print( "Number of iterators: ", iter_test )

    print("add_string_with_plus test:")
    print( add_string_with_plus( iter_test ) )
    
    print("add_string_with_join:")
    print( add_string_with_join( iter_test ) )
    
    print("add_string_with_format tests:")
    print( add_string_with_format( iter_test ) )
    
    print("add_string_with_yield:")
    print( add_string_with_yield( iter_test ) )

In [None]:
convert_list_print( 10 )

In [None]:
def convert_list_test( iter_test ):
    
    print( "Number of iterators: ", iter_test )

    print("add_string_with_plus test:")
    %timeit -r1 add_string_with_plus( iter_test )
    
    print("add_string_with_join:")
    %timeit -r1 add_string_with_join( iter_test )
    
    print("add_string_with_format tests:")
    %timeit -r1 add_string_with_format( iter_test )
    
    print("add_string_with_yield:")
    %timeit -r1 add_string_with_yield( iter_test )

In [None]:
convert_list_test( 10000 )

# Python Global Interpreter Lock

### Protecting <code>ob_refcnt</code>

The problem with using Python objects that this reference count variable needed <b>protection</b> from race conditions where two threads increase or decrease its value simultaneously. 

If this happens, it can cause either leaked memory that is never released or, even worse, incorrectly release the memory while a reference to that object still exists.

This reference count variable can be kept safe by adding <b>locks</b> to all data structures that are shared across threads so that they are not modified inconsistently.

> <b>Operating Systems Preview/Review</b>: Adding a lock to each object or groups of objects means multiple locks will exist which can cause another problem—Deadlocks.

So the <b>GIL</b> is design to have one lock so there are no deadlocks.

For our Computer Architecture purposes, this has another side effect would be <b>decreased performance</b> caused by the repeated acquisition and release of locks.


### So if GIL can cause so many issues, why did the inventors of Python use it?

Let's ask them!

<a href="https://youtu.be/KVKufdTphKs?si=-xYqQ97WrOrW7A6N&t=732" target="_blank">
 <img src="http://img.youtube.com/vi/-xYqQ97WrOrW7A6N/mqdefault.jpg" alt="Watch the video" width="240" height="180" border="10" />
</a>

#### Memory and Thread Safe Programs

A lot of extensions were being written for the existing C libraries whose features were needed in Python. To prevent inconsistent changes, these C extensions required a thread-safe memory management which the GIL provided.

C libraries that were not thread-safe became easier to integrate. And these C extensions became one of the reasons why Python was readily adopted by different communities.

## The Impact on Multi-Threaded Python Programs

When you look at a typical Python program—or any computer program for that matter—there’s a difference between those that are CPU-bound in their performance and those that are I/O-bound.

#### I/O Bound Programs

A program that spends most of its time waiting for an interrupt to start the program

> Example: A mouse driver waiting for you to use the mouse.

#### CPU Bound Program

A program that has multiple threads (we will study multithreading later this semester). Because there is one lock, programs in Python essentially become single threaded programs. This is where programmers get frustrated.


### Multithreading Example

The reality is that, while the GIL provided memory-safe programming in Python, programmers still need improvements in performance that are brought about with multithreading.

Let’s have a look at a simple CPU-bound program that performs a countdown:

In [None]:
# No multithreading

import time

COUNT = 50000000

def countdown(n):
    while n>0:
        n -= 1

start = time.time()
countdown(COUNT)
end = time.time()

print('Time taken in seconds: ', end - start)

In [None]:
# But now let's try multithreading

from threading import Thread

t1 = Thread(target=countdown, args=(COUNT//2,))
t2 = Thread(target=countdown, args=(COUNT//2,))

start = time.time()
t1.start()
t2.start()
t1.join()
t2.join()
end = time.time()

print('Time taken in seconds -', end - start)

As you can see, both versions take almost same amount of time to finish. In the multi-threaded version the GIL prevented the CPU-bound threads from executing in parallel.

## Why Hasn’t the GIL Been Removed Yet?

The developers of Python receive a lot of complaints regarding this but a language as popular as Python cannot bring a change as significant as the removal of GIL without causing backward incompatibility issues.

Removing the GIL would have made Python 3 slower in comparison to Python 2 in single-threaded performance.

Python 3's added starvation prevention since Python’s GIL was known to starve the I/O-bound threads by not giving them a chance to acquire the GIL from CPU-bound threads.

Example: Here is the behavior of running two CPU-bound threads on a single CPU system. As you will observe, the threads nicely alternate with each other after long periods of computation.
![image.png](attachment:image.png)

Here is an example GIL contention on a dual-core CPU. This image shows red regions which indicate times where the operating system has scheduled a Python thread on one of the cores, but it can't run because the thread on the other core is holding it.
![image-2.png](attachment:image-2.png)


So what happens when a I/O bound thread competes with a CPU-bound thread? The other thread (thread 2) is just mindlessly spinning.
![image-3.png](attachment:image-3.png)

As you would expect, most of the time is spent running the CPU-bound thread. However, when I/O is received, there is a flurry of activity that takes place in the I/O thread. Let's zoom in on that region and see what's happening.
![image-4.png](attachment:image-4.png)

In this graph, you're seeing how difficult it is for the I/O bound to get the GIL in order to perform its small amount of processing.

### What can you do?

<b>Multi-processing vs multi-threading</b>: The most popular way is to use a multi-processing approach where you use multiple processes instead of threads. Each Python process gets its own Python interpreter and memory space so the GIL won’t be a problem. 

> <b>Issue!</b> - Per the <a href = "https://docs.python.org/3/library/multiprocessing.html">multprocessing</a> documentation, <code>multiprocessing.Pool</code> does not work on an interactive interpreter like Jupyter notebooks since they only work on one process. So the code is imbedded in this Jupyter Notebook for your reference, but it will not work as intended if you try to run the code.

In [15]:
# function to print cube of given num 
def print_cube(num): 
    
    print("Cube: {}".format(num * num * num)) 


# function to print square of given num 
def print_square(num): 
    
    print("Square: {}".format(num * num)) 
    

def test_prints( num ):
    
    print_square( num )
    print_cube( num )
    
    print("Done!") 

In [17]:
# importing the multiprocessing module 
import multiprocessing 
  
if __name__ == "__main__": 
    
    # creating processes 
    p1 = multiprocessing.Process(target=print_square, args=(10, )) 
    p2 = multiprocessing.Process(target=print_cube, args=(10, )) 
  
    # starting process 1 
    p1.start() 
    # starting process 2 
    p2.start() 
  
    # wait until process 1 is finished 
    p1.join() 
    # wait until process 2 is finished 
    p2.join() 
  
    # both processes finished 
    print("Done!") 

Done!


In [16]:
test_prints( 10 )

Square: 100
Cube: 1000
Done!
