# Other ways to speed up code

# Parallelism

Naturally, if your code is working on multiple processors, you can potentially get things done faster.  Most new machines these days now have multiple processors.  If you're not sure how many you're working with, you can ask with a function call from the module for using multiple processors, multiprocess.  (There is another module, multiprocessing, that is nearly identical but causes problems for notebooks.)



In [1]:
!pip install multiprocess

import multiprocess as mp
print(mp.cpu_count())

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 6.4 MB/s 
Installing collected packages: multiprocess
Successfully installed multiprocess-0.70.14
2


The key class used for multiprocessor work is Pool, an object that can be constructed with the number of processors as an argument.  Pool.map(function, iterable) will apply its function in parallel to all the items in the iterable -- or at least, in as parallel a way as possible given the number of processors.

In [2]:
my_pool = mp.Pool(mp.cpu_count())

def is_even(x):
  return x % 2 == 0

result = my_pool.map(is_even, [5,6,7,8,9,10])
print(result)
my_pool.close()

[False, True, False, True, False, True]


The above method is synchronous, meaning the program can't proceed until all the results are in.  It's also possible to start asynchronous processes, where they work in the background until called for with get().

In [4]:
my_pool = mp.Pool(mp.cpu_count())
result = my_pool.map_async(is_even, [5,6,7,8,9,10])
# We could do something else here
print(result.get())

[False, True, False, True, False, True]


You may have also heard of multiple threads being used to increase performance, but this mostly happens for other languages.  The Python interpreter only allows one instruction on a processor to be carried out at a time, so this doesn't save any time; the threads have to be ordered on the processor and executed one at a time.  On the other hand, we're allowed to explicitly tell it to split the work between processors.

To create a more heavyweight parallel process, similar to how a thread might act, we can create an object that inherits from the multiprocessing. Process object, but override its run() method.  The process can be started with .start() and we can wait for its results with join().

In [5]:
import time

class Process(mp.Process):
  def run(self):
    time.sleep(1)
    print('Hello, multiprocessing!')

p1 = Process()
p2 = Process()

p1.start()
p2.start()

p1.join()
p2.join()

Hello, multiprocessing!
Hello, multiprocessing!


Note through all this that there's some overhead in talking to the operating system to get the multiple processes set up, so there's no guarantee that small input sizes will see any speedup.

In [6]:
def square(x):
  return x ** 2

# This will actually be slower because of the overhead of starting up processes
def distributed_squaring(n):
  my_pool = mp.Pool(mp.cpu_count())
  result = my_pool.map(square, range(n))
  my_pool.close()
  return result

%time distributed_squaring(10000)

def sequential_squaring(n):
  a = [x ** 2 for x in range(n)]
  # we don't return it to avoid printing the giant list

%time sequential_squaring(10000)

CPU times: user 52.2 ms, sys: 19.1 ms, total: 71.3 ms
Wall time: 82.9 ms
CPU times: user 3.25 ms, sys: 3.44 ms, total: 6.69 ms
Wall time: 6.71 ms


# Vectorization



Vectorization is a way of optimizing away for loops in Python, which are slightly slower than they are in other languages.  If you can write the same code without saying "for," but instead treat what you're trying to do as a vector operation, the resulting code will probably be faster.  (Until the interpreter is optimized to do this automatically.)

As an example, we have on the one hand, the timing of a for loop that multiplies every element by 2 in an array, and on the other hand, the timing of a multiplication that realizes it's scaling the whole vector.

In [7]:
import numpy as np

def for_loop_mult(n):
  original_list = np.array(range(n))
  for i in range(n):
    original_list[i] *= 2
%time for_loop_mult(12345678)

CPU times: user 6.13 s, sys: 948 ms, total: 7.08 s
Wall time: 7.09 s


In [8]:
import numpy as np

def vectorized_mult(n):
  return np.array(range(n)) * 2

%time vectorized_mult(12345678)

CPU times: user 1.26 s, sys: 178 ms, total: 1.44 s
Wall time: 1.45 s


array([       0,        2,        4, ..., 24691350, 24691352, 24691354])

Note again here that the gains aren't really guaranteed, and subtleties of the inner workings of Python could cause a non-vectorized version to run faster.

# Compilers

Python is an interpreted language, and interpreted languages are thought to be slow, as the interpreter typically needs to make sense of the programming language on-the-fly as the program is supposed to be running.  It's faster generally to compile the program ahead of time, interpreting it and turning it into machine code, assembly language, or something else low-level.

The most popular distribution of Python does compile the program ahead of time, though.  It's compiled into "bytecode," a similarity shared with Java, as a low-level set of instructions that can be somewhat optimized away from what the code literally says to do.  If the program has already been run once from the command line, then a .pyc file will be lingering as the compiled version of the program.

Since this compilation is always done, the main speed gain of having a .pyc file around comes from just loading the program without needing to compile it again.  The program will run just as fast after it's loaded if there was no pre-existing bytecode.



An advantage to using a compiler, as Python does automatically, is that the compiler can automatically detect when some code could be written more efficiently, and it can perform those optimizations.  This is why small optimizations for speed sometimes don't have the intended consequences.

The compilation into bytecode that I'm describing is done by CPython, the most popular distribution of Python, but there do exist other interpreters and compilers, and some could conceivably be faster than CPython's combined approach of compiliation and interpretation, especially for particular situations.

# Profiling

A profiler is a very important tool in making code run faster.  As we've discussed, the interpreter/compiler can make it very difficult to reason about where the slowest part of the code lies.  A profiler lets you determine with accuracy where the bottlenecks are, so you don't waste time trying to optimize the wrong thing. 



Here, it tells us that getting a numpy array takes a significant amount of time in for_loop_mult().

In [9]:
import cProfile
cProfile.run('for_loop_mult(12345678)')

         5 function calls in 6.640 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    5.179    5.179    6.639    6.639 <ipython-input-7-ff6398c5d16c>:3(for_loop_mult)
        1    0.000    0.000    6.640    6.640 <string>:1(<module>)
        1    0.000    0.000    6.640    6.640 {built-in method builtins.exec}
        1    1.460    1.460    1.460    1.460 {built-in method numpy.array}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




tottime doesn't sum over all functions called by a function, but cumtime does.  The "percall" next to each divides the adjacent time by the number of calls on the far left.  

For more on the profiler, see [the documentation.](https://docs.python.org/3/library/profile.html)

# Simple slow things

There are two kinds of operations which are generally rather slower than the others - printing things and asking for memory.  Both require some negotiation with the operating system, and so both will be slower than simple arithmetic operations.  (Other operating system business is also slow, such as asking for a socket in networking, but these are the two most common.)

In [10]:
def count_and_print(n, pr):
  b = 0
  for i in range(n):
    if pr:
      print('I is now ' + str(i))  # Printing is slow, reads as socket send in cProfile
    b += 2
  return b

%time count_and_print(10000, True)

%time count_and_print(10000, False)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
I is now 5004
I is now 5005
I is now 5006
I is now 5007
I is now 5008
I is now 5009
I is now 5010
I is now 5011
I is now 5012
I is now 5013
I is now 5014
I is now 5015
I is now 5016
I is now 5017
I is now 5018
I is now 5019
I is now 5020
I is now 5021
I is now 5022
I is now 5023
I is now 5024
I is now 5025
I is now 5026
I is now 5027
I is now 5028
I is now 5029
I is now 5030
I is now 5031
I is now 5032
I is now 5033
I is now 5034
I is now 5035
I is now 5036
I is now 5037
I is now 5038
I is now 5039
I is now 5040
I is now 5041
I is now 5042
I is now 5043
I is now 5044
I is now 5045
I is now 5046
I is now 5047
I is now 5048
I is now 5049
I is now 5050
I is now 5051
I is now 5052
I is now 5053
I is now 5054
I is now 5055
I is now 5056
I is now 5057
I is now 5058
I is now 5059
I is now 5060
I is now 5061
I is now 5062
I is now 5063
I is now 5064
I is now 5065
I is now 5066
I is now 5067
I is now 5068
I is now 5069
I is now 50

20000

In [11]:
def count_and_allocate(n, alloc):
  b = 0
  for i in range(n):
    if alloc:
      a = np.ones(n)
    b += 2
  return b

%time count_and_allocate(10000,True)

%time count_and_allocate(10000,False)

cProfile.run('count_and_allocate(10000, True)')

CPU times: user 81 ms, sys: 0 ns, total: 81 ms
Wall time: 82.1 ms
CPU times: user 730 µs, sys: 0 ns, total: 730 µs
Wall time: 734 µs
         50004 function calls in 0.078 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.005    0.000    0.058    0.000 <__array_function__ internals>:2(copyto)
        1    0.005    0.005    0.078    0.078 <ipython-input-11-a70544ac48f8>:1(count_and_allocate)
        1    0.000    0.000    0.078    0.078 <string>:1(<module>)
    10000    0.001    0.000    0.001    0.000 multiarray.py:1071(copyto)
    10000    0.007    0.000    0.073    0.000 numeric.py:149(ones)
        1    0.000    0.000    0.078    0.078 {built-in method builtins.exec}
    10000    0.051    0.000    0.051    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
    10000    0.008    0.000    0.008    0.000 {built-in method numpy.empty}
        1    0.000    0.000    0.000    0.000 {meth

# Final thoughts

Nothing shown here offers more than a constant-time speedup (except perhaps changes due to profiling), so it's worthwhile optimizing your approach from a big-O perspective first.  At that point, time the code using a profiler, optimize away very expensive operations, and use parallelism for anything that seems very parallel.