# Compute Clusters and parallelization

Parallelization, or concurrency is a vital part of your data analysis tool kit. Imagine your analysis takes 3 hours to run on one data point. It would take 2 weeks to run on 100 data points! On an 8 core home computer you could get it down to 2 days, on a large enough cluster you could do it in one pass, a little over 3 hours! 

In short, if you have a lot of options to explore or have to do the same long calculation on many things then you speed up your code by a factor proportional to the number of cores in your CPU. 

Today, you don't even need a cluster. If your advisor is open to paying for a new computer, consider getting a multicore processor such as a 32 or 64 core AMD threadripper or 16 core intel i9 and building your own ([it's easier than you think](https://www.howtogeek.com/187797/dont-be-intimidated-building-your-own-computer-is-easier-than-youd-think/)). The cost would be $3-4K and you wouldn't need to wait for any cluster. 

## contents


1.   Multithreading vs multiprocessing
2.   Threadsafety
2.   Multiprocessing example code
2.   Multithreading example code
2.   CHPC usage (Mallinckrodt Institute of Radiology)
2.   Physics HPC facility
2.   GPU acceleration with CUDA




## Multithreading vs multiprocessing

Modern computers have separate modules (cores) within the same central processing unit (CPU). It's always been the case that networking, USB, and memory components are separate from the CPU and much slower than the CPU. Having your separate CPU cores working on your program at the same time is multiprocessing in a nutshell. Having your CPU cores work on the next part of your problem while they wait for the other components is multithreading in a nutshell. 

If your problem is mostly calculation based and the data is loaded beforehand and held in RAM (held in a variable), then your problem is "CPU bound" and fit for multiprocessing. All of my experience is with this type of problem.

If your problem involves back and forth communication over the internet or with another device, or if your problem involves loading a datapoint, doing something relatively simple (filtering) and then immediately saving it again, then your problem is "I/O bound" and multithreading is probably best. 

Multiprocessing has a pretty big negative. For CPU bound problems that are pretty short and simple then multiprocessing can take longer than doing it in series or multithreading. This is because multiprocessing involves relatively lengthy set-up and tear down. 

Multiprocessing has (almost) limitless advantage. The more cores you can run on the bigger the speed up you'll get. However you'll quickly run out of RAM if you keep too much stored in memory. So it is important to avoid having variables that store the same thing and be very careful about what you send to each process. 

[https://realpython.com/python-concurrency/](https://realpython.com/python-concurrency/)

## Thread safety


Several things can go wrong when trying multiprocessing or multithreading. All of these problems are caused by code that is not written in a "threadsafe" way. In the contexts you are likely to encounter while doing data analysis thread safety means that none of your different processes or workers are modifying the same files or variables, and that you have protected your functions so that they cannot be interrupted while running. 



*   Always define the portion to be run in parallel in a self-contained function (you need to do this anyway)
*   Never define a variable in a script that is later used in one of the functions. Always either define the variable within the function itself, or pass a *copy* of the variable as an argument to the function
*   When using multithreading use `threading.local` to encapsulated your function in a way that avoids interruption.
*   Do not allow separate workers or threads to write to or modify the same file, generally it's OK to let them read the same file if you open with a read-only operation. 

[https://en.wikipedia.org/wiki/Thread_safety](https://en.wikipedia.org/wiki/Thread_safety)

[https://en.wikipedia.org/wiki/Reentrancy_(computing)](https://en.wikipedia.org/wiki/Reentrancy_(computing))

[https://en.wikipedia.org/wiki/Race_condition](https://en.wikipedia.org/wiki/Race_condition)

[https://realpython.com/python-concurrency/](https://realpython.com/python-concurrency/)

# Multiprocessing: The Ultimate Kamehameha

If your function takes more than a few seconds the gain in running it multiple times at once will outweigh the cost of the long initialization time. The bigger the computer the more speed gains you will get. 

[multiprocessing module documentation](https://python.readthedocs.io/en/latest/library/multiprocessing.html)

In [1]:
!rm get_max_eig.py
!wget https://raw.githubusercontent.com/jojker/PML_Workshops/master/Summer%202019/Day%201%20-%20Process%20and%20Design%20for%20Rapid%20Progress/Ex%205%20-%20Computing%20clusters%2C%20parallelization%20and%20GPUs/get_max_eig.py

rm: cannot remove 'get_max_eig.py': No such file or directory
--2019-07-17 23:12:38--  https://raw.githubusercontent.com/jojker/PML_Workshops/master/Summer%202019/Day%201%20-%20Process%20and%20Design%20for%20Rapid%20Progress/Ex%205%20-%20Computing%20clusters%2C%20parallelization%20and%20GPUs/get_max_eig.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180 [text/plain]
Saving to: ‘get_max_eig.py’


2019-07-17 23:12:43 (3.51 MB/s) - ‘get_max_eig.py’ saved [180/180]



In [2]:
import multiprocessing
import time
import numpy as np
from get_max_eig import get_max_eig

matsize=1000 # when set to 100 multithreading is faster, set to 1000 and multiprocessing is
N_threads=2 # we only have two cpus in google colab

# Joint output is a special kind of object that gets written to by each worker 
# as they work (instead of collecting everything with a function output assignment)
"""
joint_output=multiprocessing.Queue()
"""

# define a function which does NOT modify ANY global (previously defined) varaibles except a Queue
# NOTE, for Windows environemnts, when running "interactively" (in ipython, and 
# IDE or in a notebook) the function you use must be imported and cannot be defined in the same script/session
"""
def get_max_eig(chunk):
  idx=chunk[1] 
  arry=chunk[0]
  eigvals=np.linalg.eigvalsh(arry)
  joint_output.put((eigvals.max(),idx))
  return (eigvals.max(),idx)
"""
        


#if __name__ == "__main__":
# example 1, generating then separating arrays for independent work
data=[(np.random.rand(matsize,matsize),x) for x in range(20)] # threading handles chunking for us!
# # example 2, splitting an array and working on the peices
# data = np.random.rand(20,20)
# chunks = [(x,ndx) for ndx,x in enumerate np.array_split(data, N_threads)]
print('confirm the data splitting')
print([x[0].shape for x in data]) # verify the splitting

start_time = time.time()
pool = multiprocessing.Pool(N_threads)
print('doing stuff')
gathered_chunks=[x for x in pool.map(get_max_eig, data)]
print('stuff is done')
maxeig=[x[0] for x in gathered_chunks]
index_out=[x[1] for x in gathered_chunks]
maxeig_ndx=maxeig.index(max(maxeig))
maxeig_ndx=gathered_chunks[maxeig_ndx][1]
maxeig=max(maxeig)
maxeig_array=data[maxeig_ndx]
print('display maximum eigenvalue')
print(maxeig)
print('display the array which gives the maximum eigenvalue')
print(maxeig_array)
"""# do it again but with the joint output instead of the returned list
joint_output=[joint_output.get() for x in range(joint_output.qsize())]
# extra step because we can't index on joint_output.get()
maxeig2=[x[0] for x in joint_output]
index_out2=[x[1] for x in joint_output]
maxeig_ndx2=maxeig2.index(max(maxeig2))
maxeig_ndx2=joint_output[maxeig_ndx2][1]
maxeig2=max(maxeig2)
maxeig_array2=data[maxeig_ndx2]
print('pool.map output indices')
print(index_out)
print('joint_output indices')
print(index_out2)
print('does maxeig match?')
print(maxeig2==maxeig)
print('does maxeig_array match?')
print(maxeig_array2==maxeig_array)"""

elapsed_time = time.time() - start_time
print(f"It took {elapsed_time} seconds to process {len(data)} matrices")

confirm the data splitting
[(1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000)]
doing stuff
stuff is done
display maximum eigenvalue
500.7909554915367
display the array which gives the maximum eigenvalue
(array([[0.37778615, 0.04055442, 0.5150767 , ..., 0.03943158, 0.73087757,
        0.0720365 ],
       [0.01054151, 0.13469659, 0.65589384, ..., 0.73879981, 0.80468291,
        0.52541305],
       [0.19337457, 0.70387836, 0.92017039, ..., 0.15074992, 0.97590554,
        0.89088734],
       ...,
       [0.22036265, 0.04114682, 0.89241079, ..., 0.68879522, 0.18420228,
        0.56430834],
       [0.14627942, 0.5395083 , 0.28127738, ..., 0.83322918, 0.9966674 ,
        0.84475125],
       [0.49913105, 0.35256568, 0.20839051, ..., 0.1089864 , 0.72080256,
        0.90

#Multithreading: an opening maneuver

Most useful for preprocessing raw data, or changing to significantly different formats
Not so useful for analysis that involves large matrix calculations or lengthy analysis. 

[threading module documentation](https://python.readthedocs.io/en/stable/library/threading.html)

In [3]:
import concurrent.futures
import threading
import time
import numpy as np

matsize=1000 # when set to 100 multithreading is faster, set to 1000 and multiprocessing is
thread_local = threading.local()
N_threads=2 # set to 2 to make a fair comparison with multiprocessing


def threadsafe_eigs(chunk):
    if not hasattr(thread_local, "get_max_eig"):
        thread_local.get_max_eig = get_max_eig
    return thread_local.get_max_eig(chunk)


def find_best_matrix(chunks):
    with concurrent.futures.ThreadPoolExecutor(max_workers=N_threads) as executor:
        gathered_chunks=[x for x in executor.map(threadsafe_eigs, chunks)]
        maxeig=[x[0] for x in gathered_chunks]
        maxeig_ndx=maxeig.index(max(maxeig))
        maxeig_ndx=gathered_chunks[maxeig_ndx][1]
        maxeig=max(maxeig)
        maxeig_array=chunks[maxeig_ndx]
        print('display maximum eigenvalue')
        print(maxeig)
        print('display the array which gives the maximum eigenvalue')
        print(maxeig_array)
        
        
def get_max_eig(chunk):
  idx=chunk[1]
  arry=chunk[0]
  eigvals=np.linalg.eigvalsh(arry)
  return (eigvals.max(),idx)
        
          
    

if __name__ == "__main__":
  # example 1, generating then separating arrays for independent work
  data=[(np.random.rand(matsize,matsize),x) for x in range(20)] # threading handles chunking for us!
  # # example 2, splitting an array and working on the peices
  # data = np.random.rand(20,20)
  # chunks = [(x,ndx) for ndx,x in enumerate np.array_split(data, N_threads)]
  print([x[0].shape for x in data]) # verify the splitting


  start_time = time.time()
  find_best_matrix(data)
  elapsed_time = time.time() - start_time
  print(f"It took {elapsed_time} seconds to process {len(data)} matrices")


[(1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000), (1000, 1000)]
display maximum eigenvalue
500.73135604672143
display the array which gives the maximum eigenvalue
(array([[0.3149457 , 0.2068915 , 0.22573461, ..., 0.02841787, 0.95432047,
        0.74206671],
       [0.69095165, 0.313048  , 0.71878623, ..., 0.83661787, 0.33232412,
        0.94947128],
       [0.27470046, 0.26825978, 0.53783859, ..., 0.25204323, 0.86274221,
        0.65192078],
       ...,
       [0.50701093, 0.3294186 , 0.17844519, ..., 0.15431113, 0.66597849,
        0.39329202],
       [0.42368908, 0.54263458, 0.10372451, ..., 0.43460681, 0.03996008,
        0.03100962],
       [0.04301358, 0.9771131 , 0.22318735, ..., 0.7582938 , 0.7450777 ,
        0.73792397]]), 17)
It took 16.927886486053467 seconds to

**Parallelization isn't always faster**

Numpy and other libraries are often already optimized pretty well. Don't start with parallelization unless you're really sure you'll need it. In our example with finding the eigenvalues it's actually faster to run without any parallelization (on the Colab runtime). 

If we increase the size of the matrices, use a different eigenvalue solver, or have a machine that offers a larger number of cores then it may be faster to run in parallel. This just underscores that it's important to check for prejudice and test versions of your code.

In [4]:
# time the process on the GPU
import time
import numpy as np

matsize=1000 # when set to 100 multithreading is faster, set to 1000 and multiprocessing is


if __name__ == "__main__":
  # example 1, generating then separating arrays for independent work
  start_time = time.time()
  data=[np.random.rand(matsize,matsize) for x in range(20)] # threading handles chunking for us!
  maxeig=[np.linalg.eigvalsh(x).max() for x in data]
  maxeig_ndx=maxeig.index(max(maxeig))
  maxeig=max(maxeig)
  maxeig_array=data[maxeig_ndx]
  print('display maximum eigenvalue')
  print(maxeig)
  print('display the array which gives the maximum eigenvalue')
  print(maxeig_array)

    
  elapsed_time = time.time() - start_time
  print(f"It took {elapsed_time} seconds to process {len(data)} matrices")

display maximum eigenvalue
500.6734298972959
display the array which gives the maximum eigenvalue
[[0.98449371 0.43177062 0.61564363 ... 0.38888417 0.42846851 0.984318  ]
 [0.35795872 0.76382852 0.65613542 ... 0.28824312 0.99909606 0.76161075]
 [0.1807745  0.56936573 0.31718037 ... 0.58533865 0.94706122 0.53129491]
 ...
 [0.48159875 0.65400046 0.62950296 ... 0.21707006 0.25133658 0.67652186]
 [0.72972474 0.79512367 0.93688658 ... 0.39924894 0.60771342 0.75055381]
 [0.52199843 0.90473609 0.82041336 ... 0.53058552 0.44492439 0.94395564]]
It took 3.8288612365722656 seconds to process 20 matrices


# WashU's Center for High Performance Computing
**FORMERLY Mallinckrodt Center for High Performance Computing**

The multiprocessing module used previously does NOT let you use multiple cores across multiple computers. It only lets you use different cores on the same computer. The CHPC is set up to enable one program to use the cores of multiple separate machines. It uses the Message Passing Interface (MPI), so we need to use a different python module: mpi4py. 

However if you are running your code on only one compute node then multiprocessing will suit you well. 

## Set up :



1.   [Request and get access](https://www.mir.wustl.edu/research/research-support-facilities/center-for-high-performance-computing-chpc/services/request-an-account) (Click the link to the left)
3.   Learn to use and set up your favorite SFTP client and SSH: [FileZilla](https://filezilla-project.org/) or [MobaXterm](https://mobaxterm.mobatek.net/) (for windows to run Graphical User Interface software remotely)
2.   Always run:     
`export PATH=/export/Anaconda3-5.2.0/bin:$PATH`    
`module load openmpi-2.0.0-intel-15.0.1` (or some other MPI module if you need to run MPI)
2.   Set up virtual environment    
`conda create --name environment_name python=3`    
`source activate environment_name`    
This is because you can now install your own modules: `conda install --name name_of_package`    
or from outside the environment: `conda install --name name_of_env name_of_package`
3.   Email alerts are set up to the email you registered with in step one, if you want to change the address do the following.     
got to your home directory `home/your_username/`    
create a text file named `.forward` (not "something.forward" not "".forward.txt" just ".forward")    
put one line an one line only in it, your email address: "some_account@some_domain.name"    
Note: you need to explicitly enable email notifications in the next step.
4.   PBS script    
example below
5.   MPI    
example below

## other resources
[CHPC link](https://www.mir.wustl.edu/research/research-support-facilities/center-for-high-performance-computing-chpc/services) - [on campus wiki](http://mgt2.chpc.wustl.edu/wiki119/index.php/Main_Page)

[queuing system](https://www.mir.wustl.edu/research/research-support-facilities/center-for-high-performance-computing-chpc/for-researchers/queuing-system) - [on campus wiki](http://mgt2.chpc.wustl.edu/wiki119/index.php/Queuing_System)
[PBS](https://www.chpc.utah.edu/documentation/software/pbs-scheduler.php)


software availability and limitations
[software availability](https://www.mir.wustl.edu/research/research-support-facilities/center-for-high-performance-computing-chpc/services/software) - [on campus wiki](http://mgt2.chpc.wustl.edu/wiki119/index.php/Software)

[data storage rules](https://www.mir.wustl.edu/research/research-support-facilities/center-for-high-performance-computing-chpc/for-researchers/rules-and-guidelines) - [on campus wiki](http://mgt2.chpc.wustl.edu/wiki119/index.php/FAQ#Storage) (very low data storage limits without paying)

Monitoring current usage
[Ganglia](http://mgt2.chpc.wustl.edu/ganglia)

In [0]:
# PBS script example

#!/bin/bash

# NOTE: #PBS comments request resources and have to come BEFORE any commands
# NOTE: after adapting this script to your needs you run it with the command:
# qsub this_script_file_name
# NOTE: the # comes immediately before PBS, no space


# This option will send email when the job starts/finishes
# To receive this email, you'll need a .forward file in your home
# directory that contains the email address you wish to receive your
# email at.
#PBS -m be




# Give the job a name
#PBS -N JKJ_test_job1




# Specify the resources needed:

# ** EXAMPLE 1/3 ** This assumes the job requires less 
# than 3GB of memory.  If you increase the memory requested, it
# will limit the number of jobs you can run per node, so only  
# increase it when necessary (i.e. the job gets killed for violating
# the memory limit). 
# Because it requests only one node it *should* work with multiprocessing
# #PBS -l nodes=1:ppn=2,walltime=00:15:00,mem=3gb

# ** EXAMPLE 2/3 ** Specify 32 by specifying 4 nodes with 8 cores each
# determine the nodes and cores by refering to the table on the page below:
# https://www.mir.wustl.edu/research/research-support-facilities/center-for-high-performance-computing-chpc/for-researchers/queuing-system
# each row is a node you can pick and the second column is the max "ppn".
# Because it requests 4 nodes use mpi4py not multiprocessing.
# #PBS -l nodes=4:ppn=8,walltime=24:00:00

# ** EXAMPLE 3/3 ** Specify a number of gpus
# #PBS -l nodes=1:ppn=1:gpus=1


# actually specify the resources
#PBS -l nodes=1:ppn=2,walltime=00:15:00,mem=3gb

# Specify the default queue for the fastest nodes
#PBS -q dque









# ** BEGIN OPTIONAL SECTION ** for issuing terminal commands, used when you didn't 
# already go to the right directory or need to shuttle files around.
# NOTE: to run these commands delete the # and the space after it

# cd Into the run directory; I'll create a new directory to run under
# cd /scratch/mtobias
# mkdir freesurfer
# cd freesurfer

# copy in the input data to the current directory
# cp /export/freesurfer/subjects/sample-001.mgz .
# copy in the input data to a subdirectory
# cp /export/freesurfer/subjects/sample-002.mgz ./data_folder/

# load software packages (other than your python scripts) see the software available below
# https://www.mir.wustl.edu/research/research-support-facilities/center-for-high-performance-computing-chpc/services/software
# module load package_name

# create environment variables (i.e. set the paths)
export PATH=/export/Anaconda3-5.2.0/bin:$PATH
export PATH=/usr/lib64/openmpi-1.4/bin/:$PATH
  
# access your virtual environment with everything installed correctly
source activate environment_name
  
# ** END OPTIONAL SECTION **







# run your python script with MPI
mpirun -np 2 python test1.py

In [0]:
# MPI example script AKA "test1.py" in the PBS script above

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
name = MPI.Get_processor_name()

print("Hello from rank {0} of {1} on {2}".format(rank, size, name))


def get_max_eigs(chunk,rank):
  maxeig=-np.inf*(1+1j)
  ndcs=[];
  for idx, arry in enumerate(chunk):
    eigvals=np.linalg.eigvalsh(arry)
    if eigvals.max().real>=maxeig.real and eigvals.max().imag>=maxeig.imag:
      print("eigval is {0} on rank {1} index {2}".format(eigvals.max(),rank,idx))
      maxeig=eigvals.max()
      ndcs.append((rank,idx))
  return (maxeig,ndcs)



# assemble data to be operated on in parallel
if rank == 0: # if the root node AKA "controller node"
  # load data from files at the root only (loading is not shown here)
  
  
  # example 1, generating then separating arrays for independent work
  data=[np.random.rand(10,10) for x in range(20)]
  chunks=[np.array_split(chunk,chunk.shape[0]) for chunk in np.array_split(data,N_threads)]
  data=[];
  # # example 2, splitting an array and working on the peices
  # data = np.random.rand(20,20)
  # chunks = np.array_split(data, size)
  print([x.shape for x in chunks]) # verify the splitting
else:
  chunks = None
  
# send the data off to the different workers
chunk = comm.scatter(chunks, root=0)

# perform the calculation that is done on all workers
chunk_result=get_max_eigs(chunk,rank)
# WARNING AVOID FILE READ/WRITES (SAVING) UNTIL YOU UNDERSTAND THREAD SAFETY

# collect the data from the separate workers
gathered_chunks = comm.gather(chunk_result, root=0) # gather items to the root

if rank ==0: # do a final calculation at the root ONLY
  maxeig=[x[0] for x in gathered_chunks]
  maxeig_ndx=maxeig.index(max(maxeig))
  maxeig_ndx=gathered_chunks[maxeig_ndx][1][0]
  maxeig=max(maxeig)
  maxeig_array=chunks[maxeig_ndx[0]][maxeig_ndx[1]]
  print(maxeig)
  print(maxeig_array)



# Physics department High Performance Computing facility


## Set up :



1.   Get access by emailing Sai (sai@physics.wustl.edu), ask him to set up a data folder too, unless you can use a low data storage limit.    
Get a list of the machines available and their capabilties here: [https://web.physics.wustl.edu/intranet/Pages/Computing/hpcCenter.php](https://web.physics.wustl.edu/intranet/Pages/Computing/hpcCenter.php)
1.   Learn to use and set up your favorite SFTP client and SSH: [FileZilla](https://filezilla-project.org/) or [MobaXterm](https://mobaxterm.mobatek.net/) (for windows to run Graphical User Interface software remotely)
1.   Log in to any node and run `hpcload` to get a list of machines and their current usage
1.   Log in to a node with acceptable capabilities and usage
1.   First run the command `htop` and hit F5 to go to "tree view" to view all the scripts that every user is running.    
The number of workers allocated is 1/2 the number of lines for Matlab and multiprocessing methods.    
 **Watch for a while as resource usage can fluctuate/spike.** 
1.   **Determine the number of cores you want to allocate based on the previous step.** Remember the optimal solution to "[tragedy of the commons](https://en.wikipedia.org/wiki/Tragedy_of_the_commons)" type resource usage is that every new user requests X% of what ever remains (I guess 75%). 
1.   You shouldn't need to set up the paths like on the CHPC
1.   Set up virtual environment    
`conda create --name environment_name python=3`    
`source activate environment_name`    
This is because you can now install your own modules: `conda install --name name_of_package`    
or from outside the environment: `conda install --name name_of_env name_of_package`
1.   No email alerts unless you write a function to do it: [https://realpython.com/python-send-email/](https://realpython.com/python-send-email/)
    
    
The cluster is very free and open, some users don't check very thoroughly before running.    
Remember if someone's (my) code spikes to high usage every once in a while and combined with yours allocation goes above 100% of machine capabilities then the spike will last longer. If allocation is too far above 100% everyone's code will get slower and slower.    
Always take the time to watch on `htop`. Never allocate above 100%. The one exception is when the extant code and new code both have low usage with rare spikes and there is a low probability that two spikes will co-occur. 

## GPU acceleration

If your computer has a modern graphics card then it is possible to parallelize your code *thousands* of times by having your Graphics Processing Unit do the work instead of your CPU. However there are some pretty big catches. The memory available for each "chunk" is pretty limited, meaning you can't send large programs to each of the workers and the GPU handles information in a very different way. So the software available to do it is also pretty limited, and depends on having CUDA or openCL correctly configured on your machine.

Unless you wanted to get into different programming languages and write a lot of lines of code, all just to handle tedious tasks then you will be reliant on python libraries such as cupy. CuPy has versions of common NumPy functions that have been adapted to run on GPUs. CuPy also lets you run custom functions, called "kernels". Another option is to use the GPU capabilities of the Numba library, which is better at enabling you to write custom functions without much tedium, but has limited expressivity. Numba has the benefit of also being able to handle multiprocessing on multiple CPUs, but in the same constrained way it handles GPU computation.

### DO THIS
Change your runtime type to include GPU acceleration (see the "runtime menu above"), then run the following cells.

In [0]:
# check if you have Cuda
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
Collecting cupy
[?25l  Downloading https://files.pythonhosted.org/packages/fb/18/a8c5f594d6fc70d70e53c7847eabc9b922c037b3b7dff4af89a499217d42/cupy-6.1.0.tar.gz (3.1MB)
[K     |████████████████████████████████| 3.1MB 3.4MB/s 
Building wheels for collected packages: cupy
  Building wheel for cupy (setup.py) ... [?25l[?25hcanceled
[31mERROR: Operation cancelled by user[0m


In [0]:
# Install CuPy for the CUDA version above
!pip install cupy-cuda100




In [0]:
# time the process on the GPU
import time
import numpy as np
import cupy as cp

matsize=1000 # when set to 100 multithreading is faster, set to 1000 and multiprocessing is


if __name__ == "__main__":
  # example 1, generating then separating arrays for independent work
  start_time = time.time()
  data=[cp.array(np.random.rand(matsize,matsize)) for x in range(20)] # threading handles chunking for us!
  maxeig=[cp.linalg.eigvalsh(x).max() for x in data]
  maxeig_ndx=maxeig.index(max(maxeig))
  maxeig=max(maxeig)
  maxeig_array=data[maxeig_ndx]
  print('display maximum eigenvalue')
  print(maxeig)
  print('display the array which gives the maximum eigenvalue')
  print(maxeig_array)

    
  elapsed_time = time.time() - start_time
  print(f"It took {elapsed_time} seconds to process {len(data)} matrices")

display maximum eigenvalue
501.02482052606683
display the array which gives the maximum eigenvalue
[[0.39853323 0.88142266 0.37750285 ... 0.6924593  0.70819448 0.0519846 ]
 [0.93231736 0.55569323 0.31035479 ... 0.43053994 0.56119633 0.32046951]
 [0.22635327 0.92344492 0.10111251 ... 0.93197159 0.90083852 0.79663857]
 ...
 [0.62390399 0.57356271 0.95838915 ... 0.57640116 0.11349385 0.27786203]
 [0.76045963 0.27599598 0.87402397 ... 0.05726385 0.54164207 0.60788906]
 [0.7960042  0.3108535  0.7417449  ... 0.16183558 0.75761693 0.76943369]]
It took 4.932931184768677 seconds to process 20 matrices
