# Analyzing Performance

## Using Linux built-in performance measurement tools

One of the most prolific tools (as you may have read in the textbook) is the `time` ([man page](https://man7.org/linux/man-pages/man1/time.1.html)) tool built into Linux and BSD (amongst other operating systems), usually located at `/usr/bin/time`.  To run `/usr/bin/time`, simply just add it on the same line where you call `python` or your compiled program at the beginning of the line in your `%%qsub` cells.  

Note that it will print, by default: the CPU usage by the program code (user), the CPU usage by the system (system), the Wall time (elapsed), the percentage of CPU used, and different information about the RAM usage.

You can customize it yourself by feeding the `--format="..."` parameters, by replacing the ellipsis with a printf-style format string.

For instance, if we want to time the program `sleep 5`, with the format string `"real %e system %S cpu %P avg_ram_kb %K"`, our line would look like:

`/usr/bin/time --format="real %e system %S cpu %P avg_ram_kb %K" sleep 5`

**NOTE**: The output from this tool will appear in the error buffer instead (`STDIN.eNNNNNN`), so make sure you look for through both the standard output and the standard error files.  Make sure to look through the `STDIN.oNNNNN` file too, so that you have the job number and can know which run was with which parameters.

Try using this tool on your code in the following cells!

In [None]:
%%writefile ParallelCode.py

import numpy as np
import multiprocessing
import random
import timeit 
import time
from multiprocessing import Array

# definition of monte carlo method to find pi
def monte_carlo_definition():
    # get random x and y points
    x = random.uniform(-1.0,1.0)
    y = random.uniform(-1.0,1.0)
    
    # determine if the points are in the circle of pi
    return np.square(x) + np.square(y) <= 1

# mini parallelized method
def parallel_monte_carlo_sub_act(sizeOfSubArray, completeSum, i):
    # get the smaller sized array (sizeofArray) of random points 
    completeSum[i] = sum(monte_carlo_definition() for j in range(sizeOfSubArray))

# parallelized method main call
def monte_carlo_parallel(numberOfSampleTrials: int = 1000000, numberOfProcesses: int = 6):
    sizeOfSubArray = int(numberOfSampleTrials/numberOfProcesses) # size of the mini array passed to the processes in parallel
    completeSum = Array('i', [0]*numberOfProcesses, lock=False) # instantiate array to keep track of circle points
    processes = [] # array that will hold the processes executed
    
    # loop through the amount of processes in the program (4 in this case)
    for i in range(numberOfProcesses):
        process = multiprocessing.Process(target=parallel_monte_carlo_sub_act, args=(sizeOfSubArray, completeSum, i))
        
        processes.append(process)
        process.start()
    
    # important to make sure process has enough time to execute
    for process in processes:
        process.join()
    pi = 4.0 * (sum(completeSum))/numberOfSampleTrials
    print("Parallel approximation of pi with size "+str(numberOfSampleTrials)+" and "+str(numberOfProcesses)+" workers: ", end=' ')
    print(pi)
    return pi
    
def main() -> None:
    start_time = time.time()
    monte_carlo_parallel(100000000, 2)
    print("--- %s seconds to complete ---" % (time.time() - start_time))

if __name__ == '__main__':
    #print("Replace this with proper timeit calls")
    timer_main = timeit.Timer(main)
    timer_main.repeat(repeat=1, number=1)
    

In [None]:
import cfxmagic 

In [None]:
%%qsub 
cd $PBS_O_WORKDIR
/usr/bin/time python ParallelCode.py

In [None]:
!qstat

In [None]:
!qdel 1906339

## Collecting run data

To make sure that we have adequate data, make sure to submit at least 10 different variations of your code, such as the following example variations (based on the Monte Carlo example):

1. Run with draw number size 10000000 and 2 workers
1. Run with draw number size 10000000 and 4 workers
1. Run with draw number size 10000000 and 6 workers
1. Run with draw number size 10000000 and 8 workers
1. Run with draw number size 10000000 and 10 workers
1. Run with draw number size 10000000 and 12 workers
1. Run with draw number size 10000000 and 14 workers
1. Run with draw number size 10000000 and 16 workers
1. Run with draw number size 100000000 and 8 workers
1. Run with draw number size 100000000 and 16 workers
1. Run with draw number size 1000000000 and 8 workers
1. Run with draw number size 1000000000 and 16 workers
1. Run with draw number size 10000000000 and 8 workers
1. Run with draw number size 10000000000 and 16 workers

Now, go ahead and use the two cells below to run your job for different variations (you can either programmatically run the variations or just manually run each variation here and just note the data down).

In [None]:
# times to complete for each (seconds) using 'time' module
# using method monte_carlo_parallel(numberOfSamples, numberOfWorkers)

#monte_carlo_parallel(10000000, 2)
# 26.37137222290039 seconds to complete

#monte_carlo_parallel(10000000, 4)
# 12.796449184417725 seconds to complete

#monte_carlo_parallel(10000000, 6)
# 8.501582622528076 seconds to 

#monte_carlo_parallel(10000000, 8)
# 6.437229156494141 seconds to complete

#monte_carlo_parallel(10000000, 10)
# 5.096891403198242 seconds to complete

#monte_carlo_parallel(10000000, 12)
# 4.514995336532593 seconds to complete

#monte_carlo_parallel(10000000, 14)
# 5.060871839523315 seconds to complete

#monte_carlo_parallel(10000000, 16)
# 4.5333778858184814 seconds to complete

#monte_carlo_parallel(10000000, 18)
# 4.334452152252197 seconds to complete

#monte_carlo_parallel(10000000, 20)
# 4.195279359817505 seconds to complete


We can also submit to different types of machines.  The Intel(R) Core(tm) processors differ in specifications from the Intel(R) Xeon(tm) processors.  To switch between the Intel Core nodes and the Intel Xeon nodes, simply just call `qsub` with different node properties as shown:

In [None]:
%%qsub -l nodes=1:core:ppn=2
cd $PBS_O_WORKDIR
python ParallelCode.py

In [None]:
!qstat

In [None]:
%%qsub -l nodes=1:xeon:ppn=2
cd $PBS_O_WORKDIR
python ParallelCode.py

## Generating plots

We would like to generate plots using the data we collected above.  Optimally, we'd generate data files that we could just simply import and plot, but for this time, it's okay to just create lists of data manually as this isn't really a class on data analysis.

A nice video explaining plotting in Jupyter notebooks is available at: https://www.youtube.com/watch?v=Hr4yh1_4GlQ

Practice by plotting the number of workers (or another variable, such as data draw size) against a response (such as wall time, cpu time, memory usage, etc.):

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import numpy as np
import matplotlib.pyplot as plt

# all points saved as array
x = np.array([2,4,6,8,10,12,14,16,18,20])
y = np.array([26.37137222290039,12.796449184417725,8.501582622528076,6.437229156494141,5.096891403198242,4.514995336532593,5.060871839523315,4.5333778858184814,4.334452152252197,4.195279359817505])

# line of best fit with 3 degrees
model = np.poly1d(np.polyfit(x, y, 3))

#create a scatterplot line for line of best fit
polyline = np.linspace(2, 20, 50)

# plot the line of best fit with red line
plt.plot(polyline, model(polyline), '--', color='red')

#add points to plot
plt.scatter(x, y)

# values for the x axis
plt.xticks([0,2,4,6,8,10,12,14,16,18,20])

# values for the y axis
plt.yticks([0,3,6,9,12,15,18,21,24,27,30])

# title of graph
plt.title("Number of workers vs. Time to complete (secs) with 10,000,000 points")

# labels for each axis
plt.xlabel("Number of workers", fontsize = 18)
plt.ylabel("Time to complete", fontsize = 16)

print("Line of best fit equation:\n "+str(model)+"\n")

plt.show()



## Presenting data/plots inline with Markdown text

Now that you have your data and know how to plot your data, create Markdown and code cells below to answer the following questions in report format (incorporating the code cells to generate plots):

1. What software application did you choose to attempt to parallelize or augment?
2. How did you parallelize or augment your chosen software application?
3. How did the throughput or latency of your software application change as you increased the number of resources (workers, CPUs, etc.)?
4. Were there any differences between the Linux system performance measurement tools and your language-based measurement tools?  What may be the cause of that?
5. What would you change if you were to attempt this project again?

In [None]:
%%markdown 

# Monte Carlo method for finding pi paralellization report

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I decided to chose to focus on parallelizing the Monte Carlo method for 
finding pi because I wanted to see how the speed of the program decreased or increased with an increase in workers and 
how much time is taken depending on the sample size. I ultimately found that the program is a lot faster when there are 
more workers that are operating, however, this speed does eventually plateau and regardless of many workers you have 
after time to complete doesn't change much.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I parallelized the software application by dividing the sample size given by the user
into smaller sizes that each worker would be in charge of, the smaller size was calculated by dividing the sample size 
by the number of workers the user wanted. With the smaller sample size, each worker (or process as it was called in the program)
is created using the multiprocess module, they are then in charge of using the definition of the Monte Carlo method to create
random points, determine whether they are valid, and then save the vaid points in their respective array. After each of the workers
complete their respective task, all the info is sent back to the main method to determine what the estimation of pi is. The
main method then proceeds to creating printing the amount of time taken to complete the entire process and the estimation of pi
in the given iteration.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; As I stated in the beginning, the time taken to complete the task reduces with 
an increased number of workers, however, it does end up plateauing after reaching a certain amount of workers. Not only this, 
but as the number of workers increased the latency ended up reducing by a significant amount, this can be seen in the reduced 
amount of time to in the graph below (figure 2.1). On the contrary, if the amount of workers was constant, as the sample
size increases, the latency begins to drastically increase and it eventually causes the program to never return a output
due to the high amount of work that needs to completed. In the case of figure 2.1, we can see that after reaching 10 workers
the amount of time needed to complete the program doesn't drop below 3 seconds. Based off this information we can assume that 
the program has reached a plateau in terms of how efficient it can be, and this ultimately varies becuase every user has
a different cpu and operating system that affects the performance times. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; There was a slight difference in the different measurement tools but the difference
is small enough to be negligible. The difference in times between the measurement tools was around 0.0003 seconds of a difference.
One possible explanation for this is because the python measurement tool needed to perform 
a couple more tasks in order to complete the same task as the linux system performance measurement tools. For the purpose
of this program, I decided to use the 'time' module that is provided as a python library, the reason for this was because
I found it easier to understand and it was able to give me the time that had passed in seconds.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; One thing I would for change is the presentation of the data collected. One way in 
which I would do this is by creating two tables, one in which the amount of workers increases with a fixed sample size
(such as the graph provided below) and another which the sample size increases with a fixed amount of workers. This would 
demonstrate how paralellization and the amount of workers can not only decrease the amount of time needed to complete a 
program but it would also show how as a sample size increases the amount of time increases exponentially until the task 
would take too long to complete.




In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import numpy as np
import matplotlib.pyplot as plt

print("Figure 2.1")

# all points saved as array
x = np.array([2,4,6,8,10,12,14,16,18,20])
y = np.array([26.37137222290039,12.796449184417725,8.501582622528076,6.437229156494141,5.096891403198242,4.514995336532593,5.060871839523315,4.5333778858184814,4.334452152252197,4.195279359817505])

# line of best fit with 3 degrees
model = np.poly1d(np.polyfit(x, y, 3))

#create a scatterplot line for line of best fit
polyline = np.linspace(2, 20, 50)

# plot the line of best fit with red line
plt.plot(polyline, model(polyline), '--', color='red')

#add points to plot
plt.scatter(x, y)

# values for the x axis
plt.xticks([0,2,4,6,8,10,12,14,16,18,20])

# values for the y axis
plt.yticks([0,3,6,9,12,15,18,21,24,27,30])

# title of graph
plt.title("Number of workers vs. Time to complete (secs) with 10,000,000 points")

# labels for each axis
plt.xlabel("Number of workers", fontsize = 18)
plt.ylabel("Time to complete", fontsize = 16)

plt.show()