# Assignment 2: Performance Metrics & First Week Flop/s

This assignment has some questions that you need to answer with text, and some code that you need to write.

You should put all of you textual answers in this notebook: `Insert->Insert Cell Below` to create a new cell below
the question, and `Cell->Cell Type->Markdown` to make it a cell for entering text.

You will test your code on the compute nodes of pace-ice, and that it also where we will evaluate it.
Please complete the text portions when you are logged into a head node working locally, and leave the compute nodes for when you actually need them.

**Due: Tuesday, September 2, 9:30 am**

**Total: 10 pts + 2 bonus pts (1 for working on a node with GPUs, 1 for optimizing the flop/s code)**

## Performance Metrics

In class we talked about the _strong-scaling efficiency_ of a parallel algorithm / machine pair: $H_f(P) = T_f(1) / (P T_f(P))$.

We then talked about the _weak-scaling efficiency_ of algorithm $f$ that can be applied to different problem sizes $N$: $E_f(N,P) = T_f(N/P,1) / T_f(N,P)$.

The question came up of how they are related to each other.

First, the notion of strong scaling doesn't have a concept of problem size, so let's add it: let's define

$$H_f(N,P) = T_f(N,1) / (P T_f(N,P)).$$

This is simply strong-scaling efficiency for each problem instance individually.

**Question 1 (1 pt):** Show that the relative order of strong and weak scaling efficiency (Whether $H_f(N,P) < E_f(N,P)$ or $E_f(N,P) < H_f(N,P)$) can be related to the efficiency of the serial algorithm, that is, whether $T_f(N,1)$ as a function of $N$ exhibits superlinear or sublinear behavior.

**Answer**

$$H_f(N,P) = T_f(N,1) / (P T_f(N,P)).$$
$$E_f(N,P) = T_f(N/P,1) / T_f(N,P)$$
**Then**
$$\frac{H_f(N,P)}{E_f(N,P)} = \frac{T_f(N,1) / (P T_f(N,P))}{T_f(N/P,1) / T_f(N,P)}$$
**To**
$$\frac{H_f(N,P)}{E_f(N,P)} = \frac{T_f(N,1)}{P*T_f(N/P,1)}$$
**Since**
$$P*T_f(N/P,1) < = T_f(N,1)$$
**Thus**
$$\frac{H_f(N,P)}{E_f(N,P)}>=1$$
which means the **superlinear**


## PACE-ICE

**Head node exercise 1 (1 pt):** What command should you run from a head node to see a list of all the compute nodes in `coc-ice` and their availability? [Resource for this question: the [orientation slides](http://pace.gatech.edu/sites/default/files/pace-ice_orientation_2.pdf)]

**ANSWER**

In [1]:
pace-check-queue coc-ice


	** NEW FEATURE : add '-s' to pace-check-queue to list 
	** scheduler features for each node

=== coc-ice Queue Summary: ====
	Last Update                            : 09/01/2019 18:45:02
	Number of Nodes (Accepting Jobs/Total) : 48/49 (97.96%)
	Number of Cores (Used/Total)           : 2/916 ( 0.22%)
	Amount of Memory (Used/Total) (MB)     : 123561/8105865 ( 1.52%)
  Hostname       tasks/np Cpu%  loadav%  used/totmem(MB)   Mem%   Accepting Jobs? 
rich133-h35-15-r   1/28    3.6     1.0      3044/131126     2.3    Yes (free)             
rich133-h35-16-l   0/28    0.0     1.2      2912/131126     2.2    Yes (free)             
rich133-h35-16-r   0/28    0.0     0.5      3074/131126     2.3    Yes (free)             
rich133-h35-17-l   0/28    0.0     0.7      5062/518966     1.0    Yes (free)             
rich133-h35-17-r   0/28    0.0     0.5      5068/518966     1.0    Yes (free)             
rich133-h35-18-l   0/28    0.0     0.4      5234/518966     1.0    Yes (free)             
ri

Try it out: open up this notebook on a head node and compare the list you get to the [orientation slides](http://pace.gatech.edu/sites/default/files/pace-ice_orientation_2.pdf).  You'll see that it has grown, and they haven't updated the orientation slides.  We'll just have to find out what all these new nodes are for ourselves.

---
### A word on running jupyter on pace-ice:

As we discussed in class, screen refresh can be a bit laggy if you try to run a jupyter notebook through a browser opened on the head node or a compute node.  See the [guide](../../notes/logistics/compute-node-notebook.ipynb) in the notes for instructions on runnin the jupyter server on the compute nodes and the browser on your own computer.  You don't have to work directly in the notebook: you can work on you answers in the terminal, and then paste them into the notebook, as long as you're confident that they are correct.

---

**Head node exercise 2 (1 pt):** From the output of the above answer, you can probably see that we have a few different types of nodes to work with.  Fill in the blanks in the list below, describing the _properties_ of the different types. [Command line tools you might want to use: `qnodes`, `grep`, `sort`]

1. #__ nodes with __ CPU core(s) and no GPUs
2. #__ nodes with __ CPU core(s) and #__ GPU(s) of type NVIDIA Tesla ____.
3. #__ nodes with __ CPU core(s) and #__ GPU(s) of type NVIDIA Tesla ____ (this group includes `rich133-s42-21.pace.gatech.edu`, even though that the type of GPU is not listed for this node)
5. One node (`rich133-s30-20.pace.gatech.edu`) with __ CPU core(s) (currently offline).

**Answer**
1. $23$ nodes with $28$ CPU core(s) and no GPUs.
2. $12$ nodes with $12$ CPU core(s) and $2$ GPU(s) of type NVIDIA Tesla K40.
3. $12$ nodes with $8$ CPU core(s) and $1$ GPU(s) of type NVIDIA Tesla P100. And $1$ node(`rich133-s42-21`) with $8$ CPU core(s) and $2$ GPU(s) of type NVIDIA Tesla P100.
4. One node (`rich133-s30-20.pace.gatech.edu`) with $24$ CPU core(s) and with $2$ GPUs of type NVIDIA Tesla K40 . (currently offline).

In [2]:
Nodes="$(qnodes|grep '^rich')"
for i in $Nodes; do
    echo $i
    qnodes $i|grep np
    qnodes $i|grep gpus
done



rich133-h35-15-r.pace.gatech.edu
     np = 28
rich133-h35-16-l.pace.gatech.edu
     np = 28
rich133-h35-16-r.pace.gatech.edu
     np = 28
rich133-h35-17-l.pace.gatech.edu
     np = 28
rich133-h35-17-r.pace.gatech.edu
     np = 28
rich133-h35-18-l.pace.gatech.edu
     np = 28
rich133-h35-18-r.pace.gatech.edu
     np = 28
rich133-k33-17.pace.gatech.edu
     np = 8
     gpus = 2
rich133-k40-17.pace.gatech.edu
     np = 8
     gpus = 1
rich133-k40-18.pace.gatech.edu
     np = 8
     gpus = 1
rich133-k40-20-l.pace.gatech.edu
     np = 28
rich133-k40-20-r.pace.gatech.edu
     np = 28
rich133-k40-21-l.pace.gatech.edu
     np = 28
rich133-k40-21-r.pace.gatech.edu
     np = 28
rich133-k40-22-l.pace.gatech.edu
     np = 28
rich133-k40-22-r.pace.gatech.edu
     np = 28
rich133-k40-23-l.pace.gatech.edu
     np = 28
rich133-k40-23-r.pace.gatech.edu
     np = 28
rich133-k40-24-l.pace.gatech.edu
     np = 28
rich133-k40-24-r.pace.gatech.edu
     np = 28
rich133-k40-25-l.pace.gatech.edu
     np = 28
r

In [6]:
printf "All Nodes CPU, CPU cores count:\n"
qnodes|grep "np"|sort|uniq -c
printf "All Nodes with GPU, GPU count: \n"
qnodes|grep gpus|sort|uniq -c
printf "2 GPUs with Tesla P100 count total:\n"
qnodes|grep "gpu\[0]"|grep "gpu\[1]"|grep 'P100'|sort|uniq -c|awk '{s+=$1} END {print s}'
printf "2 GPUs with Tesla K40m count total:\n"
qnodes|grep "gpu\[0]"|grep "gpu\[1]"|grep 'K40m'|sort|uniq -c|awk '{s+=$1} END {print s}'
printf "Access rich133-s30-20.pace.gatech.edu info:\n"
qnodes rich133-s30-20.pace.gatech.edu 
qnodes rich133-s42-21.pace.gatech.edu

All Nodes CPU, CPU cores count:
     12      np = 12
      1      np = 24
     23      np = 28
     13      np = 8
All Nodes with GPU, GPU count: 
     12      gpus = 1
     14      gpus = 2
2 GPUs with Tesla P100 count total:
1
2 GPUs with Tesla K40m count total:
13
Access rich133-s30-20.pace.gatech.edu info:
rich133-s30-20.pace.gatech.edu
     state = down,offline
     power_state = Running
     np = 24
     properties = core24,mhz2400,ib,ibQDR,localdisk,nvidiagpu,teslak40,ssd,5-2620v3,intel,rhel6
     ntype = cluster
     status = rectime=1566377608,macaddr=40:8d:5c:65:77:74,cpuclock=Fixed,varattr=,jobs=,state=free,size=87755940kb:88117536kb,netload=240088208,gres=,message=ERROR Health check failed:  [pace_mce] /var/log/mcelog has at least one uncorrectable error,loadave=0.05,ncpus=12,physmem=132177880kb,availmem=133108192kb,totmem=134275028kb,idletime=224,nusers=0,nsessions=0,uname=Linux rich133-s30-20.pace.gatech.edu 2.6.32-573.12.1.el6.x86_64 #1 SMP Mon Nov 23 12:55:32 EST 2015 x

**Head node exercise 3 (1 pt):** For the next questions, I need you to log in to compute nodes to find out about them, but you need to be able to specify which type of compute nodes you are accessing.

For each of the types of nodes 1, 2, and 3 in the question above, give me a `qsub` command to start a `jupyter_notebook_script.sh` job on that type of node, with the following requirements:

* The job should give you exclusive access to one node and all its cores and devices.
* The job should begin in the CSE6230 directory.
* The job should end after 30 minutes.

[Resources: [compute-node-notebook.ipynb](../../notes/logistics/compute-node-notebook.ipynb), [orientation slides](http://pace.gatech.edu/sites/default/files/pace-ice_orientation_2.pdf)]

In [19]:
# put the qsub command for type 1 in this cell
qsub -l nodes=rich133-h35-18-l.pace.gatech.edu:ppn=28,walltime=00:30:00 $CSE6230_DIR/utils/jupyter_notebook_job.sh -w $CSE6230_DIR

106309.ice-sched.pace.gatech.edu


In [1]:
# put the qsub command for type 2 in this cell
qsub -l nodes=rich133-s30-12.pace.gatech.edu:ppn=12:gpus=2,walltime=00:30:00 $CSE6230_DIR/utils/jupyter_notebook_job.sh -w $CSE6230_DIR

106245.ice-sched.pace.gatech.edu


In [1]:
# put the qsub command for type 3 in this cell
qsub -l nodes=rich133-s42-23.pace.gatech.edu:ppn=8:gpus=1,walltime=00:30:00  $CSE6230_DIR/utils/jupyter_notebook_job.sh -w $CSE6230_DIR



106209.ice-sched.pace.gatech.edu


## What have we got to work with?

Now, we need to switch from a notebook running on the head node to one running on a compue node, so `File->Save and Checkpoint` this notebook and `File->Close and Halt` it.  (Now would also be a good time to `git add` and `git commit` changes to this file.)  Use one of your ineractive job scripts to connect to a compute node and run the notebook there.
See you on the other side!

---

Okay, you're running on the compute node.

**Compute node exercise 1 (2 pt):** Using bash scripting (`awk`, `grep`, `sed`) or any other tool you like (you could, e.g., write a python script in a separate file and call it, as long as you `git add` it), set the variables in the cell below so that the printout that follows is correct.  You script should be correct on any type of compute node.

Resources: the file `/proc/cpuinfo`, the utility `nvidia-smi`; if you are very new to using a shell command line and the utilities that go with it in linux, please look at the [training slides on Linux](https://pace.gatech.edu/training) from PACE.

Note: when you run a command in backticks, you can assign its value to a variable like

```bash
MY_FILES=`ls -al`
```

Also note: when ever you encounter a new program or utility `AAA` that you've never used before, `man AAA` or `AAA --help` are the first places to go if you want to know what different command line flags do.

Some [one-liners](https://en.wikipedia.org/wiki/One-liner_program) that you may find useful:

* `grep -P -m 1 -o -e "(?<=XXX\s: ).*" YYY`: look in file `YYY` for the string "XXX   : " (with an arbitrary number of spaces between `XXX` and `:`) and print what comes after that on the line
* `wc -l YYY` counts the number of lines in file `YYY`
* most command line utilities that read files can also read the output of a previous command with a pipe `|`, for example to count the number of files in a directory:

```bash
ls -al | wc -l
```

* `grep -c "ZZZ" YYY` count the number of times the string `ZZZ` occurs in file `YYY`
* Nodes without GPUs won't have the `nvidia-smi` utility.  You can tell when a utility is unavailable if `which AAA` returns an error.  If you want to write a one-liner that only runs command `AAA` when `nvidia-smi` when it's available, you can do that like this:

```bash
(which nvidia-smi &> /dev/null) && (AAA)
```

In [241]:
CPU_NAME=`cat /proc/cpuinfo| grep 'model name'|uniq | grep -P -m 1 -o -e "(?<=model name\s: ).*"`
CORE_COUNT=`cat /proc/cpuinfo |grep 'model name'|wc -l`
GPU_NAME=`nvidia-smi -q|grep 'Product Name'|uniq|grep -o "Tesla.*"`
GPU_COUNT=`nvidia-smi -q|grep 'Product Name'|wc -l`

In [242]:
echo "This nodes has ${CORE_COUNT} cores: its architecture is (Manufacturer, Product Id) ${CPU_NAME}"
if [[ ! $GPU_COUNT || $GPU_COUNT == 0 ]] ;  then
    echo "This node has no GPUs"
else
    echo "This node has ${GPU_COUNT} GPUs: its/their architecture is (Manufacturer, Product Id) ${GPU_NAME}"
fi

This nodes has 28 cores: its architecture is (Manufacturer, Product Id) Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
This node has no GPUs


**Compute node exercise 2 (1 pt):** After you have logged out of the compute node, use whatever resources published on the web you can find to estimate the peak _single precision_ (aka FP32) flop/s of this node (you only need to do this step for one of the types of nodes, not all of them).

[Resources: [ark.intel.com](https://ark.intel.com), [wikipedia:FLOPS](https://en.wikipedia.org/wiki/FLOPS), [wikichip](https://en.wikichip.org), our notebook on [processors](../../notes/processors/processors-alone.ipynb)

In [1]:
lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('

CPU(s):                28
Thread(s) per core:    1
Core(s) per socket:    14
Socket(s):             2


I have chosen node type **1** .  The peak flop/s for this node is **2150.4** gigaflop/s.  Here is how I calculated that:
$$FLOPS = sockets * \frac{cores}{socket} *\frac{cycles}{second}*\frac{FLOPS}{cycle}$$
(source wikipedia:FLOPS)
$$Peak Flop = 2*14*2.4Ghz*32FLOPs/cycle$$
$$Peak Flop = 2150.4 GFLOPS/s$$


(calculation goes here)

## Flop/s fever

We've got to scratch that itch: we just want to go fast.  Okay, let's get it out of our system, and we'll look at more practical computations in future assignments.

You should choose one of the node types for this task.  Because this is more complex if multiple devices are involved
**1 bonus point** is earned for choosing a node with GPUs.

**Compute node exercise 3 (2 pts):** The command below will compile and runs essentially the following computation:

```C
for (i = 0; i < Nh; i++) { /* this loop will run on the "host" (CPUs) */
  for (j = 0; j < T; j++) {
    ah[i] = ah[i] * b + c;
  }
}

for (i = 0; i < Nd; i++) { /* this loop will run on the "device" (GPUs) */
  for (j = 0; j < T; j++) {
    ad[i] = ad[i] * b + c;
  }
}
```
And it will report the flop/s for the whole calculation.

`Nh` array entries will be on the host and `Nd` entries will be on each of the devices.  Try to find values of `Nh`, `Nd`, and `T`, and (optionally) compiler optimization flags that give you the highest flop/s.  Things to consider:

- Try to make your whole computation run for about a second.
- The time reported is the maximum time for any device: if one sits idle while the other finishes, it will rob you of flop/s.
- I suggest looking at one type of device at a time: set one of `Nh` or `Nd` to zero.  Once you've found your best flop/s for that device, optimize the other, and then try to strike a balance.
- Experiment with the merits of putting more weight on `Nh` and `Nd` vs more weight on `T`.
  Try to use **Little's Law** to make sure that you have enough parallelism to keep the pipelines filled.
- You can also choose to pass the option `Bs=X` to control the thread block size for the GPU, where `X` is a power of 2 between 64 and 2048.

In [3]:
dmesg | grep cache

Dentry cache hash table entries: 67108864 (order: 17, 536870912 bytes)
Inode-cache hash table entries: 33554432 (order: 16, 268435456 bytes)
Mount-cache hash table entries: 256
PCI: old code would have set cacheline size to 32 bytes, but clflush_size = 64
PCI: pci_cache_line_size set to 64 bytes
IP route cache hash table entries: 524288 (order: 10, 4194304 bytes)
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
ehci_hcd 0000:00:1a.0: cache line size of 64 is not supported
ehci_hcd 0000:00:1d.0: cache line size of 64 is not supported
xhci_hcd 0000:00:14.0: cache line size of 64 is not supported
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
CacheFiles: Can't set xattr on cache [451881] (err 95)


In [41]:
lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                28
On-line CPU(s) list:   0-27
Thread(s) per core:    1
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2399.777
BogoMIPS:              4799.30
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13
NUMA node1 CPU(s):     14-27


***Trials:***
Since Only CPU is used for type 1, I denote Nd=0, and start Nh = 256, each time I double the Nh, and monitor the flop/s rate, I found a relative high peak value happened at Nh=8388608,T=12.  Result a flop/s rate at 6.6-6.9 e+11 .

In [156]:
make clean

make run_fma_prof Nh=8388608 Nd=0 T=12 COPTFLAGS='-O3 -xHost' CUOPTFLAGS='-O3' 

rm -f *.o *.optrpt *.so fma_prof fma_prof_opt
icc -g -Wall -std=c99 -fPIC -O3 -xHost -I/usr/local/pacerepov1/cuda/8.0.44/include -qopenmp -c -o fma_prof.o fma_prof.c
icc -g -Wall -std=c99 -fPIC -O3 -xHost -I/usr/local/pacerepov1/cuda/8.0.44/include -qopenmp -c -o fma_omp.o fma_omp.c
icc -g -Wall -std=c99 -fPIC -O3 -xHost -I/usr/local/pacerepov1/cuda/8.0.44/include -qopenmp -c -o fma_loop_host.o fma_loop_host.c
nvcc -ccbin=icpc -Xcompiler '-fPIC' -O3 -dc -o fma_cuda.o fma_cuda.cu
nvcc -ccbin=icpc -Xcompiler '-fPIC' -O3 -dc -o fma_loop_dev.o fma_loop_dev.cu
nvcc -ccbin=icpc -Xcompiler '-fPIC' -dlink  fma_cuda.o fma_loop_dev.o -o fma_cuda_link.o
icpc -qopenmp -shared -Wl,-soname,libfma_cuda.so -o libfma_cuda.so fma_cuda_link.o fma_cuda.o fma_loop_dev.o -L/usr/local/pacerepov1/cuda/8.0.44/lib64 -Wl,-rpath,/usr/local/pacerepov1/cuda/8.0.44/lib64 -lcudart
icpc -qopenmp -o fma_prof fma_prof.o fma_omp.o fma_loop_host.o libfma_cuda.so -Wl,-rpath,.
OMP_PROC_BIND=spread OMP_NUM_THREADS=28  ./fma_

**Compute Node Exercise 4 (Bonus 1 pt):** Now let's see if we can make any transformations to the code to make a difference.

We will run the same program, but with fused multiply add loops that you have tried to optimize.  You should edit the files
`fma_loop_host_opt.cu` and/or `fma_loop_dev_opt.c`: they start out exactly the same as the reference implementations used above.

In [17]:
diff fma_loop_host.c fma_loop_host_opt.c

21,22c21,22
<   for (int i = 0; i < N; i++) {
<     for (int j = 0; j < T; j++) {
---
>   #pragma unroll(12)
>   for (int i = 0; i < N; i+=8) {
24c24,35
<     }
---
>       a[i+1] = a[i+1]*b + c;
>       a[i+2] = a[i+2]*b + c;
>       a[i+3] = a[i+3]*b + c;
>       a[i+4] = a[i+4] * b + c;
>       a[i+5] = a[i+5]*b + c;
>       a[i+6] = a[i+6]*b + c;
>       a[i+7] = a[i+7]*b + c;
> 
> 
> 
> 
>     


: 1

In [20]:
diff fma_loop_dev.cu fma_loop_dev_opt.cu

See if you can exploit vectorization, instruction level parallelism, and/or loop transformations to get a boost.

In [21]:
make clean
make run_fma_prof_opt Nh=8388608 Nd=0 T=12 COPTFLAGS='-O3 -xHost' CUOPTFLAGS='-O3' # modify this for peak flop/s

rm -f *.o *.optrpt *.so fma_prof fma_prof_opt
icc -g -Wall -std=c99 -fPIC -O3 -xHost -I/usr/local/pacerepov1/cuda/8.0.44/include -qopenmp -c -o fma_prof.o fma_prof.c
icc -g -Wall -std=c99 -fPIC -O3 -xHost -I/usr/local/pacerepov1/cuda/8.0.44/include -qopenmp -c -o fma_omp.o fma_omp.c
icc -g -Wall -std=c99 -fPIC -O3 -xHost -I/usr/local/pacerepov1/cuda/8.0.44/include -qopenmp -c -o fma_loop_host.o fma_loop_host.c
nvcc -ccbin=icpc -Xcompiler '-fPIC' -O3 -dc -o fma_cuda.o fma_cuda.cu
nvcc -ccbin=icpc -Xcompiler '-fPIC' -O3 -dc -o fma_loop_dev.o fma_loop_dev.cu
nvcc -ccbin=icpc -Xcompiler '-fPIC' -dlink  fma_cuda.o fma_loop_dev.o -o fma_cuda_link.o
icpc -qopenmp -shared -Wl,-soname,libfma_cuda.so -o libfma_cuda.so fma_cuda_link.o fma_cuda.o fma_loop_dev.o -L/usr/local/pacerepov1/cuda/8.0.44/lib64 -Wl,-rpath,/usr/local/pacerepov1/cuda/8.0.44/lib64 -lcudart
icpc -qopenmp -o fma_prof fma_prof.o fma_omp.o fma_loop_host.o libfma_cuda.so -Wl,-rpath,.
icc -g -Wall -std=c99 -fPIC -O3 -xHost -I/usr/l

## Submitting this work

**Workstation exercise 1 (1 pt):** When you have completed the rest of this assignment, `git add` the changes to this file, the source files you modified, and any scripts you added, and `git commit` them.  Having commited your changes, you should `git push` them to the private repository that you have on `github.gatech.edu`.

Our TA Han Sol Suh will email each of you a individualized [deploy key](https://developer.github.com/v3/guides/managing-deploy-keys/) that will allow him to read the contents of your repository.  

**Assignments need to be formally submitted to canvas,** but the totality of your submission on canvas should be a git revision hash or branch name indicating the version of your repository we should use to grade the assignment.