# Final Project

## Due Thursday, December 12th at 5:00 PM EDT, no extensions

## 1. Running a community benchmark (15 pts)

## I worked in pair with ZiFan Jiang. <br/>Please grade his version as our final submission. For documentation purpose, I also included some of my work that synced with him. 

For those working in pairs, one version of this question will be graded.

You are asked to take all of the necessary steps to run a meaningful benchmark code on pace-ice.  First we should talk about what a meaningful benchmark is.

### A meaningful benchmark should:

### a. Help someone working in a non-HPC domain understand / predict how useful a particular machine is to solving their problem.

### b. Report a machine-independent measure of performance, to allow for fair comparison and portability.

### c. Have an algorithm-independent statement of what the problem is (i.e. phrased in terms of inputs and outputs), to avoid artificially constraining the implementations.

### d. Be as simple as possible, so that the results of the benchmark are explainable and reproducible.

With these criteria in mind, you are welcome to select any accepted community benchmark with an open-source implementation.

- The benchmark implementation must be *open*, so that we may see what exactly is being run.
- An "accepted community" benchmark should ideally have a website describing itself, publishd benchmark results, and (ideally) a peer-reviewed in-depth description.


### Here are some recommendations that you could choose from:

### [HPLinpack](http://www.netlib.org/benchmark/hpl/): Dense Linear Algebra

### [HPCG](http://hpcg-benchmark.org/): Iterative Sparse Linear Algebra

### [Graph500](https://graph500.org/): Data-Intensive Graph Algorithms

### [HPGMG](http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/): Multilevel PDE Solvers

### [LAMMPS](https://lammps.sandia.gov/index.html) ([benchmarks](https://lammps.sandia.gov/bench.html)): Molecular Dynamics

### [TensorFlow](tensorflow.org) ([benchmarks](https://github.com/tensorflow/benchmarks)): Machine Learning

---

**1.1 (1 pts):** In a cell below, tell me which benchmark you are choosing.  Provide a link.  If the benchmark is actually a suite of benchmarks, tell me which one you would like to focus on.  If there are citations for the benchmark, give me those, too, please.  After that, give:

- As complete a description as possible of the *problem* being solved.  Include scaling parameters like problem size $N$, and any other "free" parameters that can change between different runs of the benchmark.

- As complete a description as possible of the *value* of the benchmark: what quantity is being reported?

Then, tell me which type of pace-ice node you intend to use to test the benchmark.

#### 1.1 Answer:

We chose the HPLinpack Benchmark [HPLinpack](http://www.netlib.org/benchmark/hpl/)<br/>
HPL - A Portable Implementation of the High-Performance Linpack, Benchmark for Distributed-Memory Computers <br/>

And the versioning is: <br/>
HPL 2.3 - by Antoine Petitet, Clint Whaley, Jack Dongar <br/>

> HPL is a High-Performance Linpack benchmark implementation.

> HPL is written in a portable ANSI C and requires an MPI implementation as well as either BLAS or VSIPL library. Such choice of software dependencies gives HPL both portability and performance.

> The HPL package provides a testing and timing program to quantify the accuracy of the obtained solution as well as the time it took to compute it. The best performance achievable by this software on your system depends on a large variety of factors. Nonetheless, with some restrictive assumptions on the interconnection network, the algorithm described here and its attached implementation are scalable in the sense that their parallel efficiency is maintained constant with respect to the per processor memory usage.

In our implementations for this project, <br/>

All the following packages are included in the module mkl/19.0.

- HPL rely on an efficient implementation of the Basic Linear Algebra Subprograms (BLAS) (http://www.netlib.org/blas/blas-3.8.0.tgz)

- And MPI module we already had as mvapich2 2.3, a MPI based software package, alternative versionings can be found at (http://mvapich.cse.ohio-state.edu/downloads/)



For problems to be solved: <br/>
> The code solves a uniformely random system of  linear equations and reports time and floating-point execution rate using a standard formula for operation count. Specifically, HPL generates a linear system of equations of order n and solves it using LU decomposition with partial row pivoting.<br/>
For N Problem size, it will need (2/3 * N^3－2N^2) steps to compute, and it measure the computational time performing such (2/3 * N^3－2N^2) steps operations. And then calculated the FLOPs. <br/>
Since HPL performs computation on an N x N array of Double Precision (DP) elements, and that each double precision element requires sizeof(double) = 8 bytes, the memory consumed for a problem size of N is 8N^2.<br/>
NB=192 for the broadwell processors.<br/>
P and Q, knowing that the product P x Q SHOULD typically be equal to the number of MPI processes.<br/>


> Here are the parameters:
```
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
```

For values of the benchmark: <br/>
> The HPL benchmark is used as reference benchmark to provide data for the Top500 list and thus rank to supercomputers worldwide.It has the testing program to quantify the accuracy of the obtained solution as well as the time it took to compute it. Generally It reports the computation time per node, and the FLOPS rate. 


Which nodes on pace-ice we intended to use?<br/>

> We chose the nodes on pace-ice clusters with the CPU only.  <br/>
They are the ones with 28 CPU cores (14 physical cores), Model is  Intel(R) Xeon(R) CPU E5-2680 v4 @2.40GHz machines. <br/>
We intended to perform benchmark tests on 1 CPU and 4 CPU clusters.

**1.2 (4 pts):** In your own words, give me your assessment of the quality of the benchmark according to the four points (a), (b), (c), and (d) above.

- a. Describe some applications where the benchmark problem is relevant.  Benchmarks must walk a fine line between being to specific to one application but very predictive, versus being general to lots of applications while being too simple to predict the performance of any application very well.  Do you think the benchmark you chose does a good job with this balance?

- b. What assumptions does your benchmark make about the kind of machine that it is run on?  Do you think that those assumptions are reasonable?  Let's make this question very concrete: let's say you have access to [TaihuLight](https://en.wikipedia.org/wiki/Sunway_TaihuLight), whose nodes are neither really CPUs or GPUs, but somewhere in between.  Could your benchmark run on this machine?  If not, propose a way that you could change the benchmark to make it more portable.

- c. How exactly does your benchmark specify the way the problem is solved?  If your benchmark is for a particular algorithm or a particular code, do you think that the results of the benchmark would help you predict the performance of a different code/algorithm solving the same problem on the same machine?

- d. One measure of the complexity of a benchmark is how difficult it would be to write a reference implementation from scratch (one that solves the problem, if not in a "high-performance" way).  If you had to guess, how big would a team have to be do that: (i) one dedicated programmer; (ii) a team of about a dozen (like a research lab); (iii) an Organization (like a division of a company or a government agency)?  Give your reasoning (by, e.g. measuring lines of code in the implementation you will be working with)

#### 1.2 Answer:

- a: <br/>
The HPLinpack (HPL) as a scalable addition to the original Linpack benchmark suites, due to its parallel computation efficiency remaining constant respect to per process memory usage, succeeded in scalability. And it outputs only one single number in time and FLOPS, making this benchmark results are easily compared with different machines.<br/>
In this sense, the benchmark is very good at being general to lots of applications. <br/>
However, it only tests the **dense linear system**, omitted a fact that there are many more other operations can be taken into account to the performance evaluation. Thus, the best results can be achieved by fine tuning for a particular machine targeting to solving the dense linear system, which caused the results shown are biased. <br/>
That is to say, the benchmark is too simple for all kinds of applications testing, but can be very predictive for machines or applications specificially for dense linear system problem solving. <br/>


- b: <br/>
HPL has been designed as a portable benchmark to perform well for large problem sizes on hundreds of nodes and more.
As user, the main algorithm does not need to be modified to run, and it only relies on MPI and BLAS enviroment. So,  theoretically it can be run on clusters with GPU, or CPU or hybrid. But of course, the algorithm needs to be optimized to achieve the best performance results for differerent machine configs. For example, Nvidia, has its own version for HPL benchmarks to run on Nvidia GPUs. And for case of **TianhuLight** , it actually is the third place of the HPLinpark top 500 at the moment, with a results of Peak rate at 125,435.9 TFlop/s running HPL.<br/>



- c: <br/>
HPL main algorithm code solves a uniformely random system of linear equations and reports time and floating-point execution rate using a standard formula for operation count. Specifically, HPL generates a linear system of equations of order n and solves it using LU decomposition with partial row pivoting. 
For n Problem size, it will need (2/3 * N^3－2N^2) steps to compute, and it measure the computational time performing such (2/3 * N^3－2N^2) steps operations. And then calculated the FLOPs. <br/>
As I reckon, HPL can provide a reference benchmark to predict the machine computation speed, particularly for solving dense linear system type problems. It emphasized the computing part, yet not emphasize enough the interconnection part, thus, If the problems are heavily relied on communications, then the benchmark is not well suited. On the other hand, it the problems are relied on computation mostly, the benchmark results can be useful for predicting. 

- d: <br/>
HPL is the portable reference implementation of HPLinpack benchmarks. To build something similar like HPL from scratch, to estimate the time and labor, I examined the HPL source code and main algorithm (https://www.netlib.org/benchmark/hpl/algorithm.html), the original source code has more than 10000 lines, and the algorithm is straightforward. By my measure, I would assume that building something like a reference implementation from stratch will need a research lab(dozen of people), work in a month or several months to finish.




**1.3 (1 pts):** Try to prepare for some of the logistics ahead of you.  Answer the following questions:

- a. Where / how will you obtain the source for the benchmark driver and implementation that you will be using? (Regarding how: is it a tarball, repository, or other?)

- b. What software environments will you need to build and run the benchmark? (e.g. Does it use raw `make`? Autotools?  CMake?  Is it python/pip/conda?  Does it need MPI?  OpenMP?  Cuda?)

#### 1.3 Answer:
- a: <br/>
The source code can be obtained from (http://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz) as a tar, and the dependency packages are BLAS (http://www.netlib.org/blas/blas-3.8.0.tgz). The detailed process can be found in 1.4 answers, I have documented a step by step workflow for how to successfully install and run the benchmark.


- b: <br/>
The benchmark needs to be run on a computing node,
the following modules need to be loaded:
```
module use $CSE6230_DIR/modulefiles
module unload cse6230/core
module load cse6230/gcc-omp-gpu
```
The benchmark can be made locally once downloaded, and it required MPI packages to successfully run. In our case, we already have the MPI software installed. 

**1.4 (1 pts):** Successfully install and run your benchmark

Include in this directory an example **job submission script** that runs your benchmark code.

#### Step 1 Benchmark installation:
I created a directory named "hpl" under cse6230/final, to contain all the neccessary files for the benchmark

In [7]:
cd ~
cd /nv/coc-ice/mguo34/cse6230/final/hpl
ls

BLAS-3.8.0  blas-3.8.0.tgz  hpl-2.3  hpl-2.3.tar.gz


##### About Downloading the packages

hpl-2.3.tar.gz can be downloaded at (http://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz) <br/>
the benchmark tool is hpl-2.3, but blas will be one of the neccessary package to successfully make the executionables 
BLAS. In our project, the linear algebra package is contained with the intel mkl module, we do not neccessarily install it from scratch.

##### About building the tool packages
###### Load Modules

It is always needed to load the neccessary modules first, here to ease the touble, we always load :
```bash
module load cse6230/core
module load mkl/19.0
```

###### HPL:

After downloading and uploads the packages to the final/hpl folder:

`mguo34@coc-ice:~/cse6230/final/hpl$ tar -xzvf hpl-2.3.tar.gz` <br/>

`mguo34@coc-ice:~/cse6230/final/hpl$ cd hpl-2.3/setup` <br/>

To generate a template:<br/>

`mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3/setup$ sh make_generic ` <br/>

Then have our make file named as Make.Linux, and move it to the higher level directory <br/>

`mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3/setup$  cp Make.UNKNOWN ../Make.Linux  ` <br/>

And go back to hpl/hpl-2.3/ directory <br/>
`mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3/setup$ cd .. ` <br/>

Now we have a Make.Linux template to work on. 

```bash
mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3$ ls
acinclude.m4  BUGS          config.sub    COPYRIGHT  INSTALL     Makefile.am  Make.top  README   THANKS
aclocal.m4    ChangeLog     configure     depcomp    install-sh  Makefile.in  man       setup    TODO
AUTHORS       compile       configure.ac  HISTORY    lib         Make.Linux   missing   src      TUNING
bin           config.guess  COPYING       include    Makefile    makes        NEWS      testing  www

```
Open with any editors, it should be showing a template, yet not completed, we need add dependecy patch to make sure the tool can be successfully built. it will need mvapich2, BLAS, and the hpl <br/>

```bash
mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3$ vi Make.Linux
```


###### BLAS:
The linear algebra package is already contained within the mkl/19.0 module, we will simply load this module, and modify the dependecy in Make.Linux file, like the description below:

###### MODIFY MAKE FILE:
After successfully get all the neccessary files, we are reay to make modifications of the Make.Linux file
This is the final version of my Make.Linux:
And here is the Make.Linux that I modified for building.


In [None]:
cd /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3
pwd
cat Make.Linux

###### MODIFY MAKE FILE:
Now go to the hpl-2.3 directory and start compliling: <br/>
First Make clean:<br/>
 `mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3$ make arch=Linux clean_arch_all`<br/>
Then Make:<br/>
 `mguo34@coc-ice:~/cse6230/final/hpl/hpl-2.3$ make arch=Linux`<br/>

In [5]:
cd /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3
make arch=Linux clean_arch_all
make arch=Linux

make -f Make.top clean_arch_all  arch=Linux
make[1]: Entering directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
make -f Make.top clean_arch_src arch=Linux
make[2]: Entering directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
/bin/rm -f -r src/auxil/Linux
/bin/rm -f -r src/blas/Linux
/bin/rm -f -r src/comm/Linux
/bin/rm -f -r src/grid/Linux
/bin/rm -f -r src/panel/Linux
/bin/rm -f -r src/pauxil/Linux
/bin/rm -f -r src/pfact/Linux
/bin/rm -f -r src/pgesv/Linux
make[2]: Leaving directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
make -f Make.top clean_arch_tst arch=Linux
make[2]: Entering directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
/bin/rm -f -r testing/matgen/Linux
/bin/rm -f -r testing/timer/Linux
/bin/rm -f -r testing/pmatgen/Linux
/bin/rm -f -r testing/ptimer/Linux
/bin/rm -f -r testing/ptest/Linux
make[2]: Leaving directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
/bin/rm -f -r bin/Linux include/Linux lib/Linux
make[1]: Leaving directory `/nv/coc-i

cp makes/Make.ptimer   testing/ptimer/Linux/Makefile
cp makes/Make.ptest    testing/ptest/Linux/Makefile
make[1]: Leaving directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
make -f Make.top build_src       arch=Linux
make[1]: Entering directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
( cd src/auxil/Linux;         make )
make[2]: Entering directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/src/auxil/Linux'
mpiicc -o HPL_dlacpy.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_dlacpy.c
mpiicc -o HPL_dlatcpy.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/

touch lib.grd
make[2]: Leaving directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/src/blas/Linux'
( cd src/comm/Linux;          make )
make[2]: Entering directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/src/comm/Linux'
mpiicc -o HPL_1ring.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_1ring.c
mpiicc -o HPL_1rinM.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z n

mpiicc -o HPL_reduce.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_reduce.c
mpiicc -o HPL_all_reduce.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_all_reduce.c
mpiicc -o HPL_barrier.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I

mpiicc -o HPL_dlaswp10N.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_dlaswp10N.c
mpiicc -o HPL_dlaswp01N.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_dlaswp01N.c
mpiicc -o HPL_dlaswp01T.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Li

mpiicc -o HPL_dlocswpN.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_dlocswpN.c
mpiicc -o HPL_dlocswpT.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_dlocswpT.c
mpiicc -o HPL_pdmxswp.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I

mpiicc -o HPL_pdlaswp00T.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_pdlaswp00T.c
mpiicc -o HPL_perm.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_perm.c
mpiicc -o HPL_logsort.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /us

ar r /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/lib/Linux/libhpl.a  HPL_pipid.o            HPL_plindx0.o          HPL_pdlaswp00N.o HPL_pdlaswp00T.o       HPL_perm.o             HPL_logsort.o HPL_plindx10.o         HPL_plindx1.o          HPL_spreadN.o HPL_spreadT.o          HPL_rollN.o            HPL_rollT.o HPL_equil.o            HPL_pdlaswp01N.o       HPL_pdlaswp01T.o HPL_pdupdateNN.o       HPL_pdupdateNT.o       HPL_pdupdateTN.o HPL_pdupdateTT.o       HPL_pdtrsv.o           HPL_pdgesv0.o HPL_pdgesvK1.o         HPL_pdgesvK2.o         HPL_pdgesv.o
echo /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/lib/Linux/libhpl.a 
/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/lib/Linux/libhpl.a
touch lib.grd
make[2]: Leaving directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/src/pgesv/Linux'
make[1]: Leaving directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
make -f Make.top build_tst       arch=Linux
make[1]: Entering directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3'
( cd testing/

ar r /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/lib/Linux/libhpl.a  HPL_ptimer.o           HPL_ptimer_cputime.o   HPL_ptimer_walltime.o
echo /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/lib/Linux/libhpl.a 
/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/lib/Linux/libhpl.a
touch lib.grd
make[2]: Leaving directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/testing/ptimer/Linux'
( cd testing/ptest/Linux;     make )
make[2]: Entering directory `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/testing/ptest/Linux'
mpiicc -o HPL_pddriver.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include -I/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/include/Linux -I /usr/local/pacerepov1/intel/compiler/19.0/mkl/include/ -I /usr/local/pacerepov1/intel/compiler/16.0/impi/5.1.1.109/include64  -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -nocompchk  ../HPL_pddriver.c
mpiicc -o HPL_pdinfo.o -c -DAdd_ -DF77_INTEGER=int -DStringSunStyle  -I/nv/coc-ic

#### Step 2 load modules and run a test:

In [1]:
module load cse6230/core
module load mkl/19.0

|                                                                         |
|       A note about python/3.6:                                          |
|       PACE is lacking the staff to install all of the python 3          |
|       modules, but we do maintain an anaconda distribution for          |
|       both python 2 and python 3. As conda significantly reduces        |
|       the overhead with package management, we would much prefer        |
|       to maintain python 3 through anaconda.                            |
|                                                                         |
|       All pace installed modules are visible via the module avail       |
|       command.                                                          |
|                                                                         |


###### CHECK MY HARDWARES Models:

In [4]:
CPU=$(cat /proc/cpuinfo | grep "model name" | tail -1)
COUNT=$(cat /proc/cpuinfo | grep processor | wc -l)
MEM=$(cat /proc/meminfo |grep MemTotal | tail -1)
echo "CPU : $CPU"
echo "Total CPU Cores : $COUNT"
echo "$MEM"

CPU : model name	: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Total CPU Cores : 28
MemTotal:       132176424 kB


As suggested, we need to use the memory available per/node to determine problem N, but I noticed that if the N is too large, over 100000, the running time is too long. Due to time limits, we have to reduce the N to less 100000, and I assumed that the max mem is 10% of the available memory, which is around 12907 MB.By calculation, the N for 10% memory is 36480. <br/>
And since we are going to test with 28 MPI process, the P and Q I set is 4 and 7. I used this parameters to config the ***HPL.dat*** file, which is the config file located in `/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/bin/Linux`

In [9]:
cd /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/bin/Linux
pwd

/nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/bin/Linux


###### A single node run for 1 node, 28 cores, 28 MPI, N is n times 19200, n = 1,2,3,4,5, NB 192: I saved the HPL.dat file as HPL_single_node.dat in the same directory as the original HPL.dat

In [1]:
cd /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/bin/Linux
mpirun -np 28 ./xhpl

HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   19200    38400    57600    76800    96000 
NB     :     192 
PMAP   : Row-major process mapping
P      :       4 
Q      :       7 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed

###### Added another run: N is n times 19200, n =6, NB 192: I saved the HPL.dat file as HPL_single_node_patch.dat in the same directory as the original HPL.dat

In [5]:
cd /nv/coc-ice/mguo34/cse6230/final/hpl/hpl-2.3/bin/Linux
mpirun -np 28 ./xhpl

HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  115200 
NB     :     192 
PMAP   : Row-major process mapping
P      :       4 
Q      :       7 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL

![single_node](figs/Figure_1.png)

**1.5 (3 pts):** Develop a performance model for your benchmark

In 1.1, you chose a performance metric of your benchmark, let's call it $V$.  Your benchmark will solve a problem with some parameters (problem size, the choice of matrix / network / etc.), let's call those parameters $N$.  The node that you chose to run on will have some machine parameters (The number of cores, the type of GPU, the bandwidth from main memory, etc., etc.), let's call them $P$.  Give an expression 
for $V(N,P)$ for your benchmark, and describe how you arrived at it.  You should use your discretion when choosing the level of detail.  If it is hard to develop a closed-form performance model for the whole benchmark, but there are a few key kernels that happens repeatedly in your benchmark (a stencil application, an iteration of stochastic gradient descent, etc.), you can give performance models for those benchmarks(s) instead.

If it is difficult to formulate your expression in terms of machine parameters, try to develop an expression
with coefficients that measure the rates at which the machine can do some lower-level operations (for example, the rate at which a GPU can sum an array).  If you have these coefficients, you should give a plausible description for how the architecture of the machine affects those rates.

**1.6 (2 pts):** Gather statistics for the performance metric

Include in this directory the **job script(s)** that you use to gather statistics for the performance metric on pace-ice.  Additionally, describe what steps you've taken to ensure the quality of the statistics: how are you accounting for variability / noise?  Does your benchmark show different performance on the first run than on subsequent runs?

If you are running your benchmark for multiple problem instances ($N$), include a plot of the performance metric for the different problem instances. (You can include error bars for maximum/minimum values of the performance metric for the same problem instance to convey variability.)

**1.7 (3 pts):** Compare your performance model to your statistics

If your performance model allowed you to make a concrete prediction of $V(N,P)$ before running the benchmark,
compare (in a table or plot) the predictions and the actual measurements.

If your performance model includes coefficients that could not be estimated ahead of time, use measurements gathered during your experiments to get empirical values of those coefficients to fit the model to the data.  Once you do this, answer the following question: is there a plausible explanation (in terms of the architecture of the machine and the nature of the algorithm) for why these coefficients have the value that they do?

Present additional timings and/or machine performance metrics (either for the full benchmark or key kernels) and make a case for why you think they best demonstrate what the biggest bottleneck to the performance of the benchmark is.

## 2. Paper report (5 pts)

**Completed separately, not as a team**

Choose one of the two Gordon Bell Prize finalists this year:

- [A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations](https://dl.acm.org/citation.cfm?id=3357156)
- [Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing: 46 PFLOPS simulation of a metallic dislocation system](https://dl.acm.org/citation.cfm?id=3357157)

Read the paper, and answer the following questions:

- **a.** What is the problem being solved?

- **b.** What are the most important kernels of the algorithm for solving the problem? (In this context consider a kernel to be a subproblem that is relevant to more than just the specific problem from a.)

- **c.** What about the important kernels and/or the size of the problem make this a challenging problem?

- **d.** Summarize the innovation of this paper.

- **e.** What model is used that combines the problem parameters and machine parameters to predict performance?

- **f.** Any paper that is submitted for a prize contains some marketing, and maybe some attempts to [fool the masses](https://blogs.fau.de/hager/archives/category/fooling-the-masses).  If you could ask the authors to submit one additional figure with the performance measurements of an experiment, what would you choose, and why?

### Paper Report:  
I chose this paper to review:
[A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations](https://dl.acm.org/citation.cfm?id=3357156))
#### a. What is the problem being solved? <br/>
The paper proposed an algorithm to improve the computational efficiency of a state of art ab initio quantum transport (QT) solver. the restructured QT simulator is capable to treat realistic nanoelectronic devices made of more than 10,000 atoms within a 14⇥ shorter duration than the original code needs to handle a system with 1,000 atoms, on the
same number of CPUs/GPUs and with the same physical accuracy.

#### b. What are the most important kernels of the algorithm for solving the problem? (In this context consider a kernel to be a subproblem that is relevant to more than just the specific problem from a.) <br/>
The kernels are to solve the non-linear GF
and SSE equations to obtain the electrical and energy currents or the temperature
distribution of a given divce. The non-linear GF and SSE equations needed to be solved iteratively until convergence met. <br/>
The former computes the boundary
conditions, cast them into self-energies, solve for the Green’s Functions,
and extract physical observables (current, density) from them.
The state consists of two concurrent Maps, one for the electrons
and one for the phonons (§ 3.1). The SSE state computes the scattering
self-energies Σ and Π.<br/>
And the SSE computation takes most of the simulation time. Which is the most important kernel to be improved to solve the problem. In this paper, a proposed Data-Centric(DaCe) SSE algorithm reduced magnitudes of data movement and communications. 



#### c. What about the important kernels and/or the size of the problem make this a challenging problem? <br/>

As referenced from the paper:

large-scale QT simulations are bound by both
communication volume and memory requirements. The former
inhibits strong scaling, as simulation time includes nanostructuredependent
point-to-point communication patterns, which become
infeasible when increasing node count. The memory bottleneck
is a direct result of the former. It hinders large simulations due to
the increased memory requirements w.r.t. atom count. Transforming
the QT simulation algorithm to minimize communication is
thus the key to simultaneously model larger devices and increase
scalability on di￿erent supercomputers.

#### d.Summarize the innovation of this paper. <br/>
The paper proposed a paradigm change by rewriting the quantum transport problem as implemented in OMEN from a data-centric perspective.
- Data Ingestion: We stage the material and use chunked
broadcast to deliver it to nodes. This reduced Piz Daint startup
time at full-scale from ⇠30 minutes to under two.
- Load Balancing: Similar to OMEN, we divide work among
electrons and phonons unevenly, so as to reduce imbalance.
Communication Avoidance:We reformulate communication
in a non-natural way from a physics perspective, leading
to two orders of magnitude reduction in volume.
- Boundary Conditions:We pipeline contour integral calculation
on the GPUs, computing concurrently and accumulating
resulting matrices using on-GPU reduction.
- Sparsity Utilization:We tune and investigate di￿erent datacentric
transformations on sparse Hamiltonian blocks in GF,
using a combination of sparse and dense matrices.
- Pipelining: The DaCe framework automatically generates
copy/compute and compute/compute overlap, resulting in
60 auto-generated CUDA streams.
- Computational Innovations:We reformulate SSE computations
using data-centric transformations. Using ￿ssion and
data layout transformations,we reshape the job into a stencillike
strided-batched GEMM operation, where the DaCe implementation
yields up to 4.8⇥ speedup over cuBLAS.


#### e. What model is used that combines the problem parameters and machine parameters to predict performance? <br/>

Component benchmarks tasklets of matrix multiplication were created on single node and GPU to predict the computation costs for different methods.<br/>
For communication,  a lower bounds for the completion time of each call was predicted by dividing the amount of data each node must sned by the injection bandwidth of that node. <br/>
For Scalability, both strong scale and weak scale in addition to a extreme scale runs are also analyzed. 
However, the overall performance wasn't predicted in the paper, the author presented the actual performance as an improvement comparison in each fields, showing efficiency calculated by dividing the achieved flop/s by peak flop/s measured.

#### f. Any paper that is submitted for a prize contains some marketing, and maybe some attempts to fool the masses. If you could ask the authors to submit one additional figure with the performance measurements of an experiment, what would you choose, and why? <br/>
The paper indicates that
OMEN depending on multiple
external libraries, some of which are not necessarily optimized for
every architecture (e.g., IBM POWER9). On the other hand, SDFGs
are compiled on the target architecture and depend only on a few
optimized libraries provided by the architecture vendor (e.g., MKL,
cuBLAS, ESSL), whose implementations can always be replaced by
SDFGs for further tuning and transformations <br/>
And the paper has a table comparision results with DaCe , Python, and muanlly -tuned C++ OMEN implementation,
indicates that the performance-oriented reconstruction of
SSE generates a speedup of 9.97⇥.<br/>
Aa a reader, I am very curious how OMEN was manually tuned, as they did not fully optimize for the following tests. I would like to ask for figures showing how OMEN was tunned as a detailed complement reference to the performance experiments. And for a reproduction purpose, to validate the comparison fidelity. 