# Final Project

## Due Thursday, December 12th at 5:00 PM EDT, no extensions

Those working in pairs should submit the same version of this notebook.

## 1. Running a community benchmark (15 pts)

You are asked to take all of the necessary steps to run a meaningful benchmark code on pace-ice.  First we should talk about what a meaningful benchmark is.

### A meaningful benchmark should:

### a. Help someone working in a non-HPC domain understand / predict how useful a particular machine is to solving their problem.

### b. Report a machine-independent measure of performance, to allow for fair comparison and portability.

### c. Have an algorithm-independent statement of what the problem is (i.e. phrased in terms of inputs and outputs), to avoid artificially constraining the implementations.

### d. Be as simple as possible, so that the results of the benchmark are explainable and reproducible.

With these criteria in mind, you are welcome to select any accepted community benchmark with an open-source implementation.

- The benchmark implementation must be *open*, so that we may see what exactly is being run.
- An "accepted community" benchmark should ideally have a website describing itself, publishd benchmark results, and (ideally) a peer-reviewed in-depth description.


### Here are some recommendations that you could choose from:

### [HPLinpack](http://www.netlib.org/benchmark/hpl/): Dense Linear Algebra

### [HPCG](http://hpcg-benchmark.org/): Iterative Sparse Linear Algebra

### [Graph500](https://graph500.org/): Data-Intensive Graph Algorithms

### [HPGMG](http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/): Multilevel PDE Solvers

### [LAMMPS](https://lammps.sandia.gov/index.html) ([benchmarks](https://lammps.sandia.gov/bench.html)): Molecular Dynamics

### [TensorFlow](tensorflow.org) ([benchmarks](https://github.com/tensorflow/benchmarks)): Machine Learning

---

**1.1 (1 pts):** In a cell below, tell me which benchmark you are choosing.  Provide a link.  If the benchmark is actually a suite of benchmarks, tell me which one you would like to focus on.  If there are citations for the benchmark, give me those, too, please.  After that, give:

- As complete a description as possible of the *problem* being solved.  Include scaling parameters like problem size $N$, and any other "free" parameters that can change between different runs of the benchmark.

- As complete a description as possible of the *value* of the benchmark: what quantity is being reported?

Then, tell me which type of pace-ice node you intend to use to test the benchmark.

We are going to choose [HPLinpack](http://www.netlib.org/benchmark/hpl/).

HPL is trying to solve a dense linear system using LU decomposition. Parameters include N, the size of the input matrix, NB, the size of a block, which is a minimum group of elements that will stick together during the whole process, and grid (P * Q), number of processes that form the grid, this is controlled by MPI.

HPL reports two value: time use and peak flops reached in the progress. Note that here flops is cummulative on all process.

Here we plan to use nodes with 28 core CPUs and we will try run on one single node first.

**1.2 (4 pts):** In your own words, give me your assessment of the quality of the benchmark according to the four points (a), (b), (c), and (d) above.

- a. Describe some applications where the benchmark problem is relevant.  Benchmarks must walk a fine line between being to specific to one application but very predictive, versus being general to lots of applications while being too simple to predict the performance of any application very well.  Do you think the benchmark you chose does a good job with this balance?

- b. What assumptions does your benchmark make about the kind of machine that it is run on?  Do you think that those assumptions are reasonable?  Let's make this question very concrete: let's say you have access to [TaihuLight](https://en.wikipedia.org/wiki/Sunway_TaihuLight), whose nodes are neither really CPUs or GPUs, but somewhere in between.  Could your benchmark run on this machine?  If not, propose a way that you could change the benchmark to make it more portable.

- c. How exactly does your benchmark specify the way the problem is solved?  If your benchmark is for a particular algorithm or a particular code, do you think that the results of the benchmark would help you predict the performance of a different code/algorithm solving the same problem on the same machine?

- d. One measure of the complexity of a benchmark is how difficult it would be to write a reference implementation from scratch (one that solves the problem, if not in a "high-performance" way).  If you had to guess, how big would a team have to be do that: (i) one dedicated programmer; (ii) a team of about a dozen (like a research lab); (iii) an Organization (like a division of a company or a government agency)?  Give your reasoning (by, e.g. measuring lines of code in the implementation you will be working with)

$\textbf{1.2.a}$ There are several applications on solving dense linear systems. One of the example is simulating the evolution of celestral systems like galaxies. Another application is modeling the behavior of biofluid systems.

This benchmark is a kind of instruments having wide application situations since that the benchmark does not have any specific design on fitting any prescribed problem: the function of the benchmark is just solving the mathematical problem.

This characteristic makes it easy to be applied on any project which needs a dense-matrix solver.

$\textbf{1.2.b}$ There is no specific assumption on the type of machine where the benchmark run. Namely HPL simply call routines from the linear algebra library that user provided. And for any nontrivial archetichture, HPL can run as long as the user provide linear algebra implementation and MPI library.

$\textbf{1.2.c}$ The algorithm used in the benchmark, LU-factorization, is the most efficient algorithm for solving a dense-matrix system. In other words, if there is another program having a solver of dense-matrix system, it is highly possible that the solver also uses the same algorithm as this benchmark. If the main part or the bottleneck of the program is solving dense-matrix systems, the result of this benchmark is helpful for evaluating the performance of the program on the machine.

Also, since the LU factorization is well understood and highly predictable, one can easily tell what is going on and use the result from HPL to make some prediction on other tasks.

$\textbf{1.2.d}$ 
Total lines of code in hpl: 12009

In assignments 6, we did matvec using MPI, given skeleton code with 857 lines. And after plenty of work we finally come up with a working version consisting of 1042 lines in one week. Here we observe: 2 dedicated student, 7 days, ~150 lines of code. Given the fact that we spent some time working on other courses, and we did spend some time debugging pointers, and assume experienced programmer doesn't make this type of mistakes at all, we speed up 200% resulting in himself writing ~300 lines of code per week. Thus he needs 400 weeks to finish the project if he can stay in this pace.

Given a dozen prople, we can speed up roughly 10-15 times resulting in less that 40 weeks, which is about 10 month.

Given a company, let's say we have a product manager and he assigns different part of the program to different teams, we can speed up at least 20 times. Generally a company should be able to finish that work in 2-3 month.

Note: in all counting above, comments and blank lines in source files are ignored.

**1.3 (1 pts):** Try to prepare for some of the logistics ahead of you.  Answer the following questions:

- a. Where / how will you obtain the source for the benchmark driver and implementation that you will be using? (Regarding how: is it a tarball, repository, or other?)

- b. What software environments will you need to build and run the benchmark? (e.g. Does it use raw `make`? Autotools?  CMake?  Is it python/pip/conda?  Does it need MPI?  OpenMP?  Cuda?)

Source code is downloaded from link above.

Dependency:

- [Intel MKL](https://software.intel.com/en-us/mkl)
- [MPI](https://www.open-mpi.org/)

Installation:

- To install MKL, download source tarball and run `install.sh`.

- To install MPI follow official [documentation](https://www.open-mpi.org/faq/?category=building). Here MPI is shipped with pace module.

- To install HPL, download source tarball and follow the `INSTALL` file. Remember to change make file to point those environment variables to correct mpi and mkl installation.

**1.4 (1 pts):** Successfully install and run your benchmark

Include in this directory an example **job submission script** that runs your benchmark code.

For detail please see `hpl_job.sh`

**1.5 (3 pts):** Develop a performance model for your benchmark

In 1.1, you chose a performance metric of your benchmark, let's call it $V$.  Your benchmark will solve a problem with some parameters (problem size, the choice of matrix / network / etc.), let's call those parameters $N$.  The node that you chose to run on will have some machine parameters (The number of cores, the type of GPU, the bandwidth from main memory, etc., etc.), let's call them $P$.  Give an expression 
for $V(N,P)$ for your benchmark, and describe how you arrived at it.  You should use your discretion when choosing the level of detail.  If it is hard to develop a closed-form performance model for the whole benchmark, but there are a few key kernels that happens repeatedly in your benchmark (a stencil application, an iteration of stochastic gradient descent, etc.), you can give performance models for those benchmarks(s) instead.

If it is difficult to formulate your expression in terms of machine parameters, try to develop an expression
with coefficients that measure the rates at which the machine can do some lower-level operations (for example, the rate at which a GPU can sum an array).  If you have these coefficients, you should give a plausible description for how the architecture of the machine affects those rates.

$\textbf{Performance Model of the Benchmark}$

The program for solving dense matrix systems contains two steps. The first one is doing LU-factorization for the dense matrix and the second one is doing Gauss-elimination for two times to get the solution.

The chosen performance metric is the magnitude of the problem which can be solved in an unit time.

Assume that the shape of the dense matrix $M$ is [$l\times l$]. The total number of the useable threads is $P$.

The method of the benchmark to distribute the large matrix is "wrapping", which means that dividing the large [$l\times l$] matrix into the sum of small matrices $N_{ab}$, whose size are [$n_b\times n_b$]. The large matrix $M$ is composed of [$\frac{l}{n_b}\times\frac{l}{n_b}$] small matrices. Then every thread is distributed these small matrices in sequence. For example, thread 1 is assigned the small matrix [$N_{11}$]; thread 2 is assigned [$N_{12}$] and so on, like the photo [http://www.netlib.org/benchmark/hpl/algorithm.html] shown below:

![wrapping](./wrapping.PNG)

$\textbf{Firstly, let's research the process of LU-factorization.}$

After all the submatrices are assigned, the LU-factorization can begin. The sequence of every iteration of LU-factorization is four steps:

$\textbf{(1)}$ LU-factorize a submatrix on the diagonal [$N_{ii}$];

$\textbf{(2)}$ for all submatrices on the same column $i$ and below [$N_{ii}$] (matrices [$N_{ji}$], $j > i$) compute the coefficients of the L-matrix $L_{ji}$;

$\textbf{(3)}$ for all submatrices on the same row $i$ and on the right of [$N_{ii}$] (matrices [$N_{ij}$], $j > i$) compute the coefficients of the U-matrix $U_{ij}$;

$\textbf{(4)}$ for the part at the Lower right corner (matrices [$N_{jk}$], $j > i$ and $k > i$), compute the remain part of it by subtracting the part composed of the elements of the L-matrix on $i$ row and the elements of th U-matrix on $i$ column.

The picture below [http://www.netlib.org/utk/papers/factor/node7.html] shows the four steps:

![LU](./LU.PNG)

$\textbf{(1)}$ Now let's look at the i th iteration. The process of a LU-factorization on a [4$\times$4] matrix is shown below [https://blog.csdn.net/billbliss/article/details/78559289].

![LUmat](./LUmat.PNG)

The total number of computation during the process can be written as

$$
\begin{split}
n(\text{LU-factorization of a matrix})=1(n_b-1)(\text{1st column, 1 division; }n_b-1\text{ elements are computed})\\
+ 2(n_b-1)(\text{1st row, 1 mulplication and 1 minus; }n_b-1\text{ elements})\\
+ 3(n_b-2)(\text{2nd column, 2 divisions and 1 minus; }n_b-2\text{ elements})\\
+ 4(n_b-2)(\text{2nd row, 2 divisions and 2 minus; }n_b-2\text{ elements})\\
+ \dots + (n_b-2)1 + (n_b-1)1
\end{split}
$$
$$=\sum_{x=1}^{n_b/2}(n_b-x)(2x+1)=\sum_{x=1}^{n_b/2}(n_b\times2x-2x^2+x)=2n_b\frac{(1+n_b/2)n_b/2}{2}-2\frac{n_b/2(n_b/2+1)(n_b+1)}{6}+(n_b/2)^2$$
$$\approx\frac{n_b^3}{6}+\frac{n_b^2}{2}$$

$\textbf{(2) and (3):}$ After the LU-factorization of the submatirx [$N_{ii}$] ($0\leq i\leq l/n_b$), the benchmark broadcasts the results [L($n_b\times n_b$)] to the column of submatrices [$N_{ji},j>i$] and broadcasts the results [U($n_b\times n_b$)] to the row of submatrices [$N_{ij},j>i$]. Then these submatrices will compute their coefficients.

$\sqrt{P}$ threads having the submatrices on the $i$th column receive a message whose size is $n_b^2/2$

$$T(\text{column broadcast})=(\lambda+n_b^2/2\mu)\log_2(\sqrt{P})$$

These threads need to compute the coefficients of $(l/n_b-i)/l\times l/\sqrt{P}=(l/n_b-i)/\sqrt{P}$ submatries

$n(\text{column computation})=(l/n_b-i)/\sqrt{P}\times n_b(\text{rows of a submatrix})\times[1(\text{1st column})+3(\text{2nd column})+\dots+(2n_b-1)((n_b-1)\text{th column})]$
$$=(l/n_b-i)/\sqrt{P}\times n_b\times\frac{(1+2n_b-1)n_b}{2}\approx(l/n_b-i)/\sqrt{P}\times n_b^3$$

$\sqrt{P}$ threads having the submatrices on the $i$th row receive a message whose size is $n_b^2/2$

$$T(\text{row broadcast})=(\lambda+n_b^2/2\mu)\log_2(\sqrt{P})$$

These threads also need to compute the coefficients of $(l/n_b-i)/l\times l/\sqrt{P}=(l/n_b-i)/\sqrt{P}$ submatries

$n(\text{row computation})=(l/n_b-i)/\sqrt{P}\times n_b(\text{rows of a submatrix})\times[2(\text{1st column})+4(\text{2nd column})+\dots+2(n_b-1)((n_b-1)\text{th column})]$
$$=(l/n_b-i)/\sqrt{P}\times n_b\times\frac{(1+2(n_b-1))(n_b-1)}{2}\approx(l/n_b-i)/\sqrt{P}\times n_b^3$$


$\textbf{(4)}$ After that, the remain part of the large matrix $M$, submatrices $N_{jk}(j>i);k>i$ need to be updated for removing the parts composed of factorized part. $N_{ik}$ and $N_{ji}$ need to broadcast their results to the remain submatrices on row $i$ and column $i$. For one thread, it needs to receive the two messages whose sizes are $n_b\times (l-i)/\sqrt{P}$ from the thread having $i$th row and $(l-i)/\sqrt{P}\times n_b$ from the thread having $i$th column.

$$T(\text{broadcast on row direction}) = T(\text{broadcast on column direction})=\left[\lambda+(n_b\times (l/n_b-i)/\sqrt{P})\mu\right]\log_2(\sqrt{P})$$

After the communication, a matrix multiplication and a matrix minus is executed. For a thread,

$$n(\text{update the remain part})=[n_b^2(\text{multiplication})+n_b^2(\text{subtract})]\times[(l/n_b-i)/\sqrt{P}]^2$$


There are $l/n_b$ iterations during the whole LU-factorization process. Sum all these number of computations and communication times together:

$$\sum_{i=0}^{l/n_b-1}\left[n(\text{LU-factorization of a matrix})+n(\text{column computation})+n(\text{row computation})+n(\text{update the remain part})\right]$$

$$\approx l\times\left(\frac{n_b^2}{6}+\frac{n_b}{2}\right)+\frac{l^2}{2}n_b/\sqrt{P}+\frac{l^2}{2}n_b/\sqrt{P}+\frac{2}{3P}\frac{l^3}{n_b}$$

$$=l\times\left(\frac{n_b^2}{6}+\frac{n_b}{2}\right)+l^2n_b/\sqrt{P}+\frac{2}{3P}\frac{l^3}{n_b}$$

$$\sum_{i=0}^{l/n_b-1}[T(\text{column broadcast})+T(\text{row broadcast})+T(\text{broadcast on row direction})+T(\text{broadcast on column direction})$$

$$=2\frac{l}{n_b}\times(\lambda+n_b^2/2\mu)\log_2(\sqrt{P})+2\left[\frac{l}{n_b}\times\lambda+\left[n_b\times \frac{(l/n_b)^2}{2}/\sqrt{P}\right]\mu\right]\log_2(\sqrt{P})$$


After the matrix [$M$] is LU-factorized, backward-substitution is executed. 

$\textbf{Secondly, let's research the process of backward-substitution.}$

The backward-substitution is done in two times (for L matrix and for U matrix separately).

Let's research solving the $i$ th subvector belonging to the solution vector, whose length is also $n_b$.

$\textbf{Firstly}$, a thread having the submatrix $N_{i,i-1}$ needs to receive the result of the last $i-1$ th subvector. Other threads having submatrices $N_{ij}(j<i-1)$ should have received their needed results (j th subvector) from the thread having $N_{jj}$ submatrix (once it got its solution, it would broadcast its solutions to the remain needed threads).

$$T(\text{receiving the latest result, subvector } i-1)=\lambda+\mu n_b$$

$\textbf{Then}$ all threads having the submatrices $N_{ij}(j<i)$ on the $i$ th row need to calculate matrix multiplication

$$n(N_{ij}\text{compute their matrix multiplication})=2\frac{in_b}{\sqrt{P}}\times n_b$$

$\textbf{Thirdly}$, these results need to be reduced by the thread having $N_{ii}$ submatrix

$$T(\text{reducing the results of matrix multiplication } i-1)=\log_2(\sqrt{P})\left[\lambda+\frac{in_b}{\sqrt{P}}\mu\right]$$

$\textbf{At last}$, the thread having [$N_{ii}$] calculate the result of $i$ th subvector

$$n(\text{solving the i th subvector})=1(\text{1st row divide})+(1+1+1)(\text{2nd row multiply-subtract-divide})+(2\times2+1)+\dots+[2\times(n_b-1)+1]=n_b^2$$

Sum these number of computation and time of communication together:

$$\sum_{i=0}^{l/n_b}[n(N_{ij}\text{compute their matrix multiplication})+n(\text{solving the i th subvector})]=\sum_{i=0}^{l/n_b}\left[2\frac{in_b}{\sqrt{P}}\times n_b+n_b^2\right]$$

$$=\frac{(l/n_b)^2}{2}\times2\frac{n_b^2}{\sqrt{P}}+\frac{l}{n_b}\times n_b^2=\frac{l^2}{\sqrt{P}}+ln_b$$

$$\sum_{i=0}^{l/n_b}[T(\text{receiving the latest result, subvector } i-1)+T(\text{reducing the results of matrix multiplication } i-1)]$$
$$=\sum_{i=0}^{l/n_b}\left[\lambda+\mu n_b+\log_2(\sqrt{P})\left[\lambda+\frac{in_b}{\sqrt{P}}\mu\right]\right]=\frac{l}{n_b}(\lambda+\mu n_b)+\frac{l}{n_b}\log_2(\sqrt{P})\lambda+\frac{l}{n_b}\log_2(\sqrt{P})\times\frac{(l/n_b)^2}{2}\times\frac{n_b}{\sqrt{P}}\mu$$

$$=\frac{l}{n_b}(\log_2\sqrt{P}+1)\lambda+\left[l+\frac{l^3}{n_b^2}\log_2\sqrt{P}\frac{1}{2\sqrt{P}}\right]\mu$$


Since there are two times of backward substitution, the number of computation and the time of communication spent on solving the solution vector need to be multiplied 2.

$\textbf{Now let's compute the total time of computation.}$

$$T_{\text{benchmark}}(l, n_b, P)=T(\text{LU-factorize})+T(\text{backward substitution})$$
$$=\left[l\times\left(\frac{n_b^2}{6}+\frac{n_b}{2}\right)+l^2n_b/\sqrt{P}+\frac{2}{3P}\frac{l^3}{n_b}\right]/\text{flops}+2\frac{l}{n_b}\times(\lambda+n_b^2/2\mu)\log_2(\sqrt{P})+2\left[\frac{l}{n_b}\times\lambda+\left[n_b\times \frac{(l/n_b)^2}{2}/\sqrt{P}\right]\mu\right]\log_2(\sqrt{P})$$
$$+2\left[\frac{l^2}{\sqrt{P}}+ln_b\right]/\text{flops}+2\frac{l}{n_b}(\log_2\sqrt{P}+1)\lambda+2\left[l+\frac{l^3}{n_b^2}\log_2\sqrt{P}\frac{1}{2\sqrt{P}}\right]\mu$$

In this model, the first term and the last term are the most important two terms. To simplify the formula, the time can also be expressed as

$$T(\text{benchmark})(l, n_b, P)\approx \left[l\times\left(\frac{n_b^2}{6}+\frac{n_b}{2}\right)+l^2n_b/\sqrt{P}+\frac{2}{3P}\frac{l^3}{n_b}\right]/\text{flops}+2\left[l+\frac{l^3}{n_b^2}\log_2\sqrt{P}\frac{1}{2\sqrt{P}}\right]\mu$$

**1.6 (2 pts):** Gather statistics for the performance metric

Include in this directory the **job script(s)** that you use to gather statistics for the performance metric on pace-ice.  Additionally, describe what steps you've taken to ensure the quality of the statistics: how are you accounting for variability / noise?  Does your benchmark show different performance on the first run than on subsequent runs?

If you are running your benchmark for multiple problem instances ($N$), include a plot of the performance metric for the different problem instances. (You can include error bars for maximum/minimum values of the performance metric for the same problem instance to convey variability.)

In the HPL configuration file, one can specify multiple number of parameters, and the program will try each combination for you, which is very handy.

We took advantage of that, and run HPL for a couple of times. Data are retrieved from HPL.out files using `grep` and `awk`, then they are converted to csv. Further analysis are done in another `.ipynb` file in same directory.

**1.7 (3 pts):** Compare your performance model to your statistics

If your performance model allowed you to make a concrete prediction of $V(N,P)$ before running the benchmark,
compare (in a table or plot) the predictions and the actual measurements.

If your performance model includes coefficients that could not be estimated ahead of time, use measurements gathered during your experiments to get empirical values of those coefficients to fit the model to the data.  Once you do this, answer the following question: is there a plausible explanation (in terms of the architecture of the machine and the nature of the algorithm) for why these coefficients have the value that they do?

Present additional timings and/or machine performance metrics (either for the full benchmark or key kernels) and make a case for why you think they best demonstrate what the biggest bottleneck to the performance of the benchmark is.

Notes:

- HPL has a bunch of parameters other that N, NB and grid, we did try combination of them on very first runs. Then we observe that they don't make significant difference in the overall performance (<.1%). Thus we simply choose 1 of each to save run time.
 
- The value for $\lambda, \mu$ and peakFlops: we use $\lambda, \mu$ calculated from MPI assignments, and in our runs, we observe peak flops reach and saturating at ~30/core, which is close to calculated value from hw2 (2.1T / 2 / 28 = 37, 30/37 = 80%). Here we simply use 30 as peak flops per core in our model.

<img src="N_T_1_100_raw.png"> <img src="N_T_1_100.png"> <img src="N_flops_1_100.png">
We first run the problem with only one process, to see some property of the problem before introducing parallelism and whether our model works. Rseult show that our models successfully predicted the trend, and, more clearly shown on the log-scale figure, that our model is a constant factor smaller than the observed time. From the third figure, we observe that as problem size increase, the peak flops increase. The curve looks a bit weird here and we think this is because 2000 being a outliner here.

<img src="N_T_25_100_raw.png"> <img src="N_T_25_100.png"> <img src="N_flops_25_100.png">
Then we tried similar stuff with 25 process, and we observe similar trend of the time, and again constant factor difference between ours and observed. Note that in figure 3 we see a clearer trend that peak flops grow as problem size increase and saturate at some point. We don't even need to try larger N to make the conclusion that the peakflop/core here is ~30G.

<img src="NB_T_10000_25.png"> <img src="NB_flops_10000_25.png">
Then we tried fix N and see if NB makes a differece. From both figures we see that NB doesn't make significant difference in the performance. This time we don't even have to use log scale to see that our model having correct trend but smaller absolute value. Although the plot of flops looks vary a lot, the absolute difference is very small.

<img src="grid_T_10000_100_log.png"> <img src="grid_flops_25_100.png">
Then we tried to fix N and NB, run exact same problem with different number of process. We see that the time cost did drop, but the peak flops also drop. That is reasonable since with same problem, more process means data are more distributed, and each process have fewer data. Thus more time is spent by process waiting for each other and not doing any computation.
Note that flop rate with 28 process is even smaller than 25 process, this is because square matrix is more suitable for this kind of problem, and also from HPL documentation:

    HPL likes "square" or slightly flat process grids. Unless you are using a very small process grid, stay away from the 1-by-Q and P-by-1 process grids.
    
However in our model we only consider case that $P = Q$, and thus the last point from our model doesn't make much sense. But here we see that we successfully predict the drop from 1 process to 25 process.

###### Summary
From the analysis above, we see that our model always give a value that is ~10 times smaller than the actual time, regradless of N, NB or grid we use. And since we are using a actual flop rate, the variance must come from that we underestimated some operation. In our model we are counting operations in a optimistic fashion, namely we only count time spent actual computing or in MPI routing, but we don't count for time waiting for others, this could account for some difference in our model.

During the process of analysis, we notice that the performance isn't that "stable". It could vary if the parameter used in MPI changes and we think that outlinears in the above charts may come from here.

## 2. Paper report (5 pts)

Choose one of the two Gordon Bell Prize finalists this year:

- [A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations](https://dl.acm.org/citation.cfm?id=3357156)
- [Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing: 46 PFLOPS simulation of a metallic dislocation system](https://dl.acm.org/citation.cfm?id=3357157)

Read the paper, and answer the following questions:

- **a.** What is the problem being solved?

- **b.** What are the most important kernels of the algorithm for solving the problem? (In this context consider a kernel to be a subproblem that is relevant to more than just the specific problem from a.)

- **c.** What about the important kernels and/or the size of the problem make this a challenging problem?

- **d.** Summarize the innovation of this paper.

- **e.** What model is used that combines the problem parameters and machine parameters to predict performance?

- **f.** Any paper that is submitted for a prize contains some marketing, and maybe some attempts to [fool the masses](https://blogs.fau.de/hager/archives/category/fooling-the-masses).  If you could ask the authors to submit one additional figure with the performance measurements of an experiment, what would you choose, and why?