# Project 1: Molecular Dynamics with OpenMP

This assignment is due in two weeks time, by **9:30 am on Thursday October 3rd**.

**You may work in pairs on this assignment:** When you officially submit this project on Canvas, you should indicate in the text submission field on Canvas:

- Who, if anyone you are working with
- If you are working in pairs, indicater whether the repository to be graded is yours or your partner's.
- Which commit of your repository you would like to be graded (we will grade the `master` branch by default if no choice is made)

**Which type of node are you using?** Because OpenMP can be used to program the GPUs, you may choose to optimize the application for any of the three types of nodes used in this class.  Declare the type of node you would like to use here:

**This notebook should be run on a node with 28 CPU cores and 0 Tesla 0 GPUs**

If you will not use the GPUs, you should use the following modules:

In [1]:
module use $CSE6230_DIR/modulefiles
module load cse6230/core

|                                                                         |
|       A note about python/3.6:                                          |
|       PACE is lacking the staff to install all of the python 3          |
|       modules, but we do maintain an anaconda distribution for          |
|       both python 2 and python 3. As conda significantly reduces        |
|       the overhead with package management, we would much prefer        |
|       to maintain python 3 through anaconda.                            |
|                                                                         |
|       All pace installed modules are visible via the module avail       |
|       command.                                                          |
|                                                                         |


If you will use the GPUs, you should use the following modules:

In [None]:
# module use $CSE6230_DIR/modulefiles
# module load cse6230/gcc-omp-gpu

(I've included a set of makefile rules for GNU-based builds: you can use `make MAKERULES=gcc` wherever you would use make and it should work.  You should do this if you are using the `gcc-omp-gpu` module or if you are developing on your laptop and don't have the intel compilers)

## About this program

The code for this assignment started out almost exactly the same as your third assignment with interacting particles.  We saw in that assignment the way that $O(n^2)$ interactions in an $n$-body simulation dominate the rest of the operations.  This project shows an attempt to return that work complexity from $O(n^2)$ back down to $O(n)$ or thereabouts.
  
Some of the potentials that define interactions in molecular dynamics decay *quite* rapidly.  So rapidly, that it is not a terrible approximation to assign to each particle an effective **radius $r$**.  If two particles are not touching (that is if their centers are more than $2r$ apart), then the interactions can safely be ignored (particularly if it will be drowned out relative to the background *Brownian* noise that we saw last week).  In side of $2r$, then the overlapping particles start pushing each other apart.

If you'd like to see the particulars of this assignments force due to interactions, you can look at `steric.h`, so called because the force approximate [steric effects](https://en.wikipedia.org/wiki/Steric_effects).

In [2]:
pygmentize steric.h

[36m#[39;49;00m[36mif !defined(STERIC_H)[39;49;00m[36m[39;49;00m
[36m#[39;49;00m[36mdefine      STERIC_H[39;49;00m[36m[39;49;00m

[36m#[39;49;00m[36minclude[39;49;00m [37m<math.h>[39;49;00m[36m[39;49;00m


[37m/* This kernel should be called if the distance between two particles is less[39;49;00m
[37m * than twice the particle radius */[39;49;00m
[34mstatic[39;49;00m [34minline[39;49;00m [36mvoid[39;49;00m
[32mforce_in_range[39;49;00m ([36mdouble[39;49;00m k, [37m/* The interaction strength (may be scaled by the time step already)[39;49;00m [04m[31;01m*/[39;49;00m
                [36mdouble[39;49;00m r, [37m/* The radius of a particle.  Two particles interact if they intersect */[39;49;00m
                [36mdouble[39;49;00m R, [37m/* The distance between these two particles */[39;49;00m
                [36mdouble[39;49;00m dx, [36mdouble[39;49;00m dy, [36mdouble[39;49;00m dz, [37m/* The displacement from particle 2 to particle 1 *

(If you find part of your program is compute bound, you are welcome to change the implementations in `steric.h`, as long as your still calculate the same function)

Now, suppose that our particles bounce around and repel each other until they are roughly in equilibrium.  We would expect that they would be well spread out, and that the chance of any two particles interacting would be no more likely than two particles placed at random.

A particle interacts with any particle within a range of $2*r$, which means that around each particle there is a sphere with volume $V_p = \frac{4}{3}\pi (2r)^3\approx 33 r^3$: any particle whose center is outside of that cell does not interact.  Suppose the volume of the periodic domain is $V_D$, and there are $N_p$ particles.  Then if the other $N_p - 1$ particles are distributed at random, then we expect $V_p (N_p - 1)/ V_D$ of those particles to interact with the particle in question.  Therefore we might expect $N_p V_p (N_p - 1) / 2 V_D$ interactions in total.

What's the point of this calculation?  Well, when run a periodic simulation, we are trying to approximate a larger domain with a fixed *density* of particles per volume.  Thus, if we consider $\phi = N_p/ V_D$ to be a fixed density of the problem we are trying to simulate, then the number of interactions is $\approx (N_p - 1) \phi / 2$.
*We should expect the number of interactions to scale linearly with the number of particles if we keep $\phi$ fixed.*

So how can we exploit the fact that only $O(N_p)$ interactions are expected instead of $O(N_p^2)$?  In our acceleration routine, we should try to rule out particles from interacting with each other.

One way to do this is *binning*: we divide up our periodic domain $[-L/2,L/2)^3$ into a grid of $b$ boxes per dimension, $b^3$ boxes total.  An algorithm would look like the following:

1. Given each particles coordinates, assign it to the appropriate box.
2. If the length of a box $(L / b)$ is longer than $2r$, then every particle can only interact with particles
  - In its own box,
  - In neighboring boxes
3. So loop over neighboring boxes and create a list of *pairs of particles* that are close enough to interact.

This is what is done now in `accelerate.c`: there is an interaction "object" that handles the internals of binning particles into boxes: it returns a list of pairs on request.

The previous $O(N_p^2)$ calculation is available for comparison and debugging purposes.

In [3]:
sed -n '54,85 p' accelerate.c | pygmentize -l c

[34mstatic[39;49;00m [36mvoid[39;49;00m
[32maccelerate_ix[39;49;00m (Accel accel, Vector X, Vector U)
{
  IX ix = accel->ix;
  [36mint[39;49;00m Np = X->Np;
  [36mint[39;49;00m Npairs;
  ix_pair *pairs;
  [36mdouble[39;49;00m L = accel->L;
  [36mdouble[39;49;00m k = accel->k;
  [36mdouble[39;49;00m r = accel->r;

  [34mfor[39;49;00m ([36mint[39;49;00m i = [34m0[39;49;00m; i < Np; i++) {
    [34mfor[39;49;00m ([36mint[39;49;00m j = [34m0[39;49;00m; j < [34m3[39;49;00m; j++) {
      IDX(U,j,i) = [34m0.[39;49;00m;
    }
  }

  IXGetPairs (ix, X, [34m2.[39;49;00m*r, &Npairs, &pairs);
  [34mfor[39;49;00m ([36mint[39;49;00m p = [34m0[39;49;00m; p < Npairs; p++) {
    [36mint[39;49;00m i = pairs[p].p[[34m0[39;49;00m];
    [36mint[39;49;00m j = pairs[p].p[[34m1[39;49;00m];
    [36mdouble[39;49;00m du[[34m3[39;49;00m];

    force (k, r, L, IDX(X,[34m0[39;49;00m,i), IDX(X,[34m1[39;49;00m,i), IDX(X,[34m2[39;49;00m,i), IDX(X,[34m0[39;49;0

## Your task

You're free to make just about any changes you'd like to the code.  The `cloud` program is currently a functioning serial program with a small amount of OpenMP already mixed in.  Below is a sequence of problems of increasing size $N_p$ but fixed density.

You should specify OpenMP environment variables before this loop that will be used by the programs.

In [1]:
make clean
export OMP_NUM_THREADS=14
export OMP_PROC_BIND=spread
export OMP_SCHEDULE=static
export COPTFLAGS='-O3'

rm -f *.o cloud


In [2]:
for N_p in 1 2 4 8 16 32; do
  this_L=`echo "$N_p 0.333 20." | awk '{ print ($3 * $1^$2); }'`
  this_T=`echo "$N_p 25600" | awk '{ print ($2 / ($1 * $1)); }'`
  make runcloud NP=$(( 256*$N_p )) L=$this_L NT=$this_T PERF="perf stat"
done

make --silent clean
make --silent cloud
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
perf stat ./cloud 256 25600 1.e-4 100. 1. 20 1.
[./cloud] NUM_POINTS=256, NUM_STEPS=25600, CHUNK_SIZE=25600, DT=0.0001, K=100, D=1, L=20, R=1
With 256 particles of radius 1 and a box width of 20.000000, the volume fraction is 0.134041.
The interaction volume is 33.5103, so we expect 1.07

icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
perf stat ./cloud 4096 100 1.e-4 100. 1. 50.3503 1.
[./cloud] NUM_POINTS=4096, NUM_STEPS=100, CHUNK_SIZE=100, DT=0.0001, K=100, D=1, L=50.3503, R=1
With 4096 particles of radius 1 and a box width of 50.350300, the volume fraction is 0.134413.
The interaction volume is 33.5103, so we expect 1.07531 interactions per particle, 2202.23 overall.

 Performance counter stats for './cloud 4096 100 1.e-4 100. 1. 50.3503 1.':

       2575.459773      task-clock (msec)         #   13.297 CPUs utilized          
               161      context-switches          #    0.063 K/sec                  
                26      cpu-migrations            #    0.010 K/sec                  
             1,302      page-faults 

However, you code must still be correct:  an effective diffusion coefficient can be computed for the type of particles you are simulating.  The following diffusion coefficient calculation should stay in the range of 0.77-0.92:

In [3]:
make checkcloud NP=512 L=25.198421 NT=51000 CHUNK=1000

make --silent clean
make --silent cloud
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
./cloud 512 51000 1.e-4 100. 1. 25.198421 1. 1000 check | python3 check.py
Diffusion constant: [ 0.87511503]


## Grading

### 4 pts: Hassle-free usage: if the bash script that is generated by `jupyter convert` from this notebook runs without issue

### 6 pts: For code that correctly parallelizes all critical kernels (including the binning calculations in `interactions.c`)
 
- A correct diffusion coefficient is required for correctness
- If your code is not correct, points can be salvaged with *legible code* that describes what changes you are making

### 6 pts: Speed.  Any (correct) code that is the fastest on one of the benchmark problem sizes (for the node type you have selected) automatically gets 6 pts.  Code that does not outperform the initial version on any benchmark gets no points.  1 point will be available for each benchmark problem that shows non-trivial improvements in performance.

**Significant improvement is defined by the following speedups for the benchmark problems:**

- `N_p = 1`: 2x
- `N_p = 2`: 5x
- `N_p = 4, 8, 16, 32`: 9x

### 4 pts: Report.  In a cell below this one, describe the optimizations that you made and why you made them.

- Full points will require evidence (such as a screenshot) from `hpctoolkit` or some other profiling utility that motivates or justifies your changes.
- Points will be awarded for optimizations that you tried that did not work as long as you have a good explanation for why you tried them and why they didn't work.

# ----------------------------------------Report-------------------------------------------

## General Procedures & Approach:
### 1. Run the original serialization version code, record the running time, and diffusion constant as references for future optimation benchmark baseline
### 2. Use HPCviewer to examine the devision of running time and figure out which component code will take up too much time and needs parallelization
### 3. Modify the specific code, adding parallelization to speed up while ensuring the correctness. Run the modified code to see the improvement compared with the original benchmarks

### Using HPCVIEWER to detect the code component that needs parallelization:
```make clean
export OMP_NUM_THREADS=14
export OMP_PROC_BIND=spread
export OMP_SCHEDULE=static
export COPTFLAGS='-O3'
make checkcloud NP=512 L=25.198421 NT=51000 CHUNK=1000 PERF="hpcrun"
ls -d hpctoolkit-*
hpcstruct ./cloud
ll cloud.hpcstruct
hpcprof -S cloud.hpcstruct hpctoolkit-cloud-measurements-*.ice-sched.pace.gatech.edu```

**The results**:


![title](hpc.png)

### Although the test case running time is a little small, loading module actually takes much time, but if we scrutinize into the computational part, we could easily detect that whenever the code involves using ```IXGetPairs```a significant percentage of computation time will be consumed, this led my attention to ```interaction.c``` and concluded that  ```interction.c``` needs parallization most to speed up the process. 

### However, other codes need to be examined as well, to make sure we can put the maximum parallelization to the code and achieve a more optimal speed, following ```Amdahl's law```.

### Some loops are looping around N_P which may only contributes a trivial performance improvement if parallelization takes place, I tried to make them parallel as well to see if parallelization actually improve or lower the performance, and noticed that if the N_P loops with some serilization critical parts, actually parallelization on those can slow the code with my current setting, but some code not relating the serialization critical parts are usually faster after modification. 

### The loops take significant time are the loops associating the pairs. Those type of loops consume a lot computation and must be parallelized. 

### While Scrutinizing the ```interaction.c```, the main target loop is the ```IXGetPairs```, I tested several ```for``` loops within this function and the one with the largest impact is the pairs interactions calculation part, which involves calculating the particles within the box and the particles in neighboring boxes. 

### Using this part of code as an example, I started using ```#pragma omp parallel for collapse(3)``` on a three nested ```for``` loops to collapse them into one.  And specify clauses following with ```default(shared)``` and ```private(p1,p2,d2,dx,dy,dz,idx,idy,idz,bp,neigh_idx,neigh_idy,neigh_idz,neigh_bp)``` to specify each threads' shared and private variables, to avoid the segmentation fault error. And there are steps containing ``` IXPushPair(ix,p1,p2);```which is the function that called to update the pairs, for this part I added a protection using```#pragma omp critical``` to make sure the update is protected when it happened. 

### A similar scrutinizing method is used to other codes, for example, in the ```accelerate.c```, there is a loop involving looping through the pairs, so I made sure to parallelize that part as well. The way I parallelized it is using a ```#pragma omp parallel for``` before ```for (int p = 0; p < Npairs; p++)```, and I made sure ```IDX(U,d,i) += du[d];``` and ```IDX(U,d,j) -= du[d];``` are protected when updating, using ```#pragma omp atomic update``` on each one of them. 

### I also noticed that besides those two major changes that will affect the speed a lot, increasing the ```boxdim``` to ```8``` is also ramping up the speed dramatically. This might be due to the ```boxdim = 8``` will let the cache be packed more efficiently, I tried with different numbers over ```8```, but ```8``` works best in my code.

### Besides those changes, parallelizations also made on other parts where parallelization can happen but not going to slow down the speed overall. I made comments to those changes in the code. 

### And I decided to use 14 cores, as the CPU I selected has 28 cores, but 14 are physical cores. Choosing 14 cores can have better memeory locality. And I have the ```OMP_PROC_BIND = spread``` to spread threads evenly to the 2 sockets. 


## ----------------------------------------Orignial Sim Results--------------------------------------------

### Original Running Time for N_p in 1 2 4 8 16 32 benchmarks results: (reference baseline)

**N_p = 1**
perf stat ./cloud 256 25600 1.e-4 100. 1. 20 1.
[./cloud] NUM_POINTS=256, NUM_STEPS=25600, CHUNK_SIZE=25600, DT=0.0001, K=100, D=1, L=20, R=1
With 256 particles of radius 1 and a box width of 20.000000, the volume fraction is 0.134041.
The interaction volume is 33.5103, so we expect 1.07233 interactions per particle, 137.258 overall.

 Performance counter stats for './cloud 256 25600 1.e-4 100. 1. 20 1.':

      14033.139682      task-clock (msec)         #    1.001 CPUs utilized          
                46      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               584      page-faults               #    0.042 K/sec                  
    46,229,603,407      cycles                    #    3.294 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    75,044,615,473      instructions              #    1.62  insns per cycle        
    11,007,339,609      branches                  #  784.382 M/sec                  
       157,650,330      branch-misses             #    1.43% of all branches        

      14.024503229 seconds time elapsed

**N_p = 2**
perf stat ./cloud 512 6400 1.e-4 100. 1. 25.1926 1.
[./cloud] NUM_POINTS=512, NUM_STEPS=6400, CHUNK_SIZE=6400, DT=0.0001, K=100, D=1, L=25.1926, R=1
With 512 particles of radius 1 and a box width of 25.192600, the volume fraction is 0.134134.
The interaction volume is 33.5103, so we expect 1.07307 interactions per particle, 274.707 overall.

 Performance counter stats for './cloud 512 6400 1.e-4 100. 1. 25.1926 1.':

      12527.597975      task-clock (msec)         #    1.000 CPUs utilized          
                45      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               590      page-faults               #    0.047 K/sec                  
    41,277,734,659      cycles                    #    3.295 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    72,071,693,973      instructions              #    1.75  insns per cycle        
    10,776,279,176      branches                  #  860.203 M/sec                  
        69,080,128      branch-misses             #    0.64% of all branches        

      12.524094910 seconds time elapsed

**N_p = 4**
perf stat ./cloud 1024 1600 1.e-4 100. 1. 31.7334 1.
[./cloud] NUM_POINTS=1024, NUM_STEPS=1600, CHUNK_SIZE=1600, DT=0.0001, K=100, D=1, L=31.7334, R=1
With 1024 particles of radius 1 and a box width of 31.733400, the volume fraction is 0.134227.
The interaction volume is 33.5103, so we expect 1.07381 interactions per particle, 549.792 overall.

 Performance counter stats for './cloud 1024 1600 1.e-4 100. 1. 31.7334 1.':

      11870.610461      task-clock (msec)         #    1.000 CPUs utilized          
                44      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               600      page-faults               #    0.051 K/sec                  
    39,094,559,769      cycles                    #    3.293 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    70,487,474,610      instructions              #    1.80  insns per cycle        
    10,634,329,558      branches                  #  895.854 M/sec                  
        35,599,301      branch-misses             #    0.33% of all branches        

      11.867447082 seconds time elapsed

**N_p = 8**
perf stat ./cloud 2048 400 1.e-4 100. 1. 39.9723 1.
[./cloud] NUM_POINTS=2048, NUM_STEPS=400, CHUNK_SIZE=400, DT=0.0001, K=100, D=1, L=39.9723, R=1
With 2048 particles of radius 1 and a box width of 39.972300, the volume fraction is 0.13432.
The interaction volume is 33.5103, so we expect 1.07456 interactions per particle, 1100.35 overall.

 Performance counter stats for './cloud 2048 400 1.e-4 100. 1. 39.9723 1.':

      11536.999394      task-clock (msec)         #    1.001 CPUs utilized          
                42      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               620      page-faults               #    0.054 K/sec                  
    38,015,400,842      cycles                    #    3.295 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    69,629,918,286      instructions              #    1.83  insns per cycle        
    10,546,673,223      branches                  #  914.161 M/sec                  
        17,820,102      branch-misses             #    0.17% of all branches        

      11.531132050 seconds time elapsed

**N_p = 16**
perf stat ./cloud 4096 100 1.e-4 100. 1. 50.3503 1.
[./cloud] NUM_POINTS=4096, NUM_STEPS=100, CHUNK_SIZE=100, DT=0.0001, K=100, D=1, L=50.3503, R=1
With 4096 particles of radius 1 and a box width of 50.350300, the volume fraction is 0.134413.
The interaction volume is 33.5103, so we expect 1.07531 interactions per particle, 2202.23 overall.

 Performance counter stats for './cloud 4096 100 1.e-4 100. 1. 50.3503 1.':

      11378.946088      task-clock (msec)         #    1.000 CPUs utilized          
                42      context-switches          #    0.004 K/sec                  
                 1      cpu-migrations            #    0.000 K/sec                  
               660      page-faults               #    0.058 K/sec                  
    37,480,302,368      cycles                    #    3.294 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    69,195,354,797      instructions              #    1.85  insns per cycle        
    10,503,186,584      branches                  #  923.037 M/sec                  
         8,877,825      branch-misses             #    0.08% of all branches        

      11.374951562 seconds time elapsed

**N_p = 32**
perf stat ./cloud 8192 25 1.e-4 100. 1. 63.4227 1.
[./cloud] NUM_POINTS=8192, NUM_STEPS=25, CHUNK_SIZE=25, DT=0.0001, K=100, D=1, L=63.4227, R=1
With 8192 particles of radius 1 and a box width of 63.422700, the volume fraction is 0.134507.
The interaction volume is 33.5103, so we expect 1.07605 interactions per particle, 4407.52 overall.

 Performance counter stats for './cloud 8192 25 1.e-4 100. 1. 63.4227 1.':

      11812.290255      task-clock (msec)         #    1.000 CPUs utilized          
                65      context-switches          #    0.006 K/sec                  
                11      cpu-migrations            #    0.001 K/sec                  
               741      page-faults               #    0.063 K/sec                  
    37,368,880,889      cycles                    #    3.164 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    68,978,118,423      instructions              #    1.85  insns per cycle        
    10,481,876,455      branches                  #  887.370 M/sec                  
         4,038,593      branch-misses             #    0.04% of all branches        

      11.809938282 seconds time elapsed
      

***Checkcloud Running Results***
./cloud 512 51000 1.e-4 100. 1. 25.198421 1. 1000 check | python3 check.py
Diffusion constant: $\color{red}{\text{0.87511503}}$

## ------------------------------------------My Sim Results-----------------------------------------------

### My Sim Running Time for N_p in 1 2 4 8 16 32 benchmarks results:

**N_p = 1**
perf stat ./cloud 256 25600 1.e-4 100. 1. 20 1.
[./cloud] NUM_POINTS=256, NUM_STEPS=25600, CHUNK_SIZE=25600, DT=0.0001, K=100, D=1, L=20, R=1
With 256 particles of radius 1 and a box width of 20.000000, the volume fraction is 0.134041.
The interaction volume is 33.5103, so we expect 1.07233 interactions per particle, 137.258 overall.

 Performance counter stats for './cloud 256 25600 1.e-4 100. 1. 20 1.':

      31784.406274      task-clock (msec)         #   13.949 CPUs utilized          
               219      context-switches          #    0.007 K/sec                  
                30      cpu-migrations            #    0.001 K/sec                  
             1,215      page-faults               #    0.038 K/sec                  
    92,097,754,887      cycles                    #    2.898 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    58,541,257,163      instructions              #    0.64  insns per cycle        
    12,874,122,692      branches                  #  405.045 M/sec                  
        43,474,275      branch-misses             #    0.34% of all branches        

       2.278661266 seconds time elapsed

**N_p = 2**
perf stat ./cloud 512 6400 1.e-4 100. 1. 25.1926 1.
[./cloud] NUM_POINTS=512, NUM_STEPS=6400, CHUNK_SIZE=6400, DT=0.0001, K=100, D=1, L=25.1926, R=1
With 512 particles of radius 1 and a box width of 25.192600, the volume fraction is 0.134134.
The interaction volume is 33.5103, so we expect 1.07307 interactions per particle, 274.707 overall.

 Performance counter stats for './cloud 512 6400 1.e-4 100. 1. 25.1926 1.':

      20912.348196      task-clock (msec)         #   13.916 CPUs utilized          
               191      context-switches          #    0.009 K/sec                  
                28      cpu-migrations            #    0.001 K/sec                  
             1,213      page-faults               #    0.058 K/sec                  
    60,593,358,225      cycles                    #    2.897 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    31,500,583,732      instructions              #    0.52  insns per cycle        
     6,687,590,741      branches                  #  319.791 M/sec                  
        38,127,702      branch-misses             #    0.57% of all branches        

       1.502749218 seconds time elapsed

**N_p = 4**
perf stat ./cloud 1024 1600 1.e-4 100. 1. 31.7334 1.
[./cloud] NUM_POINTS=1024, NUM_STEPS=1600, CHUNK_SIZE=1600, DT=0.0001, K=100, D=1, L=31.7334, R=1
With 1024 particles of radius 1 and a box width of 31.733400, the volume fraction is 0.134227.
The interaction volume is 33.5103, so we expect 1.07381 interactions per particle, 549.792 overall.

 Performance counter stats for './cloud 1024 1600 1.e-4 100. 1. 31.7334 1.':

       6753.714915      task-clock (msec)         #   13.741 CPUs utilized          
               168      context-switches          #    0.025 K/sec                  
                25      cpu-migrations            #    0.004 K/sec                  
             1,232      page-faults               #    0.182 K/sec                  
    19,565,521,246      cycles                    #    2.897 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    17,146,628,921      instructions              #    0.88  insns per cycle        
     3,298,032,857      branches                  #  488.329 M/sec                  
        22,155,852      branch-misses             #    0.67% of all branches        

       0.491489405 seconds time elapsed

**N_p = 8**
perf stat ./cloud 2048 400 1.e-4 100. 1. 39.9723 1.
[./cloud] NUM_POINTS=2048, NUM_STEPS=400, CHUNK_SIZE=400, DT=0.0001, K=100, D=1, L=39.9723, R=1
With 2048 particles of radius 1 and a box width of 39.972300, the volume fraction is 0.13432.
The interaction volume is 33.5103, so we expect 1.07456 interactions per particle, 1100.35 overall.

 Performance counter stats for './cloud 2048 400 1.e-4 100. 1. 39.9723 1.':

       3689.260087      task-clock (msec)         #   13.524 CPUs utilized          
               158      context-switches          #    0.043 K/sec                  
                26      cpu-migrations            #    0.007 K/sec                  
             1,285      page-faults               #    0.348 K/sec                  
    10,685,955,434      cycles                    #    2.897 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    13,111,959,987      instructions              #    1.23  insns per cycle        
     2,407,775,871      branches                  #  652.645 M/sec                  
        11,916,528      branch-misses             #    0.49% of all branches        

       0.272796408 seconds time elapsed

**N_p = 16**
perf stat ./cloud 4096 100 1.e-4 100. 1. 50.3503 1.
[./cloud] NUM_POINTS=4096, NUM_STEPS=100, CHUNK_SIZE=100, DT=0.0001, K=100, D=1, L=50.3503, R=1
With 4096 particles of radius 1 and a box width of 50.350300, the volume fraction is 0.134413.
The interaction volume is 33.5103, so we expect 1.07531 interactions per particle, 2202.23 overall.

 Performance counter stats for './cloud 4096 100 1.e-4 100. 1. 50.3503 1.':

       2605.471459      task-clock (msec)         #   13.256 CPUs utilized          
               174      context-switches          #    0.067 K/sec                  
                24      cpu-migrations            #    0.009 K/sec                  
             1,307      page-faults               #    0.502 K/sec                  
     7,545,272,217      cycles                    #    2.896 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    10,867,292,516      instructions              #    1.44  insns per cycle        
     1,879,808,932      branches                  #  721.485 M/sec                  
         7,777,019      branch-misses             #    0.41% of all branches        

       0.196545814 seconds time elapsed


**N_p = 32**
perf stat ./cloud 8192 25 1.e-4 100. 1. 63.4227 1.
[./cloud] NUM_POINTS=8192, NUM_STEPS=25, CHUNK_SIZE=25, DT=0.0001, K=100, D=1, L=63.4227, R=1
With 8192 particles of radius 1 and a box width of 63.422700, the volume fraction is 0.134507.
The interaction volume is 33.5103, so we expect 1.07605 interactions per particle, 4407.52 overall.

 Performance counter stats for './cloud 8192 25 1.e-4 100. 1. 63.4227 1.':

       2064.510807      task-clock (msec)         #   13.213 CPUs utilized          
               156      context-switches          #    0.076 K/sec                  
                22      cpu-migrations            #    0.011 K/sec                  
             1,396      page-faults               #    0.676 K/sec                  
     5,977,566,463      cycles                    #    2.895 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
     9,742,087,488      instructions              #    1.63  insns per cycle        
     1,602,216,101      branches                  #  776.075 M/sec                  
         4,027,578      branch-misses             #    0.25% of all branches        

       0.156242913 seconds time elapsed

***Checkcloud Running Results***
./cloud 512 51000 1.e-4 100. 1. 25.198421 1. 1000 check | python3 check.py
Diffusion constant: $\color{red}{\text{0.87511503}}$,$\color{blue}{\text{Matched with the original output, code correctness is validated}}$ 

## ---------------------------------------Results Comparation--------------------------------------------

| Trials| Original Version Benchmarks | Modified Version Benchmarks | Improvement 
| --- | --- | --- | --- |
|N_p =1 | 14.024503229 s | 2.278661266 s |  $\color{green}{\text{6.15X}}$  |
|N_p =2 | 12.524094910 s| 1.502749218 s |  $\color{green}{\text{8.33X}}$|
|N_p =4 | 11.867447082 s | 0.491489405 s  |$\color{green}{\text{24.15X}}$|
|N_p =8| 11.531132050 s | 0.272796408 s |  $\color{green}{\text{42.27X}}$|
|N_p =16| 11.374951562 s | 0.196545814 s |  $\color{green}{\text{57.87X}}$|
|N_p =32| 11.809938282 s | 0.156242913 s |  $\color{green}{\text{75.59X}}$|

**Checkcloud Running Results**
./cloud 512 51000 1.e-4 100. 1. 25.198421 1. 1000 check | python3 check.py
Diffusion constant: $\color{red}{\text{0.87511503}}$,$\color{blue}{\text{Matched with the original output, code correctness is validated}}$ 

**The modifications satisfied the benchmarks significant improvements requirement**

## ----------------------------------------End of the Report-----------------------------------------------

## Advice

- **My experience in the past:** Detailed git histories are correlated with better performance!
- **Understand your code before you try to change it:**
    - In addition to profiling utilities, it might be useful to add timers to
      individual routines.  The division of the program into objects that control
      different aspects of the code should make easy to, say, add a timer
      in one place without changing the whole program.
- **Simple problem parameters that can be changed:**
    - The number of boxes per dimension
    - The layout of vectors (array-of-structures or structure of arrays? see `vector.h`)
    - The data structures used to assign particles to boxes (is a linked-list really best)?
- **Avoid memory and other resource contention:**
    - Anytime multiple threads are trying to write to locations close to each
      other, it makes it difficult and expensive to make sure each thread has
      an up-to-date copy of the memory that is changing.  This would happen,
      for example, if many threads are writing to the `pairs` list in
      the interactions routine.  Consider allocating a separate workspace for
      each thread by, for example, giving each thread its own `pairs` array.
      Then, once all threads are done computing their pairs, you can combine
      the separate arrays into one array, or even change the interface of the
      `interactions()` function so that it is multiple lists are returned.
- **Find ways to avoid recomputing from scratch:**
    - Can you use the layout of the particles from the last time step to help you
      bin or find pairs in the next time step?
- **You get to choose how many threads we use to evaluate your code:**
    - There's nothing inherently wrong with achieving your best performance
      using fewer than the maximum number of threads available on a node.  The
      problem may simply not have enough concurrency to support every thread.
- **Read through these performance slides for ideas:** [From Archer](https://www.archer.ac.uk/training/course-material/2015/12/ShMem_OpenMP_York/Slides/L09-performance.pdf)
- **Reread the molecular dynamic notes from Prof. Chow to make sure you understand what we're trying to accomplish:** [Molecular dynamics and cell lists](https://www.cc.gatech.edu/%7Eechow/ipcc/hpc-course/05_celllist.pdf)