Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark #8

Open
gha3mi opened this issue Jan 17, 2024 · 17 comments
Open

Benchmark #8

gha3mi opened this issue Jan 17, 2024 · 17 comments

Comments

@gha3mi
Copy link

gha3mi commented Jan 17, 2024

Hi,

I'm currently working on the ForBenchmark project and have generated some results for the dot_product here. If you are interested, you can add your dot_product implementation to this benchmark. Fpm makes it easy to include it as a dependency, and a Python script will generate the results.

Best,
Ali

@jalvesz
Copy link
Owner

jalvesz commented Jan 18, 2024

Brilliant! Your project looks awesome, I'll take a look at that!

Thanks for sharing it!

@gha3mi
Copy link
Author

gha3mi commented Jan 19, 2024

Thanks! I've been thinking about a tool and a place to test Fortran fpm packages, not with the intention of competing, but with the aim of improving the packages.

@jalvesz
Copy link
Owner

jalvesz commented Jan 19, 2024

That's a very good initiative, how are you thinking about proceeding? would you like PRs to centralize the Benchmarks and try to have them published with a github action?

I forked your project to try it out, I managed to get results with gfortran but I'm hitting a few dependencies issues with ifort and ifx.

I thought that having a companion sphinx-gallery would be a good way of having the plots neatly organized.

@gha3mi
Copy link
Author

gha3mi commented Jan 19, 2024

That's a very good initiative, how are you thinking about proceeding? would you like PRs to centralize the Benchmarks and try to have them published with a github action?

Yes, exactly. I think this may be the easiest way to get the results.

I forked your project to try it out, I managed to get results with gfortran but I'm hitting a few dependencies issues with ifort and ifx.

Each benchmark has an index. In your test, I noticed that the last one for fprod_kahan has the same number, 6, as fprod.
call bench%start_benchmark(7, 'kahan', "a = fprod_kahan(u, v)", [p]) ! here 6 -> 7

Here are the flags I used for each compiler: fpm.rsp. I used LAPACK and BLAS. I also ran a dot benchmarking test using GitHub Actions here. The last step to create a pull request fails; it needs some work. However, the benchmarks with gfortran, ifort, ifx, and nvfortran work. Could it be an issue with LAPACK and BLAS in your case?

I thought that having a companion sphinx-gallery would be a good way of having the plots neatly organized.

It looks great. However, I am not familiar with it. I will take a look at it. If you could provide it, that would be great.

@jalvesz
Copy link
Owner

jalvesz commented Jan 19, 2024

! here 6 -> 7

Upsss, that was a typo, fixed.

Yes, I saw the dependencies and managed to install blas/lapack for runing with gfortran. But for intel compilers I had to add a bunch of stuff https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html but I still have other compile errors and no time to actually solve it :s ... I'll continue later on to take a look... Also, I was thinking that the benchmarks will be more interesting with -O3 instead of -Ofast, which even compiler developers do not recommend. https://fortran-lang.discourse.group/t/is-ofast-in-gfortran-or-fast-flag-in-intel-fortran-safe-to-use/2755/4

Regarding setting up a sphinx gallery connected to a project, here they have an example and describe how to connect it with a source project.

For inspiration, I always look at PyVista they have a github repo with all the sources and a secondary repo that is automatically fed https://github.com/pyvista/pyvista/tree/main/doc ... something like this should work with a ForBenchmark-doc repo :)

@gha3mi
Copy link
Author

gha3mi commented Jan 19, 2024

Yes, I saw the dependencies and managed to install blas/lapack for runing with gfortran. But for intel compilers I had to add a bunch of stuff https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html but I still have other compile errors and no time to actually sole it :s ... I'll continue later on to take a look... Also, I was thinking that the benchmarks will be more interesting with -O3 instead of -Ofast, which even compiler developers do not recommend. https://fortran-lang.discourse.group/t/is-ofast-in-gfortran-or-fast-flag-in-intel-fortran-safe-to-use/2755/4

You can also use -qmkl instead of -llapack and -lblas. By the way, you are right; I will replace -Ofast with -O3.

Regarding setting up a sphinx gallery connected to a project, here they have an example and describe how to connect it with a source project.

For inspiration, I always look at PyVista they have a github repo with all the sources and a secondary repo that is automatically fed https://github.com/pyvista/pyvista/tree/main/doc ... something like this should work with a ForBenchmark-doc repo :)

Alright, I will take a look at it.

Thank you! If you find the time, you can send a pull request for the dot_product or any other implementations.

@jalvesz
Copy link
Owner

jalvesz commented Jan 19, 2024

Perfect, if you get started with that, here a few dependencies that I use for sphinx projects:
pip install numpydoc pydata-sphinx-theme sphinxcontrib-bibtex jupyter_sphinx sphinx_panels pythreejs

numpydoc for the documentation style within the python scripts
pydata-sphinx-theme give the white theme used by PyVista
sphinxcontrib-bibtex enables adding a .bib file that can be used to cite stuff within the project à la LaTeX
jupyter_sphinx sphinx_panels pythreejs for integrating jupyter-notebook like stuff

@jalvesz
Copy link
Owner

jalvesz commented Jan 21, 2024

You can also use -qmkl instead of -llapack and -lblas

So this worked, I had to comment out in the fpm.toml this link = ["lapack", "blas"]

Couple of questions: How do you measure the speedup? it seems like the ratio is inverted when I look at the plots and values in the data. I didn't check where you compute it, but I would have expected something like speed-up = time_reference / time_new_method, such that a speed-up > 1 implies faster. But this is not what I saw with the dot products.

The reference value is systematically the benchmark placed in the first place?

@gha3mi
Copy link
Author

gha3mi commented Jan 21, 2024

So this worked, I had to comment out in the fpm.toml this link = ["lapack", "blas"]

perfect!

You can also use -qmkl instead of -llapack and -lblas

So this worked, I had to comment out in the fpm.toml this link = ["lapack", "blas"]

Couple of questions: How do you measure the speedup? it seems like the ratio is inverted when I look at the plots and values in the data. I didn't check where you compute it, but I would have expected something like speed-up = time_reference / time_new_method, such that a speed-up > 1 implies faster. But this is not what I saw with the dot products.

The reference value is systematically the benchmark placed in the first place?

You can find it here: link to the code. Yes, you are right; this is inverted now. Thank you for reminding me. Feel free to send a pull request (PR) if you have time, or I will change it as soon as possible.

The reference value is systematically the benchmark placed in the first place?

Yes, exactly. I tried to provide an example demo with some comments here: link to the demo.

@jalvesz
Copy link
Owner

jalvesz commented Jan 21, 2024

The results change quite a bit from one run to another, for instance here with ifort and the following flags -O3 -mtune=native -xHost -qmkl -qopenmp -ipo -DINT64 just two subsequent runs of the the bench:

dot_ifort_speedup
dot_ifort_speedup

This is very nice and interesting, I think that from a statistical point of view is more than acceptable. I was just wondering then, how could the actual time of the function be extracted from the intermediate operations included to avoid for excessive optimization. This time changes also the ratio as ratio = (time_ref + C)/(time_i + C) is then influenced by that constant. Maybe an internal measurement in the loop should be done to capture these lines and extract it from the time captured by the bench object ?

Oh, just saw the label of the abscissa should be updated to method name ?

@gha3mi
Copy link
Author

gha3mi commented Jan 21, 2024

I am working on speeding up plots. I will write to you again here.

@jalvesz
Copy link
Owner

jalvesz commented Jan 21, 2024

I'm wondering if something like this could help to have a clearer view:

call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
time = 0._rk !> a variable defined as time(0:1)
do nl = 1,bench%nloops
  time(0) = time(0) + timer() !> a function pointer using the selected method
  u = u + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
  v = v + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
  time(1) = time(1) + timer() 
  a = dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops , extract_time = time(1)-time(0) ) !> an optional variable to extract time 
!! from the analysis that is not associated with the function that is being benchmarked
```
?

@gha3mi
Copy link
Author

gha3mi commented Jan 22, 2024

I changed speed-up plot to plot for all problem sizes:
dot_ifort_speedup

I also tried to plot the average weighted speed-up; however, I'm not sure if this provides valuable insights:
dot_ifort_speedup_avg

The results change quite a bit from one run to another.

I think there are many factors such as the temperature of the CPU, other processes running during benchmarking, different random numbers,... However, I updated the code to use the same random numbers consistently.

This time changes also the ratio as ratio = (time_ref + C)/(time_i + C) is then influenced by that constant.

I noticed this before. But if you measure this time, you need to calculate the time of the timer function again! In my opinion, for large problem sizes, this could be simply avoided. Or maybe calculate it one time outside the benchmarking object, then extract it.

[edited:] Please check the latest results for the dot product generated by the GitHub Action workflow.: https://github.com/gha3mi/forbenchmark/tree/main/benchmarks/dot

@jalvesz
Copy link
Owner

jalvesz commented Jan 22, 2024

[edited:] Please check the latest results for the dot product generated by the GitHub Action workflow.: https://github.com/gha3mi/forbenchmark/tree/main/benchmarks/dot

Excellent! These results are very interesting! I'll push a version as is, though locally I had to:

  • remove link = ["lapack", "blas"] from the fpm.toml (when running with ifort and ifx)
  • switch -llapack -lblas for -qmkl for both ifort and ifx in the fpm.rsp
  • Oh, and for inlining with ifx: -flto=full instead of -ipo
    (haven't test nvfortra as I have to clean up my install)

@jalvesz
Copy link
Owner

jalvesz commented Jan 22, 2024

I tried something:

time = 0._rk
call bench%start_benchmark(7,'kahan', "a = fprod_kahan(u,v)",[p])
do nl = 1,bench%nloops
         time(0) = timer()
         u = u + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
         v = v + real(nl,rk) ! to prevent compiler from optimizing (loop-invariant)
         time(1) = time(1) + timer() - time(0)
         a = fprod_kahan(u,v)
end do
call bench%stop_benchmark(cmp_gflops)
print *, 'inner time: ', time(1)/bench%nloops
...
real(8) function timer() result(y)
      call cpu_time(y)
end function

And got results in the lines of

Meth.: kahan; Des.: a = fprod_kahan(u,v) ; Argi.:100000
 Elapsed time :     0.000060600 [s]
 Speedup      :  0.987 [-]
 Performance  :  1.650 [GFLOPS]

 inner time:  5.958200000000069E-005

So basically most of the time is actually in the two lines avoiding the optimization, and the dot product is almost transparent!

Maybe it would be better to test with larger arrays or splitting the loop in a different way.

@jalvesz
Copy link
Owner

jalvesz commented Jan 22, 2024

This might be more representative:

a = 0._rk
do nl = 1,bench%nloops
      a = a + dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops)
print *, a

The cumulative variable + the print forces the compiler to avoid optimizing to actually print the correct value. Removed m2 as it was stagnating:
ifort
dot_ifort_speedup_avg

ifx
dot_ifx_speedup_avg

gfortran
dot_gfortran_speedup_avg

@gha3mi
Copy link
Author

gha3mi commented Jan 22, 2024

Thanks! I merged your PR. Today was busy, I'll take a look at the last messages later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants