# Requirements

In [1]:
import math
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from pycuda.compiler import SourceModule

We determine $\pi$ as the ratio between a circle of radius 1 and the square that circumscribes it. The area of the circle will be approximated by the number of randomly selected points that fall into it, compared to the (larger) number of points that fall into the square.

If we choose $x$ and $y$ independently from a uniform distribution $[0, 1[$, then $(x, y)$ represents a point and lies in a circle with radius 1 if $x^2 + y^2 \le 1$.  Since this is only one quarter of the circle and circumscribing square, we get $\pi$ by dividing the number of points in the circle by the total number of points, and multiplying by 4.

# Implementation

To generate random numbers for each thread on the GPU, we use the curand library, which is a C++ library.  Since we use plain C for our CUDA code, we have to make sure that the header file is read ouside of an extermal C block.  The `SourceModule` add this automatically by default, so we make sure this isn't done by specifying the appropriate option.  In the source code, we implement our kernel in an external C blcok.

The random number generator is intialized using the `curand_init` function that takes a seed as its first argument.  We ensure that it is unique for each thread by adding a thread-specific constant to the clock time.

Random numbers are sampled from a uniform distribution using the `curand_uniform` function.

In [48]:
source_code = '''
    #include <curand_kernel.h>
    
    typedef unsigned long long cu_long;
    
    extern "C" {
        __global__ void estimate_pi(cu_long nr_tries, cu_long *nr_hits) {
            curandState rand_state;
            int thread_id = blockIdx.x*blockDim.x + threadIdx.x;
            curand_init((cu_long) clock() + (cu_long) thread_id,
                        (cu_long) 0, (cu_long) 0, &rand_state);
            float x, y;
            for (cu_long i = 0; i < nr_tries; ++i) {
                x = curand_uniform(&rand_state);
                y = curand_uniform(&rand_state);
                if (x*x + y*y < 1.0f) {
                    nr_hits[thread_id]++;
                }
            }
        }
    }
'''

kernels = SourceModule(no_extern_c=True, source=source_code)
pi_kernel = kernels.get_function('estimate_pi')

Set the number of threads per block and the number of blocks per grid, and create an array of the appropriate size to store the counts for each thread.  Also specify the number of points to try

In [44]:
threads_per_block = 32
blocks_per_grid = 512
total_threads = threads_per_block*blocks_per_grid
nr_hits = gpuarray.zeros((total_threads, ), dtype=np.uint64)
nr_tries = np.uint64(2**24)

Now we can execute the kernel and compute the value of $\pi$.

In [45]:
pi_kernel(nr_tries, nr_hits, grid=(blocks_per_grid, 1, 1), block=(threads_per_block, 1, 1))

In [46]:
pi_computed = 4.0*np.sum(nr_hits.get())/(nr_tries*total_threads)

Checking the accuracy as comared with $\pi$'s true value shows that it is correct upto a millionth.

In [47]:
for tol in np.logspace(-1, -12, num=12):
    print(f'{tol:.1e} {math.isclose(pi_computed, math.pi, rel_tol=tol)}')

1.0e-01 True
1.0e-02 True
1.0e-03 True
1.0e-04 True
1.0e-05 True
1.0e-06 True
1.0e-07 False
1.0e-08 False
1.0e-09 False
1.0e-10 False
1.0e-11 False
1.0e-12 False
