In [1]:
from numba import cuda
import numpy as np
from timeit import default_timer as timer

In [62]:
# How much memory does it use?
N = 1024 # !
blockspergrid = 1 # Memory !!! If < #blocks x # threads < matrix shape it does not work yet!!! Why?
threadsperblock = 1024 # why number of threads important ?

# When types are specified it doesn't operate properly
@cuda.jit(max_registers=1)#('void(float32[:,:], float32[:,:], float32[:,:])') # Assign types explicitly
def kernel_1(a,b,out):
    x = cuda.grid(1)
    if x <= out.shape[0]: #and y <= out.shape[1]: # In order not to exceed the shapes?
        out[x] = a[x] + b[x]

a = cuda.to_device((np.random.rand(N)).astype(np.float64))
b = cuda.to_device(3*np.ones(N).astype(np.float64))
out = cuda.device_array_like(a) # Empty array

start = timer()
kernel_1[blockspergrid, threadsperblock](a, b, out)
t_ = timer() - start

print ("Time consumed %f s" % t_)
print(a.copy_to_host()[1023])
print(b.copy_to_host()[1023])
print(out.copy_to_host()[1023])
print(out.copy_to_host().shape)

Time consumed 0.504155 s
0.12010806372454963
3.0
3.1201080637245497
(1024,)


In [63]:
sample = np.random.rand(1)
print(sample.nbytes, type(sample[0]))

8 <class 'numpy.float64'>


In [102]:
# How much memory does it use?
N = 1
blockspergrid = 1 # Memory !!! If < #blocks x # threads < matrix shape it does not work yet!!! Why?
threadsperblock = 2 # why number of threads important ?

# When types are specified it doesn't operate properly
@cuda.jit#('void(float32[:,:], float32[:,:], float32[:,:])') # Assign types explicitly
def kernel_1(a,b,out):
    x = cuda.grid(1)
    if x < 1: #and y <= out.shape[1]: # In order not to exceed the shapes?
        for j in range(out.shape[0]):
            out[j] = a[j] + b[j]
        
a = cuda.to_device(3*np.ones((N), dtype=np.float64))
b = cuda.to_device(3*np.ones((N),dtype=np.float64))
out = cuda.device_array_like(a) # Empty array

start = timer()
kernel_1[blockspergrid, threadsperblock](a, b, out)
t_ = timer() - start

print ("Time consumed %f s" % t_)
print(a.copy_to_host())
print(b.copy_to_host())
print(out.copy_to_host())
print(out.copy_to_host().shape)

CudaAPIError: [201] Call to cuMemAlloc results in CUDA_ERROR_INVALID_CONTEXT

In [242]:
print(np.array([1.]).dtype)
print(np.arange(10).nbytes)

float64
40


# Memory Hierarchy

There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Device Memory Accesses

The use of shared memory is when you need to within a block of threads, reuse data already pulled or evaluated from global memory. So instead of pulling from global memory again, you put it in the shared memory for other threads within the same block to see and reuse.

Global memory 8GB?

`cudaMalloc` always allocates global memory

Each `double` variable and each `long long` variable uses `two registers`. 

If in SM 10 registers per thread
and 
If each thread uses 11 registers -> threads per SM are reduced 

Local memory is private for each thread it is not memory it is kept in global memory 

Now all the variables that can't be stored in registers because there is lack of them goes to the local memory which is a part of global device memory and provides high memory latency in contrast to registers. This is called register spilling

It's very important to try to keep all the variables in registers. The impact of register spilling is often underestimated by new Cuda developers. I made some tests in which I artificially doubled the amount of memory used by threads and caused register spilling without any other computation costs and it increased the time of computations 5 times! In small CUDA applications the number of registers is enough. You can find out how many variables goes to local memory by following the instruction in the pdf above. 

# Some info

In [70]:
cuda.cudadrv.driver.Context.get_memory_info("0")

_MemoryInfo(free=3530332569, total=4294967296)

In [71]:
cuda.detect()

Found 1 CUDA devices
id 0    b'GeForce GTX 1050 Ti'                              [SUPPORTED]
                      compute capability: 6.1
                           pci device id: 0
                              pci bus id: 1
Summary:
	1/1 devices are supported


True

In [88]:
cuda.gpus

<numba.cuda.cudadrv.devices._DeviceList at 0x21750827898>

In [86]:
cuda.current_context()


<CUDA context c_void_p(2299135068240) of device 0>

In [90]:
cuda.cudadrv.driver.Context #?

numba.cuda.cudadrv.driver.Context

In [98]:
cuda.cudadrv.driver.Device(0).name#.compute_capability

b'GeForce GTX 1050 Ti'

In [99]:
cuda.cudadrv.driver.Device(0).compute_capability

(6, 1)

# Float type review

Clarifying images

<img src="img/image032.jpg" width="480" height="480">

<img src="img/image034.jpg" width="480" height="480">

See the pic above

Algorithm to get from 6 to -6 

Достаточно просто копировать исходную комбинацию спра­ва налево до тех пор, пока не будет встречена единица, а затем последовательно заменять значения оставшихся битов их дополнениями

<img src="img/image036.jpg" width="480" height="480">

<b>IEEE 754</b>:
       
    —Single precision numbers include an 8-bit exponent field and a 23-bit fraction, for a total of 32bits.

$$\pm mantissa*{2^{\exp }}$$

<img src="img/norm_repr.png" width="480" height="480">

<img src="img/mant.png" width="480" height="480">

There are special cases that require encodings:

    – Infinities (overflow)
    – NAN (divide by zero)
    For example:
        – Single-precision: 8 bits in e → 256 codes;
         11111111 reserved for special cases → 255 codes;
         one code (00000000) for zero → 254 codes; 
         need both positive and negative exponents → half positives (127),
         and half negatives (127)

The exp field represents the exponent as a biased number.

– It contains the actual exponent plus 127 for single precision

Example:
    
    exp = 4 => 4-127 = -123

    exp = 93 => 93 - 127 = 34

<img src="img/bin_to_dec.jpg" width="480" height="480">

Because truly mantissa looks like 1.11000000000...

that's why we use (1+0.75)

###### Special values

<img src="img/Special.png" width="240" height="240">

###### Range

<img src="img/Range.jpg" width="360" height="360">

Issues

<img src="img/Roundoff.jpg" width="360" height="360">

In [224]:
print(np.array([33554431], dtype=np.float32)) # some integers can't pe represented by float
print(np.array([33554431], dtype=np.int32))

[33554432.]
[33554431]


In [226]:
print(np.array([.10], dtype=np.float32))
#Some simple decimal numbers cannot be represented exactly in binary to begin with
# 0.10 base(10)= 0.0001100110011...base(2)
print(np.array([0.1], dtype=np.float32)*np.array([3]))
print(1/3)

[0.1]
[0.3]
0.3333333333333333


###### Example

To get a feel for floating-point operations, we’ll do an addition example.

    –To keep it simple, we’ll use base 10 scientific notation
    –Assume the mantissa has fourdigits, and the exponent has one digit.

99.99 + 0.161 = 100.151

When normalized:
        
        9.999*10^1 
        1.610^10^-1

Then 1.610 * 10^-1 = 0.01610 * 10^1 for addition
    
    4 significant digits it is getting

    0.016

<b>This can result in a loss of least significant digits—the rightmost 1 in this case. But rewriting the number with the larger exponent could result in loss of the mostsignificant digits, which is much worse.</b>

    999.9 * 10^-1 = 99.90^-1

<img src="img/Add.jpg" width="360" height="360">

<img src="img/Steps35.jpg" width="360" height="360">

Another example

<img src="img/Example_.jpg" width="360" height="360">

<img src="img/Epsilon.jpg" >

In [144]:
a = np.array([1.0000_000596], dtype=np.float32)
b = np.array([1.0000_0000], dtype=np.float32)
print(a,a+b)
a==b

[1.] [2.]


array([ True])

In [145]:
a = np.array([1.0000_000597], dtype=np.float32)
b = np.array([1.0000_000000], dtype=np.float32)
a==b

array([False])

In [160]:
a = np.array([1.0000_001], dtype=np.float32)
b = np.array([1.0000_001], dtype=np.float32)
a+b

array([2.0000002], dtype=float32)

In [156]:
a = np.array([1.0000_0001], dtype=np.float32)
b = np.array([1.0000_0010], dtype=np.float32)
a+b

array([2.], dtype=float32)

In [206]:
# Double
a = np.array([1.00000_00000_00000_1], dtype=np.float64)
b = np.array([1.00000_00000_00000_0], dtype=np.float64)
print(a==b)

[ True]


In [207]:
# Double
a = np.array([1.00000_00000_00003], dtype=np.float64)
b = np.array([1.00000_00000_00001], dtype=np.float64)
print("%.15f" % (a+b))

2.000000000000004


In [222]:
# Double
a = np.array([4.00000_00000_00000], dtype=np.float64)
b = np.array([0.00000_00000_00002], dtype=np.float64)
print("%.15f" % (a*b))

0.000000000000008


In [232]:
0.1+0.2

0.30000000000000004