# Numba Typing Exercises

This notebook provides some exposition of Numba's typing mechanisms, and how to deal with some of the issues you may encounter with typing. It covers:

* How to display the typing of functions and understand the output,
* Examination of different typings of the same function,
* How to understand and fix typing errors,
* Some CUDA-specific issues related to performance and occupancy.

This notebook as published in the Git repository includes all the output from a previous run of the notebook - this is because some of the output (e.g. temporary variable names, register counts, etc.) may vary slightly with different versions of Numba, or different CUDA toolkits. It is suggested to clear all the output and work through the notebook, and keep the version with output provided for a reference in case there appears to be a discrepancy between the output of Numba and the description given in the text.

We'll begin with importing some required packages. We use the `@njit` decorator for CPU-targeted examples, which is shorthand for `@jit(nopython=True)` - the nopython mode typing has more requirements than the object mode typing, which leads to better performance, so it is a better choice for learning about typing. The `@cuda.jit` decorator is used for the CUDA-targeted examples.

In [1]:
from numba import njit, cuda
import numpy as np

## Inspecting the typing

Throughout this notebook we will use `inspect_types()` extensively to inspect the results of the typing algorithm. We'll start with a very simple example:

In [2]:
@njit
def f(a, b):
    return a + b

Let's call this function with a pair of `float32`s to force a typing:

In [3]:
f(np.float32(1), np.float32(2))

3.0

Now we'll inspect the typing for this call:

In [4]:
f.inspect_types()

f (float32, float32)
--------------------------------------------------------------------------------
# File: <ipython-input-2-7136c8bb3c8c>
# --- LINE 1 --- 

@njit

# --- LINE 2 --- 

def f(a, b):

    # --- LINE 3 --- 
    # label 0
    #   a = arg(0, name=a)  :: float32
    #   b = arg(1, name=b)  :: float32
    #   $6binary_add.2 = a + b  :: float32
    #   del b
    #   del a
    #   $8return_value.3 = cast(value=$6binary_add.2)  :: float32
    #   del $6binary_add.2
    #   return $8return_value.3

    return a + b




The output of `inspect_types()` is a printout of the function's source code annotated with the Numba IR for each line, and the type of each IR node. Note that `del` nodes have no type, as they simply delete an existing variable.

Types are separated from IR with a double colon. In one example from above:

```
$8return_value.3 = cast(value=$6binary_add.2)  :: float32
```

the type of `$return_value.3` (which is also the type returned by `cast(value=$6binary_add.2)` is `float32`.

## An example with branching

When a variable takes a value from multiple different control flow paths (i.e. *branches*), a unification is needed to determine a type that is suitable for representing the types across all the different control flow path. We can explore unification using a simple function with a branch in it:

In [5]:
@njit
def select(a, b, c):
    if c:
        ret = a
    else:
        ret = b
    return ret

We'll start by calling the function with `a` and `b` both as `float32` for a first example:

In [6]:
select(np.float32(1), np.float32(2), True)

1.0

If we inspect the typing we get:

In [7]:
select.inspect_types()

select (float32, float32, bool)
--------------------------------------------------------------------------------
# File: <ipython-input-5-edbb7033ccdc>
# --- LINE 1 --- 

@njit

# --- LINE 2 --- 

def select(a, b, c):

    # --- LINE 3 --- 
    # label 0
    #   a = arg(0, name=a)  :: float32
    #   b = arg(1, name=b)  :: float32
    #   c = arg(2, name=c)  :: bool
    #   branch c, 6, 12

    if c:

        # --- LINE 4 --- 
        # label 6
        #   del c
        #   del b
        #   ret = a  :: float32
        #   del a
        #   jump 16

        ret = a

    # --- LINE 5 --- 

    else:

        # --- LINE 6 --- 
        # label 12
        #   del c
        #   del a
        #   ret.1 = b  :: float32
        #   del b

        ret = b

    # --- LINE 7 --- 
    #   jump 16
    # label 16
    #   ret.2 = phi(incoming_values=[Var(ret.1, <ipython-input-5-edbb7033ccdc>:6), Var(ret, <ipython-input-5-edbb7033ccdc>:4)], incoming_blocks=[12, 6])  :: float32
    #   del ret.1
    # 

We see that where the value of a variable can come from two separate branches, there is a *phi node*: `ret.2 = phi(incoming_values=...)`. The `incoming_values` track the different sources of this variable - in this example, `ret` from the `if` side of the branch, and `ret.1` from the `else` side of the branch.

The type of the phi node (`float32` in this case) is the type resulting from unification of the types of all the incoming values.

### Another typing of the branching function

If we call the function with a `float32` and a `float64`, we get another typing:

In [8]:
select(np.float32(1), np.float64(2), True)

1.0

Now let's inspect types again. This time, there will be two sets of typings - one for the `(float32, float32, boolean)` call earlier, and another for the `(float32, float64, boolean)` call in the previous cell. If we call `inspect_types()` with no arguments, it will print out the typings for all sets of argument types that have been seen so far. In order to focus on just the case we are interested in, we can pass the `signature` keyword argument with a tuple of Numba types to get the typing for a specific set of argument types. Numba types are imported from `numba` - for a comprehensive list of them, see [Types and signatures](http://numba.pydata.org/numba-doc/latest/reference/types.html#types-and-signatures) in the Numba documentation.

In [9]:
from numba import float32, float64, boolean
select.inspect_types(signature=(float32, float64, boolean))

select (float32, float64, bool)
--------------------------------------------------------------------------------
# File: <ipython-input-5-edbb7033ccdc>
# --- LINE 1 --- 

@njit

# --- LINE 2 --- 

def select(a, b, c):

    # --- LINE 3 --- 
    # label 0
    #   a = arg(0, name=a)  :: float32
    #   b = arg(1, name=b)  :: float64
    #   c = arg(2, name=c)  :: bool
    #   branch c, 6, 12

    if c:

        # --- LINE 4 --- 
        # label 6
        #   del c
        #   del b
        #   ret = a  :: float32
        #   del a
        #   jump 16

        ret = a

    # --- LINE 5 --- 

    else:

        # --- LINE 6 --- 
        # label 12
        #   del c
        #   del a
        #   ret.1 = b  :: float64
        #   del b

        ret = b

    # --- LINE 7 --- 
    #   jump 16
    # label 16
    #   ret.2 = phi(incoming_values=[Var(ret.1, <ipython-input-5-edbb7033ccdc>:6), Var(ret, <ipython-input-5-edbb7033ccdc>:4)], incoming_blocks=[12, 6])  :: float64
    #   del ret.1
    # 

Here we see the types from each branch (`ret = a  :: float32` and `ret.1 = b  :: float64`) have unified to `float64` at the phi node.

### Failing unification

Sometimes unification can fail. If we try to choose between a tuple and a scalar:

In [10]:
select((1, 2), 3.0, False)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Cannot unify float64 and UniTuple(int64 x 2) for 'ret.2', defined at <ipython-input-5-edbb7033ccdc> (7)

File "<ipython-input-5-edbb7033ccdc>", line 7:
def select(a, b, c):
    <source elided>
        ret = b
    return ret
    ^

[1] During: typing of assignment at <ipython-input-5-edbb7033ccdc> (7)

File "<ipython-input-5-edbb7033ccdc>", line 7:
def select(a, b, c):
    <source elided>
        ret = b
    return ret
    ^


The typing fails at unification: `Cannot unify int64 and UniTuple(int64 x 2) for 'ret.2'`.

When a typing error occurs, we can debug the propagation of type information by setting the environment variable `NUMBA_DEBUG_TYPEINFER` to `1`, or setting `numba.config.DEBUG_TYPEINFER` to `True`. It helps to also dump the Numba IR to understand the results of propagation better, so we should also set `numba.config.DUMP_IR` to `True` (or use the corresponding environment variable `NUMBA_DUMP_IR`). The debug output won't appear in the Jupyter notebook, but we can get the output by re-running this example as an external script:

In [11]:
%%script python
from numba import njit
from numba import config


@njit
def select(a, b, c):
    if c:
        ret = a
    else:
        ret = b
    return ret

config.DEBUG_TYPEINFER = True
config.DUMP_IR=True
select((1, 2), 3.0, False)

--------------------------------IR DUMP: select---------------------------------
label 0:
    a = arg(0, name=a)                       ['a']
    b = arg(1, name=b)                       ['b']
    c = arg(2, name=c)                       ['c']
    branch c, 6, 12                          ['c']
label 6:
    ret = a                                  ['a', 'ret']
    jump 16                                  []
label 12:
    ret = b                                  ['b', 'ret']
    jump 16                                  []
label 16:
    $18return_value.1 = cast(value=ret)      ['$18return_value.1', 'ret']
    return $18return_value.1                 ['$18return_value.1']

--------------------------------IR DUMP: select---------------------------------
label 0:
    a = arg(0, name=a)                       ['a']
    b = arg(1, name=b)                       ['b']
    c = arg(2, name=c)                       ['c']
    branch c, 6, 12                          ['c']
label 6:
    ret = a         

Traceback (most recent call last):
  File "<stdin>", line 15, in <module>
  File "/home/gmarkall/numbadev/numba/numba/core/dispatcher.py", line 401, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/gmarkall/numbadev/numba/numba/core/dispatcher.py", line 344, in error_rewrite
    reraise(type(e), e, None)
  File "/home/gmarkall/numbadev/numba/numba/core/utils.py", line 80, in reraise
    raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Cannot unify float64 and UniTuple(int64 x 2) for 'ret.2', defined at <stdin> (11)

File "<stdin>", line 11:
<source missing, REPL/exec in use?>

[1] During: typing of assignment at <stdin> (11)

File "<stdin>", line 11:
<source missing, REPL/exec in use?>



CalledProcessError: Command 'b'from numba import njit\nfrom numba import config\n\n\n@njit\ndef select(a, b, c):\n    if c:\n        ret = a\n    else:\n        ret = b\n    return ret\n\nconfig.DEBUG_TYPEINFER = True\nconfig.DUMP_IR=True\nselect((1, 2), 3.0, False)\n'' returned non-zero exit status 1.

Numba has printed a dump of the type of all variables after each propagation step. The type inference happens on the IR in [Static Single Assignment (SSA) form](https://en.wikipedia.org/wiki/Static_single_assignment_form) so the names of variables after propagation carry a "version number" - e.g. `ret`, `ret.1`, `ret.2`, etc - however, the IR dump does not presently print the version numbers of each variable, so it can be a little tricky to work out which variable each versioned variable refers to.

The different versions of the variable make up the set that is being unified, so we can see that the variable `ret` has a set of `{UniTuple(int64 x 2), float64, float64}` from its versions `ret`, `ret.1` and `ret.2`.

A general strategy for debugging typing issues is to examine the changes in the types of variables at each propagate step, to determine how a typing error is occurring.

### Exercises

Execute the code in the following cell, and try to locate the typing of `x` in the output. Try to understand the message accompanying the `TypingError` (which begins with `Invalid use of Function(...`). You may find it easier to run this example on the terminal to avoid a lot of scrolling through a frame in the IPython notebook.

In [12]:
%%script python
from numba import njit
from numba import config
import numpy as np
config.DEBUG_TYPEINFER = True
config.DUMP_IR = True

@njit
def array_vs_scalar():
    x = np.zeros(20)
    x[0] = 10
    x[0, 1] = 20

array_vs_scalar()

----------------------------IR DUMP: array_vs_scalar----------------------------
label 0:
    $2load_global.0 = global(np: <module 'numpy' from '/home/gmarkall/miniconda3/envs/numba/lib/python3.8/site-packages/numpy/__init__.py'>) ['$2load_global.0']
    $4load_method.1 = getattr(value=$2load_global.0, attr=zeros) ['$2load_global.0', '$4load_method.1']
    $const6.2 = const(int, 20)               ['$const6.2']
    $8call_method.3 = call $4load_method.1($const6.2, func=$4load_method.1, args=[Var($const6.2, <stdin>:9)], kws=(), vararg=None) ['$4load_method.1', '$8call_method.3', '$const6.2']
    x = $8call_method.3                      ['$8call_method.3', 'x']
    $const12.4 = const(int, 10)              ['$const12.4']
    $const16.6 = const(int, 0)               ['$const16.6']
    x[$const16.6] = $const12.4               ['$const12.4', '$const16.6', 'x']
    $const20.7 = const(int, 20)              ['$const20.7']
    $const_0 = const(int, 0)                 ['$const_0']
    $const_1 = c

Traceback (most recent call last):
  File "<stdin>", line 13, in <module>
  File "/home/gmarkall/numbadev/numba/numba/core/dispatcher.py", line 401, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/gmarkall/numbadev/numba/numba/core/dispatcher.py", line 344, in error_rewrite
    reraise(type(e), e, None)
  File "/home/gmarkall/numbadev/numba/numba/core/utils.py", line 80, in reraise
    raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<built-in function setitem>) with argument(s) of type(s): (array(float64, 1d, C), Tuple(Literal[int](0), Literal[int](1)), Literal[int](20))
 * parameterized
In definition 0:
    All templates rejected with literals.
In definition 1:
    All templates rejected without literals.
In definition 2:
    All templates rejected with literals.
In definition 3:
    All templates rejected without literals.
In definition 4:
    All templates rejected with

CalledProcessError: Command 'b'from numba import njit\nfrom numba import config\nimport numpy as np\nconfig.DEBUG_TYPEINFER = True\nconfig.DUMP_IR = True\n\n@njit\ndef array_vs_scalar():\n    x = np.zeros(20)\n    x[0] = 10\n    x[0, 1] = 20\n\narray_vs_scalar()\n'' returned non-zero exit status 1.

This is an example of a function unsupported on the CUDA target. Numba tries to implement this function using the `array_sum_impl` internal function, which you will see in the output. Try to determine which function is unsupported (in the message beginning with `Use of unsupported NumPy function...`) and locate the call to it in the IR for `array_sum_impl`.

In [13]:
%%script python
from numba import cuda
from numba import config
import numpy as np
config.DEBUG_TYPEINFER = True
config.DUMP_IR = True

@cuda.jit
def sum_reduce(x):
    x[0] = x.sum()

x = np.ones(10)
sum_reduce(x)

------------------------------IR DUMP: sum_reduce-------------------------------
label 0:
    x = arg(0, name=x)                       ['x']
    $4load_method.1 = getattr(value=x, attr=sum) ['$4load_method.1', 'x']
    $6call_method.2 = call $4load_method.1(func=$4load_method.1, args=[], kws=(), vararg=None) ['$4load_method.1', '$6call_method.2']
    $const10.4 = const(int, 0)               ['$const10.4']
    x[$const10.4] = $6call_method.2          ['$6call_method.2', '$const10.4', 'x']
    $const14.5 = const(NoneType, None)       ['$const14.5']
    $16return_value.6 = cast(value=$const14.5) ['$16return_value.6', '$const14.5']
    return $16return_value.6                 ['$16return_value.6']

______________________________________________________________________
REWRITING (RewriteConstSetitems):
    x = arg(0, name=x)                       ['x']
    $4load_method.1 = getattr(value=x, attr=sum) ['$4load_method.1', 'x']
    $6call_method.2 = call $4load_method.1(func=$4load_method.1, a

Traceback (most recent call last):
  File "<stdin>", line 12, in <module>
  File "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py", line 758, in __call__
    kernel = self.specialize(*args)
  File "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py", line 769, in specialize
    kernel = self.compile(argtypes)
  File "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py", line 784, in compile
    kernel = compile_kernel(self.py_func, argtypes,
  File "/home/gmarkall/numbadev/numba/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py", line 57, in compile_kernel
    cres = compile_cuda(pyfunc, types.void, args, debug=debug, inline=inline)
  File "/home/gmarkall/numbadev/numba/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py", line 40, in compile_cuda
    cres = compiler.

CalledProcessError: Command 'b'from numba import cuda\nfrom numba import config\nimport numpy as np\nconfig.DEBUG_TYPEINFER = True\nconfig.DUMP_IR = True\n\n@cuda.jit\ndef sum_reduce(x):\n    x[0] = x.sum()\n\nx = np.ones(10)\nsum_reduce(x)\n'' returned non-zero exit status 1.

## Branch Elimination

Sometimes Numba can eliminate code from dead branches, if it can determine that the branch will never run for a given set of argument types - this can avoid a unification error that would otherwise have occurred if Numba could not eliminate these dead branches. The next example demonstrates this capability when it does work, and also when it doesn't.

In [14]:
@njit
def branch_elim_example(a, b, cond):
    if cond is None:
        return a
    else:
        return b

This call, where `cond` is `None`, succeeds due to the elision of the `else` branch:

In [15]:
branch_elim_example(1, (1, 2), None)

1

In the following call branch elimination fails, forcing an attempt to unify two things that cannot be unified:

In [16]:
branch_elim_example(1, (1, 2), True)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Can't unify return type from the following types: UniTuple(int64 x 2), int64
Return of: IR name '$16return_value.1', type 'UniTuple(int64 x 2)', location: 
File "<ipython-input-14-483ae52ee5e6>", line 6:
def branch_elim_example(a, b, cond):
    <source elided>
    else:
        return b
        ^
Return of: IR name '$12return_value.1', type 'int64', location: 
File "<ipython-input-14-483ae52ee5e6>", line 4:
def branch_elim_example(a, b, cond):
    <source elided>
    if cond is None:
        return a
        ^

The following cell contains the same function and call, but run with `%%script` so you can inspect the IR and typing if you wish.

In [17]:
%%script python
from numba import njit
from numba import config
import numpy as np
config.DEBUG_TYPEINFER = True
config.DUMP_IR = True

@njit
def branch_elim_example(a, b, cond):
    if cond is None:
        return a
    else:
        return b
    
branch_elim_example(1, (1, 2), True)

--------------------------IR DUMP: branch_elim_example--------------------------
label 0:
    a = arg(0, name=a)                       ['a']
    b = arg(1, name=b)                       ['b']
    cond = arg(2, name=cond)                 ['cond']
    $const4.1 = const(NoneType, None)        ['$const4.1']
    $6compare_op.2 = cond is $const4.1       ['$6compare_op.2', '$const4.1', 'cond']
    branch $6compare_op.2, 10, 14            ['$6compare_op.2']
label 10:
    $12return_value.1 = cast(value=a)        ['$12return_value.1', 'a']
    return $12return_value.1                 ['$12return_value.1']
label 14:
    $16return_value.1 = cast(value=b)        ['$16return_value.1', 'b']
    return $16return_value.1                 ['$16return_value.1']

--------------------------IR DUMP: branch_elim_example--------------------------
label 0:
    a = arg(0, name=a)                       ['a']
    b = arg(1, name=b)                       ['b']
    cond = arg(2, name=cond)                 ['cond']
 

Traceback (most recent call last):
  File "<stdin>", line 14, in <module>
  File "/home/gmarkall/numbadev/numba/numba/core/dispatcher.py", line 401, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/gmarkall/numbadev/numba/numba/core/dispatcher.py", line 344, in error_rewrite
    reraise(type(e), e, None)
  File "/home/gmarkall/numbadev/numba/numba/core/utils.py", line 80, in reraise
    raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Can't unify return type from the following types: UniTuple(int64 x 2), int64
Return of: IR name '$12return_value.1', type 'int64', location: 
File "<stdin>", line 10:
<source missing, REPL/exec in use?>
Return of: IR name '$16return_value.1', type 'UniTuple(int64 x 2)', location: 
File "<stdin>", line 12:
<source missing, REPL/exec in use?>


CalledProcessError: Command 'b'from numba import njit\nfrom numba import config\nimport numpy as np\nconfig.DEBUG_TYPEINFER = True\nconfig.DUMP_IR = True\n\n@njit\ndef branch_elim_example(a, b, cond):\n    if cond is None:\n        return a\n    else:\n        return b\n    \nbranch_elim_example(1, (1, 2), True)\n'' returned non-zero exit status 1.

### General summary of Branch Elimination

* Branch elimination can sometimes remove dead code and prevent unification errors.
* In practive if you find that some calls fail to unify, then branch elimination may be involved.

# CUDA-specific issues

This section looks at a few issues where performance on CUDA can be impacted due to the typing. These are:

* Widening unification
* Widening arithmetic, and its propagation
* The typing of integer arithmetic
* Register usage control

## Widening unification

Unification of types can result in a type that is larger than any of the types from the set that was unified. This first example uses the CPU target because it makes for a simpler example, but the general idea of widening unification applies to the CUDA target as well.

In [18]:
@njit
def select(a, b, threshold, value):
    if threshold < value:
        r = a
    else:
        r = b
    return r

a = np.float32(1)
b = np.int32(2)
select(a, b, 10, 11)  # Call with (float32, int32, int64, int64)

1.0

After the call, we can inspect the typing:

In [19]:
select.inspect_types()

select (float32, int32, int64, int64)
--------------------------------------------------------------------------------
# File: <ipython-input-18-0f89bde90efd>
# --- LINE 1 --- 

@njit

# --- LINE 2 --- 

def select(a, b, threshold, value):

    # --- LINE 3 --- 
    # label 0
    #   a = arg(0, name=a)  :: float32
    #   b = arg(1, name=b)  :: int32
    #   threshold = arg(2, name=threshold)  :: int64
    #   value = arg(3, name=value)  :: int64
    #   $6compare_op.2 = threshold < value  :: bool
    #   del value
    #   del threshold
    #   branch $6compare_op.2, 10, 16

    if threshold < value:

        # --- LINE 4 --- 
        # label 10
        #   del b
        #   del $6compare_op.2
        #   r = a  :: float32
        #   del a
        #   jump 20

        r = a

    # --- LINE 5 --- 

    else:

        # --- LINE 6 --- 
        # label 16
        #   del a
        #   del $6compare_op.2
        #   r.1 = b  :: int32
        #   del b

        r = b

    # --- LINE 7 --- 

### Exercises

Try to determine from the typing:

* Try to determine the return type from the typing output. What was it? 
* Why was this type chosen instead of one of the types in the set?
* Fix the above code so that the return type is no wider than any of the input types.

## Width of constants

The default width of constants and the propagation of their width can have an effect on the typing that results in slower code due to the use of double precision units, and increased register usage. We will build up an example step-by-step to see the impact on the propagated types and the knock-on effects on the LLVM IR and PTX code. 

We begin with a very simple example, where we assign a constant to an array element:

In [20]:
from numba import void

@cuda.jit(void(float32[:]))
def assign_constant(x):
    x[0] = 2.0

Now let's see the typing:

In [21]:
assign_constant.inspect_types()

_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE (array(float32, 1d, A),)
--------------------------------------------------------------------------------
# File: <ipython-input-20-f028b2d99a7d>
# --- LINE 3 --- 

@cuda.jit(void(float32[:]))

# --- LINE 4 --- 

def assign_constant(x):

    # --- LINE 5 --- 
    # label 0
    #   x = arg(0, name=x)  :: array(float32, 1d, A)
    #   $const2.0 = const(float, 2.0)  :: float64
    #   $const6.2 = const(int, 0)  :: Literal[int](0)
    #   x[0] = $const2.0
    #   del x
    #   del $const6.2
    #   del $const2.0
    #   $const10.3 = const(NoneType, None)  :: none
    #   $12return_value.4 = cast(value=$const10.3)  :: none
    #   del $const10.3
    #   return $12return_value.4

    x[0] = 2.0




The constant has a type of `float64`. Now let's look at what LLVM does with that, by viewing the LLVM IR after LLVM optimizations:

In [22]:
print(assign_constant.inspect_llvm())

source_filename = "<string>"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@"_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__errcode__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__tidx__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__ctaidx__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__tidy__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__ctaidy__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__tidz__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable

It turns out that the LLVM optimizer was able to convert this back to a 32-bit constant: `store float 2.000000e+00, float* %arg.x.4, align 4`.

We see a similar width in the PTX:

In [23]:
print(assign_constant.inspect_asm())

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-27506705
// Cuda compilation tools, release 10.2, V10.2.89
// Based on LLVM 3.4svn
//

.version 6.5
.target sm_70
.address_size 64

	// .globl	_ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE
.visible .global .align 4 .u32 _ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__errcode__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__tidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__ctaidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__tidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__ctaidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19assign_constant$248E5ArrayIfLi1E1A7mutable7alignedE__tidz__;
.visible .global .align 4 .u32 

Correspondingly, we have `mov.u32 	%r1, 1073741824;`. So far, so good.

### Increasing complexity slightly - in-place addition

Now let's build up the example a little - instead of assigning a constant, we add a constant to the array element:

In [24]:
@cuda.jit(void(float32[:]))
def add_constant(x):
    x[0] += 2.0

If we inspect the types, we see:

In [25]:
add_constant.inspect_types()

_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE (array(float32, 1d, A),)
--------------------------------------------------------------------------------
# File: <ipython-input-24-25f8fdc40e3e>
# --- LINE 1 --- 

@cuda.jit(void(float32[:]))

# --- LINE 2 --- 

def add_constant(x):

    # --- LINE 3 --- 
    # label 0
    #   x = arg(0, name=x)  :: array(float32, 1d, A)
    #   $const4.1 = const(int, 0)  :: Literal[int](0)
    #   $8binary_subscr.4 = static_getitem(value=x, index=0, index_var=$const4.1)  :: float32
    #   $const10.5 = const(float, 2.0)  :: float64
    #   $12inplace_add.6 = inplace_binop(fn=<built-in function iadd>, immutable_fn=<built-in function add>, lhs=$8binary_subscr.4, rhs=$const10.5, static_lhs=Undefined, static_rhs=Undefined)  :: float64
    #   del $const10.5
    #   del $8binary_subscr.4
    #   x[0] = $12inplace_add.6
    #   del x
    #   del $const4.1
    #   del $12inplace_add.6
    #   $const18.7 = const(NoneType, None)  :: none
  

Again the typing of the constant is `float64`, and also the addition of the `float32` and `float64` (`$8binary_subscr.4` plus `$const10.5` stored in `$12inplace_add.6`) results in a `float64`.

But does the addition result in a 64-bit operation in the LLVM IR?

In [26]:
print(add_constant.inspect_llvm())

source_filename = "<string>"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@"_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__errcode__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__tidx__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__ctaidx__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__tidy__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__ctaidy__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__tidz__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__ctaidz__" 

No! Again the LLVM optimizer has managed to reduce this to a 32-bit operation: `fadd float %.4957, 2.000000e+00`.

The PTX corresponds:

In [27]:
print(add_constant.inspect_asm())

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-27506705
// Cuda compilation tools, release 10.2, V10.2.89
// Based on LLVM 3.4svn
//

.version 6.5
.target sm_70
.address_size 64

	// .globl	_ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE
.visible .global .align 4 .u32 _ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__errcode__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__tidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__ctaidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__tidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__ctaidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__16add_constant$249E5ArrayIfLi1E1A7mutable7alignedE__tidz__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__16

As expected, we see `add.f32 	%f2, %f1, 0f40000000;`.

### Bringing in another addition

As well as adding a constant, we'll now add another array element:

In [28]:
@cuda.jit(void(float32[:], float32[:]))
def add_constant_2(x, y):
    x[0] += y[0] + 2.0

We would expect the IR to contain more `float64` operations:

In [29]:
add_constant_2.inspect_types()

_ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE (array(float32, 1d, A), array(float32, 1d, A))
--------------------------------------------------------------------------------
# File: <ipython-input-28-aa462724f951>
# --- LINE 1 --- 

@cuda.jit(void(float32[:], float32[:]))

# --- LINE 2 --- 

def add_constant_2(x, y):

    # --- LINE 3 --- 
    # label 0
    #   x = arg(0, name=x)  :: array(float32, 1d, A)
    #   y = arg(1, name=y)  :: array(float32, 1d, A)
    #   $const4.1 = const(int, 0)  :: Literal[int](0)
    #   $8binary_subscr.4 = static_getitem(value=x, index=0, index_var=$const4.1)  :: float32
    #   $const12.6 = const(int, 0)  :: Literal[int](0)
    #   $14binary_subscr.7 = static_getitem(value=y, index=0, index_var=$const12.6)  :: float32
    #   del y
    #   del $const12.6
    #   $const16.8 = const(float, 2.0)  :: float64
    #   $18binary_add.9 = $14binary_subscr.7 + $const16.8  :: float64
    #   del $const16.8


What happens this time in the LLVM IR? Let's see:

In [30]:
print(add_constant_2.inspect_llvm())

source_filename = "<string>"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@"_ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__errcode__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__tidx__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__ctaidx__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__tidy__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__ctaidy__" = local_unnamed_addr global i32 0
@"_ZN6cudapy8__main__19add_c

Instead of operations on 32-bit floats, we now see casts (`fpext` / `fptrunc`) between 32- and 64-bit values, and operations on 64-bit values (`fadd double`). This time, the optimizer couldn't save us!

NVVM doesn't help us in this case either:

In [31]:
print(add_constant_2.inspect_asm())

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-27506705
// Cuda compilation tools, release 10.2, V10.2.89
// Based on LLVM 3.4svn
//

.version 6.5
.target sm_70
.address_size 64

	// .globl	_ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE
.visible .global .align 4 .u32 _ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__errcode__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__tidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__ctaidx__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5ArrayIfLi1E1A7mutable7alignedE__tidy__;
.visible .global .align 4 .u32 _ZN6cudapy8__main__19add_constant_2$2410E5ArrayIfLi1E1A7mutable7alignedE5

Similarly we see casts (e.g. `cvt.f64.f32`) and operations on 64-bit values (e.g. `add.f64`).

### Exercise:

* Fix the typing of the `add_constant_2` function with an appropriate cast.
* Re-run the inspection of the typing, LLVM, and PTX to verify that the width of operations is reduced.

## Register usage

We can find out the register usage of the kernel from its `regs` attribute:

In [32]:
add_constant_2._func.get().attrs.regs

8

With the original typing, this gives 8 registers on my setup. With the "corrected" typing, fewer registers are needed - 6 in my case. In general, reducing the width of operations reduces register usage and can increase occupancy.

## Controlling register usage by parameter

The `max_registers` keyword argument of the `@cuda.jit` decorator can also be used to limit register usage, which can be helpful if the limit reducing register usage via code changes has been hit.

This only has an effect for kernels of a minimum level of complexity - the following is about the size of the simplest example for which it can be seen to take effect:

In [33]:
@cuda.jit
def busy_arithmetic(x, y, a):
    a = y[0]
    b = 2.0
    c = y[1] / 6
    d = y[2] % 8
    e = y[3] * y[4]
    for i in range(a):
        a += 2
        b -= c
        e *= d
        x[0] += a * b + c * d - e

x = np.empty(32, dtype=np.float32)
y = np.empty(32, dtype=np.float32)
kernel = busy_arithmetic.specialize(x, y, 5)

Note here we used the `specialize()` function of the CUDA-jitted kernel - this can be used to give us a compiled kernel with a typing for a particular set of arguments without launching a kernel. This is convenient when we only want to experiment with a particular typing of a function.

Let's examine the register usage of the kernel:

In [34]:
kernel._func.get().attrs.regs

36

Now if we redefine the kernel with the `max_registers` keyword argument and inspect the register usage:

In [35]:
@cuda.jit(max_registers=24)
def busy_arithmetic_maxreg_24(x, y, a):
    a = y[0]
    b = 2.0
    c = y[1] / 6
    d = y[2] % 8
    e = y[3] * y[4]
    for i in range(a):
        a += 2
        b -= c
        e *= d
        x[0] += a * b + c * d - e
        
kernel_maxreg_24 = busy_arithmetic_maxreg_24.specialize(x, y, 5)
kernel_maxreg_24._func.get().attrs.regs

24

We see that the register usage is reduced to the level we requested. However, the `max_registers` kwarg places no commitment on the optimizer, so it may not be honored. For example:

In [36]:
@cuda.jit(max_registers=20)
def busy_arithmetic_maxreg_20(x, y, a):
    a = y[0]
    b = 2.0
    c = y[1] / 6
    d = y[2] % 8
    e = y[3] * y[4]
    for i in range(a):
        a += 2
        b -= c
        e *= d
        x[0] += a * b + c * d - e
        
kernel_maxreg_20 = busy_arithmetic_maxreg_20.specialize(x, y, 5)
kernel_maxreg_20._func.get().attrs.regs

24

The register usage was reduced, but only to 24, which was the minimum achievable.

## Integer arithmetic width

Numba strongly prefers using `int64` values for all integer arithmetic. Let's consider an example:

In [37]:
@cuda.jit
def index_computation(x):
    i = cuda.grid(1)                     # int32

    if i < x.shape[0]:                   # x.shape[0] will be int64
        for j in range(3):               # range_iter_int64
            x[i, j] = (i * 2) + (j * 3)  # int64 computations

x = np.zeros((1024, 3), dtype=np.int32)
kernel = index_computation.specialize(x)

Now if we inspect the typing:

In [38]:
kernel.inspect_types()

_ZN6cudapy8__main__22index_computation$2414E5ArrayIiLi2E1C7mutable7alignedE (array(int32, 2d, C),)
--------------------------------------------------------------------------------
# File: <ipython-input-37-7e0a8164e397>
# --- LINE 1 --- 

@cuda.jit

# --- LINE 2 --- 

def index_computation(x):

    # --- LINE 3 --- 
    # label 0
    #   x = arg(0, name=x)  :: array(int32, 2d, C)
    #   $2load_global.0 = global(cuda: <module 'numba.cuda' from '/home/gmarkall/numbadev/numba/numba/cuda/__init__.py'>)  :: Module(<module 'numba.cuda' from '/home/gmarkall/numbadev/numba/numba/cuda/__init__.py'>)
    #   $4load_method.1 = getattr(value=$2load_global.0, attr=grid)  :: Macro(<class 'numba.cuda.cudadecl.Cuda_grid'>)
    #   del $4load_method.1
    #   del $2load_global.0
    #   $const6.2 = const(int, 1)  :: Literal[int](1)
    #   $8call_method.3 = call ptx.grid.1d($const6.2, func=ptx.grid.1d, args=[Var($const6.2, <ipython-input-37-7e0a8164e397>:3)], kws=(), vararg=None)  :: (int64,) -> int32

We see that most of the arithmetic happens using `int64` values, and the range iterates over `int64` (the `range_iter_int64` type).

We can attempt to reduce the width of arithmetic operations using casts, but it requires a lot of casts:

In [39]:
from numba import int32

@cuda.jit
def index_computation_int32(x):
    i = cuda.grid(1)                     # int32

    if i < int32(x.shape[0]):            # Attempt to compare using int32 arithmetic
        for j in range(int32(3)):        # Force iteration over int32 - a range_iter_int32
            x[i, j] = int32(int32(int32(i) * int32(2))
                            + int32(int32(j) * int32(3)))
                                         # Attempt to make all constants and operations int32

kernel_int32 = index_computation_int32.specialize(x)

If we have been successful, we should see a reduced register usage for the `index_computation_int32` kernel:

In [40]:
kernel._func.get().attrs.regs

12

In [41]:
kernel_int32._func.get().attrs.regs

14

We have actually made things worse! Often it is better to try not to reduce the width of `int64` operations, because it results in a mix of `int32` and `int64` values, which ends up requiring more registers.

Exercises:

* Inspect the IR, LLVM, and PTX to see where `int64` computations remain in `index_computation_int32`.

# Summary

Throughout the course of this notebook, we have:

* Seen how to use `inspect_types()` to view the typing of jitted functions
* Examined *phi nodes* and looked at the unification of types at phi nodes.
* Seen how calls with different argument types result in different specialisations of a function, that have different typings.
* Examined typing errors:
  * Unification failures, and how to determine what failed to unify
  * Use of a variable with inconsistent typing throughout the function (e.g. 1D array vs. 2D array)
  * Use of unsupported functions, or functions implemented using unsupported functions in the CUDA target.
* Seen an example of branch elimination, and how it sometimes succeeds in allowing typings with arguments that could otherwise have resulted in unification errors.
* Looked at CUDA-specific issues, mainly related to register usage:
  * When widening unification occurs, and how to prevent it.
  * When widening arithmetic occurs, and how to avoid it for floating point types.
  * How integer arithmetic strongly prefers `int64`, and how it can be counterproductive to try to reduce it to `int32` and narrower types.
* Seen how to control register usage using the `max_registers` keyword argument.