# Part Two: Making it run on GPU

Last post presented an autograd library in its simpliest form. Recall in the title of that post I said we were making this run on CPU? that's actually numpy. When we do `self.data * other.grad`, the __mul__ method is provided by the numpy library, which runs on CPU in the form of JIT compiled C++ code. Because the __mul__ happens on numpy, we don't have to worry it about it polluting our _backward function that's updated with every call to `.mul()`. In order for this to work on GPU, and allows for easy transitioning between the two device type, we will need to have something like numpy, but running on GPU. There are different ways to do this, and in the this notebook, I will present the approach tinygrad has taken.

Running things on GPU requires us to write some GPU specific code, I'll use pyopenCL. First let's look at how it works:

In [4]:
import pyopencl as cl
import numpy as np

# Boilerplate for openCL
cl_ctx = cl.create_some_context(answers=[0,2])  # change if you don't have mac
cl_queue = cl.CommandQueue(cl_ctx)

# First create the value in numpy  that lives on CPU
a = np.array([2], dtype=np.float32)
b = np.array([3], dtype=np.float32)

# Then convert them into openCL buffer that lives on the GPU
a = cl.Buffer(cl_ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=a)
b = cl.Buffer(cl_ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=b)

# Create the destination buffer (output value). This may seem weird if you come from an interpreted language background
# but it is common in low level programming to first allocate a variable/space before writing data to it
c = cl.Buffer(cl_ctx, cl.mem_flags.WRITE_ONLY, 4) # output is a single integer, 1 integer takes up 4 bytes

# Write the actual openCL GPU code that would do the calculation
prg = cl.Program(cl_ctx, """
    __kernel void mul(
        __global const float *a_g, __global const float *b_g, __global float *res_g)
    {
      int gid = get_global_id(0);
      res_g[gid] = a_g[gid] * b_g[gid];
    }
    """).build()

# The .build() step will create a method of the same name on the prg instance, and we execute it
prg.mul(cl_queue, [c.size//4], None, a, b, c)

# Data on GPU is not directly accessible, we have to copy it to numpy in order to view it
# So again, allocate a variable and then paste the content into it with the enqueue_copy method
data = np.empty((1,), dtype=np.float32)
cl.enqueue_copy(cl_queue, data, c)
print(data) # output [6.]

[6.]


We will want to replace the `*` and `+` (__mul__ and __add__) methods from part 1 with the operations above. We can keep most things as is, and add the __mul__ and __add__ methods. Let's implement __mul__ first. 

Two things to note:
1. with the added implementation, .data attribute is an openCL buffer rather than numpy, so places that contain numpy code need to be modified so the types are consistent.

2. __mul__ and __add__ happens directly on the tensor, rather than on .data, therefore, .grad also need to be a tensor.


In [3]:
import numpy as np
import pyopencl as cl
# Boilerplate for openCL
cl_ctx = cl.create_some_context(answers=[0,2])  # change if you don't have mac
cl_queue = cl.CommandQueue(cl_ctx)
class Tensor:
  def __mul__(self, other):
    out = cl.Buffer(cl_ctx, cl.mem_flags.WRITE_ONLY, 4) # output is a single integer, 1 integer takes up 4 bytes

    # Write the actual openCL GPU code that would do the calculation
    prg = cl.Program(cl_ctx, """
        __kernel void mul(
            __global const float *a_g, __global const float *b_g, __global float *res_g)
        {
          int gid = get_global_id(0);
          res_g[gid] = a_g[gid] * b_g[gid];
        }
        """).build()

    # The .build() step will create a method of the same name on the prg instance, and we execute it
    prg.mul(cl_queue, [out.size//4], None, self.data, other.data, out)
    return Tensor(out, (self, other))
  
  def __add__(self, other):
    out = cl.Buffer(cl_ctx, cl.mem_flags.WRITE_ONLY, 4) # output is a single integer, 1 integer takes up 4 bytes

    # Write the actual openCL GPU code that would do the calculation
    prg = cl.Program(cl_ctx, """
        __kernel void add(
            __global const float *a_g, __global const float *b_g, __global float *res_g)
        {
          int gid = get_global_id(0);
          res_g[gid] = a_g[gid] + b_g[gid];
        }
        """).build()

    # The .build() step will create a method of the same name on the prg instance, and we execute it
    prg.add(cl_queue, [out.size//4], None, self.data, other.data, out)
    return Tensor(out, (self, other))
  
  def __init__(self, data, children=(), requires_grad=True) -> None:
    self.data = data
    self._prev = set(children)

    # Convert grad to openCL buffer
    grad = np.zeros(1)
    grad = cl.Buffer(cl_ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=grad)

    self.grad = Tensor(grad, requires_grad=False) if requires_grad else None

    self._backward = lambda: None

  def mul(self, other):
    out = self * other
    def _backward():
      self.grad += other * out.grad
      other.grad += self * out.grad
    out._backward = _backward
    return out
    
  def build_topo(self):
    topo = []
    visited = set()
    def _build_topo(v):
      if v not in visited:
        visited.add(v)
        for child in v._prev:
          _build_topo(child)
        topo.append(v)
    _build_topo(self)
    return topo

  def backward(self):

    # Convert to openCL buffer
    grad = np.ones((1,))
    grad = cl.Buffer(cl_ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=grad)
    self.grad = Tensor(grad)
    tree = self.build_topo()
    for tensor in reversed(tree):
      tensor._backward()
  
  def __repr__(self):
    return f"<Tensor with data: {self.data}, grad: {self.grad}>"
  

a = cl.Buffer(cl_ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=np.array([2]))
a = Tensor(a)
b = cl.Buffer(cl_ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=np.array([3]))
b = Tensor(b)
c = a.mul(b)
d = c.mul(b)
d.backward()
print(a)

<Tensor with data: <pyopencl._cl.Buffer object at 0x110a72990>, grad: <Tensor with data: <pyopencl._cl.Buffer object at 0x110a73650>, grad: <Tensor with data: <pyopencl._cl.Buffer object at 0x110a73770>, grad: None>>>


In the above cell, we managed to get the operation to run on GPU without crashing it, but there are a few important issues:
1. If you examine the ouput, the grad is a tensor with grad and nested a few layers deep, this should be fixed to make it consistent because the gradient itself should not have gradient. In other words, we need to have a mechanism that determine if gradient should be tracked or not across the board.

2. In our cpu implementation, we used Tensor.data.shape to decide the datashape, however, GPU buffer doesn't have this attribute, so I took a shortcut and just assumed the shape is just 1.

3. We have to initialize the data first as the buffer, and then construct the tensor, this is fine at first glance, but if you examine the operation in __mul__ and __add__, they need to be handled meticulously as well.

First change we will make is to specify whether the Tensor need gradient or not, and have it store this attributes. The gradient will always be set to None at the beginning

```python
def __init__(self, data, device=Device.DEFAULT, requires_grad=True):
```