# Part Three: combining CPU and GPU operations
We would like to refactor and combine what we have in part 1 and 2 into a unified
interface like pytorch, where you pass in a device type of either CPU or GPU and
then the rest of the operations work automatically. In other words, we extract a
shared representation between all the operations (addition, multiplication etc.)
and all the device types (currently just CPU and GPU), this will take us to a minimal
version of tinygrad.

The first change to make is upon Tensor initialization.

Tensor data can be passed either GPU buffer or numpy array, it needs to detect it automatically, and able to convert it to the desired device.

```python
def __init__(self, data, shape, device):
  self.device = device
  self.data = _move_data(data, shape, device)
```



Let's look at how _move_data can be implemented. 

In [2]:
import numpy as np
import pyopencl as cl

cl_ctx = cl.create_some_context(answers=[0,2])  # change if you don't have mac
cl_queue = cl.CommandQueue(cl_ctx)

def _move_data(data, shape, device):
  if device == 'CPU':
    if isinstance(data, list):
      return np.array(data, dtype=np.float32)
    elif isinstance(data, cl.Buffer):
      ret = np.empty(shape, dtype=np.float32)
      cl.enqueue_copy(cl_queue, data, ret)
      return ret
    elif isinstance(data, np.array):
      return data
  elif device == 'GPU':
    if isinstance(data, list):
      ret = cl.Buffer(cl_ctx, cl.mem_flags.WRITE_ONLY, 4 * len(list))
      return ret
    elif isinstance(data, cl.Buffer):
      return data
    elif isinstance(data, np.array):
      ret = cl.Buffer(cl_ctx, cl.mem_flags.WRITE_ONLY, hostbuf=data)
      return ret
    

a = np.array([2], dtype=np.float32)
a = cl.Buffer(cl_ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=a)
ret = _move_data(a, (1,), "CPU")

Second change to make is always initialize the gradient as zero, and add the attribute on whether the tensor should require gradient
```python
def __init__(self, requires_grad=True)
  self.grad = None
  self.requires_grad = requires_grad
```

We now think about what happens when a tensor that requires grad is multiplied with another, the output tensor inherits the require_grad attribute of the input tensor. 

Next change is to abstract the algebraic operations, if you look at part 2 and part 1, the implementation for multiplication and addition are actually very similar on different axes: between GPU and CPU, between autograd and no-grad, between mulplication and addition. In tinygrad, all these operations are boiled down to a dispatch call, that has arguments shifted around for the above 6 sub-scenarios. 

To do that, we look at what's similar between __add__ and __mul__. First, it's a binary operation. So it can be represented as 

In [4]:
def binary_op_cpu(op, x, y):
  if op == 'ADD':
    ret = x + y
  elif op == 'MUL':
    ret = x * y
  return ret

Things are a bit more complicated in the GPU:

In [5]:
def binary_op_gpu(op, x, y, ret):
  if op == 'ADD':
    code = '+'
  elif op == 'MUL':
    code = '*'

  prg = cl.Program(cl_ctx, f"""
      __kernel void binary_op(
          __global const float *a_g, __global const float *b_g, __global float *res_g)
      {{
      int gid = get_global_id(0);
      res_g[gid] = a_g[gid] {code} b_g[gid];
      }}
      """).build()
  prg.binary_op(cl_queue, [ret.size//4], None, x, y, ret)
  return ret

Next is how to actually define the __mul__ and __add__ on the tensor. We want the tensor to have these two methods and when called, automatically call the correct binary_op after checking the device type. We also don't want to keep writing __mul__, __add__, __div__ and so on as actual method, but rather run in a loop to register them. 

In [None]:
class Tensor:
  pass

def dispatch(x, y):
  device = x.device
  requires_grad = any([t.requires_grad for t in [x, y]])
  if device == 'CPU':
    ret = binary_op_cpu(x.data, y.data)
  elif device == 'GPU':
    ret = cl.Buffer(cl_ctx, cl.mem_flags.WRITE_ONLY, 4)
    ret = binary_op_gpu(x.data, y.data, ret)
  return Tensor(ret, requires_grad)
ops = ["__add__", "__mul__"]
for op in ops:
  setattr(Tensor, op, dispatch)

This simplifies a lot of boilerplate code, despite adding some complexity, but given we may have way more than 2 operations, the overhead is well worth it. 

Let's not forget we also need to keep track of the calculation graph for backward pass if the operation is done on the Tensor of interest, and not keep such info if it's a gradient Tensor or something with requires_grad set to False intentionally. So we need an extra layer of abstraction between Tensor and ops. This layer would have the backward method on it. 

The concept is that we create an object that stores information necessary for back propogation, and if the Tensor does not 
require grad, then we discard this object. And in the subsequent backward call, we terminate if we do not see this object.

To start, dispatch will just call this middle layer:
```python
def dispatch(x, y):
  return fxn.apply(x, y)
```

and fxn is the middle layer class. It has a shared base class called `Function`, and the variant for `Add` and `Mul`:

In [None]:
class Function:
  def __init__(self, device, requires_grad, x, y):
    self.device = device
    self.requires_grad = requires_grad
    self.x = x
    self.y = y
  @classmethod
  def apply(cls, x, y):
    device = x.device
    requires_grad = x.requires_grad
    ctx = cls(device, requires_grad, x, y)
    ret_data = ctx.forward(ctx, x, y)
    ret = Tensor(ret_data, device, requires_grad)
    ret._ctx = ctx
    return ret

It took me a while to figure out this inversion of control. The awkwardness at first glance is to extract a unified interface for as many operations as possible. Let's next actually define the child class, namely it will have the `forward` and `backward` method.

In [None]:
class Mul(Function):
  def forward(ctx, x, y):
    ctx.save_for_backward(x, y)
    ctx.binary_op('MUL', x, y)

  def backward(ctx, grad_output):
    x,y = ctx.saved_tensors
    tmp = ctx.buffer(grad_output.shape)
    # skipping for now

As you see, the base class need two more methods to save the tensor. Let's just see the full code below:

In [None]:
class Function:
  def __init__(self, device, requires_grad, x, y):
    self.device = device
    self.requires_grad = requires_grad
    self.x = x
    self.y = y
    self.saved_tensors = []

  @classmethod
  def apply(cls, x, y):
    device = x.device
    requires_grad = x.requires_grad
    ctx = cls(device, requires_grad, x, y)
    ret_data = ctx.forward(ctx, x, y)
    ret = Tensor(ret_data, device, requires_grad)
    ret._ctx = ctx
    return ret
  
  def save_for_backward(self, x, y):
    self.saved_tensors.extend(x, y)
  
  @property
  def binary_op(self, x, y):
    if self.device == 'CPU':
      return binary_op_cpu
    elif self.device == 'GPU':
      return binary_op_gpu
    
class Mul(Function):
  def forward(ctx, x, y):
    ctx.save_for_backward(x, y)
    ctx.binary_op('MUL', x, y)

  def backward(ctx, grad_output):
    x,y = ctx.saved_tensors
    tmp = ctx.buffer(grad_output.shape)
    # skipping for now

class 