All transformations learned by deep neural networks can be reduced to a handful of tensor operations (or tensor functions) applied to tensors of numeric data. It's possible to add tensors, multiply tensors, and so on.

### Element-wise operations

- The relu operation and addition are element-wise operations that are applied independently to each entry in the tensors being considered.
- This means these operations are highly amenable to massively parallel implementations (vectorized implementations).
- If you want to write a naive Python implementation of an element-wise operation, you use a for loop

ReLU

In [2]:
def naive_relu(x):
    assert len(x.shape) == 2
    x = x.copy()
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i, j] = max(x[i, j], 0)

Addition

In [4]:
def naive_add(x,y):
    assert len(x.shape) == 2
    assert x.shape == y.shape
    x = x.copy()
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i, j] += y[i, j]
    return x

n practice, when dealing with NumPy arrays, these operations are available as well-optimized built-in NumPy functions, which themselves delegate the heavy lifting  to a Basic Linear Algebra Subprograms (BLAS) implementation (low-level, highly parallel, efficient tensor-manipulation routines that are typically implemented in Fortran or C).m

So in NumPy, you can do the following element-wise operation, and it will be blazing fast

In [8]:
import time
import numpy as np
x = np.random.random((20,100))
y = np.random.random((20,100))

t0 = time.time()

for _ in range(1000):
    z = x + y
    z = np.maximum(z, 0)
    
print("Took: {0:2f} s".format(time.time() - t0))

Took: 0.008999 s


Meanwhile the naive version

In [9]:
t0 = time.time()
for _ in range(1000):
    z = naive_add(x, y)
    z = naive_relu(z)
    
print("Took: {0:2f} s".format(time.time() - t0))

Took: 4.779998 s


Likewise, when running TensorFlow code on a GPU, element-wise operations are executed via fully vectorized CUDA implementations that can best utilize the highly parallel GPU chip architecture.

#### Broadcasting

- What happens with addition when the shapes of the two tensors being added differ?
- When possible, when there's no ambiguity, the smaller tensor will be broadcast to match the shape of the larger tensor.

- Broadcasting consists of two steps:
	- Axes (called broadcast axis) are added to the smaller tensor to match the ndim of the larger tensor.
	- The smaller tensor is repeated alongside these new axes to match the full shape of the larger tensor.

Let's look at a concrete example. Consider X with shape (32,10) and y with shape (10,):

In [20]:
import numpy as np
X = np.random.random((32,10))
y = np.random.random((10,))

In [21]:
X

array([[0.52040019, 0.11364975, 0.79379005, 0.2826918 , 0.81832712,
        0.30051143, 0.69614454, 0.95671423, 0.19789542, 0.80455648],
       [0.59027225, 0.26480535, 0.71874721, 0.38362018, 0.87490661,
        0.24829111, 0.92076838, 0.40430915, 0.77322922, 0.55622764],
       [0.8216292 , 0.65650194, 0.13979498, 0.04938335, 0.92196324,
        0.75124839, 0.16512904, 0.98321442, 0.89604952, 0.88466392],
       [0.56739504, 0.21704017, 0.19072295, 0.37317848, 0.48155987,
        0.68713767, 0.13410949, 0.49334919, 0.98036186, 0.35933354],
       [0.70157012, 0.4401658 , 0.90793161, 0.69475223, 0.1767852 ,
        0.15887621, 0.89699658, 0.27212976, 0.28393734, 0.23981372],
       [0.66217616, 0.76878214, 0.94084423, 0.46567931, 0.86021226,
        0.06449526, 0.72396059, 0.00609476, 0.9005034 , 0.2455131 ],
       [0.35030332, 0.7098137 , 0.93271561, 0.39008344, 0.68197308,
        0.27242743, 0.37274262, 0.80335723, 0.47622014, 0.03283213],
       [0.94184011, 0.64927805, 0.0535324

In [22]:
y

array([0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
       0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ])

First, we add an empty first axis to y, whose shape becomes (1,10):

In [23]:
y = np.expand_dims(y, axis=0)
y

array([[0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
        0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ]])

Then, we repeat y 32 times alongside this new axis, so that we end up with a tensor Y with shape (32,10), where Y([i, :] == y for i in range (0,32):

In [26]:
Y = np.concatenate([y] * 32, axis=0)

In [27]:
Y

array([[0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
        0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ],
       [0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
        0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ],
       [0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
        0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ],
       [0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
        0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ],
       [0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
        0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ],
       [0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
        0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ],
       [0.65730044, 0.71045142, 0.53709828, 0.44315067, 0.66300613,
        0.82663649, 0.88057878, 0.9424155 , 0.84131204, 0.8494456 ],
       [0.65730044, 0.71045142, 0.5370982

- At this point, we can proceed to add X and Y, because they have the same shape.
- In terms of implementation, no new rank-2 tensor is created, because that would be terribly inefficient.
- The repetition operation is entirely virtual: it happens at the algorithmic level rather than at the memory level. But thinking of the vector being repeated 10 times alongside a new axis is a helpful mental model.

Here's what a naive implementation would look like

In [28]:
def naive_add_matrix_and_vector(x,y):
    assert len(x.shape) == 2
    assert len(y.shape) == 2
    assert x.shape[1] == y.shape[0]
    x = x.copy()
    for i in range(x.shape[0]):
        for k in range(x.shape[1]):
            x[i, j] += y[j]
    return x