# Hacker's Guide to Neural Networks
* Python code w/ personal notes and experiments from [Andrej Karpathy's tutorial](http://karpathy.github.io/neuralnets/)

---
# Real-valued Circuits

### Circuit with Single Gate
$f\left(x,y\right)\ =\ xy$

In [1]:
def forwardMultiplyGate(x,y):
    return x * y

forwardMultiplyGate(-2, 3)

-6

#### Strategy 1: Random Local Search
* throw numbers at the wall and see what sticks

In [2]:
x, y = -2, 3
best_x, best_y = x, y
best_out = -float("inf")
tweak_amount = 0.01

In [3]:
import random
for i in range(100):
    random_ = (random.random() * 2 - 1)
    x_try = x + tweak_amount * random_
    y_try = y + tweak_amount * random_
    out = forwardMultiplyGate(x_try, y_try)
    
    if (out > best_out):
        best_out, best_x, best_y = out, x_try, y_try

best_x, best_y, out

(-1.9900748582097127, 3.009925141790287, -5.993390245162583)

#### Strategy 2: Numerical Gradient
* finding the derivatives by tweaking the knobs for each pass

In [4]:
x, y = -2, 3
out = forwardMultiplyGate(x, y)
h = 0.0001

out_x = forwardMultiplyGate(x + h, y)
x_derivative = float((out_x - out) / h)
out_y = forwardMultiplyGate(x, y + h)
y_derivative = float((out_y - out) / h)

x_derivative, y_derivative

(3.00000000000189, -2.0000000000042206)

$\frac{df\left(x,y\right)}{dx}=\frac{f\left(x+h,\ y\right)\ -\ f\left(x,y\right)}{h}$

Think of the derivative as a check and balance. If it is (+), it tells the variable, that if the function ought to increase, this is where the function will be going (combined with the magnitude). The derivative, whether it evaluates to +/- (times magnitude), will be forcing the function to proceed to the direction where the function will increase. It is a value that indicates whether nudge (h) gives a +/- slope when evaluated w/ the original function.

In [5]:
step_size = 0.01
x += step_size * x_derivative
y += step_size * y_derivative
out_new = forwardMultiplyGate(x, y)

out_new

-5.87059999999986

The step size is the key here. Turn it (-), the direction will turn the opposite way. When (+), it inclines the derivatives to _increase_ the function. It sort of amplifies, little by little, the force and direction given by the derivatives. It 'commands' the circuit to proceed with the derivatives whatever is their signs.

Try to see with more iterations if the gradient is really towards increasing the function.. 

It is! Except if step to (-)step, opposite direction of gradient.

In [6]:
step_size = 0.01
x, y = -1, 1
h = 0.0001

for i in range(10):
    
    out = forwardMultiplyGate(x, y)
    
    out_x = forwardMultiplyGate(x + h, y)
    x_derivative = (out_x - out) / h
    
    out_y = forwardMultiplyGate(x, y + h)
    y_derivative = (out_y - out) / h
    
    x += step_size * x_derivative
    y += step_size * y_derivative
    out_new = forwardMultiplyGate(x, y)
    
    print "old x: %s, x derivative: %s, new x: %s\nold y: %s, y derivative: %s, new y: %s\nout: %s\n" % \
    (out_x, x_derivative, x, out_y, y_derivative, y, out_new)

old x: -0.9999, x derivative: 1.0, new x: -0.99
old y: -1.0001, y derivative: -1.0, new y: 0.99
out: -0.9801

old x: -0.980001, x derivative: 0.99, new x: -0.9801
old y: -0.980199, y derivative: -0.99, new y: 0.9801
out: -0.96059601

old x: -0.960498, x derivative: 0.9801, new x: -0.970299
old y: -0.96069402, y derivative: -0.9801, new y: 0.970299
out: -0.941480149401

old x: -0.941383119501, x derivative: 0.970299, new x: -0.96059601
old y: -0.941577179301, y derivative: -0.970298999999, new y: 0.96059601
out: -0.922744694428

old x: -0.922648634827, x derivative: 0.96059601, new x: -0.9509900499
old y: -0.922840754029, y derivative: -0.96059601, new y: 0.9509900499
out: -0.904382075009

old x: -0.904286976004, x derivative: 0.950990049901, new x: -0.941480149401
old y: -0.904477174014, y derivative: -0.950990049899, new y: 0.941480149401
out: -0.886384871716

old x: -0.886290723701, x derivative: 0.941480149401, new x: -0.932065347907
old y: -0.886479019731, y derivative: -0.94148014

#### Strategy 3: Analytical Gradient
* for our function, it turns out that the derivatives of x, y are y, x respectively.

$\frac{df\left(x,y\right)}{dx}=\frac{\left(x+h\right)y\ -\ xy}{h}=y$

Instead of probing with h, compute the derivatives directly for each step because math.
Btw, h presumes that whatever the sign of the derivative will be , it corresponds to the rate of growth. ie. increasing a little bit of a variable results to a rate of increase. If that rate is (+), this means growth; if that is (-), opposite of growth.

In [7]:
x, y = -2, 3 # re-initialize
x_gradient, y_gradient = y, x # derived from separate evaluation

x += step_size * x_gradient
y += step_size * y_gradient
out_new = forwardMultiplyGate(x, y)

out_new

-5.8706

### Circuits with Multiple Gates

In [8]:
def forwardMultiplyGate(a, b): return a * b
def forwardAddGate(a, b): return a + b

def forwardCircuit(x, y, z):
    q = forwardAddGate(x, y)
    f = forwardMultiplyGate(q, z) 
    return f

x, y, z = -2, 5, -4
forwardCircuit(x, y, z)

-12

#### Backpropagation
* the chain rule, is really really useful

In [9]:
x, y, z = -2, 5, -4
q = forwardAddGate(x, y)
f = forwardMultiplyGate(q, z)
print f

derivative_f_wrt_z = q
derivative_f_wrt_q = z

derivative_q_wrt_x = 1
derivative_q_wrt_y = 1

derivative_f_wrt_x = derivative_f_wrt_q * derivative_q_wrt_x
derivative_f_wrt_y = derivative_f_wrt_q * derivative_q_wrt_y

step_size = 0.01
x += step_size * derivative_f_wrt_x
y += step_size * derivative_f_wrt_y
z += step_size * derivative_f_wrt_z

print forwardMultiplyGate(forwardAddGate(x, y), z)

-12
-11.5924


One way to think about this is that the circuit is fitting-a-function problem (in our simple case, the gradients are just increasing the function w/c doesn't optimize anything yet). The chain rule, which is implemented via backpropagation, will signal each layer (and its gradients) how much and where to go in terms of adjustments to satisfy the last function. It is a backward pass, where each layer influences the functions before it to adjust accordingly. The chain rule tells us formally what the sensitivity of the weights (and other variables) to the over-all change of the function. We can sort of trust that every layer and the layer before it communicates locally to produce a desired computation globally. 

Let's try with more iterations..

In [10]:
x, y, z = -2, 5, -4
step_size = 0.01

for i in range(20):
    
    dfdz = q
    dfdq = z

    dqdx = 1
    dqdy = 1

    dfdx = dfdq * dqdx
    dfdy = dfdq * dqdy

    x += step_size * dfdx
    y += step_size * dfdy
    z += step_size * dfdz
    
    print forwardMultiplyGate(forwardAddGate(x, y), z)

-11.5924
-11.191964
-10.798638
-10.412368
-10.0331
-9.66078
-9.295354
-8.936768
-8.584968
-8.2399
-7.90151
-7.569744
-7.244548
-6.925868
-6.61365
-6.30784
-6.008384
-5.715228
-5.428318
-5.1476


Now, experiment with the chain rule adding a basic cost function at the end.

Result: It works! Finds the proper inputs to minimize the function, instead of just ascending the function.   
(c gradually decreases, so f gets closer to k). Beautiful.

In [11]:
x, y, z, k = -1, 3, 4, 6
step_size = 0.001

for i in range(20):    
   
    f = forwardMultiplyGate(forwardAddGate(x, y), z)
    c = ((f - k) ** 2) / 2
    
    dfdz = q
    dfdq = z

    dqdx = 1
    dqdy = 1

    dfdx = dfdq * dqdx
    dfdy = dfdq * dqdy
    
    dcdf = k - f # we want to follow the opposite of the gradient, to minimize, not maximize the cost
    dcdx = dcdf * dfdx
    dcdy = dcdf * dfdy
    dcdz = dcdf * dfdz

    x += step_size * dcdx
    y += step_size * dcdy
    z += step_size * dcdz

    print "x: %s\t y: %s\t z: %s\nf: %s\t c: %s\n" % (round(x, 4), round(y, 4), round(z, 4), round(f, 4), round(c, 4))

x: -1.008	 y: 2.992	 z: 3.994
f: 8.0	 c: 2.0

x: -1.0157	 y: 2.9843	 z: 3.9882
f: 7.9241	 c: 1.8511

x: -1.0231	 y: 2.9769	 z: 3.9827
f: 7.8513	 c: 1.7137

x: -1.0302	 y: 2.9698	 z: 3.9773
f: 7.7816	 c: 1.587

x: -1.037	 y: 2.963	 z: 3.9722
f: 7.7147	 c: 1.4701

x: -1.0435	 y: 2.9565	 z: 3.9672
f: 7.6506	 c: 1.3622

x: -1.0498	 y: 2.9502	 z: 3.9625
f: 7.589	 c: 1.2625

x: -1.0559	 y: 2.9441	 z: 3.9579
f: 7.5299	 c: 1.1703

x: -1.0617	 y: 2.9383	 z: 3.9535
f: 7.4732	 c: 1.0852

x: -1.0673	 y: 2.9327	 z: 3.9492
f: 7.4188	 c: 1.0064

x: -1.0727	 y: 2.9273	 z: 3.9451
f: 7.3665	 c: 0.9336

x: -1.0779	 y: 2.9221	 z: 3.9412
f: 7.3162	 c: 0.8663

x: -1.0829	 y: 2.9171	 z: 3.9373
f: 7.268	 c: 0.8039

x: -1.0877	 y: 2.9123	 z: 3.9337
f: 7.2216	 c: 0.7462

x: -1.0924	 y: 2.9076	 z: 3.9302
f: 7.1771	 c: 0.6927

x: -1.0968	 y: 2.9032	 z: 3.9267
f: 7.1342	 c: 0.6432

x: -1.1011	 y: 2.8989	 z: 3.9235
f: 7.093	 c: 0.5974

x: -1.1053	 y: 2.8947	 z: 3.9203
f: 7.0534	 c: 0.5549

x: -1.1092	 y: 2.8908	 z:

### Single Neuron
* using sigmoid activation

In [12]:
class Unit(object):
    def __init__(self, value, grad):
        self.value = value
        self.grad = grad
        
class MultiplyGate(object):
    def forward(self, u0, u1):
        self.u0 = u0
        self.u1 = u1
        self.utop = Unit(u0.value * u1.value, 0.0)
        return self.utop
    
    def backward(self):
        self.u0.grad += self.u1.value * self.utop.grad
        self.u1.grad += self.u0.value * self.utop.grad
        
class AddGate(object):
    def forward(self, u0, u1):
        self.u0 = u0
        self.u1 = u1
        self.utop = Unit(u0.value + u1.value, 0.0)
        return self.utop
    
    def backward(self):
        self.u0.grad += 1 * self.utop.grad
        self.u1.grad += 1 * self.utop.grad
    
x = Unit(2,0)
y = Unit(-3,0)
print x.value, x.grad

m = MultiplyGate()
print m.forward(x,y).value

a = AddGate()
print a.forward(x, y).value

2 0
-6
-1


sigmoid function    
$sig\left(x\right)\ =\ \frac{1}{1+e^{-x}}$

derivative   
$\frac{dsig\left(x\right)}{dx}=\ sig\left(x\right)\left(1-sig\left(x\right)\right)$

In [13]:
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

class SigmoidGate(object):
    def forward(self, u0):
        self.u0 = u0
        self.utop = Unit(sigmoid(self.u0.value), 0.0)
        return self.utop
    
    def backward(self):
        s = sigmoid(self.u0.value)
        self.u0.grad += (s * (1 - s)) * self.utop.grad

print sigmoid(3)
sg = SigmoidGate()
print sg.forward(x).value

0.952574126822
0.880797077978


__Neuron (forward pass)__  
* dot product of input & weights, + c (bias term), then feed to sigmoid
* it yields a single value

__Neuron (backward pass)__   
* adjust global vars with computed gradients, the sequence is impt on the chain
* started with gradient 1 from the end operation which is the sigmoid
* it yields updates to every value (not just a single value)

In [14]:
class Neuron(object):
    
    def __init__(self):
        self.mulg0, self.mulg1 = MultiplyGate(), MultiplyGate()
        self.addg0, self.addg1 = AddGate(), AddGate()
        self.sg0 = SigmoidGate()

    def forward(self, a, b, c, x, y):
        ax = self.mulg0.forward(a, x)
        by = self.mulg1.forward(b, y)
        axby = self.addg0.forward(ax, by)
        axbyc = self.addg1.forward(axby, c)
        self.s = self.sg0.forward(axbyc)
        return self.s
    
    def backward(self, a, b, c, x, y):
        step_size = 0.01
        self.s.grad = 1.0
        
        self.sg0.backward()
        self.addg1.backward()
        self.addg0.backward()
        self.mulg1.backward()
        self.mulg0.backward()
        
        a.value += step_size * a.grad
        b.value += step_size * b.grad
        c.value += step_size * c.grad
        x.value += step_size * x.grad
        y.value += step_size * y.grad
        
        return "\nvalues:\na: %s, x: %s\nb: %s, y: %s\nc: %s\n" % (a.value, x.value, b.value, y.value, c.value) 

In [15]:
# input
a, b = Unit(1.0, 0.0), Unit(2.0, 0.0)
x, y = Unit(-1.0, 0.0), Unit(3.0, 0.0)
c = Unit(-3.0, 0.0)

# process / output
n = Neuron()
print "forward pass 1: ", n.forward(a, b, c, x, y).value
print "\nbackward pass 1: ", n.backward(a, b, c, x, y)
print "forward pass 2: ", n.forward(a, b, c, x, y).value

forward pass 1:  0.880797077978

backward pass 1:  
values:
a: 0.998950064146, x: -0.998950064146
b: 2.00314980756, y: 3.00209987171
c: -2.99895006415

forward pass 2:  0.882550181622


Numerical Gradients Check
* just checking if the gradients are carried out correctly from above.
* they're equivalent!

In [16]:
a, b, c, x, y = 1, 2, -3, -1, 3
h = 0.0001

def forwardCircuitFast(a, b, c, x, y):
    return 1 / (1 + math.exp(-(a*x + b*y + c)))

a_grad = (forwardCircuitFast(a+h, b, c, x, y) - forwardCircuitFast(a, b, c, x, y)) / h
b_grad = (forwardCircuitFast(a, b+h, c, x, y) - forwardCircuitFast(a, b, c, x, y)) / h
c_grad = (forwardCircuitFast(a, b, c+h, x, y) - forwardCircuitFast(a, b, c, x, y)) / h
x_grad = (forwardCircuitFast(a, b, c, x+h, y) - forwardCircuitFast(a, b, c, x, y)) / h
y_grad = (forwardCircuitFast(a, b, c, x, y+h) - forwardCircuitFast(a, b, c, x, y)) / h

print a_grad, b_grad, c_grad, x_grad, y_grad

-0.104997583592 0.314944774835 0.104989587345 0.104989587345 0.209971178827
