In [1]:
# http://karpathy.github.io/neuralnets/
#
# Via Hacker news:

# this eventually turned into Andrej Karpathy's class at 
# Stanford, CS231n. The class notes are here: 
# http://cs231n.github.io/
#
# A lot of the compute graph and backprop type stuff that 
# is in the hacker's guide is covered in this specific class,
# starting about at this time: 
# https://www.youtube.com/watch?v=i94OvYb6noo&t=207s

# Note: This guide by Karpathy has been converted from JS to Python by jbnunn

In [4]:
"""
f(x,y)=xy
"""
def forward_multiply_gate(x, y):
    return x*y

forward_multiply_gate(2, -3)

-6

In [5]:
"""
The problem we are interested in studying looks as follows:

We provide a given circuit some specific input values (e.g. x = -2, y = 3)
The circuit computes an output value (e.g. -6)
The core question then becomes: How should one tweak the input slightly to increase the output?

In this case, in what direction should we change x,y to get a number larger than -6? 
Note that, for example, x = -1.99 and y = 2.99 gives x * y = -5.95, which is higher than -6.0. 
Don’t get confused by this: -5.95 is better (higher) than -6.0.
It’s an improvement of 0.05, even though the magnitude of -5.95 (the distance from zero) happens to be lower.
"""

forward_multiply_gate(-1.99, 2.99)

-5.950100000000001

In [7]:
## Strategy 1: Random Local Search
import random
x = -2
y = 3

# Try changing x,y randomly small amounds and keep track of what works
tweak_amount = 0.01
best_out = -float('inf') # infinity
best_x = x
best_y = y

for k in range(0,100):
    x_try = x + tweak_amount * (random.randint(0,10) * 2 - 1)
    y_try = y + tweak_amount * (random.randint(0,10) * 2 - 1)
    out = forward_multiply_gate(x_try, y_try)
    
    if out > best_out:
        best_out = out
        best_x = x_try
        best_y = y_try

# We should have something < -6        
print("Best X: {}, Best Y: {}, Total: {})".format(best_x, best_y, best_x * best_y))      

Best X: -1.81, Best Y: 2.99, Total: -5.4119)


In [8]:
"""
So, we’re done, right? Not quite: This is a perfectly fine strategy 
for tiny problems with a few gates if you can afford the compute 
time, but it won’t do if we want to eventually consider huge 
circuits with millions of inputs. 

It turns out that we can do much better.
"""    

## Strategy 2: Numerical Gradient

"""
Here's how to imagine this. Imagine pulling on the output value to make it 
larger. It might exert a force on X that makes the output higher, than -6 e.g.
"""
forward_multiply_gate(x+1, y)

-3

In [9]:
"""
We've just learned about the "derivative" of the output value with respect to 
its inputs (x and y).

The derivative can be thought of as a force on each input as we pull on the 
output to become higher.

It's a very simple procedure. Instead of pulling on the circuit’s output, we’ll 
iterate over every input one by one, increase it very slightly and look at what 
happens to the output value. The amount the output changes in response is the 
derivative.

Here's the formula for the derivative with respect to x:

∂f(x, y)     f(x + h, y) - f(x, y)
--------  =  -------------------
∂x                     h


A "derivative" is with respect to a single input. The gradient is a collection 
of ALL the derivatives. (It's represented as a concatendated list, a vector--not shown.)

"""
x = -2
y = 3
h = 0.0001

derivative_x = (forward_multiply_gate(x + h, y) - forward_multiply_gate(x, y)) / h
print(derivative_x)

derivative_y = (forward_multiply_gate(x, y + h) - forward_multiply_gate(x, y)) / h
print(derivative_y)

# See what happens when we turn the knob x to x + h
x_h = forward_multiply_gate(x + h, y)
print("x+h: ", x_h)

# See what happens when we turn the knob y to y + h
y_h = forward_multiply_gate(x, y + h)
print("y+h: ", y_h)

out = forward_multiply_gate(derivative_x, derivative_y); # 
print(out)

3.00000000000189
-2.0000000000042206
x+h:  -5.9997
y+h:  -6.0002
-6.000000000016442


In [10]:
step_size = 0.01
x = -2
y = 3
output = forward_multiply_gate(x, y) # Before, -6
x = x + (step_size * derivative_x) # x becomes -1.97
y = y + (step_size * derivative_y) # y becomes 2.98

output_new = forward_multiply_gate(x, y) # -5.87, which achives the goal of being greater than the original -6
print(output_new)

-5.87059999999986


In [None]:
## Strategy 3: Analytic Gradient
"""
In the previous section we evaluated the gradient by probing the 
circuit’s output value, independently for every input. This procedure 
gives you what we call a numerical gradient. This approach, however, 
is still expensive because we need to compute the circuit’s output as 
we tweak every input value independently a small amount. So the 
complexity of evaluating the gradient is linear in number of inputs. 
But in practice we will have hundreds, thousands or (for neural networks) 
even tens to hundreds of millions of inputs, and the circuits aren’t 
just one multiply gate but huge expressions that can be expensive to 
compute. We need something better.

Luckily, there is an easier and much faster way to compute the gradient: 
we can use calculus to derive a direct expression for it that will be as
simple to evaluate as the circuit’s output value. We call this an analytic
gradient and there will be no need for tweaking anything.

Final point: The analytic derivative requires no tweaking of the inputs. 
It can be derived using mathematics (calculus).
"""

