# Hacker's Guide to Neural Networks
* Python code w/ personal notes and experiments from [Andrej Karpathy's tutorial](http://karpathy.github.io/neuralnets/)

---
# Real-valued Circuits

### Circuit with Single Gate
$f\left(x,y\right)\ =\ xy$

In [1]:
def forwardMultiplyGate(x,y):
    return x * y

forwardMultiplyGate(-2, 3)

-6

#### Strategy 1: Random Local Search
* throw numbers at the wall and see what sticks

In [2]:
x, y = -2, 3
best_x, best_y = x, y
best_out = -float("inf")
tweak_amount = 0.01

In [3]:
import random
for i in range(100):
    random_ = (random.random() * 2 - 1)
    x_try = x + tweak_amount * random_
    y_try = y + tweak_amount * random_
    out = forwardMultiplyGate(x_try, y_try)
    
    if (out > best_out):
        best_out, best_x, best_y = out, x_try, y_try

best_x, best_y, out

(-1.9901778661695548, 3.009822133830445, -6.005996011041122)

#### Strategy 2: Numerical Gradient
* finding the derivatives by tweaking the knobs for each pass

In [4]:
x, y = -2, 3
out = forwardMultiplyGate(x, y)
h = 0.0001

out_x = forwardMultiplyGate(x + h, y)
x_derivative = float((out_x - out) / h)
out_y = forwardMultiplyGate(x, y + h)
y_derivative = float((out_y - out) / h)

x_derivative, y_derivative

(3.00000000000189, -2.0000000000042206)

$\frac{df\left(x,y\right)}{dx}=\frac{f\left(x+h,\ y\right)\ -\ f\left(x,y\right)}{h}$

Think of the derivative as a check and balance. If it is (+), it tells the variable that if the function ought to increase, this is where the function will be going (combined with the magnitude). The derivatives, whether they evaluate to +/- (times magnitude), will be forcing/pulling the function to proceed to the direction where the function will increase (hence, the gradient). It is a value that indicates whether nudge (h) gives a variable a +/- slope when evaluated w/ the original function.

In [5]:
step_size = 0.01
x += step_size * x_derivative
y += step_size * y_derivative
out_new = forwardMultiplyGate(x, y)

out_new

-5.87059999999986

The step size is the key here. Turn it (-), the direction will turn the opposite way. When (+), it inclines the derivatives to _increase_ the function (following the gradient). It sort of amplifies, little by little, the force and direction given by the derivatives. It 'commands' the circuit to proceed with the derivatives whatever is their signs.

<u>Expt</u>: Try to see with more iterations if the gradient is really towards increasing the function..   
Result: It is! Except if step to (-)step, opposite direction of gradient.

In [6]:
step_size = 0.01
x, y = -1, 1
h = 0.0001

for i in range(5):
    
    out = forwardMultiplyGate(x, y)
    
    out_x = forwardMultiplyGate(x + h, y)
    x_derivative = (out_x - out) / h
    
    out_y = forwardMultiplyGate(x, y + h)
    y_derivative = (out_y - out) / h
    
    x += step_size * x_derivative
    y += step_size * y_derivative
    out_new = forwardMultiplyGate(x, y)
    
    print "old x: %s, x derivative: %s, new x: %s\nold y: %s, y derivative: %s, new y: %s\nout: %s\n" % \
    (out_x, x_derivative, x, out_y, y_derivative, y, out_new)

old x: -0.9999, x derivative: 1.0, new x: -0.99
old y: -1.0001, y derivative: -1.0, new y: 0.99
out: -0.9801

old x: -0.980001, x derivative: 0.99, new x: -0.9801
old y: -0.980199, y derivative: -0.99, new y: 0.9801
out: -0.96059601

old x: -0.960498, x derivative: 0.9801, new x: -0.970299
old y: -0.96069402, y derivative: -0.9801, new y: 0.970299
out: -0.941480149401

old x: -0.941383119501, x derivative: 0.970299, new x: -0.96059601
old y: -0.941577179301, y derivative: -0.970298999999, new y: 0.96059601
out: -0.922744694428

old x: -0.922648634827, x derivative: 0.96059601, new x: -0.9509900499
old y: -0.922840754029, y derivative: -0.96059601, new y: 0.9509900499
out: -0.904382075009



#### Strategy 3: Analytical Gradient
* for our function, it turns out that the derivatives of x, y are y, x respectively.

$\frac{df\left(x,y\right)}{dx}=\frac{\left(x+h\right)y\ -\ xy}{h}=y$

Instead of probing with h, compute the derivatives directly for each step because math.
Btw, h presumes that whatever the sign of the derivative will be , it corresponds to the rate of growth. ie. increasing a little bit of a variable results to a rate of increase. If that rate is (+), this means growth; if that is (-), opposite of growth.

In [7]:
x, y = -2, 3 # re-initialize
x_gradient, y_gradient = y, x # derived from separate evaluation

x += step_size * x_gradient
y += step_size * y_gradient
out_new = forwardMultiplyGate(x, y)

out_new

-5.8706

### Circuits with Multiple Gates

In [8]:
def forwardMultiplyGate(a, b): return a * b
def forwardAddGate(a, b): return a + b

def forwardCircuit(x, y, z):
    q = forwardAddGate(x, y)
    f = forwardMultiplyGate(q, z) 
    return f

x, y, z = -2, 5, -4
forwardCircuit(x, y, z)

-12

### Backpropagation
* the chain rule, is really really useful

In [9]:
x, y, z = -2, 5, -4
q = forwardAddGate(x, y)
f = forwardMultiplyGate(q, z)
print f

derivative_f_wrt_z = q
derivative_f_wrt_q = z

derivative_q_wrt_x = 1
derivative_q_wrt_y = 1

derivative_f_wrt_x = derivative_f_wrt_q * derivative_q_wrt_x
derivative_f_wrt_y = derivative_f_wrt_q * derivative_q_wrt_y

step_size = 0.01
x += step_size * derivative_f_wrt_x
y += step_size * derivative_f_wrt_y
z += step_size * derivative_f_wrt_z

print forwardMultiplyGate(forwardAddGate(x, y), z)

-12
-11.5924


One way to think about this is that the circuit is a fitting-a-function exercise; and in our simple case, the gradient is just increasing the function w/c doesn't optimize anything yet. The chain rule, which is implemented via backpropagation, will signal each layer how much and where to go in terms of adjustments to satisfy the last function. It is a backward pass, where each layer influences the functions before it to adjust accordingly. 

The chain rule tells us formally what the sensitivity of the weights (and other variables) to the over-all change of the function. We can sort of trust that every layer and the layer before it communicates locally to produce a desired computation globally. 

Let's try with more iterations..

In [10]:
x, y, z = -2, 5, -4
step_size = 0.01

for i in range(20):
    
    dfdz = q
    dfdq = z

    dqdx = 1
    dqdy = 1

    dfdx = dfdq * dqdx
    dfdy = dfdq * dqdy

    x += step_size * dfdx
    y += step_size * dfdy
    z += step_size * dfdz
    
    print forwardMultiplyGate(forwardAddGate(x, y), z)

-11.5924
-11.191964
-10.798638
-10.412368
-10.0331
-9.66078
-9.295354
-8.936768
-8.584968
-8.2399
-7.90151
-7.569744
-7.244548
-6.925868
-6.61365
-6.30784
-6.008384
-5.715228
-5.428318
-5.1476


<u>Expt:</u> Now, experiment with the chain rule by adding a basic cost function at the end. (a bit of a fast forward)   
Result: It works! Finds the proper inputs to minimize the function, instead of just ascending the function.   
(c gradually decreases, so f gets closer to k). Beautiful.

In [11]:
x, y, z, k = -1, 3, 4, 6
step_size = 0.001

for i in range(10):    
   
    f = forwardMultiplyGate(forwardAddGate(x, y), z)
    c = ((f - k) ** 2) / 2
    
    dfdz = q
    dfdq = z

    dqdx = 1
    dqdy = 1

    dfdx = dfdq * dqdx
    dfdy = dfdq * dqdy
    
    dcdf = k - f # we want to follow the opposite of the gradient, to minimize, not maximize the cost
    dcdx = dcdf * dfdx
    dcdy = dcdf * dfdy
    dcdz = dcdf * dfdz

    x += step_size * dcdx
    y += step_size * dcdy
    z += step_size * dcdz

    print "x: %s\t y: %s\t z: %s\nf: %s\t c: %s\n" % (round(x, 4), round(y, 4), round(z, 4), round(f, 4), round(c, 4))

x: -1.008	 y: 2.992	 z: 3.994
f: 8.0	 c: 2.0

x: -1.0157	 y: 2.9843	 z: 3.9882
f: 7.9241	 c: 1.8511

x: -1.0231	 y: 2.9769	 z: 3.9827
f: 7.8513	 c: 1.7137

x: -1.0302	 y: 2.9698	 z: 3.9773
f: 7.7816	 c: 1.587

x: -1.037	 y: 2.963	 z: 3.9722
f: 7.7147	 c: 1.4701

x: -1.0435	 y: 2.9565	 z: 3.9672
f: 7.6506	 c: 1.3622

x: -1.0498	 y: 2.9502	 z: 3.9625
f: 7.589	 c: 1.2625

x: -1.0559	 y: 2.9441	 z: 3.9579
f: 7.5299	 c: 1.1703

x: -1.0617	 y: 2.9383	 z: 3.9535
f: 7.4732	 c: 1.0852

x: -1.0673	 y: 2.9327	 z: 3.9492
f: 7.4188	 c: 1.0064



### Single Neuron
* using sigmoid activation, so there is variation of values between 0 to 1, not just 0 and 1.

In [12]:
class Unit(object):
    def __init__(self, value, grad):
        self.value = value
        self.grad = grad
        
class MultiplyGate(object):
    def forward(self, u0, u1):
        self.u0 = u0
        self.u1 = u1
        self.utop = Unit(u0.value * u1.value, 0.0)
        return self.utop
    
    def backward(self):
        self.u0.grad += self.u1.value * self.utop.grad
        self.u1.grad += self.u0.value * self.utop.grad
        
class AddGate(object):
    def forward(self, u0, u1):
        self.u0 = u0
        self.u1 = u1
        self.utop = Unit(u0.value + u1.value, 0.0)
        return self.utop
    
    def backward(self):
        self.u0.grad += 1 * self.utop.grad
        self.u1.grad += 1 * self.utop.grad
    
x = Unit(2,0)
y = Unit(-3,0)
print x.value, x.grad

m = MultiplyGate()
print m.forward(x,y).value

a = AddGate()
print a.forward(x, y).value

2 0
-6
-1


sigmoid function    
$sig\left(x\right)\ =\ \frac{1}{1+e^{-x}}$

derivative   
$\frac{dsig\left(x\right)}{dx}=\ sig\left(x\right)\left(1-sig\left(x\right)\right)$

In [13]:
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

class SigmoidGate(object):
    def forward(self, u0):
        self.u0 = u0
        self.utop = Unit(sigmoid(self.u0.value), 0.0)
        return self.utop
    
    def backward(self):
        s = sigmoid(self.u0.value)
        self.u0.grad += (s * (1 - s)) * self.utop.grad

print sigmoid(3)
sg = SigmoidGate()
print sg.forward(x).value

0.952574126822
0.880797077978


__Neuron (forward pass)__  
* dot product of input & weights, + c (bias term), then feed to sigmoid
* it yields a single value

__Neuron (backward pass)__   
* adjust global vars with computed gradients, the sequence is impt on the chain
* started with gradient 1 from the end operation which is the sigmoid
* it yields updates to every value (not just a single value)

In [14]:
class Neuron(object):
    
    def __init__(self):
        self.mulg0, self.mulg1 = MultiplyGate(), MultiplyGate()
        self.addg0, self.addg1 = AddGate(), AddGate()
        self.sg0 = SigmoidGate()

    def forward(self, a, b, c, x, y):
        ax = self.mulg0.forward(a, x)
        by = self.mulg1.forward(b, y)
        axby = self.addg0.forward(ax, by)
        axbyc = self.addg1.forward(axby, c)
        self.s = self.sg0.forward(axbyc)
        return self.s
    
    def backward(self, a, b, c, x, y):
        step_size = 0.01
        self.s.grad = 1.0
        
        self.sg0.backward()
        self.addg1.backward()
        self.addg0.backward()
        self.mulg1.backward()
        self.mulg0.backward()
        
        a.value += step_size * a.grad
        b.value += step_size * b.grad
        c.value += step_size * c.grad
        x.value += step_size * x.grad
        y.value += step_size * y.grad
        
        return "\nvalues:\na: %s, x: %s\nb: %s, y: %s\nc: %s\n" % (a.value, x.value, b.value, y.value, c.value) 

In [15]:
# input
a, b = Unit(1.0, 0.0), Unit(2.0, 0.0)
x, y = Unit(-1.0, 0.0), Unit(3.0, 0.0)
c = Unit(-3.0, 0.0)

# process / output
n = Neuron()
print "forward pass 1: ", n.forward(a, b, c, x, y).value
print "\nbackward pass 1: ", n.backward(a, b, c, x, y)
print "forward pass 2: ", n.forward(a, b, c, x, y).value

forward pass 1:  0.880797077978

backward pass 1:  
values:
a: 0.998950064146, x: -0.998950064146
b: 2.00314980756, y: 3.00209987171
c: -2.99895006415

forward pass 2:  0.882550181622


Numerical Gradients Check
* just checking if the gradient is carried out correctly from above.
* they're equal!

In [16]:
a, b, c, x, y = 1, 2, -3, -1, 3
h = 0.0001

def forwardCircuitFast(a, b, c, x, y):
    return 1 / (1 + math.exp(-(a*x + b*y + c)))

a_grad = (forwardCircuitFast(a+h, b, c, x, y) - forwardCircuitFast(a, b, c, x, y)) / h
b_grad = (forwardCircuitFast(a, b+h, c, x, y) - forwardCircuitFast(a, b, c, x, y)) / h
c_grad = (forwardCircuitFast(a, b, c+h, x, y) - forwardCircuitFast(a, b, c, x, y)) / h
x_grad = (forwardCircuitFast(a, b, c, x+h, y) - forwardCircuitFast(a, b, c, x, y)) / h
y_grad = (forwardCircuitFast(a, b, c, x, y+h) - forwardCircuitFast(a, b, c, x, y)) / h

print a_grad, b_grad, c_grad, x_grad, y_grad

-0.104997583592 0.314944774835 0.104989587345 0.104989587345 0.209971178827


---
# Machine Learning

### Binary Classification
* linear classifier ala svm. ax + by + c, won't use activation
* labels as 1, -1
* stochastic gradient descent = pick a random pair, gradient descent on each one.
* weights will define a linear boundary, ideal output:

<img style="float: left;" src="https://cdn.pbrd.co/images/H4G0bnR.png"/>

In [17]:
# data, labels, weights, bias initalization
data = [[1.2, 0.7], [-0.3, -0.5], [3.0, 0.1], [-0.1, -1.0], [-1.0, 1.1], [2.1, -3]]
labels = [1, -1, 1, -1, -1, 1]
a, b, c = 1, -2, -1

def check(a, b, c):
    num_cor = 0
    for i in range(len(data)):
        x, y, label = data[i][0], data[i][1], labels[i]
        score = a*x + b*y + c
        if ((score > 0.0 and label == 1) or (score < 0.0 and label == -1)): num_cor += 1
    return num_cor / float(len(data))
        
for l in range(400):
    i = int(random.random() * len(data))
    x, y, label = data[i][0], data[i][1], labels[i]

    if (l % 25 == 0): print l, check(a, b, c)
    score = a*x + b*y + c
    
    # +/- assignment for the gradient of the function
    pull = 0.0
    if (label == 1 and score < 1.0): pull = 1.0
    if (label == -1 and score > -1.0): pull = -1.0
    
    ss = 0.01
    a += ss * (x * pull - a) # -a regularization
    b += ss * (y * pull - b) # -b regularization
    c += ss * (pull)

0 0.666666666667
25 0.666666666667
50 0.833333333333
75 0.833333333333
100 0.833333333333
125 0.833333333333
150 0.833333333333
175 0.833333333333
200 0.833333333333
225 0.833333333333
250 0.833333333333
275 0.833333333333
300 0.833333333333
325 0.833333333333
350 0.833333333333
375 0.833333333333


hmm.. it gets stuck?

<u>Expt:</u> Removing the 'regularization pull'..    
Result: Seems like adding a regularization pull -a or -b (ie. x \* pull - a) makes the circuit wiggle more (it can converge, but immediately steps out after), whereas removing it makes the circuit converge faster & more consistent after iterations. The reason is that dfda or dfdb is derived to be as df \* dfdz \* dzda (z=ax+by), which evaluates to the (+1 or -1) \* 1 * (x or y). Maybe this happens because of the given data for this toy problem; perhaps in practice, regularization does prevent params from getting noisy.

In [18]:
# data, labels, weights, bias initalization
data = [[1.2, 0.7], [-0.3, -0.5], [3.0, 0.1], [-0.1, -1.0], [-1.0, 1.1], [2.1, -3]]
labels = [1, -1, 1, -1, -1, 1]
a, b, c = 1, -2, -1

def check(a, b, c):
    num_cor = 0
    for i in range(len(data)):
        x, y, label = data[i][0], data[i][1], labels[i]
        score = a*x + b*y + c
        if ((score > 0.0 and label == 1) or (score < 0.0 and label == -1)): num_cor += 1
    return num_cor / float(len(data))
        
for l in range(400):
    i = int(random.random() * len(data))
    x, y, label = data[i][0], data[i][1], labels[i]

    if (l % 25 == 0): print l, check(a, b, c), (a, b, c)
    score = a*x + b*y + c
    
    # +/- assignment for the gradient of the function
    pull = 0.0
    if (label == 1 and score < 1.0): pull = 1.0
    if (label == -1 and score > -1.0): pull = -1.0
    
    # removed regularization pull for the meantime, these are the derived gradients
    ss = 0.01
    a += ss * (x * pull) 
    b += ss * (y * pull) 
    c += ss * (pull)

0 0.666666666667 (1, -2, -1)
25 0.666666666667 (1.0729999999999993, -1.9100000000000008, -1.02)
50 0.666666666667 (1.1479999999999986, -1.8250000000000017, -1.04)
75 0.666666666667 (1.2279999999999975, -1.7150000000000027, -1.09)
100 0.666666666667 (1.3019999999999965, -1.5900000000000034, -1.1400000000000001)
125 0.666666666667 (1.423999999999996, -1.4870000000000048, -1.11)
150 0.666666666667 (1.4969999999999948, -1.384000000000006, -1.1800000000000002)
175 0.666666666667 (1.5859999999999945, -1.3100000000000067, -1.1400000000000001)
200 0.833333333333 (1.6379999999999937, -1.2290000000000074, -1.1900000000000002)
225 0.833333333333 (1.6759999999999935, -1.1880000000000077, -1.1800000000000002)
250 1.0 (1.7149999999999932, -1.137000000000008, -1.1800000000000002)
275 1.0 (1.7689999999999926, -1.0490000000000084, -1.2000000000000002)
300 1.0 (1.8199999999999923, -0.9910000000000088, -1.1900000000000002)
325 1.0 (1.8609999999999918, -0.9200000000000087, -1.2100000000000002)
350 1.0 (1.

### Generalizing the SVM into a Neural Network
* 3 neurons, relu (take + values only) instead of sigmoid
* the svm above was a single linear classifier; this one, each of the 3 neurons is a linear classifier + 1 classifier at the end
* this is prototype code. in practical cases, refactor into cleaner data structures. :)

In [19]:
data = [[1.2, 0.7], [-0.3, -0.5], [3.0, 0.1], [-0.1, -1.0], [-1.0, 1.1], [2.1, -3]]
labels = [1, -1, 1, -1, -1, 1]
a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d4 = [random.uniform(-0.5, 0.5) for i in range(13)]

for l in range(400):
    
    # forward
    i = int(random.random() * len(data))
    x, y, label = data[i][0], data[i][1], labels[i]

    n1 = max(0, x*a1 + y*b1 + c1)
    n2 = max (0, x*a2 + y*b2 + c2)
    n3 = max(0, x*a3 + y*b3 + c3)
    score = n1*a4 + n2*b4 + n3*c4 + d4

    if (l % 25 == 0):
        num_cor = 0
        for j in range(len(data)):
            x_, y_, label_ = data[j][0], data[j][1], labels[j]
            n1_, n2_, n3_ = max(0, x_*a1 + y_*b1 + c1), max (0, x_*a2 + y_*b2 + c2), max(0, x_*a3 + y_*b3 + c3)
            score_ = n1_*a4 + n2_*b4 + n3_*c4 + d4
            if ((score_ > 0.0 and label_ == 1) or (score_ < 0.0 and label_ == -1)): num_cor += 1
            corr = num_cor / float(len(data))
        print l, corr
        if (corr == 1): 
            print "neuron 1: (%s, %s, %s) \nneuron 2: (%s, %s, %s) \nneuron 3: (%s, %s, %s) \nfinal: (%s, %s, %s, %s)" \
            % (a1, b1, c1, a2, b2, c2, a3, b3, c3, a4, b4, c4, d4)
    
    # backward, this will be a loooong chain
    pull = 0.0
    if (label == 1 and score < 1.0): pull = 1.0
    if (label == -1 and score > -1.0): pull = -1.0

    # f
    da4 = pull * n1
    db4 = pull * n2
    dc4 = pull * n3
    dd4 = pull

    dn1 = max(0, pull * a4)
    dn2 = max(0, pull * b4)
    dn3 = max(0, pull * c4)

    # n1
    da1 = dn1 * x
    db1 = dn1 * y
    dc1 = dn1

    # n2
    da2 = dn2 * x
    db2 = dn2 * y
    dc2 = dn2

    # n3
    da3 = dn3 * x
    db3 = dn3 * y
    dc3 = dn3
    
    #  no regularization
    ss = 0.01
    a1 += ss * da1
    b1 += ss * db1
    c1 += ss * dc1
    a2 += ss * da2
    b2 += ss * db2
    c2 += ss * dc2
    a3 += ss * da3
    b3 += ss * db3
    c3 += ss * dc3
    a4 += ss * da4
    b4 += ss * db4
    c4 += ss * dc4
    d4 += ss * dd4

0 0.333333333333
25 0.5
50 0.5
75 0.5
100 0.833333333333
125 0.833333333333
150 0.833333333333
175 0.833333333333
200 1.0
neuron 1: (-0.346744012996, -0.358567036143, 0.74241592093) 
neuron 2: (0.693487215185, 0.14870506052, 0.903935126191) 
neuron 3: (-0.522922012574, -0.0359121877446, 0.30543168357) 
final: (-0.634862099528, 0.81128041528, -0.711938837451, -0.420354963045)
225 1.0
neuron 1: (-0.353776098942, -0.380744293954, 0.774221899072) 
neuron 2: (0.780730142067, 0.0240723078307, 0.945479377087) 
neuron 3: (-0.530845213059, -0.0611811905979, 0.341493286822) 
final: (-0.632412658661, 0.88403439774, -0.735309443519, -0.420354963045)
250 1.0
neuron 1: (-0.360320376291, -0.41339385957, 0.813430106261) 
neuron 2: (0.79897092719, -0.00198595663087, 0.954165465241) 
neuron 3: (-0.538321601378, -0.098571400752, 0.386358231583) 
final: (-0.690993386843, 0.860294471375, -0.766068394082, -0.470354963045)
275 1.0
neuron 1: (-0.36237898213, -0.433979917962, 0.834016164653) 
neuron 2: (0.8352

Might be good to interpret in 2 ways. The first is the computational explanation: iteratively, there are 3 neurons which will compose 3 functions for the first batch of weights, then in the end will be fed into another function that weighs the functions before, all tuned via backprop. The classsification task is monitored by checking the examples one by one and seeing if the collection of functions correctly match the labels of the inputs. 

The second is the visual explanation: the structure enables 3 neurons to create 3 linear boundaries, and in the end will have weights themselves so that they 'battle' for a final non-linear boundary to be created much like this: 

<img style="float: left;" src="https://cdn.pbrd.co/images/H4FxROH.png"/>

In both explanations, it is clear that the hidden layer serves as a way to create more distinct classifiers that separate points into their respective classes. This kind of structure seem to be flexible if there are more dimensions involved and more labels to train for.

### Loss/Cost/Objective Function
* instead of +/- 'pull', have a function to minimize as an objective, aptly called an objective (or loss or cost) function. i prefer cost.
* similar as what was done in one expt above
* think of cost as the gap between matching the computer and desired output

In [20]:
X = [[1.2, 0.7], [-0.3, 0.5], [3.0, 2.5]]
y = [1, -1, 1]
w = [0.1, 0.2, 0.3]
alpha = 0.1

def cost(X,y,w):
    total_cost = 0.0
    for i in range(len(X)):
        score = (X[i][0] * w[0]) + (X[i][1] * w[1]) + w[2]
        cost_i = max(0, -y[i] * score + 1)
        print "example %s: %s and label: %s\nscore: %s and cost: %s" % (i, X[i], y[i], score, cost_i)
        total_cost += cost_i
        
    reg_cost = alpha * (w[0]**2 + w[1]**2)
    total_cost += reg_cost;
    print "\nregularization cost for current model is: %s\ntotal_cost: %s\n" % (reg_cost, total_cost)
    return total_cost

cost(X, y, w)

example 0: [1.2, 0.7] and label: 1
score: 0.56 and cost: 0.44
example 1: [-0.3, 0.5] and label: -1
score: 0.37 and cost: 1.37
example 2: [3.0, 2.5] and label: 1
score: 1.1 and cost: 0

regularization cost for current model is: 0.005
total_cost: 1.815



1.815