In [24]:
# You can train neural networks to convert a given dataset of "what you know" to a dataset of "what you want to know"
# Basically, you can train the network to interpret observations.
# First, convert the observation dataset into matrices so the information is interpretable for the network.
    # Convention: use one row for one observation (each set of on/off lights on a 3-light streetlight) 
    # and one column per observed item (whether each light in the set is on or off). 
    # Ideally, you want a "lossless representation" - the data and the matrix can be perfectly converted between each other.
    
import numpy as np
weights = np.array([0.5, 0.48, -0.7])
alpha = 0.1

# input data pattern
# 0 = light is off, 1 = light is on in a 3-light horizontal stoplight at a crosswalk
streetlights = np.array([[1, 0, 1],
                       [0, 1, 1],
                       [0, 0, 1],
                       [1, 1, 1],
                       [0, 1, 1],
                       [1, 0, 1]])
# output data pattern 
# 0 = stop, 1 = walk
walk_vs_stop = np.array([[0],
                        [1],
                        [0],
                        [1],
                        [1],
                        [0]])

# First, we can turn streetlights into walk_vs_stop with a neural network, as before.
# Uses nice numpy arrays to do elementwise addition/multiplication easily, otherwise is same as previous neural networks.
print(streetlights[0] * [2, 2, 2], "elementwise multiplication")
print(streetlights[0] + [2, 2, 2], "elementwise addition")

for iteration in range(40):
    error_for_all_lights = 0
    for row in range(len(walk_vs_stop)):
        input = streetlights[row]
        goal_prediction = walk_vs_stop[row]
        
        # dot product = weighted sum: input * weights and addition of all items in vector to return a single number
        # The weighted sum of inputs finds perfect correlation between input and output by weighting decorrelated inputs to 0.
        # Basically, if the light is off (marked 0), it will have no effect on the outcome because 0 * anything = 0.
        # So anytime a light has an effect, it will not be 0 and it will be accounted for as a value that affects the outcome.
        prediction = input.dot(weights) 
        error = (prediction - goal_prediction) ** 2
        error_for_all_lights += error
        
        delta = prediction - goal_prediction 
        weights = weights - (alpha * (input * delta))

print("Error:" + str(error_for_all_lights) + " Prediction:" + str(prediction))

# Stochastic Gradient Descent
# The network goes through the training examples one at a time and iterates over it several times. This lets it update the 
# weights for all examples until the network is capable of predicting the correct answer when faced with all training examples.
# This was essentially what we did in ch5 to train the handwriting neural network. We did not have a separate error for the
# entire dataset, however. We just updated the error for each digit 0-9 and used that error for all instances of that digit
# in the dataset. This doesn't seem to make a difference for the network's learning because we don't actually use the error 
# value to learn. We use delta, which is just kind of related.

# (Average/Full) Gradient Descent
# The network goes through the entire set of training examples and calculates the average weight_delta for the whole dataset.
# Then, the network changes the weights one time. The network does not change the weights for every data point.

# Batch Gradient Descent
# Updates the weights after n data points. Batch size is chosen by the user and is typically between 8 and 256. This will be
# discussed more later.

[2 0 2] elementwise multiplication
[3 2 3] elementwise addition
Error:[0.00053374] Prediction:-0.0026256193329783125


In [1]:
# Overfitting
# There is an edge case where the network will predict the right answer but not actually learn. For example, what if the left
# and right weights were 0.5 and -0.5 respectively and our data point was [1, 0, 1]? Then the weighted sum (prediction) would 
# be 0. The prediction was correct (stop), but the network did not learn anything.

# Error is shared among all weights. If some weight configuration accidentally creates perfect correlation between the 
# prediction and the expected output (error = 0), then weights will not be updated properly and the network will not learn from
# this data point.

# Overfitting is really only a problem if you only train on data points that the network cannot learn off of. The other data
# points should bump the weights out of this configuration and you can continue learning as long as you see other data points.

# Networks should be exposed to plenty of data in order to make sure they learn the rule. They need to learn to generalize 
# instead of memorizing some specific examples and reacting accordingly.

# Conflicting Pressure
# Notice in our stoplight example, the third light is always on. How does the network know that it has to bring the weight down
# to 0 for walking even though there are both positive and negative pressures exerted on the weight for this light?
# There is something called "regularization" that forces weights with conflicting pressure to move to 0, we will discuss later.
# A weight with conflicting pressure doesn't really do anything except confuse, so it makes sense to silence it.
# With regularization, you can learn that the third light is useless more quickly than without. If you don't have regularization,
# you can still learn that the light is useless, but it won't happen until the first light (for stopping) and the second light
# (for walking) have already settled into their perfectly correlated weights.
# If the correlations weren't perfect, the network might have struggled to silence the unnecessary third weight. Regularization
# can help avoid that problem.

# If the network is given a dataset where the input has no correlation with the output (all weights have conflicting pressure),
# it won't be able to solve anything. In this case, you can create "intermediate data" in order to predict the output. You can
# do this by feeding the input to a network which then produces results (intermediate data/output layer 1). You can then put
# these results into another network and this network will be able to use the intermediate data to predict the output (output
# layer 2). 

# "Because the input dataset doesn't correlate with the output dataset, you'll use the input dataset to create an intermediate
# dataset that DOES have correlation with the output. It's kind of like cheating."

# So how do you figure out what the delta (normalized error) values are in the first network which takes the lights and outputs
# some data? The second network is the same as stuff we have been doing; it just takes the output from network 1 and gets
# trained to output a prediction of walk or run. But the first network takes lights and has to output something. What is the 
# delta value there?? How do we know we are outputting the correct thing when we're just making up outputs?

# Turns out that the weights for the first network directly cause the second network's prediction (obviously, since we made
# this data from network 1's weights and we are using it as input for network 2). So the weights of network 1 directly influence
# error of network 2. You can use delta from network 2 to figure out the delta of network 1. Just multiply the delta from 
# network 2 (only 1 value because it is the normalized error for this one prediction) by all of the network 1 weights. This
# moves the delta back to network 1 from network 2 and this is called "backpropagation."


## VERY IMPORTANT!! ##

# Backpropagation
# Remember that delta tells you the direction and amount we have to adjust things to get the right answer. For example, if we get
# a result that is too low, we have a delta that will be a positive number because we want to raise the result. And this delta
# will really mean "if we want to result to be raised by x amount, we have to raise/lower all of these inputs into network 2 by
# some amount" and the inputs that go into network 2 come directly from network 1 and they are produced by the weights of 
# network 1. So network 1's weights have to be multiplied by some delta value. 

# 1 * 0.25 * 0.9 = 0.225
# 1 * 0.225 = 0.225
# For any 3 multiplications, I can accomplish the same thing in 2 multiplications.
# For every three-layer (1st input, 1st output -> 2nd input, 2nd output) network, there is a two-layer network (input, output)
# that does the same thing. So just making our two neural networks like this even with backpropagation is useless and does not
# give us any more power than just having one network. We can't actually make good predictions like this, yet. After all, if we
# set it up to learn right now, it'll probably just go in circles forever raising and lowering all the weights in both networks
# due to conflicting pressures. !! If no correlation exists, error will never reach 0 !!

# Conditional Correlation (or Sometimes Correlation)
# Right now, all the outputs of the 1st network are 100% correlated with the inputs that go into the 1st network (in this case,
# the row of 3 lights). The only way to reduce the correlation of one of these outputs to an input is to give that output more
# correlation with another input (another light). The ratio of correlation (weight of each input) is different, but the output
# still has 100% correlation with an input.

# What if the outputs in the 1st network could select when to be correlated with the inputs? We can do that by looking a the 
# output and if it is negative, we set it to 0 instead of leaving it negative and sending that value to the 2nd network. When 
# this output is negative, it now has NO correlation to any inputs. Now an output can be perfectly correlated to the left light
# only when the right light is off. Otherwise, it's not correlated. If the weight for this output is positive for the left light
# and a HUGE negative number for the right light, whenever the right light is on, the result will be negative. Otherwise, it 
# will just be the positive output meaning the left light is currently on.

# Now a three-layer network can do more than a two-layer network. A two-layer network cannot have this conditional correlation.
# Three-layer networks can be nonlinear. Two layer networks must be linear (100% correlation).
# There are many types of nonlinearity. The one just described is called "relu."
# Basically, you need nonlinearities because otherwise you are just doing the same thing you did before with two-layer networks,
# just wasting more resources to make an extra layer.

## VERY IMPORTANT!! ##

In [16]:
import numpy as np
np.random.seed(1)

def relu(x):
    # This returns 0 for all numbers less than 0. Otherwise, it returns the number.
    return (x > 0) * x

alpha = 0.2
hidden_size = 4
streetlights = np.array([[1, 0, 1], [0, 1, 1], [0, 0, 1], [1, 1, 1]])
walk_vs_stop = np.array([[1, 1, 0, 0]]).T # .T is the transpose, so we should get [[1], [1], [0], [0]]

# Here we just start with some random weights for our layers. The 1st network gets 3 * hidden_size because there are 3 input 
# lights and hidden_size is the number of outputs we have. So the 1st network has a total of 12 weights, 3 weights (1 from
# each light) for each output (4 outputs). The 2nd network gets 1 weight for each input (4 inputs from the 1st network's 
# outputs) so there are only 4 weights.
# np.random.random: to sample between a and b, b > a, multiply by b - a (in this case, 1 + (-1) = 2) and add a 
# (in this case, -1); so here we are random sampling numbers between 1 and -1 for our weights
weights_0_1 = 2 * np.random.random((3, hidden_size)) - 1
weights_1_2 = 2 * np.random.random((hidden_size, 1)) - 1

# The output of layer 1 was sent through the relu function, so negative values became 0. Layer 2 is just normal.
# This box is just a summary of one run of the next box. Disregard this box when running the code.
layer_0 = streetlights[0]
layer_1 = relu(np.dot(layer_0, weights_0_1))
layer_2 = np.dot(layer_1, weights_1_2)

In [25]:
import numpy as np
np.random.seed(1)

def relu(x):
    # This returns 0 for all numbers less than 0. Otherwise, it returns the number.
    return (x > 0) * x

def relu2deriv(output):
    # This returns 1 if input > 0, otherwise returns 0. This is the slope (derivative) of the relu function.
    return output > 0

alpha = 0.2
hidden_size = 4
streetlights = np.array([[1, 0, 1], [0, 1, 1], [0, 0, 1], [1, 1, 1]])
walk_vs_stop = np.array([[1, 1, 0, 0]]).T # .T is the transpose, so we should get [[1], [1], [0], [0]]

# Here we just start with some random weights for our layers. The 1st network gets 3 * hidden_size because there are 3 input 
# lights and hidden_size is the number of outputs we have. So the 1st network has a total of 12 weights, 3 weights (1 from
# each light) for each output (4 outputs). The 2nd network gets 1 weight for each input (4 inputs from the 1st network's 
# outputs) so there are only 4 weights.
# np.random.random: to sample between a and b, b > a, multiply by b - a (in this case, 1 + (-1) = 2) and add a 
# (in this case, -1); so here we are random sampling numbers between 1 and -1 for our weights
weights_0_1 = 2 * np.random.random((3, hidden_size)) - 1
weights_1_2 = 2 * np.random.random((hidden_size, 1)) - 1

for iteration in range(60):
    layer_2_error = 0
    for i in range(len(streetlights)):
        # streetlights[i] = [1, 1, 1]
        # streetlights[i:i+1] = [[1, 1, 1]]
        layer_0 = streetlights[i:i+1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        
        # This is just (pred - goal_pred) ** 2 = error.
        layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
        
        # This is just pred - goal_pred = delta. 
        layer_2_delta = (layer_2 - walk_vs_stop[i:i+1])
        
        # The rest of this code is just the network as we have already seen. The only real new line of code is this one.
        # This line is responsible for our nonlinearity. This line computes delta for 1st network by multiplying the 2nd network
        # weights by the 2nd network's current delta. And of course, we only adjust the weights for the outputs that are 
        # relevant this round (did not produce a negative value and feed 0 into the 2nd network) and to do this we just multiply
        # by 1 in the places that are relevant and by 0 in the places that are not (relu2deriv)
        # We do this because the ones that did not contribute because they were not correlated did not contribute to the final
        # error and should not be edited.
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        
        # This is new_weights = weights - alpha * (input * delta)
        weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta)
        
    if (iteration % 10 == 9):
        print("Error:" + str(layer_2_error))

Error:0.6342311598444467
Error:0.3583840767631751
Error:0.08301831133032973
Error:0.006467054957103672
Error:0.0003292669000750735
Error:1.5055622665134864e-05


In [None]:
# If I want to train a network to determine if there is a cat in a picture, I cannot use a two-layer network. One pixel does not
# tell us anything about if there is a cat in the picture, but groups of pixels might tell us. This is why deep learning is 
# important. It is necessary to create intermediate layers where each output in the layer represents the presence or absence of
# a different configuration of inputs.

# So the network in the cat example can try to find groups of pixels that look like ears, eyes, etc. and if there are many 
# cat-like configurations present in the picture, the final network in the system has information that can help it determine if
# there is a cat in the picture.

# Now, build a three-layer neural network from memory!

