# Simple network Exercise:

Below you'll use Numpy to calculate the output of a simple network with two input nodes and one output node with a sigmoid activation function. Things you'll need to do:

Implement the sigmoid function.
Calculate the output of the network.
As a reminder, the sigmoid function is
$$
sigmoid(x)=1/(1+e
​−x
​​ )
$$
For the exponential, you can use Numpy's exponential function, np.exp.

And the output of the network is
$$
y=f(h)=sigmoid(∑
​i
​​ w
​i
​​ x
​i
​​ +b)
$$
For the weights sum, you can do a simple element-wise multiplication and sum, or use Numpy's dot product function.

In [2]:
import numpy as np

def sigmoid(x):
    # TODO: Implement sigmoid function
    return 1 / (1 + np.exp(-x))

inputs = np.array([0.7, -0.3])
weights = np.array([0.1, 0.8])
bias = -0.1

# TODO: Calculate the output
output = sigmoid(np.dot(inputs, weights) + bias)

print('Output:')
print(output)

Output:
0.432907095035


In [3]:
inputs

array([ 0.7, -0.3])

In [4]:
weights

array([ 0.1,  0.8])

In [6]:
np.dot(inputs, weights) - 1

-1.1699999999999999

In [7]:
sigmoid(-1.16)

0.23866728515708963

# Gradient descent: the code

From before we saw that one weight update can be calculated as:
$$
Δw
​i
​​ =ηδx
​i
​​ 
$$
with the error term δ as

$$
δ=(y−ypre)f′(h)=(y−ypred)f′(∑wixi)
$$

Remember, in the above equation (y−
​y
​^
​​ ) is the output error, and f
​′
​​ (h) refers to the derivative of the activation function, f(h). We'll call that derivative the output gradient.

Now I'll write this out in code for the case of only one output unit. We'll also be using the sigmoid as the activation function f(h).

In [8]:
# Defining the sigmoid function for activations (f(h))
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

In [9]:
# Input data
x = np.array([0.1, 0.3]) #x0, x1
# Target
y = 0.2

In [11]:
y

0.2

In [12]:
# Input to output weights
weights = np.array([-0.8, 0.5]) #w0, w1

# The learning rate, eta in the weight step equation
learnrate = 0.5

In [16]:
# the linear combination performed by the node (h in f(h) and f'(h))
# h = x[0]*weights[0] + x[1]*weights[1]
h = np.dot(x, weights)

# The neural network output (y-hat)
nn_output = sigmoid(h)

# output error (y - y-hat)
error = y - nn_output

# output gradient (f'(h))
output_grad = sigmoid_prime(h)

# error term (lowercase delta)
error_term = error * output_grad

# Gradient descent step 
# del_w = [ learnrate * error_term * x[0],
#           learnrate * error_term * x[1]]
del_w = learnrate * error_term * x

In [17]:
del_w

array([-0.0039638 , -0.01189141])

In [6]:
del_w

[-0.0039638030790068828, -0.011891409237020648]

Test:

In [7]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    """
    # Derivative of the sigmoid function
    """
    return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight
### Note: Some steps have been consolidated, so there are
###       fewer variable names than in the above sample code

# TODO: Calculate the node's linear combination of inputs and weights
h = np.dot(x,w)

# TODO: Calculate output of neural network (y^)
nn_output = sigmoid(h)

# TODO: Calculate error of neural network
error = y - nn_output

# TODO: Calculate the error term
#       Remember, this requires the output gradient, which we haven't
#       specifically added a variable for.
error_term = error * sigmoid_prime(h)

# TODO: Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.689974481128
Amount of Error:
-0.189974481128
Change in Weights:
[-0.02031869 -0.04063738 -0.06095608 -0.08127477]


# Implementing Gradient Descent

As an example, I'm going to have you use gradient descent to train a network on graduate school admissions data (found at http://www.ats.ucla.edu/stat/data/binary.csv. This dataset has three input features: GRE score, GPA, and the rank of the undergraduate school (numbered 1 through 4). Institutions with rank 1 have the highest prestige, those with rank 4 have the lowest.

The goal here is to predict if a student will be admitted to a graduate program based on these features. For this, we'll use a network with one output layer with one unit. We'll use a sigmoid function for the output unit activation.

**Data cleanup**

You might think there will be three input units, but we actually need to transform the data first. The rank feature is categorical, the numbers don't encode any sort of relative values. Rank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank 2. Instead, we need to use dummy variables to encode rank, splitting the data into four new columns encoded with ones or zeros. Rows with rank 1 have one in the rank 1 dummy column, and zeros in all other columns. Rows with rank 2 have one in the rank 2 dummy column, and zeros in all other columns. And so on.

We'll also need to standardize the GRE and GPA data, which means to scale the values such they have zero mean and a standard deviation of 1. This is necessary because the sigmoid function squashes really small and really large inputs. The gradient of really small and large inputs is zero, which means that the gradient descent step will go to zero too. Since the GRE and GPA values are fairly large, we have to be really careful about how we initialize the weights or the gradient descent steps will die off and the network won't train. Instead, if we standardize the data, we can initialize the weights easily and everyone is happy.

This is just a brief run-through, you'll learn more about preparing data later. If you're interested in how I did this, check out the data_prep.py file in the programming exercise below.

Now that the data is ready, we see that there are six input features: gre, gpa, and the four rank dummy variables.

Data preparation: data_prep.py

In [8]:
import numpy as np
import pandas as pd

admissions = pd.read_csv('binary.csv')

# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis = 1)

# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std

# Split off random 10% of the data for testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

# Mean Square Error
We're going to make a small change to how we calculate the error here. Instead of the SSE, we're going to use the mean of the square errors (MSE). Now that we're using a lot of data, summing up all the weight steps can lead to really large updates that make the gradient descent diverge. To compensate for this, you'd need to use a quite small learning rate. Instead, we can just divide by the number of records in our data, m to take the average. This way, no matter how much data we use, our learning rates will typically be in the range of 0.01 to 0.001. Then, we can use the MSE (shown below) to calculate the gradient and the result is the same as before, just averaged instead of summed.



**Here's the general algorithm for updating the weights with gradient descent:**

Set the weight step to zero: $$Δw
​i
​​ =0$$
For each record in the training data:
Make a forward pass through the network, calculating the output 
$$​y
​^
​​ =f(∑
​i
​​ w
​i
​​ x
​i
​​ )$$
Calculate the error term for the output unit, $$δ=(y−
​y
​^
​​ )∗f
​′
​​ (∑
​i
​​ w
​i
​​ x
​i
​​ )$$
Update the weight step $$Δw
​i
​​ =Δw
​i
​​ +δx
​i
​​ $$
Update the weights $$w
​i
​​ =w
​i
​​ +ηΔw
​i
​​ /m$$ where η is the learning rate and m is the number of records. Here we're averaging the weight steps to help reduce any large variations in the training data.
Repeat for e epochs.




You can also update the weights on each record instead of averaging the weight steps after going through all the records.

Remember that we're using the sigmoid for the activation function, $$f(h)=1/(1+e
​−h
​​ )$$

And the gradient of the sigmoid is $$f
​′
​​ (h)=f(h)(1−f(h))$$

where h is the input to the output unit,

$$h=∑
​i
​​ w
​i
​​ x
​i
​​ $$

# Implementing with Numpy

For the most part, this is pretty straightforward with Numpy.

First, you'll need to initialize the weights. We want these to be small such that the input to the sigmoid is in the linear region near 0 and not squashed at the high and low ends. It's also important to initialize them randomly so that they all have different starting values and diverge, breaking symmetry. So, we'll initialize the weights from a normal distribution centered at 0. A good value for the scale is $$1/√
​n
​
​​ $$ where n is the number of input units. This keeps the input to the sigmoid low for increasing numbers of input units.

weights = np.random.normal(scale=1/n_features**.5, size=n_features)

Numpy provides a function that calculates the dot product of two arrays, which conveniently calculates h for us. The dot product multiplies two arrays element-wise, the first element in array 1 is multiplied by the first element in array 2, and so on. Then, each product is summed.

    input to the output layer
output_in = np.dot(weights, inputs)

And finally, we can update $$Δw
​i
​​$$  and $$w
​i
​​$$  by incrementing them with weights += ... which is shorthand for weights = weights + ....

**Efficiency tip!**

You can save some calculations since we're using a sigmoid here. For the sigmoid function, $$f
​′
​​ (h)=f(h)(1−f(h))$$. That means that once you calculate f(h), the activation of the output unit, you can use it to calculate the gradient for the error gradient.

# Programming Exercise

Implement gradient descent and train the network on the admissions data.

In [18]:
import numpy as np
from data_prep import features, targets, features_test, targets_test

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# TODO: We haven't provided the sigmoid_prime function like we did in
#       the previous lesson to encourage you to come up with a more
#       efficient solution. If you need a hint, check out the comments
#       in solution.py from the previous lecture.


# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)
    #weights normal distrubuted centered at 0. Scale: 1/sqrt(n) - scale is std

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

#REPEAT FOR e epochs
for e in range(epochs):
    del_w = np.zeros(weights.shape) #step one of the algorithm
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target

        # Note: We haven't included the h variable from the previous
        #       lesson. You can add it if you want, or you can calculate
        #       the h together with the output

        #CALCULATE THE OUTPUT
        # TODO: Calculate the output: f(h)
        output = sigmoid(np.dot(x, weights))

        # TODO: Calculate the error: target minus the network output
        error = y - output

        #CALCULATE THE ERROR TERM
        # TODO: Calculate the error term
        error_term = error * output * (1-output) #f(h) and f'(h). This makes the code faster

        # UPDATE THE WEIGHT STEP
        # TODO: Calculate the change in weights for this sample
        #       and add it to the total weight change
        del_w += error_term * x

    #UPDATE THE WEIGHTS
    # TODO: Update weights using the learning rate and the average change in weights
    weights += learnrate * del_w / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        #print(e)
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

Train loss:  0.2627609385
Train loss:  0.209286194093
Train loss:  0.200842929081
Train loss:  0.198621564755
Train loss:  0.197798513967
Train loss:  0.197425779122
Train loss:  0.197235077462
Train loss:  0.197129456251
Train loss:  0.197067663413
Train loss:  0.197030058018
Prediction accuracy: 0.725


In [19]:
features

Unnamed: 0,gre,gpa,rank_1,rank_2,rank_3,rank_4
209,-0.066657,0.289305,0,1,0,0
280,0.625884,1.445476,0,1,0,0
33,1.837832,1.603135,0,0,1,0
210,1.318426,-0.131120,0,0,0,1
93,-0.066657,-1.208461,0,1,0,0
84,-0.759199,0.552071,0,0,1,0
329,-0.759199,-1.208461,0,0,0,1
94,0.625884,0.131646,0,1,0,0
266,-0.239793,-0.393886,0,0,0,1
126,0.106478,0.394412,1,0,0,0


In [10]:
np.random.normal(scale=1 / n_features**.5, size=n_features)

array([ 0.64471093,  0.31330392, -0.19166212,  0.22149921, -0.18918948,
       -0.19013338])

In [11]:
for x, y in zip(features.values, targets):
    print(x, y)

[-0.06665712  0.28930534  0.          1.          0.          0.        ] 0
[ 0.62588442  1.44547565  0.          1.          0.          0.        ] 0
[ 1.83783211  1.60313523  0.          0.          1.          0.        ] 1
[ 1.31842596 -0.13112022  0.          0.          0.          1.        ] 0
[-0.06665712 -1.20846073  0.          1.          0.          0.        ] 0
[-0.75919866  0.55207132  0.          0.          1.          0.        ] 1
[-0.75919866 -1.20846073  0.          0.          0.          1.        ] 0
[ 0.62588442  0.13164576  0.          1.          0.          0.        ] 1
[-0.23979251 -0.3938862   0.          0.          0.          1.        ] 0
[ 0.10647826  0.39441173  1.          0.          0.          0.        ] 1
[ 0.97215519  1.39292245  0.          1.          0.          0.        ] 0
[-0.41292789  0.26302874  1.          0.          0.          0.        ] 1
[-0.23979251 -0.52526919  0.          0.          1.          0.        ] 0
[ -9.3233404

# Multilayer Perceptrons - Implementing the hidden layer 

Matrix operations check

In [20]:
inputs = np.array([1, 2, 3])
inputsT = inputs[:,None]

In [22]:
inputsT

array([[1],
       [2],
       [3]])

In [13]:
weightsinput = np.matrix([[4, 20], [15, 32], [30, 20]])
weightsinput.shape

(3, 2)

In [14]:
np.matmul(inputsT.T, weightsinput)

matrix([[124, 144]])

In [15]:
np.matmul(weightsinput.T, inputsT)

matrix([[124],
        [144]])

In [16]:
np.dot(inputsT.T, weightsinput)

matrix([[124, 144]])

### Programming quiz

Below, you'll implement a forward pass through a 4x3x2 network, with sigmoid activation functions for both layers.

Things to do:

* Calculate the input to the hidden layer.
* Calculate the hidden layer output.
* Calculate the input to the output layer.
* Calculate the output of the network.

In [23]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

# Network size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)
# Make some fake data
X = np.random.randn(4)

weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))


# TODO: Make a forward pass through the network

#hidden_layer_in = np.matmul(X[:,None].T, weights_input_to_hidden)
hidden_layer_in = np.dot(X, weights_input_to_hidden)
hidden_layer_out = sigmoid(hidden_layer_in)

print('Hidden-layer Output:')
print(hidden_layer_out)

#output_layer_in = np.matmul(hidden_layer_out, weights_hidden_to_output)
output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
output_layer_out = sigmoid(output_layer_in)

print('Output-layer Output:')
print(output_layer_out)

Hidden-layer Output:
[ 0.41492192  0.42604313  0.5002434 ]
Output-layer Output:
[ 0.49815196  0.48539772]


In [30]:
X.shape

(4,)

In [25]:
weights_input_to_hidden

array([[-0.02341534, -0.0234137 ,  0.15792128],
       [ 0.07674347, -0.04694744,  0.054256  ],
       [-0.04634177, -0.04657298,  0.02419623],
       [-0.19132802, -0.17249178, -0.05622875]])

In [26]:
np.dot(X, weights_input_to_hidden)

array([-0.34365494, -0.29801368,  0.00097362])

In [27]:
hidden_layer_out

array([ 0.41492192,  0.42604313,  0.5002434 ])

# Backpropagation

**Backpropagation exercise**

Below, you'll implement the code to calculate one backpropagation update step for two sets of weights. I wrote the forward pass, your goal is to code the backward pass.

Things to do

* Calculate the network's output error.
* Calculate the output layer's error term.
* Use backpropagation to calculate the hidden layer's error term.
* Calculate the change in weights (the delta weights) that result from propagating the errors back through the network.

In [18]:
import numpy as np


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


x = np.array([0.5, 0.1, -0.2])
target = 0.6
learnrate = 0.5

weights_input_hidden = np.array([[0.5, -0.6],
                                 [0.1, -0.2],
                                 [0.1, 0.7]])
#there are 3 inputs, 1 hidden layer with 2 nodes, 1 output (JF)
weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)

output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)
output = sigmoid(output_layer_in)

## Backwards pass
## TODO: Calculate output error
error = target - output

# TODO: Calculate error term for output layer
output_error_term = error  * (output) * (1-output)

# TODO: Calculate error term for hidden layer
hidden_error_term = weights_hidden_output * output_error_term * (hidden_layer_output) * (1-hidden_layer_output)
# hidden_error_term = np.dot(output_error_term, weights_hidden_output) * \
#                     hidden_layer_output * (1 - hidden_layer_output)


# TODO: Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * output_error_term * hidden_layer_output

# TODO: Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * x[:,None] * hidden_error_term[:,None].T 

print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)


Change in weights for hidden layer to output layer:
[ 0.00804047  0.00555918]
Change in weights for input layer to hidden layer:
[[  1.77005547e-04  -5.11178506e-04]
 [  3.54011093e-05  -1.02235701e-04]
 [ -7.08022187e-05   2.04471402e-04]]


# Implementing backpropagation

Here's the general algorithm for updating the weights with backpropagation:

* Set the weight steps for each layer to zero
    * The input to hidden weights $$Δw
​ij
​​ =0$$
    * The hidden to output weights $$ΔW
​j
​​ =0$$

* For each record in the training data:
    * Make a forward pass through the network, calculating the output 
$$​y
​^
​​$$ 
    * Calculate the error gradient in the output unit, $$δ
​o
​​ =(y−
​y
​^
​​ )f
​′
​​ (z)$$ where $$z=∑
​j
​​ W
​j
​​ a
​j
​​$$, the input to the output unit.

    * Propagate the errors to the hidden layer $$δ
​j
​h
​​ =δ
​o
​​ W
​j
​​ f
​′
​​ (h
​j
​​ )$$
    * Update the weight steps,:
        * $$ΔW
​j
​​ =ΔW
​j
​​ +δ
​o
​​ a
​j
​​$$ 
        * $$Δw
​ij
​​ =Δw
​ij
​​ +δ
​j
​h
​​ a
​i
​​$$ 
* Update the weights, where η is the learning rate and m is the number of records:
    * $$W
​j
​​ =W
​j
​​ +ηΔW
​j
​​ /m$$
        
    * $$w
​ij
​​ =w
​ij
​​ +ηΔw
​ij
​​ /m$$

* Repeat for e epochs.

## Backpropagation Exercise

Now you're going to implement the backprop algorithm for a network trained on the graduate school admission data. You should have everything you need from the previous exercises to complete this one.

Your goals here:

* Implement the forward pass.
* Implement the backpropagation algorithm.
* Update the weights.

In [31]:
import numpy as np
from data_prep import features, targets, features_test, targets_test

np.random.seed(21)

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


# Hyperparameters
n_hidden = 2  # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                        size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
                                         size=n_hidden)

for e in range(epochs):
    del_w_input_hidden = np.zeros(weights_input_hidden.shape)
    del_w_hidden_output = np.zeros(weights_hidden_output.shape)
    for x, y in zip(features.values, targets):
        print(e)
        print(x.shape)
        print(y.shape)
        ## Forward pass ##
        # TODO: Calculate the output
        hidden_input = np.dot(x, weights_input_hidden)
        hidden_output = sigmoid(hidden_input)
        output = sigmoid(np.dot(hidden_output, weights_hidden_output))

        ## Backward pass ##
        # TODO: Calculate the network's prediction error
        error = y - output

        # TODO: Calculate error term for the output unit
        output_error_term = error * output * (1-output)

        ## propagate errors to hidden layer

        # TODO: Calculate the hidden layer's contribution to the error
        hidden_error = np.dot(output_error_term, weights_hidden_output)
        
        # TODO: Calculate the error term for the hidden layer
        hidden_error_term = hidden_error * hidden_output * (1-hidden_output)
        
        # TODO: Update the change in weights
        del_w_hidden_output += output_error_term * hidden_output
        del_w_input_hidden += hidden_error_term * x[:, None]

    # TODO: Update weights
    weights_input_hidden += learnrate * del_w_input_hidden / n_records
    weights_hidden_output += learnrate * del_w_hidden_output / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        hidden_output = sigmoid(np.dot(x, weights_input_hidden))
        out = sigmoid(np.dot(hidden_output,
                             weights_hidden_output))
        loss = np.mean((out - targets) ** 2)

        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss

# Calculate accuracy on test data
hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))


[-0.06665712  0.28930534  0.          1.          0.          0.        ]
0
[ 0.62588442  1.44547565  0.          1.          0.          0.        ]
0
[ 1.83783211  1.60313523  0.          0.          1.          0.        ]
1
[ 1.31842596 -0.13112022  0.          0.          0.          1.        ]
0
[-0.06665712 -1.20846073  0.          1.          0.          0.        ]
0
[-0.75919866  0.55207132  0.          0.          1.          0.        ]
1
[-0.75919866 -1.20846073  0.          0.          0.          1.        ]
0
[ 0.62588442  0.13164576  0.          1.          0.          0.        ]
1
[-0.23979251 -0.3938862   0.          0.          0.          1.        ]
0
[ 0.10647826  0.39441173  1.          0.          0.          0.        ]
1
[ 0.97215519  1.39292245  0.          1.          0.          0.        ]
0
[-0.41292789  0.26302874  1.          0.          0.          0.        ]
1
[-0.23979251 -0.52526919  0.          0.          1.          0.        ]
0
[ -9.3233404

KeyboardInterrupt: 

In [20]:
weights_input_hidden.shape

(6, 2)

In [21]:
weights_hidden_output.shape

(2,)

In [22]:
output_error_term.shape

()

In [23]:
hidden_output

array([ 0.5453801 ,  0.37113363])