In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6) # set default figure size, 8in by 6in

# Video W4 01: Non-linear Hypothesis

[YouTube Video Link](https://www.youtube.com/watch?v=mcnvIWDnPns&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=44)

We have already seen some linear regression and logistic classification problems that would not
be easily modeled using only linear assumptions.  A couple of times in the companion videos, it was
mentioned that the techniques we developed do actually extend easily to using non-linear hypothesis.
For example, we could expand a given set of features to include all of the quadratic combinations
(raise to the second power).
This can work ok if we only have a few 10s to 100s of original features, though we might have to expand
beyond quadratic and even look at third power, fourth power combinations, etc.  For quadratic 
features, as the video mentions, the number of inputs will grow with the square of the number of
original features, so a problem with 100 original inputs, would require about $10,000 / 2$ 
quadratic combinations, and higher order powers grow even faster.  Thus for problems of more than
100 or so original inputs, it quickly becomes infeasable to model and solve these problems using
standard regression methods.

So these next two weeks we will be looking at a different learning method, neural networks.  Neural
networks are able to create models or solutions using non-linear combinations of features, without
the problems we just saw with the combinotorial explosion of combinations of the features.

# Video W4 02: Neurons and the Brain

[YouTube Video Link](https://www.youtube.com/watch?v=Nx3HVwg2uGA&index=45&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW)

# Video W4 03: Model Representation I

[YouTube Video Link](https://www.youtube.com/watch?v=wnSol2JRZeY&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=46)

Neural networks use a "hypothesis" or model for individual units of computation that is the same
as we have already seen.

<img src="../../figures/nn-model-logistic-unit.png">

What happens in a neural network is that the inputs $x_1, x_2, x_3$ are multiplied by what are known as
weights $w_1, w_2, w_3$, which are associated with the input wires in the diagram.  The weights are
the same as the $\theta_1, \theta_2, \theta_3$ parameters we have been using in the previous lectures
(but by convention they are usually referred to as weights rather than parameters in the
context of neural networks).  Inputs and weight
parameters are multiplied and then summed together.  Then the output of the neuarl network unit
is passed through a non-linear function.  The most common output function to use is the logistic
(sigmoid) function, the same as we used for logistic regression:

$$
h_\theta(x) = \frac{1}{1 + e^{-\theta^Tx}}
$$

But what makes a neural network different from a logistic regression, is that we organize the
computation of the hypothesis into many small regression/calculations, and we combine multiple of
these using 2 or more layers of the network.

# Video W4 04: Model Representation II

[YouTube Video Link](https://www.youtube.com/watch?v=vuhueI_7324&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=47)

To summarize, this video looks in detail at the equations we will be using for a vectorized implementation of a
neural network.  We use the following example neural network with 3 inputs, 3 units in the hidden layer, and 1 output:

<img src="../../figures/nn-example-model.png">

The following equations give us statements for calculating what is known as the feed forward activation of the network:

$$
a_1^{(2)} = g\big( \Theta_{10}^{(1)} x_0 + \Theta_{11}^{(1)} x_1 + \Theta_{12}^{(1)} x_2 + \Theta_{13}^{(1)} x_3 \big)
$$

$$
a_2^{(2)} = g\big( \Theta_{20}^{(1)} x_0 + \Theta_{21}^{(1)} x_1 + \Theta_{22}^{(1)} x_2 + \Theta_{23}^{(1)} x_3 \big)
$$

$$
a_3^{(2)} = g\big( \Theta_{30}^{(1)} x_0 + \Theta_{31}^{(1)} x_1 + \Theta_{32}^{(1)} x_2 + \Theta_{33}^{(1)} x_3 \big)
$$

$$
h_\Theta(x) = a_1^{(3)} = g\big( \Theta_{10}^{(2)} a_0^{(2)} + \Theta_{11}^{(1)} a_1^{(2)} + \Theta_{12}^{(1)} a_2^{(2)} + \Theta_{13}^{(1)} a_3^{(2)} \big)
$$

There are a lot of equations, but this is mostly just notation.  Be sure that you understand the basic ideas of the notation.  
The $a_1^{(2)}$ are the activations (the outputs) of the units in the hidden layer, or layer 2.  Layer 1 refers to the raw inputs
used for the problem, thus the first layer of actual computing units is layer 2.  The output layer in this model network
can also be referred to as layer 3.  Notice that in all of the equations, the superscript does not refer to raising the value
to a power, they simply refer to the layer numbers of the units in our network.  The $\Theta_{10}^{(1)}$ values represent
the parameter of the model we will be trying to learn.  As mentioned previously, in most neural network literature these
are refered to as the network weights, but in our companion videos we will continue to refer to these parameters as $\Theta$.
Notice that, as with our previous linear and logistic regression, we have a bias unit (not shown in figure) named $X_0$.
This bias unit will always have a value of $1.0$, but the weight from the bias unit can be set to any valid value.  Likewise,
each layer in such a network will also have a bias unit, for example in our equations the output layer uses $a_0^{(2)}$
which is a bias unit which will always have a value of $1.0$ as well.  

Also make sure you understand the notation of
the $\Theta_{12}^{(1)}$ parameters.  The superscript simply means this is a weight between the input layer 1 and the hidden
layer 2 (or you can think of it as the first set or layer of weights).  The subscript means this is the weight from input
$x_2$ to the layer 2 unit 1.  Many other texts on neural networks would use a comman, like this $\Theta_{i,j}^{(1)}$.
Also notice that the order is reversed from what you might expect, the previous parameter would be read to mean that
this is the weight coming from unit $j$ in layer 1, going to unit $i$ in layer 2.  The reason for switching the order of these
is to make the definitions of the linear algebra operations on matrices representing the networks weights straightforward.


Also as noted, we often refer to the set of $\Theta$ parameters times the inputs/activations by the values $z_1^{(2)}.  For example

$$
z_1^{(2)} = \Theta_{10}^{(1)} x_0 + \Theta_{11}^{(1)} x_1 + \Theta_{12}^{(1)} x_2 + \Theta_{13}^{(1)} x_3
$$

With this definition, the activation of unit 1 in layer 2 becomes a function of:

$$
a_1^{(2)} = g(z_1^{(2)})
$$

where you should recall that $g()$ represents our logistic or sigmoid function.  Given this, we can calculate all of the $z$
values for a layer using a single matrix operation, and all of the activations for a layer where the $g()$ function represents
taking the elementwise logistic function of each of the values in the $z$ vector:

$$
z_1^{(2)} = \Theta^{(1)} \cdot x
$$
$$
a^{(2)} = g(z^{(2)})
$$

You should realize that the $\Theta$ represents a $3 \times 4$ matrix (4 because of the bias unit input)
in the previous equations, and that the multiplication
shown is a $3 \times 4$ matrix multiplication by a $4 \times 1$ vector.  And as discussed in the vector, this results in
3 activation values, but we add a bias activation value for the hidden layer, thus getting a $1 \times 4$ sized 
matrix.

This forward propagation of activation should remind you strongly of our previous logistic regression models.  Each unit in
a neural network is computing a single case of logistic regression.  However, when we have multiple layers in a neural network,
later layers are computing logistic regression on a set of features that the neural network has learned.

How we learn the set of $\Theta$ parameter weights in order to automatically learn whatever features work well to solve
the problem will be discussed next week, when we look at the backpropagation algorithm for trying to optimize the parameters
for a network to learn some model.



# Video W4 05: Examples and Intuitions I

[YouTube Video Link](https://www.youtube.com/watch?v=BhWlHvjEn3s&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=48)

**Simple Example: AND**

Lets first look at the network to compute the **AND** function.

$$
h_{\Theta}(x) = g \big( -30 x_0 + 20 x_1 + 20 x_2 \big)
$$

Here we have 2 inputs, and 1 bias unit.  This gives us a $1 \times 3$ matrix of `Theta` parameters.  We can define a matrix of
the inputs and the `Theta` parameters like this:

In [2]:
# parameters for AND function
Theta = np.array([[-30.0, 20.0, 20.0]]) # a 1x3 matrix
print(Theta.shape)

x = np.array([1.0, 0.0, 0.0]) # a 3x1 vector, x_0 is our bias unit, and the next two values are x_1,x_2
print(x.shape)

(1, 3)
(3,)


In the next cell we will redefine our logistic function once again, and we will use matrix operations to compute the $z$
value and get our final output for the previous set of inputs.

In [3]:
def g(z):
    """The logistic or sigmoid function, given a scalar value, or a numpy array of values in z
    return the sigmoid of all values in z as our result.
    """
    return 1.0 / (1.0 + np.exp(-z))

z = np.dot(Theta, x)
print(z)
print(g(z))

[-30.]
[9.35762297e-14]


If we want to recreate the whole **AND** binary table, we can do the previous on all possible inputs:

In [4]:
# our basic set of inputs
Xraw = np.array([[0.0, 0.0],
              [0.0, 1.0],
              [1.0, 0.0],
              [1.0, 1.0]])
(rows, cols) = Xraw.shape
print(rows, cols)

# add in a column of 1.0s for the bias units
X = np.ones((rows, cols+1))
X[:, 1:] = Xraw
print(X)
print(X.shape)

4 2
[[1. 0. 0.]
 [1. 0. 1.]
 [1. 1. 0.]
 [1. 1. 1.]]
(4, 3)


In [5]:
def threshold(z):
    """Given a value or array of values, perform a threshold at 0.5"""
    return np.where(z >= 0.5, np.ones(z.shape), np.zeros(z.shape))
    
print("x_1 x_2 |     z |       g(z) | output")
for x in X:
    z = np.dot(Theta, x)
    #print x[1:], g(z), threshold(g(z))
    print("%3d %3d | %5.1f | %0.8f | %0.1f" % (x[1], x[2], z, g(z), threshold(g(z))))

x_1 x_2 |     z |       g(z) | output
  0   0 | -30.0 | 0.00000000 | 0.0
  0   1 | -10.0 | 0.00004540 | 0.0
  1   0 | -10.0 | 0.00004540 | 0.0
  1   1 |  10.0 | 0.99995460 | 1.0


**Simple Example: OR function**

Here is the `Theta` parameters for the **OR** function, and the truth table results for this function:

In [6]:
# parameters for OR function
Theta = np.array([[-10.0, 20.0, 20.0]]) # a 1x3 matrix

print("x_1 x_2 |     z |       g(z) | output")
for x in X:
    z = np.dot(Theta, x)
    #print x[1:], g(z), threshold(g(z))
    print("%3d %3d | %5.1f | %0.8f | %0.1f" % (x[1], x[2], z, g(z), threshold(g(z))))

x_1 x_2 |     z |       g(z) | output
  0   0 | -10.0 | 0.00004540 | 0.0
  0   1 |  10.0 | 0.99995460 | 1.0
  1   0 |  10.0 | 0.99995460 | 1.0
  1   1 |  30.0 | 1.00000000 | 1.0


# Video W4 06: Examples and Intuitions II

[YouTube Video Link](https://www.youtube.com/watch?v=QZqmNpEyiKI&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=49)

**NOT or Negation**

The example **NOT** function from the video:

In [7]:
# parameters for the not function
Theta = np.array([[10.0, -20.0]]) # a 1x2 matrix

# our basic set of inputs
Xraw = np.array([[0.0],
                 [1.0]])
(rows, cols) = Xraw.shape

# add in a column of 1.0s for the bias units
X = np.ones((rows, cols+1))
X[:, 1:] = Xraw

print("x_1 |     z |       g(z) | output")
for x in X:
    z = np.dot(Theta, x)
    #print x[1:], g(z), threshold(g(z))
    print("%3d | %5.1f | %0.8f | %0.1f" % (x[1], z, g(z), threshold(g(z))))

x_1 |     z |       g(z) | output
  0 |  10.0 | 0.99995460 | 1.0
  1 | -10.0 | 0.00004540 | 0.0


**Putting it together: $x_1 \; \textrm{XNOR} \; x_2$**

Before looking at the next cell you might want to tray putting together the code from the previous cells to build the network
shown in the middle part of this video.  Notice that in this network we now have 2 layers, so we will need to set of 
`Theta` parameters, those for the weights from the inputs to the first activation layer, then those for the weights from
these intermediate activations to our final output.  We will then need to calculate the final output using a 2 step process.

Lets start by defining our two set of `Theta` layer weights, and our original set of inputs we need for our binary
logical functions:

In [8]:
# our basic set of inputs
Xraw = np.array([[0.0, 0.0],
              [0.0, 1.0],
              [1.0, 0.0],
              [1.0, 1.0]])
(rows, cols) = Xraw.shape

# add in a column of 1.0s for the bias units
X = np.ones((rows, cols+1))
X[:, 1:] = Xraw

In [9]:
# weights from layer 1 to layer 2, a 2 rows by 3 column matrix.  First column is from the bias unit
# first row represents our simple x_1 AND x_2 feature
# second row represnts the (NOT x_1) AND (NOT x_2)
Theta1 = np.array([[-30.0,  20.0,  20.0],
                   [ 10.0, -20.0, -20.0]])

# weights from layer 2 to layer 3, a 1 row by 3 column matrix, First column will be from the bias unit
# these weights represent a simple OR function
Theta2 = np.array([[-10.0,  20.0,  20.0]])

To calculate the final output, we need to perform two steps, compute the output activations for layer 2, then compute the final
output activation of layer 3.  This is the basis of feed forward activation.

In [10]:
x = X[0] # select one of the inputs as an example, you can use any input X[0] to X[3] to test out the forward activation
print(x)

z2 = np.dot(Theta1, x)
a2 = g(z2)
print(z2)
print(a2)

# we need to add in a bias activation to the 2 activations we calculated
a2_bias = np.ones( (3,))
a2_bias[1:] = a2

z3 = np.dot(Theta2, a2_bias)
a3 = g(z3)
print(z3)
print(a3)

[1. 0. 0.]
[-30.  10.]
[9.35762297e-14 9.99954602e-01]
[9.99909204]
[0.99995456]


You should try out the previous cell for all the inputs and make sure you understand what is happenning.  

Lets write a simple little function that will take two layers of `Theta` weights and some initial inputs, and compute
the outputs of the network using the feed forward activation method we just demonstrated:

In [11]:
def feedforward(x, Theta1, Theta2):
    """Given sets of Theta parameter weights from layer1 to layer2 and from layer2 to layer3
    of a 3 layer network, compute the final feed forward activation for a given set of inputs.
    """
    # activations of layer 2
    z2 = np.dot(Theta1, x)
    a2 = g(z2)
    
    # add a bias column to our activations
    a2_bias = np.ones( (a2.size+1, ) )
    a2_bias[1:] = a2
    
    # activations of layer 3
    z3 = np.dot(Theta2, a2_bias)
    a3 = g(z3)
    
    return a2, a3

Given our feed forward function, we can compute our complete **XNOR** truth table

In [12]:
print("x_0 x_1 | a_1^2 a_2^2 | a_1^3")
for x in X:
    a2, a3 = feedforward(x, Theta1, Theta2)
    print("%3d %3d | %5.1f %5.1f | %5.1f" % (x[1], x[2], a2[0], a2[1], a3[0]))

x_0 x_1 | a_1^2 a_2^2 | a_1^3
  0   0 |   0.0   1.0 |   1.0
  0   1 |   0.0   0.0 |   0.0
  1   0 |   0.0   0.0 |   0.0
  1   1 |   1.0   0.0 |   1.0


# Video W4 07: Multiclass Classification

[YouTube Video Link](https://www.youtube.com/watch?v=HzpptanxP6A&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=50)

All of our previous simple logic function networks only had a single unit in the final output layer.  However, just as in
our hidden layers we often had more than one activation unit, we can also have more than 1 output activation for our final
output layer.  This can be used to learn multiclass classification problems. 

As discussed in the video, the most common way to do this is that if we have $N$ classes we want to learn in a classification
problem using a neural network, we will build a network with $N$ units on the output layer.  We will then train it so that
the correct outputs have only one of these units active for the correct class, and all other units are trained to output 0 for
members of that class.

In [14]:
import sys
sys.path.append("../../src") # add our class modules to the system PYTHON_PATH

from ml_python_class.custom_funcs import version_information
version_information()

              Module   Versions
--------------------   ------------------------------------------------------------
         matplotlib:   ['3.3.0']
              numpy:   ['1.18.5']
             pandas:   ['1.0.5']
