# Neural Network Basic s

### Brief Review of Machine Learning

In supervised learning, parametric models are those where the model is a function of a fixed form with a number of unknown _parameters_.  Together with a loss function and a training set, an optimizer can select parameters to minimize the loss with respect to the training set.  Common optimizers include stochastic gradient descent.  It tweaks the parameters slightly to move the loss "downhill" due to a small batch of examples from the training set.

## Part A:  Linear & Logistic Regression

You've likely seen linear regression before.  In linear regression, we fit a line (technically, hyperplane) that predicts a target variable, $y$, based on some features $x$.  The form of this model is affine (even if we call it "linear"):  

$$y_{hat} = xW + b$$

where $W$ and $b$ are weights and an offset, respectively, and are the parameters of this parametric model.  The loss function that the optimizer uses to fit these parameters is the squared error ($||\cdots||_2$) between the prediction and the ground truth in the training set.

You've also likely seen logistic regression, which is tightly related to linear regression.  Logistic regression also fits a line - this time separating the positive and negative examples of a binary classifier.  The form of this model is similar: 

$$y_{hat} = \sigma(xW + b)$$

where again $W$ and $b$ are the parameters of this model, and $\sigma$ is the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) which maps un-normalized scores ("logits") to values $\hat{y} \in [0,1]$ that represent probabilities. The loss function that the optimizer uses to fit these parameters is the [cross entropy](../information_theory.ipynb) between the prediction and the ground truth in the training set.

This pattern of an affine transform, $xW + b$, occurs over and over in machine learning.

**We'll use logistic regression as our running example for the rest of this part.**


### Short Answer Questions

Imagine you want to implement logistic regression:

* `z = xW + b`
* `y_hat = sigmoid(z)`

Where:
1.  `x` is a 10-dimensional feature vector
2.  `W` is the weight vector
3.  `b` is the bias term

What are the dimensions of `W` and `b`?  Recall that in logistic regression, `z` is just a scalar (commonly referred to as the "logit").

Sketch a picture of the whole equation using rectangles to illustrate the dimensions of `x`, `W`, and `b`.  See examples below for inspiration (though please label each dimension).  We don't ask you to submit this, but make sure you can do it!  It's the "print" debugging statement of neural networks!  It's also useful for reading papers... if you can't draw the shapes of all the tensors, you don't (yet) know what's going on!

## Part B: Batching

Let's say we want to perform inference using your model (parameters `W` and `b`) above on 10 examples intsead of just 1. On modern hardware (especially GPUs), we can do this efficiently by *batching*.

To do this, we stack up the feature vectors in x like in the diagram below.  Note that changing the number of examples you run on (i.e. your batch size) *does not* affect the number of parameters in your model.  You're just running the same thing in parallel (instead of running the above one feature vector at a time at a time).

![](batchaffine.png)

The red (# features) and blue (batch size) lines represent dimensions that are the same.

### Short Answer Questions

If we have 10 features and running the model in parallel with 30 examples, what are the dimensions of:

1. `W` ?
2. `b` ?
3. `x` ?
4. `z` ?

_Hint:_ remember that your model parameters stay fixed!

## Part C: Logistic Regression - NumPy Implementation

In this section, we'll implement logistic regression by hand and compute a few values to make sure we understand what's going on!

Let's say your model has the following parameters:

In [6]:
import numpy as np

W = np.array([45,6,3,25,-1])
b = 5

If you want to run the model on the following three examples:

* [1, 2, 3, 4, 5]
* [0, 0, 0, 0, 5]
* [-3, -4, -12, -1, 1]

Construct the x matrix **such that you compute the answer all in one big batch** and compute the probability of the positive class for each.

In [7]:
# Import sigmoid.
from scipy.special import expit as sigmoid
np.set_printoptions(suppress=True)

### YOUR CODE HERE
x = np.array([[1,2,3,4,5], [0,0,0,0,5], [-3,-4,-12,-1,1]])
print(x)
# z_1 = sigmoid(x*W+b)
# print("z_1: ", z_1)

z_2 = sigmoid(np.matmul(x,W)+b)
print("z_2: ", z_2)

z_3 = sigmoid(np.dot(x,W)+b)  # Check 
print("z_3: ", z_3)

### END YOUR CODE

[[  1   2   3   4   5]
 [  0   0   0   0   5]
 [ -3  -4 -12  -1   1]]
z_2:  [1.  0.5 0. ]
z_3:  [1.  0.5 0. ]


In [9]:
p=1
q=0.5
ce = -np.dot(p, np.log(q))  # from lesson 2
print(ce)

print(-np.mean(p*np.log(q) + (1-p)*np.log(1-q))) #checkb

p=1
q=0.5
ce2 = -np.dot(p, np.log2(q))  # from lesson 2
print(ce2)
print(-np.mean(p*np.log2(q) + (1-p)*np.log2(1-q)))

0.6931471805599453
0.6931471805599453
1.0
1.0


In [None]:
# p=np.array([0,0,0,0,5])
# q=0.5
# ce = -np.dot(p, np.log(q))  # from lesson 2
# print(ce)

# print(-np.mean(p*np.log(q) + (1-p)*np.log(1-q))) #checkb

### Short Answer Questions

1. What is the probability of the positive class for the second (middle) example?
2. What is the cross-entropy loss of the second example if its label is positive?

## Part D: NumPy Feed Forward Neural Network

Let's do the same procedure for a simple feed-forward neural network.

Imagine you have a 3 layer network.  Each hidden layer is size 10.  Just like before, you've already trained your model and you just want to run it forward.  For this exercise, let's say that each weight matrix is np.ones(...) and each bias term is [-1, -2, -3, ..., -n] if the bias term is $n$ long.  Compute the probability of the positive class for the three examples above, again in a single batch.

**Hint:  Draw the shapes of the matrices at each layer out on a piece of paper!  Include it with any questions you post to Piazza.**

Assume your model uses a sigmoid as the nonlinearity for all layers.

In [53]:
### YOUR CODE HERE

x = np.array([[1,2,3,4,5], [0,0,0,0,5], [-3,-4,-12,-1,1]])
y = 0
b = np.arange(-1, -11, -1)
outputSize = 1
hiddenSize = 10
inputSize = x.shape[1]
# print(inputSize)
# print(hiddenSize)
W_in = np.ones((inputSize, hiddenSize))
W_h = np.ones((hiddenSize, hiddenSize))
W_out = np.ones((hiddenSize, outputSize))

h1 = np.dot(x, W_in) + b
z1 = sigmoid(h1)

print('x: ', x.shape)
print('W_in', W_in.shape)
print('h1: ', h1.shape)
print('z1: ', z1.shape)
print('b1: ', b.shape)
print('b: ', b)

h2 = np.dot(z1, W_h) + b
z2 = sigmoid(h2)

print('z1: ', z1.shape)
print('W_h', W_h.shape)
print('h2: ', h2.shape)
print('z2: ', z2.shape)
print('b2: ', b.shape)
print('b: ', b)

h3 = np.dot(z2, W_h) + b
z3 = sigmoid(h3)

print('z2: ', z2.shape)
print('W_h', W_h.shape)
print('h3: ', h3.shape)
print('z3: ', z3.shape)
print('b3: ', b.shape)
print('b: ', b)

output = np.dot(z3, W_out) + b[0]
final_z = sigmoid(output)

print('z3: ', z3.shape)
print('W_out', W_out.shape)
print('h3: ', h3.shape)
print('final: ', final_z.shape)
print('b4: ', b[0])

print('final_z dims: ', final_z.shape)
print('final_z: \n', final_z)

p = 0
q = final_z

print('probability of 3rd example: ', final_z[2])
# ce = -np.dot(p, np.log2(q[2])) 
# print(ce)

print('cross-entropy loss: ', -np.mean(p*np.log2(q[2]) + (1-p)*np.log2(1-q[2])))

    
### END YOUR CODE

x:  (3, 5)
W_in (5, 10)
h1:  (3, 10)
z1:  (3, 10)
b1:  (10,)
b:  [ -1  -2  -3  -4  -5  -6  -7  -8  -9 -10]
z1:  (3, 10)
W_h (10, 10)
h2:  (3, 10)
z2:  (3, 10)
b2:  (10,)
b:  [ -1  -2  -3  -4  -5  -6  -7  -8  -9 -10]
z2:  (3, 10)
W_h (10, 10)
h3:  (3, 10)
z3:  (3, 10)
b3:  (10,)
b:  [ -1  -2  -3  -4  -5  -6  -7  -8  -9 -10]
z3:  (3, 10)
W_out (10, 1)
h3:  (3, 10)
final:  (3, 1)
b4:  -1
final_z dims:  (3, 1)
final_z: 
 [[0.99934417]
 [0.92744813]
 [0.41696628]]
probability of 3rd example:  [0.41696628]
cross-entropy loss:  0.7783487725678238


In [42]:
#### Alternative as a check.

from scipy.special import expit as sigmoid
x = np.array([[1,2,3,4,5], [0,0,0,0,5], [-3,-4,-12,-1,1]])

# y = np.arrary([0],[1])
y = 0
b = np.arange(-1, -11, -1)
class Neural_Network(object):
    def __init__(self):
        #parameters
        self.inputSize = 5
        self.outputSize = 1
        self.hiddenSize = 10

        #weights
        self.W1 = np.ones((self.inputSize, self.hiddenSize))
        self.W2 = np.ones((self.hiddenSize, self.hiddenSize))
        self.W3 = np.ones((self.hiddenSize, self.hiddenSize))
        self.W4 = np.ones((self.hiddenSize, self.outputSize))
    
    def forward(self, X):
        #forward propagation through our network
        self.z = np.dot(X, self.W1) # dot product of X (input) and first set of 3x2 weights
        self.z2 = self.sigmoid(self.z + b) # activation function
        self.z3 = np.dot(self.z2, self.W2) # dot product of hidden layer (z2) and second set of 3x1 weights
        self.z4 = self.sigmoid(self.z3 + b)
        self.z5 = np.dot(self.z4, self.W3)
        self.z6 = self.sigmoid(self.z5 + b)
        self.z7 = np.dot(self.z6, self.W4)
        o = self.sigmoid(self.z7  + b[0]) # final activation function
        return o
    
    def sigmoid(self, s):
    # activation function
        return 1/(1+np.exp(-s))

NN = Neural_Network()

out = NN.forward(x)

ce = -np.dot(y, np.log(out))

print("Predicted Output: \n", str(out))
print("Actual Output: \n", str(y))
print("Cross-Entropy: \n", str(ce))
print(out[2])
print(-np.dot(y, np.log(out[2])))
print(-np.mean(p*np.log2(q[2]) + (1-p)*np.log2(1-q[2])))

Predicted Output: 
 [[0.99934417]
 [0.92744813]
 [0.41696628]]
Actual Output: 
 0
Cross-Entropy: 
 [[-0.]
 [-0.]
 [-0.]]
[0.41696628]
[0.]
0.7783487725678238


3 dense layers with sigmoid (each with a 10 dimensional hidden layer output) + an affine with sigmoid output.

 

So, to be extra clear, your code should EITHER read something like this (#affines before the logistic head == # layers):

 

h1 = ...

h2 = ...

h3 = ...

z_final = ...

p = ...

xe_loss = ...

 

OR something like this (arguing # affines == # layers):

 

h1 = ...

h2 = ...

z_final = ...

p = ...

xe_loss = ...

 

We'll accept either.

### Short Answer Questions

1.  What is the probability of the third example?
2.  What is the cross-entropy loss if its label is negative?

## Part E: Softmax

Recall that softmax(z) is a vector with the same length as z, and whose components are:  $softmax(z)_i = \frac{e^{z_i}}{\Sigma_j e^{z_j}}$.

### Short Answer Questions

1. If the logits coming from the main body of the network are [1, 2, 3], what is the probability of the middle class?
2. What is the cross-entropy loss if the correct class is the last one? (i.e. corresponding to logit=3)?
3. If you had such a three-class classification problem, what would the dimensions of W and b be for the last layer of the feed forward neural network above? 

In [16]:
logits = np.array([1,2,3])
exp_l = np.exp(logits)
print('outputs: ', exp_l)

sum_exp_l = np.sum(exp_l)
print('output sum: ', sum_exp_l)

q = exp_l/sum_exp_l
print('model probability (q): ', q)
print('probability of middle class: ', q[1])

p = 1
ce = -np.dot(p, np.log(q[2]))
print('CE: ', ce)  # is this LOSS? or just CEb


p = 1
ce2 = -np.dot(p, np.log2(q[2]))
print('CE2: ', ce2)  # is this LOSS? or just CEb


outputs:  [ 2.71828183  7.3890561  20.08553692]
output sum:  30.19287485057736
model probability (q):  [0.09003057 0.24472847 0.66524096]
probability of middle class:  0.24472847105479767
CE:  0.4076059644443803
CE2:  0.5880511035406706


In [None]:
# https://cocoxu.github.io/courses/5525_slides_spring17/06_perceptron.pdf
# In practice, can represent this with one giant weight vector and repeated features for each category (10, 15 (3*5)) maybe
# hidden weights are (10,15) - final weight is (10,3) - b is (3,)
# B3 is not not last layer so it could be 10,15 and b3 is (3,)

w3 = np.ones((10,3))
# z3 = np.ones((10,15))#?
print(w3)
x = np.array([1,2,3])
b = np.array([-1,-2,-3])
#rint(h)
# print(np.dot(W_out, z3))#?
print(w3-b)
print(b.shape)