# Neural Networks

* Synapse: Gap between one neutron and another
* Info travels down axon and causes synapses excitation to occur on other neurons which can fire by sending out spike trains.
* Neurons are computational units.
* Neurons are complicated. By first approximation though (by def) they are v simple.

(image of artificial 

## Perceptron
1. Inputs x_i: think of them as firing rates or the strength of inputs. 
2. Multiplied by weights w_i.
3. Activation: Sum the w_ix_i. and see if it's >= the firing threshold. If it is, then output 1. Else output 0.
4. Output

Artificial Neurons can be tuned such that they fire under different things.

In [None]:
# Perceptron class

import numpy as np


class Perceptron:
    """
    This class models an artificial neuron with step activation function.
    """

    def __init__(self, weights = np.array([1]), threshold = 0):
        """
        Initialize weights and threshold based on input arguments. Note that no
        type-checking is being performed here for simplicity.
        """
        self.weights = weights
        self.threshold = threshold
    
    def activate(self,inputs):
        """
        Takes in @param inputs, a list of numbers equal to length of weights.
        @return the output of a threshold perceptron with given inputs based on
        perceptron weights and threshold.
        """ 

        # INSERT YOUR CODE HERE
        

        # TODO: calculate the strength with which the perceptron fires
        activation = np.dot(inputs, self.weights)
        
        
        # TODO: return 0 or 1 based on the threshold
        if activation > self.threshold:
            result = 1
        else:
            result = 0
            
        return result


def test():
    """
    A few tests to make sure that the perceptron class performs as expected.
    Nothing should show up in the output if all the assertions pass.
    """
    p1 = Perceptron(np.array([1, 2]), 0.)
    assert p1.activate(np.array([ 1,-1])) == 0 # < threshold --> 0
    assert p1.activate(np.array([-1, 1])) == 1 # > threshold --> 1
    assert p1.activate(np.array([ 2,-1])) == 0 # on threshold --> 0

if __name__ == "__main__":
    test()

### What sort of things can ANNs compute?
/ How powerful is a perceptron unit?

**Perceptrons are always going to compute hyperplanes (lines)**.

- Representing the region in an input space that's going to get an output of 0 versus the region that's going to get an output of 1
(2D plane)
    - Linear programming (x_1 = 0, threshold x_2 = 1.5)

Computations expressible as perceptron units
- AND (x_1 in {0,1}, x_2 in {0,1}).
- OR
- NOT (One variable e.g. when x_1 = 0, good. When x_1 = 1, bad.) w_1 = -1, theta = 0.
- If we can combine the perceptron functions together, we can represent any boolean function.
### Ways 
- Perceptron rule (threshold)
- Gradient descent (unthreshold)

### Perceptron rule
 
- Threshold foldled into weights. Add a 1 to the x inputs.
- Run while there is error:
$$\Delta w_i = \eta(y-\hat y)x_i$$
where
$$\hat y = (\sum_i w_ix_i \geq 0)$$,

$\hat y$ is boolean and
$\eta$ is the learning rate.

If the data is linearly separable, the perceptron rule will find the separation line in finite time.
* But often it's not clear if data is linearly separaable, especially if the data has many dimensions.

In [None]:
# ----------
#
# In this exercise, you will update the perceptron class so that it can update
# its weights.
#
# Finish writing the update() method so that it updates the weights according
# to the perceptron update rule.
# 
# ----------

import numpy as np


class Perceptron:
    """
    This class models an artificial neuron with step activation function.
    """

    def __init__(self, weights = np.array([1]), threshold = 0):
        """
        Initialize weights and threshold based on input arguments. Note that no
        type-checking is being performed here for simplicity.
        """
        self.weights = weights
        self.threshold = threshold


    def activate(self, values):
        """
        Takes in @param values, a list of numbers equal to length of weights.
        @return the output of a threshold perceptron with given inputs based on
        perceptron weights and threshold.
        """
               
        # First calculate the strength with which the perceptron fires
        strength = np.dot(values,self.weights)
        
        # Then return 0 or 1 depending on strength compared to threshold  
        return int(strength > self.threshold)


    def update(self, values, train, eta=.1):
        """
        Takes in a 2D array @param values consisting of a LIST of inputs and a
        1D array @param train, consisting of a corresponding list of expected
        outputs. Updates internal weights according to the perceptron training
        rule using these values and an optional learning rate, @param eta.
        """
        
        # YOUR CODE HERE
        self.weights = self.weights.astype(float)
        
        # TODO: for each data point...
        for i in range(len(train)):
            # TODO: obtain the neuron's prediction for that point
            prediction = self.activate(values[i])
            print("prediction for i=", i, " : ", prediction)
            print("train for i=", i, " : ", train[i])
            # TODO: update self.weights based on prediction accuracy, learning
            # rate and input value
            for j in range(len(self.weights)):
                weight_delta = eta * (train[i] - prediction) * values [i][j]
                print("weight_delta for j=", j, " : ", weight_delta)
                self.weights[j] = self.weights[j] + weight_delta
                print("self.weights after j=", j, " is now ", self.weights)
            print("self.weights after ", i, " is now ", self.weights)

def test():
    """
    A few tests to make sure that the perceptron class performs as expected.
    Nothing should show up in the output if all the assertions pass.
    """
    def sum_almost_equal(array1, array2, tol = 1e-6):
        return sum(abs(array1 - array2)) < tol

    p1 = Perceptron(np.array([1,1,1]),0)
    print("p1 weights:", p1.weights)
    p1.update(np.array([[2,0,-3]]), np.array([1]))
    print("p1 weights:", p1.weights)
    print("should be equal to np.array([1.2, 1, 0.7])")
    # assert sum_almost_equal(p1.weights, np.array([1.2, 1, 0.7]))

    p2 = Perceptron(np.array([1,2,3]),0)
    print("p2 weights:", p2.weights)
    p2.update(np.array([[3,2,1],[4,0,-1]]),np.array([0,0]))
    print("p2 weights:", p2.weights)
    print("should be equal to np.array([0.7, 1.8, 2.9])")
    # assert sum_almost_equal(p2.weights, np.array([0.7, 1.8, 2.9]))

    p3 = Perceptron(np.array([3,0,2]),0)
    print("p3 weights:", p3.weights)
    p3.update(np.array([[2,-2,4],[-1,-3,2],[0,2,1]]),np.array([0,1,0]))
    print("p3 weights:", p3.weights)
    print("should be equal to np.array([2.7, -0.3, 1.7])")
    # assert sum_almost_equal(p3.weights, np.array([2.7, -0.3, 1.7]))

test()

### Bulding the XOR Network Debugging

In [None]:
# ----------
#
# In this exercise, you will create a network of perceptrons that can represent
# the XOR function, using a network structure like those shown in the previous
# quizzes.
#
# You will need to do two things:
# First, create a network o f perceptrons with the correct weights
# Second, define a procedure EvalNetwork() which takes in a list of inputs and
# outputs the value of this network.
#
# ----------

import numpy as np

class Perceptron:
    """
    This class models an artificial neuron with step activation function.
    """

    def __init__(self, weights = np.array([1]), threshold = 0):
        """
        Initialize weights and threshold based on input arguments. Note that no
        type-checking is being performed here for simplicity.
        """
        self.weights = weights
        self.threshold = threshold


    def activate(self, values):
        """
        Takes in @param values, a list of numbers equal to length of weights.
        @return the output of a threshold perceptron with given inputs based on
        perceptron weights and threshold.
        """
               
        # First calculate the strength with which the perceptron fires
        strength = np.dot(values,self.weights)
        
        # Then return 0 or 1 depending on strength compared to threshold  
        return int(strength > self.threshold)

            
# Part 1: Set up the perceptron network
Network = [
    # input layer, declare input layer perceptrons here
    [Perceptron(np.array([1.,0.])), Perceptron(np.array([0.5,0.5,])), Perceptron(np.array([0.0,1.0]))], \
    # output node, declare output layer perceptron here
    [Perceptron(np.array([1,-2,1]))]
]

# Part 2: Define a procedure to compute the output of the network, given inputs
def EvalNetwork(inputValues, Network):
    """
    Takes in @param inputValues, a list of input values, and @param Network
    that specifies a perceptron network. @return the output of the Network for
    the given set of inputs.
    """
    
    # YOUR CODE HERE
    x_1 = inputValues[0]
    x_2 = inputValues[1]
    input = [1,0]
    for layer in Network:
        output = []
        for perceptron in layer:
            perceptron_output = perceptron.activate(input)
            output.append(perceptron_output)
            print "pw: ", perceptron.weights, "input: ", input, "output: ", perceptron_output
        output_temp = output
        input = output
                
    
    OutputValue = int(output_temp[0])  
    # Be sure your output value is a single number
    return OutputValue


def test():
    """
    A few tests to make sure that the perceptron class performs as expected.
    """
    print "0 XOR 0 = 0?:", EvalNetwork(np.array([0,0]), Network)
    print "0 XOR 1 = 1?:", EvalNetwork(np.array([0,1]), Network)
    print "1 XOR 0 = 1?:", EvalNetwork(np.array([1,0]), Network)
    print "1 XOR 1 = 0?:", EvalNetwork(np.array([1,1]), Network)

if __name__ == "__main__":
    test()

And I figured out I'd got my AND weights wrong. Missed out a threshold=1.0.

WOWW I'm such a moron.

Ah whoops threshold needs to be 0.9999

In [None]:
# ----------
#
# In this exercise, you will create a network of perceptrons that can represent
# the XOR function, using a network structure like those shown in the previous
# quizzes.
#
# You will need to do two things:
# First, create a network o f perceptrons with the correct weights
# Second, define a procedure EvalNetwork() which takes in a list of inputs and
# outputs the value of this network.
#
# ----------

import numpy as np

class Perceptron:
    """
    This class models an artificial neuron with step activation function.
    """

    def __init__(self, weights = np.array([1]), threshold = 0):
        """
        Initialize weights and threshold based on input arguments. Note that no
        type-checking is being performed here for simplicity.
        """
        self.weights = weights
        self.threshold = threshold


    def activate(self, values):
        """
        Takes in @param values, a list of numbers equal to length of weights.
        @return the output of a threshold perceptron with given inputs based on
        perceptron weights and threshold.
        """
               
        # First calculate the strength with which the perceptron fires
        strength = np.dot(values,self.weights)
        
        # Then return 0 or 1 depending on strength compared to threshold  
        return int(strength > self.threshold)

            
# Part 1: Set up the perceptron network
Network = [
    # input layer, declare input layer perceptrons here
    [Perceptron(np.array([1.,0.])), Perceptron(np.array([0.5,0.5,]), threshold=0.99999), Perceptron(np.array([0.0,1.0]))], \
    # output node, declare output layer perceptron here
    [Perceptron(np.array([1,-2,1]))]
]

# Part 2: Define a procedure to compute the output of the network, given inputs
def EvalNetwork(inputValues, Network):
    """
    Takes in @param inputValues, a list of input values, and @param Network
    that specifies a perceptron network. @return the output of the Network for
    the given set of inputs.
    """
    
    # YOUR CODE HERE
    input = inputValues
    for layer in Network:
        output = []
        for perceptron in layer:
            perceptron_output = perceptron.activate(input)
            output.append(perceptron_output)
            print "pw: ", perceptron.weights, "input: ", input, "output: ", perceptron_output
        output_temp = output
        input = output
                
    
    OutputValue = int(output_temp[0])  
    # Be sure your output value is a single number
    return OutputValue


def test():
    """
    A few tests to make sure that the perceptron class performs as expected.
    """
    print "0 XOR 0 = 0?:", EvalNetwork(np.array([0,0]), Network)
    print "0 XOR 1 = 1?:", EvalNetwork(np.array([0,1]), Network)
    print "1 XOR 0 = 1?:", EvalNetwork(np.array([1,0]), Network)
    print "1 XOR 1 = 0?:", EvalNetwork(np.array([1,1]), Network)

if __name__ == "__main__":
    test()


### Gradient Descent
- More robust to non(linear separability).
Activation
$$a = \sum_i x_i w_i$$

Imagine the output is not thresholded. 
-> figure out weights s.t. not-thresholded value is as close to the output value as we can.

$$E(w)=\frac{1}{2}\sum_{(x,y)\in D} (y-a)^2$$
 
Take partial derivative of E(w) with respect to w_i.

$$\frac{\delta E}{\delta w_i} = \sum_{(x,y)\in 0}(y-a)(-x_i)$$

Looks like the perceptron rule.

### Comparison of learning rules
Perceptron: guarantee of finite convergence in the case of linear separability.
$$\Delta w_i = \eta(y-\hat y)x_i$$

Gradient descent: calculus. More robust to datasets that are not linearly separable. Converges in the limit to a local optimum.
$$\Delta w_i = \eta(y-a)x_i$$
* Why not do gradient descent on $\hat y$? -> It's not differentiable because it's discontinuous (a step function).
* So we want to try to smooth out the threshold.
-> SIGMOID.

### Advantages of having threshold vs returning 

### Tuning perceptron parameters
- 


### Inputs to perceptron networks
- A single perceptron is very much like linear regression. Therefore it should take the same kinds of inputs. However the outputs of perceptrons will generally be classifications, not numerical.
- A matrix

## Variation of Perceptrons

I just took the number of perceptrons in each layer and multiplied them together to get the total number of possible outcomes for the quiz. Somehow I think that's wrong.

As discussed in the previous lesson, to solve the problem of having only a very few discrete outputs from our neural net, we'll apply a transition function.

We'll start by letting you test out a variety of functions, numerically approximating their derivatives in order to apply a gradient descent update rule.

We have decided we need a function that is continuous (to avoid the discrete problem of perceptrons) but not linear (to allow us to represent non-linear functions). 
* Logistic function is appropriate

## Sigmoids

Sigmoid: S-like.

$$ \sigma(a) = \frac{1}{1+e^{-a}} $$

$ a -> -\infty$, $\sigma(a) -> 0$
$ a -> +\infty$, $\sigma(a) -> 1$

$$ D\sigma(a) = \sigma(a)(1-\sigma(a))$$

Q: Difference between sigmoid unit and a single perceptron in a binary classification problem?
* Sigmoid unit will give more info but both give the same answer.

Determine update rules using calculus.

### Potential problems with gradient descent
(to find locally optimal set of weights)
- Local extrema
- Lengthy run times
- Infinite loops
- Failure to completely converge

## Layered networks

### Additional layers don't give us more representational power if the units are all linear.

(Neural net diagram)

If entire neural net is made up of sigmoids, the mapping from input to output is differentiable in terms of the weights.
* That is, for any given weight in the network, we can figure out how moving it up or down by a little bit changes the mapping from input to output. So

### Back-propagation
A computationally beneficial organisation of the chain rule.
Info from input _> output
error info flowing back from output -> input

If we replace the sigmoids with some other differentiable unit, this also works.

The error function can have multiple local optima. -> Could just be stuck at an overall non-optimal weight setting.
* Imagine combining many parabolas in a higher dimensional space and considering the local minima that are quite high up.

### Optimising Weights

Methods
- Gradient Descent
- Advanced methods
    - Momentum terms in gradient: instead of being stuck in a high bowl, you can have energy to bounce out and pop over to the text thing.
    - Higher order derivatives (look at changes in combinations of weights vs individual weights. e.g. Hamiltonians.)
    - Randomised optimisation
    - Penalty for 'complexity'. // Decision tree, regression overfitting.
        - Networks get complex when we: add more nodes or more layers, have large weights.

Some people in ML think optimisation and learning are the same thing.

## Restriction Bias
What are neural nets appropriate for?

Restriction Bias tells you about the representation of the data structure. - Representational power. 
- Set of hypotheses we're willing to consider.

e.g. Perceptrons -> Linear. Half spaces
Sigmoids -> More complex.
* So not much restruction at all.

Types of functions we can represent: 
* Boolean via network of threshold-like units.
* Continuous: Can do this with a single hidden layer as long as we can have as many units as we want. -> Each unit can worry about a small patch of the function. -> Output layer stitch the patches together.
* Arbitrary: Adding hidden layers to stitch patches together even if they have jumps between them.
    - But that means we have a **danger of overfitting**: We can represent the noise as well. 
        - Set max number of hidden layers.
        - Cross-validation: nodes to put in each layer, number of layers, max weights
    - Training NN is an iterative process where error decreases as no. of iterations increase. VS other supervised classification where at some point error increases as no. of iterations increase.
    

## Preference Bias

Algorithm's selection of one representation over another
(e.g. DTs correct trees, max information gain)

1. How do we start?
    - Initial weights: Small, random values. (Randomness gives some variance: If we run multiple times we don't want it to get stuck in the same place multiple times. Small to avoid overfitting.)
2. PReference bias
    - Prefer correct over incorrect
    - Prefer simpler over complex
    
**Occam's Razor**: Entities should not be multiplied unneccessarily.


...Better generalisation error?

## Summary
- Perceptrons: Linear threshold unit
- Networks can be put together to produce any boolean function
- Perceptron rule - finite time for linearly separable datasets
- General differentiable rule: Back propagation and gradient descent
- Preference and restriction bias


## I totally don't get this activation function sandbox quiz

Josh: ML x design

Teaching a computer to recognise sketches
* Feature extraction and engineering

Applications
- Teddy search
- Search Doodle 2.0 semantic search horse running annotating motions
- Simulate physics 
- TEDDY 3d mesh
- Shadow Draw / Sketch
- (Comparing comments)
    -> UNderstanding comments (Good or bad)

Data: 250 categories, 80 images per category

Feature engineering
- Word expansion -> Synonyms
- Lower case normalise
- Bag of words could be two-word groups
- Convolution

Feautures
- Colors RGB
- Gradients (hog)
- (Feat Engin still) Cluster using KMeans

Train: 
...
Test if in 'Top k'

? NN extract features for you

