## Side notes 
_(code snippets, summaries, resources, etc.)_

__Further reading:__  ( \* = sources that have been integrated in this notebook)
- Used Udacity forum post and [this college course handout](http://www.idi.ntnu.no/~keithd/classes/advai/lectures/backprop.pdf) to get Sigmoid Programming Exercise working
- Useful Python code explanation of neural nets: [Neural Networks in Python](https://rolisz.ro/2013/04/18/neural-networks-in-python/)
- From Coursera/Standford's [Machine Learning course](https://www.coursera.org/learn/machine-learning):
    - [Neural Networks: Representation](https://www.coursera.org/learn/machine-learning/home/week/4)
    - [Neural Networks: Learning](https://www.coursera.org/learn/machine-learning/home/week/5)
- [Github repo](https://github.com/mdlynch37/gradient_descent.git) (forked by mdlynch37) of Jupyter notebook based on gradient descent section of Stanford course
- \*Provides extra explanation helpful for mini-project: [_Neural Networks_ PDF by Udactiy]( https://www.evernote.com/shard/s37/nl/1033921335/50316007-f4a1-430e-a914-db8458a7830d/) (Evernote)
- \* [_Data Science from Scratch_ by Joel Grus (2016)](https://www.evernote.com/shard/s37/nl/1033921335/64072b2a-f2b7-4409-9d02-611bfc0f4901/) (Evernote)
    - Chapter 18: Neural Networks
    - Chapter 8: Gradient Descent
- [_Gradient Descent - Problem of Hiking Down a Mountain_ PDF worksheet by Udactiy]( https://www.evernote.com/shard/s37/nl/1033921335/f754539a-a88e-4ac1-85f3-dd5d705e4d37/) (Evernote)
- Calculus used for Sigmoid Function below is explained at [WolframMathWorld](http://mathworld.wolfram.com/SigmoidFunction.html)
- `sklearn`'s [Stochastic Gradient Descent module](http://scikit-learn.org/stable/modules/sgd.html)

# Neural Networks

## Summary of topics covered
![summary of neural networks](neural_networks_images/summary_neural_networks.png)



Two rules for neural network units:
1. Threshold (perceptrons)
- Delta rule (gradient descent, using the sigmoid function on perceptron)

## Perceptrons
- Type of _neural net unit_ that:
    - Approximates a single neuron with _n binary inputs_. 
    - Computes a weighted sum of its inputs and “fires” if that weighted sum is zero or greater

![neural network in brain](neural_networks_images/neural_network_brain.png)

![artificial neural network](neural_networks_images/artificial_neural_network.png)

#### Mathematically, the perceptron computes output according to the following rule:
![perceptron rule formula](neural_networks_images/perceptron_rule_formula.png)

We can think of the perceptron as a hyperplane in n dimensions, perpendicular to the vector _w_ = (_w<sub>1</sub>, w<sub>2</sub>, . . . , w<sub>n</sub>_). The perceptron classifies things on one side of the hyperplane as positive and things on the other side as negative.

### Power of a perceptron unit
![perceptron power](neural_networks_images/perceptron_power.png)

- Generalized to _halfplanes_
- Perceptrons will always be linear functions that compute hyperplanes

### Boolean logic with perceptrons
- Perceptrons with certain combinations of weights and inputs act behave as a kind of "logic gate"
- These perceptrons can be combined to represent any boolean operator
- Particularly helpful for overcoming decision tree's problem with parity, i.e. `XOR` operator (see below)

![perceptron boolean AND](neural_networks_images/perceptron_and.png)

![perceptron boolean OR](neural_networks_images/perceptron_or.png)

![perceptron boolean NOT](neural_networks_images/perceptron_not.png)

![perceptron boolean XOR](neural_networks_images/perceptron_xor.png)

__Summary graph__
![perceptron boolean summary](neural_networks_images/perceptron_boolean_summary.png)

### Feed-Forward Neural Networks
The topology of a NN is enormously complicated, so it is often approximated with an idealized feed-forward neural network of discrete layers of neurons, each connected to the next. This typically entails:
- an _input layer_ that
    - receives inputs and feeds them forward unchanged
- one or more _“hidden layers”_
    - each of which consists of neurons that take the outputs of the previous layer
    - performs some calculation, and 
    - passes the result to the next layer
- an output layer
    - which produces the final outputs.
- each noninput neuron has a weight corresponding to each of its inputs and _a bias_, which is always 1 so that it can take on the threshold during the dot product operation (shown in drawn diagram below).
- for each neuron, we’ll sum up the products of its inputs and its weights.
- To make a thresholds in perceptron differentiable, rather than outputting the step_function applied to that product, we’ll output a smooth approximation of the step function. using the _sigmoid function_

![step function vs sigmoid](neural_networks_images/step_vs_sigmoid.png)


By using a hidden layer, we are able to feed the output of an "and" neuron and the output of an "or" neuron into a second input but not first "input” neuron. The result is a network that performs "or, but not and,” which is precisely XOR:

![hidden layer for XOR](neural_networks_images/hidden_layer_for_xor.png)

Imagine we have a training set that consists of input vectors and corresponding target output vectors. For example, in our previous xor_network example, the input vector [1, 0] corresponded to the target output [1]. And imagine that our network has some set of weights.

__We then adjust the weights using the following algorithm:__
1. Run feed_forward on an input vector to produce the outputs of all the neurons in the network.
- This results in an error for each output neuron — the difference between its output and its target.
- Compute the gradient of this error as a function of the neuron’s weights, and adjust its weights in the direction that most decreases the error.
- “Propagate” these output errors backward to infer errors for the hidden layer.
- Compute the gradients of these errors and adjust the hidden layer’s weights in the same manner.
Typically we run this algorithm many times for our entire training set until the network converges (see code that follows in _Data Science from Scratch_)

__Why use sigmoid instead of the simpler step_function?__ 
In order to train a neural network, we’ll need to use calculus, and in order to use calculus, we need smooth functions. The step function isn’t even continuous, and sigmoid is a good smooth approximation of it.

Note: Technically sigmoid” refers to the shape of the function, logistic” to this particular function although people often use the terms interchangeably.

## Training Neural Networks
- That is, _given examples_, find weights that map inputs to outputs
- Rules for training covered below are
    1. Perceptron rule (with thresholds)
    - Gradient descent or delta rule (unthresholded)

### Perceptron rule
- Binary output _is_ determined by a threshold
- __Geometrically, we might think of training as rotating the hyperplane to put the training data on the correct side of the boundary.__
- If data is _linearly separable_, the algorithm below will find it! (in a finite number of iterations).
- Algorithm has to be terminated when the weight value is no longer changed at each iteration, i.e. `actual y == y-hat`
- It can be hard to tell if data is linearly separable, especially with lots of dimensions
- If this algorithm does not terminate for a while, this could mean data is not linearly separable, but since _finite_ could be any number, we cannot be certain of that.
    - "if we could solve the halting problem, we could solve this, but not necessarily so that problem could be solved another way..."



![perceptron rule calculation part 1](neural_networks_images/perceptron_rule_calc_1.png)
![perceptron rule algorithm part 2](neural_networks_images/perceptron_rule_calc_2.png)
![perceptron rule algorithm part 3](neural_networks_images/perceptron_rule_calc_3.png)

### Gradient Descent
(From _Data Science from Scratch_ Chapter 8 on topic)
- “Frequently when doing data science, we’ll be trying to the find the best model for a certain situation. And usually “best” will mean something like “minimizes the error of the model” or “maximizes the likelihood of the data.” In other words, it will represent the solution to some sort of optimization problem.”
- “This means we’ll need to solve a number of optimization problems. And in particular, we’ll need to solve them from scratch. Our approach will be a technique called gradient descent, which lends itself pretty well to a from-scratch treatment. You might not find it super exciting in and of itself, but it will enable us to do exciting things throughout the book, so bear with me.”

#### The Idea Behind Gradient Descent
"Suppose we have some function f that takes as input a vector of real numbers and outputs a single real number. One simple such function is:
```python
def sum_of_squares(v):
    """computes the sum of squared elements in v"""
    return sum(v_i ** 2 for v_i in v)
```
- We’ll frequently need to maximize (or minimize) such functions. 
    - That is, we need to find the input v that produces the largest (or smallest) possible value.
- For functions like ours, the _gradient_ (if you remember your calculus, this is the vector of partial derivatives) gives the input direction in which the function most quickly increases.
- Accordingly, one approach to maximizing a function is to pick a random starting point, compute the gradient, take a small step in the direction of the gradient (i.e., the direction that causes the function to increase the most), and repeat with the new starting point. 
- Similarly, you can try to minimize a function by taking small steps in the _opposite_ direction, as shown in Figure 8-1.

![gradient descent minimum](neural_networks_images/gradient_descent_minimum.png)

Note: “If a function has a unique global minimum, this procedure is likely to find it. If a function has multiple (local) minima, this procedure might “find” the wrong one of them, in which case you might re-run the procedure from a variety of starting points. If a function has no minimum, then it’s possible the procedure might go on forever.”

#### Estimating Gradient
If f is a function of one variable, its derivative at a point x measures how `f(x)` changes when we make a very small change to `x`. It is defined as the limit of the difference quotients:
```python
def difference_quotient(f, x, h):
    return (f(x + h) - f(x)) / h

# essentially calculation for gradient 
# of a straight line.
```
as `h` approaches zero.
(Many a would-be calculus student has been stymied by the mathematical definition of limit. Here we’ll cheat and simply say that it means what you think it means.)

The derivative is the slope of the tangent line at $(x, f(x))$, while the difference quotient is the slope of the not-quite-tangent line that runs through $(x+h, f(x+h))$. As $h$ gets smaller and smaller, the not-quite-tangent line gets closer and closer to the tangent line (Figure 8-2).

Figure 8-2. Approximating a derivative with a difference quotient:
![difference quotient](neural_networks_images/difference_quotient.png)

Although we can’t take limits in Python, we can estimate derivatives by evaluating the difference quotient for a very small `e`. Figure 8-3 shows the results of one such estimation:
```python
derivative_estimate = partial(
    difference_quotient, square, h=0.00001)

# plot to show they're basically the same
import matplotlib.pyplot as plt
x = range(-10,10)
plt.title("Actual Derivatives vs. Estimates")
plt.plot(x, map(derivative, x), 
        'rx', label='Actual')  # red  x
plt.plot(x, map(derivative_estimate, x), 
        'b+', label='Estimate')  # blue +
plt.legend(loc=9)
plt.show()
```

Figure 8-3. Goodness of difference quotient approximation:
![derivatives: actual vs estimates](neural_networks_images/derivatives_actual_vs_estimates.png)


When `f` is a function of many variables, it has multiple _partial derivatives_, each indicating how `f` changes when we make small changes in just one of the input variables.

We calculate its `i`th partial derivative by treating it as a function of just its `i`th variable, holding the other variables fixed:
```python
def partial_difference_quotient(f, v, i, h):
    """compute the ith partial difference quotient of f at v"""
    # add h to just the ith element of v
    w = [v_j + (h if j == i else 0)
         for j, v_j in enumerate(v)]

    return (f(w) - f(v)) / h
```
after which we can estimate the gradient the same way:
```python
def estimate_gradient(f, v, h=0.00001):
    return [partial_difference_quotient(f, v, i, h)
            for i, _ in enumerate(v)]
```

Note: A major drawback to this “estimate using difference quotients” approach is that it’s computationally expensive. 
- If `v` has length `n`, `estimate_gradient` has to evaluate `f` on _2n_ different inputs. 
- If you’re repeatedly estimating gradients, you’re doing a whole lot of extra work.

#### Using the Gradient
It’s easy to see that the `sum_of_squares` function is smallest when its input `v` is a vector of zeroes. But imagine we didn’t know that. Let’s use gradients to find the minimum among all three-dimensional vectors. We’ll just pick a random starting point and then take tiny steps in the opposite direction of the gradient until we reach a point where the gradient is very small:

```python
def step(v, direction, step_size):
    """move step_size in the direction from v"""
    return [v_i + step_size * direction_i
            for v_i, direction_i in zip(v, direction)]

def sum_of_squares_gradient(v):
    return [2 * v_i for v_i in v]

# pick a random starting point
v = [random.randint(-10,10) for i in range(3)]

tolerance = 0.0000001

while True:
    gradient = sum_of_squares_gradient(v)   # compute the gradient at v
    next_v = step(v, gradient, -0.01)       # take a negative gradient step
    if distance(next_v, v) < tolerance:     # stop if we're converging
        break
    v = next_v                              # continue if we're not
```
If you run this, you’ll find that it always ends up with a `v` that’s very close to `[0,0,0]`. The smaller you make the tolerance, the closer it will get.

#### Choosing the Right Step Size
Although the rationale for moving against the gradient is clear, how far to move is not. Indeed, choosing the right step size is more of an art than a science. Popular options include:
- Using a fixed step size
- Gradually shrinking the step size over time
- At each step, choosing the step size that minimizes the value of the objective function

The last sounds optimal but is, in practice, a costly computation. We can approximate it by trying a variety of step sizes and choosing the one that results in the smallest value of the objective function:
```python
step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
```
It is possible that certain step sizes will result in invalid inputs for our function. So we’ll need to create a “safe apply” function that returns infinity (which should never be the minimum of anything) for invalid inputs:
```python
def safe(f):
    """return a new function that's the same as f,
    except that it outputs infinity whenever f produces an error"""
    def safe_f(*args, **kwargs):
        try:
            return f(*args, **kwargs)
        except:
            return float('inf')         # this means "infinity" in Python
    return safe_f
```

#### Putting It All Together
In the general case, we have some `target_fn` that we want to minimize, and we also have its `gradient_fn`. For example, the `target_fn` could represent the errors in a model as a function of its parameters, and we might want to find the parameters that make the errors as small as possible.
Furthermore, let’s say we have (somehow) chosen a starting value for the parameters `theta_0`. Then we can implement gradient descent as:
```python
def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
  """use gradient descent to find theta that minimizes target function"""

  step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]

  theta = theta_0                           # set theta to initial value
  target_fn = safe(target_fn)               # safe version of target_fn
  value = target_fn(theta)                  # value we're minimizing

  while True:
      gradient = gradient_fn(theta)
      next_thetas = [step(theta, gradient, -step_size)
                     for step_size in step_sizes]

      # choose the one that minimizes the error function
      next_theta = min(next_thetas, key=target_fn)
      next_value = target_fn(next_theta)

      # stop if we're "converging"
      if abs(value - next_value) < tolerance:
          return theta
      else:
          theta, value = next_theta, next_value
```

We called it `minimize_batch` because, for each gradient step, it looks at the entire data set (because `target_fn` returns the error on the whole data set). In the next section, we’ll see an alternative approach that only looks at one data point at a time.
Sometimes we’ll instead want to _maximize_ a function, which we can do by minimizing its negative (which has a corresponding negative gradient):

```pyhton
def negate(f):
  """return a function that for any input x returns -f(x)"""
  return lambda *args, **kwargs: -f(*args, **kwargs)

def negate_all(f):
  """the same when f returns a list of numbers"""
  return lambda *args, **kwargs: [-y for y in f(*args, **kwargs)]

def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
  return minimize_batch(negate(target_fn),
                        negate_all(gradient_fn),
                        theta_0,
                        tolerance)
```

### Stochastic Gradient Descent
(From _Data Science from Scratch_ Chapter 8 on topic)

As we mentioned before, often we’ll be using gradient descent to choose the parameters of a model in a way that minimizes some notion of error. Using the previous batch approach, each gradient step requires us to make a prediction and compute the gradient for the whole data set, which makes each step take a long time.

Now, usually these error functions are _additive_, which means that the predictive error on the whole data set is simply the sum of the predictive errors for each data point.

When this is the case, we can instead apply a technique called _stochastic gradient descent_, which computes the gradient (and takes a step) for only one point at a time. It cycles over our data repeatedly until it reaches a stopping point.

During each cycle, we’ll want to iterate through our data in a random order:

```python
def in_random_order(data):
    """generator that returns the elements of data in random order"""
    indexes = [i for i, _ in enumerate(data)]  # create a list of indexes
    random.shuffle(indexes)                    # shuffle them
    for i in indexes:                          # return the data in that order
        yield data[i]
```

And we’ll want to take a gradient step for each data point. This approach leaves the possibility that we might circle around near a minimum forever, so whenever we stop getting improvements we’ll decrease the step size and eventually quit:

```python
def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):

    data = zip(x, y)
    theta = theta_0                             # initial guess
    alpha = alpha_0                             # initial step size
    min_theta, min_value = None, float("inf")   # the minimum so far
    iterations_with_no_improvement = 0

    # if we ever go 100 iterations with no improvement, stop
    while iterations_with_no_improvement < 100:
        value = sum( target_fn(x_i, y_i, theta) for x_i, y_i in data )

        if value < min_value:
            # if we've found a new minimum, remember it
            # and go back to the original step size
            min_theta, min_value = theta, value
            iterations_with_no_improvement = 0
            alpha = alpha_0
        else:
            # otherwise we're not improving, so try shrinking the step size
            iterations_with_no_improvement += 1
            alpha *= 0.9

        # and take a gradient step for each of the data points
        for x_i, y_i in in_random_order(data):
            gradient_i = gradient_fn(x_i, y_i, theta)
            theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i))

    return min_theta
```

The stochastic version will typically be a lot faster than the batch version. Of course, we’ll want a version that maximizes as well:

```python
def maximize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):
  return minimize_stochastic(negate(target_fn),
                             negate_all(gradient_fn),
                             x, y, theta_0, alpha_0)
```

In [1]:
import sys
sys.path.append(
    '../../../../../coding/data-science-from-scratch-joel-grus/code/')
# $python -i script.py
!python2 -i ../../../../../coding/data-science-from-scratch-joel-grus/code/gradient_descent.py

using the gradient
minimum v [-2.611200379320695e-06, -3.65568053104897e-06, -2.0889603034565536e-06]
minimum value 2.45461227155e-11

using minimize_batch
minimum v [0.0009969209968386874, 0.0006646139978924582, 0.0009969209968386874]
minimum value 2.42941471407e-06
>>> ^C

Traceback (most recent call last):
  File "/Users/mdlynch37/anaconda2/lib/python2.7/encodings/utf_8.py", line 15, in decode
    def decode(input, errors='strict'):
KeyboardInterrupt
>>> 

### Gradient descent or delta rule
- When output _is not_ thresholded
- The perceptron rule outlined above works fine when the data is linearly separable, but can fail to converge otherwise.
- For more complicated data, we need a better training rule.
- Using a sigmoid function to measure how much error we have, we can differentiate it to adjust the weights in order to minimize the unit's output error.
- Most robust to data set that are not linearly separable
    - converges to the limit of the local optimum
- Relies on calculus to minimize the error, i.e. change the weights to push the error down
    - 1/2 in equation does not affect outcome but it makes result of partial derivative calculation cleaner.

![gradient descent calculation](neural_networks_images/gradient_descent_calc.png)

#### Perceptron rule vs. gradient descent

![perceptron rule vs gradient descent](neural_networks_images/perceptron_vs_gradient_d.png)

### Sigmoid unit
- Hack on the gradient descent equation that allows `y-hat` to be substituted and differentiated instead of `a`.
- Uses a _sigmoid function_ to force this jump into a differentiable threshold
- Calculus used to get to final equation below can be explored at [WolframMathWorld - Sigmoid Function](http://mathworld.wolfram.com/SigmoidFunction.html)

![sigmoid for differentiable threshold](neural_networks_images/sigmoid.png)

### Backpropogation trianing algorithm
- "A computationally beneficial organization of the chain rule."
- Convenient method to compute derivatives with repsect to all the different weights in the network
- Network learns through:
    - Information flows from inputs to outputs
    - Then, error information flows back from the outputs to the inputs
- Could also be called _error back propogation_
- Can be applied to units of another differentiable function
- Error function, in this case some of least squares, can have multiple "local" optima / minima
    - a single unit's error function will have one local optimum, bottom of one parabola, but globally multiple parabolas are combined from all units.

![back propogation](neural_networks_images/back_propogation.png)

## Optimizing Weights, brief intro
- Techniques to solve problem of multiple local optima, which will cause algorithm to get stuck in one minima even if it is not the global optima.
- for image below: red bullet points are aspect that add to a model's complexity

![opitmizing weights](neural_networks_images/optimizing_weights.png)

## Restriction Bias
__definition__ of restriction bias:
- Describes the _representational power_ of a particular data structure, e.g. of a network of neurons
- Restricts the hypotheses that will be considered

### Evaluating restriction bias
- _perceptron unit:_ linear, only considering planes
    - to _Networks of perceptrons:_ allows boolean functions like `XOR`
    - to _Networks of units with sigmoids & other arbitrary functions:_ allows lots of layers and nodes that can become much more complex, not many restrictions at all
- Neural networks can represent _any_ mapping of inputs to outputs, like:
    - _boolean:_ with network of threshold-like units
    - _continuous:_ as long as smooth curves, connected / no jumps
        - using single hidden layer of nodes
        - each node covers some portion of function
        - nodes are then "stitched together" to give output
    - _arbitrary:_ functions that aren't continuous
        - requires two hidden layers
        - with additional hidden layer, output can be stitched together even with gaps in the function.
        
### Overfitting
- Danger of overfitting neural network can even represent noise in our training set
- To solve this, restrict number of hidden nodes and layer in network
    - Neural network can only capture as much of a function as its bounds allow
    - i.e. the particular network architecture can have restrictions even though an unbounded neural network will not.
- Other solutions are ones that are applied to other learners like:
    - Cross validation to decide how many nodes per layer, how large weights can get before stopping. 
- Complexity of a neural network is not only in the nodes and layers, but also in its weights, i.e. how _much_ it is trained

![restriction bias](neural_networks_images/restriction_bias.png)

## Preference Bias
__definition__ of preference bias:
- Characteristics that determine whether one subclass of algorithm would be selecteed over another.
    - e.g. preferred decision trees are correct ones, one with top nodes having the most information gain, ones that aren't longer than necessary, etc.

### Evaluating preference bias
- For _neural networks with gradient descent:_
    - prefers models with lower complexity (Occam's razor)
        
![preference bias](neural_networks_images/preference_bias.png)

# Neural Nets Mini-Project
## 1. Build a Perceptron
__question:__ What do you think the advantage of a perceptron is, compared with simply returning the dot product without a threshold?
- (???) Guaranteed finite convergence for linearly separable data. Faster learning/optimization with near certain determination of nature of data (whether linearly separable or not).

In [2]:
#-----------------------------------

#
#   In this exercise you will put the finishing 
#   touches on a perceptron class
#
#   Finish writing the activate() method by using 
#   numpy.dot and adding in the thresholded
#   activation function

from numpy import dot

class Perceptron:

    def activate(self,inputs):
        '''Takes in @param inputs, a list of numbers.
        @return the output of a threshold perceptron with
        given weights, threshold, and inputs.
        ''' 

        #YOUR CODE HERE

        #TODO: calculate the strength with which the perceptron fires
        w_x = dot(self.weights, inputs)        

        #TODO: return 0 or 1 based on the threshold
        result = 0 if w_x < self.threshold else 1
        return result

        
        
    def __init__(self,weights=None,threshold=None):
        if weights is not None:
            self.weights = weights
        if threshold is not None:
            self.threshold = threshold
            


## 2. Perceptron Update Rule

In [3]:
#-----------------------------------

#
#   In this exercise we write a perceptron class
#   which can update its weights
#
#   Your job is to finish the train method so that it implements the perceptron update rule

import numpy as np

class Perceptron:
    
    def activate(self,values):
        '''Takes in @param values, @param weights lists of numbers
        and @param threshold a single number.
        @return the output of a threshold perceptron with
        given weights and threshold, given values as inputs.
        ''' 
               
        #First calculate the strength with which the perceptron fires
        strength = np.dot(values,self.weights)
        
        if strength>self.threshold:
            result = 1
        else:
            result = 0
            
        return result

    def update(self,values,train,eta=.1):
        '''Takes in a 2D array @param values consisting of a LIST of inputs
        and a 1D array @param train, consisting of a corresponding list of 
        expected outputs.
        Updates internal weights according to the perceptron training rule
        using these values and an optional learning rate, @param eta.
        '''
        #YOUR CODE HERE
        #update self.weights based on the training data
        
        for x, y in zip(values, train):
            y_pred = self.activate(x)
            for i, x_i in enumerate(x):
                self.weights[i] += (y - y_pred) * eta * x_i
        

    def __init__(self,weights=None,threshold=None):
        if weights is not None:
            self.weights = weights
        if threshold is not None:
            self.threshold = threshold
            

## 3. Build the XOR Network

In [4]:
#
#   In this exercise, you will create a network of perceptrons which
#   represent the xor function use the same network structure you used
#   in the previous quizzes.
#
#   You will need to do two things:
#   First, create a network of perceptrons with the correct weights
#   Second, define a procedure EvalNet() which takes in a list of 
#   inputs and ouputs the value of this network.

import numpy as np

class Perceptron:

    def evaluate(self,values):
        '''Takes in @param values, @param weights lists of numbers
        and @param threshold a single number.
        @return the output of a threshold perceptron with
        given weights and threshold, given values as inputs.
        ''' 
               
        #First calculate the strength with which the perceptron fires
        strength = np.dot(values,self.weights)
        
        #Then evaluate the return value of the perceptron
        if strength >= self.threshold:
            result = 1
        else:
            result = 0

        return result

    def __init__(self,weights=None,threshold=None):
        if weights is not None:
            self.weights = weights
        if threshold is not None:
            self.threshold = threshold

Network = [
    #input layer, declare perceptrons here
    # [OR_Perceptron, AND_Percaptron]
    [ Perceptron([1, 1], 1), Perceptron([.5, .5], 1) ], \
    #output node, declare one perceptron here
    # output = OR_Perceptron - 2*AND_Perceptron
    [ Perceptron(weights=[1, -2], threshold=1) ]
]

def EvalNetwork(inputValues, Network):    
    p_or = Network[0][0].evaluate(inputValues)
    p_and = Network[0][1].evaluate(inputValues)
    p_xor = Network[1][0].evaluate([p_or, p_and])
    
    OutputValues = p_xor
    # Be sure your output values are single numbers
    return OutputValues

print '0 xor 0:', EvalNetwork([0,0], Network)
print '0 xor 1:', EvalNetwork([0,1], Network)
print '1 xor 0:', EvalNetwork([1,0], Network)
print '1 xor 1:', EvalNetwork([1,1], Network)


0 xor 0: 0
0 xor 1: 1
1 xor 0: 1
1 xor 1: 0


## 4. Activation Function Sandbox

In [5]:
#   Python Neural Networks code originally 
#   by Szabo Roland and used by permission

#   Modifications, comments, and exercise breakdowns 
#   by Mitchell Owen, (c) Udacity

#   Retrieved originally from
#   http://rolisz.ro/2013/04/18/neural-networks-in-python/

#	Neural Network Sandbox
#
#	Define an activation function activate(), 
#   which takes in a number and returns a number.
#	Using test run you can see the performance of 
#   a neural network running with that activation function.
#
import numpy as np


def activate(strength):
    return np.power(strength,2)
    
def activation_derivative(activate, strength):
    #numerically approximate
    return (activate(strength+1e-5)-activate(strength-1e-5))/(2e-5)

Output through webapp:

```
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
our weights [[ -9.39070541e-02  -1.04328949e-01   1.65772580e-01 ...,   3.95364078e-02
   -8.79811120e-02   1.45551742e-01]
 [ -1.81965442e-01   2.23494036e-01   2.29270054e-01 ...,   5.60283791e-02
    1.99415490e-01   2.41451736e-01]
 [  5.50110296e+01  -5.25212740e+01   1.74491734e+03 ...,   9.07396683e-01
    4.48984591e+02  -8.64517579e+01]
 ..., 
 [  1.73859117e-01   2.30827068e-01  -1.66113073e-01 ...,   2.34616668e-01
    2.72151263e-02   2.35588898e-01]
 [  1.07146784e-01  -3.38694889e-02   4.71807362e-02 ...,   2.25425026e-01
   -1.74156663e-01   1.57382086e-01]
 [  1.10227160e+02  -1.05335043e+02   3.48975870e+03 ...,   1.23151826e+00
    8.98238455e+02  -1.72644011e+02]]

 are being modified with deltas [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

 using the results matrix a: [[ 0.      0.      0.4375  0.9375  1.      1.      0.0625  0.      0.
   0.5625  1.      1.      0.625   0.3125  0.      0.      0.      0.875
   1.      1.      0.9375  0.      0.      0.      0.      0.6875  0.875
   0.8125  1.      0.125   0.      0.      0.      0.      0.      0.375
   1.      0.0625  0.      0.      0.      0.      0.      0.75    0.75    0.
   0.      0.      0.      0.      0.375   1.      0.4375  0.      0.      0.
   0.      0.      0.625   0.8125  0.      0.      0.      0.      1.    ]]
(65, 16) (1, 16) (1, 65)
our weights [[ -9.39070541e-02  -1.04328949e-01   1.65772580e-01 ...,   3.95364078e-02
   -8.79811120e-02   1.45551742e-01]
 [ -1.81965442e-01   2.23494036e-01   2.29270054e-01 ...,   5.60283791e-02
    1.99415490e-01   2.41451736e-01]
 [  5.50110296e+01  -5.25212740e+01   1.74491734e+03 ...,   9.07396683e-01
    4.48984591e+02  -8.64517579e+01]
 ..., 
 [  1.73859117e-01   2.30827068e-01  -1.66113073e-01 ...,   2.34616668e-01
    2.72151263e-02   2.35588898e-01]
 [  1.07146784e-01  -3.38694889e-02   4.71807362e-02 ...,   2.25425026e-01
   -1.74156663e-01   1.57382086e-01]
 [  1.10227160e+02  -1.05335043e+02   3.48975870e+03 ...,   1.23151826e+00
    8.98238455e+02  -1.72644011e+02]]

 are being modified with deltas [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

 using the results matrix a: [[ 0.      0.      0.4375  0.875   1.      0.5     0.      0.      0.      0.
   0.875   0.875   1.      0.875   0.      0.      0.      0.      0.      0.
   0.625   0.75    0.      0.      0.      0.      0.25    0.25    0.875
   0.5625  0.125   0.      0.      0.4375  1.      1.      1.      1.
   0.4375  0.      0.      0.375   0.75    1.      0.6875  0.0625  0.      0.
   0.      0.      0.125   1.      0.1875  0.      0.      0.      0.      0.
   0.375   0.8125  0.      0.      0.      0.      1.    ]]
(65, 16) (1, 16) (1, 65)
[[ 0 46  0  0  0  0  0  0  0  0]
 [ 0 53  0  0  0  0  0  0  0  0]
 [ 0 44  0  0  0  0  0  0  0  0]
 [ 0 37  0  0  0  0  0  0  0  0]
 [ 0 44  0  0  0  0  0  0  0  0]
 [ 0 39  0  0  0  0  0  0  0  0]
 [ 0 46  0  0  0  0  0  0  0  0]
 [ 0 40  0  0  0  0  0  0  0  0]
 [ 0 55  0  0  0  0  0  0  0  0]
 [ 0 46  0  0  0  0  0  0  0  0]]
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        46
          1       0.12      1.00      0.21        53
          2       0.00      0.00      0.00        44
          3       0.00      0.00      0.00        37
          4       0.00      0.00      0.00        44
          5       0.00      0.00      0.00        39
          6       0.00      0.00      0.00        46
          7       0.00      0.00      0.00        40
          8       0.00      0.00      0.00        55
          9       0.00      0.00      0.00        46

avg / total       0.01      0.12      0.02       450

```

## 5. Sigmoid Programming Exercise

Logistic used is sigmoid function: $ f(x) = \frac{1}{1+e^{-x}} $

In [7]:
#
#   As with the perceptron exercise, you will modify the
#   last functions of this sigmoid unit class
#
#   There are two functions for you to finish:
#   First, in activate(), write the sigmoid activation function
#
#   Second, in train(), write the gradient descent update rule
#
#   NOTE: the following exercises creating classes for functioning
#   neural networks are HARD, and are not efficient implementations.
#   Consider them an extra challenge, not a requirement!

import numpy as np
from math import exp
from scipy.optimize import fmin

class Sigmoid:
        
    def activate(self,values):
        '''Takes in @param values, @param weights lists of numbers
        and @param threshold a single number.
        @return the output of a threshold perceptron with
        given weights and threshold, given values as inputs.
        ''' 
               
        #First calculate the strength with which the perceptron fires
        strength = self.strength(values)
        self.last_input = strength
        
        #YOUR CODE HERE
        #modify strength using the sigmoid activation function
        
        result = self.sigmoid(strength)
        
        return result
        
    def strength(self,values):
        strength = np.dot(values,self.weights)
        return strength
        
        
    def update(self,values,train,eta=.1):
        '''
        Updates the sigmoid unit with expected return
        values @param train and learning rate @param eta
        
        By modifying the weights according to the gradient descent rule
        '''
        
        #YOUR CODE HERE
        #modify the perceptron training rule to a gradient descent
        #training rule you will need to use the derivative of the
        #logistic function evaluated at the last input value.
        #Recall: d/dx logistic(x) = logistic(x)*(1-logistic(x))
        # NOTE: last_input not needed
        
        y = train[0]
        y_pred = self.activate(values)
        for i in range(0,len(values)):
            delta_w = (eta * (y-y_pred) * y_pred *
                       (1 - y_pred) * values[i])
            
            self.weights[i] += delta_w


    def sigmoid(self, x):
        try:
            return 1 / (1 + exp(-x))
        except OverflowError:
            return float('inf')

    # returns the value of the derivative of our function
    def derivative_sigmoid(self, x):
        return self.sigmoid(x)*(1-self.sigmoid(x))

    # we setup this function to pass into the fmin algorithm
    def update_weight(self, direction,x,step):
        x = x + direction*step
        return self.sigmoid(x)

    def optimized_weight(self, x):

        x = 2  # algorithm starts at x=2
        precision = 0.0001

        x_list, y_list = [x], [self.sigmoid(x)]


        while True:
            direction = -self.derivative_sigmoid(x)

            # use scipy fmin function to find ideal step size.
            step_size = fmin(self.update_weight, 0.1, (x,direction), 
                             disp = False)
            x_new = x + (step_size * direction)

            x_list.append(x_new)
            y_list.append(self.sigmoid(x_new))

            if abs(x_new - x) < precision:
                return x_new
            else:
                x = x_new   

    def __init__(self,weights=None):
        if weights is not None:
            self.weights = weights
            
            
unit = Sigmoid(weights=[3,-2,1])
unit.update([1,2,3],[0])
print unit.weights
#Expected: [2.99075, -2.0185, .97225]

Attrerror 
0
0.880797077978
[2.990752195677017, -2.0184956086459658, 0.972256587031051]
