# Neural Networks
Jonathan Balaban

Credits and thanks to:
- [Stephen Welch](https://github.com/stephencwelch/Neural-Networks-Demystified)
- [Harsh Pokharna](https://medium.com/technologymadeeasy/for-dummies-the-introduction-to-neural-networks-we-all-need-c50f6012d5eb#.93dgf0vg2)

Artificial Neural Networks (ANNs, also connectionist systems) are a computational approach that mimics brain function: a large collection of linked neural units. Each neural unit is connected with many others, and links can be enforcing or inhibitory in their effect on the activation state of connected neural units. Each individual neural unit may have a summation function which combines the values of all its inputs together. There may be a threshold function or limiting function on each connection and on the unit itself: such that the signal must surpass the limit before propagating to other neurons. These systems are self-learning and trained, rather than explicitly programmed, and excel in areas where the solution or feature detection is difficult to express in a traditional computer program.

![Neuron](https://cdn-images-1.medium.com/max/1600/1*MnmwgNzk5YkMhC3Ttb09SQ.jpeg)

Neural networks typically consist of multiple layers or a cube design, and the signal path traverses from front to back. Back propagation is where the forward stimulation is used to reset weights on the "front" neural units and this is sometimes done in combination with training where the correct result is known. More modern networks are a bit more free flowing in terms of stimulation and inhibition with connections interacting in a much more chaotic and complex fashion.

Dynamic neural networks are the most advanced- in that they dynamically can, based on rules, form new connections and even new neural units while disabling others (similar to regularization). ANNs typically work with a few thousand to a few million neural units and millions of connections, which is still several orders of magnitude less complex than the human brain and closer to the computing power of an insect.

![Perceptron Neural Net](https://cdn-images-1.medium.com/max/1600/1*nRRXhhjSjKNpGn-T3yF2Ew.jpeg)
_A perceptron is the digital equivalent of a neuron, firing if strength of inputs exceeds its threshold `theta`_

![Neural Network](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/296px-Colored_neural_network.svg.png)
_General Neural Network with Hidden Layer_

## Derivatives and Gradient Descent
Because of the massive complexity of ANNs, we almost never have a "perfect" analytic solution and must resort to optimizing our weights. Weights are the unknown scalar values on branches/links/connections that amplify or reduce the relative strength of one input over another. So, to truly understand how ANNs are built and solved, let's review some calculus!

In [None]:
# plot y = x-squared


y = x^2. A lovely parabola! But how do we determine its rate of change? This skill will allow us to find a cost function's minimum and optimize our ANN weights. If your answer is to take the derivative of y with respect to x (dy/dx), and find y = x2, you would be correct. But what if computers don't know calculus? What if they were to approximate the change in y based on a tiny change in x, called epsilon?

We need to be careful not to make our denominator too small! We will lose precision, and a zero value would, of course, blow up our laptops.

In [None]:
# create our function


In [None]:
# define values


In [None]:
# calculate delta y / delta x


In [None]:
# compare with our known calculus solution


Fantastic, so computers can mimic calculus to find local (hopefully global) minima. Keep in mind that gradient descent has its limits. You won't necessarily find a global minimum, a great solution in n iterations, or any solution at all, depending on your function. Yet, if we tried a brute-force "gridsearch" method by checking values, we would get smashed by THE CURSE OF DIMENSIONALITY!!

Imagine a simple ANN with 3x4 input to hidden connections per the image above. Imagine we wanted to try 100 possible values for each. And imagine our fast computers could compute 100 calculations in one millisecond. 

In [None]:
# total time calculation in years
total = 100**12


#forgetthis

## ANN Structure

In [1]:
# build your own neural net!
class Neural_Network(object):
    def __init__(self):        
        # hyperparameters, or structure of our network
        self.inputLayerSize = 2
        self.outputLayerSize = 1
        self.hiddenLayerSize = 3
        
        # weights, parameters to solve for
        self.W1 = np.random.randn(self.inputLayerSize,self.hiddenLayerSize)
        self.W2 = np.random.randn(self.hiddenLayerSize,self.outputLayerSize)
        
    def forward(self, X):
        # propogate inputs though network
        self.z2 = np.dot(X, self.W1)
        self.a2 = self.sigmoid(self.z2)
        self.z3 = np.dot(self.a2, self.W2)
        yHat = self.sigmoid(self.z3) 
        return yHat
        
    def sigmoid(self, z):
        # sigmoid activation function
        return 1/(1+np.exp(-z))
    
    def sigmoidPrime(self,z):
        # derivative of sigmoid to optimize
        return np.exp(-z)/((1+np.exp(-z))**2)
    
    def costFunction(self, X, y):
        # compute cost for given X,y, use weights already stored in class
        self.yHat = self.forward(X)
        J = 0.5*sum((y-self.yHat)**2)
        return J
        
    def costFunctionPrime(self, X, y):
        # compute derivative with respect to W and W2 for a given X and y
        self.yHat = self.forward(X)
        
        delta3 = np.multiply(-(y-self.yHat), self.sigmoidPrime(self.z3))
        dJdW2 = np.dot(self.a2.T, delta3)
        
        delta2 = np.dot(delta3, self.W2.T)*self.sigmoidPrime(self.z2)
        dJdW1 = np.dot(X.T, delta2)  
        
        return dJdW1, dJdW2

# ANNs in Sklearn
Today, we'll take a look at [Multi-layer Perceptron (MLP)](http://scikit-learn.org/stable/modules/neural_networks_supervised.html) models in sklearn. [This guide on perceptrons](http://www.cprogramming.com/tutorial/AI/perceptron.html) might be helpful.

The advantages of MLP are:
- Capability to learn non-linear models.
- Capability to learn models in real-time (on-line learning) using partial_fit.

The disadvantages of MLP include:
- MLP with hidden layers have a non-convex loss function where there exists more than one local minimum. Therefore different random weight initializations can lead to different validation accuracy.
- MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations.
- MLP is sensitive to feature scaling.

In [7]:
# build simple neural net with sklearn
from sklearn.neural_network import MLPClassifier

X = [[0., 0.], [1., 1.], [1., 0.]]
y = [0, 1, 1]

clf = MLPClassifier(hidden_layer_sizes=(5,2), solver='lbfgs')
clf.fit(X,y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(5, 2), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [10]:
# predict new observations
clf.predict([[0,0]])

array([0])

clf.coefs_ contains the weight matrices that constitute our model parameters

In [13]:
# find parameters
print([coef.shape for coef in clf.coefs_])
clf.coefs_

[(2, 5), (5, 2), (2, 1)]


[array([[  1.09542421e-03,   1.99662861e-03,   1.62249937e+00,
          -2.03246470e+00,  -4.73470028e-02],
        [  2.84597711e-03,   1.61587452e-03,   4.99417159e-02,
          -9.37065139e-02,  -2.33447380e-02]]),
 array([[  5.48593070e-03,   3.73651668e-03],
        [  3.00633223e-03,  -3.47384967e-03],
        [  3.12619088e-03,  -1.47910849e+00],
        [ -1.89205740e-03,   2.43003446e+00],
        [ -5.16844163e-03,   6.40558372e-01]]),
 array([[-0.00609055],
        [-2.75765493]])]

In [14]:
# find intercepts
clf.intercepts_

[array([-0.91091071, -0.92073416,  2.79999024,  3.54706393,  1.05658564]),
 array([-0.61208439,  2.47748849]),
 array([ 11.23843769])]

In [15]:
# predict probabilities
clf.predict_proba([[1,0],[0,1]])

array([[  2.71866410e-05,   9.99972813e-01],
       [  9.99868438e-01,   1.31562105e-04]])

### MLPs work well with multiclass problems

In [20]:
# create MPL with alpha=1e-5, lbfgs solver, and one hidden layer
X = [[0., 0.], [1., 1.]]
y = [[0, 1], [1, 1]]
              
clf = MLPClassifier(solver='lbfgs', alpha = 1e-5, hidden_layer_sizes=(15))

clf.fit(X,y)

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=15, learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [21]:
# predict two observations
clf.predict([[0,0]])

array([[0, 1]])

### Regression
The `MLPRegressor` class implements an MLP that trains using backpropagation with no activation function in the output layer, which can also be seen as using the identity function as activation function. Therefore, it uses the square error as the loss function, and the output is a set of continuous values. Amazingly, `MLPRegressor` also supports multi-output regression, in which a sample can have more than one target.

### Scaling and General Tips

Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data. For example, normalize X to [0, 1] or [-1, +1], or standardize. Note: that you must apply the same scaling (but not at the same time) to the test set for meaningful results!!

Finding a reasonable regularization parameter \alpha is best done using GridSearchCV, usually in the range `10.0 ** -np.arange(1, 7)`

L-BFGS converges faster and with better solutions on small datasets. For relatively large datasets, Adam is performant and robust. SGD with momentum or nesterov’s momentum, on the other hand, can perform better than those two algorithms if learning rate is correctly tuned.