We have that - 

z = $w_1x_1+w_2x_2+....+w_mx_m = \sum_{j=0}^m x_jw_j = w^Tx$

Where z is calculated for a given sample. 

Now let us define a activation function such that -  

$$\phi(z) = (1 \ if \ z \geq \theta,\ otherwise -1) $$

For simplicity, we can rewrite above equation by bring $\theta$ to left hand side and define a weight-zero as -$\theta$ and $x_0$ as 1. We get - 

$$z = w_0x_0+w_1x_1+w_2x_2+....+w_mx_m = \sum_{j=0}^m x_jw_j = w^Tx$$ 

and 

$$\phi(z) = (1 \ if \ z \geq 0,\ otherwise -1) $$

Here, **$\phi(z)$** is our activation function. 

The whole idea behind the MCP neuron and Rosenblatt's thresholded perceptron model is to use a reductionist approach to mimic how a single neuron in the brain works: it either fires or it doesn't. Thus, Rosenblatt's initial perceptron rule is fairly
simple and can be summarized by the following steps:
1. Initialize the weights to 0 or small random numbers.
2. For each training sample $x(i)$ perform the following steps:
  1. Compute the output value $y^ˆ$ .
  2. Update the weights.
Here, the output value is the class label.

See this illustration of perceptron (shown below) where $\phi(z) = z$. 

![](images/perceptron.png)

It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable and the learning rate is sufficiently small. If the two classes can't be separated by a linear decision boundary, we can set a maximum number of passes over the training dataset (epochs) and/or a threshold for the number of tolerated misclassifications—the perceptron would never stop updating the weights otherwise.

### Perceptron vs Sigmoid Neuron

In perceptron, output values can either be 0 or 1. However, in case of sigmoid neuron, input/output can take any value in the range [0,1]. 

Also, perceptron curve is essentially a step function. On the other hand, sigmoid function follows S-shaped function with some midway threshold value (usually .5). 

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

nn = MLPClassifier(solver='lbfgs')
nn.fit(X_train, y_train)
y_pred = nn.predict(X_test)
accuracy_score(y_test, y_pred)

0.88

Note that `activation` keyword can take one of following values - (relu, tanh, identity, logistic). This is the activation function for hidden layers. 

  - `identity`, no-op activation, useful to implement linear bottleneck, returns $f(x) = x$
  - `logistic`, the logistic sigmoid function, returns $f(x) = \frac{1}{1 + exp(-x)}$.
  - `tanh`, the hyperbolic tan function, returns $f(x) = tanh(x)$.
  - `relu`, the rectified linear unit function, returns $f(x) = max(0, x)$

Also, `solver` keyword can take one of these values - (`lbfgs`, `sgd`, `adam`). Default is `adam`

In [1]:
>>> from sklearn.datasets import fetch_mldata
>>> mnist = fetch_mldata('MNIST original')

#second command raised OSError few times. It was because internet speed was erratic and downloaded data got corrupted.

In [2]:
mnist.data.shape, mnist.keys(), mnist.COL_NAMES

((70000, 784),
 dict_keys(['DESCR', 'COL_NAMES', 'target', 'data']),
 ['label', 'data'])

In [6]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(mnist.data, mnist.target)

In [7]:
import network