# The Perceptron

Let us start by importing some of the libraries we will need and set up our notebook session:

In [None]:
from __future__ import print_function
from __future__ import division

import math

import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

## A Basic Perceptron

Here goes the blueprint of our perceptron, encoded as a plain Python class, as well as a single auxiliary function. Below we will walk through the details of each code components.

In [None]:
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

class Perceptron:
    def __init__(self, nb_features=None):
        self.weights = None
        self.nb_features = nb_features # i.e. the nb of incoming connections from other neurons
    
    def set_weights(self, weights=None):
        if weights:
            self.weights = weights
        else:
            self.weights = np.ones(self.nb_features)
    
    def predict(self, feature_vectors, squash=False):
        scores = []
        for fv in feature_vectors:
            s = 0.0
            for i in range(self.nb_features):
                s += self.weights[i] * fv[i]
            scores.append(s)
        
        if squash:
            scores = [sigmoid(v) for v in scores]
        
        return scores

We start by predicting house prices in a really naive way, i.e. we assume that all characteristics are equally important, each having a (positive) weight of 1. We characterize each house along 5 integer variables, namely:
* the number of doors
* surface (in square meters)
* the number of bedrooms
* the number of bathrooms
* its age (in years)

First, we initialize our perceptron, and we specify that we will use 5 input features:

In [None]:
perceptron = Perceptron(nb_features=5)

Next, we set the weights of our perceptron; by default, our perceptron will assign an equal, positive weight of 1.0 to each feature:

In [None]:
perceptron.set_weights()
print(perceptron.weights)

Now, we create a dummy data set of 5 houses, which are represented as a fixed length vector of five integers (the index of which corresponds to the bullet list above): 

In [None]:
houses = [[2, 3, 5, 1, 8],
          [1, 2, 1, 3, 5],
          [4, 4, 2, 1, 3],
          [11, 6, 8, 9, 8],
          [9, 10, 8, 1, 30]]

In this data set, the last houses have higher a number of doors, bathrooms etc., so that they would have to predict higher prices. We now feed this data set to the `predict()` function of our perceptron (cf. standard `sklearn` naming conventions), and we have it predict a the price for each house:

In [None]:
prices = perceptron.predict(houses)
print(prices)

Indeed, the see that for instance the pentultimate house has a relatively higher price. Of course, our feature weighting is now ridiculously naive. Clearly,
* the number of bathrooms should matter relative more than the number of doors
* the age of a house should negatively affect the total price

We can now fix this by setting a more sensible weight vector, which should have five entries (one for each feature). Using these new weights, we should now obtain a more accurate estimate:

In [None]:
weights = [0.5, # doors
           0.9, # surface
           0.8, # bedrooms
           0.9, # bathrooms
           -0.5] # age
perceptron.set_weights(weights)
print(perceptron.predict(houses))

Predicting the price of a house is simple enough and comes down to calculating the **weighted sum** of the feature values. We now see, for instance, that the final house will be valued much less than the penultimate house in the list, based on the assumption that, while is it much larger, it is also is much older and will therefore require much more investments. Note that the prices predicted are not in an actual currency and vary wildly. In many applications, it is very common to **normalize** the predictions of a perceptron and 'squash' them to a range between 0 and 1. Historically, the **sigmoid** function has often been for this. By setting `squash = True`, we can implement this behaviour in our perceptron:

In [None]:
print(perceptron.predict(houses, squash=True))

We now see that we obtain predictions which are neatly between 0 and 1. In the context of deep learning, functions such as the sigmoid function are often called **activation** functions, because they control how strongly a particular neuron will be activated. In our perceptron, we have a single output neuron, the activation of which is controlled via a sigmoid activation. We call such an activation **element-wise**, is they are applied to each element in a list individually.

## Speeding up our perceptron

The perceptron which we created will be slow, because, right now, it only relies on traditional, unoptimized Python code. Inspect for instance the code block in the `predict()` function above.

The many `for`-loops in this code will make it extremely slow to run. In Deep Learning (and in so many other fields), such code will be too slow to be usable in practice: it is much more common to used **vectorized routines** from specialized libraries, such as `numpy`, which can easily replace our naive `for`-loops. To be able to vectorize our code more efficiently, we will have to convert our Python lists to `numpy` arrays, which essentially are matrix objects, that have a **`shape`** attribute. As will become clear below, the `shape` is an extremely important property of `numpy` arrays, and you will want to keep track of it frequently, e.g.:

In [None]:
houses = np.array(houses)
print(houses.shape)

weights = np.array(weights)
print(weights.shape)

This tells us that our dummy data set is now represented by a two-dimensional matrix which has 5 rows (corresponding to the number of houses) and 5 columns (corresponding to the number of features). Our `weights` variable, on the other hand is a single-dimensional vector, instead of a matrix, of consisting of five numbers (scalars). We are now ready to redefine our Perceptron blueprint to be able to deal with such arrays more efficiently:

In [None]:
def v_sigmoid(x):
    # element-wise; eats any tensor
    return 1 / (1 + np.exp(-x))

class VectorizedPerceptron:
    def __init__(self, nb_features=None):
        self.weights = None
        self.nb_features = nb_features
    
    def set_weights(self, weights=None):
        if isinstance(weights, np.ndarray):
            self.weights = weights
        else:
            self.weights = np.ones(self.nb_features)
    
    def predict(self, feature_vectors, squash=False):
        scores = np.multiply(feature_vectors, self.weights).sum(axis=-1)
        
        if squash:
            scores = v_sigmoid(scores)
        
        return scores

In fact, once you get used to such vectorized notation, vectorized code becomes much easier to read than stacks of for-loops. The result is the same:

In [None]:
v_perceptron = VectorizedPerceptron()
v_perceptron.set_weights(weights)

print(perceptron.predict(houses))
print(perceptron.predict(houses, squash=True))

But is this faster? Let us create an artificial, large house data set to check this:

In [None]:
houses = np.random.uniform(low=-0.05, high=0.05, size=(1000, 500))
print(houses.shape)

In [None]:
# simple:
weights = list(np.random.uniform(low=-0.05, high=0.05, size=(500)))
perceptron = Perceptron(nb_features=500)
perceptron.set_weights(weights)

# vectorized:
v_perceptron = VectorizedPerceptron(nb_features=500)
v_perceptron.set_weights(np.array(weights))

Now, we can time the difference:

In [None]:
%timeit perceptron.predict(houses, squash=True)

In [None]:
%timeit v_perceptron.predict(houses, squash=True)

Jip, it seems to be quite a bit faster -- and believe me, the difference grows even larger for larger data sets. 

## Optimizing the perceptron

Let us first implement a pre-historic approach to optimizing our perceptron. We first load the Boston Housing Prices Dataset that ships with `sklearn`:

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target

As you can see, this data set has information for 506 houses across 13 numerical features. The target vector which we like to fit is a list of 506 prices associated with each house:

In [None]:
print(X.shape)
print(y.shape)
print(y[:10])

We start again with a stripped down version of our `Perceptron` class: upon instantiation, objects from this class will create a list of weight parameters, or really small values randomly sampled from a uniform distribution betwee: -0.05 and +0.05:

In [None]:
class Perceptron:
    def __init__(self, nb_features=None):
        np.random.seed(156651)
        self.weights = np.random.uniform(low=-.05, high=0.05,
                                         size=nb_features)
    
    def predict(self, feature_vectors):
        return np.multiply(feature_vectors, self.weights).sum(axis=-1)


Additionally, our `Perceptron` has a `predict` method, which we can use to obtain house prices:

In [None]:
perceptron = Perceptron()
prices = perceptron.predict(X)
print(prices.shape)

To be able to measure the current quality of our system, we need a **loss function** or **objective function**: later, we will try to get down the loss returned by the function. As a loss function, we start off with the relatively **mean squared error**: it compares the proces outputted by the system and the **gold standard** (in the `y` vector) and returns the mean of squares with respect to the difference between each predicted price and its gold standard equivalent:

In [None]:
def mean_squared_error(y_gold, y_pred):
    return np.mean((y_gold - y_pred) ** 2)

Let us find out how large the loss is for our current model, which has randomly initialized parameters:

In [None]:
preds = perceptron.predict(X)
print(mean_squared_error(preds, y))

Wow, that's huge! We are now ready to add a naive `fit` method to optimize the parameters of our perceptron. In each **epoch** (i.e. one pass over the entire data set), we loop over each parameter individually and test two different situations: *hypothesis a*, in which we slightly increase it (with a small rate `learning_rate`) and *hypothesis b*, in which we slightly decrease the weight under scrutiny (again with a small `learning` rate). Then, we adjust the parameters in the light of whichever hypothesis gave the best result (i.e. maximally minimized the loss). That goes like this:

In [None]:
class Perceptron:
    def __init__(self, nb_features=None):
        self.nb_features = nb_features
        np.random.seed(156651)
        self.weights = np.array(np.random.uniform(low=-.05, high=0.5,
                                         size=self.nb_features))
    
    def predict(self, feature_vectors, weights=None):
        if not isinstance(weights, np.ndarray):
            return np.multiply(feature_vectors, self.weights).sum(axis=-1)
        else:
            return np.multiply(feature_vectors, weights).sum(axis=-1)
    
    def fit(self, X, y, learning_rate=0.1, nb_epochs=10):
        losses = []
        
        for e in range(nb_epochs):
            if e % 100 == 0 and e:
                learning_rate *= 0.9
            
            for idx in range(self.nb_features):
                # hypothesis a: we increase the parameter:
                weights_plus = self.weights.copy()
                weights_plus[idx] += learning_rate

                # hypothesis b: we decrease the parameter:
                weights_minus = self.weights.copy()
                weights_minus[idx] -= learning_rate
                
                # we obtain the predictions for both hypotheses a and b:
                plus_preds = self.predict(X, weights = weights_plus)
                minus_preds = self.predict(X, weights = weights_minus)
                
                # we obtain the predictions for both hypotheses a and b:
                plus_loss = mean_squared_error(plus_preds, y)
                minus_loss = mean_squared_error(minus_preds, y)
                
                # we adjust 
                if plus_loss < minus_loss:
                    self.weights = weights_plus.copy()
                elif minus_loss < plus_loss:
                    self.weights = weights_minus.copy()
            
            # we calculate the loss
            loss = mean_squared_error(self.predict(X), y)
            losses.append(loss)
            
            if e and e % 20 == 0:
                print('Loss after epoch #', e + 1, '->', loss)
        
        return losses

If we now apply this fit method with a small enough learning rate, you see that we are able to gradually push down the loss in the subsequent epochs, until the loss curve staurates and stabilizes:

In [None]:
p = Perceptron(nb_features=X.shape[1])
p.predict(X)
losses = p.fit(X, y, nb_epochs=500, learning_rate=0.001)

A graphical representation of the loss curve is easy enough to obtain:

In [None]:
plt.plot(losses)

## Do It Yourself:

Add some simple modifications to the code above. Note that these exercises in fact introduce some key concepts that are very central in modern deep learning research.

* Adapt our Perceptron to show an adaptive, **decreasing learning rate**: after a number of epochs (e.g. 100), it would make sense to decrease the learning steps taken, for instance by a factor of three.
* Implement a form of **early stopping**: stop the training procedure, if the loss doesn't significantly go down anymore for a fixed number of epochs.

## A Perceptron in TensorFlow

(For the sake of reference, I include a similar implementation of our perceptron in `Tensorflow`, but we probably won't have tim eto work through block by block...)

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target

X = np.array(X, dtype='float32')
y = np.array(y, dtype='float32')

In [None]:
# purely "symbolic variable"
train_input = tf.placeholder(tf.float32, [None, X.shape[1]])

In [None]:
train_output = tf.placeholder(tf.float32, [None])

In [None]:
rng.seed(156651)
weights = tf.Variable(np.asarray(rng.uniform(low=-0.05, high=0.05,
                                 size=(X.shape[1])),
                     dtype='float32'))

In [None]:
model_output = tf.reduce_sum(tf.mul(train_input, weights), 1)

In [None]:
mse_cost = tf.reduce_mean(tf.square(model_output - train_output),
                          reduction_indices=0)

In [None]:
train_step = tf.train.GradientDescentOptimizer(0.0000001).minimize(mse_cost)

In [None]:
init = tf.initialize_all_variables()

In [None]:
sess = tf.Session()
sess.run(init)

In [None]:
f = sess.run(model_output, feed_dict={train_input: X})
print(mean_squared_error(f, y))
g = sess.run(mse_cost, feed_dict={train_input: X, train_output: y})
print(g)

In [None]:
losses = []
for i in range(500):
    c = sess.run(mse_cost, feed_dict={train_input: X, train_output: y})
    losses.append(c)
    if i % 50 == 0:
        print(c)
    sess.run(train_step, feed_dict={train_input: X, train_output: y})

In [None]:
plt.plot(losses)

# Addendum

There is an additional notebook in the repo, i.e. [A simple implementation of ANN for MNIST](1.4 (Extra) A Simple Implementation of ANN for MNIST.ipynb) for a *naive* implementation of **SGD** and **MLP** applied on **MNIST** dataset.

This accompanies the online text http://neuralnetworksanddeeplearning.com/ . The book is highly recommended. 

---------------------------------