# Learning XOR
#### from http://www.deeplearningbook.org/ 

### define imports

In [1]:
import sys
import numpy as np
np.random.seed(123)  # for reproducibility
import matplotlib.pyplot as plt
%matplotlib inline

from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import RMSprop, SGD

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Let's try to learn the XOR function $$ y = f_{xor}(\mathbf{x}) $$ by means of deep learning.
For this, we try to fit our to-be-created model to the dataset

$$
\begin{align}
\mathbb{X} &= \Bigl\{[0,0]^T, [0,1]^T, [1,0]^T, [1,1]^T\Bigr\} \\
\mathbf{y} &= \big[ 0, 1, 1, 0 \big]^T
\end{align}
$$

Our model shall provide the function
$$\hat{f}_{xor}(\mathbf{x};\mathbf{\theta})$$

### input / output data

In [2]:
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

## Linear Model
Suppose we chose a linear model with $ \mathbf{\theta} $ consisting of $\mathbf{w}$ and $b$. The our model is defined to be

$$
\hat{f}(\mathbf{x}; \mathbf{w}, b) = \mathbf{x}^T\mathbf{w} + b.
$$

If we treat this problem as a regression problem, we can use a mean square error loss function. This is only done due to the simplified math in the example. So

$$
J(\mathbf{\theta}) = \frac{1}{4} \sum_{x\in\mathbb{X}}\bigl(f_{xor}(\mathbf{x}) - \hat{f}_{xor}(\mathbf{x};\theta) \bigr)^2
$$

Solving $\frac{\partial J}{\partial \theta} = 0$ is relatively simple and yields $\mathbf{w} = \mathbf{0}$ and $b = \frac{1}{2}$. 
__Thus, our linear model will output 0.5 for every input.__

In [3]:
# use mse as loss function and the RMSProp optimizer
loss_fn = 'mse'
rms = RMSprop(lr=0.01)
# create linear model
linearModel = Sequential()
linearModel.add(Dense(1, input_dim=2))
linearModel.compile(loss=loss_fn, optimizer=rms, metrics=['accuracy'])
# print model summary
print(linearModel.summary())
# train model
linearModel.fit(X, y, batch_size=4, epochs=100)
# print probabilities
print('Predictions: \n{}'.format(linearModel.predict_proba(X)))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1)                 3         
Total params: 3
Trainable params: 3
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoc

## 2-Layer Model

Clearly, our linear model was not capable of providing a solution.
We already know that, because __XOR__ is __non-linear__ operation and thus cannot be represented using a linear function.

### Solution
One way to solve this problem is to introduce an additional layer in our model with a non-linear activation function, i.e.

$$
\begin{align}
h &= f^{(1)}(\mathbf{x}; \mathbf{W}, \mathbf{c}) \\
y &= f^{(2)}(\mathbf{h}; \mathbf{w}, b) \\
\text{and thus} \\
\hat{f}(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) &= f^{(2)}\bigl(f^{(1)}(\mathbf{x})\bigr)
\end{align}
$$

Now, $f^{(1)}$ is often chosen to be a ReLu (Rectifier Linear Unit), i.e.
$g(z) = \max\{0, z\}$,
which is arguably the easiest linear and mostly differentiable function one could imagine. Our model function then becomes
$$
\hat{f}(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T \max\{0, \mathbf{W}^T\mathbf{x} + \mathbf{c}\} + b
$$

Now we can actually specify a solution directly using
$$
\begin{align}
\mathbf{W} &= \left[ \matrix{1 & 1\cr 1 & 1} \right] \\
\mathbf{c} &= \left[ \matrix{0 \cr -1} \right] \\
\mathbf{w} &= \left[ \matrix{1 \cr -2} \right] \quad\text{ and } \\
         b &= 0 \\
\end{align}
$$

We can step through the operations to see that the new parameters lead to. Let $\mathbb{X}$ the design matrix containing all four points in the binary input space, with one example per row:
$$\mathbf{X} = \left[ \matrix{0 & 0\cr 0 & 1\cr 1 & 0\cr 1 & 1} \right]$$

The first step is to multiply the input matrix by the first layer's weight matrix:
$$\mathbf{X}\mathbf{W} = \left[ \matrix{0 & 0\cr 1 & 1\cr 1 & 1\cr 2 & 2} \right]$$

Next, we add the bias vector $c$, to obtain
$$\left[ \matrix{0 & -1\cr 1 & 0\cr 1 & 0\cr 2 & 1} \right]$$

$$\mathbf{X}\mathbf{W} = \left[ \matrix{0 & 0\cr 1 & 1\cr 1 & 1\cr 2 & 2} \right]$$

As we see, all the examples lie along a line with slope 1. As we move along this line, the output needs to begin at 0, then rise to 1, then drop back down to 0. A linear model cannot implement such a function. To ﬁnish computing the value ofh for each example, we apply the rectiﬁed linear transformation:
$$\left[ \matrix{0 & 0\cr 1 & 0\cr 1 & 0\cr 2 & 1} \right]$$

We finish by multiplying the weight vector $\mathbf{w}$:
$$\left[ \matrix{0\cr 1\cr 1\cr 0} \right]$$

A neural net cannot simply guess the correct model parameters, but it can learn it in a certain number of iterations. The solution found will probably be different to our guess but still do the job.

In [59]:
## create 2-layer model
# set loss and optimizer functions
loss_fn = 'mse'
rms = RMSprop(lr=0.01)
# the model is a sequential model
betterModel = Sequential()
# add layers
betterModel.add(Dense(4, input_dim=2))
betterModel.add(Activation('relu')) # <-- this is our unlinearity
betterModel.add(Dense(1))
betterModel.add(Activation('sigmoid')) # <-- this gives us nice values [0,1]
betterModel.compile(loss=loss_fn, optimizer=rms, metrics=['accuracy'])
# print model summary
print(betterModel.summary())
# train model
betterModel.fit(X, y, batch_size=4, epochs=500)
# print probabilities
print('Predictions: \n{}'.format(betterModel.predict_proba(X)))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_93 (Dense)             (None, 4)                 12        
_________________________________________________________________
activation_67 (Activation)   (None, 4)                 0         
_________________________________________________________________
dense_94 (Dense)             (None, 1)                 5         
_________________________________________________________________
activation_68 (Activation)   (None, 1)                 0         
Total params: 17
Trainable params: 17
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/