Before starting, run:
```
 conda install -c conda-forge ffmpeg
```

#### We are going to train a neural net to predict a bilinear function shown below:

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math


n_points = 200
x = np.linspace(0, 2, n_points)
y = np.array([0] * int(n_points / 2) + list(x[:int(n_points / 2)])) * 2

plt.figure(figsize=(5, 2))
plt.plot(x, y, linewidth=2)
plt.title('ridiculously simple data')
plt.xlabel('a')
plt.ylabel('b')
plt.show()

This may look familiar to the ReLu activation function!
f(x)=max(0,x)

We'll start with a one neuron model on this data. For a reminder this is how the neuron (or perceptron) functions within the model:

![image](http://www.swanintelligence.com/images/2016q1/neuron.png)

We have two choices here. One is the initialization of the weights. We choose them to be randomly drawn from a normal distribution. The second choice is the activation function. The chosen ReLu function looks similar to our data.

In [None]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation

np.random.seed(0)
model = Sequential()
model.add(Dense(units=1, input_dim=1))
model.add(Activation("relu"))
model.compile(loss='mean_squared_error', optimizer='sgd')

# print initial weights
weights = model.layers[0].get_weights()
w0 = weights[0][0][0]
w1 = weights[1][0]
'neural net initialized with weights w0: {w0:.2f}, w1: {w1:.2f}'.format(**locals())

In [None]:
from keras.callbacks import Callback

class TrainingHistory(Callback):
    def on_train_begin(self, logs={}):
        self.losses = []
        self.predictions = []
        self.i = 0
        self.save_every = 50

    def on_batch_end(self, batch, logs={}):
        self.losses.append(logs.get('loss'))
        self.i += 1        
        if self.i % self.save_every == 0:        
            pred = model.predict(X_train)
            self.predictions.append(pred)

In [None]:
history = TrainingHistory()
X_train = np.array(x, ndmin=2).T
Y_train = np.array(y, ndmin=2).T
model.fit(X_train,
          Y_train,
          epochs=2000,
          verbose=0,
          batch_size=50,
          callbacks=[history])

# print trained weights
weights = model.layers[0].get_weights()
w0 = weights[0][0][0]
w1 = weights[1][0]
'neural net weights after training w0: {w0:.2f}, w1: {w1:.2f}'.format(**locals())

In [None]:
plt.figure(figsize=(6, 3))
plt.plot(history.losses)
plt.ylabel('error')
plt.xlabel('iteration')
plt.title('training error')
plt.show()

In [None]:
# make the animation
import matplotlib.animation as animation
Writer = animation.writers['ffmpeg']
writer = Writer(fps=15, metadata=dict(artist='Me'), bitrate=1800)

fig = plt.figure(figsize=(5, 2.5))
plt.plot(x, y,  label='data')
line, = plt.plot(x, history.predictions[0],  label='prediction')
plt.legend(loc='upper left')

def init():
    line.set_data([], [])
    return line,

def update_line(num):
    line.set_xdata(x)
    line.set_ydata(history.predictions[num])
    return line,

ani = animation.FuncAnimation(fig, update_line, init_func=init, frames=len(history.predictions),
                                   interval=50, blit=True)

ani.save('neuron_training.mp4', writer=writer)

We can try tuning the batch size - Keras uses stochastic gradient descent to update the weights. It randomly selects a subset of our data for each iteration an does a gradient descent on the error on this subset. By default Keras uses 128 data point on each iteration. In a few cases, when the sample would be very skewed, then the optimal weight update for the sample might actually make the predictions worse for the whole data set.

The sample size for stochastic gradient descent is a parameter to the Model.fit() method called batch_size. If we use a larger batch size, we will see a monotonously descereasing error.

In [None]:
history = TrainingHistory()
model = Sequential()
model.add(Dense(units=1, input_dim=1))
model.add(Activation("relu"))
model.compile(loss='mean_squared_error', optimizer='sgd')
model.fit(X_train,
          Y_train,
          batch_size=200,
          epochs=2000,
          verbose=0,
          callbacks=[history])

plt.figure(figsize=(6, 3))
plt.plot(history.losses)
plt.ylabel('error')
plt.xlabel('iteration')
plt.title('training error')
plt.show()

Weight initialization also matters:

In [None]:
np.random.seed(2)
history = TrainingHistory()
model = Sequential()
model.add(Dense(units=1, input_dim=1))
model.add(Activation("relu"))
model.compile(loss='mean_squared_error', optimizer='sgd')

weights = model.layers[0].get_weights()
w0 = weights[0][0][0]
w1 = weights[1][0]
print('neural net initialized with weigths w0: {w0:.2f}, w1: {w1:.2f}'.format(**locals()))

model.fit(X_train,
          Y_train,
          batch_size=200,
          epochs=2000,
          verbose=0,
          callbacks=[history])

weights = model.layers[0].get_weights()
w0 = weights[0][0][0]
w1 = weights[1][0]
print('neural net weigths after training w0: {w0:.2f}, w1: {w1:.2f}'.format(**locals()))

fig = plt.figure(figsize=(5, 2.5))
plt.plot(x, y,  label='data')
line, = plt.plot(x, history.predictions[0],  label='prediction')
plt.xlabel('a')
plt.ylabel('b')
plt.legend(loc='upper left')

plt.figure(figsize=(5, 2.5))
plt.plot(history.losses)
plt.ylabel('error')
plt.xlabel('iteration')
plt.title('training error')
plt.show()

The neuron's weights don't get updated during training. This is known as the dying ReLu problem. If the initial weights map all our sample points to values smaller than 0, the ReLu maps everything to 0. Even with small changes in the weights the result is still 0. This means the gradient is 0 and the weights never get updated. 