# When keeping it ReLU goes wrong

RuLU activations are often recommended for regression-based neural network tasks.
In certain scenarios, even very simple ones as we will see below, ReLU activations
can actually impede the learning process of a neural net. 

Here, we construct a simple feed-forward network and ask it to learn the identity map on the unit interval.
That is, the network is expected to approximate the function $f(x) = x$ for $x \in (0,1)$. 
To do this we design a network with 1 input node, 1 output node, and a single hidden layer with 2 nodes.
This results in 4 weight and 3 bias parameters, where first weight matrix is a column vector and the second is a row vector, each of length 2. 

That is, for input $x$, let $W_1 \in \mathbb{R}^{2\times 1}$ and $W_2 \in \mathbb{R}^{1\times 2}$ be the first and second weight matrices with corresponding bias vectors $b_1 \in \mathbb{R}^2$ and $b_2 \in \mathbb{R}$. Then the network is the map $(0,1) \to \mathbb{R}^2 \to \mathbb{R}$ defined by

$$
    x \mapsto \text{relu}(W_1 x + b_1) \mapsto \text{relu}(W_2 \text{relu}(W_1 x + b_1) + b_2)
$$

We will see that the network defined above will be insufficient for the assigned task, and that in this case it is better to replace ReLU with linear activations. 

In [1]:
import numpy as np
import tensorflow as tf 
from sklearn.model_selection import train_test_split

In [16]:
# simulate data
n_samples = 2000
n_features = 1
x = np.ones((n_samples, n_features))
np.random.seed(23)
y = np.random.uniform(size=n_samples)
for idx in range(n_samples):
    x[idx] *= y[idx]

# split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=23)

# construct the network
def make_network(activations='relu'):
    model = tf.keras.Sequential()
    # connect input node to hidden layer
    model.add(tf.keras.layers.Dense(
        2, 
        activations, 
        input_shape=(n_features, ), 
        bias_initializer=tf.keras.initializers.Zeros())
             )
    # connect hidden layer to output node
    model.add(tf.keras.layers.Dense(
        1, 
        activations, 
        bias_initializer=tf.keras.initializers.Zeros())
             )
    # extract the initial weights
    initial_weights = [w.numpy() for w in model.weights]
    
    return model, initial_weights

In [17]:
tf.random.set_seed(23)
relu_model, relu_init = make_network('relu')

tf.random.set_seed(23)
linear_model, linear_init = make_network('linear')

# print weights
print('Initial ReLU model weights:')
print('  W_1 =', relu_init[0][0])
print('  b_1 =', relu_init[1])
print('  W_2 =', relu_init[2][:,0])
print('  b_2 =', relu_init[3])
print('Initial linear model weights:')
print('  W_1 =', linear_init[0][0])
print('  b_1 =', linear_init[1])
print('  W_2 =', linear_init[2][:,0])
print('  b_2 =', linear_init[3])

Initial ReLU model weights:
  W_1 = [-0.16988075 -1.1390355 ]
  b_1 = [0. 0.]
  W_2 = [ 1.2553526  -0.45677203]
  b_2 = [0.]
Initial linear model weights:
  W_1 = [-0.16988075 -1.1390355 ]
  b_1 = [0. 0.]
  W_2 = [ 1.2553526  -0.45677203]
  b_2 = [0.]


As we can see both models have the same initial state, the only difference is their activations.
Let's see how they learn.

In [18]:
# train network with ReLU activations
relu_model.compile(optimizer='adam', loss='mae')
relu_history = relu_model.fit(x_train, y_train, epochs=10, batch_size=25)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [19]:
# train network with linear activations
linear_model.compile(optimizer='adam', loss='mae')
linear_history = linear_model.fit(x_train, y_train, epochs=10, batch_size=25)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


So what's going on here? Well, we can actually compute the activations of each network using the initial weights.

In [38]:
def relu(x):
    negatives = (x < 0)
    x[negatives] = 0.0
    return x

def compute_activations(x, weights, activations):
    assert activations in ['relu', 'linear']
    # get weights
    W1 = weights[0]
    b1 = weights[1]
    W2 = weights[2]
    b2 = weights[3]
    
    hidden = W1.dot(x) + b1
    
    if activations == 'relu':
        hidden = relu(hidden)
        final = relu(hidden.dot(W2) + b2)
    else:
        final = hidden.dot(W2) + b2
        
    return hidden, final

In [40]:
relu_hidden, relu_final = compute_activations(0.5, relu_init, 'relu')
linear_hidden, linear_final = compute_activations(0.5, linear_init, 'linear')

print('ReLU model:')
print('  hidden layer activations:', relu_hidden[0])
print('  final layer activations:', relu_final[0])
print('Linear model:')
print('  hidden layer activations:', linear_hidden[0])
print('  final layer activations:', linear_final[0])

ReLU model:
  hidden layer activations: [0. 0.]
  final layer activations: [0.]
Linear model:
  hidden layer activations: [-0.08494037 -0.56951773]
  final layer activations: [0.15350965]


Since the initial weights for the input, $W_1$, are all negative and every input $x$ is positive,
none of the hidden nodes will be activated with ReLU activations. Since there are no activations
in the hidden layer, the value to be activated in the final layer is zero, hence no final layer activations.

On the other hand, with linear activations the final activation value is simply $x W_2 W_1$ which is an inner product.
Hence we can write the final activation as 

$$
    x |W_1| |W_2| \cos\theta
$$

where $\theta$ is the angle between the vectors $W_1$ and $W_2$.
This observation also sheds light on the ReLU case; even if all hidden units are activated the final layer
will only be activated whenever $\theta$ is acute. 