# XOR Trajectory when learning

_Aside: Should do [Working efficiently with jupyter lab](https://florianwilhelm.info/2018/11/working_efficiently_with_jupyter_lab/)_

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib widget
#%matplotlib inline

In [2]:
import numpy as np
import matplotlib.pyplot as plt

Fetch our tools:

In [3]:
from nn import Network, Layer, IdentityLayer, AffineLayer, MapLayer
from nnbench import NNBench
from nnvis import NNVis

___

## Setup
Build our `xor` net:

In [4]:
net = Network()
net.extend(AffineLayer(2,2))
net.extend(MapLayer(np.tanh, lambda d: 1.0 - np.tanh(d)**2))
net.extend(AffineLayer(2,1))
net.extend(MapLayer(np.tanh, lambda d: 1.0 - np.tanh(d)**2))

Make a test bench and a visualizer:

In [5]:
bench = NNBench(net)
vis = NNVis(bench)

Prepare fixed training data for the learning process _[improve]_:

In [6]:
dat = \
[(np.array([-1,-1]), np.array([-1])),
 (np.array([-1,1]), np.array([1])),
 (np.array([1,1]), np.array([-1])),
 (np.array([1,-1]), np.array([1]))]
dc = 0
amp= 1
temp = [(d[0]*amp/2+dc,d[1]*amp/2+dc) for d in dat]

bench.training_data = ((np.array([v[0] for v in temp]),
                        np.array([v[1] for v in temp])),)
bench.training_data_gen = bench.training_data_gen_fixed

Set the state to an ordinary example starting point, for consistent notebook behavior below. We also make it the checkpoint in the bench.

In [7]:
net.set_state_from_vector(np.array([-0.88681521, -1.28596788,  0.3248974 , -2.33838503,  0.34761944,
       -0.94541789,  1.99448043,  0.38704839, -3.8844268 ]))
bench.checkpoint_net()

## The track that learning takes
Let us examine the trajectory in state space during learning, and the loss function.
Each learning iteration changes the net state. We can examine those deltas.
Questions:
1. Are there regimes of direction-of-change (DoC) in state space, or does the DoC wander chaotically?
1. What are the spectral characteristics of the DoC? Length characteristics?
1. How do the DoC characteristics relate to the loss function, and it's first difference?
1. How do these trajectories vary with learning rate? Are there clues in these to adapt the learning rate?
1. How do the trajectory characteristics vary across different starting nets?
1. How do these measures vary with the objective function of the learning process, that is, what you're trying to teach the net?
1. How do the different layers with learning state evolve? Do they settle at different times? How does an upstream layer change, as a consequence of learning, affect downstream layers? Down affect up?

In [16]:
bench.rollback_net()
#bench.net.eta = 0.25
bench.net.eta = 0.4101
learned_track = bench.learn_track(5000)

Find the angles between trajectory steps, from
$$\mathbf {a} \cdot \mathbf {b} = \left\|\mathbf {a} \right\|\left\|\mathbf {b} \right\|\cos \theta \\
\cos \theta = \frac{\mathbf {a} \cdot \mathbf {b}}{\left\|\mathbf {a} \right\|\left\|\mathbf {b} \right\|} \\
$$
where $\mathbf {a}$ and $\mathbf {b}$ are a state-space trajectory step and the succeeding step respectively

In [17]:
traja = bench.analyze_learning_track(learned_track)

In [24]:
# Development space for plotting:
if False:
    fig, ax = plt.subplots()  # Create a figure and an axes.
    traj_color = 'xkcd:red'
    loss_color = 'xkcd:blue'
    cos_color = 'xkcd:green'
    ax.set_xlabel('$n$')  # Add an x-label to the axes.
    ax.set_ylabel('$|\Delta state|$', color=traj_color)
    ax.tick_params(axis='y', labelcolor=traj_color)
    ax.set_title(f"$\eta={bench.net.eta}$")  # Add a title to the axes.
    ax.set_xscale('log')
    ax.set_yscale('log')
    tnl, = ax.plot(traja.traj_L2, label=f"traj norm", color=traj_color)  # Plot some data on the axes.
    ax2 = ax.twinx()  # instantiate a second axes that shares the same x-axis
    ax2.tick_params(axis='y', labelcolor=loss_color)
    dll, = ax2.plot(traja.loss_steps, label=f"$\Delta loss$", color=loss_color)  # Plot some data on the axes.
    #dll, = ax2.plot(np.tanh(10*traja.loss_steps), label=f"$\tanh(\Delta loss)$", color=loss_color)  # Plot some data on the axes.
    cosl, = ax2.plot(traja.traj_cos, label=f"$\Delta state cosine$", color=cos_color)
    ax.legend([tnl, dll, cosl], ["$\\|\\Delta state \\|$", "$\\Delta loss$", "$cos(\\theta)\Delta$"])  # Add a legend.
    #ax.legend([tnl, dll, cosl], ["$\\|\\Delta state \\|$", "$\\tanh(\\Delta loss)$", "$cos(\\theta)\Delta$"])  # Add a legend.
    #ax2.legend()  # Add a legend.
    fig.tight_layout()  # otherwise the right y-label is slightly clipped
    plt.show()
else:
    vis.plot_trajectory(traja)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Conclusions
We asked:
1. Are there regimes of direction-of-change (DoC) in state space, or does the DoC wander chaotically? \
    _**Answer:** Regimes exist. Often the initial direction is maintained (cos near 1) up to some region where the
    direction becomes chaotic, the loss improves, then falls into back-and-forth oscillations (cos near 0)
    perfecting the loss._
1. What are the spectral characteristics of the DoC? Length characteristics?
1. How do the DoC characteristics relate to the loss function, and it's first difference?
1. How do these trajectories vary with learning rate? Are there clues in these to adapt the learning rate?
1. How do the trajectory characteristics vary across different starting nets?
1. How do these measures vary with the objective function of the learning process, that is, what you're trying to teach the net?
1. How do the different layers with learning state evolve? Do they settle at different times? How does an upstream layer change, as a consequence of learning, affect downstream layers? Down affect up?

---

# Scratch

In [11]:
assert False, "Stop notebook execution here if entering from above"

AssertionError: Stop notebook execution here if entering from above

In [None]:
bench.randomize_net()
t = bench.learn(1000)

In [None]:
bench.net.state_vector()

In [None]:
net = bench.net
net([[-.5, -.5], [-.5, .5]])

In [None]:
net.layers

In [None]:
from nnbench import Thing

In [None]:
t = Thing(color='brown', weight=7)

In [None]:
t.cow = 'moo'

In [None]:
t.cow

In [None]:
t.color

---

In [None]:
# Boneyard

In [None]:
assert False, "Stop notebook execution, the rest is scrap"

Wrangle the state-space trajectory and the losses into form.

In [None]:
trajectory = np.vstack([v[0] for v in lt])
losses = np.vstack([v[1] for v in lt])

Take first differences, which represent the changes at each step

In [None]:
traj_steps = np.diff(trajectory, axis=0)
loss_steps = np.diff(losses, axis=0)

In [None]:
traj_steps[:5]

Find the L2 norm of the trajectory steps $\lVert traj \rVert$:

In [None]:
traj_L2 = np.sqrt(np.einsum('...i,...i', traj_steps, traj_steps))

In [None]:
len(traj_L2), traj_L2[:5], traj_L2[-5:]

Find the angles between trajectory steps, from
$$\mathbf {a} \cdot \mathbf {b} = \left\|\mathbf {a} \right\|\left\|\mathbf {b} \right\|\cos \theta \\
\cos \theta = \frac{\mathbf {a} \cdot \mathbf {b}}{\left\|\mathbf {a} \right\|\left\|\mathbf {b} \right\|} \\
$$
where $\mathbf {a}$ and $\mathbf {b}$ are a state-space trajectory step and the succeeding step respectively

Find $\mathbf {a} \cdot \mathbf {b}$:

In [None]:
trajn_dot_nplus1 = np.einsum('...i,...i', traj_steps[:-1], traj_steps[1:])
trajn_dot_nplus1[:5], np.any(trajn_dot_nplus1 < 0)

Find $\left\|\mathbf {a} \right\|\left\|\mathbf {b} \right\|$:

In [None]:
traj_cos_denom = np.multiply(traj_L2[:-1], traj_L2[1:])

This will be the divisor. Some entries may be zero, so we adapt

In [None]:
len(traj_L2) - np.count_nonzero(traj_L2)

In [None]:
np.equal(traj_L2, 0)

Find $\cos \theta$ by dividing, excluding division by zero:

In [None]:
traj_cos = np.divide(trajn_dot_nplus1, traj_cos_denom, where=traj_cos_denom!=0.0)
traj_cos[:5], traj_cos[-5:], min(traj_cos), max(traj_cos)

In [None]:
#traj_theta = np.arccos(traj_cos)
#traj_theta[:5], traj_theta[-5:]

In [None]:
net = Network()
net.extend(AffineLayer(2,2))
#leak = 0
#net.extend(MapLayer(lambda x: (x*(1+leak/2)+abs(x)*(1-leak/2))/2, lambda d: [leak,1][1 if d>0 else 0]))
#net.extend(MapLayer(lambda x: max(0, np.sign(x)) * x, lambda d: max(0, np.sign(d))))
net.extend(MapLayer(np.tanh, lambda d: 1.0 - np.tanh(d)**2))
net.extend(AffineLayer(2,1))
net.extend(MapLayer(np.tanh, lambda d: 1.0 - np.tanh(d)**2))
#sigmoid = lambda x: 1/(np.exp(x)+1)
#net.extend(MapLayer(sigmoid, lambda d: sigmoid(d)*(1-sigmoid(d))))
#net.extend(MapLayer(lambda x: max(0, np.sign(x)) * x, lambda d: max(0, np.sign(d))))