# Lecture 04: Introduction to Deep Learning and TensorFlow

TensorFlow is Google's open source Python library for deep learning. Code for TensorFlow generally has two parts:

1. **The Construction Phase:** This code generally defines abstract variables and how they are employed to compute the desired quantities. This part of the code is **declarative**. It specifies the structure of **what** is to be computed, but does not necessarily specify **how** it should be computed
2. **The Execution Phase:** This code executes the computation using actual numbers. A TensorFlow **session** is generated and it is used to initialize variables and evaluate results.

In [1]:
import tensorflow as tf

# This code defines a "computational graph" -- this is the "construction phase"

x = tf.Variable(2, name='x') # Initialize a variable x with value 5
y = tf.Variable(5, name='y') # Initialize a variable y with value 3
f = x*x*y+5 # Create a function of variables x and y

# This code executes the computational graph -- this is the "execution phase"

sess = tf.Session() # Create the tensorflow session
sess.run(x.initializer) # Initialize x
sess.run(y.initializer) # Initialize y
result = sess.run(f) # Evaluate f
print(result)
sess.close() # Close the tensorflow session

25


In [2]:
# Alternatively, we can use a with block:

with tf.Session() as sess:
    x.initializer.run()
    y.initializer.run()
    result = f.eval()
    
print(result)

25


In [3]:
# And we can initialize all variables in one command as well

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    result = f.eval()
    
print(result)

25


In [4]:
# Inside Jupyter, we can use an interactive session 

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
result = f.eval()
print(result)
sess.close()

25


In [5]:
# Let's look at the computation graph!

import numpy as np

# Code taken from yaroslavv@github

# make things wide
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

from IPython.display import clear_output, Image, display, HTML

def strip_consts(graph_def, max_const_size=32):
    """Strip large constant values from graph_def."""
    strip_def = tf.GraphDef()
    for n0 in graph_def.node:
        n = strip_def.node.add() 
        n.MergeFrom(n0)
        if n.op == 'Const':
            tensor = n.attr['value'].tensor
            size = len(tensor.tensor_content)
            if size > max_const_size:
                tensor.tensor_content = "<stripped %d bytes>"%size
    return strip_def

def show_graph(graph_def=None, width=1200, height=800, max_const_size=32, ungroup_gradients=False):
    if not graph_def:
        graph_def = tf.get_default_graph().as_graph_def()
        
    """Visualize TensorFlow graph."""
    if hasattr(graph_def, 'as_graph_def'):
        graph_def = graph_def.as_graph_def()
    strip_def = strip_consts(graph_def, max_const_size=max_const_size)
    data = str(strip_def)
    if ungroup_gradients:
        data = data.replace('"gradients/', '"b_')
        #print(data)
    code = """
        <script>
          function load() {{
            document.getElementById("{id}").pbtxt = {data};
          }}
        </script>
        <link rel="import" href="https://tensorboard.appspot.com/tf-graph-basic.build.html" onload=load()>
        <div style="height:600px">
          <tf-graph-basic id="{id}"></tf-graph-basic>
        </div>
    """.format(data=repr(data), id='graph'+str(np.random.rand()))

    iframe = """
        <iframe seamless style="width:{}px;height:{}px;border:0" srcdoc="{}"></iframe>
    """.format(width, height, code.replace('"', '&quot;'))
    display(HTML(iframe))
    
show_graph()

## Logisitic Regression Revisited

We now set up logistic regression in a way that will eventually generalize when we want to train deep neural networks. For this setup, we will consider labeled data of the form $\{({\bf x}^{(i)}, {\bf y}^{(i)})\}_{i=1}^N$ where ${\bf x}^{(i)}\in\mathbb{R}^d$ and 

$$
{\bf y}^{(i)}\in\left\{\begin{pmatrix}1\\0\end{pmatrix}, \begin{pmatrix}0\\1\end{pmatrix}\right\}
$$

for all $i=1,\ldots, N$. Notice that $y^{(i)}\in\{-1,+1\}$ has been replaced with a **one-hot** vector representation of a class label. That is, each of the two classes is represented by a vector in $\mathbb{R}^2$ with exactly one entry equal to $1$ (the "hot" entry) and all other entries equal to $0$. The class label $-1$ is replaced with the vector

$$
\begin{pmatrix}1\\0\end{pmatrix}
$$

and the class label $+1$ is replaced with the vector

$$
\begin{pmatrix}0\\1\end{pmatrix}.
$$

Formerly, logistic regression attempted to fit $0$'s and $1$'s (which we changed to $-1$'s and $+1$'s for our own convenience). Now, we want to fit vectors. Fitting one-hot vectors with continuous functions can be done using the **softmax** function:

$$
\text{softmax}({\bf z}) = \begin{pmatrix}
\frac{e^{z_1}}{e^{z_1}+e^{z_2}}\\
\frac{e^{z_2}}{e^{z_1}+e^{z_2}}\\
\end{pmatrix}=\begin{pmatrix}
\frac{e^{-(z_2-z_1)}}{1+e^{-(z_2-z_1)}}\\ \frac{1}{1+e^{-(z_2-z_1)}}
\end{pmatrix} = \begin{pmatrix}
\text{logit}(-(z_2-z_1))\\ \text{logit}(z_2-z_1)
\end{pmatrix}.
$$

If we define a **weight matrix** $W\in M_{2, d}$ and a **bias vector** ${\bf b}\in \mathbb{R}^2$, we have that

$$
\text{softmax}\left(W{\bf x}+{\bf b}\right)=\begin{pmatrix}
\frac{\exp\left(\left({\bf w}^{(0)}\right)^T{\bf x}+b_0\right)}{\exp\left(\left({\bf w}^{(0)}\right)^T{\bf x}+b_0\right)+\exp\left(\left({\bf w}^{(1)}\right)^T{\bf x}+b_1\right)}\\ \frac{\exp\left(\left({\bf w}^{(1)}\right)^T{\bf x}+b_1\right)}{\exp\left(\left({\bf w}^{(0)}\right)^T{\bf x}+b_0\right)+\exp\left(\left({\bf w}^{(1)}\right)^T{\bf x}+b_1\right)}
\end{pmatrix}
$$

where
$$
W = \begin{pmatrix}
\left({\bf w}^{(0)}\right)^T\\
\left({\bf w}^{(1)}\right)^T
\end{pmatrix}
$$

holds the **weight vectors** for the classes $j=0,1$. Setting $z_1=\left({\bf w}^{(0)}\right)^T{\bf x}+b_0$ and $z_2=\left({\bf w}^{(1)}\right)^T{\bf x}+b_1$,

$$
z_2-z_1 = \left(\left({\bf w}^{(1)}\right)^T{\bf x}+b_1\right)-\left(\left({\bf w}^{(0)}\right)^T{\bf x}+b_0\right)=\left({\bf w}^{(1)}-{\bf w}^{(0)}\right)^T{\bf x} +\left(b_1-b_0\right) = {\bf w}^T{\bf x}+b
$$

where ${\bf w}={\bf w}^{(1)}-{\bf w}^{(0)}$ and $b=b_1-b_0$. This shows that, in the case of **binary classification** the softmax function encodes the probabilities of the different outcomes under a logisitic model using $\beta={\bf w}$ and $\beta_0=b$.

Previously, we used the maximum likelihood principle to derive an objective function for fitting logistic regression. It turns out we can generalize this objective function to the case of one-hot vectors using **cross entropy**:

$$
H({\bf p},{\bf q}) = -p_1\log q_1 -p_2\log q_2
$$

for ${\bf p},{\bf q}\in\mathbb{R}^2$. In particular, observe that

$$
H\left({\bf y}^{(i)},\text{softmax}(W{\bf x}^{(i)}+{\bf b})\right)=-y_0^{(i)}\log\text{logit}\left(-\left({\bf w}^T{\bf x}^{(i)}+b\right)\right) -y_1^{(i)}\log\text{logit}\left({\bf w}^T{\bf x}^{(i)}+b\right).
$$

We then see that the negative log-likelihood minimization coincides with minimizing

$$
\frac{1}{N} \sum_{i=1}^NH\left({\bf y}^{(i)},\text{softmax}(W{\bf x}^{(i)}+{\bf b})\right).
$$

In [6]:
import numpy as np
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer() # Loads the Wisconsin Breast Cancer dataset (569 examples in 30 dimensions)

# Parameters for the data
dim_data = 30
num_labels = 2
num_examples = 569

# Parameters for training
learning_rate = 1e-6
num_train = 400

X = data['data'] # Data in rows
targets = data.target # 0-1 labels
labels = np.zeros((num_examples, num_labels))
for i in range(num_examples):
    labels[i,targets[i]]=1 # Conversion to one-hot representations

# Let's use TensorFlow to train logisitic regression 

x = tf.placeholder(tf.float32, shape=[None, dim_data])
y_ = tf.placeholder(tf.float32, shape=[None, num_labels])

W = tf.Variable(tf.zeros([dim_data, num_labels]))
b = tf.Variable(tf.zeros([num_labels]))

y_prime = tf.matmul(x, W) + b
y = tf.nn.softmax(y_prime)

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_prime))

train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

train_accuracy = sess.run(accuracy, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
train_cross_entropy = sess.run(cross_entropy, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
print("Initial training accuracy %g, cross entropy %g" % (train_accuracy, train_cross_entropy))

W0 = np.zeros((dim_data, num_labels))

for i in range(10000):
    sess.run(train_step, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
    train_accuracy = sess.run(accuracy, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
    train_cross_entropy = sess.run(cross_entropy, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
    W1 = sess.run(W)
    if ((i % 1000) == 0):
        grads = sess.run(tf.gradients(cross_entropy, W), feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
        print("LR Step %d, training accuracy %g, cross entropy %g" % (i+1, train_accuracy, train_cross_entropy))
        print('Weight Residual: %g, Gradient SS: %g' % (np.sum((W0-W1)**2), np.sum(np.array(grads)**2)))
    W0=W1

logistic_test_accuracy = sess.run(accuracy, feed_dict={x: X[num_train:, :], y_: labels[num_train:, :]})
print("LR test accuracy: %g" % logistic_test_accuracy)
    
sess.close()

Initial training accuracy 0.4325, cross entropy 0.693147
LR Step 1, training accuracy 0.4325, cross entropy 0.661145
Weight Residual: 5.56539e-08, Gradient SS: 2416.43
LR Step 1001, training accuracy 0.9075, cross entropy 0.379833
Weight Residual: 9.25424e-11, Gradient SS: 92.4359
LR Step 2001, training accuracy 0.91, cross entropy 0.318759
Weight Residual: 4.06293e-11, Gradient SS: 40.5998
LR Step 3001, training accuracy 0.9125, cross entropy 0.288341
Weight Residual: 2.28051e-11, Gradient SS: 22.7941
LR Step 4001, training accuracy 0.9125, cross entropy 0.270037
Weight Residual: 1.47364e-11, Gradient SS: 14.7302
LR Step 5001, training accuracy 0.91, cross entropy 0.257651
Weight Residual: 1.04404e-11, Gradient SS: 10.4313
LR Step 6001, training accuracy 0.91, cross entropy 0.248608
Weight Residual: 7.84503e-12, Gradient SS: 7.84464
LR Step 7001, training accuracy 0.9125, cross entropy 0.241665
Weight Residual: 6.15131e-12, Gradient SS: 6.14417
LR Step 8001, training accuracy 0.91, cr

In [20]:
# This is another LR implementation that uses our derivation above

import tensorflow as tf
import numpy as np
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer() # Loads the Wisconsin Breast Cancer dataset (569 examples in 30 dimensions)

# Parameters for the data
dim_data = 30
num_labels = 2
num_examples = 569

# Parameters for training
learning_rate = 1e-6
num_train = 400

X = data['data'] # Data in rows
X = np.concatenate([np.ones((X.shape[0], 1)), X], axis=1)
targets = data.target # 0-1 labels
labels = 2*targets - 1 # Converts 0-1 labels to -1 +1 labels

# Let's use TensorFlow to train logisitic regression 

x = tf.placeholder(tf.float32, shape=[None, dim_data+1])
y_ = tf.placeholder(tf.float32, shape=[None])

b = tf.Variable(tf.zeros([dim_data+1, 1]))

y = 1/tf.add(1.0,tf.exp(-tf.matmul(x,b)))

nll = tf.reduce_mean(tf.log(tf.add(1.0,tf.exp(-tf.multiply(y_ , tf.matmul(x, b))))))

train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(nll)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

train_nll = sess.run(nll, feed_dict={x: X[:num_train, :], y_: labels[:num_train]})

for i in range(100):
    sess.run(train_step, feed_dict={x: X[:num_train, :], y_: labels[:num_train]})
    train_nll = sess.run(nll, feed_dict={x: X[:num_train, :], y_: labels[:num_train]})
    grad = np.array(sess.run(tf.gradients(nll, b), feed_dict={x: X[:num_train, :], y_: labels[:num_train]}))
    if ((i % 10) == 0):
        print("LR Step: %d, negative log-likelihood: %g, gradient norm: %g" % (i+1, train_nll, np.sqrt(np.sum(grad**2))))
            
test_logits = np.array(sess.run(y, feed_dict={x: X[num_train:, :]}))
test_labels = np.zeros(test_logits.size)
test_labels[test_labels>0.5] = 1
per_correct = 100 * (1 - np.count_nonzero(test_labels-targets[num_train:]))
print('Final test accuracy: %.1f percent' % per_correct)
sess.close()

LR Step: 1, negative log-likelihood: 0.688486, gradient norm: 43.2471
LR Step: 11, negative log-likelihood: 0.686288, gradient norm: 2.23525
LR Step: 21, negative log-likelihood: 0.686239, gradient norm: 2.17578
LR Step: 31, negative log-likelihood: 0.68618, gradient norm: 2.12342
LR Step: 41, negative log-likelihood: 0.686138, gradient norm: 2.07274
LR Step: 51, negative log-likelihood: 0.686118, gradient norm: 2.02369
LR Step: 61, negative log-likelihood: 0.686078, gradient norm: 1.97624
LR Step: 71, negative log-likelihood: 0.68601, gradient norm: 1.93034
LR Step: 81, negative log-likelihood: 0.685977, gradient norm: 1.88596
LR Step: 91, negative log-likelihood: 0.685963, gradient norm: 1.84305
Final test accuracy: -12900.0 percent


In the construction phase, we specified an **optimizer** (GradientDescentOptimizer) and passed it the cross entropy function for minimization. We also passed a **learning rate** parameter to the optimizer. This is equivalent to the step size for gradient descent with a fixed step size. 

Now, this performs gradient descent without backtracking, so we do not expect the value of the function to decrease monotonically. On the other hand, *we didn't have to compute any gradients by hand*. This is a feature of **automatic differentiation**, which automatically computes derivatives given a range of symbolic functions. Automatic differentiation gives us a way to quickly prototype applications of optimization, but it is often possible to make an implementation more efficient by computing gradients by hand. 

In [7]:
import matplotlib.pyplot as plt

def logit(x):
    return 1/(1+np.exp(-x))

def loglogit(x):
    return np.log(1+np.exp(-x))
    
def loglogitlikelihood(X,y):
    tildeX = np.hstack([np.ones((X.shape[0],1)), X])
    f = lambda b: np.mean(loglogit(np.multiply(y, tildeX@b)))
    df = lambda b: -tildeX.T @ np.multiply(y, logit(-np.multiply(y, tildeX@b))) / X.shape[0]
    d2f = lambda b: tildeX.T @ np.diag(logit(tildeX@b)*logit(-tildeX@b)) @ tildeX / X.shape[0]
    return f, df, d2f

def lr_accuracy(b, X, targets):
    tildeX = np.hstack([np.ones((X.shape[0],1)), X])
    y_guess = np.zeros(X.shape[0])
    y_guess[logit(tildeX@b) > 0.5] = 1
    return np.mean(np.equal(targets, y_guess))

def backtracking(x0, dx, f, df0, alpha=0.1, beta=0.5, verbose=False):
    '''
    Backtracking for general functions with illustrations
    :param x0: Previous point from backtracking, or initial guess
    :param dx: Incremental factor for updating x0
    :param f: Objective function
    :param df0: Gradient of f at x0
    :param alpha: Sloping factor of stopping criterion
    :param beta: "Agressiveness" parameter for backtracking steps
    :param verbose: Boolean for providing plots and data
    :return: x1, the next iterate in backtracking
    '''

    # Note that the definition below requires that dx and df0 have the same shape
    delta = alpha * np.sum(dx * df0) # A general, but memory intensive inner product
    
    t = 1 # Initialize t=beta^0
    f0 = f(x0) # Evaluate for future use
    x = x0 + dx # Initialize x_{0, inner}
    fx = f(x)
    
    if verbose:
        n=0
        xs = [x]
        fs = [fx]
        ts = [1] * 3
    
    while (not np.isfinite(fx)) or f0 + delta * t < fx:
        t = beta * t
        x = x0 + t * dx
        fx = f(x)
    ###################################### 
    
        if verbose:
            n += 1
            xs.append(x)
            fs.append(fx)
            ts.append(t)
            ts.pop(0)
            
    if verbose:
        # Display the function along the line search direction as a function of t
        s = np.linspace(-0.1*ts[-1], 1.1*ts[0], 100)
        xi = [0, 1.1*ts[0]]
        fxi = [f0, f0 + 1.1*ts[0]*delta]   
        y = np.zeros(len(s))
        
        for i in range(len(s)):
            y[i] = f(x0 + s[i]*dx) # Slow for vectorized functions

        plt.figure('Backtracking illustration')
        arm, =plt.plot(xi, fxi, '--', label='Armijo Criterion')
        fcn, =plt.plot(s, y, label='Objective Function')
        plt.plot([s[0], s[-1]], [0, 0], 'k--')
        pts =plt.scatter(ts, [0 for p in ts], label='Backtracking points for n=%d, %d, %d' % (n, n+1, n+2))
        plt.scatter(ts, [f(x0 + q*dx) for q in ts] , label='Backtracking values for n=%d, %d, %d' % (n, n+1, n+2))
        init =plt.scatter([0], [f0], color='black', label='Initial point')
        plt.xlabel('$t$')
        plt.ylabel('$f(x^{(k)}+t\Delta x^{(k+1)})$')
        plt.legend(handles=[arm, fcn, pts, init])
        plt.show()
        
        return x, xs, fs
    
    else:
        return x
    
y = 2*targets-1
f, df, d2f = loglogitlikelihood(X[:400,:], y[:400])
b = np.zeros(dim_data+1)
for i in range(60):
    if ((i % 10) == 0):
        acc = lr_accuracy(b, X[:400,:], targets[:400])
        val = f(b)
        print('LR Step %d: training_accuracy %g, cross entropy %g' % (i, acc, val))
        print('Gradient SS: %g' % np.sum(df(b)**2))
    b = backtracking(b, -np.linalg.solve(d2f(b), df(b)), f, df(b))
    
print("LR test accuracy: %g" % lr_accuracy(b, X[400:, :], targets[400:]))


LR Step 0: training_accuracy 0.4325, cross entropy 0.693147
Gradient SS: 27827
LR Step 10: training_accuracy 1, cross entropy 0.00242053
Gradient SS: 6.89282e-05
LR Step 20: training_accuracy 1, cross entropy 1.08748e-07
Gradient SS: 1.60579e-13
LR Step 30: training_accuracy 1, cross entropy 4.93713e-12
Gradient SS: 3.30979e-22
LR Step 40: training_accuracy 1, cross entropy 2.2482e-16
Gradient SS: 6.82198e-31




LR Step 50: training_accuracy 1, cross entropy 0
Gradient SS: 3.09717e-35
LR test accuracy: 0.928994


While Newton's method appears to minimize the cross entropy efficiently, we note that the two approaches both yield the same test accuracy. 

On the other hand, we can see the immediate advantage of automatic differentiation when we want to have **multiple layers** of activations. That is, we can consider what happens when we perform the mapping

$$
f_1({\bf x}; W^{(0)},{\bf b}^{(0)}) = \text{logit}\left(W^{(0)}{\bf x} + {\bf b}^{(0)}\right)
$$

and

$$
\varphi({\bf x}; W^{(0)}, W^{(1)}, {\bf b}^{(0)}, {\bf b}^{(1)}) = \text{softmax}\left(W^{(1)}f_1({\bf x}; W^{(0)}, {\bf b}^{(0)}) +{\bf b}^{(1)}\right).
$$

We can then use the cross entropy loss function to define

$$
f(W^{(0)}, W^{(1)}, {\bf b}^{(0)}, {\bf b}^{(1)}) = \frac{1}{N}\sum_{i=1}^N H\left(y^{(i)},\: \varphi\left({\bf x}^{(i)}; W^{(0)}, W^{(1)}, {\bf b}^{(0)}, {\bf b}^{(1)}\right)\right).
$$

Clearly, computing gradients and hardcoding optimization for this objective function is going to be difficult. On the other hand, here is TensorFlow code that does exactly this with relative ease.

In [10]:
# Let's use TensorFlow to train this shallow neural network

tf.reset_default_graph()

learning_rate = 1e-1
num_hidden_units = 1000



x = tf.placeholder(tf.float32, shape=[None, dim_data])
y_ = tf.placeholder(tf.float32, shape=[None, num_labels])

W0 = tf.Variable(tf.truncated_normal([dim_data, num_hidden_units], stddev=0.001), name='W_0')
b0 = tf.Variable(tf.zeros([num_hidden_units]), name='b_0')

W1 = tf.Variable(tf.truncated_normal([num_hidden_units, num_labels], stddev=0.001), name='W_1')
b1 = tf.Variable(tf.zeros([num_labels]), name='b_1')

y_prime = tf.matmul(tf.sigmoid(tf.matmul(x, W0) + b0), W1) + b1
y = tf.nn.softmax(y_prime)

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_prime))

#train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
train_step = tf.train.AdamOptimizer(1e-3).minimize(cross_entropy)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

train_accuracy = sess.run(accuracy, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
train_cross_entropy = sess.run(cross_entropy, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
print("Initial training accuracy %g, cross entropy %g" % (train_accuracy, train_cross_entropy))

for i in range(5000):
    sess.run(train_step, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
    train_accuracy = sess.run(accuracy, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
    train_cross_entropy = sess.run(cross_entropy, feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
    if ((i % 1000) == 0):
        grad0 = sess.run(tf.gradients(cross_entropy, W0), feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
        grad1 = sess.run(tf.gradients(cross_entropy, W1), feed_dict={x: X[:num_train, :], y_: labels[:num_train, :]})
        gss = np.sum(np.array(grad0)**2)+np.sum(np.array(grad1)**2)
        print("NN Step %d, training accuracy %g, cross entropy %g" % (i+1, train_accuracy, train_cross_entropy))
        print("Weight Gradient SS: %g" % gss)
    
nn_test_accuracy = sess.run(accuracy, feed_dict={x: X[num_train:, :], y_: labels[num_train:, :]})
print("NN test accuracy: %g" % nn_test_accuracy)
    
sess.close()

Initial training accuracy 0.4325, cross entropy 0.696698
NN Step 1, training accuracy 0.5675, cross entropy 0.705088
Weight Gradient SS: 28.7451
NN Step 1001, training accuracy 0.955, cross entropy 0.118233
Weight Gradient SS: 813.622
NN Step 2001, training accuracy 0.99, cross entropy 0.0386581
Weight Gradient SS: 52.3624
NN Step 3001, training accuracy 0.9875, cross entropy 0.0343358
Weight Gradient SS: 12.6499
NN Step 4001, training accuracy 0.9975, cross entropy 0.0150695
Weight Gradient SS: 0.333207
NN test accuracy: 0.95858


In [11]:
show_graph()

## Jacobians of Nested Functions

We have just performed black box optimization of parameters, and we would like to know what TensorFlow is doing under the hood. This leads us to some pretty interesting mathematical considerations.

Consider the functions $f_1:\mathbb{R}^{d_1}\times\mathbb{R}^n_1\rightarrow\mathbb{R}^{d_2}$ and $f_2:\mathbb{R}^{d_2}\times\mathbb{R}^{n_2}\rightarrow\mathbb{R}$ with $f_1({\bf x}_1,\Theta_1)$ and $f_2({\bf x}_2,\Theta_2)$, and set

$$
g({\bf x};\Theta_1,\Theta_2) = f_2(f_1({\bf x},\Theta_1),\Theta_2).
$$

The gradient of $g$ then has the block form

$$
\nabla_{\Theta_1,\Theta_2} g({\bf x};\Theta_1,\Theta_2)=\begin{pmatrix}
D_{\Theta_1}\: f_1({\bf x},\Theta_1)^T\nabla_{{\bf x}_2}\: f_2(f_1({\bf x},\Theta_1),\Theta_2)\\
\nabla_{\Theta_2}\: f_2(f_1({\bf x},\Theta_1),\Theta_2)
\end{pmatrix}
$$

Thus, we can often simplify our computation of the gradient by separating concerns. 

Now, we identify the functions $f_2$ with

$$
H({\bf y}, \text{softmax}(W^{(1)}{\bf x}_1+{\bf b}_1))
$$

and  $f_1$ with the function

$$
\text{logit}(W^{(0)}{\bf x}_0+{\bf b}_0).
$$

A good place to start is determining the Jacobian of the map $\text{log}(W{\bf x}+{\bf b})$ with respect to the parameters $W$ and ${\bf b}$. Now, we know that the map defined by ${\bf x}\mapsto A{\bf x}$ for the matrix $A$ has Jacobian $A$. We want to have some way of generalizing this to the case where $W\mapsto W{\bf x}$. In particular, $A\in M_{m,n}$ is a map from $\mathbb{R}^n$ to $\mathbb{R}^m$, so a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$ effectively returns a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$. The map $W\mapsto W{\bf x}$ is a linear map from $M_{m, n}\rightarrow \mathbb{R}^m$. We should also be able to view the Jacobian of this map as a linear map from $M_{m,n}$ to $\mathbb{R}^m$. 


## Multilinear Algebra and Elements of Tensor Calculus

### Tensors

A **tensor** is an $N$-way array of numbers, $\mathcal{A}=\left(a_{i_1,i_2,\ldots, i_N}\right)_{1\leq i_1\leq n_1,\:1\leq i_2\leq n_2,\ldots,1\leq i_n\leq n_N}$ where each $a_{i_1,i_2,\ldots, i_n}\in\mathbb{R}$. Letting $\mathbb{N}=\{1,2,3,\ldots\}$ denote the natural numbers, we say that ${\bf i}\in\mathbb{N}^N$ with ${\bf i}=(i_1,\ldots, i_N)$ is a **multi-index**, and we will write $a_{\bf i}$ instead of $a_{i_1,i_2,\ldots, i_N}$. We will also replace the system of inequalities $1\leq i_1\leq n_1,\:1\leq i_2\leq n_2,\ldots,1\leq i_n\leq n_N$ with ${\bf 1}\leq {\bf i}\leq {\bf n}$. This gives us the compact notation $\mathcal{A}=\left(a_{{\bf i}}\right)_{{\bf 1}\leq {\bf i}\leq{\bf n}}$. 

For example, an $m$ by $n$ matrix is a $2$-way array, and in this case $\mathcal{A}=\left(a_{i, j}\right)_{1\leq i\leq m, 1\leq j\leq n}$, which is the same as $\left(a_{(i,j)}\right)_{(1, 1)\leq (i, j)\leq (m,n)}$ in our compact notation.

We let $\mathcal{T}_{{\bf n}}$ denote the set of $n_1$ by $n_2$ by $\ldots$ by $n_N$ tensors where
$$
{\bf n} = \begin{pmatrix}
n_1 & n_2 &\cdots & n_N
\end{pmatrix}\in \mathbb{N}^N.
$$

For example, $M_{m,n}$ is the same as $\mathcal{T}_{(m, n)}$, and $\mathbb{R}^d$ is the same as $\mathcal{T}_{(d)}$.

We say that members of $\mathcal{T}_{\bf n}$ are $N$th order tensors whe ${\bf n}\in\mathbb{N}^n$. Thus, we call scalars $0$th order tensors, vectors are $1$st order tensors, matrices are $2$nd order tensors, and 3D arrays will be $3$rd order tensors.

### Tensor Contractions along a Single Index Pair

For a vector ${\bf n}\in\mathbb{N}^N$, and $i\in\{1,\ldots, N\}$, let ${\bf n}_{\setminus \{i\}}\in\mathbb{N}^{N-1}$ denote vector obtained by removing the $i$th entry of ${\bf n}$, and we let ${\bf n}\leftarrow_{i} k\in\mathbb{N}^N$ denote the vector obtained by replacing the $i$th entry of ${\bf n}$ with $k$. If ${\bf n}\in\mathbb{N}^N$ and ${\bf m}\in\mathbb{N}^M$, then we define 

$$
{\bf n}\oplus{\bf m} =\begin{pmatrix}
n_1 & \cdots & n_N & m_1 &\cdots &m_M
\end{pmatrix}\in \mathbb{N}^{N+M}.
$$

For example, if ${\bf n} = (4, 2, 6, 3)$ and ${\bf m}=(6, 7, 5)$, then ${\bf n}_{\setminus\{3\}}=(4, 2, 3)$, ${\bf m}_{\setminus\{1\}}=(7,5)$, and 

$$
{\bf n}_{\setminus\{3\}}\oplus {\bf m}_{\setminus\{1\}} = (4, 2, 3, 7, 5).
$$

We also have

$$
{\bf n}\leftarrow_{2} 1 = (4, 1, 6, 3).
$$

Now, suppose that there are indices $i$ and $j$ such that $n_{i}=m_{j}$ for index vectors ${\bf n}\in\mathbb{N}^N$ and ${\bf m}\in\mathbb{N}^M$. Then we can define a **tensor contraction operator** $c_{(i,j)}:\mathcal{T}_{\bf n}\times \mathcal{T}_{\bf m}\rightarrow \mathcal{T}_{{\bf n}_{\setminus\{i\}}\oplus{\bf m}_{\setminus\{j\}}}$

$$
c_{(i, j)}(\mathcal{A},\mathcal{B})_{{\bf k}_{\setminus\{i\}}\oplus{\bf l}_{\setminus\{j\}}} = \sum_{k_i=1}^{n_i} a_{{\bf k}\leftarrow_{\{i\}}k_i} b_{{\bf l}\leftarrow_{\{j\}}k_i}
$$

We say that that this is contraction of $\mathcal{A}$ and $\mathcal{B}$ over the $(i,j)$th index pair. 

For example, matrix-vector multiplication may be thought of as a contraction on $A\in \mathcal{T}_{(m,n)}$ and ${\bf x}\in\mathcal{T}_{(n)}$. We note that $(m,n)$ and $(n)$ agree at $i=2$ and $j=1$, and then

$$
(m,n)_{\setminus\{2\}}\oplus (n)_{\setminus\{1\}}= (m) \oplus () = (m) \text{ (by convention)}.
$$

We also note that

$$
(k_1, k_2)_{\setminus\{2\}}\oplus (l_1)_{\setminus\{1\}} = (k_1),
$$

so

$$
c_{(2, 1)}(A,{\bf x})_{(k_1)}=c_{(2, 1)}(A,{\bf x})_{(k_1, k_2)_{\setminus\{2\}}\oplus (l_1)_{\setminus\{1\}}}=\sum_{k=1}^n a_{(k_1,k_2)\leftarrow_{2}k}x_{(l_1)\leftarrow_{1} k}=\sum_{k=1}^n a_{k_1, k} x_k.
$$

It should be clear from the name, but TensorFlow supports such tensor operations!

In [2]:
import tensorflow as tf

A = tf.Variable([[3, 1], [5, 6]], name='A') 
B = tf.Variable([[2, -1], [3, 5]], name='B')
x = tf.Variable([2, -1], name='x') # 
f = tf.tensordot(A, x, [[1], [0]])
g = tf.tensordot(A, B, [[1], [0]])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    mvresult = f.eval()
    mmresult = g.eval()

print('We see that contraction generalizes matrix-vector and matrix-matrix multiplication:')
print(mvresult)
print(mmresult)

A = tf.Variable([[[3, 1], [5, 6]],[[-1, 1], [2, -2]]], name='A') # This is a 2 by 2 by 2 tensor
B = tf.Variable([[2, -1], [3, 5]], name='B') # This is a 2 by 2 tensor
f = tf.tensordot(A, B, [[1], [0]]) # Contraction along a single index pair

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    result = f.eval()

print('Contracting a 2 by 2 by 2 tensor with a 2 by 2 tensor along a single index pair yields a 2 by 2 by 2 tensor:')
print(result)


We see that contraction generalizes matrix-vector and matrix-matrix multiplication:
[5 4]
[[ 9  2]
 [28 25]]
Contracting a 2 by 2 by 2 tensor with a 2 by 2 tensor along a single index pair yields a 2 by 2 by 2 tensor:
[[[ 21  22]
  [ 20  29]]

 [[  4  11]
  [ -4 -11]]]


### Tensor Contractions in General

For a vector ${\bf n}\in\mathbb{N}^N$, and a subset of **distinct indices** $\{i_1,i_2,\ldots, i_k\}\subset\{1,\ldots, N\}$ (we abuse notation and use ${\bf i}$ to denote this subset), let ${\bf n}_{\setminus {\bf i}}\in\mathbb{N}^{N-k}$ denote vector obtained by removing the $i_1$th, $i_2$th, $\ldots$, and $i_k$th entries of ${\bf n}$, and let ${\bf n}\leftarrow_{{\bf i}} \kappa \in\mathbb{N}^N$ denote the vector obtained by replacing the $i_j$th entry of ${\bf n}$ with $\kappa_j$ for $j=1,\ldots, k$. 

For example, if ${\bf n} = (4, 2, 6, 3)$ and ${\bf m}=(6, 2, 5)$, then ${\bf n}_{\setminus (2, 3)}=(4, 3)$, ${\bf m}_{\setminus(2,1)}=(5)$. It should be noted that ${\bf m}_{(1, 2)}=(5)$ as well, but it will turn out that ordering is important for generalizing contractions. We also have

$$
{\bf n}\leftarrow_{(2,3)} (1,1) = (4, 1, 1, 3).
$$

Now, suppose that there are distinct subsets of indices ${\bf i}=\{i_1,\ldots, i_\kappa\}\subset\{1,\ldots, N\}$ and ${\bf j}=\{j_1,\ldots, j_\kappa\}\subset\{1,\ldots, M\}$ such that $n_{i_t}=m_{j_t}$ for all $t=1,\ldots, \kappa$ where ${\bf n}\in\mathbb{N}^N$ and ${\bf m}\in\mathbb{N}^M$. Then we can define a **tensor contraction operator** $c_{{\bf i},{\bf j}}:\mathcal{T}_{\bf n}\times \mathcal{T}_{\bf m}\rightarrow \mathcal{T}_{{\bf n}_{\setminus{\bf i}}\oplus{\bf m}_{\setminus{\bf j}}}$

$$
c_{{\bf i}, {\bf j}}(\mathcal{A},\mathcal{B})_{{\bf p}_{\setminus{\bf i}}\oplus{\bf q}_{\setminus {\bf j}}} = \sum_{{\bf 1}\leq {\bf k}\leq {{\bf n}_{\bf i}}} a_{{\bf p}\leftarrow_{\bf i}{\bf k}} b_{{\bf q}\leftarrow_{\bf j} {\bf k}}
$$

We say that that this is contraction of $\mathcal{A}$ and $\mathcal{B}$ over the ${\bf i}$ ${\bf j}$ index pairings. 

For example, the extension of the *inner product* to matrices may be thought of as a contraction on $A,B\in \mathcal{T}_{(m,n)}$:

$$
c_{(1, 2),(1,2))}(A,B)=c_{(2, 1)}(A,{\bf x})_{(p_1,p_2)_{\setminus\{1,2\}}\oplus (q_1, q_2)_{\setminus\{1,2\}}}=\sum_{k_1=1}^m\sum_{k_2=1}^n a_{(p_1,p_2)\leftarrow_{\{1,2\}}(k_1,k_2)}b_{(q_1, q_2)\leftarrow_{\{1,2\}}(k_1,k_2)}=\sum_{k_1=1}^m\sum_{k_2=1}^n a_{(k_1,k_2)}b_{(k_1,k_2)}
$$

This is called the **Frobenius inner product** on $M_{m,n}$. 

In [12]:
A = tf.Variable([[[3, 1], [5, 6]],[[-1, 1], [2, -2]]], name='A') # This is a 2 by 2 by 2 tensor
B = tf.Variable([[2, -1], [3, 5]], name='B') # This is a 2 by 2 tensor
f = tf.tensordot(A, B, [[1,2], [0,1]]) # Contraction along two indices

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    result = f.eval()

print('Contracting a 2 by 2 by 2 tensor with a 2 by 2 tensor along two indices yields a 2 dimensional vector:')
print(result)

Contracting a 2 by 2 by 2 tensor with a 2 by 2 tensor along two indices yields a 2 dimensional vector:
[50 -7]


## Group Problems

Consider the following 2 by 2 by 2 tensors:  $\mathcal{A}=\left(\begin{pmatrix}
1 & 2\\
-2 & 1
\end{pmatrix}, \begin{pmatrix}
1 & -1\\
-1 & 1
\end{pmatrix}\right)$ and $\mathcal{B}=\left(\begin{pmatrix}
1 & 2\\
3 & 4
\end{pmatrix}, \begin{pmatrix}
4 & 1\\
2 & 3
\end{pmatrix}\right)$
Contract these two tensors over the indices
1. ${\bf i}=\{1,2,3\}$ and ${\bf j}=\{1,2,3\}$
2. ${\bf i}=\{1,3,2\}$ and ${\bf j}=\{1,2,3\}$
3. ${\bf i}=\{1,3\}$ and ${\bf j}=\{1,2\}$
4. ${\bf i}=\{2,3\}$ and ${\bf j}=\{1,2\}$
5. ${\bf i}=\{1,2\}$ and ${\bf j}=\{2,3\}$
6. ${\bf i}=\{1\}$ and ${\bf j}=\{2\}$
7. ${\bf i}=\{1\}$ and ${\bf j}=\{3\}$
8. ${\bf i}=\{3\}$ and ${\bf j}=\{1\}$

In [5]:
A = tf.Variable([[[1, 2], [-2, 1]],[[1, -1], [-1, 1]]], name='A') # This is a 2 by 2 by 2 tensor
B = tf.Variable([[[1, 2], [3, 4]],[[4, 1], [2, 3]]], name='B') # This is a 2 by 2 by 2 tensor
f = tf.tensordot(A, B, [[0,2,1], [0,1, 2]]) # Contraction along two indices

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    result = f.eval()

print('Contracting a 2 by 2 by 2 tensor with a 2 by 2 tensor along two indices yields a 2 dimensional vector:')
print(result)

Contracting a 2 by 2 by 2 tensor with a 2 by 2 tensor along two indices yields a 2 dimensional vector:
11


### The Chain Rule for Tensor-Valued Functions

A **tensor-valued function** is a function $f:\mathcal{T}_{\bf n}\rightarrow\mathcal{T}_{\bf m}$. For example, if $A\in\mathcal{T}_{(m,n)}$ and ${\bf x}\in \mathcal{T}_{(n)}$, then $f(A)=A{\bf x}$ is a map from $\mathcal{T}_{(m,n)}$ to $\mathcal{T}_{(m)}$. We can write $f((x_{\bf j})_{{\bf 1}\leq {\bf j}\leq{\bf n}})=\left(f_{\bf i}((x_{\bf j})_{{\bf 1}\leq {\bf j}\leq{\bf n}})\right)_{{\bf 1}\leq {\bf i}\leq {\bf m}}$ to explicitly state the variables ($x_{\bf j}$'s) and component functions ($f_{\bf i}$'s) of such an $f$. 

We will say that $f\in C^1(\mathcal{T}_{\bf n};\mathcal{T}_{\bf m})$ if all the component functions of $f$ are continuous functions over the variables $\mathcal{X}=(x_{\bf i})_{{\bf 1}\leq {\bf i}\leq{\bf n}}$. For $f\in C^1(\mathcal{T}_{\bf n};\mathcal{T}_{\bf m})$, we can generalize the definition of the Jacobian of $f$ to $Df(\mathcal{X})\in\mathcal{T}_{{\bf m}\oplus{\bf n}}$ by defining

$$
(Df(\mathcal{X}))_{{\bf i}\oplus{\bf j}} = \frac{\partial f_{\bf i}}{\partial x_{\bf j}}(\mathcal{X}).
$$

For example, if $f\in C^1(\mathbb{R}^n;\mathbb{R}^m)=C^1(\mathcal{T}_{(n)};\mathcal{T}_{(m)}))$, then $Df(\mathcal{X})\in \mathcal{T}_{(m)\oplus(n)}=\mathcal{T}_{(m,n)}$ is defined by

$$
(Df(\mathcal{X}))_{(i,j)}=(Df(\mathcal{X}))_{(i)\oplus(j)}=\frac{\partial f_i}{\partial x_j}(\mathcal{X})
$$

for $\mathcal{X}\in\mathcal{T}_{(n)}=\mathbb{R}^n$. 

For $f\in C^1(\mathcal{T}_{\bf n};\mathcal{T}_{\bf k})$ and $g\in C^1(\mathcal{T}_{\bf k};\mathcal{T}_{\bf m})$, we have that the *composition* of $g$ with $f$, $g\circ f$ defined by $(g\circ f)(\mathcal{X})=g(f(\mathcal{X}))$, satisfies $(g\circ f)\in C^1(\mathcal{T}_{\bf n};\mathcal{T}_{\bf m})$. Note that $Df(\mathcal{X})\in \mathcal{T}_{{\bf k}\oplus{\bf n}}$ and $Dg(\mathcal{Y})\in \mathcal{T}_{{\bf m}\oplus{\bf k}}$, and the vectors $\widetilde{\bf n}={\bf k}\oplus{\bf n}$ and $\widetilde{\bf m}={\bf m}\oplus{\bf k}$ satisfy $\widetilde{n}_i=\widetilde{m}_{M+i}$ for $i=1,\ldots K$. In notation, the **chain rule** is

$$
\frac{\partial (g\circ f)_{\bf i}}{\partial x_{\bf j}}(\mathcal{X})=D(g\circ f)(\mathcal{X})_{{\bf i}\oplus{\bf j}} = c_{(M+1,\ldots, M+K), (1,\ldots, K)}(Dg(f(\mathcal{X}),Df(\mathcal{X}))_{{\bf i}\oplus{\bf j}}=\sum_{{\bf 1}\leq {\bf l}\leq {\bf k}} Dg(f(\mathcal{X}))_{{\bf i}\oplus {\bf l}} Df(\mathcal{X})_{{\bf l}\oplus{\bf j}} = \sum_{{\bf 1}\leq {\bf l}\leq {\bf k}} \frac{\partial g_{\bf i}}{\partial y_{\bf l}}(f(\mathcal{X})) \frac{\partial f_{\bf l}}{\partial x_{\bf j}}(\mathcal{X})
$$

That is, we contract over the indices arising from ${\bf k}$. This is clearly correct since the chain rule tells us to take Jacobians and sum over intervening variables. The only complicated thing is that we now have many, many indices.

### The Product Rule for Contractions

We start by introducing **slice** notation for tensors. For a tensor $\mathcal{X}\in\mathcal{T}_{{\bf m}\oplus{\bf n}}$ and multi-index ${\bf q}$ with ${\bf 1}\leq{\bf q}\leq {\bf n}$, $\mathcal{X}_{(\cdot, {\bf q})}\in\mathcal{T}_{\bf m}$ is such that 

$$
\left(\mathcal{X}_{(\cdot, {\bf q})}\right)_{\bf p}=\mathcal{X}_{{\bf p}\oplus{\bf q}}
$$

for all ${\bf p}$ with ${\bf 1}\leq {\bf p}\leq {\bf m}$.

We have seen that 

$$
\frac{\partial}{\partial x_i} \left(A({\bf x}) B({\bf x})\right)=\left(\frac{\partial}{\partial x_i}A({\bf x})\right) B({\bf x})+A({\bf x})\left(\frac{\partial}{\partial x_i} B({\bf x})\right) 
$$

where the partial derivative of a matrix valued function is understood to mean the partial derivative of each of the component functions. We have seen that matrix multiplication is generalized by tensor contractions, so it is natural to see how the product rule generalizes. So, suppose we have two tensor-valued functions $\mathcal{F}:\mathcal{T}_{\bf n}\rightarrow\mathcal{T}_{\bf k}$ and $\mathcal{G}:\mathcal{T}_{\bf n}\rightarrow\mathcal{T}_{{\bf k}^\prime}$, and pairs ${\bf i}$ and ${\bf j}$ where $k_{i_t}=k_{j_t}^\prime$ for all $t=1,\ldots, \kappa$. We then have a contraction

$$
c_{{\bf i},{\bf j}}:\mathcal{T}_{\bf k}\times\mathcal{T}_{{\bf k}^\prime}\rightarrow \mathcal{T}_{{\bf k}_{\setminus{\bf i}}\oplus{\bf k}^\prime_{\setminus{\bf j}}}
$$

and therefore $\mathcal{Q}(\mathcal{X})=c_{{\bf i},{\bf j}}(\mathcal{F}(\mathcal{X}), \mathcal{G}(\mathcal{X}))$ is a map from $\mathcal{T}_{\bf n}$ to $\mathcal{T}_{{\bf k}_{\setminus{\bf i}}\oplus{\bf k}^\prime_{\setminus{\bf j}}}$ with $D\mathcal{Q}(\mathcal{X})\in \mathcal{T}_{{\bf k}_{\setminus{\bf i}}\oplus{\bf k}^\prime_{\setminus{\bf j}}\oplus{\bf n}}$.

Using the 1D product rule, it is easy to establish that

$$
\frac{\partial}{\partial x_{\bf l}}c_{{\bf i},{\bf j}}(\mathcal{F}(\mathcal{X}), \mathcal{G}(\mathcal{X})) = c_{{\bf i},{\bf j}}\left(\frac{\partial \mathcal{F}}{\partial x_{\bf l}}(\mathcal{X}), \mathcal{G}(\mathcal{X})\right)+c_{{\bf i},{\bf j}}\left(\mathcal{F}(\mathcal{X}), \frac{\partial \mathcal{G}}{\partial x_{\bf l}}(\mathcal{X})\right)= c_{{\bf i},{\bf j}}\left(D \mathcal{F}(\mathcal{X})_{(\cdot,{\bf l})}, \mathcal{G}(\mathcal{X})\right)+c_{{\bf i},{\bf j}}\left(\mathcal{F}(\mathcal{X}), D\mathcal{G}(\mathcal{X})_{(\cdot,{\bf l})}\right)
$$

where the partial derivative of a tensor valued function is understood to be tensor of partial derivatives of the component functions. 

Now, set ${\bf p}={\bf k}\oplus{\bf n}$ and ${\bf q}={\bf k}^\prime\oplus{\bf n}$, and note that $p_{i_1}=k_{i_1}=k_{j_1}^\prime=q_{j_1}$. Because of these identifications, $c_{{\bf i},{\bf j}}$ "lifts" to the contractions $\widetilde{c}_{{\bf i},{\bf j}}:\mathcal{T}_{{\bf k}\oplus{\bf n}}\times\mathcal{T}_{{\bf k}^\prime}\rightarrow\mathcal{T}_{{\bf k}_{\setminus{\bf i}}\oplus{\bf k}^\prime_{\setminus{\bf j}}\oplus{\bf n}}$ and $\widetilde{c}_{{\bf i},{\bf j}}^\prime:\mathcal{T}_{{\bf k}}\times\mathcal{T}_{{\bf k}^\prime\oplus{\bf n}}\rightarrow\mathcal{T}_{{\bf k}_{\setminus{\bf i}}\oplus{\bf k}^\prime_{\setminus{\bf j}}\oplus{\bf n}}$, and the **tensor product rule** becomes

$$
D c_{{\bf i},{\bf j}}(\mathcal{F}(\mathcal{X}),\mathcal{G}(\mathcal{X}))=\widetilde{c}_{{\bf i},{\bf j}}(D\mathcal{F}(\mathcal{X}),\mathcal{G}(\mathcal{X}))+\widetilde{c}_{{\bf i},{\bf j}}^\prime(\mathcal{F}(\mathcal{X}),D\mathcal{G}(\mathcal{X}))
$$




### Example

Consider the map $W\mapsto W{\bf x}$. In this case, we have the following identifications:

1. $W\in M_{m,n}$ is identified with $\mathcal{X}\in\mathcal{T}_{(m,n)}$
2. $W\mapsto W$ is identified with $\mathcal{F}(\mathcal{X})=\mathcal{X}$, so $\mathcal{F}:\mathcal{T}_{(m,n)}\rightarrow\mathcal{T}_{(m,n)}$
3. $W\mapsto {\bf x}$ is identified with $\mathcal{G}:\mathcal{T}_{(m,n)}\rightarrow\mathcal{T}_{(n)}$
4. $W\mapsto W{\bf x}$ is identified with $c_{(2),(1)}(\mathcal{F}(\mathcal{X}), \mathcal{G}(\mathcal{X}))$

Sinc $\mathcal{F}$ is the **identity function** on $\mathcal{T}_{(m,n)}$, we will have that $D\mathcal{F}(\mathcal{X})=\mathcal{I}$ for all $\mathcal{X}\in\mathcal{T}_{(m,n)}$ where $\mathcal{I}\in \mathcal{T}_{(m,n)\oplus(m,n)}$ is the **identity tensor**. In particular,

$$
\mathcal{I}_{i,j,i^\prime,j^\prime} = 1 \text{ if }i=i^\prime, j=j^\prime\text{ and } 0\text{ otherwise}
$$

and

$$
c_{(1, 2), (1, 2)}(\mathcal{I}, \mathcal{A})= \mathcal{A}
$$

for all $\mathcal{A}\in \mathcal{T}_{(m,n)}$. On the other hand $D\mathcal{G}(\mathcal{X})={\bf 0}\in \mathcal{T}_{(n)\oplus(m,n)}$ is the **zero tensor** for all $\mathcal{X}\in\mathcal{T}_{(n)}$. We conclude that the Jacobian of $W\mapsto W{\bf x}$ is then

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x}).
$$


Even more concretely, suppose $W$ was a $2$ by $3$ matrix. Then 

$$
\mathcal{I}_{(\cdot,(1,1))}=\begin{pmatrix}
1 & 0 & 0\\
0 & 0 & 0
\end{pmatrix}
$$

$$
\mathcal{I}_{(\cdot,(1,2))}=\begin{pmatrix}
0 & 1 & 0\\
0 & 0 & 0
\end{pmatrix}
$$

$$
\mathcal{I}_{(\cdot,(1,3))}=\begin{pmatrix}
0 & 0 & 1\\
0 & 0 & 0
\end{pmatrix}
$$

$$
\mathcal{I}_{(\cdot,(2,1))}=\begin{pmatrix}
0 & 0 & 0\\
1 & 0 & 0
\end{pmatrix}
$$

$$
\mathcal{I}_{(\cdot,(2,2))}=\begin{pmatrix}
0 & 0 & 0\\
0 & 1 & 0
\end{pmatrix}
$$

$$
\mathcal{I}_{(\cdot,(2,3))}=\begin{pmatrix}
0 & 0 & 0\\
0 & 0 & 1
\end{pmatrix}
$$

and

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{(\cdot,(1,1))}=\begin{pmatrix}
1 & 0 & 0\\
0 & 0 & 0
\end{pmatrix}\begin{pmatrix} x_1\\ x_2\\ x_3\end{pmatrix} = \begin{pmatrix} x_1\\ 0\end{pmatrix}
$$

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{(\cdot,(1,2))}= \begin{pmatrix} x_2\\ 0\end{pmatrix}
$$

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{(\cdot,(1,3))}= \begin{pmatrix} x_3\\ 0\end{pmatrix}
$$

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{(\cdot,(2,1))}= \begin{pmatrix} 0\\ x_1\end{pmatrix}
$$

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{(\cdot,(2,2))}= \begin{pmatrix} 0\\ x_2\end{pmatrix}
$$

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{(\cdot,(2,3))}= \begin{pmatrix} 0\\ x_3\end{pmatrix}.
$$

Slicing in a different way, we have

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{((1,1),\cdot)}= {\bf x}
$$

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{((1,2),\cdot)}= {\bf 0}
$$

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{((2,1),\cdot)}= {\bf 0}
$$

and

$$
\widetilde{c}_{(2), (1)}(\mathcal{I},{\bf x})_{((2,2),\cdot)}= {\bf x}.
$$


### Example

Consider the map $h:M_{m,n}\times\mathbb{R}^m\rightarrow \mathbb{R}^m$ defined by $h(W,{\bf b})=\text{logit}(W{\bf x}+{\bf b})$ for some ${\bf x}\in\mathbb{R}^n$ and where $\text{logit}$ is asssumed to act in a vectorized manner. We can think of this as a composition $f(W,{\bf b})=W{\bf x}+{\bf b}$ with $g({\bf y})=\text{logit}({\bf y})$, where $f:M_{m,n}\times\mathbb{R}^m\rightarrow\mathbb{R}^m$ and $g:\mathbb{R}^m\rightarrow\mathbb{R}^m$. 

Because we have separated the variables as $(W,{\bf b})$, computation of the full Jacobian should be written as a pair of tensors, $(D_Wf(W,{\bf b}), D_{\bf b}f(W,{\bf b}))$ where $D_W f\in \mathcal{T}_{(m)\oplus(m,n)}$ is the Jacobian of $f$ with respect to the parameters $W$, and $D_{\bf b}f\in \mathcal{T}_{(m)\oplus(m)}$ is the Jacobian of $f$ with respect to the parameters ${\bf b}$. 

From the previous example, we can see that $D_Wf(W,{\bf b})=\widetilde{c}_{(2),(1)}(\mathcal{I},{\bf x})$ and we of course have that $D_{\bf b} f(W,{\bf b})= I$ where $I$ is the $m$ by $m$ identity matrix. On the other hand, $Dg({\bf y}) = \text{diag}(\text{logit}({\bf y})\cdot \text{logit}(-{\bf y}))$. By the chain rule, we have that

$$
D_W(g\circ f)(W, {\bf b}) = c_{(2), (1)}(Dg(f(W,{\bf b})), D_W f(W,{\bf b})).
$$

Considering the case when $W$ is $2$ by $3$, and hence $g:\mathbb{R}^2\rightarrow\mathbb{R}^2$, this Jacobian is a 3rd order tensor in $\mathcal{T}_{(2, 2, 3)}$, and we have

$$
D_W(g\circ f)(W, {\bf b})_{((1),\cdot)}=\begin{pmatrix}
\text{logit}({\bf e}_1^T(W{\bf x}+{\bf b}))\text{logit}(-{\bf e}_1^T(W{\bf x}+{\bf b})) & 0\\
0 &\text{logit}({\bf e}_2^T(W{\bf x}+{\bf b}))\text{logit}(-{\bf e}_2^T(W{\bf x}+{\bf b}))
\end{pmatrix}\begin{pmatrix}
x_1 & x_2 & x_3\\
0 & 0 & 0
\end{pmatrix}
$$

and

$$
D_W(g\circ f)(W, {\bf b})_{((2),\cdot)}=\begin{pmatrix}
\text{logit}({\bf e}_1^T(W{\bf x}+{\bf b}))\text{logit}(-{\bf e}_1^T(W{\bf x}+{\bf b})) & 0\\
0 &\text{logit}({\bf e}_2^T(W{\bf x}+{\bf b}))\text{logit}(-{\bf e}_2^T(W{\bf x}+{\bf b}))
\end{pmatrix}\begin{pmatrix}
0 & 0 & 0\\
x_1 & x_2 & x_3
\end{pmatrix}
$$



## Group Problems

For the following functions $g$ and $f$, compute the Jacobians of $f$ and $g$, and use the chain rule to compute the Jacobian of $g\circ f$.

1. $\displaystyle f\begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}= \begin{pmatrix} x_{1, 1} x_{1, 2}\\ x_{2,1} x_{2, 2}\end{pmatrix}$ and $\displaystyle g\begin{pmatrix}y_1 \\ y_2\end{pmatrix} = \begin{pmatrix}
y_1 + y_2\\
-y_1 + y_2
\end{pmatrix}$
2.  $\displaystyle f\begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}= \begin{pmatrix} x_{1, 1} - x_{1, 2}\\ x_{2,1} + x_{2, 2}\end{pmatrix}$ and $\displaystyle g\begin{pmatrix}y_1 \\ y_2\end{pmatrix} = \begin{pmatrix}
y_1^2\\
y_2^2
\end{pmatrix}$
3.  $\displaystyle f\begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}= \begin{pmatrix} x_{1, 1}/x_{1, 2}\\ x_{2,1} /x_{2, 2}\end{pmatrix}$ and $\displaystyle g\begin{pmatrix}y_1 \\ y_2\end{pmatrix} = \begin{pmatrix}
y_1y_2\\
y_1/y_2
\end{pmatrix}$
4.  $\displaystyle f\begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}= \begin{pmatrix} e^{x_{1, 1}+x_{1, 2}}\\ e^{x_{2,1} + x_{2, 2}}\end{pmatrix}$ and $\displaystyle g\begin{pmatrix}y_1 \\ y_2\end{pmatrix} = \begin{pmatrix}
\log(y_1) + \log(y_2)\\
y_1y_2
\end{pmatrix}$
5.  $\displaystyle f\left(\begin{pmatrix}x_{1, 1, 1} & x_{1, 1, 2}\\ x_{1,2,1} & x_{1,2, 2}\end{pmatrix},\begin{pmatrix}x_{2, 1, 1} & x_{2, 1, 2}\\ x_{2,2,1} & x_{2,2, 2}\end{pmatrix}\right)= \begin{pmatrix} x_{1,1,1}+x_{2, 2, 2}\\ x_{1, 2, 1}^2+x_{2, 1, 1}^2\end{pmatrix}$ and $\displaystyle g\begin{pmatrix}y_1 \\ y_2\end{pmatrix} = \begin{pmatrix}
y_1 + y_2\\
-y_1 + y_2
\end{pmatrix}$
6.  $\displaystyle f\left(\begin{pmatrix}x_{1, 1, 1} & x_{1, 1, 2}\\ x_{1,2,1} & x_{1,2, 2}\end{pmatrix},\begin{pmatrix}x_{2, 1, 1} & x_{2, 1, 2}\\ x_{2,2,1} & x_{2,2, 2}\end{pmatrix}\right)= \begin{pmatrix} x_{1,1,1}-x_{2, 2, 2}+x_{1, 2, 1}-x_{1, 1, 2}\\ x_{1, 2, 1}-x_{2, 1, 1}+x_{1, 1, 2}\end{pmatrix}$ and $\displaystyle g\begin{pmatrix}y_1 \\ y_2\end{pmatrix} = \begin{pmatrix}
y_1^2 - y_2^2\\
y_1^2 + y_2^2
\end{pmatrix}$
7.  $\displaystyle f\begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}= \begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}\begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}$ and $\displaystyle g\begin{pmatrix}y_{1,1} & y_{1, 2} \\ y_{2, 1}& y_{2,2}\end{pmatrix} = y_{1,1}y_{2,2}-y_{1,2}y_{2,1}$.
8.  $\displaystyle f\begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}= \begin{pmatrix}x_{1, 1} & 0\\ 0 & x_{2, 2}\end{pmatrix}\begin{pmatrix}x_{1, 1} & x_{1, 2}\\ x_{2,1} & x_{2, 2}\end{pmatrix}$ and $\displaystyle g\begin{pmatrix}y_{1,1} & y_{1, 2} \\ y_{2, 1}& y_{2,2}\end{pmatrix} = y_{1,1}y_{2,2}-y_{1,2}y_{2,1}$.

Shapes of these Jacobians:

1. 2 by 2 by 2 tensor-valued $g\circ f$ -- verify the chain rule for the (1, 1, 1) entry
2. 2 by 2 by 2 as well
3. 2 by 2 by 2 as well
4. 2 by 2 by 2 as well
5. 2 by 2 by 2 by 2! -- verify the chain rule for the (1, 1, 1) entry
6. 2 by 2 by 2 by 2!
7. 2 by 2 -- verify the chain rule for the (1, 1) entry
8. 2 by 2
