<a href="https://colab.research.google.com/github/natrask/ENM5310/blob/main/Lecture0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We'll begin by giving a tutorial on tensorflow for performing regression tasks. I choose to work in Tensorflow 1 - this provides the most transparent view of how the model maps onto the computational graph hiding "under the hood". These days most prefer pytorch/jax, or at least tensorflow 2. You can submit homeworks in whichever of these you prefer to work in, but may not use external libraries (e.g. only use Keras when given explicit permission in the assignment, no importing of libraries with pre-implemented architectures, etc). This will let us understand architectures at a low level and avoid "code-glue" where we stick random bits of code together without understanding why/how they work. You can find tutorials and API for TF1 [here](https://github.com/tensorflow/docs/tree/master/site/en/r1).

The following is a barebones Tensorflow environment with numpy at matplotlib functionality for manipulating and plotting data

In [None]:
import numpy as np
import scipy.sparse.linalg
import matplotlib.pyplot as plt


import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
tf.reset_default_graph()
config = tf.ConfigProto()
sess = tf.Session(config=config)

Next we will demonstrate how to build up a tensorflow model which will implement a function of the form $f(x) = \theta_1 x + \theta_2$. In tensorflow 1 there are 3 fundamental objects: a constant, which is a tensor maintaining a constant value; a Variable, which is a tensor whose value will evolve via gradient descent or some other process for assignment, and a placeholder which is a tensor whose value will be provided from data as one processes through a dataset. Think of this as defining a "circuit" which data flows through, being loaded into placeholders and flowing to output nodes. When defining a model, the initial values of Variables must be specified, and the model must be initialized before evaluating. When specifying the dimension of a tensor the **None** keyword serves as a placeholder for an unknown length dimension (typically used for the number of data points so we can push in batches of varying size).

In [None]:
theta1 = tf.Variable(2.0,dtype=tf.float32)
theta2 = tf.constant(3.0,dtype=tf.float32)
x = tf.placeholder(dtype=tf.float32,shape=(None))
f_of_x = theta1*x+theta2

Data can be evaluated by defining a dictionary which specifies what values should be fed into the placeholder values. This is the mechanism by which we'll feed batches of data - we will successively redefine the feed_dict to point to different slices of the dataset. Note how the **None** keyword allows one to provide different length tensors as input.

In [None]:
sess.run(tf.global_variables_initializer()) #initialize model
myDict1 = {x:3}
myDict2 = {x:5}
myDict3 = {x:[1,2]}
print('f(x) = 2*x+3')
print('f(3) = ' + str(sess.run(f_of_x,feed_dict=myDict1)))
print('f(5) = ' + str(sess.run(f_of_x,feed_dict=myDict2)))
print('[f(1),f(2)] = ' + str(sess.run(f_of_x,feed_dict=myDict3)))

f(x) = 2*x+3
f(3) = 9.0
f(5) = 13.0
f(5) = [5. 7.]


While tensorflow is popularly known as a language for building up deep learning models, you should think of it primarily as an engine for performing automatic differentiation. Note that due to how TF calculates gradients, it is only valid to take the gradient of a scalar with respect to a tensor (i.e. no direct computation of Jacobians or Hessians). Pytorch has a similar issue, while Jax supports forward and reverse mode differentiation (i.e. derivatives of tensors with respect to tensors). Tensorflow 2 maintains a more flexible but less simple syntax for autodiff that you can read more about [here](https://www.tensorflow.org/guide/autodiff).

In [None]:
df_dtheta1 = tf.gradients(f_of_x,x)[0]
print('dfdx(x) = 2')
print('df(3)/dtheta1 = ' + str(sess.run(df_dtheta1,feed_dict=myDict1)))
print('df(5)/dtheta1 = ' + str(sess.run(df_dtheta1,feed_dict=myDict2)))
print('[df(1)/dtheta1,df(2)/dtheta1] = ' + str(sess.run(df_dtheta1,feed_dict=myDict3)))

dfdx(x) = 2
df(3)/dtheta1 = 2.0
df(5)/dtheta1 = 2.0
[df(1)/dtheta1,df(2)/dtheta1] = [2. 2.]


Coupled with gradient descent methods, this gives an incredibly simple way to fit models to data. Below we show how to implement a simple gradient descent. As an example, we recalibrate the slope of the linear function to match data sampling the function $f(x) = x+3$.

In [None]:
x_data = tf.constant(np.linspace(0,1,10),dtype=tf.float32)
y_data = x_data+3
RMS_LOSS = tf.sqrt( tf.reduce_mean( (y_data-f_of_x)**2 ) )
gradient_direction = tf.gradients(RMS_LOSS,theta1)[0]
learning_rate = 0.1
gradient_update = theta1.assign(theta1 - learning_rate*gradient_direction) #overwrite the value of theta1 using a gradient descent update

Here we take 100 steps and check that it does in fact converge to the correct value of 1.0

In [None]:
for training_step in range(100):
  sess.run(gradient_update,feed_dict={x:x_data})
  print('(RMS_error, theta1) = (' + str(sess.run(RMS_LOSS,feed_dict={x:x_data})) +','+str(sess.run(theta1,feed_dict={x:x_data})))

We will see that there is a price to pay for this flexibility. Gradient descent is incredibly slow - the gradient tells us which direction to move, but not how far. The step length is a "knob" we have to pick, and is the first of many, many hyperparameters we will learn about that must be tuned to get good answers. Other classical methods are often vastly faster. Below is a snippet that compares solving a least squares problem by gradient descent vs solving the normal equations. You can adapt this code for the first homework assignment - take a note of the linear algebra operations available in tensorflow since this will be your work horse for manipulating tensors - for others check the [tensorflow API](https://www.tensorflow.org/api_docs). I will do my best to point you toward necessary functions in the first few weeks to minimize the time you spend digging through the API. Other notable ones are tools to reshape/resize tensors ([reshape](https://www.tensorflow.org/api_docs/python/tf/reshape), [expand_dims](https://www.tensorflow.org/api_docs/python/tf/expand_dims)), perform reductions to compute sums/means of tensors ([reduce_sum](https://www.tensorflow.org/api_docs/python/tf/reduce_sum),[reduce_mean](https://www.tensorflow.org/api_docs/python/tf/reduce_mean)), and a transpose operation which can be used to permute indices of tensors ([transpose](https://www.tensorflow.org/api_docs/python/tf/transpose)). To perform dot products, matrix-vector products, and other contractions I suggest you use [einsum](https://www.tensorflow.org/api_docs/python/tf/einsum)); there are other functions but this one can perform any contraction operation once you learn to use it.

In [None]:
P1 = tf.ones(shape=(x_data.shape[0]))
P2 = x_data
P = tf.stack([P1,P2],axis=1)
M = tf.einsum('di,dj->ij',P,P)
rhs = tf.einsum('di,d->i',P,y_data)
theta_optimal = tf.linalg.solve(M,tf.expand_dims(rhs,-1))[:,0]

print('Optimal values for theta1 and theta2: ' + str(sess.run(theta_optimal)))

Optimal values for theta1 and theta2: [3.         0.99999976]


**Assignment 1:** Modify this code to perform maximum likelihood estimation of noisy data, fitting a model of the form: $y \sim \mathcal{N}(\theta_1 x + \theta_2, \sigma^2)$. Do this both by building a tensorflow model and applying gradient descent to a maximum likelihood function to solve for $(\theta_1,\theta_2,\sigma^2)$, and also by computing analytically a system of normal equations for $(\theta_1,\theta_2)$ and an expression for $\sigma^2$ by computing $\nabla MLL = 0$ with pencil and paper.

In [None]:
Ndata = 100
x_data = np.linspace(0,1,Ndata)
y_data = 1+