# Introduction to TensorFlow in Python
Not long ago, cutting-edge computer vision algorithms couldn’t differentiate between images of cats and dogs. Today, a skilled data scientist equipped with nothing more than a laptop can classify tens of thousands of objects with greater accuracy than the human eye. In this course, you will use TensorFlow 2.6 to develop, train, and make predictions with the models that have powered major advances in recommendation systems, image classification, and FinTech. You will learn both high-level APIs, which will enable you to design and train deep learning models in 15 lines of code, and low-level APIs, which will allow you to move beyond off-the-shelf routines. You will also learn to accurately predict housing prices, credit card borrower defaults, and images of sign language gestures.

**Instructor:** Isaiah Hull, senior economist at Sweden's Central Bank

In [35]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import constant, add, ones, matmul, multiply, reduce_sum, Variable

In [22]:
tf.__version__

'2.4.0'

In [26]:
def compute_gradient(x0):
  	# Define x as a variable with an initial value of x0
	x = Variable(x0)
	with GradientTape() as tape:
		tape.watch(x)
        # Define y using the multiply operation
		y = multiply(x,x)
    # Return the gradient of y with respect to x
	return tape.gradient(y, x).numpy()

In [30]:
# Define a linear regression model
def linear_regression(intercept, slope, features):
    return intercept + features*slope

In [31]:
# Define a loss function to compute the MSE
def loss_function(intercept, slope, targets, features):
    # Compute the predictions for the linear model
    predictions = linear_regression(intercept, slope)
    
    # Return the loss
    return tf.keras.losses.mse(targets, predictions)

# $\star$ Chapter 1: Introduction to TensorFlow
Before you can build advanced models in TensorFlow 2, you will first need to understand the basics. In this chapter, you’ll learn how to define constants and variables, perform tensor addition and multiplication, and compute derivatives. Knowledge of linear algebra will be helpful, but not necessary.

### Constants and variables
* TensorFlow's two basic objects of computation are: **constants** and **variables**

#### What is TensorFlow?
* An open-source library for graph-based numerical computation
    * Developed by the Google Brain Team
* Low- and high-level APIs
    * Addition, multiplication, differentiation
    * Design and train machine learning models
* Important changes in TensorFlow 2.0
    * Eager execution enabled by default
        * Allows users to write simpler and more intuitive code
        * Model building with Keras and Estimators (high-level APIs)
        
#### What is a tensor?
* The TensorFlow documentation describes a **tensor** as "generalization of vectors and matrices to potentially higher dimensions."
* If you're not familiar with linear algebra, think of a tensor as **a collection of numbers, which is arranged into a particular shape**.
    * 0-dimensional: point
    * 1-dimensional: line
    * etc
    
### Defining tensors in TensorFlow
* Each object defined below will be a `tf.Tensor object`

In [3]:
# import tensorflow as tf

# 0D Tensor
d0 = tf.ones((1,))

# 1D Tensor
d1 = tf.ones((2,))

# 2D Tensor
d2 = tf.ones((2, 2))

# 3D Tensor
d3 = tf.ones((2, 2, 2))

If we want to print the array contained in that object, we can apply the `.numpy()` method and pass the resulting object to the print function

In [4]:
# Print the 3D tensor
print(d3.numpy())

[[[1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]]]


### Defining constants in TensorFlow
* A **constant** the simplest category of tensor
* A constant does not change and cannot be trained
    * Immutable
    * Untrainable
* A constant can have any dimension
* In the code below, we've defined two constants:
    * `a` is a 2x3 tensor of 3s
    * `b` is a 2x2 tensor which is constructed from the 1-dimensional tensor: 1, 2, 3, 4

In [5]:
# from tensorflow import constant

# Define a 2x3 constant
a = constant(3, shape=[2, 3])

# Define a 2x2 constant
b = constant([1, 2, 3, 4], shape=[2, 2])

In [6]:
print(a.numpy())

[[3 3 3]
 [3 3 3]]


In [7]:
print(b.numpy())

[[1 2]
 [3 4]]


* Above we worked exclusively with the constant operation
* However, in some cases, there are more convenient options for defining certain types of special tensors
<img src='data/convenience_functions.png' width="400" height="200" align="center"/>

* Use the `.zeros` or `.ones` operations to generate a tensor of arbitrary (but defined) dimension, that is populated entirely with zeros or ones
* Use the `zeros_like` or `ones_like` operations to populate tensors with zeros and ones, copying the dimensions of some input tensor passed to it.
* Use the `.fill` operation to populate a tensor of arbitrary dimension with the same scalar value in each element

In [8]:
fill_ex = tf.fill([3, 3],7)

In [9]:
print(fill_ex.numpy())

[[7 7 7]
 [7 7 7]
 [7 7 7]]


### Defining and initializing variables
* Unlike a constant, a variable's value can change during computation
* The value of a variable is **shared**, **persistent**, and **modifiable**.
* A variable's **data type and shape are fixed**.

In [10]:
# import tensorflow as tf

# Define a variable
a0 = tf.Variable([1, 2, 3, 4, 5, 6], dtype=tf.float32)
a1 = tf.Variable([1, 2, 3, 4, 5, 6], dtype=tf.int16)

# Define a constant
b = tf.constant(2, tf.float32)

# Compute their product
c0 = tf.multiply(a0, b)
c1 = a0 * b

In [11]:
print(c0.numpy())
print(c1.numpy())

[ 2.  4.  6.  8. 10. 12.]
[ 2.  4.  6.  8. 10. 12.]


* Note that certain TensorFlow operations, such as `tf.multiply` are overloaded, which allows us to use the simpler `a0*b` expression instead.

#### Exercises: Defining data as constants
Throughout this course, we will use `tensorflow` version 2.6.0 and will exclusively import the submodules needed to complete each exercise. This will usually be done for you, but you will do it in this exercise by importing `constant` from `tensorflow`.

After you have imported `constant`, you will use it to transform a `numpy` array, `credit_numpy`, into a `tensorflow` constant, `credit_constant`. This array contains feature columns from a dataset on credit card holders and is previewed in the image below. We will return to this dataset in later chapters.

Note that `tensorflow` 2 allows you to use data as either a `numpy` array or a `tensorflow` `constant` object. Using a constant will ensure that any operations performed with that object are done in `tensorflow`.

```
# Import constant from TensorFlow
from tensorflow import constant

# Convert the credit_numpy array into a tensorflow constant
credit_constant = constant(credit_numpy)

# Print constant datatype
print('\n The datatype is:', credit_constant.dtype)

# Print constant shape
print('\n The shape is:', credit_constant.shape)
```

#### Exercises: Defining variables
Unlike a constant, a variable's value can be modified. This will be useful when we want to train a model by updating its parameters.

Let's try defining and printing a variable. We'll then convert the variable to a `numpy` array, print again, and check for differences. Note that `Variable()`, which is used to create a variable tensor, has been imported from `tensorflow` and is available to use in the exercise.

```
# Define the 1-dimensional variable A1
A1 = Variable([1, 2, 3, 4])

# Print the variable A1
print('\n A1: ', A1)

# Convert A1 to a numpy array and assign it to B1
B1 = A1.numpy()

# Print B1
print('\n B1: ', B1)
```

### Basic operations
* TensorFlow has a model of computation that revolves around the use of graphs
* A TensorFlow graph contains edges and nodes, where the edges are tensors and the nodes are operations

<img src='data/tf_operation_flow.png' width="400" height="200" align="center"/>

### Applying the addition operator
* We first import the constant and add operations so that we may now define 0-, 1-, and 2-dimensional tensors. 

In [12]:
# Import constant and add from tensorflow
# from tensorflow import constant, add

# Define 0-dimensional tensors
A0 = constant([1])
B0 = constant([2])

# Define 1-dimensional tensors
A1 = constant([1, 2])
B1 = constant([3, 4])

# Define 2-dimensional tensors
A2 = constant([[1, 2], [3, 4]])
B2 = constant([[5, 6], [7, 8]])

### Applying the addition operator
* Finally, let's add them together using the operation for tensor addition
* Note that we can perform scalar addition with `A0` and `B0`, vector addition with `A1` and `B1`, and matrix addition with `A2` and `B2`
* The `add()` operation performs **element-wise addition** with two tensors
* **Element-wise addition requires that both tensors have the same shape:**
    * Scalar addition: 1 + 2 = 3
    * Vector addition: [1, 2] + [3, 4] = [4, 6]
    * Matrix addition:

```
A = [[1, 2],
     [3, 4]]
B = [[5, 6], 
     [7, 8]]
A + B = [[6, 8],
         [10,12]]
```
* Furthermore, the `add()` operator is **overloaded**
    * We can also perform addition using the plus symbol

In [13]:
# Perform tensor addition with add()
C0 = add(A0, B0)
C1 = add(A1, B1)
C2 = add(A2, B2)

In [14]:
print(C0.numpy())
print(C1.numpy())
print(C2.numpy())

[3]
[4 6]
[[ 6  8]
 [10 12]]


### How to perform multiplication in TensorFlow
* We will consider both element-wise and matrix multiplication
* **Element-wise multiplication** performed using the `multiply()` operation
    * Tensors involved **must have the same shape**
* **Matrix multiplication** performed with `matmul()` operator 
    * The `matmul(A, B)` operation multiplies `A` by `B`
    * **Note** that number of columns of `A` must equal the number of rows of `B`
    
#### Applying the multiplication operators

In [15]:
# Import operators from tensorflow
# from tensorflow import ones, matmul, multiply

# Define tensors
A0 = ones(1)
A31 = ones([3, 1])
A34 = ones([3, 4])
A43 = ones([4, 3])

* What types of operations are valid on these tensors of ones?
    * We can perform element-wise multiplication of any element by itself
        * `multiply(A0, A0)`, `multiply(A31, A31)`, and `multiply(A34, A34)`
    * We can perform matrix multiplication on `matmul(A43, A34)`
        * but **not** matmul(A43, A43)
        
### Summing over tensor dimensions
* The `reduce_sum()` operator sums over the dimensions of a tensor
* This can be used to sum over all dimensions of a tensor or just one.
* The `reduce_sum()` operator sums over th dimensions of a tensor
    * `reduce_sum(A)` sums over all dimensions of A
    * `reduce_sum(A, i)` sums over dimension i 

In [16]:
# Import operations from tensorflow
# from tensorflow import ones, reduce_sum

# Define a 2x3x4 tensor of ones
F = ones([2, 3, 4])

* If we sum over all elements of A, we get 24, since the tensor contains 24 elements, all of which are 1 

In [17]:
# Sum over all dimensions
D = reduce_sum(F)

# Sum over dimensions 0, 1, and 2
D0 = reduce_sum(F, 0)
D1 = reduce_sum(F, 1)
D2 = reduce_sum(F, 2)

* If we sum over dimension 0, we get a 3 x 4 matrix of 2s
* If we sum over 1, we get a 2 by 4 matrix of 3s
* If we sum over 2, we get a 2 x3 matrix of 4s
* In each case, we reduce the size of the tensor by summing over one of its dimensions

In [18]:
print(D)

tf.Tensor(24.0, shape=(), dtype=float32)


In [19]:
print(D0)

tf.Tensor(
[[2. 2. 2. 2.]
 [2. 2. 2. 2.]
 [2. 2. 2. 2.]], shape=(3, 4), dtype=float32)


In [20]:
print(D1)

tf.Tensor(
[[3. 3. 3. 3.]
 [3. 3. 3. 3.]], shape=(2, 4), dtype=float32)


In [21]:
print(D2)

tf.Tensor(
[[4. 4. 4.]
 [4. 4. 4.]], shape=(2, 3), dtype=float32)


#### Exercises Performing element-wise multiplication
Element-wise multiplication in TensorFlow is performed using two tensors with identical shapes. This is because the operation multiplies elements in corresponding positions in the two tensors. An example of an element-wise multiplication, denoted by the $\odot$ symbol, is shown below:

<img src='data/ex1_matmul.png' width="200" height="100" align="center"/>

In this exercise, you will perform element-wise multiplication, paying careful attention to the shape of the tensors you multiply. Note that `multiply()`, `constant()`, and `ones_like()` have been imported for you.

```
# Define tensors A1 and A23 as constants
A1 = constant([1, 2, 3, 4])
A23 = constant([[1, 2, 3], [1, 6, 4]])

# Define B1 and B23 to have the correct shape
B1 = ones_like(A1)
B23 = ones_like(A23)

# Perform element-wise multiplication
C1 = A1 * B1
C23 = A23 * B23

# Print the tensors C1 and C23
print('\n C1: {}'.format(C1.numpy()))
print('\n C23: {}'.format(C23.numpy()))
```

#### Exercises: Making predictions with matrix multiplication
In later chapters, you will learn to train linear regression models. This process will yield a vector of parameters that can be multiplied by the input data to generate predictions. In this exercise, you will use input data, `features`, and a target vector, `bill`, which are taken from a credit card dataset we will use later in the course.

<img src='data/mat_mult.png' width="400" height="200" align="center"/>

The matrix of input data, `features`, contains two columns: education level and age. The target vector, `bill`, is the size of the credit card borrower's bill.

Since we have not trained the model, you will enter a guess for the values of the parameter vector, `params`. You will then use `matmul()` to perform matrix multiplication of `features` by `params` to generate predictions, `billpred`, which you will compare with `bill`. Note that we have imported `matmul()` and `constant()`.

```
# Define features, params, and bill as constants
features = constant([[2, 24], [2, 26], [2, 57], [1, 37]])
params = constant([[1000], [150]])
bill = constant([[3913], [2682], [8617], [64400]])

# Compute billpred using features and params
billpred = matmul(features, params)

# Compute and print the error
error = bill-billpred
print(error.numpy())
```

### Advanced Operations
* In this lesson, we explore advanced operations:
    * `gradient()`
    * `reshape()`
    * `random()`
    
<img src='data/adv_ops.png' width="400" height="200" align="center"/>

* **`gradient()`:** 
    * We will use this function in conjuction with gradient tape
    * Computes the slope of a function at a point
* **`reshape()`:**
    * Changes the shape of a tensor (e.g. 10x10 to 100x1)
* **`random()`:**
    * Generates a tensor out of randomly-drawn values

#### Finding the optimum 
* In many ML problems, we will need to find the optimum (minimum or maximum) of a function 
    * **Minimum:** Lowest value of a loss function
    * **Maximum:** Highest value of objective function
* We can do this using the `gradient()` operation, which tells us the slope of a function at a point
    * We start this process by passing points to the gradient operation until we find one where the gradient is zero
    * **Optimum:** Find a point where gradient = 0
    * **Minimum:** Change in gradient > 0 (if it is increasing, we have a minimum)
    * **Maximum:** Change in gradient < 0 (if it is decreasing, we have a maximum)
  
<img src='data/fixed_gradient.png' width="400" height="200" align="center"/>

* The plot above shows the function `y = x`; notice that the gradient (the slope at a given point) is constant
* This is not true is we instead consider the function `y = x**2` ($y=x^2$)
    * When `x` is less than 0, `y` decreases when `x` increases
    * When `x` is greater than 0, `y` increases when `x` increases
    * Thus, the gradient is initially negative, but becomes positive for `x` larger than 0.
    * This means that `x = 0` **minimizes** `y`

<img src='data/varying_gradient.png' width="400" height="200" align="center"/>

### Gradients in TensorFlow
* We define `x` as `-1.0`
* We then define `y` as `x**2` *within an instance of gradient tape*.
* **Note** that we apply the `watch()` method to an instance of gradient tape and then pass the variable `x`.
* This will allow us to compute the rate of change of `y` with respect to `x`
* Next, we compute the gradient of `y` with respect to `x` using the tape instance of gradient tape
* **Note that y is the first argument and x is the second**
* As written, the operation computes the slope of `y` at a point

In [23]:
# Import tensorflow under the alias tf
# import tensorflow as tf

# Define x
x = tf.Variable(-1.0)

# Define y within instance of GradientTape
with tf.GradientTape() as tape:
    tape.watch(x)
    y = tf.multiply(x, x)
    
# Evaluate the gradient of y at x = -1
g = tape.gradient(y, x)
print(g.numpy())

-2.0


* Running the code and printing we find that the slope is -2 at `x = -1`, which means that `y` is initially decreasing in `x`, as seen in the graph above. 
* Much of the differentiation you do in deep learning models will be handled by high level APIs
* However, **gradient tape remains an invaluable tool for building advanced and custom models.**

### Reshaping images as tensors
* A tool that is particularly usseful for image classification problems: **reshaping**
* While some algorithms allow you to exploit the shape of the original image, other require you to `reshape` matrices into vectors before using them as inputs, as shown in the diagram

#### Reshaping a grayscale image
* Below we create a random grayscale image by drawing numbers from the set of integers between 0 and 255 (grayscale pixel scale) and use these to populate a 2x2 matrix
* We can then reshape this into a 4x1 vector

In [24]:
# Import tensorflow as alias tf
# import tensorflow as tf

# Generate grayscale image
gray = tf.random.uniform([2, 2], maxval=255, dtype='int32')

# Reshape grayscale image
gray = tf.reshape(gray, [2*2, 1])

<img src='data/reshape_grayscale.png' width="200" height="100"/>

#### How to reshape a color image
* For color images, we generate 3 such matrices to form a 2x2x3 tensor
* We could then reshape the image into a 4x3 tensor, as shown in the diagram

In [25]:
# Import tensorflow as alias tf
# import tensorflow as tf

# Generate color image
color = tf.random.uniform([2, 2, 3], maxval= 255, dtype='int32')

# Reshape color image
color = tf.reshape(color, [2*2, 3])

#### Exercises: Reshaping tensors
Later in the course, you will classify images of sign language letters using a neural network. In some cases, the network will take 1-dimensional tensors as inputs, but your data will come in the form of images, which will either be either 2- or 3-dimensional tensors, depending on whether they are grayscale or color images.

The figure below shows grayscale and color images of the sign language letter A. The two images have been imported for you and converted to the numpy arrays `gray_tensor` and `color_tensor`. Reshape these arrays into 1-dimensional vectors using the `reshape` operation, which has been imported for you from `tensorflow`. Note that the shape of `gray_tensor` is 28x28 and the shape of `color_tensor` is 28x28x3.

<img src='data/asl_a.png' width="200" height="100" align="center"/>

```
# Reshape the grayscale image tensor into a vector
gray_vector = reshape(gray_tensor, (28*28, 1))

# Reshape the color image tensor into a vector
color_vector = reshape(color_tensor, (28*28*3, 1))
```

#### Exercises: Optimizing with gradients
You are given a loss function, $y = x^2$, which you want to minimize. You can do this by computing the slope using the `GradientTape()` operation at different values of `x`. If the slope is positive, you can decrease the loss by lowering `x`. If it is negative, you can decrease it by increasing `x`. This is how gradient descent works.

<img src='data/varying_gradient.png' width="300" height="150" align="center"/>

In practice, you will use a high level `tensorflow` operation to perform gradient descent automatically. In this exercise, however, you will compute the slope at `x` values of -1, 1, and 0. The following operations are available: `GradientTape()`, `multiply()`, and `Variable()`.

```
def compute_gradient(x0):
  	# Define x as a variable with an initial value of x0
	x = Variable(x0)
	with GradientTape() as tape:
		tape.watch(x)
        # Define y using the multiply operation
		y = multiply(x,x)
    # Return the gradient of y with respect to x
	return tape.gradient(y, x).numpy()

# Compute and print gradients at x = -1, 1, and 0
print(compute_gradient(-1.0))
print(compute_gradient(1.0))
print(compute_gradient(0.0))
```

#### Exercises: Working with image data
You are given a black-and-white image of a `letter`, which has been encoded as a tensor, `letter`. You want to determine whether the letter is an X or a K. You don't have a trained neural network, but you do have a simple model, `model`, which can be used to classify `letter`.

The 3x3 tensor, `letter`, and the 1x3 tensor, `model`, are available in the Python shell. You can determine whether `letter` is a K by multiplying `letter` by `model`, summing over the result, and then checking if it is equal to 1. As with more complicated models, such as neural networks, `model` is a collection of weights, arranged in a tensor.

Note that the functions `reshape()`, `matmul()`, and `reduce_sum()` have been imported from `tensorflow` and are available for use.

```
# Reshape model from a 1x3 to a 3x1 tensor
model = reshape(model, (3, 1))

# Multiply letter by model
output = matmul(letter, model)

# Sum over output and print prediction using the numpy method
prediction = reduce_sum(output)
print(prediction.numpy())
```

# $\star$ Chapter 2: Linear models
In this chapter, you will learn how to build, solve, and make predictions with models in TensorFlow 2. You will focus on a simple class of models – the linear regression model – and will try to predict housing prices. By the end of the chapter, you will know how to load and manipulate data, construct loss functions, perform minimization, make predictions, and reduce resource use with batch training.

### Input data
In the previous chapter, we focused on how to perform core TensorFlow operations. In this chapter, we will work towards training a linear model with TensorFlow. So far we've only generated data using functions like `ones` and `random_uniform`, however when we train a machine learning model, we will (obviously) want to import data from an external source (whether numeric, image, text, or other data). Beyond simply importing the data, **numeric data will need to be assigned a type, and text and image data will need to be converted to a usable format**. While this is useful for complex data pipelines, it will be unnecessarily complicated for what we do in this chapter. 

#### Importing data for use in TensorFlow
* **Data can be imported using `tensorflow`
    * Useful for managing complex pipelines 
    * Not necessary for this chapter
* **Simpler option used in this chapter**
    * Import data using `pandas`
    * Convert data to `numpy` array
    * Use in `tensorflow` without modification
    
#### How to import and convert data

```
# Import numpy and pandas
# import numpy as np
# import pandas as pd

# Load data from csv
housing = pd.read_csv('kc_housing.csv')

# Convert to numpy array 
housing = np.array(housing)
```

* We will focus on data stored in csv format in this chapter
* pandas also has methods for handling data in other formats
    * e.g. `read_json()`, `read_html()`, `read_excel()`
    
<img src='data/read_csv_params.png' width="400" height="200" align="center"/>

* The only required parameter to `read_csv` is the filepath or buffer
    * **Note:** Instead of a filepath, you can also provide a URL to load data.
* `sep` = delimiter; default is comma
    * **Note that if you do use whitespace as a delimiter, you will also need to set the `delim_whitespace` parameter to `True`** (default is `False`).
* Finally, if you are working with datasets that contain non-ASCII characters, you can specify the appropriate choice of encoding, so that your characters are correctly parsed.

#### Using mixed type datasets
* How to transform imported data for use in TensorFlow

<img src='data/mixed_type_dfs.png' width="400" height="200" align="center"/>

#### Setting the data type
* Let's say we want to perform TensorFlow operations that require `price` to be a 32-bit floating point number and `waterfront` to be a boolean
* We can do this in two ways:
* 1.
    * We select the relevant column in the DataFrame
    * Provide relevant column as first argument to `np.array`
    * Provide datatype as second argument

```
# Load KC dataset
housing = pd.read_csv('kc_housing.csv')

# Convert price column to float32
price = np.array(housing['price'], np.float32)

# Convert waterfront column to Boolean 
waterfront = np.array(housing['waterfront'], np.bool)
```    
    
* 2. Cast operation from TensorFlow

```
# Load KC dataset
housing = pd.read_csv('kc_housing.csv')

# Convert price column to float32
price = tf.cast(housing['price'], tf.float32)

# Convert waterfront column to Boolean
waterfront = tf.cast(housing['waterfront'], tf.bool)
```
  
* While either `tf.cast` or `np.array` will work, `waterfront` will be a `tf.tensor` type under the former option, and a numpy array under the latter. 

#### Load data using pandas
Before you can train a machine learning model, you must first import data. There are several valid ways to do this, but for now, we will use a simple one-liner from `pandas`: `pd.read_csv()`. Recall from the video that the first argument specifies the path or URL. All other arguments are optional.

In this exercise, you will import the King County housing dataset, which we will use to train a linear model later in the chapter.

```
# Import pandas under the alias pd
import pandas as pd

# Assign the path to a string variable named data_path
data_path = 'kc_house_data.csv'

housing = pd.read_csv(data_path)

# Print the price column of housing
print(housing.price)
```

### Loss functions
* Loss functions play a fundamental role in ML and are a fundamental `tensorflow` operation
    * Used to train a model
    * Measure of model fit
* **Higher value $\Rightarrow$ worse fit**
    * Minimize the loss function (usually)
        * But in some cases we may also want to maximize a loss function instead (much less common)
        * If this is the case, we can always place a minus sign before the function we want to maximize and minimize it instead
        * For this reason, we will always talk about **loss functions and minimization** (because even in the off-change we want to maximize a function, we can always just use the trick mentioned above)
        
#### Common loss functions in TensorFlow
* TensorFlow operations for common loss functions include:
    * Mean squared error (MSE)
    * Mean absolute error (MAE)
    * Huber error
* **Loss functions are accessible from `tf.keras.losses()`**
    * `tf.keras.losses.mse()`
    * `tf.keras.losses.mae()`
    * `tf.keras.losses.Huber()`
* The loss tells us to what degree our predictions are accurate
* Below we plot the MSE, MAE, and Huber loss for error values between -2 and 2

<img src='data/common_loss_funcs.png' width="400" height="200" align="center"/>

#### MSE
* Strongly penalizes outliers
* High (gradient) sensitivity near minimum 

#### MAE
* Scales linearly with size of error
* Low sensitivity near minimum 

#### Huber
* Similar to MSE near minimum
* Similar to MAE away from minimum


* **For greater sensitivity near the minimum, you will want to use the MSE or Huber loss.**
* **To minimize the impact of outliers, you will want to use the MAE or Huber loss.**

#### Defining a loss function
* Let's say we decide to use the MSE loss:

```
import tensorflow as tf

# Compute the MSE loss
loss = tf.keras.losses.mse(targets, predictions)
```
* In many cases, the training process will require us to supply a function that accepts our model's variables and data and returns a loss
* Here we'll first define a model, "linear_regression," which takes the intercept, slope, and features as arguments and returns the model's predictions

```
# Define a linear regression model
def linear_regression(intercept, slope = slope, features = features):
    return intercept + features*slope
```

* Next we'll define a loss function called `loss_function` that accepts the slope and intercept of a linear model-- the variables-- and the input data, the targets and the features.
* It then makes a prediction and computes and returns the associated MSE loss

```
# Define a loss function to compute the MSE
def loss_function(intercept, slope, targets = targets, features = features):
    # Compute the predictions for the linear model
    predictions = linear_regression(intercept, slope)
    
    # Return the loss
    return tf.keras.losses.mse(targets, predictions)
```

* **Note that we've defined both functions to use default argument values for features and targets.**
* We will do this whenever we train on the full sample to simplify the code
* Also notice that we've nested TensorFlow's MSE loss function within a function that first uses the model to make predictions and then uses those predictions as an input to the MSE loss function
* We can then evaluate this function for a given set of parameter values and input data

```
# Compute the loss for test data inputs
loss_function(intercept, slope, test_targets, test_features)
```
* Note that if we had omitted the data argumetns, test_targets, and test_features, the loss function would have instead used the default targets and features argumetns we set to evaluate model performance.

#### Exercises: Loss functions in TensorFlow

Compute the loss using data from the King County housing dataset. You are given a target, price, which is a tensor of house prices, and predictions, which is a tensor of predicted house prices. You will evaluate the loss function and print out the value of the loss.

```
# Import the keras module from tensorflow
from tensorflow import keras

# Compute the mean absolute error (mae)
loss = keras.losses.mse(price, predictions)

# Print the mean absolute error (mae)
print(loss.numpy())
```

#### Exercises: Modifying the loss function
In the previous exercise, you defined a `tensorflow` loss function and then evaluated it once for a set of actual and predicted values. In this exercise, you will compute the loss within another function called `loss_function()`, which first generates predicted values from the data and variables. The purpose of this is to construct a function of the trainable model variables that returns the loss. You can then repeatedly evaluate this function for different variable values until you find the minimum. In practice, you will pass this function to an optimizer in `tensorflow`. Note that `features` and `targets` have been defined and are available. Additionally, `Variable`, `float32`, and `keras` are available.

```
# Initialize a variable named scalar
scalar = Variable(1.0, float32)

# Define the model
def model(scalar, features = features):
  	return scalar * features

# Define a loss function
def loss_function(scalar, features = features, targets = targets):
	# Compute the predicted values
	predictions = model(scalar, features)
    
	# Return the mean absolute error loss
	return keras.losses.mae(targets, predictions)

# Evaluate the loss function and print the loss
print(loss_function(scalar).numpy())
```

### Linear Regression
* A **linear regression model assumes a linear relationship:**
    * $price = intercept + size * slope + error$
* **This is an example of a univariate regression.**
    * There is only one feature, `size`.
* **Multiple regression models have more than one feature**
    * e.g. `size` and `location`
    
```
# Define the targets and features 
price = np.array(housing['price'], np.float32)
size = np.array(housing['sqft_living'], np.float32)

# Define the intercept and slope
intercept = tf.Variable(0.1, np.float32)
slope = tf.Variable(0.1, np.float32)
```
* A univariate linear regression identifies the relationship between a single feature and the target tensor.

#### Exercises: Multiple linear regression
In most cases, performing a univariate linear regression will not yield a model that is useful for making accurate predictions. In this exercise, you will perform a multiple regression, which uses more than one feature.

You will use `price_log` as your target and `size_log` and `bedrooms` as your features. Each of these tensors has been defined and is available. You will also switch from using the the mean squared error loss to the mean absolute error loss: `keras.losses.mae()`. Finally, the predicted values are computed as follows: `params[0] + feature1*params[1] + feature2*params[2]`. Note that we've defined a vector of parameters, `params`, as a variable, rather than using three variables. Here, `params[0]` is the intercept and `params[1]` and `params[2]` are the slopes.

### Batch training
* We've now learned how to train a linear model to predict house prices.
* Now we will use batch training to handle large datasets

#### What is batch training?
* Let's pretend the KC Housing dataset is much larger, and we want to perform the training on a GPU, which has only a small amount of memory
* **Since you can't fit the entire dataset in memory, you will instead divide it into batches and then train on those batches sequentially.**

<img src='data/batch_training.png' width="600" height="300" align="center"/>

* A single pass over all of the batches is called **epoch** and the process itself is called **batch training**.
* **Batch training** is extremely useful when working with large image datasets
* **Batch training will also allow you to update model weights and optimizer parameters after each batch, rather than at the end of an epoch.**

#### The chunksize parameter
* `pd.read_csv()` allows us to load data in batches
    * To avoid loading the entire dataset at once, **`chunksize`** parameter specifies batch size.
    
* Below, instead of loading the data in a single one-liner, we'll write a for loop that iterates through the data in steps of 100 examples
* Each 100 will be available as a batch, which we can use to extract columns, such as `price` and `size` in the housing dataset
* We can then convert these to numpy arrays and use them to train
* Being able to load data from csv filed in fixed-sized batches using pandas allows us to handle datasets of tens or even hundreds of gigabytes without excedding the memory constraints of our system.

```
#Import pandas and numpy
import pandas as pd
import numypy as np

# Load data in batches
for batch in pd.read_csv('kc_housing.csv', chunksize=100):
    # Extract price column
    price = np.array(batch['price'], np.float32)
    
    # Extract size column
    size = np.array(batch['size'], np.float32)
```

### Training a linear model in batches

```
# Define trainable variables
intercept = tf.Variable(0.1, tf.float32)
slope = tf.Variable(0.1, tf.float32)

# Define the model
def linear_regression(intercept, slope, features):
    return intercept + features * slope
    
# Compute predicted values and return loss function
def loss_function(intercept, slope, targets, features):
    predictions = linear_regression(intercept, slope, features)
    return tf.keras.losses.mse(targets, predictions)
    
# Define optimization operation
opt = tf.keras.optimizers.Adam()
```
#### The next step is to train the model in batches
* We do this, once again, by using a for loop and supplying a chunksize to the read csv function
* Note that we take each batch,
    * separate it into features and a target, 
    * convert those into numpy arrays,
    * and then pass them to the minimize operation
* Within the minimize operation, we pass the loss function as a lambda function and we supply a variable list that contains only the trainable parameters, (intercept, and slope).
* This loop will continue until we have stepped through all of the examples in `read_csv`
* **Importantly, we did not ever need to have more than 100 examples in memory during the entire process.**
* Finally, we print our trained intercept and slope

```
# Load the data in batches from pandas
for batch in pd.read_csv('kc_housing.csv', chunksize=100):
    # Extract the target and feature columns
    price_batch = np.array(batch['prce'], np.float32)
    size_batch = np.array(batch['lot_size'], np.float32)
    
    # Minimize the loss function
    opt.minimize(lambda: loss_function(intercept, slope, price_batch, size_batch), var_list=[intercept, slope])

# Print parameter values
print(intercept.numpy(), slope.numpy())
```

* **Note that we did not use default argument values for input data. This is because our input data was generated in batches during the training process.**

<img src='data/fullsample_vs_batch.png' width="400" height="200" align="center"/>

* In later chapters, we'll automate batch training by using high-level APIs
* Importantly, however, high-level APIs will not typically load the sample in batches by default, as we have done here (above).

#### Exercises: Preparing to batch train
Before we can train a linear model in batches, we must first define variables, a loss function, and an optimization operation. In this exercise, we will prepare to train a model that will predict `price_batch`, a batch of house prices, using `size_batch`, a batch of lot sizes in square feet. In contrast to the previous lesson, we will do this by loading batches of data using `pandas`, converting it to `numpy` arrays, and then using it to minimize the loss function in steps.

Note that you should not set default argument values for either the model or loss function, since we will generate the data in batches during the training process.

```
# Define the intercept and slope
intercept = Variable(10.0, float32)
slope = Variable(0.5, float32)

# Define the model
def linear_regression(intercept, slope, features):
	# Define the predicted values
	return intercept + slope * features

# Define the loss function
def loss_function(intercept, slope, targets, features):
	# Define the predicted values
	predictions = linear_regression(intercept, slope, features)
    
 	# Define the MSE loss
	return keras.losses.mse(targets, predictions)
```

#### Exercises: Training a linear model in batches
In this exercise, we will train a linear regression model in batches, starting where we left off in the previous exercise. We will do this by stepping through the dataset in batches and updating the model's variables, `intercept` and `slope`, after each step. This approach will allow us to train with datasets that are otherwise too large to hold in memory.

Note that the loss function, `loss_function(intercept, slope, targets, features)`, has been defined for you. Additionally, `keras` has been imported for you and `numpy` is available as `np`. The trainable variables should be entered into `var_list` in the order in which they appear as loss function arguments.

```
# Initialize Adam optimizer
opt = keras.optimizers.Adam()

# Load data in batches
for batch in pd.read_csv('kc_house_data.csv', chunksize=100):
	size_batch = np.array(batch['sqft_lot'], np.float32)

	# Extract the price values for the current batch
	price_batch = np.array(batch['price'], np.float32)

	# Complete the loss, fill in the variable list, and minimize
	opt.minimize(lambda: loss_function(intercept, slope, price_batch, size_batch), var_list=[intercept, slope])

# Print trained parameters
print(intercept.numpy(), slope.numpy())
```

# $\star$ Chapter 3: Neural Networks
The previous chapters taught you how to build models in TensorFlow 2. In this chapter, you will apply those same tools to build, train, and make predictions with neural networks. You will learn how to define dense layers, apply activation functions, select an optimizer, and apply regularization to reduce overfitting. You will take advantage of TensorFlow's flexibility by using both low-level linear algebra and high-level Keras API operations to define and train models.

### Dense layers
* The dense layer is a frequently used component of neural networks
* UCI credit card defaults

<img src='data/uci_linreg.png' width="400" height="200" align="center"/>

#### Neural networks
* So how do we get from linear regression to a neural network?

<img src='data/uci_nn_map.png' width="400" height="200" align="center"/>

* Each hidden layer node takes our two inputs, multiplies them by their respective weights, and sums them together
* We also typically pass the hidden layer output to an activation function.
* Finally, we sum together the outputs of the hidden layers to compute our prediction for credit card default
* The entire process of generating a prediction is referred to as **forward propagation**
* In this chapter we will construct NNs with only 3 types of layers: an input later, some number of hidden (dense) layers, and an output layer

<img src='data/simple_nn.png' width="400" height="200" align="center"/>

* Input layer: features
* Output layer: prediction
* Each hidden layer takes inputs from the previous layer, applies numerical weights to them, sums them together, and then applies an activation function
* In the NN map above, each hidden layer is a dense, or fully-connected layer
    * **A dense layer applies weights to *all* nodes from the previous layer.**

#### A simple dense layer
* We'll first define a constant tensor that contains the marital status and age data as the input layer
* We then initialize weights as a variable, since we will train those weights to predict the output from the inputs
* We also define a bias, which will play a similar role to the intercept in the linear regression model.
* Finally, we define a dense layer; note that we first perform a matrix multiplication of the inputs by the weights and assign that to the tensor named product
* We then add product to the bias and apply a non-linear transformation, in this case the sigmoid function
    * This is called the **activation function**
* Note that the bias is not associated with a feature and is analogous to the intercept in a linear regression
* Note that TensorFlow also comes with higher level operations, such as `tf.keras.layers.Dense`, which allows us to skip the linear algebra

```
import tensorflow as tf

# Define inputs (features)
inputs = tf.constant([[1, 35]])

# Define weights
weights = tf.Variable([[-0.05], [-0.01]])

# Define the bias
bias = tf.Variable([0.5])

# Multiply inputs (features) by the weights
product = tf.matmul(inputs, weights)

# Define dense layer
dense = tf.keras.activations.sigmoid(product + bias)
```

#### Defining a complete model
* **Note that, by default, a bias will be included.**
* Note that we've also passed inputs as an argument to the first dense layer
* Note that the second dense layer takes the first dense layer as an argument and also reduces the number of nodes
* The outputs reduces this again to one.

```
import tensorflow as tf

#Define input (features) layer
input = tf.constant(data, tf.float32)

# Define first dense layer
dense1 = tf.keras.layers.Dense(10, activation='sigmoid')(inputs)

# Define second dense layer
dense2 = tf.keras.layers.Dense(5, activation='sigmoid')(dense1)

# Define output (predictions) layer
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dense2)
```

#### High-level vs. low-level approach

* **High-level approach:**
    * Complex operations in high-level API operations
    * High-level APIs such as `Keras` and `Estimators`
        * reduces the amount of code needed
    * weights and the mathematical operations will typically be hidden by the layer constructor 
    
* **Low-level approach 
    * Linear-algebraic operations
        * Allows for the construction of any model
    
**TensorFlow allows us to use either approach or even combine them.**

<img src='data/hi_lo_approach.png' width="600" height="300" align="center"/>

#### Exercises: The linear algebra of dense layers
There are two ways to define a dense layer in `tensorflow`. The first involves the use of low-level, linear algebraic operations. The second makes use of high-level `keras` operations. In this exercise, we will use the first method to construct the network shown in the image below.

<img src='data/dense_im.png' width="300" height="150" align="center"/>

The input layer contains 3 features -- education, marital status, and age -- which are available as `borrower_features`. The hidden layer contains 2 nodes and the output layer contains a single node.

For each layer, you will take the previous layer as an input, initialize a set of weights, compute the product of the inputs and weights, and then apply an activation function. Note that `Variable()`, `ones()`, `matmul()`, and `keras()` have been imported from `tensorflow`.

In [32]:
borrower_features = np.array([[2., 2., 43.]], np.float32)

In [38]:
# Initialize bias1
bias1 = Variable(1.0)

# Initialize weights1 as 3x2 variable of ones
weights1 = tf.Variable(ones((3, 2)))

# Perform matrix multiplication of borrower_features and weights1
product1 = matmul(borrower_features, weights1)

# Apply sigmoid activation function to product1 + bias1
dense1 = tf.keras.activations.sigmoid(product1 + bias1)

# Print shape of dense1
print("\n dense1's output shape: {}".format(dense1.shape))


 dense1's output shape: (1, 2)


In [40]:
# Initialize bias2 and weights2
bias2 = Variable(1.0)
weights2 = Variable(ones((2, 1)))

# Perform matrix multiplication of dense1 and weights2
product2 = matmul(dense1, weights2)

# Apply activation to product2 + bias2 and print the prediction
prediction = tf.keras.activations.sigmoid(product2 + bias2)
print('\n prediction: {}'.format(prediction.numpy()[0,0]))
print('\n actual: 1')


 prediction: 0.9525741338729858

 actual: 1


#### Exercises: The low-level approach with multiple examples
In this exercise, we'll build further intuition for the low-level approach by constructing the first dense hidden layer for the case where we have multiple examples. We'll assume the model is trained and the first layer weights, `weights1`, and bias, `bias1`, are available. We'll then perform matrix multiplication of the `borrower_features` tensor by the `weights1` variable. Recall that the `borrower_features` tensor includes education, marital status, and age. Finally, we'll apply the sigmoid function to the elements of `products1 + bias1`, yielding `dense1`.

<img src='data/prod1.png' width="400" height="200" align="center"/>

Note that `matmul()` and `keras()` have been imported from tensorflow.

In [41]:
# Compute the product of borrower_features and weights1
products1 = matmul(borrower_features, weights1)

# Apply a sigmoid activation function to products1 + bias1
dense1 = tf.keras.activations.sigmoid(products1 + bias1)

# Print the shapes of borrower_features, weights1, bias1, and dense1
print('\n shape of borrower_features: ', borrower_features.shape)
print('\n shape of weights1: ', weights1.shape)
print('\n shape of bias1: ', bias1.shape)
print('\n shape of dense1: ', dense1.shape)


 shape of borrower_features:  (1, 3)

 shape of weights1:  (3, 2)

 shape of bias1:  ()

 shape of dense1:  (1, 2)


#### Exercises: Using the dense layer operation
We've now seen how to define dense layers in `tensorflow` using linear algebra. In this exercise, we'll skip the linear algebra and let `keras` work out the details. This will allow us to construct the network below, which has 2 hidden layers and 10 features, using less code than we needed for the network with 1 hidden layer and 3 features.

<img src='data/ex_dense.png' width="400" height="200" align="center"/>

To construct this network, we'll need to define three dense layers, each of which takes the previous layer as an input, multiplies it by weights, and applies an activation function. Note that input data has been defined and is available as a 100x10 tensor: `borrower_features`. Additionally, the `keras.layers` module is available.

In [44]:
# Define the first dense layer
dense1 = tf.keras.layers.Dense(7, activation='sigmoid')(borrower_features)

# Define a dense layer with 3 output nodes
dense2 = tf.keras.layers.Dense(3, activation='sigmoid')(dense1)

# Define a dense layer with 1 output node
predictions = tf.keras.layers.Dense(1, activation='sigmoid')(dense2)

# Print the shapes of dense1, dense2, and predictions
print('\n shape of dense1: ', dense1.shape)
print('\n shape of dense2: ', dense2.shape)
print('\n shape of predictions: ', predictions.shape)


 shape of dense1:  (1, 7)

 shape of dense2:  (1, 3)

 shape of predictions:  (1, 1)


## Activation functions
* A typical hidden layer consists of two operations:
    * 1) **Linear:** Matrix multiplication
    * 2) **Non-linear:** Activation function
    
#### Why nonlinearities are important: a simple example:
* Below we assume that the weight on age is 1 and the weight on bill amount is 2
* Note that ages are divided by 100 and the bill's amount is divided by 10,000
* We then perform the matrix mutltiplication step for all combinations of features

In [45]:
# Define example borrower features
young, old = 0.3, 0.6
low_bill, high_bill = 0.1, 0.5

# Apply matrix multiplication step for all feature combinations
young_high = 1.0*young + 2.0*high_bill
young_low = 1.0 * young + 2.0* low_bill
old_high = 1.0 * old +2.0*high_bill
old_low = 1.0*old + 2.0*low_bill

* If we don't apply an activation function and we assume the bias is zero, we find that the impact of bill size on default does not depend on age:

In [46]:
# Difference in default predictions for young
print(young_high-young_low)

# Difference in default predictions for old
print(old_high-old_low)

0.8
0.8


* Note that our target is a binary variable that is equal to 1 when the borrower defaults, however predictions will be real numbers between 0 an 1, where values over 0.5 will be treated as predicting default.
* But what if we applied a sigmoid activation function?

In [47]:
# Difference in default predictions for young
print(tf.keras.activations.sigmoid(young_high).numpy() - tf.keras.activations.sigmoid(young_low).numpy())

# Difference in default predictions for old
print(tf.keras.activations.sigmoid(old_high).numpy()-tf.keras.activations.sigmoid(old_low).numpy())

0.16337562
0.14204395


In the above example, the impact of bill amount on default now depends on the borrower's age. In particular, we can see that the change in the predicted value for default is larger for young borrowers than it is for old borrowers.

* In this course, we'll use the three most common activation functions: **sigmoid**, **relu**, and **softmax**

#### Sigmoid activation function
   * Used primarily in the output layer of binary classification problems
   * Low-level: `tf.keras.activations.sigmoid()`
   * High-level: `sigmoid`
   
#### ReLU activation function
   * Rectified Activation Function
   * Typically used in all layers other than the output layer 
   * Low-level: `tf.keras.activations.relu()`
   * High-level: `relu`
   
#### Softmax activation function
   * Output layer (classification problems with >2 classes)
   * Ouputs can be interpreted as predicted class probabilities in multiclass classification problems
   * Low-level: `tf.keras.activations.softmax()`
   * High-level: `softmax`

#### Activation functions in neural networks

```
# Define input layer
inputs = tf.constant(borrower_features, tf.float32)

# Define dense layer 1
dense1 = tf.keras.layers.Dense(16, activation='relu')(inputs)

# Define dense layer 2
dense2 = tf.keras.layers.Dense(8, activation='sigmoid')(dense1)

# Define output layer
outputs = tf.keras.layers.Dense(4, activation='softmax')(dense2)
```

#### Binary classification problems
In this exercise, you will again make use of credit card data. The target variable, `default`, indicates whether a credit card holder defaults on his or her payment in the following period. Since there are only two options--default or not--this is a binary classification problem. While the dataset has many features, you will focus on just three: the size of the three latest credit card bills. Finally, you will compute predictions from your untrained network, `outputs`, and compare those the target variable, `default`.

The tensor of features has been loaded and is available as `bill_amounts`. Additionally, the `constant()`, `float32`, and `keras.layers.Dense()` operations are available.

## Optimizers

* **Stochastic Gradient Descent** or **SGD** is an improved version of gradient descent that is less likely to get stuck in local minima.
* For simple problems, the SGD algorithm performs well
* (Below) Adam and RMS Prop require 10x as many iterations to achieve a similar loss

<img src='data/opt_performance.png' width="400" height="200" align="center"/>

#### Stochastic gradient descent optimizer
   * `tf.keras.optimizers.SGD()`
   * `learning_rate`: typically between 0.5 and 0.001, which will determine how quickly the model parameters adjust during training
   * **Simple and easy to interpret** (more so than most modern optimization algorithms)
   
#### RMS propagation optimizer 
   * **Root mean squared (RMS) propagation optimizer**
       * `tf.keras.optimizers.RMSprop()
       * Applies different learning rates to each feature, which can be useful for high dimensional problems
       * **`momentum`:** allows it to build momentum
       * **`decay`**: setting a low value for the decay parameter will prevent momentum for accumumlating over long periods during the training process
       
#### The adam optimizer
   * **Adaptive moment (adam) optimizer**
       * Generally a good first choice
       * Provides further improvements
       * `learning_rate`
       * `beta1`: similar to RMS prop, you can set the momentum to decay faster by lowering the `beta1` parameter
   * **Performs well with default parameter values** (especially in comparison with RMS prop)
   
```
# Define the model function
def model(bias, weights, features = borrower_features):
    product = tf.matmul(features, weights)
    return tf.keras.activations.sigmoid(product+bias)
    
# Compute the predicted values and loss
def loss_function(bias, weights, targets = default, features = borrower_features):
    predictions = model(bias, weights)
    return tf.keras.losses.binary_crossentropy(target, predictions)
    
# Minimize the loss function with RMS propagation
opt = tf.keras.omptimizers.RMSprop(learning_rate=0.01, momentum=0.9)
opt.minimize(lambda: loss_function(bias, weights), var_list=[bias, weights])
```


#### Exercises: The dangers of local minima
Consider the plot of the following loss function, `loss_function()`, which contains a global minimum, marked by the dot on the right, and several local minima, including the one marked by the dot on the left.

<img src='data/local_minima.png' width="400" height="200" align="center"/>

In this exercise, you will try to find the global minimum of `loss_function()` using `keras.optimizers.SGD()`. You will do this twice, each time with a different initial value of the input to `loss_function()`. First, you will use `x_1`, which is a variable with an initial value of 6.0. Second, you will use `x_2`, which is a variable with an initial value of 0.3. Note that `loss_function()` has been defined and is available.

```
# Initialize x_1 and x_2
x_1 = Variable(6.0,float32)
x_2 = Variable(0.3,float32)

# Define the optimization operation
opt = keras.optimizers.SGD(learning_rate=0.01)

for j in range(100):
	# Perform minimization using the loss function and x_1
	opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
	# Perform minimization using the loss function and x_2
	opt.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())
```

```
# Initialize x_1 and x_2
x_1 = Variable(0.05,float32)
x_2 = Variable(0.05,float32)

# Define the optimization operation for opt_1 and opt_2
opt_1 = keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.99)
opt_2 = keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.0)

for j in range(100):
	opt_1.minimize(lambda: loss_function(x_1), var_list=[x_1])
    # Define the minimization operation for opt_2
	opt_2.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())
```

### Training a network in TensorFlow
* Finding the global minimum can be difficult, even when we're minimizing a simple loss function
* We also saw that we could improve our chances by selecting better initial values for variables
* But what can we do for more challenging problems with many variables?
* The eggholder function, for example, has many local minima:

<img src='data/eggholder_function.png' width="700" height="350" align="center"/>

* It may be difficult to see a global minimum on the plot above, but it does have one.
* How can we select initial values for `x` and `y`, the two inputs to the eggholder function?
* Even worse, what if we have a loss function that depends on hundreds of variables?

#### Random initializers 
* We often need to initialize hundreds or thousands of variables 
    * `tf.ones()` may perform poorly (and simply will not work for many datasets)
    * Tedious and difficult (and infeasible) to initialzie variables individually
* **Alternatively, draw initial values from distribution**
    * Random or algorithmic generation of initial values
    * We can draw them from a probability distribution
        * Normal
        * Uniform 
        * Glorot initializers (designed for ML algorithms)
        
#### Initializing variables in TensorFlow: low-level approach
* Alternatively, we can also use the **truncated random normal distribution, which discards very large and very small draws**

```
# Define 500x500 random normal variable
weights = tf.Variable(tf.random.normal([500,500]))

# Define 500x500 truncated random normal variable
weights = tf.Variable(tf.random.truncated_normal([500,500]))
```

#### Initializing variables in TensorFlow: high-level approach
* We can also use the high-level approach by initializing a dense layer using the default keras option, currently the glorot uniform initializer, as we've done in all exercises thus far.
* If we instead wish to initialize values to zero, we can do this using the `kernel_initializer` parameter

```
# Define a dense layer with the default initializer
dense = tf.keras.layers.Dense(32, activation='relu')

# Define a dense layer with the zeros intitializer
dense = tf.keras.layers.Dense(32, activation='relu', kernel_initializer='zeros')
```

#### Neural networks and overfitting
* Overfitting is especially problematic for neural networks, which contain many parameters and are quite good at memorization
* A simple solution is to use **dropout**
    * This will force your network to develop more robust rules for classification, since it cannot rely on any particular nodes being passed to an activation function
    * This will tend to improve out-of-sample performance
    
```
# Define input data
inputs = np.array(borrower_features, np.float32)

# Define dense layer 1
dense1 = tf.keras.layers.Dense(32, activation='relu')(inputs)

# Define dense layer 2
dense2 = tf.keras.layers.Dense(16, activation='relu')(dense1)

# Apply a dropout operation
dropout1 = tf.keras.layers.Dropout(0.25)(dense2)

# Define output later
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dropout1)
```

#### Exercises: Initialization in TensorFlow
A good initialization can reduce the amount of time needed to find the global minimum. In this exercise, we will initialize weights and biases for a neural network that will be used to predict credit card default decisions. To build intuition, we will use the low-level, linear algebraic approach, rather than making use of convenience functions and high-level `keras` operations. We will also expand the set of input features from 3 to 23. Several operations have been imported from `tensorflow`: `Variable()`, `random()`, and `ones()`.

In [52]:
from tensorflow import random

In [53]:
# Define the layer 1 weights
w1 = tf.Variable(tf.random.normal([23, 7]))

# Initialize the layer 1 bias
b1 = tf.Variable(ones([7]))

# Define the layer 2 weights
w2 = tf.Variable(random.normal([7, 1]))

# Define the layer 2 bias
b2 = tf.Variable(0.0, np.float32)

#### Exercises: Defining the model and loss function
In this exercise, you will train a neural network to predict whether a credit card holder will default. The features and targets you will use to train your network are available in the Python shell as `borrower_features` and `default`. You defined the weights and biases in the previous exercise.

Note that the `predictions` layer is defined as $\sigma(layer1 * w^2 + b^2)$ , where $\sigma$ is the sigmoid activation, `layer1` is a tensor of nodes for the first hidden dense layer, `w2` is a tensor of weights, and `b2` is the bias tensor.

The trainable variables are `w1`, `b1`, `w2`, and `b2`. Additionally, the following operations have been imported for you: `keras.activations.relu()` and `keras.layers.Dropout()`.

```
# Define the model
def model(w1, b1, w2, b2, features = borrower_features):
	# Apply relu activation functions to layer 1
	layer1 = keras.activations.relu(matmul(features, w1) + b1)
    # Apply dropout rate of 0.25
	dropout = keras.layers.Dropout(0.25)(layer1)
	return keras.activations.sigmoid(matmul(dropout, w2) + b2)

# Define the loss function
def loss_function(w1, b1, w2, b2, features = borrower_features, targets = default):
	predictions = model(w1, b1, w2, b2)
	# Pass targets and predictions to the cross entropy loss
	return keras.losses.binary_crossentropy(targets, predictions)
```

#### Exercises: Training neural networks with TensorFlow
In the previous exercise, you defined a model, `model(w1, b1, w2, b2, features)`, and a loss function, `loss_function(w1, b1, w2, b2, features, targets)`, both of which are available to you in this exercise. You will now train the model and then evaluate its performance by predicting default outcomes in a test set, which consists of `test_features` and `test_targets` and is available to you. The trainable variables are `w1`, `b1`, `w2`, and `b2`. Additionally, the following operations have been imported for you: `keras.activations.relu()` and `keras.layers.Dropout()`.

```
# Train the model
for j in range(100):
    # Complete the optimizer
	opt.minimize(lambda: loss_function(w1, b1, w2, b2), 
                 var_list=[w1, b1, w2, b2])

# Make predictions with model using test features
model_predictions = model(w1, b1, w2, b2, test_features)

# Construct the confusion matrix
confusion_matrix(test_targets, model_predictions)
```

# $\star$ Chapter 4: High Level APIs
In the final chapter, you'll use high-level APIs in TensorFlow 2 to train a sign language letter classifier. You will use both the sequential and functional Keras APIs to train, validate, make predictions with, and evaluate models. You will also learn how to use the Estimators API to streamline the model definition and training process, and to avoid errors.

## Defining neural networks with Keras
* In this lesson, we will introduce the Keras sequential API and expand on our introduction of the Keras functional API
* A good way to construct a model in Keras is to use the sequential API

### The sequential API
* This API is simpler and makes strong assumptions about how you will construct your model 
* It assumes you have:
    * An input layer
    * Some number of hidden layers
    * An output layer
* All of these layers are ordered one after the other in a sequence
* Once we have defined the model object, we can simply "stack" layers on top of it sequentially using the `add` method
* If we want to check our model's architecture, we can use the `.summary` method
* **The model has now been defined, but it is not yet ready to be trained.**
* We must first perform a **compilation step**, where we specify the optimizer and loss function.

```
# Import tensorflow
from tensorflow import keras

# Define a sequential model
model = keras.Sequential()

# Define first hidden layer
model.add(keras.layers.Dense(16, activation='relu', input_shape=(28*28,)))

# Define second hidden layer
model.add(keras.layers.Dense(8, activation='relu'))

# Define output layer
model.add(keras.layers.Dense(4, activation='softmax'))

# Compile the model
model.compile('adam', loss='categorical_crossentropy')

# Summarize the model
print(model.summary())
```

* **Categorical crossentropy:** for use with binary classification problems

<img src='data/functional_API.png' width="700" height="350" align="center"/>

* **But what if we want to train two models jointly to predict the same target?**
    * Answer: The functional API is for that
* For example, say we have a set of 28x28 images and a set of 10 features of metadata.
* We want to use both to predict the image's class, but restrict how they interact in our model
* We'll start by using the Keras inputs operation to define the input shapes for model 1 and model 2
* Next we define layer 1 and layer and layer 2 as dense layers for model 1
    * **Note that we have to pass the previous layer as an argument if we use the functional API, but did not with the sequential**
* We then define layers 1 and 2 for model 2
* Then use the add layer in keras to combine the outputs in a layer that merges the two models
* **Finally we define a functional model.**
    * **As inputs, it takes both the model 1 and model 2 inputs**
    * **As outputs it takes the merged layer**
    * The only thing left to do is compile the model and train    

```
# Import tensorflow
import tensorflow as tf

# Define model 1 input layer shape
model1_inputs = tf.keras.Input(shape=(28*28,))

# Define model 2 input layer shape
model2_inputs = tf.keras.Input(shape=(10,))

# Define layer 1 for model 1
model1_layer1 = tf.keras.layers.Dense(12, activation='relu')(model1_inputs)

# Define layer 2 for model 1
model1_layer2 = tf.keras.layers.Dense(4, activation='softmax')(model1_layer1)

# Define layer 1 for model 2
model2_layer1 = tf.keras.layers.Dense(8, activation='relu')(model2_inputs)

# Define layer 2 for model 2
model2_layer2 = tf.keras.layers.Dense(4, activation='softmax')(model2_layer1)

# Merge model 1 and model 2
merged = tf.keras.layers.add([model1_layer2, model2_layer2])

# Define a functional model
model = tf.keras.Model(inputs=[model1_inputs, model2_inputs], outputs=merged)

# Compile the model
model.compile('adam', loss='categorical_crossentropy')
```

#### Exercises: The sequential model in Keras
In chapter 3, we used components of the `keras` API in `tensorflow` to define a neural network, but we stopped short of using its full capabilities to streamline model definition and training. In this exercise, you will use the `keras` sequential model API to define a neural network that can be used to classify images of sign language letters. You will also use the `.summary()` method to print the model's architecture, including the shape and number of parameters associated with each layer.

Note that the images were reshaped from (28, 28) to (784,), so that they could be used as inputs to a dense layer. Additionally, note that `keras` has been imported from `tensorflow` for you.

```
# Define a Keras sequential model
model = keras.Sequential()

# Define the first dense layer
model.add(keras.layers.Dense(16, activation='relu', input_shape=(784,)))

# Define the second dense layer
model.add(keras.layers.Dense(8, activation='relu'))

# Define the output layer
model.add(keras.layers.Dense(4, activation='softmax'))

# Print the model architecture
print(model.summary())
```

#### Exercises: Compiling a sequential model
In this exercise, you will work towards classifying letters from the Sign Language MNIST dataset; however, you will adopt a different network architecture than what you used in the previous exercise. There will be fewer layers, but more nodes. You will also apply dropout to prevent overfitting. Finally, you will compile the model to use the `adam` optimizer and the `categorical_crossentropy` loss. You will also use a method in `keras` to summarize your model's architecture. Note that `keras` has been imported from `tensorflow` for you and a sequential `keras` model has been defined as `model`.

```
# Define the first dense layer
model.add(keras.layers.Dense(16, activation='sigmoid', input_shape=(784,)))

# Apply dropout to the first layer's output
model.add(keras.layers.Dropout(0.25))

# Define the output layer
model.add(keras.layers.Dense(4, activation='softmax'))

# Compile the model
model.compile('adam', loss='categorical_crossentropy')

# Print a model summary
print(model.summary())
```

#### Exercises: Defining a multiple input model
In some cases, the sequential API will not be sufficiently flexible to accommodate your desired model architecture and you will need to use the functional API instead. If, for instance, you want to train two models with different architectures jointly, you will need to use the functional API to do this. In this exercise, we will see how to do this. We will also use the `.summary()` method to examine the joint model's architecture.

Note that `keras` has been imported from `tensorflow` for you. Additionally, the input layers of the first and second models have been defined as `m1_inputs` and `m2_inputs`, respectively. Note that the two models have the same architecture, but one of them uses a `sigmoid` activation in the first layer and the other uses a `relu`.

```
# For model 1, pass the input layer to layer 1 and layer 1 to layer 2
m1_layer1 = keras.layers.Dense(12, activation='sigmoid')(m1_inputs)
m1_layer2 = keras.layers.Dense(4, activation='softmax')(m1_layer1)

# For model 2, pass the input layer to layer 1 and layer 1 to layer 2
m2_layer1 = keras.layers.Dense(12, activation='relu')(m2_inputs)
m2_layer2 = keras.layers.Dense(4, activation='softmax')(m2_layer1)

# Merge model outputs and define a functional model
merged = keras.layers.add([m1_layer2, m2_layer2])
model = keras.Model(inputs=[m1_inputs, m2_inputs], outputs=merged)

# Print a model summary
print(model.summary())
```

### Training and validation with Keras
* Whenever we train and evaluate a model in tensorflow, we typically use the same set of steps

#### Overview of training and evaluation
   1. Load and clean data
   2. Define model
   3. Train and validate model
   4. Evaluate model 

#### How to train a model

```
# Import tensorflow
import tensorflow as tf

# Define a sequential model
model = tf.keras.Sequential()

# Define the hidden layer
model.add(tf.keras.layers.Dense(16, activation='relu', input_shape=(784,)))

# Define the output layer
model.add(tf.keras.layers.Dense(4, activation='softmax'))

# Compile model
model.compile('adam', loss='categorical_crossentropy')

# Train model
model.fit(image_features, image_labels)
```

#### The fit() operation
* Notice that we only supplied two arguments to fit: features and labels
* Only required arguments:
    * `features`
    * `labels`
* However, there are also many optional arguments, including:
    * `batch_size`
    * `epochs`
    * `validation_split`
    
### Batch size and epochs

<img src='data/batch_vs_epoch.png' width="700" height="350" align="center"/>

* **Batch size:** the number of examples in each batch; `32` by default
* **Epochs:** The number of times you train on the **full** set of batches is referred to as the number of epochs.
* In the image above, the batch size is 5 and the number of epochs is 2.
* **Using multiple epochs allows the model to revisit the same batches, but with different model weights and possibly optimizer parameters, since they are updated after each batch.**

#### Validation split
<img src='data/val_split.png' width="400" height="200" align="center"/>

* **`validation_split`:** divides the dataset into two parts, as shown above
* The first part is the training set and the second part is the validation set
* For example, setting a value of `0.2` will put 20% of the data in the validation set

```
# Train model with validation split
model.fit(features, labels, epochs=10, validation_split=0.20)
```
* The benefit of using a validation split is that you can see how your model performs on both the data it was trained on, *the training set*, **and** a separate dataset it was not trained on, *the validation set*. 
* Below, we can see the first 10 epochs of training:

<img src='data/valsplit_perf.png' width="400" height="200" align="center"/>

* Notice that we can see the training loss and validation loss separately
* Recall that if the training loss becomes substantially lower than the validation loss, this is an indication that we're overfitting
* **We should either terminate the training process before that point or add regularization or dropout.**

### Changing the metric
* Another benefit of the high-level keras API is that we can swap less informative metrics, such as the loss, for ones that are more easily interpretable, such as the share of accurately classified examples. 
* We then apply fit to the model again with the same settings

```
# Recompile the model with the accuracy metric
model.compile('adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train model with validation split
model.fit(features, labels, epochs-10, validation_split=0.20)
```

#### The evaluation() operation
* Finally, it's a good idea to split off a test set before you begin to train and validate

<img src='data/train_test_val.png' width="400" height="200" align="center"/>

* You can then use the `evaluate` operation to check performance on the test set at the end of the training process

```
# Evaluate the test set
model.evaluate(test)
```
* Since you may tune model parameters in response to validation set performance, using a separate test set will provide you with further assurance that you have not overfitted.

## Training models with the Estimators API
* The high-level **Estimators API** was elevated in importance in TensorFlow 2.0
* The **Estimators API** is a **high-level TensorFlow submodule**.
* Relative to the core, lower-level TensorFlow APIs and the high-level Kera API, model building in the Estimator API is **less flexible**.
    * This is because **it enforces a set of best practices by placing restrictions on model architecture and training.**
* The upside of using the Estimators API is that is allows for **faster deployment.**
* Models can be specified, trained, evaluated, and deployed with **less code.**
* Also, there are many **pre-made models** that can be instantiated by setting a handful of model parameters

### Model specification and training with the Estimators API
   1. **Define feature columns**, which specify the shape and type of your data
   2. **Load and transform your data** within a function
       * The output of this function will be a dictionary object of features and your labels
   3. **Define an estimator**
       * Here, we'll use premade estimators, but you can also define custom estimators with different architectures
   4. Apply train operation
       * Note that all model objects created through the Estimators API have `train`, `evaluate`, and `predict` operations.
       
       
#### Defining feature columns
* Note that when defining the numeric feature column, we supplied the dictionary key, **`size`**; we will do this for each feature column we create.
* For example, you may also want a *categorical* feature column for the number of rooms using `feature_column.categorical_column_with_vocabulary_list`.
* We can then merge these into a list of feature columns

#### Loading and transforming

```
# Import tensorflow under its standard alias
import tensorflow as tf

# Define a numeric feature column
size = tf.feature_column.numeric_column("size")

# Define a categorical feature column
rooms = tf.feature_column.categorical_column_with_vocabulary_list("rooms", ["1", "2", "3", "4", "5"])

# Create feature column list
features_list = [size, rooms]
```

* **Note:** Alternatively, if we were using the sign language MNIST dataset, we'd define a list containing a single vector of features:

```
# Define a matrix feature column
features_list = [tf.feature_column.numeric_column('image', shape=(784,))]
```

* We next need to define a function that transforms our data, puts the features in a dictionary, and returns both the features and labels.

```
# Define input data function
def input_fn():
    # Define feature dictionary
    features = {"size":[1340,1690,2720], "rooms":[1, 3, 4]}
    # Define labels
    labels= [221900, 538000, 180000
    return features, labels
```
* Note that we've simply taken three examples from the housing dataset for the sake of illustration
* Using them, we've defined a dictionary with the keys "size" and "rooms" which maps to the feature columns we've defined
* Next we define a list or array of labels, which give the price of the house in this case, and then return the features and labels

#### Define and train a regression estimator
* We can now define and train the estimator, but before we do that, we have to define what estimator we actually want to train
* If we're predicting house prices, we may want to use a deep neural network with a regression head using **`estimator.DNNRegressor`**.
* This allows us to **predict a continous target.**

```
# Define a deep neural network regression
model0 = tf.estimator.DHHRegressor(feature_columns=feature_list, hidden_units=[10,6,6,3])

# Train the regression model
model0.train(input_fn, steps = 20)
```
* Note that all we had to supply was the list of feature columns and the number of nodes in each hidden layer; the rest is handled automatically
* We then apply the train function, supply our input function, and train for 20 steps

#### Define and train a deep neural network (classifier)
* Alternatively, if we want to instead perform a classification task with a deep neural network, we just need to change the estimator to **`estimator.DNNClassifier`**, add the number of classes, **`n_classes`**, and then train again

```
# Define a deep neural network classifier
model1 = tf.estimator.DNNClassifier(feature_Columns=feature_list, hidden_units=[32, 16, 8], n_classes=4)

# Train the classifier
model1.train(input_fn, steps=20)
```
* You can also use:
    * Linear classifiers
    * Boosted trees
    * [Other common options](https://www.tensorflow.org/guide/estimators)

```
# Define feature columns for bedrooms and bathrooms
bedrooms = feature_column.numeric_column("bedrooms")
bathrooms = feature_column.numeric_column("bathrooms")

# Define the list of feature columns
feature_list = [bedrooms, bathrooms]

def input_fn():
	# Define the labels
	labels = np.array(housing.price)
	# Define the features
	features = {'bedrooms':np.array(housing['bedrooms']), 
                'bathrooms':np.array(housing['bathrooms'])}
	return features, labels
    
# Define the model and set the number of steps
model = estimator.DNNRegressor(feature_columns=feature_list, hidden_units=[2,2])
model.train(input_fn, steps=1)
```