# Intro to TensorFlow 2.6.0 in Python

Isaiah Hull is a senior economist in the research division at Sweden's Central Bank (Sveriges Riksbank) and the author of Machine Learning for Economics and Finance in TensorFlow 2. He holds a PhD in economics from Boston College and conducts research on computational economics, machine learning, and quantum computing.

TensorFlow Background
- What is it? Developed by Google. Open source library for graph-based numerical computation

Low and high level APIs
- addition, multiplication, differentiaion
- ML Models

Changes in TensorFlow 2.0
- Eager execution by default - simpler, intuitive code
- model building with Keras and Estimators

What is a tensor?
- Generalization of vectors and matrices
- Tensor = collection of numbers in a particular shape
- think slice of bread and dimensions

# Defining tensors in TensorFlow

In [None]:
import tensorflow as tf

# 0D Tensor (0 dimensional)
d0 = tf.ones((1,))

# 1D Tensor (1 dimensional)
d1 = tf.ones((2,))

# 2D Tensor (2 dimensional)
d2 = tf.ones((2,2))

# 3D Tensor (3 dimensional)
d3 = tf.ones((2,2,2))

In [None]:
# print 3D tensor
print(d3.nump())

## Defining constants in TensorFlow
- a constant is the simplest category of tensor
- not trainable
- can have any dimension
- Note that tensorflow 2 allows you to use data as either a numpy array or a tensorflow constant object. Using a constant will ensure that any operations performed with that object are done in tensorflow.


In [None]:
from tensorflow import constant

# define a 2x3 constant (of 3s)
a = constant(3, shape=[2,3])

# defin a 2x2 constant
b = constant([1,2,3,4], shape=[2,2])

## Use convenience functions to define certain types of constants

- operation # example
- tf.constant() # constant([1,2,3])
- tf.zeros() # zeros([2,2]) - arbitrary of 0s
- tf.zeros_like() # zeros_like(input_tensor)
- tf.ones() # ones([2,2])
- tf.ones_like() # ones_like(input_tensor)
- tf.fill() # fill([3,3], 7)

## Defining and initializing variables

In [None]:
import tensorflow as tf

# define a variable
a0 = tf.Variable([1,2,3,4,5,6], dtype=tf.float32)
a1 = tf.Variable([1,2,3,4,5,6], dtype=tf.int16)

# define a constant
b = tf.constant(2, tf.float32)

# compute their product
c0 = tf.multiply(a0, b)
c1 = a0*b # simpler option


## Example

In [None]:
# Import constant from TensorFlow
from tensorflow import constant

# Convert the credit_numpy array into a tensorflow constant
'''This array contains feature columns from a dataset on 
credit card holders and is previewed in the image below. 
We will return to this dataset in later chapters.'''
credit_constant = constant(credit_numpy)

# Print constant datatype
print('\n The datatype is:', credit_constant.dtype)

# Print constant shape
print('\n The shape is:', credit_constant.shape)

# The datatype is: <dtype: 'float64'>

# The shape is: (30000, 4)

In [None]:
'''Defining variables
Unlike a constant, a variable's value can be modified. 
This will be useful when we want to train a model by 
updating its parameters.

Let's try defining and printing a variable. 
We'll then convert the variable to a numpy array, 
print again, and check for differences. 
Note that Variable(), which is used to create a 
variable tensor, has been imported from tensorflow and 
is available to use in the exercise.'''

# Define the 1-dimensional variable A1
A1 = Variable([1, 2, 3, 4])

# Print the variable A1
print('\n A1: ', A1)

# Convert A1 to a numpy array and assign it to B1
B1 = A1.numpy()

# Print B1
print('\n B1: ', B1)

## Basic operations

2. What is a TensorFlow operation?
 - TensorFlow has a model of computation that revolves around the use of graphs. A TensorFlow graph contains edges and nodes, where the edges are tensors and the nodes are operations.

3. What is a TensorFlow operation?
 - In the graph shown, which was drawn using TensorFlow, the const operations define 2 by 2 constant tensors. Two tensors are summed using the add operation.

4. What is a TensorFlow operation?
 - Another two tensors are then summed using the add operation.

5. What is a TensorFlow operation?
 - Finally, the resulting matrices are multiplied together with the matmul operation.

6. Applying the addition operator
 - Let's start with the addition operator. We will first import the constant and add operations. We may now use constant to define 0-dimensional, 1-dimensional, and 2-dimensional tensors.

7. Applying the addition operator
 - Finally, let's add them together using the operation for tensor addition. Note that we can perform scalar addition with A0 and B0, vector addition with A1 and B1, and matrix addition with A2 and B2.

8. Performing tensor addition
 - The add() operation performs element-wise addition with two tensors. Each pair of tensors added must have the same shape. Element-wise addition of the scalars 1 and 2 yields the scalar 3. Element-wise addition of the vectors 1,2 and 3,4 yields the vector 4,6. Element-wise addition of the matrices 1,2,3,4 and 5,6,7,8 yields the matrix 6,8,10,12. Furthermore, the add operator is overloaded, which means that we can also perform addition using the plus symbol.

9. How to perform multiplication in TensorFlow
  - We will consider both element-wise and matrix multiplication.   - For element-wise multiplication, which is performed with the multiply operation, the tensors involved must have the same shape. 
  - For instance, you may want to multiply the vector 1,2,3 by 3,4,5 or 1,2 by 3,4. 
  - For matrix multiplication, you use the matmul() operator.
  - Note that performing matmul(A,B) requires that the number of columns of A equal the number of rows of B.

10. Applying the multiplication operators
 - Let's look at some examples of multiplication in TensorFlow. We'll import the ones operator, along with the two types of multiplication we will use. We will also define a scalar, A0, a 3 by 1 vector of ones, a 3 by 4 vector of ones, and a 4 by 3 vector of ones. What operations can be performed using these tensors of ones? We can perform element-wise multiplication of any element by itself, such as A0 by A0, A31 by A31, or A34 by A34. We can also perform matrix multiplication of A43 by A34, but not A43 by A43.

11. Summing over tensor dimensions
 - Finally, we end this lesson by discussing summation over tensors, which is performed using the reduce_sum() operator. This can be used to sum over all dimensions of a tensor or just one.
 - reduce_sum(A) # sums over all dimensions of A
 - reduce_sum(A, i) sums over dimension i
 - Let's see how this works in practice. We will import ones and reduce_sum from tensorflow. We will then define a 2 by 3 by 4 tensor that consists of ones.

12. Summing over tensor dimensions
 - If we sum over all elements of A, we get 24, since the tensor contains 24 elements, all of which are 1. If we sum over dimension 0, we get a 3 by 4 matrix of 2s. If we sum over 1, we get a 2 by 4 matrix of 3s. And if we sum over 2, we get a 2 by 3 matrix of 4s. In each case, we reduce the size of the tensor by summing over one of its dimensions.

### Examples - Performing element-wise multiplication
- Element-wise multiplication in TensorFlow is performed using two tensors with identical shapes. This is because the operation multiplies elements in corresponding positions in the two tensors. An example of an element-wise multiplication, denoted by the  symbol (it's a circle with a dot inside), is shown below:


- In this exercise, you will perform element-wise multiplication, paying careful attention to the shape of the tensors you multiply. Note that multiply(), constant(), and ones_like() have been imported for you.

In [None]:
# Define tensors A1 and A23 as constants
A1 = constant([1, 2, 3, 4])
A23 = constant([[1, 2, 3], [1, 6, 4]])

# Define B1 and B23 to have the correct shape as A1 and A23
B1 = ones_like(A1)
B23 = ones_like(A23)

# Perform element-wise multiplication
C1 = A1*B1
C23 = A23*B23

# Print the tensors C1 and C23
print('\n C1: {}'.format(C1.numpy()))
print('\n C23: {}'.format(C23.numpy()))

### Making predictions with matrix multiplication
- In later chapters, you will learn to train linear regression models. This process will yield a vector of parameters that can be multiplied by the input data to generate predictions. In this exercise, you will use input data, features, and a target vector, bill, which are taken from a credit card dataset we will use later in the course.

- The matrix of input data, features, contains two columns: education level and age. The target vector, bill, is the size of the credit card borrower's bill.

- Since we have not trained the model, you will enter a guess for the values of the parameter vector, params. You will then use matmul() to perform matrix multiplication of features by params to generate predictions, billpred, which you will compare with bill. Note that we have imported matmul() and constant().

In [None]:
# Define features, params, and bill as constants
features = constant([[2, 24], [2, 26], [2, 57], [1, 37]])
params = constant([[1000], [150]])
bill = constant([[3913], [2682], [8617], [64400]])

# Compute billpred using features and params
billpred = matmul(features,params)

# Compute and print the error
error = bill - billpred
print(error.numpy())

'''
<tf.Tensor: shape=(4, 1), dtype=int32, numpy=
array([[ 5600],
       [ 5900],
       [10550],
       [ 6550]], dtype=int32)>
'''

### Summing over tensor dimensions
You've been given a matrix, wealth. This contains the value of bond and stock wealth for five individuals in thousands of dollars.

wealth = [[11,50], [7,2], [4,60], [3,0], [25,10]]
 

The first column corresponds to bonds and the second corresponds to stocks. Each row gives the bond and stock wealth for a single individual. Use wealth, reduce_sum(), and .numpy() to determine which statements are correct about wealth.

In [None]:
''' recall 
- reduce_sum(A) # sums over all dimensions of A
- reduce_sum(A, i) sums over dimension i
'''
# reduce_sum by column
reduce_sum(wealth,0).numpy()

# reduce_sum by row
reduce_sum(wealth,1).numpy()

## Advanced operations

1. Advanced operations
In this video, we will cover a selection of advanced operations. Some will be used frequently in later chapters. Others will help you to gain intuition about complex machine learning routines.

2. Overview of advanced operations
We have already covered basic operations in TensorFlow, including add(), multiply(), matmul(), and reduce_sum(). In this lesson, we will move on to more advanced operations, including gradient, reshape, and random.

3. Overview of advanced operations - gradient(), reshape(), random()
The gradient() operation, which we'll use in conjunction with gradient tape, computes the slope of a function at a point. The reshape() operation changes the shape of a tensor. And the random() module generates a tensor out of randomly-drawn values.
 - gradient() # computes slope of a function at a point
 - reshape() # reshapes a tensor (ie. 10x10 to 100x1)
 - random() # populates tensor with entries drawn from a probability distribution
4. Finding the optimum
 - In many machine learning problems, you will need to find an optimum--that is, a minimum or maximum. You may, for instance, want to find the model parameters that minimize the loss function or maximize the objective function. 
  - minimum = lowest value of a loss function
  - maximum = highest value of objective function
 - Fortunately, we can do this by using the gradient operation, which tells us the slope of a function at a point. We start this process by passing points to the gradient() operation until we find one where the gradient is zero. Next, we check if the gradient is increasing or decreasing at that point. If it is increasing, we have minimum. Otherwise, we have a maximum.
  - optimum = point where gradient = 0
  - min = change in gradient > 0
  - max = change in gradient < 0


5. Calculating the gradient
The plot shows the function y equals x. Notice that the gradient--that is, the slope at a given point, is constant. If we increase x by 1 unit, y also increases by 1 unit.

6. Calculating the gradient
This is not true if we instead consider the function y equals x squared. When x is less than 0, y decreases when x increases. When x is greater than 0, y increases when x increases. Thus, the gradient is initially negative, but becomes positive for x larger than 0. This means that x equals 0 minimizes y.

7. Gradients in TensorFlow
Let's use TensorFlow to compute the gradient. We will start by defining a variable, x, which we initialize to minus one point zero. We will then define y to be x squared within an instance of gradient tape. Note that we apply the watch method to an instance of gradient tape and then pass the variable x. This will allow us to compute the rate of change of y with respect to x. Next, we compute the gradient of y with respect to x using the tape instance of gradient tape. Note that y is the first argument and x is the second. As written, the operation computes the slope of y at a point. Running the code and printing, we find that the slope is -2 at x equals -1, which means that y is initially decreasing in x, as we saw on the previous slide. Much of the differentiation you do in deep learning models will be handled by high level APIs; however, gradient tape remains an invaluable tool for building advanced and custom models.

8. Images as tensors
We'll next consider an operation that is particularly useful for image classification problems: reshaping. The grayscale image shown has a natural representation as a matrix with values between 0 and 255. While some algorithms exploit this shape, others require you to reshape matrices into vectors before using them as inputs, as shown in the diagram.

9. How to reshape a grayscale image
Now that you've seen how images can be represented as tensors, let's generate some input images and reshape them. We will create a random grayscale image by drawing numbers from the set of integers between 0 and 255. We will use these to populate a 2 by 2 matrix. We can then reshape this into a 4 by 1 vector, as shown in the diagram.

10. How to reshape a color image
For color images, we will generate 3 such matrices to form a 2 by 2 by 3 tensor. We could then reshape the image into a 4 by 3 tensor, as shown in the diagram.

### Reshaping tensors
Later in the course, you will classify images of sign language letters using a neural network. In some cases, the network will take 1-dimensional tensors as inputs, but your data will come in the form of images, which will either be either 2- or 3-dimensional tensors, depending on whether they are grayscale or color images.

The figure below shows grayscale and color images of the sign language letter A. The two images have been imported for you and converted to the numpy arrays gray_tensor and color_tensor. Reshape these arrays into 1-dimensional vectors using the reshape operation, which has been imported for you from tensorflow. Note that the shape of gray_tensor is 28x28 and the shape of color_tensor is 28x28x3.

In [None]:
# Reshape the grayscale image tensor from a 28x28 matrix 
# into a 784x1 vector
gray_vector = reshape(gray_tensor, (784, 1))

# Reshape the color image tensor from a 28x28x3 tensor 
# into a 2352x1 vector
color_vector = reshape(color_tensor, (2352, 1))

### Optimizing with gradients
You are given a loss function, y=x**2, which you want to minimize. You can do this by computing the slope using the GradientTape() operation at different values of x. If the slope is positive, you can decrease the loss by lowering x. If it is negative, you can decrease it by increasing x. This is how gradient descent works.

The image shows a plot of y equals x squared. It also shows the gradient at x equals -1, x equals 0, and x equals 1.

In practice, you will use a high level tensorflow operation to perform gradient descent automatically. In this exercise, however, you will compute the slope at x values of -1, 1, and 0. The following operations are available: GradientTape(), multiply(), and Variable().

In [None]:
def compute_gradient(x0):
    # Define x as a variable with an initial value of x0
    x = Variable(x0)
    with GradientTape() as tape:
        tape.watch(x)
        # Define y using the multiply operation
        y = multiply(x,x)
    # Return the gradient of y with respect to x
    return tape.gradient(y, x).numpy()

# Compute and print gradients at x = -1, 1, and 0
print(compute_gradient(-1.0))
print(compute_gradient(1.0))
print(compute_gradient(0.0))

'''
<script.py> output:
    -2.0
    2.0
    0.0
Notice that the slope is positive at x = 1, which means 
that we can lower the loss by reducing x. 
The slope is negative at x = -1, which means that we can 
lower the loss by increasing x. 
The slope at x = 0 is 0, which means that we cannot lower 
the loss by either increasing or decreasing x. 
This is because the loss is minimized at x = 0.
'''

### Working with image data
You are given a black-and-white image of a letter, which has been encoded as a tensor, letter. You want to determine whether the letter is an X or a K. You don't have a trained neural network, but you do have a simple model, model, which can be used to classify letter.

The 3x3 tensor, letter, and the 1x3 tensor, model, are available in the Python shell. You can determine whether letter is a K by multiplying letter by model, summing over the result, and then checking if it is equal to 1. As with more complicated models, such as neural networks, model is a collection of weights, arranged in a tensor.

Note that the functions reshape(), matmul(), and reduce_sum() have been imported from tensorflow and are available for use.

In [None]:
# Reshape model from a 1x3 to a 3x1 tensor
model = reshape(model, (3,1))

# Multiply letter by model
output = matmul(letter, model)

# Sum over output and print prediction using the numpy method
prediction = reduce_sum(output)
print(prediction.numpy())

# Your model found that prediction=1.0 and correctly classified 
# the letter as a K.

# Input data

1. Input data
 - In the previous chapter, we learned how to perform core TensorFlow operations. In this chapter, we will work towards training a linear model with TensorFlow.

2. Using data in TensorFlow
 - So far, we've only generated data using functions like ones and random uniform; however, when we train a machine learning model, we will want to import data from an external source. This may include numeric, image, or text data. Beyond simply importing the data, numeric data will need to be assigned a type, and text and image data will need to be converted to a usable format.

3. Importing data for use in TensorFlow
 - External datasets can be imported using TensorFlow. While this is useful for complex data pipelines, it will be unnecessarily complicated for what we do in this chapter. For that reason, we will use simpler options to import data. We will then convert the data into an NumPy array, which we can use without further modification in TensorFlow.

4. How to import and convert data
 - Let's start by importing numpy under the alias np and pandas under the alias pd. We will then read housing transaction data from kc_housing.csv using the pandas method read csv and assign it to a dataframe called housing. When you are ready to train a model, you will want to convert the data into a numpy array by passing the pandas dataframe, housing, to np array. We will focus on loading data from csv files in this chapter, but you can also use pandas to load data from other formats, such as json, html, and excel.

5. Parameters of read_csv()
 - Let's take a closer look at the read csv method of pandas, since you will use it frequently to import data. In the code block, we filled in the only required parameter, which was the filepath or buffer. Note that you could have instead supplied a URL, rather than a filepath to load your data. Another important parameter is sep, which is the delimiter that separates columns in your dataset. By default, this will be a comma; however, other common choices are semi-colons and tabs. Note that if you do use whitespace as a delimiter, you will need to set the delim whitespace parameter to true. Finally, if you are working with datasets that contain non-ASCII characters, you can specify the appropriate choice of encoding, so that your characters are correctly parsed.

6. Using mixed type datasets
 - Finally, we will end this lesson by talking about how to transform imported data for use in TensorFlow. We will use housing data from King County, Washington as an example. Notice how the dataset contains columns with different types. One column contains data on house prices in a floating point format. Another column is a boolean variable, which can either be true, 1, or false, 0. In this case, a 1 indicates that a property is located on the waterfront.

7. Setting the data type
 - Let's say we want to perform TensorFlow operations that require price to be a 32-bit floating point number and waterfront to be a boolean. We can do this in two ways. The first approach uses the array method from numpy. We select the relevant column in the DataFrame, provide it as the first argument to array, and then provide the datatype as the second argument.

8. Setting the data type
 - The second approach uses the cast operation from TensorFlow. Again, we supply the data first and the data type second. While either tf cast or np array will work, waterfront will be a tf dot Tensor type under the former option and a numpy array under the latter.

## Load data

In [None]:
# Import pandas under the alias pd
import pandas as pd

# Assign the path to a string variable named data_path
data_path = 'kc_house_data.csv'

# Load the dataset as a dataframe named housing
housing = pd.read_csv(data_path)

# Print the price column of housing
print(housing.price)

## Set data type
In this exercise, you will both load data and set its type. Note that housing is available and pandas has been imported as pd. You will import numpy and tensorflow, and define tensors that are usable in tensorflow using columns in housing with a given data type. Recall that you can select the price column, for instance, from housing using housing['price'].

In [None]:
# Import numpy and tensorflow with their standard aliases
import numpy as np
import tensorflow as tf

# Use a numpy array to define price as a 32-bit float
price = np.array(housing['price'], np.float32)

# Define waterfront as a Boolean using cast
waterfront = tf.cast(housing['waterfront'], tf.bool)

# Print price and waterfront
print(price)
print(waterfront)

'''
<script.py> output:
    [221900. 538000. 180000. ... 402101. 400000. 325000.]
    tf.Tensor([False False False ... False False False], shape=(21613,), dtype=bool)
'''

## Loss Functions

1. Loss functions
 - We now know how to import datasets and perform TensorFlow operations on them, but how can we use this knowledge to train models? In this video, we'll move closer to that goal by taking a look at loss functions.

2. Introduction to loss functions - train models and measure model fit
 - Loss functions play a fundamental role in machine learning. We need loss functions to train models because they tell us how well our model explains the data. Without this feedback, it is unclear how to adjust model parameters during the training process. 
 - A high loss value indicates that the model fit is poor. Higher value -> worse fit. So minimize the loss function.
 - Typically, we train the model by selecting parameter values that minimize the loss function. In some cases, we may want to maximize a function instead. Fortunately, we can always place a minus sign before the function we want to maximize and instead minimize it. For this reason, we will always talk about loss functions and minimization.

3. Common loss functions in TensorFlow
 - While it is possible to define a custom loss function, this is typically not necessary, since many common options are available in TensorFlow. Typical choices for training linear models include the mean squared error loss, the mean absolute error loss, and the Huber loss. All of these are accessible from tf dot keras dot losses.
  - MSE = mean squared error
  - MAE = mean absolute error
  - Huber error
  - tf.keras.losses.mse()
  - tf.keras.losses.mae()
  - tf.keras.losses.Huber()

4. Why do we care about loss functions?
 - Here, we plot the MSE, MAE, and Huber loss for error values between minus two and two. 
 - MSE strongly penalizes outliers and has high sensitivity near the minimum. 
 - MAE scales linearly with the size of the error and has low sensitivity near the minimum.
 - Huber loss is similar to the MSE near zero and similar to the MAE away from zero. 
 - For greater sensitivity near the minimum, you will want to use the MSE or Huber loss. 
 - To minimize the impact of outliers, you will want to use the MAE or Huber loss.

5. Defining a loss function
 - Let's say we decide to use the MSE loss. We'll need two tensors to compute it: the actual values or "targets" tensor and the predicted values or "predictions." Passing them to the MSE operation will return a single number: the average of the squared differences between the actual and predicted values.

6. Defining a loss function
 - In many cases, the training process will require us to supply a function that accepts our model's variables and data and returns a loss. Here, we'll first define a model, "linear_regression," which takes the intercept, slope, and features as arguments and returns the model's predictions. We'll next define a loss function called "loss_function" that accepts the slope and intercept of a linear model -- the variables -- and the input data, the targets and the features. It then makes a prediction and computes and returns the associated MSE loss. Note that we've defined both functions to use default argument values for features and targets. We will do this whenever we train on the full sample to simplify the code.

7. Defining the loss function
 - Notice that we've nested TensorFlow's MSE loss function within a function that first uses the model to make predictions and then uses those predictions as an input to the MSE loss function. We can then evaluate this function for a given set of parameter values and input data. Here, we've evaluated the loss function using a test dataset and it returned a loss value of ten point seven seven. If we had omitted the data arguments, test_targets and test_features, the loss function would have instead used the default targets and features arguments we set to evaluate model performance.

### Loss functions in TensorFlow
In this exercise, you will compute the loss using data from the King County housing dataset. You are given a target, price, which is a tensor of house prices, and predictions, which is a tensor of predicted house prices. You will evaluate the loss function and print out the value of the loss.

In [None]:
# Import the keras module from tensorflow
from tensorflow import keras

# Compute the mean squared error (mse)
loss = keras.losses.mse(price, predictions)

# Print the mean squared error (mse)
print(loss.numpy())

<script.py> output:
    141171604777.12717

In [None]:
# Import the keras module from tensorflow
from tensorflow import keras

# Compute the mean absolute error (mae)
loss = keras.losses.mae(price, predictions)

# Print the mean absolute error (mae)
print(loss.numpy())

<script.py> output:
    268827.99302088

You may have noticed that the MAE was much smaller than the MSE, even though price and predictions were the same. 

This is because the different loss functions penalize deviations of predictions from price differently. 

MSE does not like large deviations and punishes them harshly.

### Modifying the loss function
In the previous exercise, you defined a tensorflow loss function and then evaluated it once for a set of actual and predicted values. In this exercise, you will compute the loss within another function called loss_function(), which first generates predicted values from the data and variables. The purpose of this is to construct a function of the trainable model variables that returns the loss. You can then repeatedly evaluate this function for different variable values until you find the minimum. In practice, you will pass this function to an optimizer in tensorflow. Note that features and targets have been defined and are available. Additionally, Variable, float32, and keras are available.

In [None]:
# Initialize a variable named scalar
scalar = Variable(1.0, float32)

# Define the model
def model(scalar, features = features):
    return scalar * features

# Define a loss function
def loss_function(scalar, features = features, targets = targets):
    # Compute the predicted values
    predictions = model(scalar, features)
    
    # Return the mean absolute error loss
    return keras.losses.mae(targets, predictions)

# Evaluate the loss function and print the loss
print(loss_function(scalar).numpy())

## Linear Regression

1. Linear regression
 - Now that you understand how to construct loss functions, you're well-equipped to start training models. We'll do that for the first time in this video with a linear regression model.

2. What is a linear regression?
 - So what is a linear regression model? We can answer this with a simple illustration. Let's say we want to examine the relationship between house size and price in the King County housing dataset. We might start by plotting the size in square feet against the price in dollars. Note that we've actually plotted the relationship after taking the natural logarithm of each variable, which is useful when we suspect that the relationship is proportional. That is, we might expect an x% increase in size to be associated with a y% increase in price.

3. What is a linear regression?
 - A linear regression model assumes that the relationship between these variables can be captured by a line. That is, two parameters--the line's slope and intercept--fully characterize the relationship between size and price.

4. The linear regression model
 - In our case, we've assumed that the relationship is linear after taking natural logarithms. Training the model will involve recovering the slope of the line and the intercept, where the line intersects the vertical axis. Once we have trained the intercept and slope, we can take a house's size and predict its price. The difference between the predicted price and actual price is the error, which can be used to construct a loss function. The example we've shown is for a univariate regression, which has only one feature, size. A multiple regression has multiple features, such as size and location.

5. Linear regression in TensorFlow
 - Let's look at some code to see how this can be implemented. We will first define our target variable, price, and feature, size. We also initialize the intercept and slope as trainable variables. After that, we define the model, which we'll use to make predictions by multiplying size and slope and then adding the intercept. Again, remember that we can do this using the addition and multiplication symbols, since these are overloaded operators and intercept and slope are tensorflow operations. Our next step is to define a loss function. This function will take the model's parameters and the data as an input. It will first use the model to compute the predicted values. We then set the function to return the mean squared error loss. We, of course, could have selected a different loss.

6. Linear regression in TensorFlow
 - With the loss function defined, the next step is to define an optimization operation. We'll do this using the adam optimizer. For now, you can ignore the choice of optimization algorithm. We will discuss the selection of optimizers in greater detail later. For our purposes, it is sufficient to understand that executing this operation will change the slope and intercept in a direction that will lower the value of the loss. We will next perform minimization on the loss function using the optimizer. Notice that we've passed the loss function as a lambda function to the minimize operation. We also supplied a variable list, which contains intercept and slope, the two variables we defined earlier. We will execute our optimization step 1000 times. Printing the loss, we'll see that it tends to decline, moving closer to the minimum value with each step. Finally, we print the intercept and the slope. This is our linear model, which enables us to predict the value of a house given its size.

### Set up a linear regression
A univariate linear regression identifies the relationship between a single feature and the target tensor. In this exercise, we will use a property's lot size and price. Just as we discussed in the video, we will take the natural logarithms of both tensors, which are available as price_log and size_log.

In this exercise, you will define the model and the loss function. You will then evaluate the loss function for two different values of intercept and slope. Remember that the predicted values are given by intercept + features*slope. Additionally, note that keras.losses.mse() is available for you. Furthermore, slope and intercept have been defined as variables.

In [None]:
# Define a linear regression model
def linear_regression(intercept, slope, features = size_log):
    return intercept + features*slope

# Set loss_function() to take the variables as arguments
def loss_function(intercept, slope, features = size_log, targets = price_log):
    # Set the predicted values
    predictions = linear_regression(intercept, slope, features)
    
    # Return the mean squared error loss
    return keras.losses.mse(targets,predictions)

# Compute the loss for different slope and intercept values
print(loss_function(0.1, 0.1).numpy())
print(loss_function(0.1, 0.5).numpy())

# 145.44653
# 71.866

### Train a linear model
In this exercise, we will pick up where the previous exercise ended. The intercept and slope, intercept and slope, have been defined and initialized. Additionally, a function has been defined, loss_function(intercept, slope), which computes the loss using the data and model variables.

You will now define an optimization operation as opt. You will then train a univariate linear model by minimizing the loss to find the optimal values of intercept and slope. Note that the opt operation will try to move closer to the optimum with each step, but will require many steps to find it. Thus, you must repeatedly execute the operation.

In [None]:
# Initialize an Adam optimizer
opt = keras.optimizers.Adam(0.5)

for j in range(100):
    # Apply minimize, pass the loss function, and supply the variables
    opt.minimize(lambda: loss_function(intercept, slope), var_list=[intercept, slope])

    # Print every 10th value of the loss
    if j % 10 == 0:
        print(loss_function(intercept, slope).numpy())

# Plot data and regression line
plot_results(intercept, slope)

### Multiple linear regression
In most cases, performing a univariate linear regression will not yield a model that is useful for making accurate predictions. In this exercise, you will perform a multiple regression, which uses more than one feature.

You will use price_log as your target and size_log and bedrooms as your features. Each of these tensors has been defined and is available. You will also switch from using the the mean squared error loss to the mean absolute error loss: keras.losses.mae(). Finally, the predicted values are computed as follows: 

params[0] + feature1*params[1] + feature2*params[2]. 

Note that we've defined a vector of parameters, params, as a variable, rather than using three variables. Here, params[0] is the intercept and params[1] and params[2] are the slopes.

In [None]:
# Define the linear regression model
def linear_regression(params, feature1 = size_log, feature2 = bedrooms):
    return params[0] + feature1*params[1] + feature2*params[2]

# Define the loss function
def loss_function(params, targets = price_log, feature1 = size_log, feature2 = bedrooms):
    # Set the predicted values
    predictions = linear_regression(params, feature1, feature2)
  
    # Use the mean absolute error loss
    return keras.losses.mae(targets, predictions)

# Define the optimize operation
opt = keras.optimizers.Adam()

# Perform minimization and print trainable variables
for j in range(10):
    opt.minimize(lambda: loss_function(params), var_list=[params])
    print_results(params)
    
'''
Note that params[2] tells us how much the price will increase 
in percentage terms if we add one more bedroom. 
You could train params[2] and the other model parameters 
by increasing the number of times we iterate over opt.
'''

### Batch Training
- handle large data sets with batches (epochs) 
- and update model weights and optimizers after each batch

1. Batch training
 - Earlier in the chapter, we learned how to train a linear model to predict house prices. In this video, we will use batch training to handle large datasets.

2. What is batch training?
 - So what is batch training exactly? To answer this, let's return to the linear model you used to predict house prices earlier in the chapter. But this time, let's say the dataset is much larger and you want to perform the training on a GPU, which has only small amount of memory. Since you can't fit the entire dataset in memory, you will instead divide it into batches and train on those batches sequentially. A single pass over all of the batches is called an epoch and the process itself is called batch training. It will be quite useful when you work with large image datasets. Beyond alleviating memory constraints, batch training will also allow you to update model weights and optimizer parameters after each batch, rather than at the end of the epoch.

3. The chunksize parameter
 - Earlier, we discussed using pandas to load data with read_csv(). 
  - The same function can be used to load data in batches. If, for instance, we have a 100 gigabyte dataset, we might want to avoid loading it all at once. 
  - We can do this by using the chunksize parameter. The code block shows how this can be done. Let's first import pandas and numpy. Instead of loading the data in a single one-liner, we'll write a for loop that iterates through the data in steps of 100 examples. Each 100 will be available as batch, which we can use to extract columns, such as price and size in the housing dataset. We can then convert these to numpy arrays and use them to train.

4. Training a linear model in batches
 - We now know how to load data from csv files in fixed-size batches using pandas. This means that we can handle data sets of tens or even hundreds of gigabytes without exceeding the memory constraints of our system. Let's look at a minimal example with a linear model using the King County housing dataset. We will start by loading tensorflow, pandas, and numpy. Next, we'll define variables for the intercept and slope, along with the linear regression model.

5. Training a linear model in batches
 - We then define a loss function, which takes the slope and intercept, and two sources of data: the features and the targets. It then returns the mean squared error loss. After defining the loss function, we instantiate an adam optimizer, which we will use to perform minimization.

6. Training a linear model in batches
 - The next step is to train the model in batches. Again, we do this by using a for loop and supplying a chunksize to the read csv function. Note that we take each batch, separate it into features and a target, convert those into numpy arrays, and then pass them to the minimize operation. Within the minimize operation, we pass the loss function as a lambda function and we supply a variable list that contains only the trainable parameters, intercept and slope. This loop will continue until we have stepped through all of the examples in csv read. Importantly, we did not ever need to have more than 100 examples in memory during the entire process. Finally, we print our trained intercept and slope. Note that we did not use default argument values for input data. This is because our input data was generated in batches during the training process.

7. Full sample versus batch training
 - So what is the value of batch training? 
 - Full sample - When we trained with the full sample, we updated the optimizer and model parameters once per training epoch and passed data to the loss function without modification, but were limited by memory constraints. 
 - Batch training - With batch training, we updated the model weights and optimizer parameters multiple times each epoch and divided the data into batches, but no longer faced any memory constraints. 
 - In later chapters, you'll automate batch training by using high level APIs. Importantly, however, high level APIs will not typically load the sample in batches by default, as we have done here.

Preparing to batch train
- Before we can train a linear model in batches, we must first define variables, a loss function, and an optimization operation. In this exercise, we will prepare to train a model that will predict price_batch, a batch of house prices, using size_batch, a batch of lot sizes in square feet. In contrast to the previous lesson, we will do this by loading batches of data using pandas, converting it to numpy arrays, and then using it to minimize the loss function in steps.

- Variable(), keras(), and float32 have been imported for you. Note that you should not set default argument values for either the model or loss function, since we will generate the data in batches during the training process.

In [None]:
# Define the intercept and slope
intercept = Variable(10.0, float32)
slope = Variable(0.5, float32)

# Define the model
def linear_regression(intercept, slope, features):
    # Define the predicted values
    return intercept + features*slope

# Define the loss function
def loss_function(intercept, slope, targets, features):
    # Define the predicted values
    predictions = linear_regression(intercept, slope, features)
    
    # Define the MSE loss
    return keras.losses.mse(targets, predictions)

Training a linear model in batches
- In this exercise, we will train a linear regression model in batches, starting where we left off in the previous exercise. We will do this by stepping through the dataset in batches and updating the model's variables, intercept and slope, after each step. This approach will allow us to train with datasets that are otherwise too large to hold in memory.

- Note that the loss function,loss_function(intercept, slope, targets, features), has been defined for you. Additionally, keras has been imported for you and numpy is available as np. The trainable variables should be entered into var_list in the order in which they appear as loss function arguments.

In [None]:
# Initialize Adam optimizer
opt = keras.optimizers.Adam()

# Load data in batches
for batch in pd.read_csv('kc_house_data.csv', chunksize=100):
    size_batch = np.array(batch['sqft_lot'], np.float32)

    # Extract the price values for the current batch
    price_batch = np.array(batch['price'], np.float32)

    # Complete the loss, fill in the variable list, and minimize
    opt.minimize(lambda: loss_function(intercept, slope, price_batch, size_batch), var_list=[intercept, slope])

# Print trained parameters
print(intercept.numpy(), slope.numpy())

# 10.217888 0.7016

# Dense Layers - neural networks hidden layers

1. Dense layers
 - In this chapter, we will focus on training neural networks in TensorFlow. We will start with an overview of a frequently used component of neural networks: the dense layer.

2. The linear regression model
 - Throughout this chapter, we'll make use of a dataset on credit card default. It contains features, such as marital status and payment amount, which we'll use to predict a target, default. Here, we have the familiar linear regression model. We take marital status, which is 1, and bill amount, which is 3. We then multiply the inputs by their respective weights, zero point one and minus zero point two five, and sum.

3. What is a neural network?
 - So how do we get from a linear regression to a neural network? By adding a hidden layer, which, in this case, consists of two nodes. Each hidden layer node takes our two inputs, multiplies them by their respective weights, and sums them together. We also typically pass the hidden layer output to an activation function, but we will come back to that later. Finally, we sum together the outputs of the two hidden layers to compute our prediction for default. This entire process of generating a prediction is referred to as forward propagation.

4. What is a neural network?
 - In this chapter, we will construct neural networks with three types of layers: an input layer, some number of hidden layers, and an output layer. The input layer consists of our features. The output layer contains our prediction. Each hidden layer takes inputs from the previous layer, applies numerical weights to them, sums them together, and then applies an activation function. In the neural network graph, we have applied a particular type of hidden layer called a dense layer. A dense layer applies weights to all nodes from the previous layer. We will use dense layers throughout this chapter to construct networks.

5. A simple dense layer
 - Let's look at a simple example of a dense layer. We'll first define a constant tensor that contains the marital status and age data as the input layer. We then initialize weights as a variable, since we will train those weights to predict the output from the inputs. We also define a bias, which will play a similar role to the intercept in the linear regression model.

6. A simple dense layer
 - Finally, we define a dense layer. Note that we first perform a matrix multiplication of the inputs by the weights and assign that to the tensor named product. We then add product to the bias and apply a non-linear transformation, in this case the sigmoid function. This is called the activation function and we will explore this in more depth in the next video, but do not worry about it for now. Furthermore, note that the bias is not associated with a feature and is analogous to the intercept in a linear regression. We will typically not draw it in neural network diagrams for simplicity.

7. Defining a complete model
 - Note that TensorFlow also comes with higher level operations, such as tf dot keras dot layers dot Dense, which allows us to skip the linear algebra. In this example, we take input data and convert it to a 32-bit float tensor. We then define a first hidden dense layer using keras layers dense. The first argument specifies the number of outgoing nodes. And the second argument is the activation function. By default, a bias will be included. Note that we've also passed inputs as an argument to the first dense layer.

8. Defining a complete model
 - We can easily define another dense layer, which takes the first dense layer as an argument and then reduces the number of nodes. The output layer reduces this again to one node.

9. High-level versus low-level approach
 - Finally, let's compare the high-level and low-level approaches. The high-level approach relies on complex operations in high-level APIs, such as Keras and Estimators, reducing the amount of code needed. The weights and the mathematical operations will typically be hidden by the layer constructor. The low-level approach uses linear algebra, which allows for the construction of any model. TensorFlow allows us to use either approach or even combine them.
 - high-level approach
    - dense = keras.layers.Dense(10,activation='sigmoid')
 - low-level approach
    - prod = matmul(inputs, weights)
    - dense = keras.activations.sigmoid(prod)

## The linear algebra of dense layers

There are two ways to define a dense layer in tensorflow. 
The first involves the use of low-level, linear algebraic 
operations. 
The second makes use of high-level keras operations. 
In this exercise, we will use the first method to construct 
the network shown in the image below.

This image depicts an neural network with 5 input nodes 
and 3 output nodes.
The input layer contains 3 features -- education, marital status,
and age -- which are available as borrower_features. 
The hidden layer contains 2 nodes and the 
output layer contains a single node.

For each layer, you will take the previous layer as an input, 
initialize a set of weights, compute the product of the inputs 
and weights, and then apply an activation function. 
Note that Variable(), ones(), matmul(), and keras() have been 
imported from tensorflow.

In [None]:
# Initialize bias1
bias1 = Variable(1.0)

# Initialize weights1 as 3x2 variable of ones
weights1 = Variable(ones((3, 2)))

# Perform matrix multiplication of borrower_features and weights1
product1 = matmul(borrower_features, weights1)

# Apply sigmoid activation function to product1 + bias1
dense1 = keras.activations.sigmoid(product1 + bias1)

# Print shape of dense1
print("\n dense1's output shape: {}".format(dense1.shape))

# dense1's output shape: (1, 2)

In [None]:
# From previous step
bias1 = Variable(1.0)
weights1 = Variable(ones((3, 2)))
product1 = matmul(borrower_features, weights1)
dense1 = keras.activations.sigmoid(product1 + bias1)

# Initialize bias2 and weights2 as 2x1 tensor of ones
bias2 = Variable(1.0)
weights2 = Variable(ones((2, 1)))

# Perform matrix multiplication of dense1 and weights2
product2 = matmul(dense1, weights2)

# Apply activation to product2 + bias2 and print the prediction
prediction = keras.activations.sigmoid(product2 + bias2)
print('\n prediction: {}'.format(prediction.numpy()[0,0]))
print('\n actual: 1')

#  prediction: 0.9525741338729858

#  actual: 1

## The low-level approach with multiple examples

In this exercise, we'll build further intuition for the low-level approach by constructing the first dense hidden layer for the case where we have multiple examples. We'll assume the model is trained and the first layer weights, weights1, and bias, bias1, are available. We'll then perform matrix multiplication of the borrower_features tensor by the weights1 variable. Recall that the borrower_features tensor includes education, marital status, and age. Finally, we'll apply the sigmoid function to the elements of products1 + bias1, yielding dense1.
(picture of products1 matrices)
 
Note that matmul() and keras() have been imported from tensorflow.

In [None]:
# Compute the product of borrower_features and weights1
products1 = matmul(borrower_features, weights1)

# Apply a sigmoid activation function to products1 + bias1
dense1 = keras.activations.sigmoid(products1 + bias1)

# Print the shapes of borrower_features, weights1, bias1, and dense1
print('\n shape of borrower_features: ', borrower_features.shape)
print('\n shape of weights1: ', weights1.shape)
print('\n shape of bias1: ', bias1.shape)
print('\n shape of dense1: ', dense1.shape)

shape of borrower_features:  (5, 3)
      - 5x3 b/c 5 examples and 3 features
shape of weights1:  (3, 2)
      - 3x2 b/c it doesn't depend on number of examples
shape of bias1:  (1,)
      - scalar
shape of dense1:  (5, 2)
      - we can multiple by weights2 which is 2x1

## Using the dense layer operation

We've now seen how to define dense layers in tensorflow using linear algebra. In this exercise, we'll skip the linear algebra and let keras work out the details. This will allow us to construct the network below, which has 2 hidden layers and 10 features, using less code than we needed for the network with 1 hidden layer and 3 features.

To construct this network, we'll need to define three dense layers, each of which takes the previous layer as an input, multiplies it by weights, and applies an activation function. Note that input data has been defined and is available as a 100x10 tensor: borrower_features. Additionally, the keras.layers module is available.

In [None]:
# Define the first dense layer - 7 nodes
dense1 = keras.layers.Dense(7, activation='sigmoid')(borrower_features)

# Define a dense layer with 3 output nodes
dense2 = keras.layers.Dense(3,activation='sigmoid')(dense1)

# Define a dense layer with 1 output node
predictions = keras.layers.Dense(1, activation='sigmoid')(dense2)

# Print the shapes of dense1, dense2, and predictions
print('\n shape of dense1: ', dense1.shape)
print('\n shape of dense2: ', dense2.shape)
print('\n shape of predictions: ', predictions.shape)

shape of dense1:  (100, 7)

shape of dense2:  (100, 3)

shape of predictions:  (100, 1)

Note that each layer has 100 rows because the input data contains 100 examples.

## Activation functions

1. Activation functions
 - In the previous video, we discussed dense layers. We also briefly introduced the concept of an activation function through the sigmoid function. We will now return to activation functions.

2. What is an activation function?
 - A typical hidden layer consists of two operations. The first performs matrix multiplication, which is a linear operation, and the second applies an activation function, which is nonlinear operation.

3. Why nonlinearities are important
 - Why do we need this nonlinear component? Consider a simple model using the credit card data. The features are borrower age and credit card bill amount. The target variable is default.

4. Why nonlinearities are important
 - Let's say we create a scatterplot of age and bill amount. We can see that bill amount usually increases early in life and decreases later in life. This suggests that a high bill for young and older borrowers may mean something different for default. If we want our model to capture this, it can't be linear. It must allow the impact of the bill amount to depend on the borrower's age. This is what an activation function does.

5. A simple example
 - Let's look at a simple example, where we assume that the weight on age is 1 and the weight on the bill amount is 2. Note that ages are divided by 100 and the bill's amount is divided by 10000. We then perform the matrix multiplication step for all combinations of features: young with a high bill, young with a low bill, old with a high bill, and old with a low bill.

6. A simple example
 - If we don't apply an activation function and we assume the bias is zero, we find that the impact of bill size on default does not depend on age. In both cases, we predict a value of 0 point 8. Note that our target is a binary variable that is equal to 1 when the borrower defaults; however, predictions will be real numbers between 0 and 1, where values over 0 point 5 will be treated as predicting default.

7. A simple example
 - But what if we apply a sigmoid activation function? The impact of bill amount on default now depends on the borrower's age. In particular, we can see that the change in the predicted value for default is larger for young borrowers than it is for old borrowers.

8. sigmoid activation function - binary classification
 - In this course, we'll use the three most common activation functions: sigmoid, relu, and softmax. 
 - The sigmoid activation function is used primarily in the output layer of binary classification problems. 
 - low-level approach, we'll pass the sum of the product of weights and inputs into:
     - tf.keras.activations.sigmoid()
 - high-level approach, we'll simply pass sigmoid as a parameter to a keras dense layer
     - sigmoid parameter

9. ReLu activation function = rectified linear unit
 - We'll typically use the rectified linear unit or ReLu activation in all layers other than the output layer. This activation simply takes the maximum of the value passed to it and 0.
 - low level: tf.keras.activations.relu()
 - high level: relu

10. softmax activation function - multiclass classification
 - Finally, the softmax activation function is used in the output layer in classification problems with more than two classes.
     - Output layer > 2 classes
     - low level: tf.keras.activations.softmax()
     - high level: softmax
 - The outputs from a softmax activation function can be interpreted as predicted class probabilities in multiclass classification problems.

11. Activation functions in neural networks
 - Let's wrap up by applying some activation functions in a neural network. We'll do this using the high-level approach, starting with an input layer. We'll pass this to our first dense layer, which has 16 output nodes and a relu activation. Dense layer 2 then reduces the number of nodes from 16 to 8 and applies a sigmoid activation. Finally, we apply a softmax activation function in the output layer, since there are more than 2 outputs.

In [None]:
# example of activation fxns in neural networks
import tensorflow as tf

# define input layer
inputs = tf.constant(borrower_features, tf.float32)

# define dense layer 1
dense1 = tf.keras.layers.Dense(16, activation='relu')(inputs)

# define dense layer 2
dense2 = tf.keras.layers.Dense(8, activation='sigmoid')(dense1)

# define output layer
outputs = tf.keras.layers.Dense(4, activation='softmax')(dense2)

### Binary classification problems

In this exercise, you will again make use of credit card data. The target variable, default, indicates whether a credit card holder defaults on his or her payment in the following period. Since there are only two options--default or not--this is a binary classification problem. While the dataset has many features, you will focus on just three: the size of the three latest credit card bills. Finally, you will compute predictions from your untrained network, outputs, and compare those the target variable, default.

The tensor of features has been loaded and is available as bill_amounts. Additionally, the constant(), float32, and keras.layers.Dense() operations are available.

In [None]:
# Construct input layer from features
inputs = constant(bill_amounts, float32)

# Define first dense layer
dense1 = keras.layers.Dense(3, activation='relu')(inputs)

# Define second dense layer
dense2 = keras.layers.Dense(2,activation='relu')(dense1)

# Define output layer
outputs = keras.layers.Dense(1, activation='sigmoid')(dense2)

# Print error for first five examples
error = default[:5] - outputs.numpy()[:5]
print(error)

[[-0.5]

 [-0.5]
     
[-0.5]
     
[-0.5]
     
[-0.5]]

If you run the code several times, you'll notice that the errors change each time. This is because you're using an untrained model with randomly initialized parameters. Furthermore, the errors fall on the interval between -1 and 1 because default is a binary variable that takes on values of 0 and 1 and outputs is a probability between 0 and 1.

### Multiclass classification problems - targets with 3+ values

In this exercise, we expand beyond binary classification to cover multiclass problems. A multiclass problem has targets that can take on three or more values. In the credit card dataset, the education variable can take on 6 different values, each corresponding to a different level of education. We will use that as our target in this exercise and will also expand the feature set from 3 to 10 columns.

As in the previous problem, you will define an input layer, dense layers, and an output layer. You will also print the untrained model's predictions, which are probabilities assigned to the classes. The tensor of features has been loaded and is available as borrower_features. Additionally, the constant(), float32, and keras.layers.Dense() operations are available.

In [None]:
# Construct input layer from borrower features
inputs = constant(borrower_features, float32)

# Define first dense layer
dense1 = keras.layers.Dense(10, activation='sigmoid')(inputs)

# Define second dense layer
dense2 = keras.layers.Dense(8, activation='relu')(dense1)

# Define output layer
outputs = keras.layers.Dense(6, activation='softmax')(dense2)

# Print first five predictions
print(outputs.numpy()[:5])

'''
[[0.15857692 0.08764538 0.24660672 0.23920012 0.14834145 0.11962939]
 [0.15037695 0.08141254 0.31689462 0.22537051 0.12768166 0.09826373]
 [0.19773847 0.11176475 0.16081297 0.22276174 0.18630679 0.12061525]
 [0.18996309 0.12200031 0.2131918  0.19491863 0.16372119 0.11620501]
 [0.15857692 0.08764538 0.24660672 0.23920012 0.14834145 0.11962939]]

Notice that each row of outputs sums to one. 
This is because a row contains the predicted class probabilities
for one example. As with the previous exercise, our predictions 
are not yet informative, since we are using an untrained model 
with randomly initialized parameters. This is why the model 
tends to assign similar probabilities to each class.
'''

## Optimizers

1. Optimizers
 - In chapter 2, you minimized a loss function with an optimizer. We'll revisit that here in the context of training neural networks. This entails finding the set of weights that corresponds to the minimum value of the loss.

2. How to find a minimum
 - So what is a minimization problem? And what can go wrong when we try to solve one? Let's start with a simple thought experiment: you want to find the lowest point in the Grand Canyon, but all you can do is pick a point, measure the elevation, and then repeat the same to nearby points. This is what you do when you train a neural network: you pick a starting point, measure the loss, and then try to move to a lower loss. We will see how a common optimization algorithm, gradient descent, solves this problem.

 - 1 Source: U.S. National Park Service

3. How to find a minimum
 - Let's start by picking a point and measuring the elevation. From that point, we'll move along the slope until we arrive on a flat surface. To understand what's going on, imagine you dropped a ball into the canyon from the point you selected. If you drop the ball on a slope above a plateau, the ball will stop when it reaches the plateau. If this happens, the gradient descent algorithm will fail. It will stop on a local minimum and will progress no further.

 - 1 Source: U.S. National Park Service

4. How to find a minimum
 - Let's say you pick a different spot. This time, the ball lands on a slope with an unobstructed path to the lowest point in the canyon. Here, the gradient descent algorithm works and ball reaches the global minimum. Notice that gravity performs the role of the gradient descent optimizer.

 - 1 Source: U.S. National Park Service

5. Stochastic gradient descent - SGD
 - Stochastic gradient descent or SGD is an improved version of gradient descent that is less likely to get stuck in local minima. For simple problems, the SGD algorithm performs well. Here, the SGD loss function value quickly falls below the losses for the more recently developed RMS Prop and the Adam optimizers on a simple minimization task. Adam and RMS require 10 times as many iterations to achieve a similar loss.

6. The gradient descent optimizer
 - Let's move on to the TensorFlow implementation for these optimizers, starting with SGD, which you can instantiate using the keras optimizers module. 
 - tf.keras.optimizers.SGD()
 - You can then supply a learning rate, typically between zero point five and zero point zero zero one, which will determine how quickly the model parameters adjust during training. 
 - learning_rate
     - 0.5 to 0.001
 - Think of a higher learning rate as exerting more force on the ball than gravity alone. The ball will move faster and skip over some plateaus, but it may miss the global minimum, too. The main advantage of SGD is that it is simpler and easier to interpret than more modern optimization algorithms.

7. The RMS prop optimizer = Root mean squared propagation optimizer
 - keras.optimizers.RMSprop()
 - 2 advantages over SGD. 
    1. Applies different learning rates to each feature, which can be useful for high dimensional problems.
    2. Allows you to both build momentum and also allow it to decay. Setting a low value for the decay parameter will prevent momentum from accumulating over long periods during the training process.

8. The adam optimizer
 - Finally, the adaptive moment or "adam" optimizer provides further improvements and is generally a good first choice. 
 - Similar to RMS prop, you can set the momentum to decay faster by lowering the beta1 parameter. 
 - Relative to RMS prop, the adam optimizer will tend to perform better with the default parameter values, which we will typically use.

9. A complete example
 - Let's return to our credit card default prediction problem and assume that features have been imported and weights have been initialized. We'll then define a model that computes the predictions and a loss function that computes the binary_crossentropy loss, which is the standard for binary classification problems. Finally, we define an RMS prop optimizer with a learning rate of zero point one and a momentum parameter of zero point nine, and then perform minimization.

In [None]:
# Example
import tensorflow as tf

# define the model function
def model(bias, weights, features = borrower_features):
    product = tf.matmul(features, weights)
    return tf.keras.activations.sigmoid(product+bias)

# compute the predicted values and loss
# binary_crossentropy b/c standard for binary classification
def loss_function(bias, weights, targets=default, features=borrower_features):
    predictions = model(bias, weights)
    return tf.keras.losses.binary_crossentropy(targets, predictions)

# minimize the loss function with RMS propagation
opt = tf.keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.9)
opt.minimize(lambda: loss_function(bias, weights), var_list=[bias, weights])

### The dangers of local minima
Consider the plot of the following loss function, loss_function(), which contains a global minimum, marked by the dot on the right, and several local minima, including the one marked by the dot on the left.

The graph is of a single variable function that contains multiple local minima and a global minimum.

In this exercise, you will try to find the global minimum of loss_function() using keras.optimizers.SGD(). You will do this twice, each time with a different initial value of the input to loss_function(). First, you will use x_1, which is a variable with an initial value of 6.0. Second, you will use x_2, which is a variable with an initial value of 0.3. Note that loss_function() has been defined and is available.

In [None]:
# Initialize x_1 and x_2 values
x_1 = Variable(6.0,float32)
x_2 = Variable(0.3,float32)

# Define the optimization operation
opt = keras.optimizers.SGD(learning_rate=0.01)

for j in range(100):
    # Perform minimization using the loss function and x_1
    opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
    # Perform minimization using the loss function and x_2
    opt.minimize(lambda: x_2, var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())

# 4.3801394 0.42052683
'''
Notice that we used the same optimizer and loss function, 
but two different initial values. 
When we started at 6.0 with x_1, 
we found the global minimum at 4.38, 
marked by the dot on the right. 
When we started at 0.3, we stopped around 0.42 with x_2, 
the local minimum marked by a dot on the far left.
'''

### Avoiding local minima

The previous problem showed how easy it is to get stuck in local minima. We had a simple optimization problem in one variable and gradient descent still failed to deliver the global minimum when we had to travel through local minima first. One way to avoid this problem is to use momentum, which allows the optimizer to break through local minima. We will again use the loss function from the previous problem, which has been defined and is available for you as loss_function().

The graph is of a single variable function that contains multiple local minima and a global minimum.

Several optimizers in tensorflow have a momentum parameter, including SGD and RMSprop. You will make use of RMSprop in this exercise. Note that x_1 and x_2 have been initialized to the same value this time. Furthermore, keras.optimizers.RMSprop() has also been imported for you from tensorflow.

In [None]:
# Using momentum param in RMSprop helps avoid local min

# Initialize x_1 and x_2
x_1 = Variable(0.05,float32)
x_2 = Variable(0.05,float32)

# Define the optimization operation for opt_1 and opt_2
opt_1 = keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.99)
opt_2 = keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.00)

for j in range(100):
    opt_1.minimize(lambda: loss_function(x_1), var_list=[x_1])
    # Define the minimization operation for opt_2
    opt_2.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())

# 4.3150263 0.4205261
'''
Recall that the global minimum is approximately 4.38. 
Notice that opt_1 built momentum, bringing x_1 closer 
to the global minimum. 
To the contrary, opt_2, which had a momentum parameter of 0.0, 
got stuck in the local minimum on the left.
'''

## Training a network in TensorFlow

1. Training a network in TensorFlow
 - In the final video in this chapter, we'll wrap-up by discussing important topics related to training neural networks in TensorFlow.

2. Initializing variables
 - We saw that finding the global minimum can be difficult, even when we're minimizing a simple loss function. We also saw that we could improve our chances by selecting better initial values for variables. But what can we do for more challenging problems with many variables? Take the eggholder function, for example, which has many local minima. It is difficult to see a global minimum on the plot, but it has one. How can we select initial values for x and y, the two inputs to the eggholder function? Even worse, what if we have a loss function that depends on hundreds of variables?

3. Random initializers
 - Often need to initialize hundreds or thousands of variables. 
 - Simply using tf.ones() will not work, or perform poorly. 
 - Selecting initial values individually is tedious and infeasible in many cases. 
 - A natural alternative to this is to use random or algorithmic generation of initial values (from distribution). We can, for instance, draw them from a probability distribution, such as the normal or uniform distributions. 
 - There are also specialized options, such as the Glorot initializers, which are designed for ML algorithms.

4. Initializing variables in TensorFlow
 - Let's start by using the low-level approach to initialize a 500x500 variable. We can do this using draws from a random normal distribution by passing the shape 500, 500 to tf.random.normal and passing the result to tf.Variable. Alternatively, we could use the truncated random normal distribution, which discards very large and very small draws.

5. Initializing variables in TensorFlow
 - We can also use the high-level approach by initializing a dense layer using the default keras option, currently the glorot uniform initializer, as we've done in all exercises thus far. If we instead wish to initialize values to zero, we can do this using the kernel initializer parameter.

6. Neural networks and overfitting
 - Overfitting is another important issue you'll encounter when training neural networks. Let's say you have a linear relationship between two variables. You decide to represent this relationship with a linear model, shown in red, and a more complex model, shown in blue. The complex model perfectly predicts the values in the training set, but performs worse in the test set. The complex model performed poorly because it overfit. It simply memorized examples, rather than learning general patterns. Overfitting is especially problematic for neural networks, which contain many parameters and are quite good at memorization.

7. Applying dropout - decrease overfit
 - A simple solution to the overfitting problem is to use dropout, an operation that will randomly drop the weights connected to certain nodes in a layer during the training process, as shown on the right. 
 - This will force your network to develop more robust rules for classification, since it cannot rely on any particular nodes being passed to an activation function. This will tend to improve out-of-sample performance.
 - Example: drop 25% of nodes after a dense layer dense2
 - dropout1 = tf.keras.layers.Dropout(0.25)(dense2)

8. Implementing dropout in a network
 - Let's look at how dropout works. We first define an input layer using the borrower features from our credit card dataset as an input. We then pass the input layer to a dense layer, which has 32 nodes and uses a relu activation function.

9. Implementing dropout in a network
 - We'll next pass the first dense layer to a second layer, which reduces the number of output nodes to 16. Before passing those nodes to the output layer, we'll apply a dropout layer. The only argument specifies that we want to drop the weights connected to 25% of nodes randomly. We'll then pass this to the output layer, which reduces the 16 nodes to 1 and applies a sigmoid activation function.

In [None]:
# Example initializing variables - low level option
import tensorflow as tf

# Option 1
# define 500x500 random normal variable
weights = tf.Variable(tf.random.normal([500, 500]))

# Option 2
# define 500x500 truncated random normal variable
# truncated random normal distribution, 
# which discards very large and very small draws.
weights = tf.Variable(tf.random.truncated_normal([500, 500]))

In [None]:
# Example initializing variable - high level option

# define a dense layer with the default initializer
dense = tf.keras.layers.Dense(32, activation='relu')

# define a dense layer with the zeros initializer using kernel
dense = tf.keras.layers.Dense(32, activation='relu', kernel_initializer='zeros')

### Initialization in TensorFlow
A good initialization can reduce the amount of time needed to find the global minimum. In this exercise, we will initialize weights and biases for a neural network that will be used to predict credit card default decisions. To build intuition, we will use the low-level, linear algebraic approach, rather than making use of convenience functions and high-level keras operations. We will also expand the set of input features from 3 to 23. Several operations have been imported from tensorflow: Variable(), random(), and ones().

In [None]:
# Define the layer 1 weights
w1 = Variable(random.normal([23, 7]))

# Initialize the layer 1 bias
b1 = Variable(ones([7]))

# Define the layer 2 weights
w2 = Variable(random.normal([7,1]))

# Define the layer 2 bias, initial value 0.0
b2 = Variable(0.0)

### Defining the model and loss function
In this exercise, you will train a neural network to predict whether a credit card holder will default. The features and targets you will use to train your network are available in the Python shell as borrower_features and default. You defined the weights and biases in the previous exercise.

Note that the predictions layer is defined as , where  is the sigmoid activation, layer1 is a tensor of nodes for the first hidden dense layer, w2 is a tensor of weights, and b2 is the bias tensor.

The trainable variables are w1, b1, w2, and b2. Additionally, the following operations have been imported for you: keras.activations.relu() and keras.layers.Dropout().

In [None]:
# Define the model
def model(w1, b1, w2, b2, features = borrower_features):
    # Apply relu activation functions to layer 1
    layer1 = keras.activations.relu(matmul(features, w1) + b1)
    # Apply dropout rate of 0.25
    dropout = keras.layers.Dropout(0.25)(layer1)
    return keras.activations.sigmoid(matmul(dropout, w2) + b2)

# Define the loss function
def loss_function(w1, b1, w2, b2, features = borrower_features, targets = default):
    predictions = model(w1, b1, w2, b2)
    # Pass targets and predictions to the cross entropy loss
    return keras.losses.binary_crossentropy(targets, predictions)

### Training neural networks with TensorFlow
In the previous exercise, you defined a model, model(w1, b1, w2, b2, features), and a loss function, loss_function(w1, b1, w2, b2, features, targets), both of which are available to you in this exercise. You will now train the model and then evaluate its performance by predicting default outcomes in a test set, which consists of test_features and test_targets and is available to you. The trainable variables are w1, b1, w2, and b2. Additionally, the following operations have been imported for you: keras.activations.relu() and keras.layers.Dropout().

In [None]:
# Train the model
for j in range(100):
    # Complete the optimizer
    opt.minimize(lambda: loss_function(w1, b1, w2, b2), var_list=[w1, b1, w2, b2])

# Make predictions with model using test features
model_predictions = model(w1, b1, w2, b2, test_features)

# Construct the confusion matrix
confusion_matrix(test_targets, model_predictions)

'''
367 correct, 90 incorrect
The diagram shown is called a ``confusion matrix.'' 
The diagonal elements show the number of correct predictions. 
The off-diagonal elements show the number of incorrect 
predictions. We can see that the model performs reasonably-well, 
but does so by overpredicting non-default. 
This suggests that we may need to train longer, 
tune the model's hyperparameters, 
or change the model's architecture.
'''

# Defining neural networks with Keras

1. Defining neural networks with Keras
 - In chapter 3, we saw how to define neural networks in TensorFlow, both using linear algebra and higher level Keras operations. In this lesson, we will introduce the Keras sequential API, and expand on our brief and informal introduction of the Keras functional API.

2. Classifying sign language letters
 - Throughout this chapter, we'll focus on using Keras to classify four letters from the Sign Language MNIST dataset: a, b, c, and d. Note that the images appear to be low resolution because each is represented by a 28x28 matrix.

3. The sequential API
 - Now, let's say we experiment with several different architectures and select the one that makes the most accurate predictions. 
 - Assumes
        - input layer
        - hidden layers - a first hidden layer with 16 nodes, and a second hidden layer with 8 nodes
        - output layer - 4 output nodes, since there are 4 letters in the dataset.
        - ordered in a sequence

4. The sequential API
 - A good way to construct this model in Keras is to use the sequential API. This API is simpler and makes strong assumptions about how you will construct your model. It assumes that you have an input layer, some number of hidden layers, and an output layer. All of these layers are ordered one after the other in a sequence.

5. Building a sequential model
 - We'll start by importing tensorflow. We can then define a sequential model, which we'll name model. Once we have defined this object, we can simply stack layers on top of it sequentially using the add method. Let's start by adding the first hidden layer, which is a dense layer with 16 nodes. We'll select a relu activation function and supply an input_shape, which Keras requires for the first layer. This input shape is simply a tuple that contains the dimensions of our data. Since we'll be using 28 by 28 pixel images, reshaped into vector, we'll supply 28*28 comma as the input shape.

6. Building a sequential model
 - Next, we'll define a second hidden layer according to the desired model architecture. Finally, we specify that the model has 4 output nodes and uses a softmax activation function. If we want to check our model's architecture, we can use the dot summarize method, which we'll return to in the upcoming exercises. The model has now been defined, but it is not yet ready to be trained. We must first perform a compilation step, where we specify the optimizer and loss function. Here, we've selected the adam optimizer and the categorical crossentropy loss function, which we'll use for classification problems with more than 2 classes.

7. The functional API
 - But what if you want to train two models jointly to predict the same target? The functional API is for that.

8. Using the functional API
 - As an example, let's say we have a set of 28x28 images and a set of 10 features of metadata. We want to use both to predict the image's class, but restrict how they interact in our model. We'll start by using the Keras inputs operation to define the input shapes for model 1 and model 2. Next, we define layer 1 and layer 2 as dense layers for model 1. Note that we have to pass the previous layer as an argument if we use the functional API, but did not with the sequential. You may remember that we did this in chapter 3. We were also using the functional API then.

9. Using the functional API
 - We now define layers 1 and 2 for model 2 and then use the add layer in keras to combine the outputs in a layer that merges the two models. Finally, we define a functional model. As inputs, it takes both the model 1 and model 2 inputs. As outputs, it takes the merged layer. The only thing left to do is compile it and train.

## Building a sequential model

In [None]:
# import tensorflow
from tensorflow import keras

# define a sequential model
model = keras.Sequential()

# define first hidden layer
# 28x28 pixels
model.add(keras.layers.Dense(16, activation='relu', input_shape=(28*28,)))

# define 2nd hidden layer
model.add(keras.layers.Dense(8, activation='relu'))

# define output layer
model.add(keras.layers.Dense(4, activation='softmax'))

# compile the model
model.compile('adam', loss='categorical_crossentropy')

# Summarize the model - check model architecture
print(model.summary())

### Sequential model in Keras

In chapter 3, we used components of the keras API in tensorflow to define a neural network, but we stopped short of using its full capabilities to streamline model definition and training. In this exercise, you will use the keras sequential model API to define a neural network that can be used to classify images of sign language letters. You will also use the .summary() method to print the model's architecture, including the shape and number of parameters associated with each layer.

Note that the images were reshaped from (28, 28) to (784,), so that they could be used as inputs to a dense layer. Additionally, note that keras has been imported from tensorflow for you.

In [None]:
# Define a Keras sequential model
model = keras.Sequential()

# Define the first dense layer
model.add(keras.layers.Dense(16, activation='relu', input_shape=(784,)))

# Define the second dense layer
model.add(keras.layers.Dense(8, activation='relu'))

# Define the output layer
model.add(keras.layers.Dense(4, activation='softmax'))

# Print the model architecture
print(model.summary())

'''
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                12560     
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 36        
=================================================================
Total params: 12,732
Trainable params: 12,732
Non-trainable params: 0
_________________________________________________________________
None
'''

'''
Notice that we've defined a model, but we haven't compiled it. 
The compilation step in keras allows us to set the optimizer, 
loss function, and other useful training parameters in a 
single line of code. 
Furthermore, the .summary() method allows us to view the 
model's architecture.
'''

### Compiling a sequential model

In this exercise, you will work towards classifying letters from the Sign Language MNIST dataset; however, you will adopt a different network architecture than what you used in the previous exercise. There will be fewer layers, but more nodes. You will also apply dropout to prevent overfitting. Finally, you will compile the model to use the adam optimizer and the categorical_crossentropy loss. You will also use a method in keras to summarize your model's architecture. Note that keras has been imported from tensorflow for you and a sequential keras model has been defined as model.

In [None]:
# assume keras sequential model

# Define the first dense layer
model.add(keras.layers.Dense(16, activation='sigmoid', input_shape=(784,)))

# Apply dropout to the first layer's output
model.add(keras.layers.Dropout(0.25))

# Define the output layer
model.add(keras.layers.Dense(4, activation='softmax'))

# Compile the model
model.compile('adam', loss='categorical_crossentropy')

# Print a model summary
print(model.summary())

'''
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                12560     
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 68        
=================================================================
Total params: 12,628
Trainable params: 12,628
Non-trainable params: 0
_________________________________________________________________
None
'''

## Functional API - use 2 models jointly to predict the same target

In [None]:
# import tensorflow
from tensorflow import keras

# define model 1 input layer shape
model1_inputs = tf.keras.Input(shape=(28*28,))
# define model 2 input layer shape
model2_inputs = tf.keras.Input(shape=(10,))

# define layer 1 for model 1
model1_layer1 = tf.keras.layers.Dense(12, activation='relu')(model_inputs)
# define layer 2 for model 1
model1_layer2 = tf.keras.layers.Dense(4, activation='softmax')(model1_layer1)

# define layer 1 for model 2
model2_layer1 = tf.keras.layers.Dense(8, activation='relu')(mode2_inputs)
# define layer 2 for model 2
model2_layer2 = tf.keras.layers.Dense(4, activation='softmax')(model2_layer1)

# merge model 1 and model 2
merged = tf.keras.layers.add([model1_layer2, model2_layer2])

# define a functional model
model = tf.keras.Model(inputs=[model1_inputs, model2_inputs], outputs=merged)

# compile the model
model.compile('adam', loss='categorical_crossentropy')

# train the model

### Defining a multiple input model - functional API

In some cases, the sequential API will not be sufficiently flexible to accommodate your desired model architecture and you will need to use the functional API instead. If, for instance, you want to train two models with different architectures jointly, you will need to use the functional API to do this. In this exercise, we will see how to do this. We will also use the .summary() method to examine the joint model's architecture.

Note that keras has been imported from tensorflow for you. Additionally, the input layers of the first and second models have been defined as m1_inputs and m2_inputs, respectively. Note that the two models have the same architecture, but one of them uses a sigmoid activation in the first layer and the other uses a relu.

In [None]:
# For model 1, pass the input layer to layer 1 and layer 1 to layer 2
m1_layer1 = keras.layers.Dense(12, activation='sigmoid')(m1_inputs)
m1_layer2 = keras.layers.Dense(4, activation='softmax')(m1_layer1)

# For model 2, pass the input layer to layer 1 and layer 1 to layer 2
m2_layer1 = keras.layers.Dense(12, activation='relu')(m2_inputs)
m2_layer2 = keras.layers.Dense(4, activation='softmax')(m2_layer1)

# Merge model outputs and define a functional model
merged = keras.layers.add([m1_layer2, m2_layer2])
model = keras.Model(inputs=[m1_inputs, m2_inputs], outputs=merged)

# Print a model summary
print(model.summary())

'''
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 784)]        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 784)]        0                                            
__________________________________________________________________________________________________
dense (Dense)                   (None, 12)           9420        input_1[0][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 12)           9420        input_2[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 4)            52          dense[0][0]                      
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 4)            52          dense_2[0][0]                    
__________________________________________________________________________________________________
add (Add)                       (None, 4)            0           dense_1[0][0]                    
                                                                 dense_3[0][0]                    
==================================================================================================
Total params: 18,944
Trainable params: 18,944
Non-trainable params: 0
__________________________________________________________________________________________________
None
'''
'''
Notice that the .summary() method yields a new column: 
connected to. This column tells you how layers connect 
to each other within the network. 
We can see that dense_2, for instance, 
is connected to the input_2 layer. 
We can also see that the add layer, which merged the two models, 
connected to both dense_1 and dense_3.
'''

## Training and Validation with Keras

1. Training and validation with Keras
 - Earlier in the chapter, we defined neural networks in Keras. In this video, we will discuss how to train and evaluate them.

2. Overview of training and evaluation
 - Whenever we train and evaluate a model in tensorflow, we typically use the same set of steps. First, we'll load and clean the data. Second, we'll define a model, specifying an architecture. Third, we'll train and validate the model. And fourth, we perform evaluation.
    - Steps:
    - Load and clean data
    - Define model and architecture
    - Train and validate model
    - Evaluate model    

3. How to train a model
 - Let's see an example of how this works. We'll start by importing tensorflow and defining a keras sequential model. We'll then add a dense layer to the model with 16 nodes and a relu activation function. Note that our input shape is (784,), since our dataset consists of 28x28 images, reshaped into vectors. We next define the output layer, which has 4 nodes and a softmax activation function.

4. How to train a model
 - We next compile the model, using the adam optimizer and the categorical cross entropy loss. Finally, we train the model using the fit operation.

5. The fit() operation
 - Notice that we only supplied two arguments to fit: features and labels. These are the only two required arguments; 
 - however, there are also many optional arguments, including batch_size, epochs, and validation_split. We will cover each of these.
        - batch_size
        - epochs
        - validation_split

6. Batch size and epochs parameters
 - Let's start with the difference between the batch size and epochs parameters. The number of examples in each batch is the batch size, which is 32 by default. The number of times you train on the full set of batches is called the number of epochs. Here, the batch size is 5 and the number of epochs is 2. Using multiple epochs allows the model to revisit the same batches, but with different model weights and possibly optimizer parameters, since they are updated after each batch.

7. Performing validation
 - So what does the validation_split parameter do? It divides the dataset into two parts. The first part is the train set and the second part is the validation set.
 - Selecting a value of 0.2 will put 20% of the data in the validation set.
 - ie. model.fit(features, labels, epochs=10, validation_split=0.20)
 
9. Performing validation
 - The benefit of using a validation split is that you can see how your model performs on both the data it was trained on, the training set, and a separate dataset it was not trained on, the validation set. Here, we can see the first 10 epochs of training. Notice that we can see the training loss and validation loss separately. If the training loss becomes substantially lower than the validation loss, this is an indication that we're overfitting. We should either terminate the training process before that point or add regularization or dropout.

10. Changing the metric
 - Another benefit of the high level keras API is that we can swap less informative metrics, such as the loss, for ones that are easily interpretable, such as the share of accurately classified examples. We can do this by supplying accuracy to the metrics parameter of compile. We then apply fit to the model again with the same settings.
 - recompile the model with the accuracy metric
     - model.compile('adam', loss='categorical_crossentropy', metrics=['accuracy'])
     - model.fit(features, labels, epochs=10, validation_split=0.20)

11. Changing the metric
 - Using the accuracy metric, we can see that the model performs quite well. In just 10 epochs, it goes from an accuracy of 42% to over 99%. Notice that the model performs equally well in the validation set, which means that we're unlikely to be overfitting.

12. The evaluation() operation
 - Finally, it is good idea to split off a test set before you begin to train and validate. You can use the evaluate operation to check performance on the test set at the end of the training process. Since you may tune model parameters in response to validation set performance, using a separate test set will provide you with further assurance that you have not overfitted.
 - model.evaluate(test) # evaluate the test set

### Training with Keras

In this exercise, we return to our sign language letter classification problem. We have 2000 images of four letters--A, B, C, and D--and we want to classify them with a high level of accuracy. We will complete all parts of the problem, including the model definition, compilation, and training.

Note that keras has been imported from tensorflow for you. Additionally, the features are available as sign_language_features and the targets are available as sign_language_labels.

In [None]:
# Define a sequential model
model = keras.Sequential()

# Define a hidden layer
model.add(keras.layers.Dense(16, activation='relu', input_shape=(784,)))

# Define the output layer
model.add(keras.layers.Dense(4, activation='softmax'))

# Compile the model
model.compile('SGD', loss='categorical_crossentropy')

# Complete the fitting operation
model.fit(sign_language_features, sign_language_labels, epochs=5)
'''
Epoch 1/5

 1/32 [..............................] - ETA: 21s - loss: 1.4386
32/32 [==============================] - 1s 2ms/step - loss: 1.2970
Epoch 2/5

 1/32 [..............................] - ETA: 0s - loss: 1.2934
32/32 [==============================] - 0s 1ms/step - loss: 1.1385
Epoch 3/5

 1/32 [..............................] - ETA: 0s - loss: 1.2360
32/32 [==============================] - 0s 2ms/step - loss: 1.0325
Epoch 4/5

 1/32 [..............................] - ETA: 0s - loss: 1.0046
28/32 [=========================>....] - ETA: 0s - loss: 0.9363
32/32 [==============================] - 0s 2ms/step - loss: 0.9335
Epoch 5/5

 1/32 [..............................] - ETA: 0s - loss: 0.8826
32/32 [==============================] - 0s 2ms/step - loss: 0.8446
<keras.callbacks.History at 0x7f1c7f9e1f98>
'''

### Metrics and validation with Keras

We trained a model to predict sign language letters in the previous exercise, but it is unclear how successful we were in doing so. In this exercise, we will try to improve upon the interpretability of our results. Since we did not use a validation split, we only observed performance improvements within the training set; however, it is unclear how much of that was due to overfitting. Furthermore, since we did not supply a metric, we only saw decreases in the loss function, which do not have any clear interpretation.

Note that keras has been imported for you from tensorflow.

In [None]:
# Define sequential model
model = keras.Sequential()

# Define the first layer
model.add(keras.layers.Dense(32, activation='sigmoid', input_shape=(784,)))

# Add activation function to classifier
model.add(keras.layers.Dense(4, activation='softmax'))

# Set the optimizer, loss function, and metrics
model.compile(optimizer='RMSprop', loss='categorical_crossentropy', metrics=['accuracy'])

# Add the number of epochs and the validation split
model.fit(sign_language_features, sign_language_labels, epochs=10, validation_split=0.1)
'''
Epoch 1/10

 1/29 [>.............................] - ETA: 26s - loss: 1.6899 - accuracy: 0.3125
29/29 [==============================] - 1s 18ms/step - loss: 1.3092 - accuracy: 0.3893 - val_loss: 1.2376 - val_accuracy: 0.3000
Epoch 2/10

 1/29 [>.............................] - ETA: 0s - loss: 1.3058 - accuracy: 0.2500
29/29 [==============================] - 0s 2ms/step - loss: 1.0616 - accuracy: 0.6396 - val_loss: 1.1271 - val_accuracy: 0.4900
Epoch 3/10

 1/29 [>.............................] - ETA: 0s - loss: 1.0355 - accuracy: 0.5938
29/29 [==============================] - 0s 2ms/step - loss: 0.9109 - accuracy: 0.7442 - val_loss: 0.8684 - val_accuracy: 0.7700
Epoch 4/10

 1/29 [>.............................] - ETA: 0s - loss: 0.9089 - accuracy: 0.7188
29/29 [==============================] - 0s 2ms/step - loss: 0.7654 - accuracy: 0.7853 - val_loss: 0.8147 - val_accuracy: 0.7200
Epoch 5/10

 1/29 [>.............................] - ETA: 0s - loss: 0.7754 - accuracy: 0.7812
29/29 [==============================] - 0s 2ms/step - loss: 0.6631 - accuracy: 0.8554 - val_loss: 0.6366 - val_accuracy: 0.8400
Epoch 6/10

 1/29 [>.............................] - ETA: 0s - loss: 0.7045 - accuracy: 0.7812
29/29 [==============================] - 0s 2ms/step - loss: 0.5778 - accuracy: 0.8654 - val_loss: 0.7495 - val_accuracy: 0.6200
Epoch 7/10

 1/29 [>.............................] - ETA: 0s - loss: 0.8076 - accuracy: 0.5625
29/29 [==============================] - 0s 2ms/step - loss: 0.5104 - accuracy: 0.9032 - val_loss: 0.4839 - val_accuracy: 0.9600
Epoch 8/10

 1/29 [>.............................] - ETA: 0s - loss: 0.5336 - accuracy: 0.9062
29/29 [==============================] - 0s 2ms/step - loss: 0.4592 - accuracy: 0.9066 - val_loss: 0.6214 - val_accuracy: 0.7100
Epoch 9/10

 1/29 [>.............................] - ETA: 0s - loss: 0.4584 - accuracy: 0.8125
29/29 [==============================] - 0s 2ms/step - loss: 0.4061 - accuracy: 0.9277 - val_loss: 0.4259 - val_accuracy: 0.9200
Epoch 10/10

 1/29 [>.............................] - ETA: 0s - loss: 0.3610 - accuracy: 0.9688
29/29 [==============================] - 0s 2ms/step - loss: 0.3619 - accuracy: 0.9422 - val_loss: 0.3829 - val_accuracy: 0.8900
<keras.callbacks.History at 0x7f1c7c7efa90>
'''

### Overfitting detection

In this exercise, we'll work with a small subset of the examples from the original sign language letters dataset. A small sample, coupled with a heavily-parameterized model, will generally lead to overfitting. This means that your model will simply memorize the class of each example, rather than identifying features that generalize to many examples.

You will detect overfitting by checking whether the validation sample loss is substantially higher than the training sample loss and whether it increases with further training. With a small sample and a high learning rate, the model will struggle to converge on an optimum. You will set a low learning rate for the optimizer, which will make it easier to identify overfitting.

Note that keras has been imported from tensorflow.

In [None]:
# Define sequential model
model = keras.Sequential()

# Define the first layer
model.add(keras.layers.Dense(1024, activation='relu', input_shape=(784,)))

# Add activation function to classifier
model.add(keras.layers.Dense(4, activation='softmax'))

# Finish the model compilation
model.compile(optimizer=keras.optimizers.Adam(lr=0.001),
              loss='categorical_crossentropy', metrics=['accuracy'])

# Complete the model fit operation
model.fit(sign_language_features, sign_language_labels,
          epochs=50, validation_split=0.5)

'''
Epoch 1/50

1/1 [==============================] - ETA: 0s - loss: 1.3461 - accuracy: 0.3077
1/1 [==============================] - 1s 1s/step - loss: 1.3461 - accuracy: 0.3077 - val_loss: 3.0842 - val_accuracy: 0.3846
Epoch 2/50

1/1 [==============================] - ETA: 0s - loss: 2.4663 - accuracy: 0.2308
1/1 [==============================] - 0s 24ms/step - loss: 2.4663 - accuracy: 0.2308 - val_loss: 4.2784 - val_accuracy: 0.3846
Epoch 3/50

1/1 [==============================] - ETA: 0s - loss: 2.1878 - accuracy: 0.6154
1/1 [==============================] - 0s 20ms/step - loss: 2.1878 - accuracy: 0.6154 - val_loss: 5.6272 - val_accuracy: 0.3077
Epoch 4/50

1/1 [==============================] - ETA: 0s - loss: 3.6705 - accuracy: 0.3846
1/1 [==============================] - 0s 45ms/step - loss: 3.6705 - accuracy: 0.3846 - val_loss: 4.5968 - val_accuracy: 0.3077
Epoch 5/50

1/1 [==============================] - ETA: 0s - loss: 2.5616 - accuracy: 0.6923
1/1 [==============================] - 0s 22ms/step - loss: 2.5616 - accuracy: 0.6923 - val_loss: 4.3052 - val_accuracy: 0.0769
Epoch 6/50

1/1 [==============================] - ETA: 0s - loss: 2.3775 - accuracy: 0.6154
1/1 [==============================] - 0s 19ms/step - loss: 2.3775 - accuracy: 0.6154 - val_loss: 2.8193 - val_accuracy: 0.0769
Epoch 7/50

1/1 [==============================] - ETA: 0s - loss: 1.5247 - accuracy: 0.6154
1/1 [==============================] - 0s 18ms/step - loss: 1.5247 - accuracy: 0.6154 - val_loss: 1.0500 - val_accuracy: 0.5385
Epoch 8/50

1/1 [==============================] - ETA: 0s - loss: 0.6675 - accuracy: 0.7692
1/1 [==============================] - 0s 17ms/step - loss: 0.6675 - accuracy: 0.7692 - val_loss: 0.9406 - val_accuracy: 0.6154
Epoch 9/50

1/1 [==============================] - ETA: 0s - loss: 1.2254 - accuracy: 0.5385
1/1 [==============================] - 0s 17ms/step - loss: 1.2254 - accuracy: 0.5385 - val_loss: 1.1072 - val_accuracy: 0.4615
Epoch 10/50

1/1 [==============================] - ETA: 0s - loss: 1.6209 - accuracy: 0.5385
1/1 [==============================] - 0s 17ms/step - loss: 1.6209 - accuracy: 0.5385 - val_loss: 0.9477 - val_accuracy: 0.6923
Epoch 11/50

1/1 [==============================] - ETA: 0s - loss: 1.3329 - accuracy: 0.5385
1/1 [==============================] - 0s 18ms/step - loss: 1.3329 - accuracy: 0.5385 - val_loss: 0.8459 - val_accuracy: 0.6923
Epoch 12/50

1/1 [==============================] - ETA: 0s - loss: 0.8220 - accuracy: 0.6154
1/1 [==============================] - 0s 18ms/step - loss: 0.8220 - accuracy: 0.6154 - val_loss: 1.0359 - val_accuracy: 0.5385
Epoch 13/50

1/1 [==============================] - ETA: 0s - loss: 0.5389 - accuracy: 0.8462
1/1 [==============================] - 0s 18ms/step - loss: 0.5389 - accuracy: 0.8462 - val_loss: 1.5257 - val_accuracy: 0.3846
Epoch 14/50

1/1 [==============================] - ETA: 0s - loss: 0.6948 - accuracy: 0.6923
1/1 [==============================] - 0s 17ms/step - loss: 0.6948 - accuracy: 0.6923 - val_loss: 2.0301 - val_accuracy: 0.3846
Epoch 15/50

1/1 [==============================] - ETA: 0s - loss: 0.9550 - accuracy: 0.6923
1/1 [==============================] - 0s 17ms/step - loss: 0.9550 - accuracy: 0.6923 - val_loss: 2.0480 - val_accuracy: 0.3846
Epoch 16/50

1/1 [==============================] - ETA: 0s - loss: 0.9436 - accuracy: 0.6923
1/1 [==============================] - 0s 18ms/step - loss: 0.9436 - accuracy: 0.6923 - val_loss: 1.6047 - val_accuracy: 0.3846
Epoch 17/50

1/1 [==============================] - ETA: 0s - loss: 0.6777 - accuracy: 0.6923
1/1 [==============================] - 0s 19ms/step - loss: 0.6777 - accuracy: 0.6923 - val_loss: 1.1002 - val_accuracy: 0.4615
Epoch 18/50

1/1 [==============================] - ETA: 0s - loss: 0.4211 - accuracy: 0.8462
1/1 [==============================] - 0s 19ms/step - loss: 0.4211 - accuracy: 0.8462 - val_loss: 0.8775 - val_accuracy: 0.6923
Epoch 19/50

1/1 [==============================] - ETA: 0s - loss: 0.4486 - accuracy: 0.8462
1/1 [==============================] - 0s 19ms/step - loss: 0.4486 - accuracy: 0.8462 - val_loss: 0.8258 - val_accuracy: 0.6923
Epoch 20/50

1/1 [==============================] - ETA: 0s - loss: 0.6064 - accuracy: 0.6154
1/1 [==============================] - 0s 19ms/step - loss: 0.6064 - accuracy: 0.6154 - val_loss: 0.7554 - val_accuracy: 0.6923
Epoch 21/50

1/1 [==============================] - ETA: 0s - loss: 0.6040 - accuracy: 0.6154
1/1 [==============================] - 0s 19ms/step - loss: 0.6040 - accuracy: 0.6154 - val_loss: 0.6710 - val_accuracy: 0.7692
Epoch 22/50

1/1 [==============================] - ETA: 0s - loss: 0.4510 - accuracy: 0.7692
1/1 [==============================] - 0s 21ms/step - loss: 0.4510 - accuracy: 0.7692 - val_loss: 0.7021 - val_accuracy: 0.7692
Epoch 23/50

1/1 [==============================] - ETA: 0s - loss: 0.3448 - accuracy: 1.0000
1/1 [==============================] - 0s 19ms/step - loss: 0.3448 - accuracy: 1.0000 - val_loss: 0.8841 - val_accuracy: 0.6154
Epoch 24/50

1/1 [==============================] - ETA: 0s - loss: 0.3646 - accuracy: 1.0000
1/1 [==============================] - 0s 19ms/step - loss: 0.3646 - accuracy: 1.0000 - val_loss: 1.0934 - val_accuracy: 0.4615
Epoch 25/50

1/1 [==============================] - ETA: 0s - loss: 0.4279 - accuracy: 0.7692
1/1 [==============================] - 0s 19ms/step - loss: 0.4279 - accuracy: 0.7692 - val_loss: 1.1553 - val_accuracy: 0.4615
Epoch 26/50

1/1 [==============================] - ETA: 0s - loss: 0.4289 - accuracy: 0.6923
1/1 [==============================] - 0s 20ms/step - loss: 0.4289 - accuracy: 0.6923 - val_loss: 1.0227 - val_accuracy: 0.5385
Epoch 27/50

1/1 [==============================] - ETA: 0s - loss: 0.3458 - accuracy: 0.6923
1/1 [==============================] - 0s 20ms/step - loss: 0.3458 - accuracy: 0.6923 - val_loss: 0.8419 - val_accuracy: 0.6923
Epoch 28/50

1/1 [==============================] - ETA: 0s - loss: 0.2693 - accuracy: 0.9231
1/1 [==============================] - 0s 19ms/step - loss: 0.2693 - accuracy: 0.9231 - val_loss: 0.7599 - val_accuracy: 0.6923
Epoch 29/50

1/1 [==============================] - ETA: 0s - loss: 0.2741 - accuracy: 0.9231
1/1 [==============================] - 0s 19ms/step - loss: 0.2741 - accuracy: 0.9231 - val_loss: 0.7528 - val_accuracy: 0.6923
Epoch 30/50

1/1 [==============================] - ETA: 0s - loss: 0.3189 - accuracy: 0.8462
1/1 [==============================] - 0s 19ms/step - loss: 0.3189 - accuracy: 0.8462 - val_loss: 0.7428 - val_accuracy: 0.7692
Epoch 31/50

1/1 [==============================] - ETA: 0s - loss: 0.3236 - accuracy: 0.8462
1/1 [==============================] - 0s 24ms/step - loss: 0.3236 - accuracy: 0.8462 - val_loss: 0.7158 - val_accuracy: 0.6923
Epoch 32/50

1/1 [==============================] - ETA: 0s - loss: 0.2771 - accuracy: 0.9231
1/1 [==============================] - 0s 20ms/step - loss: 0.2771 - accuracy: 0.9231 - val_loss: 0.7075 - val_accuracy: 0.6923
Epoch 33/50

1/1 [==============================] - ETA: 0s - loss: 0.2321 - accuracy: 1.0000
1/1 [==============================] - 0s 20ms/step - loss: 0.2321 - accuracy: 1.0000 - val_loss: 0.7444 - val_accuracy: 0.6923
Epoch 34/50

1/1 [==============================] - ETA: 0s - loss: 0.2246 - accuracy: 1.0000
1/1 [==============================] - 0s 20ms/step - loss: 0.2246 - accuracy: 1.0000 - val_loss: 0.8045 - val_accuracy: 0.6154
Epoch 35/50

1/1 [==============================] - ETA: 0s - loss: 0.2386 - accuracy: 1.0000
1/1 [==============================] - 0s 20ms/step - loss: 0.2386 - accuracy: 1.0000 - val_loss: 0.8353 - val_accuracy: 0.5385
Epoch 36/50

1/1 [==============================] - ETA: 0s - loss: 0.2408 - accuracy: 1.0000
1/1 [==============================] - 0s 20ms/step - loss: 0.2408 - accuracy: 1.0000 - val_loss: 0.8052 - val_accuracy: 0.5385
Epoch 37/50

1/1 [==============================] - ETA: 0s - loss: 0.2196 - accuracy: 1.0000
1/1 [==============================] - 0s 19ms/step - loss: 0.2196 - accuracy: 1.0000 - val_loss: 0.7360 - val_accuracy: 0.6923
Epoch 38/50

1/1 [==============================] - ETA: 0s - loss: 0.1944 - accuracy: 1.0000
1/1 [==============================] - 0s 24ms/step - loss: 0.1944 - accuracy: 1.0000 - val_loss: 0.6752 - val_accuracy: 0.6923
Epoch 39/50

1/1 [==============================] - ETA: 0s - loss: 0.1877 - accuracy: 1.0000
1/1 [==============================] - 0s 22ms/step - loss: 0.1877 - accuracy: 1.0000 - val_loss: 0.6398 - val_accuracy: 0.7692
Epoch 40/50

1/1 [==============================] - ETA: 0s - loss: 0.1952 - accuracy: 1.0000
1/1 [==============================] - 0s 20ms/step - loss: 0.1952 - accuracy: 1.0000 - val_loss: 0.6147 - val_accuracy: 0.7692
Epoch 41/50

1/1 [==============================] - ETA: 0s - loss: 0.1948 - accuracy: 1.0000
1/1 [==============================] - 0s 19ms/step - loss: 0.1948 - accuracy: 1.0000 - val_loss: 0.5950 - val_accuracy: 0.7692
Epoch 42/50

1/1 [==============================] - ETA: 0s - loss: 0.1787 - accuracy: 1.0000
1/1 [==============================] - 0s 19ms/step - loss: 0.1787 - accuracy: 1.0000 - val_loss: 0.5959 - val_accuracy: 0.7692
Epoch 43/50

1/1 [==============================] - ETA: 0s - loss: 0.1610 - accuracy: 1.0000
1/1 [==============================] - 0s 18ms/step - loss: 0.1610 - accuracy: 1.0000 - val_loss: 0.6252 - val_accuracy: 0.7692
Epoch 44/50

1/1 [==============================] - ETA: 0s - loss: 0.1552 - accuracy: 1.0000
1/1 [==============================] - 0s 19ms/step - loss: 0.1552 - accuracy: 1.0000 - val_loss: 0.6674 - val_accuracy: 0.7692
Epoch 45/50

1/1 [==============================] - ETA: 0s - loss: 0.1578 - accuracy: 1.0000
1/1 [==============================] - 0s 19ms/step - loss: 0.1578 - accuracy: 1.0000 - val_loss: 0.6989 - val_accuracy: 0.7692
Epoch 46/50

1/1 [==============================] - ETA: 0s - loss: 0.1573 - accuracy: 1.0000
1/1 [==============================] - 0s 18ms/step - loss: 0.1573 - accuracy: 1.0000 - val_loss: 0.7078 - val_accuracy: 0.6923
Epoch 47/50

1/1 [==============================] - ETA: 0s - loss: 0.1495 - accuracy: 1.0000
1/1 [==============================] - 0s 19ms/step - loss: 0.1495 - accuracy: 1.0000 - val_loss: 0.6976 - val_accuracy: 0.6923
Epoch 48/50

1/1 [==============================] - ETA: 0s - loss: 0.1394 - accuracy: 1.0000
1/1 [==============================] - 0s 18ms/step - loss: 0.1394 - accuracy: 1.0000 - val_loss: 0.6791 - val_accuracy: 0.7692
Epoch 49/50

1/1 [==============================] - ETA: 0s - loss: 0.1334 - accuracy: 1.0000
1/1 [==============================] - 0s 18ms/step - loss: 0.1334 - accuracy: 1.0000 - val_loss: 0.6602 - val_accuracy: 0.6923
Epoch 50/50

1/1 [==============================] - ETA: 0s - loss: 0.1323 - accuracy: 1.0000
1/1 [==============================] - 0s 18ms/step - loss: 0.1323 - accuracy: 1.0000 - val_loss: 0.6412 - val_accuracy: 0.6923
<keras.callbacks.History at 0x7f1c800f6a20>
'''

You may have noticed that the validation loss, val_loss, was substantially higher than the training loss, loss. Furthermore, if val_loss started to increase before the training process was terminated, then we may have overfitted. When this happens, you will want to try decreasing the number of epochs.
- Try decreasing # epochs

### Evaluating models

Two models have been trained and are available: large_model, which has many parameters; and small_model, which has fewer parameters. Both models have been trained using train_features and train_labels, which are available to you. A separate test set, which consists of test_features and test_labels, is also available.

Your goal is to evaluate relative model performance and also determine whether either model exhibits signs of overfitting. You will do this by evaluating large_model and small_model on both the train and test sets. For each model, you can do this by applying the .evaluate(x, y) method to compute the loss for features x and labels y. You will then compare the four losses generated.

In [None]:
# Evaluate the small model using the train data
small_train = small_model.evaluate(train_features, train_labels)

# Evaluate the small model using the test data
small_test = small_model.evaluate(test_features, test_labels)

# Evaluate the large model using the train data
large_train = large_model.evaluate(train_features, train_labels)

# Evaluate the large model using the test data
large_test = large_model.evaluate(test_features, test_labels)

# Print losses
print('\n Small - Train: {}, Test: {}'.format(small_train, small_test))
print('Large - Train: {}, Test: {}'.format(large_train, large_test))
'''
1/4 [======>.......................] - ETA: 0s - loss: 0.1738
4/4 [==============================] - 0s 2ms/step - loss: 0.1698

1/4 [======>.......................] - ETA: 0s - loss: 0.3251
4/4 [==============================] - 0s 2ms/step - loss: 0.2849

1/4 [======>.......................] - ETA: 0s - loss: 0.0425
4/4 [==============================] - 0s 2ms/step - loss: 0.0396

1/4 [======>.......................] - ETA: 0s - loss: 0.1414
4/4 [==============================] - 0s 1ms/step - loss: 0.1454

 Small - Train: 0.16981548070907593, Test: 0.28487256169319153
Large - Train: 0.03957207500934601, Test: 0.14543524384498596
'''

Notice that the gap between the test and train set losses is high for large_model, suggesting that overfitting may be an issue. Furthermore, both test and train set performance is better for large_model. This suggests that we may want to use large_model, but reduce the number of training epochs.

## Training models with the Estimators API

1. Training models with the Estimators API
 - In this video, we'll take a look at the high level Estimators API, which was elevated in importance in TensorFlow two point zero.

2. What is the Estimators API?
 - The Estimators API is a high level TensorFlow submodule. Relative to the core, lower-level TensorFlow APIs and the high-level Keras API, model building in the Estimator API is less flexible. 
 - This is because it enforces a set of best practices by placing restrictions on model architecture and training. 
 - The upside of using the Estimators API is that it allows for faster deployment. Models can be specified, trained, evaluated, and deployed with less code. 
 - Furthermore, there are many premade models that can be instantiated by setting a handful of model parameters.

 - 1 Image taken from https://www.tensorflow.org/guide/premade_estimators

3. Model specification and training
 - So what does the typical model specification and training process look like in the Estimators API? Well, it starts with the definition of feature columns, which specify the shape and type of your data. Next, you load and transform your data within a function. The output of this function will be a dictionary object of features and your labels. The next step is to define an estimator. In this video, we'll use premade estimators, but you can also define custom estimators with different architectures. Finally, you will train the model you defined. Note that all model objects created through the Estimators API have train, evaluate, and predict operations.
 - Define feature columns
 - Load and transform data within a function
 - Define an estimator
 - Apply train operation

4. Defining feature columns
 - Let's step through this procedure to get a sense of how it works. We'll first define the feature columns. If we were working with the housing dataset from chapter 2, we might define a numeric feature column for size using feature_column.numeric_column. Note that we supplied the dictionary key, "size," to the operation. We will do this for each feature column we create. We may also want a categorical feature column for the number of rooms using feature_column.categorical_column_with_vocabulary_list.

5. Defining feature columns
 - We can then merge these into a list of features columns. Alternatively, if we were using the sign language MNIST dataset, we'd define a list containing a single vector of features.

6. Loading and transforming data
 - We next need to define a function that transforms our data, puts the features in a dictionary, and returns both the features and labels. Note that we've simply taken three examples from the housing dataset for the sake of illustration. Using them, we've defined a dictionary with the keys "size" and "rooms," which maps to the feature columns we defined. Next, we define a list or array of labels, which give the price of the house in this case, and then return the features and labels.

7. Define and train a regression estimator
 - We can now define and train the estimator. But before we do that, we have to define what estimator we actually want to train. If we're predicting house prices, we may want to use a deep neural network with a regression head using estimator.DNNRegressor. This allows us to predict a continuous target. Note that all we had to supply was the list of feature columns and the number of nodes in each hidden layer. The rest is handled automatically. We then apply the train function, supply our input function, and train for 20 steps.

8. Define and train a deep neural network
 - Alternatively, if we want to instead perform a classification task with a deep neural network, we just need to change the estimator to estimator.DNNClassifier, add the number of classes, and then train again. You can also use linear classifiers, boosted trees, and other common options. Just check the TensorFlow Estimators documentation for a complete list.
 - https://www.tensorflow.org/guide/estimators

In [None]:
# Example

# Define feature columns
# import tensorflow under standard alias
import tensorflow as tf
# define a numeric feature column
# repeat for each feature column
size = tf.feature_column.numeric_column("size")
# define a categorical feature column
rooms = tf.feature_column.categorical_column_with_vocabulary_list(
    'rooms', ['1', '2', '3,', '4', '5'])

# define feature columns
# create feature column list
features_list = [size, rooms]
# another option
# define a matrix feature column
features_list = [tf.feature_column.numeric_column('image', shape=(784,))]

# loading and transforming data
# define input data function


def input_fn():
    # define feature dictionary
    features = {'size': [1340, 1690, 2720], 'rooms': [1, 3, 4]}
    # define labels
    labels = [221900, 5380000, 180000]
    return features, labels

###########################################
# Define and train a regression estimator
# define a deep neural network regression
# DNNRegressor for continuous target
model0 = tf.estimator.DNNRegressor(
    feature_columns=feature_list, hidden_units=[10, 6, 6, 3])
# train the regression model
model0.train(input_fn, steps=20)

###########################################
# Classification example
# define a deep neural network classifier
model1 = tf.estimatoror.DNNClassifier(
    feature_columns=feature_list, hidden_units=[32, 16, 8], n_classes=4)
# train the classifier
model1.train(input_fn, steps=20)

### Preparing to train with Estimators

For this exercise, we'll return to the King County housing transaction dataset from chapter 2. We will again develop and train a machine learning model to predict house prices; however, this time, we'll do it using the estimator API.

Rather than completing everything in one step, we'll break this procedure down into parts. We'll begin by defining the feature columns and loading the data. In the next exercise, we'll define and train a premade estimator. Note that feature_column has been imported for you from tensorflow. Additionally, numpy has been imported as np, and the Kings County housing dataset is available as a pandas DataFrame: housing.

In [None]:
# Define feature columns for bedrooms and bathrooms
bedrooms = feature_column.numeric_column("bedrooms")
bathrooms = feature_column.numeric_column('bathrooms')

# Define the list of feature columns
feature_list = [bedrooms, bathrooms]

def input_fn():
    # Define the labels
    labels = np.array(housing.price)
    # Define the features
    features = {'bedrooms':np.array(housing['bedrooms']), 
                'bathrooms':np.array(housing.bathrooms)}
    return features, labels

### Defining Estimators

In the previous exercise, you defined a list of feature columns, feature_list, and a data input function, input_fn(). In this exercise, you will build on that work by defining an estimator that makes use of input data.

Use a deep neural network regressor with 2 nodes in both the first and second hidden layers and 1 training step.

In [None]:
# Define the model and set the number of steps
model = estimator.DNNRegressor(feature_columns=feature_list, hidden_units=[2,2])
model.train(input_fn, steps=1)

Modify the code to use a LinearRegressor(), remove the hidden_units, and set the number of steps to 2.

In [None]:
# Define the model and set the number of steps
model = estimator.LinearRegressor(feature_columns=feature_list)
model.train(input_fn, steps=2)

Note that you have other premade estimator options, such as
- BoostedTreesRegressor()
- and can also create your own custom estimators.

# Summary

# TensorFlow extensions
1. TensorFlow Hub
    - Pretrained models
    - Transfer learning - useful for training image classifier using small # of images with a feature extractor trained on larger dataset
2. TensorFlow Proability
    - More statistical distributions - for random number generation
    - Trainable statisical distributions - into your models
    - Extended set of optimizers
3. TensorFlow 2.0
    - eager_execution() - automatically done
    - tighter keras integration
    - Estimators API has more important role
    - static graphs through tf.function()


1. Congratulations!
Congratulations! You've now completed this course on the fundamentals of the TensorFlow API in Python. In this final video, we'll review what you've learned, talk about two useful TensorFlow extensions, and then wrap-up with a discussion of the transition to TensorFlow two point zero.

2. What you learned
In chapter 1, you learned low-level, basic, and advanced operations in TensorFlow. You learned how to define and manipulate variables and constants. You also learned the graph-based computational model that underlies TensorFlow and how it can be used to compute gradients and solve arbitrary optimization problems. In chapter 2, you learned how to load and transform data for use in your TensorFlow projects. You also saw how to use predefined and custom loss functions. We ended with a discussion of how to train models, and when and how to divide the training into batches.

3. What you learned
In chapter 3, we moved on to training neural networks. You learned how to define neural network architecture in TensorFlow, both using low-level linear algebra operations and high-level Keras API operations. We talked about how to select activation functions and optimizers, and, ultimately, how to train models. In chapter 4, you learned how to make full use of the Keras API to train models in TensorFlow. We discussed the training and validation process and also introduced the high-level Estimators API, which can be used to streamline the production process.

4. TensorFlow extensions
In addition to what we covered, there are also a two important TensorFlow extensions that did not fit into the course, but may be worthwhile to explore on your own. The first is TensorFlow Hub, which allows users to import pretrained models that can then be used to perform transfer learning. This will be particularly useful when you want to train an image classifier with a small number of images, but want to make use of a feature-extractor trained on a much larger set of different images. TensorFlow Probability is another exciting extension, which is also currently available as a standalone module. One benefit of using TensorFlow Probability is that it provides additional statistical distributions that can be used for random number generation. It also enables you to incorporate trainable statistical distributions into your models. Finally, TensorFlow Probability provides an extended set of optimizers that are commonly used in statistical research. This gives you additional tools beyond what the core TensorFlow module provides.

1 Screenshot from https://tfhub.dev.
5. TensorFlow 2.0
Finally, I will say a few words about the difference between TensorFlow 2 and TensorFlow 1. If you primarily develop in 1, you may have noticed that you do not need to define static graphs or enable eager execution. This is done automatically in 2. Furthermore, TensorFlow 2 has substantially tighter integration with Keras. In fact, the core functionality of the TensorFlow 1 train module is handled by tf.Keras operations in 2. In addition to the centrality of Keras, the Estimators API also plays a more important role in TensorFlow 2. Finally, TensorFlow 2 also allows you to use static graphs, but they are available through the tf.function operation.

1 Screenshot taken from https://www.tensorflow.org/guide/premade_estimators