#### Copyright 2020 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Linear Regression with TensorFlow

In previous units we learned about regression and about how to build and apply a regression model using [scikit-learn](https://scikit-learn.org/stable/). For many regression cases `scikit-learn` is more than adequate. However, there are times when more powerful tools are needed. [TensorFlow](https://www.tensorflow.org/) is one of those tools. It is a computational toolkit built to perform machine learning and data science tasks at scale.

## Overview

### Estimated Duration

120 minutes

## Problem Framing

Machine learning is one of a variety of solutions that might work for solving a problem. It is always important to understand the problem space before diving in and starting to clean data and code.

In this lab we would like to be able to **predict the price of a house**.

Questions we should ask ourselves might include:

* Predict the price when? Now? In the past? In the future? For what range?
* Where are we predicting prices for? One market? One state? One country?
* What is our tolerance for being wrong?
* Are we okay with a few huge outliers if the overall model is better?
* What metrics are we using to define success and what are the acceptable values?
* Is there an non-ML way to solve this problem?
* What data is available to solve the problem?

The list of questions is boundless. Eventually you'll need to move on, but understanding the problem and the solution space is vital.

---

For this problem we'll further define the problem by saying:

>  We want to create a system that predicts the prices of houses in California in 1990. We have census data from 1990 available to build and test the system. We will accept a system with a root mean squared error of 200,000 or better.

Since this is a contrived example we'll short-cut and say that our analysis has led us to believe that we want to use a linear regression model to serve as our prediction system.

## Exploratory Data Analysis

The dataset we'll use for this Colab contains California housing data taken from the 1990 census data. This is a popular dataset for experimenting with machine learning models.

As with any data science project it is a good idea to take some time and review the [data schema and description](https://developers.google.com/machine-learning/crash-course/california-housing-data-description). Ask yourself:

* What data is available? What are the columns?
* What do those columns mean?
* What data types are those columns?
* What is the granularity of the data? In this particular case, what is a "block"?
* How many rows of data are there?
* Roughly how big is the data? Kilobytes? Megabytes? Gigabytes? Terabytes? More?
* Are any of the columns highly correlated?
* What bias is contained in the data?

### Load the data

Now that we have an understanding of the data that we are going to use in our model, let's load it into this Colab and examine it more closely.

We'll rely on Pandas to read a CSV version of the data from the internet.

In [0]:
import pandas as pd

url = ('https://download.mlcc.google.com/mledu-datasets/' + 
       'california_housing_train.csv')

housing_df = pd.read_csv(url)

housing_df

### Exploration

You should always look at your data and statistics about that data before you begin modelling it. First though, let's see the columns and data types that we have available.

In [0]:
housing_df.dtypes

Eight floating point features and a floating point target, `median_house_value`. This is what we expect based on the [data documentation](https://developers.google.com/machine-learning/crash-course/california-housing-data-description).

It is a good idea to also describe the data set.

#### Statistics

In [0]:
housing_df.describe()

In this case we can see that all of the column counts are the same. That lets us know that every data point has a value. This can sometimes give you a false sense of security because many datasets have default values instead of empty values.

Looking at the min and max can be helpful too. Does a 1.0 value for a minimum number of rooms for a block match your mental model of what a block is? What about that max of 37,937.0 rooms? In cases like this it can be useful to [research your topic area](https://www2.census.gov/geo/pdfs/reference/GARM/Ch11GARM.pdf).

In this particular case those numbers are might be okay as long as the dense block is in an urban area with very dense and tall buildings on the block. As you probe a dataset you should ask yourself questions like this. When something doesn't look right, investigate it.

##### Exercise 1: Sanity Check

Use Pandas to find the row of data that contains the census block with the largest number of rooms. Search for the latitude and longitude for that location and answer the questions below.

**Student Solution**

In [0]:
# Your Code Goes Here

1. What city is the block located in?
> *Your solution goes here*
1. Are 37937.0 total rooms reasonable? Why?
> *Your solution goes here*


---

###### Answer Key

In [0]:
import pandas as pd

url = ('https://download.mlcc.google.com/mledu-datasets/' + 
       'california_housing_train.csv')

housing_df = pd.read_csv(url)

housing_df.sort_values(by='total_rooms').tail(4)[[
  'total_rooms','latitude', 'longitude']]

1. What city is the block located in?
> Yorba Linda, CA
1. Are 37937.0 total rooms reasonable? Why?
> That's a tough one without knowing the dimensions of the census block, but the answer most likely that the data is not reasonable. The entire town of Yorba Linda has just over 20,000 residents and likely there are more than one census blocks in the town. If there were a military base or prison nearby, then it might make sense, but there doesn't seem to be. This does look like bad data.

---

#### Sampling

It is also a good idea to take a look a the actual data. We can use Panda's `head()`, `tail()`, and/or `sample()` methods to do this.

In [0]:
housing_df.head(10)

Did you gain any insight from peeking at the actual data? Is the data sorted in a manner that might lead to a bad model?

In this case the data seems to be sorted ascending by longitude and possibly secondarily descending by latitude. We need to consider this when sampling or splitting the data.

#### Correlation

It is important to understand how columns relate to one another. Every feature that you add to a your training set increases the amount of work that must be done to train your model. If you can find columns with a high degree of correlation, you can potentially not use one of the columns in your training and still get a model that performs well.

Let's create a correlation matrix heatmap for our data set.

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,10))
_ = sns.heatmap(housing_df.corr(), cmap='coolwarm', annot=True)

##### Exercise 2: Correlated Columns

Answer the following questions about the correlation between columns in our data set.

**Student Solution**

1. What columns are the most highly correlated?
> *Your solution goes here.*
1. Which column is most strongly correlated with `median_house_value`?
> *Your solution goes here.*
1. Which columns have the strongest negative correlation?
> *Your solution goes here.*


---

###### Answer Key

1. What columns are the most highly correlated?
> totoal_bedrooms and households
1. Which column is most strongly correlated with `median_house_value`?
> median_income
1. Which columns have the strongest negative correlation?
> latitude and longitude


---

### Data Preprocessing

Now is the stage where we would perform model-independent preprocessing to the data to repair any missing data. Since there isn't really any missing data, we don't have much pre-procesisng to do.

Let's look at those room counts again though. The values seem a little odd.

First we'll plot the room counts in ascending order.

In [0]:
rooms = housing_df['total_rooms'].sort_values().reset_index(drop=True)

plt.figure(figsize=(12,10))
_ = sns.lineplot(x=rooms.index.values, y=rooms)

That's quite a spike there at the end!

Looking at the chart let's pick a point where the number of rooms really starts to extremely slope upward, say 10,000. If we chose to drop the rows with really large values, what would that do to our data?

In [0]:
many_rooms = rooms[rooms > 10000].size

percent = (many_rooms / rooms.size) * 100

print(f'{many_rooms} blocks have more than 10000 rooms ' +
      f'which is {percent:0.2f}% of our data')

So we'd knock out over 1% of our data by trying to remove what we think are outliers. That's not horrible, but is probably not something we would want to do on a hunch.

At this point we'll choose not to make any model-independent changes to our data and will continue building the model.

## Modeling

It is time to actually build our model. In this case we know we are going to build a linear regression model using TensorFlow. We could buld the model by hand, but luckily we don't have too. TensorFlow provides many pre-built models in their [estimator](https://www.tensorflow.org/api_docs/python/tf/estimator) library. We are going to use the [`tensorflow.estimator.LinearRegressor`](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor) model.

The [`Estimator`](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator) class is the base for TensorFlow estimators. It's methods define the API for estimators. In the remainder of this lab we will create an instance of `LinearRegressor` and use the `Estimator` API to train the model and make test predictions.

### Prepare the Data

Earlier we considered pre-processing the data. That pre-processing was intended to be more generic preprocessing that needed to be done to correct errors with the data set.

Now that we have chosen a model, we need to to specific processing related to the type of model that we'll be building and how we are going to test and train the model.

Initially we'll be using the `LinearRegressor` with default options. For measuring model quality we will perform hold-out testing with 20% of the data being held out for test.

#### Normalization

The scale and range of data in each column of our dataset varies widely. In many models, larger values with be over-considered in training. In order to combat this we can *normalize* our data.

Note that we only want to normalize the feature data so let's first create variables to hold our feature and target column names.

In [0]:
target_column = 'median_house_value'
feature_columns = [c for c in housing_df.columns if c != target_column]

target_column, feature_columns

To normalize we subtract the minimum value from each column and then divide by the delta between the min and max. This should make all of our feature values fall into the range of 0.0 to 1.0. You can see in the `describe()` output that we now have a min values of 0.0 and max values of 1.0.

In [0]:
housing_df.loc[:, feature_columns] = (
    housing_df[feature_columns] - 
      housing_df[feature_columns].min()) / (
          housing_df[feature_columns].max() -
            housing_df[feature_columns].min())

housing_df[feature_columns].describe()

Another option would be to *standardize* the data. Standardization is the process of subtracting the mean from each column and then dividing by the standard deviation. We chose not to do that in this case because that creates negative values, which don't work well with this model.

Should we modify the target in any way?

Let's take a look at the values again.

In [0]:
housing_df[target_column].describe()

Those are some pretty big values. It does look like there is a ceiling of 500,001 applied to the data and a minimum value of 14,999.

Given enough time, our model could train to predict these big of values. However, we are going to be using a pretty small learning rate by default with the [`Ftrl` optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Ftrl): 0.0001. In order to speed things up we can shrink the values in the target column by some constant.

In [0]:
TARGET_FACTOR = 100000

housing_df[target_column] = housing_df[target_column] / TARGET_FACTOR

housing_df[target_column].describe()

We've reduced the values from the range of 14,999-500,001 to 0.14999-5.0. This should allow the model to converge faster. Of course, now our predictions will need to be multiplied by 100,000 in order to reflect real dollar values.

#### Train/Test Split

We want to go ahead and divide our data into testing and training splits. For this example we'll hold out 20% of the data for testing.

One easy way to do that is just to slice the data; however, our data is sorted by latitude and longitude so we need to shuffle it first so that we aren't testing with data from just one location in California.

In [0]:
# Shuffle
housing_df = housing_df.sample(frac=1)

# Calculate test set size
test_set_size = int(len(housing_df) * 0.2)

# Split the data
testing_df = housing_df[:test_set_size]
training_df = housing_df[test_set_size:]

print(f'Holding out {len(testing_df)} records for testing. ')
print(f'Using {len(training_df)} records for training.')

### Load TensorFlow

We'll next load the [TensorFlow](http://www.tensorflow.org) library.

TensorFlow released version 2.0 in late 2019. As of the writing of the lab, Colab supports both versions 1 and 2, but it defaults to version 1. In order to tell Colab to use TensorFlow 2 you need to run the magic in the cell below.

In [0]:
%tensorflow_version 2.x

Next we'll load TensorFlow and check to make sure that we are running version 2.

In [0]:
import tensorflow as tf
tf.__version__

Finallly, we set some global settings for TensorFlow. In this case we want to ensure that any time there is a question about the size of a floating point value that it is processed as a 64-bit number.

In [0]:
tf.keras.backend.set_floatx('float64')

### TensorFlow Data Set

`DataFrame` is a container for a dataset in Pandas. To process the data with TensorFlow we need to get the data in the `DataFrame` into a TensorFlow [Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).

Since our housing data fits in memory, we can use the `from_tensor_slices` class method to create our `Dataset`. There are a few different data formats that we could pass the method, but our model expects a feature map and a list of labels.

A feature map is a Python dictionary with feature names for keys and an iterable of column values as the value. Labels are just an interable of our target values.

Below we create the test and training `DataSet` objects.

In [0]:
testing_ds = tf.data.Dataset.from_tensor_slices((
    {c: testing_df[c] for c in feature_columns},  # feature map
    testing_df[target_column]                     # labels
))

training_ds = tf.data.Dataset.from_tensor_slices((
    {c: training_df[c] for c in feature_columns},  # feature map
    training_df[target_column]                     # labels
))

testing_ds, training_ds

The code above runs and displays two `TensorSliceDataset` objects that seem to have the correct columns. However, we can't tell how many rows of data each contains.

Intuitively you'd think this would be as simple as asking for the length of the data sets from Python:

```
 len(testing_ds)
 len(training_ds)
```

This won't work though. TensorFlow data set objects can represent in-memory data, like what we have now. They can also represent data in multiple sources stored in different locations. They can even represent a stream of data that is never-ending. For this reason having a standard `len` is impossible.

Because of this we need to do a little more work to get a count of the data in a TensorFlow data set. To get a count we'll use the `reduce` operation. This operation takes an initial value, in our case 0, and then performs some function over and over for each row in the dataset. In this case we just add one for each value. The reduction returns values for each row and feeds it to the next. The final row simply returns the value to the runtime.

We can see below that the `reduce` operation counts the number of rows for the testing and training dataset and they both match the values we saw above in the Colab.

In [0]:
import numpy as np

testing_ds_count = testing_ds.reduce(np.int64(0), lambda x, _: x + 1)
training_ds_count = training_ds.reduce(np.int64(0), lambda x, _: x + 1)

print(testing_ds_count.numpy())
print(training_ds_count.numpy())

### LinearRegressor

The model that we'll use is the [LinearRegressor](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor). This class complies with the  TensorFlow [Estimator](https://www.tensorflow.org/get_started/estimator) API. This API takes care of a lot of the low-level model plumbing, and exposes convenient methods for performing model training, evaluation, and inference.

Though the `LinearRegressor` has many configuration options, [only feature columns have to be specified when the regressor is created](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor#__init__).

We provide the regressor [feature columns](https://www.tensorflow.org/guide/feature_columns) as a list of columns that we'd like the model to use for training and prediction. For now that will be every one of our features. These columns are all floating put numbers so we use a list expansion to create a list of `float64` `numeric_column` objects.

A warning will be issued if you don't specify a `model_dir`. For now that is fine since we don't plan on saving our model and plan to train it completely now. If we do specify a model directory state will be saved that can cause issues as you iterate on the design of the model.

In [0]:
housing_features = [
    tf.feature_column.numeric_column(c, dtype=tf.dtypes.float64) 
      for c in feature_columns
]

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
)

linear_regressor

If we had multiple workers, we could distribute the training and evaluation of the model by using a distribution strategy. In the example below you can see that we are using a [`MirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy) to spread out the work.

More information on distrubuting `Estimator` work can be found [in the TensorFlow documentation](https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_estimator_limited_support).

In [0]:
housing_features = [
    tf.feature_column.numeric_column(c, dtype=tf.dtypes.float64) 
      for c in feature_columns
]

mirrored_strategy = tf.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(
    train_distribute=mirrored_strategy,
    eval_distribute=mirrored_strategy,
)

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
    config=config,
)

linear_regressor

### Training Input Function

The LinearRegressor that we just created is still not trained. To train the model we need to call the [train](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor#train) method and pass it an input function provides a `DataSet` to extract data from.

We saw how to create a `DataSet` earlier. It would be nice if we could reuse that `DataSet`, but TensorFlow requires that you create the `DataSet` in your function so we'll use the same `DataSet` creation code from above.

We also need to change a few attributes of the dataset. Our training data only has 13600 records, which isn't a lot of data. We can choose to repeat the data so that it is fed to the model multiple times. In this case we chose to repeat it 10 times. Hopefully this will give the optimizer enough data to find a good solution.

Since we are repeating the same data over-and-over, we also are going to shuffle it in-between repeats. This will give add some varibility to the training data.

Finally, we choose to process the data in batches of 100. These mini-batches of 100 are used for a single optimization step.

In [0]:
def training_input():
  ds = tf.data.Dataset.from_tensor_slices((
    {c: training_df[c] for c in feature_columns},  # feature map
    training_df[target_column]                     # labels
  ))
  ds = ds.repeat(100)
  ds = ds.shuffle(buffer_size=10000)
  ds = ds.batch(100)
  return ds

### Training

We can now call the `train` method on the regressor, passing it the input function that we defined.

In [0]:
linear_regressor.train(input_fn=training_input)

We can see in the above output how TensorFlow's LinearRegressor will tell us, as it's training, what the loss is as the model improves. This output can be useful when, later on, we'll tweak the learning rate.

### Testing Input Function

In order to evaluate the quality of our model we need to make predictions and see how close they are to reality. To do this we rely on the [`predict()`](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor#predict) method.

Similar to `train`, this method expects an input function. We'll create one similar to the one we created for train, only we won't repeat or shuffle the data and will process the data in batches of 1.

#### Exercise 3: Create a Testing Input Function

Create a testing input function called `testing_input`. The function should accept no arguments and should return a `DataSet`. The `DataSet` should not repeats, not shuffle, and should have batches of size 1. Also, target/label values aren't needed for testing input.

In [0]:
# Your code goes here

---

##### Answer Key

In [0]:
def testing_input():
  ds = tf.data.Dataset.from_tensor_slices((
    {c: testing_df[c] for c in feature_columns},  # feature map
    testing_df[target_column]                     # labels
  ))
  ds = ds.batch(1)
  return ds

---

### Make Predictions

Now we need to make predictions using our test features. To do that we pass our testing input function to the `predict` method on our trained linear regressor.

In [0]:
predictions = linear_regressor.predict(input_fn=testing_input)

That runs pretty fast... almost suspiciously fast. The reason is that the model isn't actually making predictions at this point. We have just built the graph to make predictions. TensorFlow is using lazy execution. The predictions won't be made until we ask for them.

Let's go ahead and get the predictions and put them in a NumPy array so that we can calculate our error.

In [0]:
predicted_median_values = [item['predictions'][0] for item in predictions]
print("Our predictions: ", predicted_median_values)

### Evaluate Model

Now that we have predictions we can compare them to our actual values and evaluate the quality of our model.

In [0]:
import math

from sklearn import metrics

mean_squared_error = metrics.mean_squared_error(
    np.array(predicted_median_values) * TARGET_FACTOR,
    testing_df[target_column] * TARGET_FACTOR
)
print("Mean Squared Error (on training data): %0.3f" % mean_squared_error)

root_mean_squared_error = math.sqrt(mean_squared_error)
print("Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error)

What is this telling us? The mean square error is somewhat hard to think about. However, whenever you take the root you get the units of the target column. In our test run we were `71020.204` dollars off on our predictions.

Is that good?

Let's see what the mean price is in our test data.

In [0]:
testing_df[target_column].mean() * TARGET_FACTOR

About 210,000 dollars. 71,000 is about 35% of 210,000 so our model is off by a mean of 34% of the actual price. I probably wouldn't make many bets using the model.

## Exercise 4: Hyperparameters

There are a few hyperparameters that we can adjust in order to try to improve our model. In the code cell below you'll find most of the code that we've used so far in this lab. There are three `TODO` markers in the code. Find them and:

1. Have the model use the [Adam Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam)
1. Configure the training `DataSet`. Experiment with different batch sizes. Leave the batch size that performs the best in the code.
1. Configure the testing `DataSet`.

**Student Solution**

In [0]:
%tensorflow_version 2.x

import math
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn import metrics

tf.keras.backend.set_floatx('float64')

# Load the data
url = ('https://download.mlcc.google.com/mledu-datasets/' + 
       'california_housing_train.csv')

housing_df = pd.read_csv(url)

# Create lists of column names
target_column = 'median_house_value'
feature_columns = [c for c in housing_df.columns if c != target_column]

# Normalize the feature columns
housing_df.loc[:, feature_columns] = (
    housing_df[feature_columns] - 
      housing_df[feature_columns].min()) / (
          housing_df[feature_columns].max() -
            housing_df[feature_columns].min())

# Scale the target column
TARGET_FACTOR = 100000
housing_df[target_column] = housing_df[target_column] / TARGET_FACTOR

# Test/Train split
housing_df = housing_df.sample(frac=1)
test_set_size = int(len(housing_df) * 0.2)
testing_df = housing_df[:test_set_size]
training_df = housing_df[test_set_size:]

# Create TensorFlow features
housing_features = [
    tf.feature_column.numeric_column(c, dtype=tf.dtypes.float64) 
      for c in feature_columns
]

# Create model
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
    # TODO: Set Optimizer
)

# Train the model
def training_input():
  ds = tf.data.Dataset.from_tensor_slices((
    {c: training_df[c] for c in feature_columns},  # feature map
    training_df[target_column]                     # labels
  ))
  # TODO: Configure DataSet
  return ds

linear_regressor.train(
 input_fn=training_input
)

# Make predictions
def testing_input():
  ds = tf.data.Dataset.from_tensor_slices((
    {c: testing_df[c] for c in feature_columns},  # feature map
    testing_df[target_column]                     # labels
  ))
  # TODO: Configure DataSet
  return ds

predictions_node = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predicted_median_values = np.array(
    [item['predictions'][0] for item in predictions_node])

# Find the RMSE
root_mean_squared_error = math.sqrt(
    metrics.mean_squared_error(
      predicted_median_values * TARGET_FACTOR,
      testing_df[target_column] * TARGET_FACTOR
))

print("Root Mean Squared Error (on training data): %0.3f" % 
root_mean_squared_error)

---

### Answer Key

In [0]:
%tensorflow_version 2.x

import math
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn import metrics

tf.keras.backend.set_floatx('float64')

# Load the data
url = ('https://download.mlcc.google.com/mledu-datasets/' + 
       'california_housing_train.csv')

housing_df = pd.read_csv(url)

# Create lists of column names
target_column = 'median_house_value'
feature_columns = [c for c in housing_df.columns if c != target_column]

# Normalize the feature columns
housing_df.loc[:, feature_columns] = (
    housing_df[feature_columns] - 
      housing_df[feature_columns].min()) / (
          housing_df[feature_columns].max() -
            housing_df[feature_columns].min())

# Scale the target column
TARGET_FACTOR = 100000
housing_df[target_column] = housing_df[target_column] / TARGET_FACTOR

# Test/Train split
housing_df = housing_df.sample(frac=1)
test_set_size = int(len(housing_df) * 0.2)
testing_df = housing_df[:test_set_size]
training_df = housing_df[test_set_size:]

# Create TensorFlow features
housing_features = [
    tf.feature_column.numeric_column(c, dtype=tf.dtypes.float64) 
      for c in feature_columns
]

# Create model
adam_optimizer = tf.keras.optimizers.Adam()
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
    optimizer=adam_optimizer,
)


# Train the model
def training_input():
  ds = tf.data.Dataset.from_tensor_slices((
    {c: training_df[c] for c in feature_columns},  # feature map
    training_df[target_column]                     # labels
  ))
  ds = ds.repeat(100)
  ds = ds.shuffle(buffer_size=10000)
  ds = ds.batch(10)
  return ds

linear_regressor.train(
 input_fn=training_input
)

# Make predictions
def testing_input():
  ds = tf.data.Dataset.from_tensor_slices((
    {c: testing_df[c] for c in feature_columns},  # feature map
    testing_df[target_column]                     # labels
  ))
  ds = ds.batch(1)
  return ds

predictions_node = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predicted_median_values = np.array(
    [item['predictions'][0] for item in predictions_node])

# Find the RMSE
root_mean_squared_error = math.sqrt(
    metrics.mean_squared_error(
      predicted_median_values * TARGET_FACTOR,
      testing_df[target_column] * TARGET_FACTOR
))

print("Root Mean Squared Error (on training data): %0.3f" % 
root_mean_squared_error)

---

## Exercise 5: Weights

The `LinearRegressor` builds a linear model with weights for each feature. Use the `get_variable_names` and `get_variable_value` methods to find the weights. Print the weights in a format similar to that shown below:

```
bias_weights 3.170546
population -12.792054
median_income 5.906482
total_bedrooms 5.3723865
households 4.3297663
longitude -3.7551448
latitude -3.533678
total_rooms -2.850763
housing_median_age 0.66154426
```

The columns are sorted by the relative impact to the formula (absolute value). Notice the bias weights in the list. This is the constant bias and should go first in the list.

**Student Solution**

In [0]:
# Your Code Goes Here

---

### Answer Key

In [0]:
%tensorflow_version 2.x

import math
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn import metrics

tf.keras.backend.set_floatx('float64')

# Load the data
url = ('https://download.mlcc.google.com/mledu-datasets/' + 
       'california_housing_train.csv')

housing_df = pd.read_csv(url)

# Create lists of column names
target_column = 'median_house_value'
feature_columns = [c for c in housing_df.columns if c != target_column]

# Normalize the feature columns
housing_df.loc[:, feature_columns] = (
    housing_df[feature_columns] - 
      housing_df[feature_columns].min()) / (
          housing_df[feature_columns].max() -
            housing_df[feature_columns].min())

# Scale the target column
TARGET_FACTOR = 100000
housing_df[target_column] = housing_df[target_column] / TARGET_FACTOR

# Test/Train split
housing_df = housing_df.sample(frac=1)
test_set_size = int(len(housing_df) * 0.2)
testing_df = housing_df[:test_set_size]
training_df = housing_df[test_set_size:]

# Create TensorFlow features
housing_features = [
    tf.feature_column.numeric_column(c, dtype=tf.dtypes.float64) 
      for c in feature_columns
]

# Create model
adam_optimizer = tf.keras.optimizers.Adam()
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
    optimizer=adam_optimizer,
)


# Train the model
def training_input():
  ds = tf.data.Dataset.from_tensor_slices((
    {c: training_df[c] for c in feature_columns},  # feature map
    training_df[target_column]                     # labels
  ))
  ds = ds.repeat(100)
  ds = ds.shuffle(buffer_size=10000)
  ds = ds.batch(10)
  return ds

linear_regressor.train(
 input_fn=training_input
)

# Make predictions
def testing_input():
  ds = tf.data.Dataset.from_tensor_slices((
    {c: testing_df[c] for c in feature_columns},  # feature map
    testing_df[target_column]                     # labels
  ))
  ds = ds.batch(1)
  return ds

predictions_node = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predicted_median_values = np.array(
    [item['predictions'][0] for item in predictions_node])

# Find the RMSE
root_mean_squared_error = math.sqrt(
    metrics.mean_squared_error(
      predicted_median_values * TARGET_FACTOR,
      testing_df[target_column] * TARGET_FACTOR
))

print("Root Mean Squared Error (on training data): %0.3f" % 
root_mean_squared_error)

w = {}
for v in linear_regressor.get_variable_names():
  if v.startswith('linear/linear_model/') and v.endswith('/weights'):
    _, _, name, _ = v.split('/')   
    w[name] = linear_regressor.get_variable_value(v)[0][0]
  elif v.startswith('linear/linear_model/'):
    _, _, name = v.split('/')
    print(name, linear_regressor.get_variable_value(v)[0])

for k, v in sorted(w.items(), key=lambda i: abs(i[1]), reverse=True):
  print(k, v)

---