#### Copyright 2019 Google LLC.

In [1]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression with TensorFlow

In a previous exercise, you worked with [Scikit Learn](https://scikit-learn.org/stable/) to define a linear regression model.   Recently you were introduced to [TensorFlow](https://www.tensorflow.org/), a powerful computational toolkit. We will now combine those learnings and create a linear regression model in TensorFlow.

## Overview

### Learning Objectives

  * Review the TensorFlow programming model
  * Use the `LinearRegressor` class in TensorFlow to predict median housing price, at the granularity of city blocks, based on one input feature
  * Evaluate the accuracy of a model's predictions using Root Mean Squared Error (RMSE)
  * Improve the accuracy of a model by tuning its hyperparameters

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Introduction to Pandas
* Visualizations
* Introduction to TensorFlow

### Estimated Duration

60 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |

There are 3 exercises in this Colab so there are 9 points available. The grading scale will be 9 points.

## Problem Framing

Machine learning is not a solution looking for a problem, but is instead one of a variety of solutions that might work for an existing problem. Given this, we should begin our journey by understanding the problem we are trying to solve.

In this particular case, we would like to be able to **predict the price of a house in California**.

Questions we should ask ourselves might include:

*  Predict the price when? Now? In the past? In the future? For what range?
*  What is our tolerance for being wrong?
*  Are we okay with a few huge outliers if the overall model is better?
*  What metrics are we using to define success and what are the acceptable values?
*  Is there an non-ML way to solve this problem?
*  What data is available to solve the problem?

The list of questions is boundless. Eventually you'll need to move on, but understanding the problem and the solution space is vital.

---

For this problem we'll further define the problem by saying:

>  We want to create a system that predicts the prices of houses in California in 1990. We have census data from 1990 available to build and test the system. We will accept a system with a root mean squared error of 200,000 or better.

Since this is a contrived example we'll short-cut and say that our analysis has led us to believe that we want to use a linear regression model to serve as our prediction system.

## Data

The dataset we'll use for this Colab contains California housing data taken form the 1990 census data. This is a popular dataset for experimenting with machine learning models.

As with any data science project it is a good idea to take some time and review the [data schema and description](https://developers.google.com/machine-learning/crash-course/california-housing-data-description). Ask yourself:

* What data is available? What are the columns?
* What do those columns mean?
* What data types are those columns?
* What is the granularity of the data? In this particular case, what is a "block"?
* How many rows of data are there?
* Roughly how big is the data? Kilobytes? Megabytes? Gigabytes? Terabytes? More?

### Load the data

Now that we have a rough understanding of the data that we are going to use in our model, let's load it into this Colab and examine the data a little more closely.

We'll rely on Pandas to read a CSV version of the data from the internet.

In [2]:
import pandas as pd

housing_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")

### Examine the data

You should always look at your data and statistics about that data before you begin modelling it. A great tool for getting a high-level view is to ask Pandas to describe the data.

In [3]:
housing_df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


In this case we can see that all of the column counts are the same. That lets us know that every data point has a value. This can sometimes give you a false sense of security because many datasets have default values instead of empty values.

Looking at the min and max can be helpful too. Does a 1 value for a minimum number of rooms for a block match your mental model of what a block is?

As you probe a dataset you should ask yourself questions like this. When something doesn't look right, investigate it.

We can also identify the column of data that contains our target value. In this case we want to predict home values, so we will use `median_home_value` as our target.

Let's *imagine* that through more data analysis we decide that we'll use `total_rooms` as the feature that will be used to predict the home value.

It is also a good idea to take a look a the actual data. We can use Panda's `head` and `tail` methods to do this.

In [4]:
housing_df.head(20)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.6,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


In [5]:
housing_df.tail(20)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
16980,-124.18,40.78,34.0,1592.0,364.0,950.0,317.0,2.1607,67000.0
16981,-124.18,40.78,33.0,1076.0,222.0,656.0,236.0,2.5096,72200.0
16982,-124.18,40.62,35.0,952.0,178.0,480.0,179.0,3.0536,107000.0
16983,-124.19,41.78,15.0,3140.0,714.0,1645.0,640.0,1.6654,74600.0
16984,-124.19,40.78,37.0,1371.0,319.0,640.0,260.0,1.8242,70000.0
16985,-124.19,40.77,30.0,2975.0,634.0,1367.0,583.0,2.442,69000.0
16986,-124.19,40.73,21.0,5694.0,1056.0,2907.0,972.0,3.5363,90100.0
16987,-124.21,41.77,17.0,3461.0,722.0,1947.0,647.0,2.5795,68400.0
16988,-124.21,41.75,20.0,3810.0,787.0,1993.0,721.0,2.0074,66900.0
16989,-124.21,40.75,32.0,1218.0,331.0,620.0,268.0,1.6528,58100.0


Did you gain any insight from peeking at the actual data? Is the data sorted in a manner that might lead to a bad model?

In this case the data seems to be sorted ascending by longitude and possibly secondarily descending by latitude. We need to consider this when sampling or splitting the data.

### Prepare the data

A considerable amount of time is spent working with the dataset when creating a machine learning solution. In this case, we have looked at the data and it actually seems to be relatively clean.

The largest problem that we've seen is that there is an obvious sorting order to the data. To ensure that the sorting doesn't bite us later on, we should go ahead and randomize it now. A way to do this built into Pandas is to just create a 100% sample of the `DataFrame` in place of the original `DataFrame`.

The scale of the data across columns is also considerably different. It is often useful to normalize this data before feeding it to machine learning algorithms. We'll not do that now though since our intent for this lab is to build a simple linear regression model with one feature.

In [6]:
housing_df = housing_df.sample(frac=1)

housing_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
8575,-118.51,34.0,52.0,1241.0,502.0,679.0,459.0,2.3098,500001.0
14976,-122.24,37.81,52.0,2513.0,502.0,1048.0,518.0,3.675,269900.0
9895,-119.72,34.42,49.0,1610.0,370.0,961.0,351.0,2.6983,260100.0
3617,-117.91,33.61,36.0,3082.0,455.0,771.0,365.0,11.216,500001.0
9162,-119.03,35.41,37.0,1761.0,443.0,911.0,365.0,2.0331,53200.0


### Train/Test Split

We want to go ahead and divide our data into testing and training splits. For this example we'll hold out 20% of the data for testing. Since the data is already shuffled, we can just take the first 20% and set it aside for testing and then take the final 80% and use it for training.

In [7]:
test_set_size = int(len(housing_df) * 0.2)

testing_df = housing_df[:test_set_size]
training_df = housing_df[test_set_size:]

print("Holding out {} records for testing. Using {} records for training.".format(len(testing_df), len(training_df)))

Holding out 3400 records for testing. Using 13600 records for training.


### Translating DataFrames to Datasets

`DataFrame` is a container for a dataset in Pandas. To process the data with TensorFlow we need to get the data in the `DataFrame` into a TensorFlow [Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).

Since our housing data fits in memory, we can use the `from_tensor_slices` class method to create our `Dataset`.

In [8]:
from tensorflow.data import Dataset

testing_ds = Dataset.from_tensor_slices(testing_df)
training_ds = Dataset.from_tensor_slices(training_df)

testing_ds, training_ds

(<DatasetV1Adapter shapes: (9,), types: tf.float64>,
 <DatasetV1Adapter shapes: (9,), types: tf.float64>)

The code above runs, but did it work? We can see that the shape is (9,) which tells us that the data sets have 9 columns and an unknown number of rows. The nine columns fits with our expectations, but it would be nice to know that our row counts are the same.

Intuitively you'd think this would be as simple as asking for the length of the data sets from Python:

```
 len(testing_ds)
 len(training_ds)
```

This won't work though. Remember that TensorFlow is just building a graph of things to run, but hasn't executed any of our graph yet. To do that we must create a session.

You also can't just ask for the count of rows in the dataset from the dataset itself. Why is this? The dataset doesn't necessarily know and it could be a very expensive operation.

The `Dataset` object can represent in-memory data, like what we have now. It can also represent data in multiple sources stored in different locations. In can even represent a stream of data that is never-ending.

Because of this we need to do a little more work to get a count of the data in a TensorFlow `Dataset`. To get a count we'll use the `reduce` operation. This operation takes an initial value, in our case 0, and then performs some function over and over for each row in the dataset. In this case we just add one for each value. The reduction returns values for each row and feeds it to the next. The final row simply returns the value to the runtime.

We can see below that the `reduce` operation counts the number of rows for the testing and training dataset and they both match the values we saw above in the Colab.

In [9]:
import numpy as np
import tensorflow as tf

session = tf.Session()

testing_ds_count = testing_ds.reduce(np.int64(0), lambda x, _: x + 1)
training_ds_count = training_ds.reduce(np.int64(0), lambda x, _: x + 1)

print(testing_ds_count)
print(training_ds_count)

print(session.run([testing_ds_count, training_ds_count]))

session.close()

Tensor("ReduceDataset:0", shape=(), dtype=int64)
Tensor("ReduceDataset_1:0", shape=(), dtype=int64)
[3400, 13600]


## Build and Train the Model

In this section, we'll build a model to try to predict `median_house_value`, which will be our label (often called a target).  We'll use `total_rooms` as our input feature.

### LinearRegressor

To train our model, we'll use the [LinearRegressor](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor) interface provided by the TensorFlow [Estimator](https://www.tensorflow.org/get_started/estimator) API. This API takes care of a lot of the low-level model plumbing, and exposes convenient methods for performing model training, evaluation, and inference.

Though the `LinearRegressor` has many configuration options, [only feature columns have to be specified when the regressor is created](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor#__init__).

We provide the regressor [feature columns](https://www.tensorflow.org/guide/feature_columns) as a list of columns that we'd like the model to use for training and prediction.

In [10]:
import tensorflow as tf

housing_features = [tf.feature_column.numeric_column("total_rooms")]

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
)

W0809 19:37:13.136689 4742723008 estimator.py:1811] Using temporary folder as model directory: /var/folders/0n/ctf3gvvx57z27lg3_l0nbxzh0000gn/T/tmptthqhkvq


### Input Function

The LinearRegressor that we just created is still not trained. To train the model we need to call the [train](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor#train) method and pass it an input function that feeds the regressor data.

The input function is responsible for creating the TensorFlow [Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). Let's look at a basic input function below.

In [11]:
import tensorflow as tf

from tensorflow.data import Dataset

def training_input():
  # First, we extract the features that we want to use to
  # train the model. In this case we are using the total_rooms
  # series from our housing data DataFrame.
  features = {
    'total_rooms': training_df['total_rooms'],
  }
  
  # Next we extract our labels (also called targets) from
  # the housing data DataFrame.
  labels = training_df['median_house_value']

  # We now create a TensorFlow Dataset object using the features
  # and labels.
  training_ds = Dataset.from_tensor_slices((features,labels))

  # We tell the Dataset to shuffle the order of the rows of data
  # passed to TensorFlow. We already shuffled the data once in
  # Pandas in order to create a training and testing set. We are
  # shuffling again because the data will be fed to TensorFlow
  # multiple times in batches. Shuffling adds some randomness
  # between batches.
  training_ds = training_ds.shuffle(buffer_size=10000)

  # We set the batch size. This will be the number of rows of
  # data that TensorFlow will operate on in each step of the
  # optimization.
  training_ds = training_ds.batch(100)

  # We now tell the Dataset to feed the entire training set five
  # times to the model.
  training_ds = training_ds.repeat(5)

  # And finally we return the Dataset to TensorFlow so that
  # the model can be trained.
  return training_ds

### Train

At this point training is as easy as calling the `train` method on the regressor and passing it the input function that we defined.

In [12]:
# Train the model
linear_regressor.train(
 input_fn=training_input,
)

W0809 19:37:13.173047 4742723008 deprecation.py:323] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0809 19:37:13.554797 4742723008 deprecation.py:323] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/canned/linear.py:308: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0809 19:37:13.717718 4742723008 deprecation.py:323] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py:1354: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is depr

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressor at 0x1357d7fd0>

We can see in the above output how TensorFlow's linear regressor will tell us, as it's training, what the loss is as the model improves. This output can be useful when, later on, we'll tweak the learning rate.

## Evaluate the Model

We have built and trained a `LinearRegressor`. Let's now use our regressor to make predictions about our test data and see how accurate it is.

### Input Function

We need a way to get the features that we'll be using for testing into our model for predictions. To do this we'll create an input function similar to the one above that we created for training.

You'll notice that the input function for prediction is much simpler than that for training. We simply need to create a `Dataset` containing the features that we'd like to use for prediction.

In [13]:
def testing_input():
  # Extract the features that we'd like to use for
  # prediction from our Pandas DataFrame.
  features = {
    'total_rooms': testing_df['total_rooms'],
  }

  # Create a TensorFlow Dataset of those features.
  testing_ds = Dataset.from_tensor_slices(features)

  # Set the batch size. The exact value isn't too
  # important here since we aren't training and only
  # need to send each row of data to TensorFlow once.
  # Batch size is a required setting, so we just set
  # it to one for this case.
  testing_ds = testing_ds.batch(1)

  return testing_ds

### Make Predictions

Now we need to make predictions using our test features. To do that we pass our testing input function to the `predict` method on our trained linear regressor.

In [14]:
predictions_node = linear_regressor.predict(
  input_fn=testing_input,
)
predictions_node

<generator object Estimator.predict at 0x13672d258>

That runs pretty fast... almost suspiciously fast. The reason is that the model isn't actually making predictions at this point. We have just built the graph to make predictions. Since TensorFlow uses lazy execution the predictions won't be made until we ask for them.

Let's go ahead and get the predictions and put them in a NumPy array so that we can calculate our error.

In [15]:
predictions = np.array([item['predictions'][0] for item in predictions_node])
print("Our predictions: ", predictions)

W0809 19:37:15.182476 4742723008 deprecation.py:323] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


Our predictions:  [ 12039.962  24370.56   15616.998 ... 167839.8    20308.83   30409.84 ]


### Evaluate Model

Now that we have predictions we can compare them to our actual values and evaluate the quality of our model.

In [16]:
import math

from sklearn import metrics

mean_squared_error = metrics.mean_squared_error(predictions, testing_df['median_house_value'])
print("Mean Squared Error (on training data): %0.3f" % mean_squared_error)

root_mean_squared_error = math.sqrt(mean_squared_error)
print("Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error)

Mean Squared Error (on training data): 45409461180.495
Root Mean Squared Error (on training data): 213094.958


# Exercises

## Exercise 1

TensorFlow offers a variety of optimizers. We accepted the default in our example above. In this exercise we'll choose our own optimizer.

1. Check out the documentation for the [GradientDescentOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer).
1. Create an instance of `GradientDescentOptimizer` with a learning rate of 0.0000001 in the code block below.
1. Wrap the optimizer with a call to [tf.contrib.estimator.clip_gradients_by_norm](https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm) with a clip norm of 5.0.
1. Create a new `LinearRegressor`, passing it your new optimizer.

Is your root mean squared error better with this new optimizer?

### Student Solution

In [17]:
# Create GradientDescentOptimizer
gd_optimizer = tf.train.GradientDescentOptimizer(learning_rate= 0.0000001)
# Wrap the optimizer with clip norm of 5
gd_optimizer = tf.contrib.estimator.clip_gradients_by_norm(
    gd_optimizer, clip_norm = 5.0)


#Use the gd optimizer in the model
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
    optimizer=gd_optimizer
)

# Train the model
linear_regressor.train(
 input_fn=training_input,
)

# Make predictions
predictions_node = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predictions = np.array([item['predictions'][0] for item in predictions_node])

# Find the RMSE
root_mean_squared_error = math.sqrt(metrics.mean_squared_error(predictions, testing_df['median_house_value']))
root_mean_squared_error

W0809 19:37:18.505834 4742723008 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0809 19:37:18.510181 4742723008 estimator.py:1811] Using temporary folder as model directory: /var/folders/0n/ctf3gvvx57z27lg3_l0nbxzh0000gn/T/tmpzgmrv_ok


235544.33021183943

## Exercise 2

In this exercise we will build a model using a different feature. Choose a feature, say `housing_median_age` and use it in the place of `total_rooms`.

To do this you will need to:

1. Create a new training input function that uses the alternative feature
1. Create a new testing input function that use the alternative feature
1. Create a `LinearRegressor`
1. Train the model
1. Make predictions
1. Measure RMSE

### Student Solution

In [18]:
# Training input function that uses the housing_median_age feature
def training_input():
  features = {
    'housing_median_age': training_df['housing_median_age']
  }
  labels = training_df['median_house_value']
  training_ds = Dataset.from_tensor_slices((features,labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(5)
  return training_ds

# Testing input function that uses the housing_median_age feature
def testing_input():
  features = {
    'housing_median_age': testing_df['housing_median_age']
  }
  testing_ds = Dataset.from_tensor_slices(features)
  testing_ds = testing_ds.batch(1)
  return testing_ds
           
# Define house features 
housing_features = [tf.feature_column.numeric_column("housing_median_age")] 

# Create liner regression model
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
    optimizer=tf.train.FtrlOptimizer(
      learning_rate=1,
      l1_regularization_strength=0.01,
      l2_regularization_strength=0.02
    )
)

# Train the model
linear_regressor.train(
 input_fn=training_input,
)

# Make predictions
predictions_node = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predictions = np.array([item['predictions'][0] for item in predictions_node])

# Find the RMSE
root_mean_squared_error = math.sqrt(metrics.mean_squared_error(predictions, testing_df['median_house_value']))
root_mean_squared_error

W0809 19:37:21.363716 4742723008 estimator.py:1811] Using temporary folder as model directory: /var/folders/0n/ctf3gvvx57z27lg3_l0nbxzh0000gn/T/tmpsdoh07la


234211.66946925508

## Exercise 3

In this exercise we will build a model using a multiple features. Choose a group of features and then:

1. Create a new training input function that uses the multiple features
1. Create a new testing input function that uses the multiple features
1. Create a `LinearRegressor`
1. Train the model
1. Make predictions
1. Measure RMSE

### Student Solution

In [19]:
# Training input function that uses the multiple features
def training_input():
  features = {
    'housing_median_age': training_df['housing_median_age'],
    'population': training_df['population'],
    'households': training_df['households'],
    'median_income': training_df['median_income']
  }
  labels = training_df['median_house_value']
  training_ds = Dataset.from_tensor_slices((features,labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(5)
  return training_ds

# Testing input function that uses the multiple features
def testing_input():
  features = {
    'housing_median_age': testing_df['housing_median_age'],
    'population': testing_df['population'],
    'households': testing_df['households'],
    'median_income': testing_df['median_income']
  }
  testing_ds = Dataset.from_tensor_slices(features)
  testing_ds = testing_ds.batch(1)
  return testing_ds

# Housing features to pass to ML model
housing_features = [tf.feature_column.numeric_column("housing_median_age"),
                   tf.feature_column.numeric_column("population"),
                   tf.feature_column.numeric_column("households"),
                   tf.feature_column.numeric_column("median_income")] 

# Create Linear regression model 
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=housing_features,
    optimizer= tf.train.RMSPropOptimizer(
      learning_rate=0.1,
      decay = 0.8,
      momentum=0.0,
    )
)

# Train the model
linear_regressor.train(
 input_fn=training_input,
)

# Make predictions
predictions = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predictions = np.array([item['predictions'][0] for item in predictions])

# Find the RMSE
root_mean_squared_error = math.sqrt(metrics.mean_squared_error(predictions, testing_df['median_house_value']))
root_mean_squared_error

W0809 19:37:24.555346 4742723008 estimator.py:1811] Using temporary folder as model directory: /var/folders/0n/ctf3gvvx57z27lg3_l0nbxzh0000gn/T/tmpox6yr0st
W0809 19:37:24.919340 4742723008 deprecation.py:506] From /Users/dorishuang/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/rmsprop.py:119: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


171621.4907717866

## Exercise 4: Challenge (Ungraded)

Given the [Kaggle Black Friday Sales](https://www.kaggle.com/mehdidag/black-friday) dataset use a TensorFlow `LinearRegressor` to predict the total price of the purchases for the day. This is the `sum(purchase)` for a shopper.

Features can include some combination of their **age**, **gender**, **occupation**, **city_category**, **stay_in_current_city_years**, and **marital_status**. Product and category data should not be used.

The data should be grouped by user for analysis.

Play with different optimizers, model settings, and data parameters (batch size, repeat, etc) to achieve the lowest RMSE that you can.

Work will likely include:

* Loading the data into Colab
* Examining the data quality
* Aggregating the data by user
* Examining the data
* Test/train split
* Building an input function for training
* Building an input function for testing
* Train model
* Make predictions
* Measure RSME
* Iterate

### Student Solution

In [21]:
# Your code goes here
import pandas as pd

filename = './BlackFriday.csv'
blackfriday_df = pd.read_csv(filename)


In [22]:
blackfriday_df.describe()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,537577.0,537577.0,537577.0,537577.0,370591.0,164278.0,537577.0
mean,1002992.0,8.08271,0.408797,5.295546,9.842144,12.66984,9333.859853
std,1714.393,6.52412,0.491612,3.750701,5.087259,4.124341,4981.022133
min,1000001.0,0.0,0.0,1.0,2.0,3.0,185.0
25%,1001495.0,2.0,0.0,1.0,5.0,9.0,5866.0
50%,1003031.0,7.0,0.0,5.0,9.0,14.0,8062.0
75%,1004417.0,14.0,1.0,8.0,15.0,16.0,12073.0
max,1006040.0,20.0,1.0,18.0,18.0,18.0,23961.0


In [23]:
blackfriday_df.dtypes

User_ID                         int64
Product_ID                     object
Gender                         object
Age                            object
Occupation                      int64
City_Category                  object
Stay_In_Current_City_Years     object
Marital_Status                  int64
Product_Category_1              int64
Product_Category_2            float64
Product_Category_3            float64
Purchase                        int64
dtype: object

In [24]:
blackfriday_df.head(20)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969
5,1000003,P00193542,M,26-35,15,A,3,0,1,2.0,,15227
6,1000004,P00184942,M,46-50,7,B,2,1,1,8.0,17.0,19215
7,1000004,P00346142,M,46-50,7,B,2,1,1,15.0,,15854
8,1000004,P0097242,M,46-50,7,B,2,1,1,16.0,,15686
9,1000005,P00274942,M,26-35,20,A,1,1,8,,,7871


In [25]:
blackfriday_df.tail(20)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
537557,1004736,P00175242,M,18-25,20,A,1,1,2,,,12724
537558,1004736,P00101942,M,18-25,20,A,1,1,8,17.0,,7796
537559,1004736,P00109142,M,18-25,20,A,1,1,8,17.0,,7770
537560,1004736,P00084842,M,18-25,20,A,1,1,8,16.0,,5940
537561,1004736,P00078142,M,18-25,20,A,1,1,8,16.0,,7834
537562,1004736,P00146742,M,18-25,20,A,1,1,1,13.0,14.0,11508
537563,1004736,P00154642,M,18-25,20,A,1,1,8,,,6074
537564,1004736,P00117442,M,18-25,20,A,1,1,5,14.0,,7084
537565,1004736,P00051142,M,18-25,20,A,1,1,8,,,7934
537566,1004736,P00048742,M,18-25,20,A,1,1,5,,,5350


In [26]:
#Fill in NaN in product categories
blackfriday_df.fillna(0)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,0.0,0.0,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,0.0,0.0,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,0.0,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,0.0,0.0,7969
5,1000003,P00193542,M,26-35,15,A,3,0,1,2.0,0.0,15227
6,1000004,P00184942,M,46-50,7,B,2,1,1,8.0,17.0,19215
7,1000004,P00346142,M,46-50,7,B,2,1,1,15.0,0.0,15854
8,1000004,P0097242,M,46-50,7,B,2,1,1,16.0,0.0,15686
9,1000005,P00274942,M,26-35,20,A,1,1,8,0.0,0.0,7871


In [27]:
test_set_size = int(len(blackfriday_df) * 0.2)

testing_df = blackfriday_df[:test_set_size]
training_df = blackfriday_df[test_set_size:]

print("Holding out {} records for testing. Using {} records for training.".format(len(testing_df), len(training_df)))

Holding out 107515 records for testing. Using 430062 records for training.


In [28]:



def training_input():
  features = {
#     'Age': training_df['Age'],
#     'Gender': training_df['Gender'],
    'Occupation': training_df['Occupation'],
#     'Stay_In_Current_City_Years': training_df['Stay_In_Current_City_Years'],
    'Marital_Status': training_df['Marital_Status'],
  }
  labels = training_df['Purchase']
  training_ds = Dataset.from_tensor_slices((features,labels))
  training_ds = training_ds.shuffle(buffer_size=10000)
  training_ds = training_ds.batch(100)
  training_ds = training_ds.repeat(2)
  return training_ds


def testing_input():
  features = {
#     'Age': testing_df['Age'],
#     'Gender': testing_df['Gender'],
    'Occupation': testing_df['Occupation'],
#     'Stay_In_Current_City_Years': testing_df['Stay_In_Current_City_Years'],
    'Marital_Status': testing_df['Marital_Status'],
  }
  testing_ds = Dataset.from_tensor_slices(features)
  testing_ds = testing_ds.batch(1)
  return testing_ds

blackfriday_features = [
#                    tf.feature_column.indicator_column("Age"),
#                    tf.feature_column.indicator_column("Gender"),
                   tf.feature_column.numeric_column("Occupation"),
#                    tf.feature_column.indicator_column("Stay_In_Current_City_Years"),
                   tf.feature_column.numeric_column("Marital_Status"),
                   ] 

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=blackfriday_features,
    # TODO: Use a custom optimizer and explore other hyperparameters if you would like 
#     optimizer=tf.train.FtrlOptimizer(
#       learning_rate=0.1,
#       l1_regularization_strength=0.01,
#       l2_regularization_strength=0.2
#     )
)

# Train the model
linear_regressor.train(
 input_fn=training_input,
)

# Make predictions
predictions = linear_regressor.predict(
  input_fn=testing_input,
)

# Convert the predctions to a NumPy array
predictions = np.array([item['predictions'][0] for item in predictions])

# Find the RMSE
root_mean_squared_error = math.sqrt(metrics.mean_squared_error(predictions, testing_df['Purchase']))
root_mean_squared_error

W0809 19:37:43.673061 4742723008 estimator.py:1811] Using temporary folder as model directory: /var/folders/0n/ctf3gvvx57z27lg3_l0nbxzh0000gn/T/tmpg12ssjtr
W0809 19:37:55.435832 4742723008 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 5701 vs previous value: 5701. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.


10247.453345303224