#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Bucketized Features

There are times when we would like to split a continuous feature into multiple features with different learned model weights. We do this by adding a new feature, which records which region of a scatterplot a given datapoint would fall in - that is, which "bucket" we would place it.

This is often referred to as dichotomizing variables. In TensorFlow parlance it is called [bucketizing columns](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column).

## Overview

### Learning Objectives

* Know when to convert a column from continuous values to discrete values.
* Apply bucketization to a feature.

### Prerequisites

* Linear Regression with TensorFlow

### Estimated Duration

30 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |

There is 1 exercise in this Colab so there are 3 points available. The grading scale will be 3 points.

## Setup
Run this code block to set up the exercise ;) 

In [0]:
from tensorflow.data import Dataset
from tensorflow.estimator import LinearRegressor
from tensorflow.feature_column import numeric_column
import pandas as pd

df = pd.DataFrame([
  [26.99, 3.33],
  [3.99, 4.16],
  [5.99, 3.83],
  [4.50, 4.11],
  [17.99, 2.33],
  [36.99, 4.66],
  [34.99, 4.5],
  [1.99, 4.5],
  [2.10, 4.33],
  [18.99, 2.42],
  [7.99, 3.75],
  [25.00, 3.33],
  [29.99, 3.67],
  [30.99, 3.83],
  [6.99, 3.75],
  [9.99, 3.63],
  [19.99, 2.67],
], columns=['price', 'satisfaction'])

def run_model(features):
  def training_input():
    features = {
      'price': df['price'],
    }
    labels = df['satisfaction']
    return Dataset.from_tensor_slices((features, labels)).batch(1)
  linear_regressor = LinearRegressor(
      feature_columns=features,
  )
  linear_regressor.train(
   input_fn=training_input,
  )
  return linear_regressor


## Motivation

Let's set up an artificial example where we would like to create a model that will predict customer satisfaction of items on our menu based on the cost of the item. The prices are continuous and we would like to divide them into price buckets when training our model.

First let's take a look at our data.

In [0]:
df

Cool. Let's scatter-plot it to see if there are any trends:

In [0]:
import matplotlib.pyplot as plt

plt.plot(df.price, df.satisfaction, '.')

Hmm, looks like there are actually *two* linear trends here! For low-price items, the higher cost they are the more satisfaction goes down. For higher-price items, however, increasing cost correlates to increasing satisfaction! It's possible we're observing two different trends here: people don't like paying more for appetizers, but are happy to buy expensive entrees.

A normal linear regression would be unable to accurately reflect this behavior, which is where bucketization comes in.

Indeed, the RMSE for a linear model predicting satisfaction from price doesn't have a great loss:


In [0]:
from tensorflow.feature_column import numeric_column
run_model([
  numeric_column('price')
])


If we read the output, we see that the final loss the model ended up at is around 0.88, almost whole satisfaction point off! We can do better ;) 

## Pros and cons of bucketization

Bucketizing/Dichotomizing feature columns can actually reduce model accuracy when overused: they prevent the computer from discovering trends _across_ bucket boundaries, while the divisions we choose might not have as much predictive value. Dichotomizing removes outliers and increases bias.

So when should you use this tool? When there is a natural division that you can clearly see or when there is some hidden encoding in the continuous data.

For example, Costco is rumored to hide data in their pricing. Items that end in .99 are regular priced. Items with .97 are discounted by the manufacturer. Items ending in .88 are manager markdowns, and so on.

Given only the pricing data, an analyst that knew these encodings might extract them using bucketization so that the model would have a clearer picture as to the categories.

Another use case might be restaurant data where the menu changes at a fixed time each day between meals. If we had sales data for that restaurant we might bucketize the time of purchase to delineate breakfast, lunch, and dinner menus for the model.

Can you think of other scenarios where bucketization might be useful? Where might it be dangerous?

# Exercises


## Exercise 1

We'll add a new feature that tells the model whether a price is in the 'low' range or 'high' range. To do this, we first need to figure out where to divide our buckets. Let's look at the scatter plot again:


In [0]:
plt.plot(df.price, df.satisfaction, '.')

At what price does the downward trend stop, and the upward trend start?

### Student Solution

In [0]:
###############
## YOUR ANSWER HERE:

threshold_price = ???????

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO

## Exercise 2

To create a bucketized column, we'll use the TensorFlow function 
[bucketized_column()](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column). Taking a look at the documentation, we need to plug in two parameters:
- The TensorFlow column we're adding buckets for.
- A _list_ of thresholds around which we'll divide our buckets. In this case, we're only splitting the data into two buckets, so we're only going to have one element in our list.

### Student Solution

In [0]:
from tensorflow.feature_column import bucketized_column

price_feature = numeric_column('price')

###############
## YOUR ANSWER HERE:
bucket_feature = ???????

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO

## Exercise 3

Now, let's train a model that includes both the price feature _and_ our bucketization feature.

In [0]:
run_model(
    [price_feature, bucket_feature]
)

How does the loss on your new bucketized model compare to the simple one we trained earlier? :) 

### Student Solution

*Your answer goes here..*

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO