# Feature Crosses
Continuing on the previous exercise, we will improve our linear regression model with the addition of more synthetic features.

First, let's define the input and create the data loading code.

In [None]:
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))

In [None]:
def preprocess_features(california_housing_dataframe):
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
  processed_features = selected_features.copy()
  # Create a synthetic feature.
  processed_features["rooms_per_person"] = (
    california_housing_dataframe["total_rooms"] /
    california_housing_dataframe["population"])
  return processed_features

def preprocess_targets(california_housing_dataframe):
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = pd.DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets["median_house_value"] = (
    california_housing_dataframe["median_house_value"] / 1000.0)
  return output_targets

In [None]:
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.describe()

In [None]:
training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.describe()

In [None]:
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()

In [None]:
validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()

### Feature engineering

Creating relevant features greatly improves ML models, especially for simple models like regression. We learned in a previous exercise that two (or more) independent features often do not provide as much information as a feature derived from them.

We have already used a synthetic feature in our example: `rooms_per_person`.

We can create simple synthetic features by performing operations on certain columns. However, this may become tedious for complex operations like bucketizing or crossing bucketized features. Feature columns are powerful abstractions that make it easy to add synthetic features.

In [None]:
longitude = tf.contrib.layers.real_valued_column("longitude")
latitude = tf.contrib.layers.real_valued_column("latitude")
housing_median_age = tf.contrib.layers.real_valued_column("housing_median_age")
households = tf.contrib.layers.real_valued_column("households")
median_income = tf.contrib.layers.real_valued_column("median_income")
rooms_per_person = tf.contrib.layers.real_valued_column("rooms_per_person")

feature_columns = set([
    longitude,
    latitude,
    housing_median_age,
    households,
    median_income,
    rooms_per_person])

#### The input function

Previously, we passed data to the estimator using Pandas `DataFrame` objects. A more flexible, but more complex, way to pass data is through the input function.

One particularity of the estimators API is that input functions are responsible for splitting the data into batches, so the `batch_size` arg is ignored when using `input_fn`. The batch size will be determined by the number of rows that the input function returns (see below).

Input functions return [Tensor](https://www.tensorflow.org/versions/master/api_docs/python/framework.html#Tensor) objects, which are the core data types used in TensorFlow. More specifically, input functions must return the following `(features, label)` tuple:
* `features`: A `dict` mapping `string` values (the feature name) to `Tensor` values of shape `(n, 1)` where `n` is the number of data rows (and therefore batch size) returned by the input function.
* `label`: A `Tensor` of shape `(n, 1)`, representing the corresponding labels.

As a side note, the input functions usually create a queue that reads the data sequentially, but this is an advanced topic not covered here. This makes them a necessity if your data is too large to be preloaded into memory.

For simplicity, our function will convert the entire `DataFrame` to a `Tensor`. This means we'll use a batch size of `12000` (and respectively `5000` for validation) - somewhat on the large size, but that will work fine with our small model. This will make training somewhat slower, but thanks to vector optimizations the performance penalty won't be that bad.

Here's the necessary input function:

In [None]:
def input_function(examples_df, targets_df, single_read=False):
  """Converts a pair of examples/targets `DataFrame`s to `Tensor`s.
  
  The `Tensor`s are reshaped to `(N,1)` where `N` is number of examples in the `DataFrame`s.
  
  Args:
    examples_df: A `DataFrame` that contains the input features. All its columns will be
      transformed into corresponding input feature `Tensor` objects.
    targets_df: A `DataFrame` that contains a single column, the targets corresponding to
      each example in `examples_df`.
    single_read: A `bool` that indicates whether this function should stop after reading
      through the dataset once. If `False`, the function will loop through the data set.
      This stop mechanism is used by the estimator's `predict()` to limit the number of
      values it reads.
  Returns:
    A tuple `(input_features, target_tensor)`:
      input_features: A `dict` mapping string values (the column name of the feature) to
        `Tensor`s (the actual values of the feature).
      target_tensor: A `Tensor` representing the target values.
  """
  features = {}
  for column_name in examples_df.keys():
    batch_tensor = tf.to_float(
        tf.reshape(tf.constant(examples_df[column_name].values), [-1, 1]))
    if single_read:
      features[column_name] = tf.train.limit_epochs(batch_tensor, num_epochs=1)
    else:
      features[column_name] = batch_tensor
  target_tensor = tf.to_float(
      tf.reshape(tf.constant(targets_df[targets_df.keys()[0]].values), [-1, 1]))

  return features, target_tensor

For an example, the code below shows the output of the input function when passed a few sample records from the California housing data set.

This snippet is for illustrative purposes only. It is not required for training the model, but you may find it useful to visualize the effect of various feature crosses.

In [None]:
def sample_from_input_function(input_fn):
  """Returns a few samples from the given input function.
  
  Args:
    input_fn: An input function, that meets the `Estimator`'s contract for
      input functions.
  Returns:
    A `DataFrame` that contains a small number of records that are returned
    by this function.
  """
  
  examples, target = input_fn()
  
  example_samples = {
    name: tf.strided_slice(values, [0, 0], [5, 1]) for name, values in examples.items()
  }
  target_samples = tf.strided_slice(target, [0, 0], [5, 1])
  
  with tf.Session() as sess:
    example_sample_values, target_sample_values = sess.run(
        [example_samples, target_samples])
  
  results = pd.DataFrame()
  for name, values in example_sample_values.items():
    results[name] = pd.Series(values.reshape(-1))
  results['target'] = target_sample_values.reshape(-1)

  return results

In [None]:
samples = sample_from_input_function(
  lambda: input_function(training_examples, training_targets))
samples

### FTRL optimization algorithm

High dimensional linear models benefit from using a variant of gradient-based optimization called FTRL. This algorithm has the benefit of scaling the learning rate differently for different coefficients, which can be useful if some features rarely take non-zero values (it also is well suited to support L1 regularization). We can apply FTRL using the [FtrlOptimizer](https://www.tensorflow.org/versions/master/api_docs/python/train.html#FtrlOptimizer).

In [None]:
def train_model(
    learning_rate,
    steps,
    feature_columns,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets):
  """Trains a linear regression model.
  
  In addition to training, this function also prints training progress information,
  as well as a plot of the training and validation loss over time.
  
  Args:
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    feature_columns: A `set` specifying the input feature columns to use.
    training_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for training.
    training_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for training.
    validation_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for validation.
    validation_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for validation.
      
  Returns:
    A `LinearRegressor` object trained on the training data.
  """

  periods = 10
  steps_per_period = steps / periods

  # Create a linear regressor object.
  linear_regressor = tf.contrib.learn.LinearRegressor(
      feature_columns=feature_columns,
      optimizer=tf.train.FtrlOptimizer(learning_rate=learning_rate),
      gradient_clip_norm=5.0
  )
  
  training_input_function = lambda: input_function(
      training_examples, training_targets)
  training_input_function_for_predict = lambda: input_function(
      training_examples, training_targets, single_read=True)
  validation_input_function_for_predict = lambda: input_function(
      validation_examples, validation_targets, single_read=True)

  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  print "Training model..."
  print "RMSE (on training data):"
  training_rmse = []
  validation_rmse = []
  for period in range (0, periods):
    # Train the model, starting from the prior state.
    linear_regressor.fit(
        input_fn=training_input_function,
        steps=steps_per_period
    )
    # Take a break and compute predictions.
    training_predictions = list(linear_regressor.predict(
        input_fn=training_input_function_for_predict))
    validation_predictions = list(linear_regressor.predict(
        input_fn=validation_input_function_for_predict))
    # Compute training and validation loss.
    training_root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(training_predictions, training_targets))
    validation_root_mean_squared_error = math.sqrt(
        metrics.mean_squared_error(validation_predictions, validation_targets))
    # Occasionally print the current loss.
    print "  period %02d : %0.2f" % (period, training_root_mean_squared_error)
    # Add the loss metrics from this period to our list.
    training_rmse.append(training_root_mean_squared_error)
    validation_rmse.append(validation_root_mean_squared_error)
  print "Model training finished."

  
  # Output a graph of loss metrics over periods.
  plt.ylabel("RMSE")
  plt.xlabel("Periods")
  plt.title("Root Mean Squared Error vs. Periods")
  plt.tight_layout()
  plt.plot(training_rmse, label="training")
  plt.plot(validation_rmse, label="validation")
  plt.legend()

  return linear_regressor

In [None]:
_ = train_model(
    learning_rate=1.0,
    steps=500,
    feature_columns=feature_columns,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets)

### One-hot encoding for discrete features

Discrete (i.e. strings, enumerations, integers) features are usually converted into families of binary features before training a logistic regression model.

For example, suppose we created a synthetic feature that can take any of the values `0`, `1` or `2`, and that we have a few training points:

| # | feature_value |
|---|---------------|
| 0 |             2 |
| 1 |             0 |
| 2 |             1 |

For each possible categorical value, we make a new **binary** feature of **real values** that can take one of just two possible values: 1.0 if the example has that value, and 0.0 if not. In the example above, the categorical feature would be converted into three features, and the training points now look like:

| # | feature_value_0 | feature_value_1 | feature_value_2 |
|---|-----------------|-----------------|-----------------|
| 0 |             0.0 |             0.0 |             1.0 |
| 1 |             1.0 |             0.0 |             0.0 |
| 2 |             0.0 |             1.0 |             0.0 |

### Bucketized (binned) features

Bucketization is also known as binning.

We can bucketize `population` into the following 3 buckets (for instance):
- `bucket_0` (`< 5000`): corresponding to less populated blocks
- `bucket_1` (`5000 - 25000`): corresponding to mid populated blocks
- `bucket_2` (`> 25000`): corresponding to highly populated blocks

Given the preceding bucket definitions, the following `population` vector:

    [[10001], [42004], [2500], [18000]]

becomes the following bucketized feature vector:

    [[1], [2], [0], [1]]

The feature values are now the bucket indices. Note that these indices are considered to be discrete features. Typically, these will be further converted in one-hot representations as above, but this is done transparently.

To define bucketized features, use `bucketized_column`, which requires the boundaries separating each bucket. The function in the cell below will calculate these boundaries based on quantiles, so that each bucket contains an equal number of elements.

In [None]:
def get_quantile_based_boundaries(feature_values, num_buckets):
  boundaries = np.arange(1.0, num_buckets) / num_buckets
  quantiles = feature_values.quantile(boundaries)
  return [quantiles[q] for q in quantiles.keys()]

# Divide households into 7 buckets.
bucketized_households = tf.contrib.layers.bucketized_column(
  households, boundaries=get_quantile_based_boundaries(
    california_housing_dataframe["households"], 7))

# Divide longitude into 10 buckets.
bucketized_longitude = tf.contrib.layers.bucketized_column(
  longitude, boundaries=get_quantile_based_boundaries(
    california_housing_dataframe["longitude"], 10))

### Task 1: Train the model on bucketized feature columns.
**Bucketize all the real valued features in our example, train the model and see if the results improve.**

In the preceding code block, two real valued columns (namely `households` and `longitude`) have been transformed into bucketized feature columns. Your task is to bucketize the rest of the columns, then run the code to train the model. There are various heuristics to find the range of the buckets. This exercise uses a quantile-based technique, which chooses the bucket boundaries in such a way that each bucket has the same number of examples.

In [None]:
#
# Your code here: bucketize the following columns below, following the example above.
#
bucketized_latitude = 
bucketized_housing_median_age = 
bucketized_median_income =
bucketized_rooms_per_person =

bucketized_feature_columns=set([
  bucketized_longitude,
  bucketized_latitude,
  bucketized_housing_median_age,
  bucketized_households,
  bucketized_median_income,
  bucketized_rooms_per_person])

_ = train_model(
    learning_rate=1.0,
    steps=500,
    feature_columns=bucketized_feature_columns,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets)

### Feature crosses

Crossing two (or more) features is a clever way to learn non-linear relations using a linear model. In our problem, if we just use the feature `latitude` for learning, the model might learn that city blocks at a particular latitude (or within a particular range of latitudes since we have bucketized it) are more likely to be expensive than others. Similarly for the feature `longitude`. However, if we cross `longitude` by `latitude`, the crossed feature represents a well defined city block. If the model learns that certain city blocks (within range of latitudes and longitudes) are more likely to be more expensive than others, it is a stronger signal than two features considered individually.

Currently, the feature columns API only supports discrete features for crosses. To cross two continuous values, like `latitude` or `longitude`, we cab bucketize them.

If we cross the `latitude` and `longitude` features (supposing, for example, that `longitude` was bucketized into `2` buckets, while `latitude` has `3` buckets), we actually get six crossed binary features. Each of these features will get its own separate weight when we train the model.

### Task 2: Train the model using feature crosses.

**Add a feature cross of `longitude` and `latitude` to your model, train it, and determine whether the results improve.**

In [None]:
long_x_lat = tf.contrib.layers.crossed_column(
  set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000)

#
# Your code here: Create a feature column set that includes the cross.
#
feature_columns_with_cross = 

_ = train_model(
    learning_rate=1.0,
    steps=500,
    feature_columns=feature_columns_with_cross,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets)

### Optional Challenge: Try out more synthetic features.

So far, we've tried simple bucketized columns and feature crosses, but there are many more combinations that could potentially improve the results. For example, you could cross multiple columns. What happens if you vary the number of buckets? What other synthetic features can you think of? Do they improve the model?