# Trying out features

**Learning Objectives:**
  * Improve the accuracy of a model by adding new features with the appropriate representation

The data is based on 1990 census data from California. This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively.

## Set Up
In this first cell, we'll load the necessary libraries.

In [1]:
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf

print(tf.__version__)
tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

  from ._conv import register_converters as _register_converters


1.8.0


Next, we'll load our data set.

In [2]:
df = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

## Examine and split the data

It's a good idea to get to know your data a little bit before you work with it.

We'll print out a quick summary of a few useful statistics on each column.

This will include things like mean, standard deviation, max, min, and various quantiles.

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.3,34.2,15.0,5612.0,1283.0,1015.0,472.0,1.5,66900.0
1,-114.5,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.8,80100.0
2,-114.6,33.7,17.0,720.0,174.0,333.0,117.0,1.7,85700.0
3,-114.6,33.6,14.0,1501.0,337.0,515.0,226.0,3.2,73400.0
4,-114.6,33.6,20.0,1454.0,326.0,624.0,262.0,1.9,65500.0


In [4]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207300.9
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,115983.8
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,14999.0
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119400.0
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180400.0
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265000.0
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0


Now, split the data into two parts -- training and evaluation.

In [5]:
np.random.seed(seed=1) #makes result reproducible
msk = np.random.rand(len(df)) < 0.8
traindf = df[msk]
evaldf = df[~msk]

## Training and Evaluation

In this exercise, we'll be trying to predict **median_house_value** It will be our label (sometimes also called a target).

We'll modify the feature_cols and input function to represent the features you want to use.

Hint: Some of the features in the dataframe aren't directly correlated with median_house_value (e.g. total_rooms) but can you think of a column to divide it by that we would expect to be correlated with median_house_value?

In [6]:
def add_more_features(df):
  df['rooms_per_hh'] = df['total_rooms'] / df['households']
  df['bedrooms_per_hh'] = df['total_bedrooms'] / df['households']
  df['pop_per_hh'] = df['population'] / df['households']
  # TODO: Add more features to the dataframe
  return df

add_more_features(df).head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_hh,bedrooms_per_hh,pop_per_hh
0,-114.3,34.2,15.0,5612.0,1283.0,1015.0,472.0,1.5,66900.0,11.9,2.7,2.2
1,-114.5,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.8,80100.0,16.5,4.1,2.4
2,-114.6,33.7,17.0,720.0,174.0,333.0,117.0,1.7,85700.0,6.2,1.5,2.8
3,-114.6,33.6,14.0,1501.0,337.0,515.0,226.0,3.2,73400.0,6.6,1.5,2.3
4,-114.6,33.6,20.0,1454.0,326.0,624.0,262.0,1.9,65500.0,5.5,1.2,2.4


In [37]:
# Create pandas input function
def make_input_fn(df, num_epochs, shuffle = True):
  return tf.estimator.inputs.pandas_input_fn(
    x = add_more_features(df),
    y = df['median_house_value'] / 100000, # will talk about why later in the course
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = shuffle,
    queue_capacity = 1000,
    num_threads = 1
  )

In [59]:
# Define your feature columns
FEATURE_COLS_NUM = ['housing_median_age', 'median_income', 'rooms_per_hh', 'bedrooms_per_hh', 'pop_per_hh']
FEATURE_COLS_PP = ['latitude']  ## need some kind of preprocessing
FEATURE_COLS = FEATURE_COLS_NUM + FEATURE_COLS_PP
print(FEATURE_COLS)

def create_feature_cols():
  return [
    # TODO: Define additional feature columns
    # Hint: Are there any features that would benefit from bucketizing?
    tf.feature_column.numeric_column(i) for i in FEATURE_COLS_NUM ] + \
  [
    tf.feature_column.bucketized_column(tf.feature_column.numeric_column('latitude'),
                                        boundaries = np.arange(32.0, 42.0, 1).
                                        tolist())
  ]

print(create_feature_cols())

['housing_median_age', 'median_income', 'rooms_per_hh', 'bedrooms_per_hh', 'pop_per_hh', 'latitude']
[_NumericColumn(key='housing_median_age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='median_income', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='rooms_per_hh', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='bedrooms_per_hh', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='pop_per_hh', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _BucketizedColumn(source_column=_NumericColumn(key='latitude', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0))]


In [60]:
def get_estimator(output_dir):
  estimator = tf.estimator.LinearRegressor(
    feature_columns = create_feature_cols(), 
    #hidden_units = [4, 8, 4], 
    model_dir = output_dir)
  return estimator

# Create estimator train and evaluate function
def train_and_evaluate(output_dir, num_train_steps):
  # TODO: Create tf.estimator.LinearRegressor, train_spec, eval_spec, and train_and_evaluate using your feature columns
  ## define estimator:
  estimator = get_estimator(output_dir)
  
  ## define the train spec, 
  ## which specifies the input function and max_steps
  ## (and possibly some hooks):
  train_spec = tf.estimator.TrainSpec(
    input_fn = make_input_fn(df = traindf, num_epochs = None), ## [[?]]
    max_steps = num_train_steps
  )
  
  ## define the exporter, which is needed for understanding
  ## json data coming in when model is deployed
  ## (serving time inputs); LatestExporter takes the latest
  ## checkpoint of the model:
  #exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
   
  ## define the eval spec (evaluation data input function):
  eval_spec = tf.estimator.EvalSpec(
    input_fn = make_input_fn(df = evaldf, num_epochs = 1),
    steps = None,
    start_delay_secs = 1,
    throttle_secs = 10
  )
  
  ## call train_and_evaluate!
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)


In [55]:
## try without train_and_evaluate first (and w/o tensorboard):
OUTDIR = './trained_model'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.LinearRegressor(create_feature_cols(), OUTDIR) #ADD CODE HERE

#model.train(
#  make_input_fn(traindf, num_epochs = 10),
#  max_steps = 100000
#)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3d8dc7c590>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': './trained_model', '_global_id_in_cluster': 0, '_save_summary_steps': 100}


In [51]:
# Launch tensorboard
from google.datalab.ml import TensorBoard

OUTDIR = './trained_model'
TensorBoard().start(OUTDIR)

8984

In [61]:
# Run the model
shutil.rmtree(OUTDIR, ignore_errors = True)
train_and_evaluate(OUTDIR, 2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3d8ca92e50>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': './trained_model', '_global_id_in_cluster': 0, '_save_summary_steps': 100}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 10 secs (eval_spec.throttle_secs) or training is finished.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into ./trained_model/model.ckpt.
INFO:tensorflow:loss = 599.5413, step = 1
INFO:tensorflow:global_step/sec: 108.34
INFO:tensorflow:loss = 107.04132, step = 101 (0.930 sec)
INFO:tensorflow:global_step/sec: 196.537
INFO:tensorflow:loss = 81.49104, step = 201 (0.504 sec)
INFO:tensorflow:global_step/sec: 190.436
INFO:tensorflow:loss = 110.41925, step = 301 (0.525 sec)
INFO:tensorflow:global_step/sec: 187.43
INFO:tensorflow:loss = 76.41052, step = 401 (0.534 sec)
INFO:tensorflow:global_step/sec: 172.382
INFO:t

In [62]:
pids_df = TensorBoard.list()
pids_df

Unnamed: 0,logdir,pid,port
0,./trained_model,8984,41051


In [63]:
pids_df = TensorBoard.list()
if not pids_df.empty:
    for pid in pids_df['pid']:
        TensorBoard().stop(pid)
        print('Stopped TensorBoard with pid {}'.format(pid))

Stopped TensorBoard with pid 8984


In [64]:
## load model from disk:
model = get_estimator(OUTDIR)


# In[48]:


## RMSE:
metrics = model.evaluate(input_fn = make_input_fn(df = evaldf, num_epochs = 1, shuffle = False), 
                         steps = None)
print('RMSE on dataset = {}'.format(np.sqrt(metrics['average_loss'])))


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3d8dc3ad50>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': './trained_model', '_global_id_in_cluster': 0, '_save_summary_steps': 100}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


INFO:tensorflow:Calling model_fn.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-02-04-08:53:19
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./trained_model/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-02-04-08:53:19
INFO:tensorflow:Saving dict for global step 2000: average_loss = 0.60335165, global_step = 2000, loss = 75.70946
RMSE on dataset = 0.776757121086


In [65]:
## make prediction iterator:
pred_iter = model.predict(input_fn = make_input_fn(df = evaldf, num_epochs = 1, shuffle = False))
dat_pred = pd.DataFrame(columns = ['v_true', 'v_pred'])

## [[?]]
## how to get correct true labels in distributed training?
## maybe use different input_fn for predict, starting from a 
## pandas df for easier data inspection?

## predict a few values to get correlation:
for i in range(1000):
  dat_pred = dat_pred.append({
    'v_true' : evaldf['median_house_value'].iloc[i],
    'v_pred' : next(pred_iter)['predictions'][0]
  }, ignore_index = True)
  #print(dat_eval['v'][i], next(pred_iter)['predictions'][0])
  
print(dat_pred.head(n = 5))
dat_pred.corr()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


INFO:tensorflow:Calling model_fn.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./trained_model/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
    v_true  v_pred
0  70400.0     2.2
1  44400.0     1.3
2  59200.0     1.2
3  53500.0     1.3
4 100000.0     2.2


Unnamed: 0,v_true,v_pred
v_true,1.0,0.7
v_pred,0.7,1.0
