# Trying out features

**Learning Objectives:**
  * Improve the accuracy of a model by adding new features with the appropriate representation

The data is based on 1990 census data from California. This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively.

## Set Up
In this first cell, we'll load the necessary libraries.

In [1]:
import math
import shutil
import pandas as pd
import numpy as np
import tensorflow as tf
print(tf.__version__)

# Load the TensorBoard notebook extension
%load_ext tensorboard
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

2.0.0


## Load Our Dataset From Server

In [2]:
df = pd.read_csv(filepath_or_buffer="https://storage.googleapis.com/ml_universities/california_housing_train.csv", \
                 sep=",")

## Examine the data

It's a good idea to get to know your data a little bit before you work with it.

We'll print out a quick summary of a few useful statistics on each column.

This will include things like mean, standard deviation, max, min, and various quantiles.

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.3,34.2,15.0,5612.0,1283.0,1015.0,472.0,1.5,66900.0
1,-114.5,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.8,80100.0
2,-114.6,33.7,17.0,720.0,174.0,333.0,117.0,1.7,85700.0
3,-114.6,33.6,14.0,1501.0,337.0,515.0,226.0,3.2,73400.0
4,-114.6,33.6,20.0,1454.0,326.0,624.0,262.0,1.9,65500.0


In [4]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207300.9
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,115983.8
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,14999.0
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119400.0
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180400.0
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265000.0
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0


## Split The Data

Now, split the data into two parts -- training and evaluation.

In [5]:
np.random.seed(seed=1) #makes result reproducible
msk = np.random.rand(len(df)) < 0.80
df_train = df[msk]
df_eval = df[~msk]

print("Number Of Training Examples: {}".format(len(df_train)))
print("Number Of Evaluation Examples: {}".format(len(df_eval)))

Number Of Training Examples: 13612
Number Of Evaluation Examples: 3388


## Training and Evaluation

In this exercise, we'll be trying to predict **median_house_value** It will be our label (sometimes also called a target).

We'll modify the feature_cols and input function to represent the features you want to use.

Hint: Some of the features in the dataframe aren't directly correlated with median_house_value (e.g. total_rooms) but can you think of a column to divide it by that we would expect to be correlated with median_house_value?

#### Add More Features:

In [6]:
def add_more_features(df):
    """
    This function will add more feture to our dataframe
    """
    df["num_rooms"] = df["total_rooms"] / df["households"]
    df["num_bedrooms"] = df["total_bedrooms"] / df["households"]
    return df

#### Create Feature's Columns:

In [12]:
def create_features_cols():
    return [
        # As california Latitude between 32 and 42, So
        tf.feature_column.bucketized_column(source_column=tf.feature_column.numeric_column("latitude"), boundaries=np.arange(32.0, 42, 1).tolist()),
        # As california Longitude between -124 and -114, So
        tf.feature_column.bucketized_column(source_column=tf.feature_column.numeric_column("longitude"), boundaries=np.arange(-124.0, -114, 1).tolist()),
        tf.feature_column.numeric_column("housing_median_age"),
        tf.feature_column.numeric_column("num_rooms"),
        tf.feature_column.numeric_column("num_bedrooms"),
        tf.feature_column.numeric_column("median_income")
    ]

#### Create Panda's Input Function:

In [8]:
def make_input_fn(df, num_epochs):
    return tf.compat.v1.estimator.inputs.pandas_input_fn(
        x=add_more_features(df),
        y=df["median_house_value"] / 100000,
        batch_size=128,
        num_epochs=num_epochs,
        shuffle=True,
        queue_capacity=1000,
        num_threads=1
    )

#### Create Estimator Train And Evaluate Function:

In [9]:
def train_and_evaluate(output_dir, num_train_steps):
    estimator = tf.compat.v1.estimator.LinearRegressor(
        feature_columns=create_features_cols(),
        model_dir=output_dir
    )
    
    train_spec = tf.compat.v1.estimator.TrainSpec(
        input_fn=make_input_fn(df_train, None),
        max_steps=num_train_steps
    )
    
    eval_spec =tf.compat.v1.estimator.EvalSpec(
        input_fn=make_input_fn(df_eval, 1),
        steps=None,
        start_delay_secs = 1, # start evaluating after N seconds
        throttle_secs = 5 # evaluate every N seconds
    )
    
    tf.compat.v1.estimator.train_and_evaluate(
        estimator=estimator,
        train_spec=train_spec,
        eval_spec=eval_spec
    )

## Launch Tensorboard

In [10]:
OUTPUT_DIR = './trained_model'
%tensorboard --logdir ./trained_model

## Run The Training:

In [13]:
shutil.rmtree(path=OUTPUT_DIR,ignore_errors=True)# start fresh each time
tf.compat.v1.summary.FileWriterCache.clear()# ensure filewriter cache is clear for TensorBoard events file
train_and_evaluate(output_dir=OUTPUT_DIR,num_train_steps=5000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './trained_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe244e30490>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:te

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./trained_model/model.ckpt.
INFO:tensorflow:loss = 878.85846, step = 0
INFO:tensorflow:global_step/sec: 315.187
INFO:tensorflow:loss = 61.196854, step = 100 (0.320 sec)
INFO:tensorflow:global_step/sec: 284.969
INFO:tensorflow:loss = 58.887238, step = 200 (0.349 sec)
INFO:tensorflow:global_step/sec: 274.167
INFO:tensorflow:loss = 159.2143, step = 300 (0.366 sec)
INFO:tensorflow:global_step/sec: 366.836
INFO:tensorflow:loss = 265.65558, step = 400 (0.275 sec)
INFO:tensorflow:global_step/sec: 280.202
INFO:tensorflow:loss = 198.5219, step = 500 (0.357 sec)
INFO:tensorflow:global_step/sec: 345.207
INFO:tensorflow:loss = 150.45772, step = 600 (0.287 sec)
INFO:tensorflow:global_step/sec: 294.403
INFO:tensorflow:loss = 84.50345, step = 700 (0.339 sec)
INFO:tensorflow:global_step/sec: 361.69
INFO:tensorflow:loss = 102.92279, step =