# Linear Regression with TensorFlow using canned estimators

https://medium.com/coinmonks/linear-regression-with-tensorflow-canned-estimators-6cc4ffddd14f
This project is about implementing Linear Regression using TensorFlow using canned estimators. Canned Estimators are a high-level API, different from the low-level API that requires you program everything yourself.
Starting by importing the required libraries.


King County housing transaction dataset. I will develop and train a machine learning model to predict house prices

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [2]:
train_df = pd.read_csv('kc_house_data.csv')
train_df.head(3)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062


Now need to inspect the DataFrame to find out the column names and types.


In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

One concept when working with TensorFlow is that of Feature Columns. We will need to pass our feature columns to our canned estimator when instantiating it. Let’s create our feature columns.

In [4]:
bedrooms = tf.feature_column.numeric_column('bedrooms', dtype=tf.int64, shape=())
bathrooms = tf.feature_column.numeric_column('bathrooms', dtype=tf.float64, shape=())
sqft_living = tf.feature_column.numeric_column('sqft_living', dtype=tf.int64, shape=())
sqft_lot = tf.feature_column.numeric_column('sqft_lot', dtype=tf.int64, shape=())
floors = tf.feature_column.numeric_column('floors', dtype=tf.float64, shape=())
waterfront = tf.feature_column.numeric_column('waterfront', dtype=tf.int64, shape=())
condition = tf.feature_column.numeric_column('condition', dtype=tf.int64, shape=())
yr_built = tf.feature_column.numeric_column('yr_built', dtype=tf.int64, shape=())
yr_renovated = tf.feature_column.numeric_column('yr_renovated', dtype=tf.int64, shape=())
zipcode = tf.feature_column.numeric_column('zipcode', dtype=tf.int64, shape=())

During instantiation, we will need to pass our feature columns as a list. So, let’s create that.

In [5]:
feature_cols = [bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,
                yr_built,yr_renovated,zipcode]

As usual, when working with data, we create a training and a validation set. So, let’s do that here as well.

In [6]:
feature_names = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','condition',
                'yr_built','yr_renovated','zipcode']

label_name = 'price'

features_ndarray = train_df[feature_names]
label_ndarray = train_df[label_name]
X_train, X_test, y_train, y_test = \
train_test_split(features_ndarray, label_ndarray, random_state=0, test_size=0.3)

If you have been working with sklearn, you will be used to simply passing your data into a training function. In TensorFlow, you work with input functions. We will need to create our input functions for training and validation.

As per Release 2.0.0-alpha0, tf.data.Dataset.make_one_shot_iterator() has been deprecate in V1, removed from V2, and added to tf.compat.v1.data.make_one_shot_iterator().

In [11]:
def train_input():
    _dataset = tf.data.Dataset.from_tensor_slices(({'bedrooms': X_train['bedrooms'], 
                                                   'bathrooms': X_train['bathrooms'], 
                                                   'sqft_living': X_train['sqft_living'],
                                                   'sqft_lot': X_train['sqft_lot'],
                                                   'floors': X_train['floors'],
                                                   'waterfront': X_train['waterfront'],
                                                   'condition': X_train['condition'],
                                                   'yr_built': X_train['yr_built'],
                                                   'yr_renovated': X_train['yr_renovated'],
                                                   'zipcode': X_train['zipcode']
                                                  }, y_train))
    dataset = _dataset.batch(32)
    #iterator = dataset.make_one_shot_iterator()
    iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
    features, labels = iterator.get_next()
    return features, labels

There are lot of ways of creating _dataset. Because we loaded our data using pandas, we have a DataFrame. As such, we will use from_tensor_slices() to create our dataset. This facility is available in the tf.data module released with TensorFlow 1.8. It’s important to pay special attention to this function. The first parameter is a dictionary. You will need to pass series data as the values of the dictionary. The second parameter is the series representing the label in our training data.
The next very important step, is that we must fetch our data in batches. You can pass the batch size in as a parameter. Without this call to batch(), you will get some funny errors about size().
One more thing we have to do, is create an iterator. We do this using make_one_shot_iterator().
We end our function by returning iterator.get_next().
We need a similar function for our evaluation data.

In [12]:
def val_input():
    _dataset = tf.data.Dataset.from_tensor_slices(({'bedrooms': X_train['bedrooms'], 
                                                   'bathrooms': X_train['bathrooms'], 
                                                   'sqft_living': X_train['sqft_living'],
                                                   'sqft_lot': X_train['sqft_lot'],
                                                   'floors': X_train['floors'],
                                                   'waterfront': X_train['waterfront'],
                                                   'condition': X_train['condition'],
                                                   'yr_built': X_train['yr_built'],
                                                   'yr_renovated': X_train['yr_renovated'],
                                                   'zipcode': X_train['zipcode']
                                                  }, y_train))
    dataset = _dataset.batch(32)
    iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
    features, labels = iterator.get_next()
    return features, labels

We are now ready to instantiate our canned estimator. Recall the list of feature columns we created earlier, which we pass in to our call to LinearRegressor(). There are a number of different parameters we could pass, such as the optimizer to use. We will use the defaults at this time.

In [13]:
estimator = tf.estimator.LinearRegressor(feature_columns=feature_cols)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\dave_\\AppData\\Local\\Temp\\tmp_rho66g0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


We are now ready to train our estimator.

In [14]:
estimator.train(input_fn=train_input, steps=None)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\dave_\AppData\Local\Temp\tmp_rho66g0\model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:lo

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressorV2 at 0x195608b2d48>

When training is done, we will be ready to evaluate our model.

In [15]:
train_e = estimator.evaluate(input_fn=train_input)
test_e = estimator.evaluate(input_fn=val_input)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-08-20T12:24:10Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\dave_\AppData\Local\Temp\tmp_rho66g0\model.ckpt-473
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 1.47283s
INFO:tensorflow:Finished evaluation at 2020-08-20-12:24:12
INFO:tensorflow:Saving dict for global step 473: average_loss = 132333490000.0, global_step = 473, label/mean = 540607.06, loss = 132311515000.0, prediction/mean = 510179.9
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 473:

We are ready to run inference. We get an iterator for this call.


CHECK THIS
https://github.com/tensorflow/tensorflow/pull/21703/files

In [16]:
preds = estimator.predict(input_fn=val_input)

We need to iterate over this and convert to a numpy array to get our results.

In [17]:
predictions = np.array([item['predictions'][0] for item in preds])

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\dave_\AppData\Local\Temp\tmp_rho66g0\model.ckpt-473
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


In [19]:
df_predictions = pd.DataFrame(predictions)
print(df_predictions)

                  0
0      492302.96875
1      492811.87500
2      497050.93750
3      484847.84375
4      578004.12500
...             ...
15124  657693.62500
15125  482080.96875
15126  510795.71875
15127  618268.62500
15128  512907.84375

[15129 rows x 1 columns]


In [25]:
X_train.head(3)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,yr_built,yr_renovated,zipcode
1468,4,1.5,1390,7200,1.0,0,3,1965,0,98133
15590,3,1.5,1450,7316,1.0,0,3,1961,0,98133
18552,5,2.75,2860,5379,2.0,0,3,2005,0,98052


In [27]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15129 entries, 1468 to 2732
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   bedrooms      15129 non-null  int64  
 1   bathrooms     15129 non-null  float64
 2   sqft_living   15129 non-null  int64  
 3   sqft_lot      15129 non-null  int64  
 4   floors        15129 non-null  float64
 5   waterfront    15129 non-null  int64  
 6   condition     15129 non-null  int64  
 7   yr_built      15129 non-null  int64  
 8   yr_renovated  15129 non-null  int64  
 9   zipcode       15129 non-null  int64  
dtypes: float64(2), int64(8)
memory usage: 1.3 MB


In [26]:
y_train.head(3)

1468     400000.0
15590    430000.0
18552    720000.0
Name: price, dtype: float64

In [29]:
merged_train=X_train.join(y_train, how='outer')
merged_train.head(3)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,yr_built,yr_renovated,zipcode,price
1468,4,1.5,1390,7200,1.0,0,3,1965,0,98133,400000.0
15590,3,1.5,1450,7316,1.0,0,3,1961,0,98133,430000.0
18552,5,2.75,2860,5379,2.0,0,3,2005,0,98052,720000.0


In [44]:
merged_train_s=merged_train.sort_index()
merged_train_s.head(10)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,yr_built,yr_renovated,zipcode,price
0,3,1.0,1180,5650,1.0,0,3,1955,0,98178,221900.0
2,2,1.0,770,10000,1.0,0,3,1933,0,98028,180000.0
3,4,3.0,1960,5000,1.0,0,5,1965,0,98136,604000.0
4,3,2.0,1680,8080,1.0,0,3,1987,0,98074,510000.0
5,4,4.5,5420,101930,1.0,0,3,2001,0,98053,1225000.0
6,3,2.25,1715,6819,2.0,0,3,1995,0,98003,257500.0
9,3,2.5,1890,6560,2.0,0,3,2003,0,98038,323000.0
10,3,2.5,3560,9796,1.0,0,3,1965,0,98007,662500.0
11,2,1.0,1160,6000,1.0,0,4,1942,0,98115,468000.0
13,3,1.75,1370,9680,1.0,0,4,1977,0,98074,400000.0


In [None]:
ALl good to this point


In [35]:
merged_train_pred=merged_train.join(df_predictions)
merged_train_pred.head(5)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,yr_built,yr_renovated,zipcode,price,0
1468,4,1.5,1390,7200,1.0,0,3,1965,0,98133,400000.0,490829.84375
15590,3,1.5,1450,7316,1.0,0,3,1961,0,98133,430000.0,
18552,5,2.75,2860,5379,2.0,0,3,2005,0,98052,720000.0,
10535,2,1.0,1050,4125,1.0,0,4,1909,0,98144,392500.0,484256.125
1069,2,1.0,1240,57000,1.0,0,3,1962,0,98075,505000.0,499694.15625


In [40]:
merged_train_pred_s=merged_train_pred.sort_index()
#merged_train_pred_s.head(5)

In [41]:
merged_train_pred_s.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,yr_built,yr_renovated,zipcode,price,0
0,3,1.0,1180,5650,1.0,0,3,1955,0,98178,221900.0,492302.96875
2,2,1.0,770,10000,1.0,0,3,1933,0,98028,180000.0,497050.9375
3,4,3.0,1960,5000,1.0,0,5,1965,0,98136,604000.0,484847.84375
4,3,2.0,1680,8080,1.0,0,3,1987,0,98074,510000.0,578004.125
5,4,4.5,5420,101930,1.0,0,3,2001,0,98053,1225000.0,491117.125
