## First Steps to tensor flow

**Objecttive**
1. Learn fundamental TensorFlow concepts
2. Use the LinearRegressor class in TensorFlow to predict median housing price, at the granularity of city blocks, based on one input feature
3. Evaluate the accuracy of a model's predictions using Root Mean Squared Error (RMSE)
4. Improve the accuracy of a model by tuning its hyperparameters

## Setup

In [5]:
from __future__ import print_function

import math
from IPython import display

from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt

import numpy as np
import pandas as pd
from sklearn import metrics

import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

In [7]:
california_housing_dataframe = pd.read_csv(
    "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv",
    sep=",")

In [8]:
california_housing_dataframe.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.3,34.2,15.0,5612.0,1283.0,1015.0,472.0,1.5,66900.0
1,-114.5,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.8,80100.0
2,-114.6,33.7,17.0,720.0,174.0,333.0,117.0,1.7,85700.0
3,-114.6,33.6,14.0,1501.0,337.0,515.0,226.0,3.2,73400.0
4,-114.6,33.6,20.0,1454.0,326.0,624.0,262.0,1.9,65500.0


In [9]:
# lets randomize the data and divide median_house_value value by 1000
california_housing_dataframe = california_housing_dataframe.reindex(
np.random.permutation(california_housing_dataframe.index))
california_housing_dataframe["median_house_value"] /= 1000.0

california_housing_dataframe.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
8589,-118.5,34.2,6.0,3218.0,949.0,2295.0,876.0,3.1,418.5
15036,-122.2,37.9,41.0,685.0,141.0,266.0,123.0,5.2,384.6
2911,-117.8,34.1,28.0,4086.0,871.0,1973.0,853.0,2.6,202.2
3238,-117.9,34.0,10.0,17820.0,2812.0,8686.0,2666.0,6.4,310.7
7451,-118.3,33.9,37.0,1420.0,286.0,886.0,290.0,4.6,261.3


In [10]:
# Examine the data
california_housing_dataframe.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207.3
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,116.0
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,15.0
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119.4
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180.4
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265.0
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500.0


In [18]:
# input feature - total_rooms
my_feature = california_housing_dataframe[["total_rooms"]]

# Configure a numeric feature column for total_rooms.
feature_columns =[tf.feature_column.numeric_column("total_rooms")]
targets = california_housing_dataframe["median_house_value"]

In [12]:
# Use gradient descent as the op(timizer for training the model.
my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
# Gradient clipping ensures the magnitude of the gradients do not become too large 
# during training, which can cause gradient descent to fail.
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

# Configure the linear regression model with our feature columns and optimizer.
linear_regessor = tf.estimator.LinearRegressor(feature_columns=feature_columns,
                                              optimizer = my_optimizer )


### Step 4: Define the Input Function

we need to define an input function, which instructs TensorFlow how to preprocess the data, as well as how to batch, shuffle, and repeat it during model training.First, we'll convert our pandas feature data into a dict of NumPy arrays. We can then use the TensorFlow Dataset API to construct a dataset object from our data, and then break our data into batches of batch_size, to be repeated for the specified number of epochs 

In [32]:
def my_input_fn(features, targets, batch_size = 1, shuffle = True, num_epochs=None):
    """Trains a linear regression model of one feature.
    Args:
        features: Pandas DataFrame of Features
        targets: Pandas DataFrame of targets
        batch_size: Size of Batches passed to the model
        suffle: True or False. Whether to suffle the data
        num_epochs: Numbers of epochs for which data should be repeated
    Returns:
        Tupple of (features, labels) for next data batch
    """
    # convert pandas data into a dict of np array
    features = {key:np.array(values) for key, values in dict(features).items()}
    
    # Construct a dataset, configuring batching/repeating
    ds = Dataset.from_tensor_slices((features, targets))
    ds = ds.batch(batch_size).repeat(num_epochs)
    
    # Suffle the data if specified
    if shuffle:
        ds = ds.shuffle(buffer_size = 10000)
        
    # Return the next batch of data
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels
    
    

### Step 5: Train the Model

In [33]:
_ = linear_regessor.train(input_fn=lambda:my_input_fn(my_feature,targets),steps= 100)

### Step 6: Evaluate the Model

In [34]:
prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)

# Call predict() on the linear_regressor to make predictions.
predictions = linear_regessor.predict(input_fn=prediction_input_fn)

# Format predictions as a NumPy array, so we can calculate error metrics.
predictions = np.array([item['predictions'][0] for item in predictions])

# Print Mean Squared Error and Root Mean Squared Error.
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)
print("Mean Squared Error (on training data): %0.3f" % mean_squared_error)
print("Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error)

Mean Squared Error (on training data): 56308.998
Root Mean Squared Error (on training data): 237.295
