# Linear Regression with Tensorflow
<p class='lead'>
Author: Oliveira, Markos F. B. G.<br />
</p>

# Description

This document is based on Chapter 9 content of Geron's book *Hands-On Machine Learning with Scikit-Learn & TensorFlow*. It covers only the basics of the related topics and it's not meant to be a tutorial. Please, refer to the referenced book and scikit-learn/tensorflow online documentation for specific information.

In paticular, this notebook provides an example of applying linear regression using Tensorflow on moderated-size regression problem. It applies stochastic gradient descent using mini-batches of a given size, two approaches for selecting the next mini-batch exist: *1*: random indices from the training set are selected to be inside the mini-batch using a randomized seed, which is the number of the training iteration. In this case, one epoch, in general, does not cover all the training instances. *2*: a random permutation of the examples is passed and a particular slice of this permutation is used as mini-batch. It's guarateed that all examples are used in any epoch (this approach is the most common).

The algorithm that minimizes the cost function is a plain stochastic gradient descent with mean squared error. An automatic search for good hyperparameters is not implemented, nor ensemble methods are used. However, the algorithm employs a naive implementation of early stopping. It just checks if the validaiton error has not decreased by an amount of *n_it* iterations; if dows not, the training is stopped.

In [1]:
import tensorflow as tf
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
import numpy as np
from datetime import datetime
import time
from sklearn.model_selection import train_test_split

### Fetching and Preprocessing data

In [2]:
#Fetching the data: in the first time it's executed, the data is downloaded to ‘~/scikit_learn_data’ subfolders (this may
#take a while); once, downloaded and executed again, the function just read the available dataset.
housing = fetch_california_housing()
print(fetch_california_housing().keys())
print('Entire dataset shape: ', housing.data.shape)

#Splitting the dataset into training, validation (to early stopping and testing):
X_train_val, X_test, y_train_val, y_test = train_test_split(housing.data, housing.target, random_state=0, test_size=.2)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, random_state=0, test_size=.2) #25% of the 80%
#is 20% of the original, so: X_train_ratio = 0.6, X_val_ratio = 0.2, X_test_ratio = 0.2

#Normalizing the data:
scaler = StandardScaler() #instantiates the scaler.
scaler.fit(X_train) #fits the training data, i.e. computes the necessary parameters.
X_train_scaled = scaler.transform(X_train) #Transform the datasets based on parameters evaluated on training set only.
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

#It's important to transform the y_train and y_test 1d-array ((m,) format) to 2d-array ((m,1) format):
y_train = y_train.reshape(-1, 1)
y_val = y_val.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

#Adding the bias feature to each example (necessary for linear regression):
X_train_scaled_plus_bias = np.c_[np.ones((X_train_scaled.shape[0], 1)), X_train_scaled]
X_val_scaled_plus_bias = np.c_[np.ones((X_val_scaled.shape[0], 1)), X_val_scaled]
X_test_scaled_plus_bias = np.c_[np.ones((X_test_scaled.shape[0], 1)), X_test_scaled]
m, n = X_train_scaled_plus_bias.shape

dict_keys(['data', 'feature_names', 'target', 'DESCR'])
Entire dataset shape:  (20640, 8)


### Mini-batch function

In [3]:
def fetch_batch(epoch, batch_index, batch_size, idxs, approach=2):
    """Returns the mini-batch (X, y) for a training step.

    Parameters
    ----------
    epoch : int
        Number of epoch.
    batch_index : int
        Number of batch inside the epoch.
    batch_size : int
        Batch size.
    idxs : list of int
        Permutation of training set indices.
    approach: int, optional
        Approach used for chosing the mini-batches. 
    Returns
    -------
    X_batch : np.ndarray
        Mini-batch of inputs.
    y_batch : np.ndarray
        Mini-batch of outputs.
    """
    
    if approach == 1:
        rnd.seed(epoch * n_batches + batch_index)
        indices = rnd.randint(m, size=batch_size)
    elif approach == 2:
        indices = idxs[batch_index * batch_size: (batch_index + 1) * batch_size]
    
    X_batch = X_train_scaled_plus_bias[indices]
    y_batch = y_train[indices]
    
    return X_batch, y_batch

### Linear Regression using TensorFlow

Below, the user can set some running parameters acording to their preference.

In [4]:
#SGD parameters:
n_epochs = 1000
learning_rate = 0.005
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

#Early-stopping:
early_stopping = True #enable early stopping.
min_val_error = float("inf") #stores the best error so far.
min_val_error_epoch = 0 #epoch when minimum validation error was found.
n_it = 2*batch_size #maximum number of iterations without improvement in validation error.
imp_counter = 0 #counts the umber of iterations there is no improvement in the validation error.
stop_learning = True #enable stop learning after 'n_it' iterations of no improvement in validation error. If 'False', all
#epochs are runned, but the model which least validation error is returned.

#Monitoring:
prt = True #enable print training statistics (in this case MSE).
print_step = 100 # number of epochs to periodically print training statistics.

#Saving/restoring TF model:
restore = False #enable model's restoration.
path_restore = '/tmp/finals/my_model_final.ckpt' #path of original model to be restored.
save_ckpt = False #enable saving training checkpoints.
saver_step = 100 # number of epochs to periodically save model's checkpoint.

#Tensorboard logs:
tb = False #enable log training statistics for tensorboard.
tensorboard_step = 100 #number of epochs to periodically log statistics in tensorboard log files.
root_logdir = "tf_logs" #external folder where all logging stats will be placed (for different session runs).

Constructing and Executing the Tensorflow graph

In [5]:
tf.reset_default_graph() #restoring the default graph.
tf.logging.set_verbosity(tf.logging.WARN) #supress TF logging messages when saving ckpt files.
start_time = time.time()
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
logdir = "{}/run-{}/".format(root_logdir, now) #relative path of tensorboard logs for a particular run (current time
#is used so that each folder has different running stats, comparision between them can be made inside tensorboard).

# TF CONSTRUCTION PHASE

# 1- Creating variables, placeholders and constants.
X = tf.placeholder(tf.float32, shape=(None, n), name="X") #n contains the bias already.
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
theta = tf.Variable(tf.random_uniform([n, 1], -1.0, 1.0, seed=1), name="theta")

# 2- Creating operations.
y_pred = tf.matmul(X, theta, name="predictions")
with tf.name_scope("loss") as scope: # Grouping related nodes to the same scope in the TF-graph.
    error = y_pred - y
    mse = tf.reduce_mean(tf.square(error), name="mse")
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

# 3- Node that initialize the variables.
init = tf.global_variables_initializer()

# 4- Creating the saver.
saver_save = tf.train.Saver(max_to_keep=None) #save all variable values.
if restore:
    saver_restore = tf.train.Saver() #restore all variable values.
    #saver_restore = tf.train.Saver({"weights": theta}) #restore only the old theta variable under the name of 'weights'.

# 5- Tensorboard definitions:
if tb:
    mse_summary = tf.summary.scalar('MSE', mse) #creates a node in the graph that will evaluate the MSE value
    #and write it to a TensorBoard-compatible binary log string called a summary.
    summary_writer = tf.summary.FileWriter(logdir, tf.get_default_graph()) #creates a FileWriter that you will
    #use to write summaries to logfiles in the log directory.


# TF EXECUTION PHASE:

with tf.Session() as sess:
    
    if restore:  #to restore a model the construction phase must be identical than the one used to save it. 
        saver_restore.restore(sess, path_restore)
    else:
        sess.run(init) #initializing the variables.
    
    breaker = False #used to break the epoch loop if the number of iterations without validation improvement increases above
    #the threshold 'n_int'.
    for epoch in range(n_epochs+1): #for each epoch..
        
        idXs = np.random.permutation(range(m)) #creates a random permutation of the instances.
        
        for batch_index in range(n_batches-1): # for each mini-batch.. The '-1' guarantees all batches are equally sized, including
            #the last one, without this the last batch with few examples could be used to update the weights. Hence, a few points
            #(< batch_size) may be out of each epoch.
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size, idXs)
            #Training step:
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            
        #Early-stopping (we could implement early stopping after each update (iteration), however, it would be too slow):
        if (early_stopping):
            val_error = mse.eval(feed_dict={X: X_val_scaled_plus_bias, y: y_val})
            imp_counter = imp_counter + 1 
            if val_error <= min_val_error:
                min_val_error = val_error
                min_val_error_epoch = epoch
                save_path = saver_save.save(sess, "./tmp/best_model-{}.ckpt".format(now))
                imp_counter = 0
            elif (stop_learning) and (imp_counter >= n_it):
                breaker = True
                break    
                    
        if breaker:
            break 
        if (prt) and (epoch % print_step == 0):
            print("Epoch {}, Training MSE = {:.4}, Validation MSE = {:.4}".format(epoch,
                                                                     mse.eval(feed_dict={X: X_train_scaled_plus_bias, y: y_train}),
                                                                     mse.eval(feed_dict={X: X_val_scaled_plus_bias, y: y_val})))
        if (tb) and (epoch % tensorboard_step == 0):
            summary_str_val = mse_summary.eval(feed_dict={X: X_val_scaled_plus_bias, y: y_cal})   
            summary_str_training = mse_summary.eval(feed_dict={X: X_train_scaled_plus_bias, y: y_train}) # This will output a
            #summary that you can then write to the events file using the file_writer.
            step = epoch * n_batches + batch_index
            summary_writer.add_summary(summary_str_training, step)
            summary_writer.add_summary(summary_str_val, step)
        if (save_ckpt) and (epoch % saver_step == 0):
            save_path = saver_save.save(sess, "./tmp/my_model-{}.ckpt".format(epoch))
          
    #Getting the 'best theta': this will probably not be the best theta over the training set (forget about the test
    #set), because we are updating the weights according the mini batches instead of the full training set error.
    #best_theta = theta.eval()
    #final_error = mse.eval(feed_dict={X: X_test_scaled_plus_bias, y: y_test})
    if not (early_stopping):
        save_path = saver_save.save(sess, "./tmp/best_model-{}.ckpt".format(now))

#Flushing and closing FileWriter if necessary:          
if tb:                            
    summary_writer.flush()
    summary_writer.close() 

if early_stopping:
    print('Epoch {} with minimum validation error {:.4}.'.format(min_val_error_epoch, min_val_error))
best_model_path = "./tmp/best_model-{}.ckpt".format(now)
print("Best model saved as:", best_model_path)
print("Training time: %.8s seconds" % (time.time() - start_time)) 

Epoch 0, Training MSE = 1.369, Validation MSE = 1.531
Epoch 100, Training MSE = 0.5234, Validation MSE = 0.5306
Epoch 200, Training MSE = 0.5227, Validation MSE = 0.5341
Epoch 300, Training MSE = 0.5238, Validation MSE = 0.5449
Epoch 139 with minimum validation error 0.5239.
Best model saved as: ./tmp/best_model-20170815205237.ckpt
Training time: 21.45600 seconds


### Testing the best model

In [6]:
saver_restore = tf.train.Saver()
with tf.Session() as sess:
    saver_restore.restore(sess, best_model_path) #restoring the previous model saved.
    best_theta = theta.eval()
    final_error = mse.eval(feed_dict={X: X_test_scaled_plus_bias, y: y_test})

print("\n*-*-*-*-*-*-* Final test results *-*-*-*-*-*-*") 
print("Best theta:\n", best_theta)
print("Final test set error: ", final_error)


*-*-*-*-*-*-* Final test results *-*-*-*-*-*-*
Best theta:
 [[ 2.06383395]
 [ 0.83894938]
 [ 0.11557128]
 [-0.2354842 ]
 [ 0.28999415]
 [-0.01035487]
 [ 0.00870804]
 [-0.86221409]
 [-0.83403569]]
Final test set error:  0.537006


### Conclusions and final remarks

- By descreasing the size of the training set to 60% of the total when including the validation set, using the original learning_rate = 0.01 actually made the algorithm diverge. It was necessary to decrease the LR to 0.005.

- It's a good practice to set 'stop_learning'=True because it decreases a lot the training time. Be careful setting the 'n_it' parameter though to not stop too early.

- In logistic regression the squared error cost function is a bowl-shapped convex surface with respect to model's parameters. Hence, it does not  have flat regions like neural networks cost surface. In this case, early stopping is not very useful. Because we're using mini-batch gradient descent, after the convergence the training error will become to bounce around the minimum value. Moreover, because the linear regression model is a hyperplane, i.e. it has large bias, it seems strange that increasing the number of iterations, the model starts to overfit. It probably doesn't, because it doesn't have degrees of freedom for that; the linear function just bounce around the 'best' linear function. It's useful though to use early stopping to stop the training procedure automatically, avoiding iterating over the maximum number of steps defined.