# Linear Regression

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 30/12/2024   | Martin | Created   | Started  | 

# Content

* [Introduction](#introduction)
* [Tensorflow and Linear Regression](#tensorflow-and-linear-regression)
* [Note on Estimators](#note-on-estimators)
* [Understanding Loss Functions in Linear Regression](#understanding-loss-functions-in-lr)

# Introduction

Linear regression is a fundamental algorithm in Machine Learning. One key benefit it has is that it is very interpretable. Each coefficient represents a direct change (in magnitude and direction) of the final prediction.

Tensorflow offers 2 ways to implement linear regression: _Estimators_ and _Keras_. 

Goals of module:

* Tensorflow way of regression
* Turning a Keras model into an Estimator
* Understanding the loss functions in linear regression
* Implementing Lasso and Ridge regression
* Implementing logistic regression
* Resorting to non-linear regression
* Using Wide & Deep models

# Tensorflow and Linear Regression

Estimatros and Keras provide the API to implement linear regression

__Estimators__

Pre-made specific procedures created to simplify training, evaluation, prediction and exporting of models. Estimators can be deployed on any hardware to serve a trained model.

4 steps when developing an Estimator model:

1. Acquire data (using `tf.data` fucntion)
2. Instantiate the feature columns
3. Instantiate and train the Estimator
4. Evaluate the model's performance

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["GRPC_VERBOSITY"] = "ERROR"
os.environ["GLOG_minloglevel"] = "2"

In [3]:
# Retrieve data
tfds.disable_progress_bar()
housing_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
path = tf.keras.utils.get_file(housing_url.split("/")[-1], housing_url)

columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_table(path, delim_whitespace=True, header=None, names=columns)
data.head()

  data = pd.read_table(path, delim_whitespace=True, header=None, names=columns)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [5]:
# Split data into training and testing sets
np.random.seed(1)
train = data.sample(frac=0.8).copy()
y_train = train['MEDV']
train.drop('MEDV', axis=1, inplace=True)

test = data.loc[~data.index.isin(train.index)].copy()
y_test = test['MEDV']
test.drop('MEDV', axis=1, inplace=True)

In [9]:
def define_feature_columns(data_df, categorical_cols, numeric_cols):
  feature_columns = []

  for feature_name in numeric_cols:
    feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

  for feature_name in categorical_cols:
    vocabulary = data_df[feature_name].unique()
    feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))
  
  return feature_columns

In [7]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=256):

  def input_function():
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
    if shuffle:
      ds = ds.shuffle(1000)
    ds = ds.batch(batch_size).repeat(num_epochs)
    return ds
  
  return input_function

In [11]:
categorical_cols = ['CHAS', 'RAD']
numeric_cols = ['CRIM', 'ZN', 'INDUS',  'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']
feature_columns = define_feature_columns(data, categorical_cols, numeric_cols)

train_input_fn = make_input_fn(train, y_train, num_epochs=1400)
test_input_fn = make_input_fn(test, y_test, shuffle=False, num_epochs=1)

In [None]:
linear_est = tf.estimator.LinearRegresor(feature_columns=feature_columns)
linear_est.train(train_input_fn)
results = linear_est.evaluate(test_input_fn)

print(results)

‼️ _Note: Estimators are no longer supported in TF as of 2.16_

## Review

Estimators sift the data from data functions, and converts them to the proper form based on the matched feature name and feature column. Convergence to the line of best fit depends on the number of iterations, batch size, learning rate, and loss function. 

Good to observe the loss function over time (in real-time) to troubleshoot model problems or hyperparameter changes.

## Interactions to improve performance

Create combination of two variables to better explain the target value compared to individual features. e.g average room number in a house and proportion of lower income population in an area reveals more about the type of neighbourhood.

`tf.feature_column.crossed_column` - creates a combined categorical value and applies a hashing function to limit the cardinality of the output

In [None]:
def create_interactions(interactions_list, buckets=5):
    interactions = list()
    for (a, b) in interactions_list:
        interactions.append(tf.feature_column.crossed_column([a, b], hash_bucket_size=buckets))
    return interactions

derived_feature_columns = create_interactions([['RM', 'LSTAT']])
linear_est = tf.estimator.LinearRegressor(feature_columns=feature_columns+derived_feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(test_input_fn)

print(result)

---

# Note on Estimators

Since `tf.estimators` API has been deprecated as of version 2.15, any sections containing the API will be skipped

---

# Understanding Loss Functions in LR 

Comparison between Mean Absolute loss (L1 Loss) and Mean Squared Loss (L2 Loss)

In [None]:
# Load the tensorboard extension
%load_ext tensorboard

In [None]:
# Clear any logs from previous runs
!rm -rf ./logs/*

In [19]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
import pandas as pd

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["GRPC_VERBOSITY"] = "ERROR"
os.environ["GLOG_minloglevel"] = "2"

https://www.youtube.com/watch?v=4-O14gOdRso

https://stackoverflow.com/questions/78654902/what-is-the-alternative-for-keras-layers-densefeatures-in-tensorflow-2-16

FeatureSpace utility

In [None]:
# Does the iterator need to be separately defined for mixed models?

# Input 1 - Numerical data

# Input 2 - Categorical data
## Perform feature cross

# Concatenation layer
## Batch normalisation

# Ouput layer - single dense neuron

In [28]:
def define_feature_columns_layers(data_df, categroical_cols, numeric_cols):
  feature_columns = []
  feature_layer_inputs = {}

  for feature_name in numeric_cols:
    feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))
    feature_layer_inputs[feature_name] = tf.keras.Input(shape=(1,), name=feature_name)
  
  for feature_name in categorical_cols:
    vocabulary = data_df[feature_name].unique()
    cat = tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary)
    cat_one_hot = tf.feature_column.indicator_column(cat)
    feature_columns.append(cat_one_hot)
    feature_layer_inputs[feature_name] = tf.keras.Input(shape=(1,), name=feature_name, dtype=tf.int32)
  
  return feature_columns, feature_layer_inputs

In [23]:
def create_interactions(interactions_list, buckets=5):
  feature_columns = []
  
  for (a, b) in interactions_list:
    crossed_feature = tf.feature_column.crossed_column([a, b], hash_bucket_size=buckets)
    crossed_feature_one_hot = tf.feature_column.indicator_column(crossed_feature)
    feature_columns.append(crossed_feature_one_hot)
  
  return feature_columns

In [24]:
def create_linreg(feature_columns, feature_layer_inputs, optimizer):
  feature_layer = keras.layers.DenseFeatures(feature_columns)
  feature_layer_outputs = feature_layer(feature_layer_inputs)
  norm = keras.layers.BatchNormalization()(feature_layer_outputs)
  outputs = keras.layers.Dense(1, kernel_initializer='normal', activation='linear')(norm)

  model = keras.Model(inputs=[v for v in feature_layer_inputs.values()], outputs=outputs)
  model.compile(optimizer=optimizer, loss='mean_squared_error')

In [46]:
def create_lingalg(feature_columns, optimiser):
  # feature_columns = {colname: tf.keras.layers.Input(shape=(1,), name=colname) for colname in feature_columns}

  def preprocess(inputs):
    return tf.concat([tf.expand_dims(input, axis=-1) for input in inputs], axis=-1)

  # inputs = [feature_columns[colname] for colname in feature_columns]
  preprocessed_inputs = preprocess(feature_columns)
  norm = keras.layers.BatchNormalization()(preprocessed_inputs)
  outputs = keras.layers.Dense(1, kernel_initializer='normal', activation='linear')(norm)

  model = keras.Model(inputs=inputs, outputs=outputs)
  model.compile(optimizer=optimiser, loss='mean_squared_error')

  return model


In [None]:
categorical_cols = ['CHAS', 'RAD']
numeric_cols = ['CRIM', 'ZN', 'INDUS',  'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']
feature_columns, feature_layer_inputs = define_feature_columns_layers(data, categorical_cols, numeric_cols)
interactions_columns = create_interactions([['RM', 'LSTAT']])

feature_columns += interactions_columns

CRIM
ZN
INDUS
NOX
RM
AGE
DIS
TAX
PTRATIO
B
LSTAT
CHAS_indicator
RAD_indicator
LSTAT_X_RM_indicator


In [48]:
feature_columns

[NumericColumn(key='CRIM', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='ZN', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='INDUS', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='NOX', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='RM', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='AGE', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='DIS', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='TAX', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='PTRATIO', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='B', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='LSTAT',