# Linear Regression

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 30/12/2024   | Martin | Created   | Started Linear Regression section. Completed estimators section. Working on mixed input model to adapt `DenseFeatures` since it's deprecated | 
| 31/12/2024   | Martin | Created   | Completed L1 and L2 error section. Did not create mixed input model, but a `FeatureSpace` model  | 

# Content

* [Introduction](#introduction)
* [Tensorflow and Linear Regression](#tensorflow-and-linear-regression)
* [Note on Estimators](#note-on-estimators)
* [Understanding Loss Functions in Linear Regression](#understanding-loss-functions-in-lr)

# Introduction

Linear regression is a fundamental algorithm in Machine Learning. One key benefit it has is that it is very interpretable. Each coefficient represents a direct change (in magnitude and direction) of the final prediction.

Tensorflow offers 2 ways to implement linear regression: _Estimators_ and _Keras_. 

Goals of module:

* Tensorflow way of regression
* Turning a Keras model into an Estimator
* Understanding the loss functions in linear regression
* Implementing Lasso and Ridge regression
* Implementing logistic regression
* Resorting to non-linear regression
* Using Wide & Deep models

# Tensorflow and Linear Regression

Estimatros and Keras provide the API to implement linear regression

__Estimators__

Pre-made specific procedures created to simplify training, evaluation, prediction and exporting of models. Estimators can be deployed on any hardware to serve a trained model.

4 steps when developing an Estimator model:

1. Acquire data (using `tf.data` fucntion)
2. Instantiate the feature columns
3. Instantiate and train the Estimator
4. Evaluate the model's performance

In [2]:
import tensorflow as tf
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["GRPC_VERBOSITY"] = "ERROR"
os.environ["GLOG_minloglevel"] = "2"

In [3]:
# Retrieve data
tfds.disable_progress_bar()
housing_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
path = tf.keras.utils.get_file(housing_url.split("/")[-1], housing_url)

columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_table(path, delim_whitespace=True, header=None, names=columns)
data.head()

  data = pd.read_table(path, delim_whitespace=True, header=None, names=columns)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [4]:
# Split data into training and testing sets
np.random.seed(1)
train = data.sample(frac=0.8).copy()
y_train = train['MEDV']
train.drop('MEDV', axis=1, inplace=True)

test = data.loc[~data.index.isin(train.index)].copy()
y_test = test['MEDV']
test.drop('MEDV', axis=1, inplace=True)

In [5]:
def define_feature_columns(data_df, categorical_cols, numeric_cols):
  feature_columns = []

  for feature_name in numeric_cols:
    feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

  for feature_name in categorical_cols:
    vocabulary = data_df[feature_name].unique()
    feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))
  
  return feature_columns

In [6]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=256):

  def input_function():
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
    if shuffle:
      ds = ds.shuffle(1000)
    ds = ds.batch(batch_size).repeat(num_epochs)
    return ds
  
  return input_function

In [8]:
categorical_cols = ['CHAS', 'RAD']
numeric_cols = ['CRIM', 'ZN', 'INDUS',  'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']
feature_columns = define_feature_columns(data, categorical_cols, numeric_cols)

train_input_fn = make_input_fn(train, y_train, num_epochs=1400)
test_input_fn = make_input_fn(test, y_test, shuffle=False, num_epochs=1)

In [None]:
linear_est = tf.estimator.LinearRegresor(feature_columns=feature_columns)
linear_est.train(train_input_fn)
results = linear_est.evaluate(test_input_fn)

print(results)

‼️ _Note: Estimators are no longer supported in TF as of 2.16_

## Review

Estimators sift the data from data functions, and converts them to the proper form based on the matched feature name and feature column. Convergence to the line of best fit depends on the number of iterations, batch size, learning rate, and loss function. 

Good to observe the loss function over time (in real-time) to troubleshoot model problems or hyperparameter changes.

## Interactions to improve performance

Create combination of two variables to better explain the target value compared to individual features. e.g average room number in a house and proportion of lower income population in an area reveals more about the type of neighbourhood.

`tf.feature_column.crossed_column` - creates a combined categorical value and applies a hashing function to limit the cardinality of the output

In [None]:
def create_interactions(interactions_list, buckets=5):
    interactions = list()
    for (a, b) in interactions_list:
        interactions.append(tf.feature_column.crossed_column([a, b], hash_bucket_size=buckets))
    return interactions

derived_feature_columns = create_interactions([['RM', 'LSTAT']])
linear_est = tf.estimator.LinearRegressor(feature_columns=feature_columns+derived_feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(test_input_fn)

print(result)

---

# Note on Estimators

Since `tf.estimators` API has been deprecated as of version 2.15, any sections containing the API will be skipped

---

# Understanding Loss Functions in LR 

Comparison between Mean Absolute loss (L1 Loss) and Mean Squared Loss (L2 Loss)

In [1]:
# Load the tensorboard extension
%load_ext tensorboard

In [2]:
# Clear any logs from previous runs
!rm -rf ./logs/*

In [15]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
import pandas as pd
from tensorflow.keras.utils import FeatureSpace

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["GRPC_VERBOSITY"] = "ERROR"
os.environ["GLOG_minloglevel"] = "2"

In [16]:
import tensorflow_datasets as tfds
# Retrieve data
tfds.disable_progress_bar()

housing_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
path = tf.keras.utils.get_file(housing_url.split("/")[-1], housing_url)

columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_table(path, delim_whitespace=True, header=None, names=columns)
# data['RM_int'] = round(data['RM']).astype(int) # Converting RM to integer by rounding
# data['LSTAT_int'] = round(data['LSTAT']).astype(int) # Converting LSTAT to integer by rounding

# Split into train and test data
np.random.seed(1)
train = data.sample(frac=0.8).copy()
y_train = train['MEDV']
train.drop('MEDV', axis=1, inplace=True)

test = data.loc[~data.index.isin(train.index)].copy()
y_test = test['MEDV']
test.drop('MEDV', axis=1, inplace=True)

  data = pd.read_table(path, delim_whitespace=True, header=None, names=columns)


In [17]:
# Convert dataframe into tensor dataset
def dataframe_to_dataset(x, y):
  ds = tf.data.Dataset.from_tensor_slices(( x.to_dict(orient='list'), y ))
  ds = ds.shuffle(buffer_size=len(x))
  return ds

train_ds = dataframe_to_dataset(train, y_train).batch(32).repeat(1400)
test_ds = dataframe_to_dataset(test, y_test).batch(32).repeat(1400)

In [18]:
def create_feature_space(numeric_cols, categorical_cols, crossed_cols):
  """
  A FeatureSpace helps to create a preprocessing pipeline that maps features to the
  right data types and performs additional feature engineering tasks like feature crossing
  """
  feature_space_mapping = {}

  # Define the data type
  for col in numeric_cols:
    feature_space_mapping[col] = 'float'
  
  for col in categorical_cols:
    feature_space_mapping[col] = 'integer_categorical'

  for col in crossed_cols:
    feature_space_mapping[col] = FeatureSpace.float_discretized(num_bins=7)
  
  # Create the FeatureSpace object
  feature_space = FeatureSpace(
    features=feature_space_mapping,
    crosses=[("RM", "LSTAT")], # RM_int because crosses only accepts int and string data
    output_mode='concat',
    crossing_dim=5
  )

  return feature_space

categorical_cols = ['CHAS', 'RAD']
numeric_cols = ['CRIM', 'ZN', 'INDUS',  'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']
crossed_cols = ['RM', 'LSTAT']

# Instantiate the FeatureSpace class
feature_space = create_feature_space(numeric_cols, categorical_cols, crossed_cols)

In [19]:
# "Train" the feature space on training data without labels
train_ds_with_no_labels = train_ds.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)

2024-12-31 07:16:24.820700: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-12-31 07:16:27.557097: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-12-31 07:16:30.350055: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-12-31 07:17:18.618645: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [20]:
# Example of feature output of feature space
for x, _ in train_ds.take(1):
    preprocessed_x = feature_space(x)
    print(f"preprocessed_x shape: {preprocessed_x.shape}")
    print(f"preprocessed_x sample: \n{preprocessed_x[0]}")

preprocessed_x shape: (32, 41)
preprocessed_x sample: 
[4.0500e+01 3.9290e+02 0.0000e+00 1.0000e+00 0.0000e+00 1.4320e-02
 8.3248e+00 1.3200e+00 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
 0.0000e+00 0.0000e+00 0.0000e+00 4.1100e-01 1.5100e+01 0.0000e+00
 0.0000e+00 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
 0.0000e+00 0.0000e+00 1.0000e+00 0.0000e+00 2.5600e+02 1.0000e+02
 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00]


In [21]:
# Performs the data transformation on each batch of data while current set of data is training
## num_parallel_calls - number of CPUs to use while performing this function. AUTOTUNE lets TF decide optimal
## .prefetch - ensures that the next batch of data is retrieved while current batch is training
preprocessed_train_ds = train_ds.map(
  lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)

preprocessed_valid_ds = test_ds.map(
  lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)

In [11]:
# Define the model - L2 Error (Mean Squared Error)
# Input layer
encoded_features = feature_space.get_encoded_features()

# Batch normalisation
norm_layer = keras.layers.BatchNormalization()(encoded_features)

# Output
output_layer = keras.layers.Dense(1, kernel_initializer='normal', activation='linear')(norm_layer)

model = keras.Model(inputs=encoded_features, outputs=output_layer)
optmiser = keras.optimizers.Ftrl(learning_rate=0.02)
model.compile(optimizer=optmiser, loss='mean_squared_error')

In [12]:
model.summary()

In [13]:
model.fit(
  preprocessed_train_ds,
  validation_data=preprocessed_valid_ds,
  epochs=5,
  verbose=2
)

Epoch 1/5


I0000 00:00:1735626430.245768   18860 service.cc:146] XLA service 0x7fbb3c03e8f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1735626430.245863   18860 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 4070, Compute Capability 8.9
2024-12-31 06:27:10.261752: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-12-31 06:27:10.305908: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8906
I0000 00:00:1735626430.552276   18860 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


18200/18200 - 23s - 1ms/step - accuracy: 0.0000e+00 - loss: 24.2498 - val_accuracy: 0.0000e+00 - val_loss: 24.5685
Epoch 2/5
18200/18200 - 17s - 931us/step - accuracy: 0.0000e+00 - loss: 19.4317 - val_accuracy: 0.0000e+00 - val_loss: 23.9247
Epoch 3/5
18200/18200 - 17s - 931us/step - accuracy: 0.0000e+00 - loss: 19.3068 - val_accuracy: 0.0000e+00 - val_loss: 23.6952
Epoch 4/5
18200/18200 - 18s - 968us/step - accuracy: 0.0000e+00 - loss: 19.2773 - val_accuracy: 0.0000e+00 - val_loss: 23.5556
Epoch 5/5
18200/18200 - 18s - 980us/step - accuracy: 0.0000e+00 - loss: 19.1586 - val_accuracy: 0.0000e+00 - val_loss: 23.4690


<keras.src.callbacks.history.History at 0x7fbc3a6b1390>

In [15]:
inference_model = keras.Model(
  inputs=feature_space.get_inputs(),
  outputs=model(encoded_features)
)

# Sample first 5 data for testing
sample_test = test.iloc[:5]
y_sample_test = y_test[:5]

# Convert to featureSpace format and predict
for i in range(5):
  item = sample_test.to_dict(orient='records')[i]
  input_dict = {
    name: keras.ops.convert_to_tensor([value]) for name, value in item.items()
  }
  prediction = inference_model.predict(input_dict)[0][0]
  actual = y_sample_test.iloc[i]

  print(f"The predicted value is {prediction:.2f} | Actual value: {actual}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 148ms/step
The predicted value is 25.72 | Actual value: 21.6
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
The predicted value is 34.62 | Actual value: 33.4
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
The predicted value is 17.04 | Actual value: 27.1
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
The predicted value is 19.44 | Actual value: 19.9
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
The predicted value is 14.25 | Actual value: 15.2


In [22]:
# Define the model - L1 Error (Mean Squared Error)
# Input layer
encoded_features = feature_space.get_encoded_features()

# Batch normalisation
norm_layer = keras.layers.BatchNormalization()(encoded_features)

# Output
output_layer = keras.layers.Dense(1, kernel_initializer='normal', activation='linear')(norm_layer)

model = keras.Model(inputs=encoded_features, outputs=output_layer)
optmiser = keras.optimizers.Ftrl(learning_rate=0.02)
model.compile(optimizer=optmiser, loss='mean_absolute_error')

In [23]:
model.fit(
  preprocessed_train_ds,
  validation_data=preprocessed_valid_ds,
  epochs=5,
  verbose=2
)

Epoch 1/5


I0000 00:00:1735629461.310148   20142 service.cc:146] XLA service 0x255e9bf0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1735629461.310261   20142 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 4070, Compute Capability 8.9
2024-12-31 07:17:41.325479: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-12-31 07:17:41.371769: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8906
I0000 00:00:1735629461.576444   20142 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


18200/18200 - 18s - 1ms/step - loss: 3.3440 - val_loss: 3.1813
Epoch 2/5
18200/18200 - 19s - 1ms/step - loss: 3.0084 - val_loss: 3.1562
Epoch 3/5
18200/18200 - 18s - 969us/step - loss: 2.9919 - val_loss: 3.1438
Epoch 4/5
18200/18200 - 17s - 951us/step - loss: 2.9802 - val_loss: 3.1339
Epoch 5/5
18200/18200 - 17s - 912us/step - loss: 2.9779 - val_loss: 3.1294


<keras.src.callbacks.history.History at 0x7f68ef6c78d0>

In [24]:
inference_model = keras.Model(
  inputs=feature_space.get_inputs(),
  outputs=model(encoded_features)
)

# Sample first 5 data for testing
sample_test = test.iloc[:5]
y_sample_test = y_test[:5]

# Convert to featureSpace format and predict
for i in range(5):
  item = sample_test.to_dict(orient='records')[i]
  input_dict = {
    name: keras.ops.convert_to_tensor([value]) for name, value in item.items()
  }
  prediction = inference_model.predict(input_dict)[0][0]
  actual = y_sample_test.iloc[i]

  print(f"The predicted value is {prediction:.2f} | Actual value: {actual}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 169ms/step
The predicted value is 27.28 | Actual value: 21.6
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
The predicted value is 33.14 | Actual value: 33.4
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
The predicted value is 22.10 | Actual value: 27.1
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
The predicted value is 19.97 | Actual value: 19.9
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
The predicted value is 20.37 | Actual value: 15.2


Take note of the difference between `FeatureSpace.get_inputs()` and `FeatureSpace.get_encoded_features()`

* `get_inputs()` - retrieves a dictionary of Keras Input objects that correspond to the raw features defined in the `FeatureSpace`. These inputs are used to specify the shape and type of data that the model will accept during training and inference.

* `get_encoded_features()` - returns the corresponding encoded Keras tensors that have been preprocessed according to the specified feature encoding strategies. This is the output that will be fed into the model for training and inference.

In [19]:
feature_space.get_inputs()

{'CRIM': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=CRIM>,
 'ZN': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=ZN>,
 'INDUS': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=INDUS>,
 'NOX': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=NOX>,
 'RM': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=RM>,
 'AGE': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=AGE>,
 'DIS': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=DIS>,
 'TAX': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=TAX>,
 'PTRATIO': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=PTRATIO>,
 'B': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=B>,
 'LSTAT': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=LSTAT>,
 'CHAS': <KerasTensor shape=(None, 1), dtype=int32, sparse=None, name=CHAS>,
 'RAD': <KerasTensor shape=(None, 1), dtype=int32, sparse=None, na

In [20]:
feature_space.get_encoded_features()

<KerasTensor shape=(None, 41), dtype=float32, sparse=False, name=keras_tensor_19>

### Useful resources for code examples

* [End-to-end FeatureSpace Keras Model](https://keras.io/examples/structured_data/feature_space_advanced/)
* [Converting from pandas df to tensorflow dataset](https://www.tensorflow.org/tutorials/load_data/pandas_dataframe)
* [Mixed input model tutorial](https://www.youtube.com/watch?v=4-O14gOdRso)
* [Tensorflow docs on FeatureSpace](https://www.tensorflow.org/api_docs/python/tf/keras/utils/FeatureSpace#feature)

---