# Logistic Regression

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 02/01/2025   | Martin | Created   | Changes Made | 

# Content

* [Introduction](#introduction)
* [Logistic Regression](#logistic-regression)
* [Non-linear Solutions](#non-linear-solutions)

# Introduction

Implement logistic regression to preduct probability of breast cancer using the Breast Cancer Winsonsin dataset.

Preduct the diagnosis from features that are digitized images. Dataset consists of 257 benign cases and 212 malignant ones. Two classes __DO NOT__ have the same consistency (which is an important requirement for classification models), but are not extremely different.

⚠️ __ALERT__: Always check whether __classes are balanced__ before training the model. You can implement [class weighted training](https://datascience.stackexchange.com/questions/13490/how-to-set-class-weights-for-imbalanced-classes-in-keras) for imbalanced datasets

When classifying there is always a chance for false positivse and false negatives. Setting the right threshold value for classification depends on the use case. Higher thresholds will have more false negatives than false positives vice versa

🥬 __NOTE__: Setting the threshold at 0.5 = equating the expectations for false positive and negative to be the same.

# Logistic Regression

In [14]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
from tensorflow.keras.utils import FeatureSpace

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["GRPC_VERBOSITY"] = "ERROR"
os.environ["GLOG_minloglevel"] = "2"

In [15]:
breast_cancer = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
path = tf.keras.utils.get_file(breast_cancer.split("/")[-1], breast_cancer)

columns = ['sample_code', 'clump_thickness', 'cell_size_uniformity', 'cell_shape_uniformity',
           'marginal_adhesion', 'single_epithelial_cell_size', 'bare_nuclei', 'bland_chromatin',
           'normal_nucleoli', 'mitoses', 'class']

data = pd.read_csv(
  path,
  header=None,
  names=columns,
  na_values=[np.nan, '?']
)
data = data.fillna(data.median()) # fill na values with median values

# class = 4 is the positive breast cancer case
np.random.seed(1)
train = data.sample(frac=0.8).copy()
y_train = (train['class']==4).astype(int)
train.drop(['sample_code', 'class'], axis=1, inplace=True)

test = data.loc[~data.index.isin(train.index)].copy()
y_test = (test['class']==4).astype(int)
test.drop(['sample_code', 'class'], axis=1, inplace=True)

In [16]:
# Convert dataframe into tensor dataset
def dataframe_to_dataset(x, y):
  ds = tf.data.Dataset.from_tensor_slices(( x.to_dict(orient='list'), y ))
  ds = ds.shuffle(buffer_size=len(x))
  return ds

def create_feature_space(numeric_cols, categorical_cols):
  """
  A FeatureSpace helps to create a preprocessing pipeline that maps features to the
  right data types and performs additional feature engineering tasks like feature crossing
  """
  feature_space_mapping = {}

  # Define the data type
  for col in numeric_cols:
    feature_space_mapping[col] = 'float'
  
  for col in categorical_cols:
    feature_space_mapping[col] = 'integer_categorical'
  
  # Create the FeatureSpace object
  feature_space = FeatureSpace(
    features=feature_space_mapping,
    output_mode='concat',
    crossing_dim=5
  )

  return feature_space

In [17]:
categorical_cols = []
numeric_cols = ['clump_thickness', 'cell_size_uniformity', 'cell_shape_uniformity',
                'marginal_adhesion', 'single_epithelial_cell_size', 'bare_nuclei', 'bland_chromatin',
                'normal_nucleoli', 'mitoses']

# Instantiate the FeatureSpace class
feature_space = create_feature_space(numeric_cols, categorical_cols)

train_ds = dataframe_to_dataset(train, y_train).batch(32)
test_ds = dataframe_to_dataset(test, y_test).batch(32)

# "Train" the feature space on training data without labels
train_ds_with_no_labels = train_ds.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)

# Performs the data transformation on each batch of data while current set of data is training
preprocessed_train_ds = train_ds.map(
  lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)

preprocessed_valid_ds = test_ds.map(
  lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)

In [18]:
# Example of feature output of feature space
for x, _ in train_ds.take(1):
    preprocessed_x = feature_space(x)
    print(f"preprocessed_x shape: {preprocessed_x.shape}")
    print(f"preprocessed_x sample: \n{preprocessed_x[0]}")

preprocessed_x shape: (32, 9)
preprocessed_x sample: 
[1. 2. 1. 1. 1. 1. 1. 1. 1.]


2025-01-02 05:47:29.790295: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [34]:
# Define the model
regulariser = keras.regularizers.l2(0.01)

# Input layer
encoded_features = feature_space.get_encoded_features()

# Batch normalisation
norm_layer = keras.layers.BatchNormalization()(encoded_features)

# Output
output_layer = keras.layers.Dense(1,
  kernel_initializer='normal',
  kernel_regularizer=regulariser,
  activation='sigmoid' # same format as the linear regression, just change the activation function to sigmoid
)(norm_layer)

model = keras.Model(inputs=encoded_features, outputs=output_layer)
optmiser = keras.optimizers.Ftrl(learning_rate=0.007)
# Loss here is also different - corresponds to a prediction error loss
model.compile(optimizer=optmiser, loss='binary_crossentropy', metrics=['accuracy'])

In [35]:
model.fit(
  preprocessed_train_ds,
  validation_data=preprocessed_valid_ds,
  epochs=300,
  verbose=2
)

Epoch 1/300
18/18 - 1s - 72ms/step - accuracy: 0.6547 - loss: 0.6901 - val_accuracy: 0.6357 - val_loss: 0.6887
Epoch 2/300
18/18 - 0s - 4ms/step - accuracy: 0.6601 - loss: 0.6864 - val_accuracy: 0.6357 - val_loss: 0.6860
Epoch 3/300
18/18 - 0s - 3ms/step - accuracy: 0.6601 - loss: 0.6836 - val_accuracy: 0.6357 - val_loss: 0.6839
Epoch 4/300
18/18 - 0s - 3ms/step - accuracy: 0.6601 - loss: 0.6814 - val_accuracy: 0.6357 - val_loss: 0.6821
Epoch 5/300
18/18 - 0s - 3ms/step - accuracy: 0.6601 - loss: 0.6794 - val_accuracy: 0.6357 - val_loss: 0.6806
Epoch 6/300
18/18 - 0s - 3ms/step - accuracy: 0.6601 - loss: 0.6777 - val_accuracy: 0.6357 - val_loss: 0.6792
Epoch 7/300
18/18 - 0s - 4ms/step - accuracy: 0.6601 - loss: 0.6761 - val_accuracy: 0.6357 - val_loss: 0.6778
Epoch 8/300
18/18 - 0s - 3ms/step - accuracy: 0.6601 - loss: 0.6745 - val_accuracy: 0.6357 - val_loss: 0.6764
Epoch 9/300
18/18 - 0s - 3ms/step - accuracy: 0.6601 - loss: 0.6729 - val_accuracy: 0.6357 - val_loss: 0.6750
Epoch 10/

<keras.src.callbacks.history.History at 0x7ff8bc12f8d0>

In [36]:
inference_model = keras.Model(
  inputs=feature_space.get_inputs(),
  outputs=model(encoded_features)
)

# Sample first 5 data for testing
sample_test = test.iloc[:5]
y_sample_test = y_test[:5]

# Convert to featureSpace format and predict
for i in range(5):
  item = sample_test.to_dict(orient='records')[i]
  input_dict = {
    name: keras.ops.convert_to_tensor([value]) for name, value in item.items()
  }
  prediction = inference_model.predict(input_dict)[0][0]
  prediction = int(prediction > 0.5)
  actual = y_sample_test.iloc[i]

  print(f"The predicted value is {prediction} | Actual value: {actual}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 85ms/step
The predicted value is 0 | Actual value: 0
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
The predicted value is 0 | Actual value: 0
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
The predicted value is 1 | Actual value: 1
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
The predicted value is 1 | Actual value: 1
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
The predicted value is 0 | Actual value: 0


## Performance review

Able to achieve 95% validation accuracy, with a decently balanced dataset.

---

# Non-linear Solutions

More complex relationships are not well captured by the linear model. Non-linear models extend the linear model's definition which can model more complex relationships at the cost of some interpretability.

_Support Vector Machines (SVM)_ are an algorithm that introduce non-linearity, and with the use of various __kernels__ can better model more complex distributions of data.

`RandomFourierFeatures` - applies a non-linear transformation to the input. Depending on the loss function specified, it will optimise for the best kernel-based classifier and regressor.

In [7]:
import tensorflow_models as tfm

In [13]:
# Data processing steps remain the same

# Add an additional layer in the Model definition - RandomFeatureGaussionProcess
regulariser = keras.regularizers.l2(0.01)

# Input layer
encoded_features = feature_space.get_encoded_features()

# Normalisation layer
norm_layer = keras.layers.BatchNormalization()(encoded_features)

# Non-linear layer
## Not exactly a fourier transform, but does introduction some linearity
nonlinear_layer = keras.layers.Dense(64, activation='relu')(norm_layer)

# Output layer
outputs = keras.layers.Dense(
  1,
  kernel_initializer='normal',
  kernel_regularizer=regulariser,
  activation='sigmoid'
)(nonlinear_layer)

# Define Model
model = keras.Model(
  inputs=encoded_features,
  outputs=outputs
)

optimiser = keras.optimizers.Adam(learning_rate=0.00005)
model.compile(
  optimizer=optimiser,
  loss='hinge',
  metrics=['accuracy']
)

In [19]:
model.fit(
  preprocessed_train_ds,
  validation_data=preprocessed_valid_ds,
  epochs=300,
  verbose=2
)

Epoch 1/300
18/18 - 1s - 41ms/step - accuracy: 0.3810 - loss: 1.1817 - val_accuracy: 0.4929 - val_loss: 1.1939
Epoch 2/300
18/18 - 0s - 4ms/step - accuracy: 0.4436 - loss: 1.1780 - val_accuracy: 0.3429 - val_loss: 1.1825
Epoch 3/300
18/18 - 0s - 4ms/step - accuracy: 0.5510 - loss: 1.1738 - val_accuracy: 0.2786 - val_loss: 1.1728
Epoch 4/300
18/18 - 0s - 4ms/step - accuracy: 0.5814 - loss: 1.1698 - val_accuracy: 0.4214 - val_loss: 1.1634
Epoch 5/300
18/18 - 0s - 4ms/step - accuracy: 0.6530 - loss: 1.1662 - val_accuracy: 0.5000 - val_loss: 1.1551
Epoch 6/300
18/18 - 0s - 3ms/step - accuracy: 0.6834 - loss: 1.1618 - val_accuracy: 0.6143 - val_loss: 1.1476
Epoch 7/300
18/18 - 0s - 3ms/step - accuracy: 0.6995 - loss: 1.1582 - val_accuracy: 0.6643 - val_loss: 1.1412
Epoch 8/300
18/18 - 0s - 3ms/step - accuracy: 0.7138 - loss: 1.1537 - val_accuracy: 0.7000 - val_loss: 1.1353
Epoch 9/300
18/18 - 0s - 3ms/step - accuracy: 0.7567 - loss: 1.1501 - val_accuracy: 0.7429 - val_loss: 1.1297
Epoch 10/

<keras.src.callbacks.history.History at 0x7fbfc5fd54d0>

## Performance review

Validation accuracy is slightly higher at 96.43%, indicating that the relationship is probably non-linear.

### On random fourier features

A method to approximate the work done by SVM kernels to achieve lower computational complexity to make it feasible in neural network implementations.

Simple explanation: https://stats.stackexchange.com/questions/327646/how-does-a-random-kitchen-sink-work#327961

### Model types through loss functions

Obtain different model types through loss functions:

* __Hinge__ - sets model to an SVM
* __Logistic__ - logistic regression
* __Mean sqaured error__ - numeric regression

🥬 __NOTE__: Good to start with larger number of output nodes for non-linear layers and iteratively test if shrinking the number improves the results.

---

# Using Wide & Deep Models

In [None]:
pg 205