# Learning paradigms with TensorFlow

This notebook explores various learning paradigms in deep learning, implemented using TensorFlow. Deep learning has evolved to include diverse techniques that extend beyond traditional supervised learning. These paradigms enable models to perform better on complex tasks, adapt to new tasks with limited data, and leverage shared knowledge across multiple tasks.

In [1]:
import tensorflow as tf
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.layers import Dense, Flatten, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import KLDivergence
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import numpy as np

## Transfer learning
Transfer learning is a machine learning technique where a model that has already been trained on a large dataset is reused or fine-tuned on a new, often smaller dataset. Instead of starting from scratch, transfer learning allows us to leverage the knowledge captured in a pre-trained model to improve the performance and efficiency of a new model. This approach is particularly valuable because training deep neural networks from scratch typically requires vast amounts of data and computational resources. Transfer learning allows us to start with a pre-trained model, reducing the time and data needed. Key concepts:
- Pre-trained model: A neural network model that has already been trained on a large dataset.
- Fine-tuning: Adjusting the weights of the pre-trained model to adapt it to a new dataset or task.

We will start by training a model from scratch, saving its weights, and then using that model in various transfer learning scenarios.

### Pre-trained model
We will define a simple FFNN model and train it on the synthetic data. After training, we will save the model's weights so that we can use them later in different transfer learning scenarios.

In [2]:
# Generate synthetic data
np.random.seed(42)
X = np.random.rand(1000, 20)
y = np.random.randint(2, size=1000)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the model
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Model accuracy: {accuracy}")

# Save the model weights
model.save_weights('pretrained_model_weights.h5')
model.save('pretrained_model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model accuracy: 0.5199999809265137


**Explanation**

In this section, we created a simple feedforward neural network (FFNN) and train it on synthetic binary classification data. Here’s a breakdown of the steps:
- Step 1: Load/generate data for a related task - Here, we generate synthetic data with 20 features and a binary target variable. The data is then split into training and testing sets.
- Step 2: Model definition - Here, We defined a sequential model with three hidden layers and an output layer. The hidden layers use ReLU activation, and the output layer uses the sigmoid activation function for binary classification.
- Step 3: Model training - Here, the model is trained for 10 epochs, using a validation split of 10%.
- Step 4: Saving the model - The trained model's weights and the entire model are saved to files, which will be used later for transfer learning.

### Types of transfer learning
Transfer learning can be applied in several ways, depending on how the pre-trained model is used and the nature of the new task. Let's explore different types of transfer learning techniques using the pre-trained model we just saved.

#### Model as a fixed pre-trained model
In this approach, we use the pre-trained model directly without any changes. This is typically done when the new task is very similar to the original task for which the model was trained. The pre-trained model's layers are kept unchanged, and the model is used as-is without further training to make predictions on the new data.

In [3]:
# Generate new synthetic data
X_fixed = np.random.rand(100, 20)
y_fixed = np.random.randint(2, size=100)

# Define a new model with the same architecture as the trained model
model_fixed = Sequential()
model_fixed.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
model_fixed.add(Dense(64, activation='relu'))
model_fixed.add(Dense(32, activation='relu'))
model_fixed.add(Dense(1, activation='sigmoid'))

# Load the pre-trained weights
model_fixed.load_weights('pretrained_model_weights.h5')

# Compile the loaded model
model_fixed.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Evaluate the model on new data
loss, accuracy = model_fixed.evaluate(X_fixed, y_fixed)
print(f"Fixed pre-trained model accuracy: {accuracy}")

# Use the pre-trained model directly for prediction
predictions = model_fixed.predict(X_fixed)
print(f"Predictions from fixed pre-trained model: {predictions[:5]}")

Fixed pre-trained model accuracy: 0.47999998927116394
Predictions from fixed pre-trained model: [[0.6053627 ]
 [0.56828934]
 [0.58841944]
 [0.54554385]
 [0.6093628 ]]


**Explanation**

Here, we use the previously trained model as-is, without any further training:
- Step 1: Load the data for a similar task.
- Step 2: Model definition: We defined a new model with the same architecture as the pre-trained model to ensure compatibility with the saved weights.
- Step 3: Load the pre-trained weights.
- Step 4: Evaluation - Evaluate the model on the new dataset without further training.
- Step 5: Prediction - The pre-trained model is used to make predictions on the new data, demonstrating its ability to generalize to unseen data.

#### Feature extraction transfer learning
In this approach, we use the pre-trained model as a feature extractor. We freeze the lower layers (which capture general features) and add new layers on top to adapt to the new task. The output of the pre-trained model (before the final layer) is fed into a new model designed for the new task, allowing the model to learn task-specific features without retraining the entire network. This method is particularly useful when the new task is related to the original task but requires a different output or representation.

In [4]:
# Generate new synthetic data (related but different task)
np.random.seed(42)
X_feature = np.random.rand(300, 20)
y_feature = np.random.randint(3, size=300)
y_feature_categorical = to_categorical(y_feature, num_classes=3)

# Split the data into training and testing sets
X_train_feature, X_test_feature, y_train_feature, y_test_feature = train_test_split(X_feature, y_feature_categorical, test_size=0.2, random_state=42)

# Load the pre-trained weights
pretrained_model_feature_extraction = load_model("pretrained_model.h5")

# Define a new model with the pre-trained layers as a feature extractor
feature_extractor = Sequential(pretrained_model_feature_extraction.layers[:-1])  # Exclude the last layer

# Freeze the layers in the feature extractor
for layer in feature_extractor.layers:
    layer.trainable = False

# Use the pre-trained model as a feature extractor and add new layers for the new task
model_feature_extraction = Sequential()
model_feature_extraction.add(feature_extractor)  # The pre-trained feature extractor
model_feature_extraction.add(Dense(16, activation='relu'))  # New dense layer
model_feature_extraction.add(Dense(3, activation='softmax'))  # Output layer for the new task
    
# Compile the new model
model_feature_extraction.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Train the new model
model_feature_extraction.fit(X_train_feature, y_train_feature, epochs=5, batch_size=32, validation_split=0.1)

# Evaluate the model
loss, accuracy = model_feature_extraction.evaluate(X_test_feature, y_test_feature)
print(f"Feature extractor model accuracy: {accuracy}")

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Feature extractor model accuracy: 0.21666666865348816


**Explanation**

In this approach, we use the pre-trained model as a feature extractor for a related but different task:
- Step 1: Load the data for a related but different task.
- Step 2: Load the pre-trained layers as a feature extractor by excluding the final layer(s). The last layer of the pre-trained model is typically a classification layer tailored for the original task. Since we are focusing on feature extraction, we exclude this layer to use the rest of the model as a fixed feature extractor.
- Step 3: Freeze the layers of the feature extractor to ensure the pre-trained weights are not updated during the training process of the new task.
- Step 4: Add new layers on top of the feature extractor to adapt the model to the new task. This typically includes a dense layer (or layers) and an output layer tailored to the number of classes in the new task.
- Step 5: Train the model on the new dataset for a few epochs.
- Step 6: Evaluate the model on the new dataset.

#### Fine-tuning transfer learning
Fine-tuning is a more flexible approach where we start with the pre-trained model but allow some or all layers to be further trained on the new task. This approach allows the model to adapt more closely to the new task while retaining the knowledge learned from the pre-trained modelin the original task. Fine-tuning is often used when the new task is sufficiently different from the original task, and the pre-trained model needs to be adjusted to better fit the new data.

In [5]:
# Generate new synthetic data (related task)
np.random.seed(42)
X_finetune = np.random.rand(400, 20)
y_finetune = np.random.randint(3, size=400)
y_finetune_categorical = to_categorical(y_finetune, num_classes=3)

# Split the data into training and testing sets
X_train_finetune, X_test_finetune, y_train_finetune, y_test_finetune = train_test_split(X_finetune, y_finetune_categorical, test_size=0.2, random_state=42)

# Load the pre-trained model
pretrained_model_finetune = load_model("pretrained_model.h5")

# Make some layers trainable
# Optionally, freeze some layers early in the network
for layer in pretrained_model_finetune.layers[:2]:  # Assuming the first 3 layers should be frozen
    layer.trainable = False
# The remaining layers will be fine-tuned
for layer in pretrained_model_finetune.layers[2:]:
    layer.trainable = True
    
# Add new layers for the new task (if necessary). If the pre-trained model's output layer is not suitable, replace it
pretrained_model_finetune.pop()  # Remove the last layer
pretrained_model_finetune.add(Dense(64, activation='relu'))  # Add a new dense layer
pretrained_model_finetune.add(Dense(3, activation='softmax'))  # New output layer for the new task

# Compile the model
pretrained_model_finetune.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# Fine-tune the model
pretrained_model_finetune.fit(X_train_finetune, y_train_finetune, epochs=5, batch_size=32, validation_split=0.1)

# Evaluate the model
loss, accuracy = pretrained_model_finetune.evaluate(X_test_finetune, y_test_finetune)
print(f"Fine-tuned model accuracy: {accuracy}")

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Fine-tuned model accuracy: 0.3499999940395355


**Explanation**

This approach involves fine-tuning a pre-trained model to better fit a related task:
- Step 1: Load data for a related task
- Step 2: Load the pre-trained model
- Step 3: Decide which layers to fine-tune (if needed) - Typically, the earlier layers are kept frozen because they capture more generic features, while the later layers are fine-tuned since they capture task-specific features.
- Step 4: Add new layers for the new task (if necessary) - If the output layer of the pre-trained model doesn't match the number of classes in the new dataset, we should replace it or add new layers on top of the pre-trained model.
- Step 5: Train the model on the new dataset.
- Step 6: Evaluate the model on the new dataset.

#### Knowledge distillation (teacher-student model)

In knowledge distillation, the knowledge from a large, pre-trained model (the teacher model) is transferred to a smaller and simpler model (the student model). The idea is that the student model learns to mimic the teacher model's behavior, achieving a similar performance with fewer parameters, which makes it more efficient for deployment on devices with limited resources.

##### Response-based knowledge distillation
Response-based knowledge distillation focuses on the output predictions (responses) of the teacher model. The student model is trained to mimic the probability distribution (soft labels) produced by the teacher model, rather than the hard labels. 

We will use a pre-trained model as the teacher model. The teacher model generates a probability distribution over classes (soft labels) for each input. Then, we will define a new, smaller student model. The student model will be trained to reproduce a similar probability distribution to match the output of the teacher model rather than directly training on the target labels. 

The typical loss function used is Kullback-Leibler Divergence (KLD), which measures the difference between the teacher's and student's probability distributions.

In [6]:
# Generate synthetic data
np.random.seed(42)
X = np.random.rand(1000, 20)
y = np.random.randint(2, size=1000)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Teacher model
teacher_model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

teacher_model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])
teacher_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Generate soft labels (probability distribution) from the Teacher model
teacher_logits = teacher_model.predict(X_train)
teacher_soft_labels = tf.nn.softmax(teacher_logits / 5.0)  # Using temperature scaling

# Define the Student model
student_model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile the Student model using KLD loss
student_model.compile(optimizer=Adam(), loss=KLDivergence(), metrics=['accuracy'])

# Train the Student model on the soft labels
student_model.fit(X_train, teacher_soft_labels, epochs=10, batch_size=32, validation_split=0.1)

# Evaluate the Student model
student_loss, student_accuracy = student_model.evaluate(X_test, y_test)
print(f"Response-based Student model accuracy: {student_accuracy}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Response-based Student model accuracy: 0.5149999856948853


**Explanation**

- Step 1: Teacher model definition and training - Here, a FFNN with three hidden layers is defined. It uses ReLU activation functions in the hidden layers and a sigmoid activation function in the output layer for binary classification. The teacher model is compiled with the Adam optimizer and binary cross-entropy loss. It is trained for 10 epochs using the training data.
- Step 2: Generate soft labels - After training, the teacher model's predictions are used to create soft labels. These labels are the probability distributions over classes, scaled by a temperature parameter (5.0 in this case). The temperature scaling helps smooth the probability distributions, making it easier for the student model to learn from them.
- Step 3: Student Model definition andt training - Here, a smaller FFNN with fewer layers and units is defined for the student model. This model has fewer parameters and is simpler than the teacher model. The student model is compiled with the Kullback-Leibler Divergence loss function, which measures the difference between the soft labels provided by the teacher and the student model's predictions. The student model is trained on the soft labels generated by the teacher model for 10 epochs.
- Step 4: Evaluation - Finally, the student model is evaluated on the test set to measure its accuracy. This step demonstrates how effectively the student model has learned to replicate the teacher's behavior by using the response-based distillation approach.

## Multi-task learning

Multi-task learning (MTL) is an technique where a single model is trained to perform multiple tasks simultaneously. Instead of training separate models for each task, MTL uses a unified architecture to learn from related tasks concurrently. The core idea behind MTL is that by learning multiple tasks together, the model can exploit commonalities and shared patterns across tasks, leading to improved generalization and performance. 

In [7]:
# Generate synthetic data
np.random.seed(42)
X = np.random.rand(1000, 20)

# Task 1: Binary classification labels
y_classification = np.random.randint(2, size=1000)

# Task 2: Regression labels
y_regression = np.random.rand(1000)

# Split the data into training and testing sets
X_train, X_test, y_train_class, y_test_class = train_test_split(X, y_classification, test_size=0.2, random_state=42)
_, _, y_train_reg, y_test_reg = train_test_split(X, y_regression, test_size=0.2, random_state=42)

# Define the input layer
input_layer = Input(shape=(X_train.shape[1],))

# Shared layers
shared = Dense(64, activation='relu')(input_layer)
shared = Dense(32, activation='relu')(shared)

# Task 1: Classification
classification_output = Dense(1, activation='sigmoid', name='classification')(shared)

# Task 2: Regression
regression_output = Dense(1, name='regression')(shared)

# Define the model
mtl_model = Model(inputs=input_layer, outputs=[classification_output, regression_output])

# Compile the model with two loss functions: binary_crossentropy for classification and mse for regression
mtl_model.compile(optimizer=Adam(), 
                  loss={'classification': 'binary_crossentropy', 'regression': 'mse'},
                  metrics={'classification': 'accuracy', 'regression': 'mse'})

# Train the model
history = mtl_model.fit(X_train, 
                        {'classification': y_train_class, 'regression': y_train_reg}, 
                        epochs=10, 
                        batch_size=32, 
                        validation_split=0.2)

# Evaluate the model on the test set
evaluation = mtl_model.evaluate(X_test, {'classification': y_test_class, 'regression': y_test_reg})
print(f"Classification accuracy: {evaluation[3]}")  # Accuracy for classification task
print(f"Regression MSE: {evaluation[4]}")  # MSE for regression task

# Generate predictions for both tasks on the test data
predictions = mtl_model.predict(X_test)
# Extract predictions for each task
classification_predictions = predictions[0]  # Predictions for the classification task
regression_predictions = predictions[1]  # Predictions for the regression task
print("Classification Predictions (first 5):", classification_predictions[:5])
print("Regression Predictions (first 5):", regression_predictions[:5])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification accuracy: 0.5400000214576721
Regression MSE: 0.09702430665493011
Classification Predictions (first 5): [[0.5105735 ]
 [0.53726524]
 [0.5058471 ]
 [0.5249949 ]
 [0.5506573 ]]
Regression Predictions (first 5): [[0.5503157 ]
 [0.57404166]
 [0.48006114]
 [0.5600674 ]
 [0.4826324 ]]


**Explanation**

- Step 1: Load/generate data for a related task - Here, we generated synthetic data with 20 features. We created two types of labels, binary classification labels (for the classification task, we generate binary labels) and regression labels (for the regression task, we generate continuous values). Then, the data is splitted into training and testing sets separately for each task.
- Step 2: Model definition:
  - Input layer: An input layer is defined to accept data with 20 features.
  - Shared layers: Two hidden layers are defined, which are shared between both tasks. These layers learn common features from the input data.
  - Task-specific outputs**:
    - Classification output: A dense layer with a sigmoid activation function for binary classification.
    - Regression output: A dense layer for regression with no activation function, suitable for predicting continuous values.
- Step 2: Model compilation - The model is compiled with the Adam optimizer and the specified loss functions and metrics.
  - Loss functions:
    - Binary cross-entropy: Used for the classification task to measure the difference between predicted and true binary labels.
    - Mean squared error: Used for the regression task to measure the difference between predicted and true continuous values.
  - Metrics:
    - Accuracy: For the classification task.
    - MSE: For the regression task.
- Step 3: Training - The model is trained on the training data for both tasks. The training process involves minimizing the combined loss from both tasks.
- Step 4: Evaluation - After training, the model is evaluated on the test set for both tasks. Here, the evaluation metrics include classification accuracy and regression MSE.
- Step 5: Creating predictions - After training, predictions for both tasks can be made using the model on new or test data.