### Import Libraries
---

In [1]:
from multitask_learning import MultitaskTransformer


### Initialize Model
---

In [2]:
vocab_size = 4096
num_classes = 3  # e.g., [positive, negative, neutral]
num_ner_tags = 5  # e.g., [O, B-PER, I-PER, B-ORG, I-ORG]
model = MultitaskTransformer(
    vocab_size=vocab_size,
    num_classes=num_classes,
    num_ner_tags=num_ner_tags
)


**1. Freezing the Entire Network**

Implications:
- No parameters will be updated during training
- Useful only if the model is already well-trained for both tasks
- Memory efficient during training
- Fast forward pass
- Not recommended unless the model has been pre-trained on very similar tasks

In [3]:
# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False
    

**2. Freezing Only the Transformer Backbone**

Implications:
- Preserves learned language representations
- Allows task-specific adaptation
- Good when the backbone is pre-trained on a large corpus
- Reduces risk of catastrophic forgetting
- Computationally efficient

In [4]:
# Freeze transformer backbone
for param in model.embedding.parameters():
    param.requires_grad = False
for param in model.pos_encoder.parameters():
    param.requires_grad = False
for param in model.transformer_encoder.parameters():
    param.requires_grad = False

# Keep task-specific heads trainable
for param in model.classification_head.parameters():
    param.requires_grad = True
for param in model.ner_head.parameters():
    param.requires_grad = True
    

**3. Freezing One Task-Specific Head**

Implications:
- Useful when one task is well-optimized
- Prevents degradation of performance on the frozen task
- Allows fine-tuning for the other task
- Good for incremental learning scenarios

In [5]:
# Example: Freeze classification head but keep NER head trainable
for param in model.classification_head.parameters():
    param.requires_grad = False
    
# Keep NER head and backbone trainable
for param in model.ner_head.parameters():
    param.requires_grad = True
    

**Transfer Learning Approach**

Rationale for Transfer Learning Choices:

1. Choice of Pre-trained Model:
    - BERT-base as starting point (proven architecture)
    - Trained on general language understanding
    - Well-documented transfer learning success
    - Reasonable model size for fine-tuning

2. Layer Freezing Strategy:
    - Initial phase: Freeze everything except task heads
    - Middle phase: Unfreeze top transformer layers
    - Final phase: Full fine-tuning
    - Prevents catastrophic forgetting
    - Allows gradual adaptation

3. Key Parameters:
    - Low learning rate (2e-5) to prevent destroying pre-trained features
    - Warmup steps to stabilize initial training
    - Weight decay for regularization
    - Task sampling weights based on task complexity

4. Task-Specific Considerations:
    - NER gets higher sampling weight (0.6) due to token-level complexity
    - Classification gets lower weight (0.4) as it's sentence-level
    - Both heads initialized randomly to learn task-specific features

In [6]:
def setup_transfer_learning(model, base_model='bert-base-uncased'):
    """
    Setup transfer learning from a pre-trained model
    """
    # 1. Load pre-trained weights
    from transformers import AutoModel
    pretrained = AutoModel.from_pretrained(base_model)
    
    # 2. Copy weights for embedding and transformer layers
    model.embedding.weight.data = pretrained.embeddings.word_embeddings.weight.data
    
    for i, layer in enumerate(model.transformer_encoder.layers):
        # Copy self-attention parameters
        layer.self_attn.in_proj_weight.data = pretrained.encoder.layer[i].attention.self.query.weight.data
        layer.self_attn.in_proj_bias.data = pretrained.encoder.layer[i].attention.self.query.bias.data
        
        # Copy feedforward parameters
        layer.linear1.weight.data = pretrained.encoder.layer[i].intermediate.dense.weight.data
        layer.linear1.bias.data = pretrained.encoder.layer[i].intermediate.dense.bias.data
        
    # 3. Setup gradual unfreezing
    layers = [
        model.embedding,
        model.transformer_encoder,
        model.classification_head,
        model.ner_head
    ]
    
    return layers

def gradual_unfreeze(layers, current_epoch):
    """
    Gradually unfreeze layers as training progresses
    """
    if current_epoch < 2:
        # First 2 epochs: train only task-specific heads
        for layer in layers[:-2]:
            for param in layer.parameters():
                param.requires_grad = False
    elif current_epoch < 4:
        # Next 2 epochs: include last transformer layers
        for layer in layers[1].layers[-2:]:
            for param in layer.parameters():
                param.requires_grad = True
    else:
        # After 4 epochs: unfreeze all layers
        for layer in layers:
            for param in layer.parameters():
                param.requires_grad = True

# Training configuration
training_config = {
    'epochs': 10,
    'initial_lr': 2e-5,
    'warmup_steps': 1000,
    'weight_decay': 0.01,
    'task_sampling_weights': {
        'classification': 0.4,
        'ner': 0.6
    }
}


**Recommendations:**

1. For Production Use:
    - Start with frozen backbone
    - Gradually unfreeze layers
    - Monitor validation performance
    - Use early stopping per task

2. For Research/Experimentation:
    - Try different freezing combinations
    - Experiment with layer-wise learning rates
    - Test various pre-trained models
    - Analyze task interference

3. For Resource Constraints:
    - Keep backbone frozen
    - Train only task-specific heads
    - Use smaller pre-trained models
    - Implement gradient accumulation
