# Pruning Transformers

> A partial re-implementation of Movement Pruning: Adaptive Sparsity by Fine-Tuning by Victor Sanh, Thomas Wolf, and Alexander M. Rush [[arXiv:2005.07683](https://arxiv.org/abs/2005.07683)]]

## References

* [_Movement Pruning: Adaptive Sparsity by Fine-Tuning_](https://arxiv.org/abs/2005.07683) by Victor Sanh, Thomas Wolf, and Alexander M. Rush
* The scripts and notebooks that accompany the paper ([link](https://github.com/huggingface/transformers/tree/master/examples/research_projects/movement-pruning))

## Load libraries

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from transformerlab.question_answering import *

In [None]:
from pathlib import Path

import datasets
import transformers

datasets.logging.set_verbosity_error()
transformers.logging.set_verbosity_error()

print(transformers.__version__, datasets.__version__)

4.1.1 1.2.0


In [None]:
import numpy as np
import random

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, default_data_collator, AdamW, get_linear_schedule_with_warmup

import torch
import torch.nn as nn
from torch.nn import init, CrossEntropyLoss
from torch import autograd
import torch.nn.functional as F
import math
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Running on device: {device}")

Running on device: cuda


## Fix seed for sanity

In [None]:
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    
set_seed()

## Load data

In [None]:
squad = load_dataset("squad")
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

## Evaluate fine-pruned model

HuggingFace has released a PruneBERT checkpoint for SQuAD v1.1 called `prunebert-base-uncased-6-finepruned-w-distil-squad` which is described in their docs as follows:

> Pre-trained BERT-base-uncased fine-pruned with soft movement pruning on SQuAD v1.1. We use an additional distillation signal from `BERT-base-uncased` finetuned on SQuAD. The encoder counts 6% of total non-null weights and reaches 83.8 F1 score. The model can be accessed with: `pruned_bert = BertForQuestionAnswering.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad")`

In this notebook we'll focus on reproducing this model, so let's begin by simply validating that we can obtain the same F1-score. Before doing that, we first need to preprocess the data - let's get started!

### Preprocess data

In [None]:
def convert_examples_to_features(dataset, tokenizer):
    max_length = 384 
    doc_stride = 128 
    pad_on_right = tokenizer.padding_side == "right"

    fn_kwargs = {
        "tokenizer": tokenizer,
        "max_length": max_length,
        "doc_stride": doc_stride,
        "pad_on_right": pad_on_right
    }
    
    train_enc = dataset['train'].map(prepare_train_features, fn_kwargs=fn_kwargs, batched=True, remove_columns=dataset["train"].column_names)
    valid_enc = dataset['validation'].map(prepare_validation_features, fn_kwargs=fn_kwargs, batched=True, remove_columns=dataset["validation"].column_names)

    return train_enc, valid_enc

In [None]:
# pruned_model_name = "huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad"
# pruned_tokenizer = AutoTokenizer.from_pretrained(pruned_model_name)

In [None]:
# train_enc, valid_enc = convert_examples_to_features(squad, pruned_tokenizer)

### Initialize the trainer

Now that the data is preprocessed, let's instantiate a custom trainer and evaluate the model on the validation set:

In [None]:
# pruned_model = AutoModelForQuestionAnswering.from_pretrained(pruned_model_name).to(device)
# batch_size = 8

# eval_ds = valid_enc
# eval_raw_ds = squad["validation"]

# pruned_args = QuestionAnsweringTrainingArguments(
#     output_dir="checkpoints",
#     per_device_eval_batch_size=batch_size)

# data_collator = default_data_collator

In [None]:
# pruned_trainer = QuestionAnsweringTrainer(
#     model=pruned_model,
#     args=pruned_args,
#     eval_dataset=eval_ds,
#     eval_examples=eval_raw_ds,
#     tokenizer=pruned_tokenizer,
#     data_collator=data_collator,
#     compute_metrics=squad_metrics)

In [None]:
# pruned_trainer.evaluate()

Great - we get an F1-score that matches the value quoted by HuggingFace!

## Prune-tuning without distillation

### Preprocess data

In [None]:
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
train_enc, valid_enc = convert_examples_to_features(squad, tokenizer)

### Create masked variants of BERT

In [None]:
from transformers.configuration_utils import PretrainedConfig

class MaskedBertConfig(PretrainedConfig):
    """
    A class replicating the `~transformers.BertConfig` with additional parameters for pruning/masking configuration.
    """

    model_type = "masked_bert"

    def __init__(
        self,
        vocab_size=30522,
        hidden_size=768,
        num_hidden_layers=12,
        num_attention_heads=12,
        intermediate_size=3072,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=512,
        type_vocab_size=2,
        initializer_range=0.02,
        layer_norm_eps=1e-12,
        pad_token_id=0,
        pruning_method="topK",
        mask_init="constant",
        mask_scale=0.0,
        **kwargs
    ):
        super().__init__(pad_token_id=pad_token_id, **kwargs)

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.pruning_method = pruning_method
        self.mask_init = mask_init
        self.mask_scale = mask_scale

In [None]:
from transformers.modeling_utils import PreTrainedModel
from transformers.models.bert.modeling_bert import load_tf_weights_in_bert, ACT2FN

class MaskedBertPreTrainedModel(PreTrainedModel):
    """An abstract class to handle weights initialization and
    a simple interface for downloading and loading pretrained models.
    """

    config_class = MaskedBertConfig
    load_tf_weights = load_tf_weights_in_bert
    base_model_prefix = "bert"

    def _init_weights(self, module):
        """ Initialize the weights """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            # HACK: replace BertLayerNorm with LayerNorm
        elif isinstance(module, torch.nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        if isinstance(module, nn.Linear) and module.bias is not None:
            module.bias.data.zero_()

In [None]:
class MaskedBertModel(MaskedBertPreTrainedModel):
    """
    The `MaskedBertModel` class replicates the :class:`~transformers.BertModel` class
    and adds specific inputs to compute the adaptive mask on the fly.
    Note that we freeze the embeddings modules from their pre-trained values.
    """

    def __init__(self, config):
        super().__init__(config)
        self.config = config

        self.embeddings = BertEmbeddings(config)
        self.embeddings.requires_grad_(requires_grad=False)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)

        self.init_weights()

    def get_input_embeddings(self):
        return self.embeddings.word_embeddings

    def set_input_embeddings(self, value):
        self.embeddings.word_embeddings = value

    def _prune_heads(self, heads_to_prune):
        """Prunes heads of the model.
        heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
        See base class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        threshold=None,
    ):

        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = input_ids.size()
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        device = input_ids.device if input_ids is not None else inputs_embeds.device

        if attention_mask is None:
            attention_mask = torch.ones(input_shape, device=device)
        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)

        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        if attention_mask.dim() == 3:
            extended_attention_mask = attention_mask[:, None, :, :]
        elif attention_mask.dim() == 2:
            # Provided a padding mask of dimensions [batch_size, seq_length]
            # - if the model is a decoder, apply a causal mask in addition to the padding mask
            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
            if self.config.is_decoder:
                batch_size, seq_length = input_shape
                seq_ids = torch.arange(seq_length, device=device)
                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
                causal_mask = causal_mask.to(
                    attention_mask.dtype
                )  # causal and attention masks must have same type with pytorch version < 1.3
                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
            else:
                extended_attention_mask = attention_mask[:, None, None, :]
        else:
            raise ValueError(
                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
                    input_shape, attention_mask.shape
                )
            )

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        # If a 2D ou 3D attention mask is provided for the cross-attention
        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
        if self.config.is_decoder and encoder_hidden_states is not None:
            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
            if encoder_attention_mask is None:
                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)

            if encoder_attention_mask.dim() == 3:
                encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
            elif encoder_attention_mask.dim() == 2:
                encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]
            else:
                raise ValueError(
                    "Wrong shape for encoder_hidden_shape (shape {}) or encoder_attention_mask (shape {})".format(
                        encoder_hidden_shape, encoder_attention_mask.shape
                    )
                )

            encoder_extended_attention_mask = encoder_extended_attention_mask.to(
                dtype=next(self.parameters()).dtype
            )  # fp16 compatibility
            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -10000.0
        else:
            encoder_extended_attention_mask = None

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        if head_mask is not None:
            if head_mask.dim() == 1:
                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
            elif head_mask.dim() == 2:
                head_mask = (
                    head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)
                )  # We can specify head_mask for each layer
            head_mask = head_mask.to(
                dtype=next(self.parameters()).dtype
            )  # switch to float if need + fp16 compatibility
        else:
            head_mask = [None] * self.config.num_hidden_layers

        embedding_output = self.embeddings(
            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
        )
        encoder_outputs = self.encoder(
            embedding_output,
            attention_mask=extended_attention_mask,
            head_mask=head_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_extended_attention_mask,
            threshold=threshold,
        )
        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output)

        outputs = (sequence_output, pooled_output,) + encoder_outputs[
            1:
        ]  # add hidden_states and attentions if they are here
        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)

In [None]:
class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
#         self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
        if input_ids is not None:
            input_shape = input_ids.size()
        else:
            input_shape = inputs_embeds.size()[:-1]

        seq_length = input_shape[1]
        device = input_ids.device if input_ids is not None else inputs_embeds.device
        if position_ids is None:
            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
            position_ids = position_ids.unsqueeze(0).expand(input_shape)
        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [None]:
class BertEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        threshold=None,
    ):
        all_hidden_states = ()
        all_attentions = ()
        for i, layer_module in enumerate(self.layer):
            if self.output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            layer_outputs = layer_module(
                hidden_states,
                attention_mask,
                head_mask[i],
                encoder_hidden_states,
                encoder_attention_mask,
                threshold=threshold,
            )
            hidden_states = layer_outputs[0]

            if self.output_attentions:
                all_attentions = all_attentions + (layer_outputs[1],)

        # Add last layer
        if self.output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        outputs = (hidden_states,)
        if self.output_hidden_states:
            outputs = outputs + (all_hidden_states,)
        if self.output_attentions:
            outputs = outputs + (all_attentions,)
        return outputs  # last-layer hidden state, (all hidden states), (all attentions)

In [None]:
class MaskedBertForQuestionAnswering(MaskedBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = MaskedBertModel(config)
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)

        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        start_positions=None,
        end_positions=None,
        threshold=None,
    ):

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            threshold=threshold,
        )

        sequence_output = outputs[0]

        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        outputs = (
            start_logits,
            end_logits,
        ) + outputs[2:]
        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2
            outputs = (total_loss,) + outputs

        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)

In [None]:
class BertLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = BertAttention(config)
        self.is_decoder = config.is_decoder
        if self.is_decoder:
            self.crossattention = BertAttention(config)
        self.intermediate = BertIntermediate(config)
        self.output = BertOutput(config)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        threshold=None,
    ):
        self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask, threshold=threshold)
        attention_output = self_attention_outputs[0]
        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights

        if self.is_decoder and encoder_hidden_states is not None:
            cross_attention_outputs = self.crossattention(
                attention_output, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask
            )
            attention_output = cross_attention_outputs[0]
            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights

        intermediate_output = self.intermediate(attention_output, threshold=threshold)
        layer_output = self.output(intermediate_output, attention_output, threshold=threshold)
        outputs = (layer_output,) + outputs
        return outputs

In [None]:
class BertAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.self = BertSelfAttention(config)
        self.output = BertSelfOutput(config)
        self.pruned_heads = set()

    def prune_heads(self, heads):
        if len(heads) == 0:
            return
        mask = torch.ones(self.self.num_attention_heads, self.self.attention_head_size)
        heads = set(heads) - self.pruned_heads  # Convert to set and remove already pruned heads
        for head in heads:
            # Compute how many pruned heads are before the head and move the index accordingly
            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
            mask[head] = 0
        mask = mask.view(-1).contiguous().eq(1)
        index = torch.arange(len(mask))[mask].long()

        # Prune linear layers
        self.self.query = prune_linear_layer(self.self.query, index)
        self.self.key = prune_linear_layer(self.self.key, index)
        self.self.value = prune_linear_layer(self.self.value, index)
        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)

        # Update hyper params and store pruned heads
        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
        self.pruned_heads = self.pruned_heads.union(heads)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        threshold=None,
    ):
        self_outputs = self.self(
            hidden_states,
            attention_mask,
            head_mask,
            encoder_hidden_states,
            encoder_attention_mask,
            threshold=threshold,
        )
        attention_output = self.output(self_outputs[0], hidden_states, threshold=threshold)
        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
        return outputs

In [None]:
class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
            )
        self.output_attentions = config.output_attentions

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = MaskedLinear(
            config.hidden_size,
            self.all_head_size,
            pruning_method=config.pruning_method,
            mask_init=config.mask_init,
            mask_scale=config.mask_scale,
        )
        self.key = MaskedLinear(
            config.hidden_size,
            self.all_head_size,
            pruning_method=config.pruning_method,
            mask_init=config.mask_init,
            mask_scale=config.mask_scale,
        )
        self.value = MaskedLinear(
            config.hidden_size,
            self.all_head_size,
            pruning_method=config.pruning_method,
            mask_init=config.mask_init,
            mask_scale=config.mask_scale,
        )

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        threshold=None,
    ):
        mixed_query_layer = self.query(hidden_states, threshold=threshold)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.
        if encoder_hidden_states is not None:
            mixed_key_layer = self.key(encoder_hidden_states, threshold=threshold)
            mixed_value_layer = self.value(encoder_hidden_states, threshold=threshold)
            attention_mask = encoder_attention_mask
        else:
            mixed_key_layer = self.key(hidden_states, threshold=threshold)
            mixed_value_layer = self.value(hidden_states, threshold=threshold)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)

        outputs = (context_layer, attention_probs) if self.output_attentions else (context_layer,)
        return outputs

In [None]:
class MaskedLinear(nn.Linear):
    """
    Fully Connected layer with on the fly adaptive mask.
    If needed, a score matrix is created to store the importance of each associated weight.
    """

    def __init__(
        self,
        in_features: int,
        out_features: int,
        bias: bool = True,
        mask_init: str = "constant",
        mask_scale: float = 0.0,
        pruning_method: str = "topK",
    ):
        """
        Args:
            in_features (`int`)
                Size of each input sample
            out_features (`int`)
                Size of each output sample
            bias (`bool`)
                If set to ``False``, the layer will not learn an additive bias.
                Default: ``True``
            mask_init (`str`)
                The initialization method for the score matrix if a score matrix is needed.
                Choices: ["constant", "uniform", "kaiming"]
                Default: ``constant``
            mask_scale (`float`)
                The initialization parameter for the chosen initialization method `mask_init`.
                Default: ``0.``
            pruning_method (`str`)
                Method to compute the mask.
                Choices: ["topK", "threshold", "sigmoied_threshold", "magnitude", "l0"]
                Default: ``topK``
        """
        super(MaskedLinear, self).__init__(in_features=in_features, out_features=out_features, bias=bias)
        assert pruning_method in ["topK", "threshold", "sigmoied_threshold", "magnitude", "l0"]
        self.pruning_method = pruning_method

        if self.pruning_method in ["topK", "threshold", "sigmoied_threshold", "l0"]:
            self.mask_scale = mask_scale
            self.mask_init = mask_init
            self.mask_scores = nn.Parameter(torch.Tensor(self.weight.size()))
            self.init_mask()

    def init_mask(self):
        if self.mask_init == "constant":
            init.constant_(self.mask_scores, val=self.mask_scale)
        elif self.mask_init == "uniform":
            init.uniform_(self.mask_scores, a=-self.mask_scale, b=self.mask_scale)
        elif self.mask_init == "kaiming":
            init.kaiming_uniform_(self.mask_scores, a=math.sqrt(5))

    def forward(self, input: torch.tensor, threshold: float):
        # Get the mask
        if self.pruning_method == "topK":
            mask = TopKBinarizer.apply(self.mask_scores, threshold)
        elif self.pruning_method in ["threshold", "sigmoied_threshold"]:
            sig = "sigmoied" in self.pruning_method
            mask = ThresholdBinarizer.apply(self.mask_scores, threshold, sig)
        elif self.pruning_method == "magnitude":
            mask = MagnitudeBinarizer.apply(self.weight, threshold)
        elif self.pruning_method == "l0":
            l, r, b = -0.1, 1.1, 2 / 3
            if self.training:
                u = torch.zeros_like(self.mask_scores).uniform_().clamp(0.0001, 0.9999)
                s = torch.sigmoid((u.log() - (1 - u).log() + self.mask_scores) / b)
            else:
                s = torch.sigmoid(self.mask_scores)
            s_bar = s * (r - l) + l
            mask = s_bar.clamp(min=0.0, max=1.0)
        # Mask weights with computed mask
        weight_thresholded = mask * self.weight
        # Compute output (linear layer) with masked weights
        return F.linear(input, weight_thresholded, self.bias)

In [None]:
class BertSelfOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = MaskedLinear(
            config.hidden_size,
            config.hidden_size,
            pruning_method=config.pruning_method,
            mask_init=config.mask_init,
            mask_scale=config.mask_scale,
        )
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
#         self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor, threshold):
        hidden_states = self.dense(hidden_states, threshold=threshold)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

In [None]:
class BertIntermediate(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = MaskedLinear(
            config.hidden_size,
            config.intermediate_size,
            pruning_method=config.pruning_method,
            mask_init=config.mask_init,
            mask_scale=config.mask_scale,
        )
        if isinstance(config.hidden_act, str):
            self.intermediate_act_fn = ACT2FN[config.hidden_act]
        else:
            self.intermediate_act_fn = config.hidden_act

    def forward(self, hidden_states, threshold):
        hidden_states = self.dense(hidden_states, threshold=threshold)
        hidden_states = self.intermediate_act_fn(hidden_states)
        return hidden_states

In [None]:
class BertOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = MaskedLinear(
            config.intermediate_size,
            config.hidden_size,
            pruning_method=config.pruning_method,
            mask_init=config.mask_init,
            mask_scale=config.mask_scale,
        )
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
#         self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor, threshold):
        hidden_states = self.dense(hidden_states, threshold=threshold)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

In [None]:
class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

In [None]:
class TopKBinarizer(autograd.Function):
    """
    Top-k Binarizer.
    Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j}`
    is among the k% highest values of S.

    Implementation is inspired from:
        https://github.com/allenai/hidden-networks
        What's hidden in a randomly weighted neural network?
        Vivek Ramanujan*, Mitchell Wortsman*, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari
    """

    @staticmethod
    def forward(ctx, inputs: torch.tensor, threshold: float):
        """
        Args:
            inputs (`torch.FloatTensor`)
                The input matrix from which the binarizer computes the binary mask.
            threshold (`float`)
                The percentage of weights to keep (the rest is pruned).
                `threshold` is a float between 0 and 1.
        Returns:
            mask (`torch.FloatTensor`)
                Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
                retained, 0 - the associated weight is pruned).
        """
        # Get the subnetwork by sorting the inputs and using the top threshold %
        if not isinstance(threshold, float):
            threshold = threshold[0]
        mask = inputs.clone()
        _, idx = inputs.flatten().sort(descending=True)
        j = int(threshold * inputs.numel())

        # flat_out and mask access the same memory.
        flat_out = mask.flatten()
        flat_out[idx[j:]] = 0
        flat_out[idx[:j]] = 1
        return mask

    @staticmethod
    def backward(ctx, gradOutput):
        return gradOutput, None

### Create trainer

In [None]:
from torch.utils.data import SequentialSampler, DataLoader

In [None]:
class PruningTrainingArguments(QuestionAnsweringTrainingArguments):
    def __init__(self, *args, initial_threshold=1., final_threshold=0.1, initial_warmup=1, final_warmup=2, final_lambda=0.,
                 mask_scores_learning_rate=1e-2, **kwargs): 
        super().__init__(*args, **kwargs)

        self.initial_threshold = initial_threshold
        self.final_threshold = final_threshold
        self.initial_warmup = initial_warmup
        self.final_warmup = final_warmup
        self.final_lambda = final_lambda
        self.mask_scores_learning_rate = mask_scores_learning_rate

In [None]:
class PruningTrainer(QuestionAnsweringTrainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
        if self.args.max_steps > 0:
            self.t_total = self.args.max_steps
            self.args.num_train_epochs = self.args.max_steps // (len(self.train_dataset) // self.args.gradient_accumulation_steps) + 1
        else:
            self.t_total = len(self.get_train_dataloader()) // self.args.gradient_accumulation_steps * self.args.num_train_epochs
            
#     def get_train_dataloader(self) -> DataLoader:
#         """
#         Returns the training :class:`~torch.utils.data.DataLoader`.

#         Will use no sampler if :obj:`self.train_dataset` does not implement :obj:`__len__`, a random sampler (adapted
#         to distributed training if necessary) otherwise.

#         Subclass and override this method if you want to inject some custom behavior.
#         """
#         if self.train_dataset is None:
#             raise ValueError("Trainer: training requires a train_dataset.")
#         train_sampler = SequentialSampler(self.train_dataset)

#         return DataLoader(
#             self.train_dataset,
#             batch_size=self.args.train_batch_size,
#             sampler=train_sampler,
#             collate_fn=self.data_collator,
#             drop_last=self.args.dataloader_drop_last,
# #             num_workers=self.args.dataloader_num_workers,
#         )
        
    def create_optimizer_and_scheduler(self, num_training_steps: int):
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in self.model.named_parameters() if "mask_score" in n and p.requires_grad],
                "lr": self.args.mask_scores_learning_rate,
            },
            {
                "params": [
                    p
                    for n, p in self.model.named_parameters()
                    if "mask_score" not in n and p.requires_grad and not any(nd in n for nd in no_decay)
                ],
                "lr": self.args.learning_rate,
                "weight_decay": self.args.weight_decay,
            },
            {
                "params": [
                    p
                    for n, p in self.model.named_parameters()
                    if "mask_score" not in n and p.requires_grad and any(nd in n for nd in no_decay)
                ],
                "lr": self.args.learning_rate,
                "weight_decay": 0.0,
            },
        ]

        self.optimizer = AdamW(optimizer_grouped_parameters, lr=self.args.learning_rate, eps=self.args.adam_epsilon)
        self.lr_scheduler = get_linear_schedule_with_warmup(
            self.optimizer, num_warmup_steps=self.args.warmup_steps, num_training_steps=self.t_total
        )
        
        
    def compute_loss(self, model, inputs):
            
        threshold, regu_lambda = self._schedule_threshold(
            step=self.state.global_step+1,
            total_step=self.t_total,
            warmup_steps=self.args.warmup_steps,
            final_threshold=self.args.final_threshold,
            initial_threshold=self.args.initial_threshold,
            final_warmup=self.args.final_warmup,
            initial_warmup=self.args.initial_warmup,
            final_lambda=self.args.final_lambda,
        )
        inputs["threshold"] = threshold  
#         print(inputs)
        outputs = model(**inputs)
        # model outputs are always tuple in transformers (see doc)
        loss, start_logits_stu, end_logits_stu = outputs
        print(f"Step: {self.state.global_step} | Threshold: {threshold} | Loss: {loss} | t_total: {self.t_total}")
        
        return loss
    
    def _schedule_threshold(
        self,
        step: int,
        total_step: int,
        warmup_steps: int,
        initial_threshold: float,
        final_threshold: float,
        initial_warmup: int,
        final_warmup: int,
        final_lambda: float,
    ):
        if step <= initial_warmup * warmup_steps:
            threshold = initial_threshold
        elif step > (total_step - final_warmup * warmup_steps):
            threshold = final_threshold
        else:
            spars_warmup_steps = initial_warmup * warmup_steps
            spars_schedu_steps = (final_warmup + initial_warmup) * warmup_steps
            mul_coeff = 1 - (step - spars_warmup_steps) / (total_step - spars_schedu_steps)
            threshold = final_threshold + (initial_threshold - final_threshold) * (mul_coeff ** 3)
        regu_lambda = final_lambda * threshold / final_threshold
        return threshold, regu_lambda

In [None]:
masked_config = MaskedBertConfig(pruning_method='topK', mask_init='constant', mask_scale=0.)
masked_model = MaskedBertForQuestionAnswering.from_pretrained('bert-base-uncased', config=masked_config).to(device)

batch_size = 8

num_train_examples = 160
num_eval_examples = 20

train_ds = train_enc.select(range(num_train_examples))
eval_ds = valid_enc.select(range(num_eval_examples))
eval_raw_ds = squad["validation"].select(range(num_eval_examples))
warmup_steps = 6
max_steps = 100
num_train_epochs=10

print(f"Number of training examples: {train_ds.num_rows}")
print(f"Number of validation examples: {eval_ds.num_rows}")
print(f"Number of raw validation examples: {eval_raw_ds.num_rows}")

logging_steps = len(train_ds) // batch_size

print(f"Number of warmup steps: {warmup_steps}")

pruning_training_args = PruningTrainingArguments(
    output_dir="checkpoints",
    evaluation_strategy = "epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
#     max_steps=max_steps,
    num_train_epochs=num_train_epochs,
    weight_decay=0.0,
    logging_steps=logging_steps,
    disable_tqdm=False,
    warmup_steps=warmup_steps,
    seed=42,
)

data_collator = default_data_collator

Number of training examples: 160
Number of validation examples: 20
Number of raw validation examples: 20
Number of warmup steps: 6


In [None]:
eval_ds = eval_ds.map(lambda x : {'threshold': 0.1})

In [None]:
pruning_trainer = PruningTrainer(
    model=masked_model,
    args=pruning_training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    eval_examples=eval_raw_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=squad_metrics
)

In [None]:
# pruning_trainer.evaluate()

In [None]:
pruning_trainer.train()

Step: 0 | Threshold: 1.0 | Loss: 5.989073753356934 | t_total: 200


Epoch,Training Loss,Validation Loss,Exact Match,F1
1.0,5.845221,No log,0.0,8.383459
2.0,5.990707,No log,0.0,0.0
3.0,5.9592,No log,0.0,0.8
4.0,5.954142,No log,0.0,7.738095
5.0,5.964359,No log,0.0,2.934783
6.0,5.957894,No log,0.0,1.702786
7.0,5.959497,No log,0.0,1.578947
8.0,5.940603,No log,0.0,1.909091
9.0,5.957425,No log,0.0,6.091374
10.0,5.954855,No log,0.0,8.063492


Step: 1 | Threshold: 1.0 | Loss: 5.893919944763184 | t_total: 200
Step: 2 | Threshold: 1.0 | Loss: 5.9488115310668945 | t_total: 200
Step: 3 | Threshold: 1.0 | Loss: 5.98604679107666 | t_total: 200
Step: 4 | Threshold: 1.0 | Loss: 5.863607883453369 | t_total: 200
Step: 5 | Threshold: 1.0 | Loss: 5.869380950927734 | t_total: 200
Step: 6 | Threshold: 0.9852461977703495 | Loss: 5.623629570007324 | t_total: 200
Step: 7 | Threshold: 0.9706545235949898 | Loss: 5.372407913208008 | t_total: 200
Step: 8 | Threshold: 0.9562240817388141 | Loss: 5.281087398529053 | t_total: 200
Step: 9 | Threshold: 0.9419539764667164 | Loss: 5.054854393005371 | t_total: 200
Step: 10 | Threshold: 0.9278433120435897 | Loss: 6.036581039428711 | t_total: 200
Step: 11 | Threshold: 0.9138911927343276 | Loss: 6.019186973571777 | t_total: 200
Step: 12 | Threshold: 0.9000967228038235 | Loss: 5.940598487854004 | t_total: 200
Step: 13 | Threshold: 0.8864590065169706 | Loss: 5.9573073387146 | t_total: 200
Step: 14 | Threshold

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 20 | Threshold: 0.7953088527822858 | Loss: 6.08903694152832 | t_total: 200
Step: 21 | Threshold: 0.7828929191808071 | Loss: 5.997733116149902 | t_total: 200
Step: 22 | Threshold: 0.7706256776070204 | Loss: 6.2273664474487305 | t_total: 200
Step: 23 | Threshold: 0.7585062323258194 | Loss: 5.976736068725586 | t_total: 200
Step: 24 | Threshold: 0.7465336876020973 | Loss: 6.014189720153809 | t_total: 200
Step: 25 | Threshold: 0.7347071477007473 | Loss: 5.955245494842529 | t_total: 200
Step: 26 | Threshold: 0.7230257168866635 | Loss: 5.984169006347656 | t_total: 200
Step: 27 | Threshold: 0.7114884994247389 | Loss: 5.970478534698486 | t_total: 200
Step: 28 | Threshold: 0.7000945995798671 | Loss: 5.956234931945801 | t_total: 200
Step: 29 | Threshold: 0.6888431216169413 | Loss: 5.966883182525635 | t_total: 200
Step: 30 | Threshold: 0.677733169800855 | Loss: 6.008230209350586 | t_total: 200
Step: 31 | Threshold: 0.6667638483965016 | Loss: 5.891583442687988 | t_total: 200
Step: 32 | Thres

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 40 | Threshold: 0.5742205279927174 | Loss: 6.029837608337402 | t_total: 200
Step: 41 | Threshold: 0.5646082452748316 | Loss: 5.945032596588135 | t_total: 200
Step: 42 | Threshold: 0.5551267398825062 | Loss: 5.928070068359375 | t_total: 200
Step: 43 | Threshold: 0.5457751160806347 | Loss: 5.924349784851074 | t_total: 200
Step: 44 | Threshold: 0.5365524781341108 | Loss: 5.952756881713867 | t_total: 200
Step: 45 | Threshold: 0.5274579303078277 | Loss: 6.009023666381836 | t_total: 200
Step: 46 | Threshold: 0.5184905768666789 | Loss: 5.953790187835693 | t_total: 200
Step: 47 | Threshold: 0.5096495220755575 | Loss: 5.969111442565918 | t_total: 200
Step: 48 | Threshold: 0.5009338701993574 | Loss: 5.95670747756958 | t_total: 200
Step: 49 | Threshold: 0.49234272550297187 | Loss: 5.937241554260254 | t_total: 200
Step: 50 | Threshold: 0.48387519225129416 | Loss: 5.956970691680908 | t_total: 200
Step: 51 | Threshold: 0.4755303747092179 | Loss: 5.897049427032471 | t_total: 200
Step: 52 | Thr

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 60 | Threshold: 0.40580142747000614 | Loss: 5.967108726501465 | t_total: 200
Step: 61 | Threshold: 0.39863450159308145 | Loss: 6.0039567947387695 | t_total: 200
Step: 62 | Threshold: 0.3915804383395858 | Loss: 5.914857864379883 | t_total: 200
Step: 63 | Threshold: 0.3846383419744125 | Loss: 5.953590393066406 | t_total: 200
Step: 64 | Threshold: 0.37780731676245516 | Loss: 5.911615371704102 | t_total: 200
Step: 65 | Threshold: 0.3710864669686069 | Loss: 5.941280364990234 | t_total: 200
Step: 66 | Threshold: 0.36447489685776135 | Loss: 5.941457748413086 | t_total: 200
Step: 67 | Threshold: 0.3579717106948118 | Loss: 5.974471092224121 | t_total: 200
Step: 68 | Threshold: 0.3515760127446519 | Loss: 5.998353004455566 | t_total: 200
Step: 69 | Threshold: 0.34528690727217465 | Loss: 5.952250003814697 | t_total: 200
Step: 70 | Threshold: 0.33910349854227395 | Loss: 5.961615085601807 | t_total: 200
Step: 71 | Threshold: 0.3330248908198431 | Loss: 5.93968391418457 | t_total: 200
Step: 72 

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 80 | Threshold: 0.28288567036151874 | Loss: 5.952125549316406 | t_total: 200
Step: 81 | Threshold: 0.2778058072829235 | Loss: 5.956075668334961 | t_total: 200
Step: 82 | Threshold: 0.2728208921256258 | Loss: 6.00127649307251 | t_total: 200
Step: 83 | Threshold: 0.2679300291545189 | Loss: 5.961522102355957 | t_total: 200
Step: 84 | Threshold: 0.26313232263449626 | Loss: 5.887432098388672 | t_total: 200
Step: 85 | Threshold: 0.2584268768304513 | Loss: 5.955479145050049 | t_total: 200
Step: 86 | Threshold: 0.25381279600727735 | Loss: 5.992793083190918 | t_total: 200
Step: 87 | Threshold: 0.24928918442986797 | Loss: 5.972589492797852 | t_total: 200
Step: 88 | Threshold: 0.24485514636311648 | Loss: 5.986364841461182 | t_total: 200
Step: 89 | Threshold: 0.24050978607191623 | Loss: 5.969649314880371 | t_total: 200
Step: 90 | Threshold: 0.23625220782116085 | Loss: 5.953175067901611 | t_total: 200
Step: 91 | Threshold: 0.23208151587574363 | Loss: 5.970466613769531 | t_total: 200
Step: 92

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 100 | Threshold: 0.1983073758146213 | Loss: 5.97183895111084 | t_total: 200
Step: 101 | Threshold: 0.19495628149172406 | Loss: 5.965817451477051 | t_total: 200
Step: 102 | Threshold: 0.19168222038799265 | Loss: 5.938935279846191 | t_total: 200
Step: 103 | Threshold: 0.18848429676832046 | Loss: 5.958086013793945 | t_total: 200
Step: 104 | Threshold: 0.18536161489760092 | Loss: 5.9330363273620605 | t_total: 200
Step: 105 | Threshold: 0.18231327904072742 | Loss: 6.009718894958496 | t_total: 200
Step: 106 | Threshold: 0.17933839346259342 | Loss: 6.016504287719727 | t_total: 200
Step: 107 | Threshold: 0.17643606242809237 | Loss: 5.903928756713867 | t_total: 200
Step: 108 | Threshold: 0.17360539020211768 | Loss: 5.987347602844238 | t_total: 200
Step: 109 | Threshold: 0.1708454810495627 | Loss: 5.91176700592041 | t_total: 200
Step: 110 | Threshold: 0.16815543923532095 | Loss: 5.964362621307373 | t_total: 200
Step: 111 | Threshold: 0.1655343690242857 | Loss: 5.93583869934082 | t_total: 

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 120 | Threshold: 0.1449006629766804 | Loss: 5.95989990234375 | t_total: 200
Step: 121 | Threshold: 0.14292004336684933 | Loss: 5.9453935623168945 | t_total: 200
Step: 122 | Threshold: 0.14099854227405248 | Loss: 5.95749568939209 | t_total: 200
Step: 123 | Threshold: 0.1391352639631833 | Loss: 5.937592506408691 | t_total: 200
Step: 124 | Threshold: 0.13732931269913518 | Loss: 5.97677755355835 | t_total: 200
Step: 125 | Threshold: 0.13557979274680157 | Loss: 5.976527214050293 | t_total: 200
Step: 126 | Threshold: 0.13388580837107586 | Loss: 5.95491361618042 | t_total: 200
Step: 127 | Threshold: 0.13224646383685149 | Loss: 5.97629451751709 | t_total: 200
Step: 128 | Threshold: 0.13066086340902183 | Loss: 5.968482971191406 | t_total: 200
Step: 129 | Threshold: 0.1291281113524804 | Loss: 5.9564924240112305 | t_total: 200
Step: 130 | Threshold: 0.12764731193212053 | Loss: 5.9574432373046875 | t_total: 200
Step: 131 | Threshold: 0.12621756941283568 | Loss: 5.980905532836914 | t_total: 

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 140 | Threshold: 0.11549965099506218 | Loss: 5.962859153747559 | t_total: 200
Step: 141 | Threshold: 0.11453121205566563 | Loss: 5.902565956115723 | t_total: 200
Step: 142 | Threshold: 0.11360397693117172 | Loss: 5.969881057739258 | t_total: 200
Step: 143 | Threshold: 0.11271704988647388 | Loss: 5.970367431640625 | t_total: 200
Step: 144 | Threshold: 0.11186953518646553 | Loss: 5.9229207038879395 | t_total: 200
Step: 145 | Threshold: 0.11106053709604005 | Loss: 5.92866325378418 | t_total: 200
Step: 146 | Threshold: 0.11028915988009093 | Loss: 5.9737701416015625 | t_total: 200
Step: 147 | Threshold: 0.10955450780351156 | Loss: 5.946454048156738 | t_total: 200
Step: 148 | Threshold: 0.10885568513119534 | Loss: 5.909182548522949 | t_total: 200
Step: 149 | Threshold: 0.10819179612803573 | Loss: 5.940603733062744 | t_total: 200
Step: 150 | Threshold: 0.1075619450589261 | Loss: 5.964258193969727 | t_total: 200
Step: 151 | Threshold: 0.10696523618875992 | Loss: 5.888424396514893 | t_to

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 160 | Threshold: 0.1029384590171331 | Loss: 5.951730728149414 | t_total: 200
Step: 161 | Threshold: 0.10262390670553936 | Loss: 5.930002212524414 | t_total: 200
Step: 162 | Threshold: 0.10233264350671668 | Loss: 5.9388837814331055 | t_total: 200
Step: 163 | Threshold: 0.1020637736855585 | Loss: 5.963850498199463 | t_total: 200
Step: 164 | Threshold: 0.10181640150695821 | Loss: 5.9934587478637695 | t_total: 200
Step: 165 | Threshold: 0.10158963123580925 | Loss: 5.935539245605469 | t_total: 200
Step: 166 | Threshold: 0.10138256713700501 | Loss: 5.936539649963379 | t_total: 200
Step: 167 | Threshold: 0.10119431347543895 | Loss: 5.942351341247559 | t_total: 200
Step: 168 | Threshold: 0.10102397451600446 | Loss: 5.9381232261657715 | t_total: 200
Step: 169 | Threshold: 0.10087065452359499 | Loss: 6.017017364501953 | t_total: 200
Step: 170 | Threshold: 0.10073345776310395 | Loss: 5.961722373962402 | t_total: 200
Step: 171 | Threshold: 0.10061148849942475 | Loss: 5.973657608032227 | t_t

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))


Step: 180 | Threshold: 0.10005120619025945 | Loss: 5.95906925201416 | t_total: 200
Step: 181 | Threshold: 0.10003224646383686 | Loss: 5.941567420959473 | t_total: 200
Step: 182 | Threshold: 0.10001866114805374 | Loss: 5.9982686042785645 | t_total: 200
Step: 183 | Threshold: 0.10000955450780352 | Loss: 5.975482940673828 | t_total: 200
Step: 184 | Threshold: 0.10000403080797961 | Loss: 5.929422855377197 | t_total: 200
Step: 185 | Threshold: 0.10000119431347544 | Loss: 5.914976119995117 | t_total: 200
Step: 186 | Threshold: 0.10000014928918444 | Loss: 5.949907302856445 | t_total: 200
Step: 187 | Threshold: 0.1 | Loss: 5.926374435424805 | t_total: 200
Step: 188 | Threshold: 0.1 | Loss: 5.976995468139648 | t_total: 200
Step: 189 | Threshold: 0.1 | Loss: 5.975089073181152 | t_total: 200
Step: 190 | Threshold: 0.1 | Loss: 5.936020374298096 | t_total: 200
Step: 191 | Threshold: 0.1 | Loss: 5.955297470092773 | t_total: 200
Step: 192 | Threshold: 0.1 | Loss: 5.952559947967529 | t_total: 200
Ste

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))




TrainOutput(global_step=200, training_loss=5.948390426635743)

In [None]:
pruning_trainer.t_total

200