# Reading Comprehension with ALBERT (and similar)
# Using Posit 6 bits for dense layers & 8 bits for the rest


## Introduction

Reading comprehension, otherwise known as question answering systems, are one of the tasks that NLP tries to solve. The goal of this task is to be able to answer an arbitary question given a context. For instance, given the following context:

> New Zealand (Māori: Aotearoa) is a sovereign island country in the southwestern Pacific Ocean. It has a total land area of 268,000 square kilometres (103,500 sq mi), and a population of 4.9 million. New Zealand's capital city is Wellington, and its most populous city is Auckland.

We ask the question

> How many people live in New Zealand?

We expect the QA system is to respond with something like this:

> 4.9 million

Since 2017, transformer models have shown to outperform existing approaches for this task. Many pretrained transformer models exist, including BERT, GPT-2, XLNET. One of the newcomers to the group is ALBERT (A Lite BERT) which was published in September 2019. The research group claims that it outperforms BERT, with much less parameters (shorter training and inference time).

This tutorial demonstrates how you can fine-tune ALBERT for the task of QnA and use it for inference. For this tutorial, we will use the transformer library built by [Hugging Face](https://huggingface.co/), which is an extremely nice implementation of the transformer models (including ALBERT) in both TensorFlow and PyTorch. You can  just use a fine-tuned model from their [model repository](https://huggingface.co/models) (which I encourage in general to save money and reduce emissions). However for educational purposes I will also show you how to finetune it yourself so you can adapt it for your own data.

Note that the goal of this is not to build an optimised, production ready system, but to demonstrate the concept with as little code as possible. Therefore a lot of code will be retrofitted for this purpose.


## 1.0 Setup

Let's check out what kind of GPU our friends at Google gave us. This notebook should be configured to give you a P100 😃 (saved in metadata)

In [None]:
!nvidia-smi

Mon Dec  2 23:05:20 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

First, we clone the Hugging Face transformer library from Github.


Note it's checking out a specific commit only because I've tested this

In [None]:
!rm -r transformers
!git clone https://github.com/huggingface/transformers \
&& cd transformers \
&& git checkout a3085020ed0d81d4903c50967687192e3101e770
!pip install ninja
!pip install qtorch-posit==0.1.1


rm: cannot remove 'transformers': No such file or directory
Cloning into 'transformers'...
remote: Enumerating objects: 243432, done.[K
remote: Counting objects: 100% (705/705), done.[K
remote: Compressing objects: 100% (387/387), done.[K
remote: Total 243432 (delta 415), reused 455 (delta 257), pack-reused 242727 (from 1)[K
Receiving objects: 100% (243432/243432), 255.45 MiB | 14.56 MiB/s, done.
Resolving deltas: 100% (178256/178256), done.
Note: switching to 'a3085020ed0d81d4903c50967687192e3101e770'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable 

In [None]:
# original code
!pip install ./transformers
!pip install tensorboardX
!pip install botocore==1.17


# !pip install --upgrade pip setuptools wheel
# !curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
# import os
# os.environ['PATH'] += ":/root/.cargo/bin"
# !pip uninstall -y sentence-transformers
# !pip install ./transformers
# !pip install tensorboardX boto3 botocore==1.17
# !pip install transformers tokenizers
# !pip check


Processing ./transformers
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tokenizers==0.0.11 (from transformers==2.3.0)
  Downloading tokenizers-0.0.11.tar.gz (30 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting boto3 (from transformers==2.3.0)
  Downloading boto3-1.35.72-py3-none-any.whl.metadata (6.7 kB)
Collecting sacremoses (from transformers==2.3.0)
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting botocore<1.36.0,>=1.35.72 (from boto3->transformers==2.3.0)
  Downloading botocore-1.35.72-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3->transformers==2.3.0)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3->transformers==2.3.0)
  Downloading s3transfer-0.10.4-py3-none-any.whl.metadata (1.7 kB)
Downloading boto

## 2.0 Train Model

This is where we can train our own model. Note you can skip this step if you don't want to wait 1.5 hours!

### 2.1 Get Training and Evaluation Data

The SQuAD dataset contains question/answer pairs to for training the ALBERT model for the QA task.

Now get the SQuAD V2.0 dataset. `train-v2.0.json` is for training and `dev-v2.0.json` is for evaluation to see how well your model trained.

Read more about this dataset here: https://rajpurkar.github.io/SQuAD-explorer/

In [None]:
!mkdir dataset \
&& cd dataset \
&& wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json \
&& wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json


--2024-12-02 23:06:39--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2024-12-02 23:06:39 (196 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2024-12-02 23:06:39--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2024-12-02 23:06:40 (84.3 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



## 3.0 Setup prediction code and use Posit

Now we can use the Hugging Face library to make predictions using our newly trained model. Note that a lot of the code is pulled from `run_squad.py` in the Hugging Face repository, with all the training parts removed. This modified code allows to run predictions we pass in directly as strings, rather .json format like the training/test set.

NOTE if you decided train your own mode, change the flag `use_own_model` to `True`

**Important**: this step shows how to use posit for inference by register forward_hook and forward_pre_hook for activation
and quantization for weight. Please look at the loop:

***For name, module in model.named_modules():***


In [None]:

# Only scale weights
import os
import torch
import numpy as np
from tqdm import tqdm
from torch.utils.data import DataLoader, SequentialSampler
from qtorch_posit.quant import posit_quantize
from transformers import (
    AlbertConfig,
    AlbertForQuestionAnswering,
    AlbertTokenizer,
    squad_convert_examples_to_features,
)
from transformers.data.processors.squad import SquadResult, SquadV2Processor, SquadExample
from transformers.data.metrics.squad_metrics import compute_predictions_logits, squad_evaluate

# Configuration
use_own_model = False
model_name_or_path = "/content/model_output" if use_own_model else "ktrapeznikov/albert-xlarge-v2-squad-v2"
epsilon = 1e-12  # To avoid log(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model, tokenizer, and processor
config = AlbertConfig.from_pretrained(model_name_or_path)
tokenizer = AlbertTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
model = AlbertForQuestionAnswering.from_pretrained(model_name_or_path, config=config)
processor = SquadV2Processor()
model.to(device)

# Define posit quantization functions
def linear_weight_opt(input_tensor):
    log2_input = np.log2(np.abs(input_tensor.cpu().numpy()) + epsilon)
    counts, bins = np.histogram(log2_input, bins=100)
    x_with_max_frequency = (bins[np.argmax(counts)] + bins[np.argmax(counts) + 1]) / 2
    scale = 2 ** (-x_with_max_frequency)
    return posit_quantize(input_tensor, nsize=6, es=1, scale=scale)

def other_weight_opt(input_tensor):
    log2_input = np.log2(np.abs(input_tensor.cpu().numpy()) + epsilon)
    counts, bins = np.histogram(log2_input, bins=100)
    x_with_max_frequency = (bins[np.argmax(counts)] + bins[np.argmax(counts) + 1]) / 2
    scale = 2 ** (-x_with_max_frequency)
    return posit_quantize(input_tensor, nsize=8, es=1, scale=scale)

def linear_activation_opt(input_tensor):
    return posit_quantize(input_tensor, nsize=6, es=1)

def other_activation_opt(input_tensor):
    return posit_quantize(input_tensor, nsize=8, es=1)

# Define hooks for processing layers
def forward_pre_hook_linear_opt(m, input):
    return (linear_activation_opt(input[0]),)

def forward_hook_opt(m, input, output):
    return other_activation_opt(output)

def forward_pre_hook_other_opt(m, input):
    if isinstance(input[0], torch.Tensor) and input[0].dtype == torch.float32:
        return (other_activation_opt(input[0]),)
    return input

# Quantize model weights and add hooks
for name, module in model.named_modules():
    if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
        print(f"Processing linear/conv layer: {name}")
        module.weight.data = linear_weight_opt(module.weight.data)
        module.register_forward_pre_hook(forward_pre_hook_linear_opt)
        module.register_forward_hook(forward_hook_opt)
    elif hasattr(module, 'weight') and module.weight is not None:
        print(f"Processing other layer: {name}")
        module.weight.data = other_weight_opt(module.weight.data)
        module.register_forward_pre_hook(forward_pre_hook_other_opt)

print("Quantization complete. Model ready for evaluation.")

# Evaluation function
def evaluate():
    examples = processor.get_dev_examples("/content/dataset", "dev-v2.0.json")[:1000]
    features, dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=384,
        doc_stride=128,
        max_query_length=64,
        is_training=False,
        return_dataset="pt",
        threads=1,
    )
    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=32)

    all_results = []
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        model.eval()
        batch = tuple(t.to(device) for t in batch)
        with torch.no_grad():
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "token_type_ids": batch[2]}
            outputs = model(**inputs)
            example_indices = batch[3]
            for i, example_index in enumerate(example_indices):
                eval_feature = features[example_index.item()]
                result = SquadResult(
                    unique_id=int(eval_feature.unique_id),
                    start_logits=outputs[0][i].cpu().tolist(),
                    end_logits=outputs[1][i].cpu().tolist(),
                )
                all_results.append(result)

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        n_best_size=1,
        max_answer_length=30,
        do_lower_case=True,
        output_prediction_file="predictions.json",
        output_nbest_file="nbest_predictions.json",
        output_null_log_odds_file="null_predictions.json",
        verbose_logging=False,
        version_2_with_negative=True,
        null_score_diff_threshold=0.0,
        tokenizer=tokenizer,
    )

    results = squad_evaluate(examples, predictions)
    print(f"Evaluation results: {results}")
    return results

# Run evaluation
results = evaluate()


Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu121/quant_cpu...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/quant_cpu/build.ninja...
Building extension module quant_cpu...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module quant_cpu...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu121/quant_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/quant_cuda/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Building extension module quant_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module quant_cud

config.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/235M [00:00<?, ?B/s]

Some weights of the model checkpoint at ktrapeznikov/albert-xlarge-v2-squad-v2 were not used when initializing AlbertForQuestionAnswering: ['albert.pooler.bias', 'albert.pooler.weight']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Processing other layer: albert.embeddings.word_embeddings
Processing other layer: albert.embeddings.position_embeddings
Processing other layer: albert.embeddings.token_type_embeddings
Processing other layer: albert.embeddings.LayerNorm
Processing linear/conv layer: albert.encoder.embedding_hidden_mapping_in
Processing other layer: albert.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.value
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.dense
Processing other layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.ffn
Processing lin

100%|██████████| 35/35 [00:08<00:00,  4.36it/s]
convert squad examples to features: 100%|██████████| 1000/1000 [00:04<00:00, 220.69it/s]
add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 547059.35it/s]
Evaluating: 100%|██████████| 32/32 [04:20<00:00,  8.15s/it]


Evaluation results: OrderedDict([('exact', 83.0), ('f1', 85.9574925043174), ('total', 1000), ('HasAns_exact', 78.3132530120482), ('HasAns_f1', 84.25199298055716), ('HasAns_total', 498), ('NoAns_exact', 87.64940239043824), ('NoAns_f1', 87.64940239043824), ('NoAns_total', 502), ('best_exact', 83.0), ('best_exact_thresh', 0.0), ('best_f1', 85.95749250431736), ('best_f1_thresh', 0.0)])


In [None]:
# Scale both weight and action
import os
import torch
import numpy as np
from tqdm import tqdm
from torch.utils.data import DataLoader, SequentialSampler
from qtorch_posit.quant import posit_quantize
from transformers import (
    AlbertConfig,
    AlbertForQuestionAnswering,
    AlbertTokenizer,
    squad_convert_examples_to_features,
)
from transformers.data.processors.squad import SquadResult, SquadV2Processor, SquadExample
from transformers.data.metrics.squad_metrics import compute_predictions_logits, squad_evaluate

# Configuration
use_own_model = False
model_name_or_path = "/content/model_output" if use_own_model else "ktrapeznikov/albert-xlarge-v2-squad-v2"
epsilon = 1e-12  # To avoid log(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model, tokenizer, and processor
config = AlbertConfig.from_pretrained(model_name_or_path)
tokenizer = AlbertTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
model = AlbertForQuestionAnswering.from_pretrained(model_name_or_path, config=config)
processor = SquadV2Processor()
model.to(device)

# Define posit quantization functions
def linear_weight_opt(input_tensor):
    log2_input = np.log2(np.abs(input_tensor.cpu().numpy()) + epsilon)
    counts, bins = np.histogram(log2_input, bins=100)
    x_with_max_frequency = (bins[np.argmax(counts)] + bins[np.argmax(counts) + 1]) / 2
    scale = 2 ** (-x_with_max_frequency)
    return posit_quantize(input_tensor, nsize=6, es=1, scale=scale)

def other_weight_opt(input_tensor):
    log2_input = np.log2(np.abs(input_tensor.cpu().numpy()) + epsilon)
    counts, bins = np.histogram(log2_input, bins=100)
    x_with_max_frequency = (bins[np.argmax(counts)] + bins[np.argmax(counts) + 1]) / 2
    scale = 2 ** (-x_with_max_frequency)
    return posit_quantize(input_tensor, nsize=8, es=1, scale=scale)

def linear_activation_opt(input_tensor):
    log2_input = np.log2(np.abs(input_tensor.cpu().numpy()) + epsilon)
    counts, bins = np.histogram(log2_input, bins=100)
    x_with_max_frequency = (bins[np.argmax(counts)] + bins[np.argmax(counts) + 1]) / 2
    scale = 2 ** (-x_with_max_frequency)
    return posit_quantize(input_tensor, nsize=6, es=1, scale=scale)


def other_activation_opt(input_tensor):
    log2_input = np.log2(np.abs(input_tensor.cpu().numpy()) + epsilon)
    counts, bins = np.histogram(log2_input, bins=100)
    x_with_max_frequency = (bins[np.argmax(counts)] + bins[np.argmax(counts) + 1]) / 2
    scale = 2 ** (-x_with_max_frequency)
    return posit_quantize(input_tensor, nsize=8, es=1, scale = scale)

# Define hooks for processing layers
def forward_pre_hook_linear_opt(m, input):
    return (linear_activation_opt(input[0]),)

def forward_hook_opt(m, input, output):
    return other_activation_opt(output)

def forward_pre_hook_other_opt(m, input):
    if isinstance(input[0], torch.Tensor) and input[0].dtype == torch.float32:
        return (other_activation_opt(input[0]),)
    return input

# Quantize model weights and add hooks
for name, module in model.named_modules():
    if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
        print(f"Processing linear/conv layer: {name}")
        module.weight.data = linear_weight_opt(module.weight.data)
        module.register_forward_pre_hook(forward_pre_hook_linear_opt)
        module.register_forward_hook(forward_hook_opt)
    elif hasattr(module, 'weight') and module.weight is not None:
        print(f"Processing other layer: {name}")
        module.weight.data = other_weight_opt(module.weight.data)
        module.register_forward_pre_hook(forward_pre_hook_other_opt)

print("Quantization complete. Model ready for evaluation.")

# Evaluation function
def evaluate():
    examples = processor.get_dev_examples("/content/dataset", "dev-v2.0.json")[:1000]
    features, dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=384,
        doc_stride=128,
        max_query_length=64,
        is_training=False,
        return_dataset="pt",
        threads=1,
    )
    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=32)

    all_results = []
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        model.eval()
        batch = tuple(t.to(device) for t in batch)
        with torch.no_grad():
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "token_type_ids": batch[2]}
            outputs = model(**inputs)
            example_indices = batch[3]
            for i, example_index in enumerate(example_indices):
                eval_feature = features[example_index.item()]
                result = SquadResult(
                    unique_id=int(eval_feature.unique_id),
                    start_logits=outputs[0][i].cpu().tolist(),
                    end_logits=outputs[1][i].cpu().tolist(),
                )
                all_results.append(result)

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        n_best_size=1,
        max_answer_length=30,
        do_lower_case=True,
        output_prediction_file="predictions.json",
        output_nbest_file="nbest_predictions.json",
        output_null_log_odds_file="null_predictions.json",
        verbose_logging=False,
        version_2_with_negative=True,
        null_score_diff_threshold=0.0,
        tokenizer=tokenizer,
    )

    results = squad_evaluate(examples, predictions)
    print(f"Evaluation results: {results}")
    return results

# Run evaluation
results = evaluate()


Some weights of the model checkpoint at ktrapeznikov/albert-xlarge-v2-squad-v2 were not used when initializing AlbertForQuestionAnswering: ['albert.pooler.bias', 'albert.pooler.weight']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Processing other layer: albert.embeddings.word_embeddings
Processing other layer: albert.embeddings.position_embeddings
Processing other layer: albert.embeddings.token_type_embeddings
Processing other layer: albert.embeddings.LayerNorm
Processing linear/conv layer: albert.encoder.embedding_hidden_mapping_in
Processing other layer: albert.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.value
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.dense
Processing other layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.ffn
Processing lin

100%|██████████| 35/35 [00:05<00:00,  6.57it/s]
convert squad examples to features: 100%|██████████| 1000/1000 [00:05<00:00, 177.83it/s]
add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 554802.12it/s]
Evaluating: 100%|██████████| 32/32 [2:08:39<00:00, 241.24s/it]


Evaluation results: OrderedDict([('exact', 82.9), ('f1', 85.85728354978349), ('total', 1000), ('HasAns_exact', 78.3132530120482), ('HasAns_f1', 84.2515733931397), ('HasAns_total', 498), ('NoAns_exact', 87.45019920318725), ('NoAns_f1', 87.45019920318725), ('NoAns_total', 502), ('best_exact', 82.9), ('best_exact_thresh', 0.0), ('best_f1', 85.85728354978346), ('best_f1_thresh', 0.0)])


In [None]:
# Scale linear weight and activation
import os
import torch
import numpy as np
from tqdm import tqdm
from torch.utils.data import DataLoader, SequentialSampler
from qtorch_posit.quant import posit_quantize
from transformers import (
    AlbertConfig,
    AlbertForQuestionAnswering,
    AlbertTokenizer,
    squad_convert_examples_to_features,
)
from transformers.data.processors.squad import SquadResult, SquadV2Processor, SquadExample
from transformers.data.metrics.squad_metrics import compute_predictions_logits, squad_evaluate

# Configuration
use_own_model = False
model_name_or_path = "/content/model_output" if use_own_model else "ktrapeznikov/albert-xlarge-v2-squad-v2"
epsilon = 1e-12  # To avoid log(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model, tokenizer, and processor
config = AlbertConfig.from_pretrained(model_name_or_path)
tokenizer = AlbertTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
model = AlbertForQuestionAnswering.from_pretrained(model_name_or_path, config=config)
processor = SquadV2Processor()
model.to(device)

# Define posit quantization functions
def linear_weight_opt(input_tensor):
    log2_input = np.log2(np.abs(input_tensor.cpu().numpy()) + epsilon)
    counts, bins = np.histogram(log2_input, bins=100)
    x_with_max_frequency = (bins[np.argmax(counts)] + bins[np.argmax(counts) + 1]) / 2
    scale = 2 ** (-x_with_max_frequency)
    return posit_quantize(input_tensor, nsize=6, es=1, scale=scale)

def other_weight_opt(input_tensor):

    return posit_quantize(input_tensor, nsize=8, es=1)

def linear_activation_opt(input_tensor):
    log2_input = np.log2(np.abs(input_tensor.cpu().numpy()) + epsilon)
    counts, bins = np.histogram(log2_input, bins=100)
    x_with_max_frequency = (bins[np.argmax(counts)] + bins[np.argmax(counts) + 1]) / 2
    scale = 2 ** (-x_with_max_frequency)
    return posit_quantize(input_tensor, nsize=6, es=1, scale=scale)


def other_activation_opt(input_tensor):

    return posit_quantize(input_tensor, nsize=8, es=1)

# Define hooks for processing layers
def forward_pre_hook_linear_opt(m, input):
    return (linear_activation_opt(input[0]),)

def forward_hook_opt(m, input, output):
    return other_activation_opt(output)

def forward_pre_hook_other_opt(m, input):
    if isinstance(input[0], torch.Tensor) and input[0].dtype == torch.float32:
        return (other_activation_opt(input[0]),)
    return input

# Quantize model weights and add hooks
for name, module in model.named_modules():
    if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
        print(f"Processing linear/conv layer: {name}")
        module.weight.data = linear_weight_opt(module.weight.data)
        module.register_forward_pre_hook(forward_pre_hook_linear_opt)
        module.register_forward_hook(forward_hook_opt)
    elif hasattr(module, 'weight') and module.weight is not None:
        print(f"Processing other layer: {name}")
        module.weight.data = other_weight_opt(module.weight.data)
        module.register_forward_pre_hook(forward_pre_hook_other_opt)

print("Quantization complete. Model ready for evaluation.")

# Evaluation function
def evaluate():
    examples = processor.get_dev_examples("/content/dataset", "dev-v2.0.json")[:1000]
    features, dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=384,
        doc_stride=128,
        max_query_length=64,
        is_training=False,
        return_dataset="pt",
        threads=1,
    )
    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=32)

    all_results = []
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        model.eval()
        batch = tuple(t.to(device) for t in batch)
        with torch.no_grad():
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "token_type_ids": batch[2]}
            outputs = model(**inputs)
            example_indices = batch[3]
            for i, example_index in enumerate(example_indices):
                eval_feature = features[example_index.item()]
                result = SquadResult(
                    unique_id=int(eval_feature.unique_id),
                    start_logits=outputs[0][i].cpu().tolist(),
                    end_logits=outputs[1][i].cpu().tolist(),
                )
                all_results.append(result)

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        n_best_size=1,
        max_answer_length=30,
        do_lower_case=True,
        output_prediction_file="predictions.json",
        output_nbest_file="nbest_predictions.json",
        output_null_log_odds_file="null_predictions.json",
        verbose_logging=False,
        version_2_with_negative=True,
        null_score_diff_threshold=0.0,
        tokenizer=tokenizer,
    )

    results = squad_evaluate(examples, predictions)
    print(f"Evaluation results: {results}")
    return results

# Run evaluation
results = evaluate()


Some weights of the model checkpoint at ktrapeznikov/albert-xlarge-v2-squad-v2 were not used when initializing AlbertForQuestionAnswering: ['albert.pooler.bias', 'albert.pooler.weight']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Processing other layer: albert.embeddings.word_embeddings
Processing other layer: albert.embeddings.position_embeddings
Processing other layer: albert.embeddings.token_type_embeddings
Processing other layer: albert.embeddings.LayerNorm
Processing linear/conv layer: albert.encoder.embedding_hidden_mapping_in
Processing other layer: albert.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.value
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.dense
Processing other layer: albert.encoder.albert_layer_groups.0.albert_layers.0.attention.LayerNorm
Processing linear/conv layer: albert.encoder.albert_layer_groups.0.albert_layers.0.ffn
Processing lin

100%|██████████| 35/35 [00:04<00:00,  8.39it/s]
convert squad examples to features: 100%|██████████| 1000/1000 [00:05<00:00, 189.36it/s]
add example index and unique id: 100%|██████████| 1000/1000 [00:00<00:00, 571353.22it/s]
Evaluating: 100%|██████████| 32/32 [1:00:21<00:00, 113.18s/it]


Evaluation results: OrderedDict([('exact', 84.1), ('f1', 86.93163244650081), ('total', 1000), ('HasAns_exact', 80.32128514056225), ('HasAns_f1', 86.0072940692789), ('HasAns_total', 498), ('NoAns_exact', 87.84860557768924), ('NoAns_f1', 87.84860557768924), ('NoAns_total', 502), ('best_exact', 84.1), ('best_exact_thresh', 0.0), ('best_f1', 86.9316324465008), ('best_f1_thresh', 0.0)])
