# [Project] BiBERT

## Content

Introduction
<br/>Development Environment
<br/>BiBERT
<br/>Conclusion
<br/>Reference

## Introduction

<br>Dataset: SST-2
<br><br>Task: Natural Language Processing
<br><br>Method: Full Binarized Quantization, Straight Through Estimator (STE)
<br><br>Compression: numpy.packbits

<br>

## Development Environment

In [None]:
%pip install torch==1.13.1  #cuda=11.8
%pip install scipy
%pip install seaborn
%pip install openpyxl
%pip install matplotlib
%pip install tensorboard
%pip install scikit-learn
%pip install setuptools==59.5.0
%pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com pytorch-quantization

In [1]:
import os
import glob
import torch
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

## BiBERT

In [3]:
pwd

'/workspace/deep_learning_bibert'

In [2]:
import torch
print(torch.__version__)
print(torch.version.cuda)

1.13.1+cu117
11.7


In [None]:
default_params = {
    "cola":  {"num_train_epochs": 50, "max_seq_length": 64},
    "mnli":  {"num_train_epochs": 6,  "max_seq_length": 128},
    "mrpc":  {"num_train_epochs": 20, "max_seq_length": 128},
    "sst-2": {"num_train_epochs": 10, "max_seq_length": 64},
    "sst-mini": {"num_train_epochs": 10, "max_seq_length": 64},
    "sts-b": {"num_train_epochs": 20, "max_seq_length": 128},
    "qqp":   {"num_train_epochs": 6,  "max_seq_length": 128},
    "qnli":  {"num_train_epochs": 10, "max_seq_length": 128},
    "rte":   {"num_train_epochs": 20, "max_seq_length": 128},
}

In [None]:
# https://github.com/htqin/BiBERT/blob/main/scripts/train_sst-2.sh

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" 
os.environ["CUDA_VISIBLE_DEVICES"]="0"

!python quant_task_glue.py \
    --data_dir 'data' \
    --model_dir 'models/bert-base-uncased' \
    --task_name 'sst-2' \
    --output_dir 'output/sst-2' \
    --log_dir 'log/bibert' \
    --seed 42 \
    --num_train_epochs 10 \
    --eval_step 100 \
    --learning_rate 1e-4 \
    --expriement_number 0 \
    --random_ratio 0.0 \
    --input_bits 1 \
    --weight_bits 1 \
    --embedding_bits 1 \
    --batch_size 16 \
    --pred_distill \
    --intermediate_distill \
    --value_distill \
    --key_distill \
    --query_distill

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1519.)
  next_m.mul_(beta1).add_(1 - beta1, grad)
Epoch 1: 100%|█████████████████████████████▉| 4209/4210 [10:31<00:00,  6.82it/s]Epoch 1 Step 4210 : 0.751
Best : 0.751
Epoch 1: 100%|██████████████████████████████| 4210/4210 [10:34<00:00,  6.64it/s]
Epoch 2: 100%|█████████████████████████████▉| 4209/4210 [10:29<00:00,  6.67it/s]Epoch 2 Step 8420 : 0.817
Best : 0.817
Epoch 2: 100%|██████████████████████████████| 4210/4210 [10:32<00:00,  6.65it/s]
Epoch 3: 100%|█████████████████████████████▉| 4209/4210 [10:41<00:00,  6.52it/s]Epoch 3 Step 12630 : 0.827
Best : 0.827
Epoch 3: 100%|██████████████████████████████| 4210/4210 [10:44<00:00,  6.53it/s]
Epoch 4: 100%|█████████████████████████████▉| 4209/4210 [10:45<00:00,  6.28it/s]Epoch 4 Step 16840 : 0.847
Best : 0.847
Epoch 4: 100%|█████████████████████

In [None]:
# https://github.com/htqin/BiBERT/blob/main/scripts/train_mrpc.sh

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" 
os.environ["CUDA_VISIBLE_DEVICES"]="0"

!python quant_task_glue.py \
    --data_dir 'data' \
    --model_dir 'models/bert-base-uncased' \
    --task_name 'mrpc' \
    --output_dir 'output/mrpc' \
    --log_dir 'log/bibert' \
    --seed 42 \
    --num_train_epochs 20 \
    --eval_step 100 \
    --learning_rate 2e-4 \
    --expriement_number 0 \
    --random_ratio 0.0 \
    --input_bits 1 \
    --weight_bits 1 \
    --embedding_bits 1 \
    --batch_size 16 \
    --pred_distill \
    --intermediate_distill \
    --value_distill \
    --key_distill \
    --query_distill

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1519.)
  next_m.mul_(beta1).add_(1 - beta1, grad)
Epoch 1: 100%|███████████████████████████████▊| 229/230 [00:37<00:00,  6.07it/s]Epoch 1 Step 230 : 0.684
Best : 0.684
Epoch 1: 100%|████████████████████████████████| 230/230 [00:39<00:00,  5.86it/s]
Epoch 2: 100%|███████████████████████████████▊| 229/230 [00:37<00:00,  6.10it/s]Epoch 2 Step 460 : 0.684
Best : 0.684
Epoch 2: 100%|████████████████████████████████| 230/230 [00:39<00:00,  5.86it/s]
Epoch 3: 100%|███████████████████████████████▊| 229/230 [00:37<00:00,  6.08it/s]Epoch 3 Step 690 : 0.684
Best : 0.684
Epoch 3: 100%|████████████████████████████████| 230/230 [00:39<00:00,  5.85it/s]
Epoch 4: 100%|███████████████████████████████▊| 229/230 [00:37<00:00,  6.08it/s]Epoch 4 Step 920 : 0.684
Best : 0.684
Epoch 4: 100%|███████████████████████████

In [None]:
# https://github.com/htqin/BiBERT/blob/main/scripts/train_rte.sh

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" 
os.environ["CUDA_VISIBLE_DEVICES"]="0"

!python quant_task_glue.py \
    --data_dir 'data' \
    --model_dir 'models/bert-base-uncased' \
    --task_name 'rte' \
    --output_dir 'output/rte' \
    --log_dir 'log/bibert' \
    --seed 42 \
    --num_train_epochs 20 \
    --eval_step 100 \
    --learning_rate 1e-5 \
    --expriement_number 0 \
    --random_ratio 0.0 \
    --input_bits 1 \
    --weight_bits 1 \
    --embedding_bits 1 \
    --batch_size 16 \
    --pred_distill \
    --intermediate_distill \
    --value_distill \
    --key_distill \
    --query_distill

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1519.)
  next_m.mul_(beta1).add_(1 - beta1, grad)
Epoch 1:  70%|███████████████████████▏         | 99/141 [00:13<00:05,  7.47it/s]Epoch 1 Step 100 : 0.470
Best : 0.470
Epoch 1:  99%|███████████████████████████████▌| 139/141 [00:19<00:00,  7.19it/s]Epoch 1 Step 140 : 0.482
Best : 0.482
Epoch 1:  99%|███████████████████████████████▊| 140/141 [00:20<00:00,  2.44it/s]Epoch 1 Step 141 : 0.494
Best : 0.494
Epoch 1: 100%|████████████████████████████████| 141/141 [00:21<00:00,  6.45it/s]
Epoch 2:  41%|█████████████▌                   | 58/141 [00:07<00:11,  7.39it/s]Epoch 2 Step 200 : 0.506
Best : 0.506
Epoch 2:  99%|███████████████████████████████▊| 140/141 [00:19<00:00,  7.45it/s]Epoch 2 Step 282 : 0.474
Best : 0.506
Epoch 2: 100%|████████████████████████████████| 141/141 [00:20<00:00,  6.94it/s]
Epoc

**Paper**
<br/>[Haotong et al. BiBERT: Accurate Fully Binarized BERT, ICLR, 2022](https://arxiv.org/abs/2010.11929)

<br/>**Github**
<br/>[htqin/BiBERT](https://github.com/htqin/BiBERT)
<br/>[Zhen-Dong/BitPack](https://github.com/Zhen-Dong/BitPack)