<a href="https://colab.research.google.com/github/pszemraj/ai-msgbot/blob/update-notebooks/notebooks/colab-huggingface-API/gpt_j_6B_8bit_textgen_playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Fine-tuning 6-Billion GPT-J in colab with LoRA and 8-bit compression

this notebook is for testing fine-tuned models that were 8-bit quantized for text generation.

> _The original notebook_ is a proof of concept for fine-tuning [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) with limited memory. A detailed explanation of how it works can be found in [this model card](https://huggingface.co/hivemind/gpt-j-6B-8bit).


---


>

In [None]:
#@title define huggingface model
#@markdown enter the string ID of the GPT-J 8bit model to test out in the box, 
#@markdown for example `hivemind/gpt-j-6B-8bit`
hf_gptj_model = "ethzanalytics/GPT-J-6B-8bit-Convo-D3E" #@param {type:"string"}
private_model = False #@param {type:"boolean"}

gptj_id = 'hivemind/gpt-j-6B-8bit' if len(hf_gptj_model) < 3 else hf_gptj_model

#@markdown if you are using a private model, _then you need to check the box that says so_

# setup things

In [None]:
#@markdown add auto-Colab formatting with `IPython.display`
from IPython.display import HTML, display
# colab formatting
def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )

get_ipython().events.register("pre_run_cell", set_css)

In [None]:
#@markdown print GPU status
import torch
!nvidia-smi

device = 'cuda' if torch.cuda.is_available() else 'cpu'

print(f"\nwill run computations on {device}")

Mon Jan  3 21:22:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
#@markdown **print out the VM's CPU stats**

#@markdown - a high-RAM runtime is recommended as the model file itself is around
#@markdown 10 gb. That gets loaded into ram, so 12 gb RAM will not cut it
from psutil import virtual_memory
import os
ram_gb = round(virtual_memory().total / (1024**3), 1)
print(f'Runtime has {ram_gb} gigs of memory and {os.cpu_count()} processors')

if ram_gb < 20:
    print("WARNING - your CPU RAM allocated is less than 20.",
          " You may experience errors loading models or generating text.")
    kleiner_cpu = True
else:
    kleiner_cpu = False

Runtime has 12.7 gigs of memory and 2 processors


In [None]:
#@title install packages
!pip install transformers[fairscale] -U -q
!pip install -U sentencepiece -q
!pip install -U datasets -q
!pip install bitsandbytes-cuda111==0.26.0 -q
!pip install -U joblib
!pip install -U huggingface_hub -q




In [None]:
#@markdown import packages

import transformers
import datasets
import torch
import joblib

import torch.nn.functional as F
from torch import nn
from torch.cuda.amp import custom_fwd, custom_bwd

from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise

from tqdm.auto import tqdm

In [None]:
#@title Sign in to HF
#@markdown <font color="orange"> **you have to sign in if you are using a private model
#@markdown you can use username/pass or get a token in your account.**</font>
from huggingface_hub import (
    # User management
    login,
    logout,
    notebook_login,
    whoami,
    # Repository creation and management
    create_repo,
    delete_repo,
    update_repo_visibility,
    # And some methods to retrieve/change information about the content
    list_models,
    list_datasets,
    list_metrics,
    list_repo_files,
    upload_file,
    delete_file,
)

notebook_login()


VBox(children=(HTML(value='<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

---

### Converting the model to 8 bits.

OG Author's note:

> We convert EleutherAI's GPT-J-6B model to 8 bits using facebook's [bitsandbytes](https://github.com/facebookresearch/bitsandbytes) library. This reduces the model's size from 20Gb down to just 6Gb.

> Note that we don't convert linear layer biases to 8 bit as they take up less that 1% of the model's weight anyway.

In [None]:

class FrozenBNBLinear(nn.Module):
    def __init__(self, weight, absmax, code, bias=None):
        assert isinstance(bias, nn.Parameter) or bias is None
        super().__init__()
        self.out_features, self.in_features = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
        self.bias = bias
 
    def forward(self, input):
        output = DequantizeAndLinear.apply(input, self.weight, self.absmax, self.code, self.bias)
        if self.adapter:
            output += self.adapter(input)
        return output
 
    @classmethod
    def from_linear(cls, linear: nn.Linear) -> "FrozenBNBLinear":
        weights_int8, state = quantize_blockise_lowmemory(linear.weight)
        return cls(weights_int8, *state, linear.bias)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.in_features}, {self.out_features})"
 
 
class DequantizeAndLinear(torch.autograd.Function): 
    @staticmethod
    @custom_fwd
    def forward(ctx, input: torch.Tensor, weights_quantized: torch.ByteTensor,
                absmax: torch.FloatTensor, code: torch.FloatTensor, bias: torch.FloatTensor):
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        ctx.save_for_backward(input, weights_quantized, absmax, code)
        ctx._has_bias = bias is not None
        return F.linear(input, weights_deq, bias)
 
    @staticmethod
    @custom_bwd
    def backward(ctx, grad_output: torch.Tensor):
        assert not ctx.needs_input_grad[1] and not ctx.needs_input_grad[2] and not ctx.needs_input_grad[3]
        input, weights_quantized, absmax, code = ctx.saved_tensors
        # grad_output: [*batch, out_features]
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        grad_input = grad_output @ weights_deq
        grad_bias = grad_output.flatten(0, -2).sum(dim=0) if ctx._has_bias else None
        return grad_input, None, None, None, grad_bias
 
 
class FrozenBNBEmbedding(nn.Module):
    def __init__(self, weight, absmax, code):
        super().__init__()
        self.num_embeddings, self.embedding_dim = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
 
    def forward(self, input, **kwargs):
        with torch.no_grad():
            # note: both quantuized weights and input indices are *not* differentiable
            weight_deq = dequantize_blockwise(self.weight, absmax=self.absmax, code=self.code)
            output = F.embedding(input, weight_deq, **kwargs)
        if self.adapter:
            output += self.adapter(input)
        return output 
 
    @classmethod
    def from_embedding(cls, embedding: nn.Embedding) -> "FrozenBNBEmbedding":
        weights_int8, state = quantize_blockise_lowmemory(embedding.weight)
        return cls(weights_int8, *state)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.num_embeddings}, {self.embedding_dim})"
 
 
def quantize_blockise_lowmemory(matrix: torch.Tensor, chunk_size: int = 2 ** 20):
    assert chunk_size % 4096 == 0
    code = None
    chunks = []
    absmaxes = []
    flat_tensor = matrix.view(-1)
    for i in range((matrix.numel() - 1) // chunk_size + 1):
        input_chunk = flat_tensor[i * chunk_size: (i + 1) * chunk_size].clone()
        quantized_chunk, (absmax_chunk, code) = quantize_blockwise(input_chunk, code=code)
        chunks.append(quantized_chunk)
        absmaxes.append(absmax_chunk)
 
    matrix_i8 = torch.cat(chunks).reshape_as(matrix)
    absmax = torch.cat(absmaxes)
    return matrix_i8, (absmax, code)
 
 
def convert_to_int8(model):
    """Convert linear and embedding modules to 8-bit with optional adapters"""
    for module in list(model.modules()):
        for name, child in module.named_children():
            if isinstance(child, nn.Linear):
                print(name, child)
                setattr( 
                    module,
                    name,
                    FrozenBNBLinear(
                        weight=torch.zeros(child.out_features, child.in_features, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                        bias=child.bias,
                    ),
                )
            elif isinstance(child, nn.Embedding):
                setattr(
                    module,
                    name,
                    FrozenBNBEmbedding(
                        weight=torch.zeros(child.num_embeddings, child.embedding_dim, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                    )
                )

### create blocking functions 

they convert anything that could be assigned to the model to 8-bit (I think)

In [None]:
class GPTJBlock(transformers.models.gptj.modeling_gptj.GPTJBlock):
    def __init__(self, config):
        super().__init__(config)

        convert_to_int8(self.attn)
        convert_to_int8(self.mlp)


class GPTJModel(transformers.models.gptj.modeling_gptj.GPTJModel):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)
        

class GPTJForCausalLM(transformers.models.gptj.modeling_gptj.GPTJForCausalLM):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)


transformers.models.gptj.modeling_gptj.GPTJBlock = GPTJBlock  # monkey-patch GPT-J

### load pretrained model, config, etc files

In [None]:
#@markdown tokenizer and config stay the same, loaded from `EleutherAI/gpt-j-6B`
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

In [None]:
#@title load the model into a modified `GPTJForCausalLM` class
#@markdown <font color="orange"> **if you get an error here mentioning something is not found,
#@markdown go back up to where 
#@markdown you can login to your HF account**
gpt = GPTJForCausalLM.from_pretrained(gptj_id, 
                                      use_auth_token=private_model,
                                      low_cpu_mem_usage=kleiner_cpu)

gpt.to(device)

print(f"\n\nRunning computations on {device}")

k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, bias=False)
fc_in Linear(in_features=4096, out_features=16384, bias=True)
fc_out Linear(in_features=16384, out_features=4096, bias=True)
k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, bias=False)
fc_in Linear(in_features=4096, out_features=16384, bias=True)
fc_out Linear(in_features=16384, out_features=4096, bias=True)
k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, 

### Text generation example

In [None]:
%%time
import pprint as pp

#@markdown in this relatively basic example, text is generated the _old-school_ way
#@markdown by converting to logits, etc. the pipeline object introduced later handles all that.


std_test = "Elon musk is pretty annoying because" #@param {type:"string"}
prompt = tokenizer(std_test, return_tensors='pt')
prompt = {key: value.to(device) for key, value in prompt.items()}
ex_min = len(std_test) + 64
out = gpt.generate(**prompt, 
                   min_length=ex_min,
                   max_length=ex_min + 64,
                   do_sample=True,
                    top_k=50,
                   top_p=0.9,
                   no_repeat_ngram_size=2,
                   clean_up_tokenization_spaces=True,
                   remove_invalid_values=True,
                   )

example_res = tokenizer.decode(out[0])
print(f"Total generated text is: \n")
print(example_res)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total generated text is: 

('Elon musk is pretty annoying because he makes up shit all the time. I just '
 'got back to my desk and found out we are making an electric car\n'
 '\n'
 "Yes, he's pretty much the Steve Jobs of SpaceX and Tesla, only with the "
 'money, money and more money.\n'
 "But he was never known to be very smart and he doesn't get people and "
 'business. He is just famous for the stupid things he does. Most of which '
 'have no chance of working out and are just a way to make a name for '
 'himself.<|endoftext|>')
CPU times: user 1h 39min 5s, sys: 1min 17s, total: 1h 40min 23s
Wall time: 49min 6s


# pipeline for textgen

In [None]:
from transformers import pipeline
my_chatbot = pipeline('text-generation', 
                      model=gpt, tokenizer=tokenizer,
                      device=0 if device == 'cuda' else -1,
                    )

## add prompts

In [None]:
#@title define speaker and responder
#@markdown for testing the models this should not need to be changed. 
#@markdown if testing a model related to [ai-msgbot](https://github.com/pszemraj/ai-msgbot)
#@markdown trained on data that **was not** using the entries below, update as needed.
speaker = "person alpha" #@param {type:"string"}
responder = "person beta" #@param {type:"string"}

## define prompt messages

the reason `f"{responder}:\n"` is added at the end of each prompt is to force the text-gen model to actually _respond_ to the prompt as opposed to adding on to it.

In [None]:

prompts = [
           [f"{speaker}:\n", "hi! how are you doing?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "what should I bring to the party?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "do you like memes?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "can we go on a date together this weekend?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "what's up homie?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "do you know how can I make friends here?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "so what do you like to do for fun?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "what is your favorite brand of cereal?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "what is the meaning of existence?\n", "\n", f"{responder}:\n"],
]

# generate text!

In [None]:
#@markdown set amount of text to generate (higher # = longer RT)
resp_len =  256#@param {type:"integer"}

In [None]:
# note that responses output the prompt as part of the output (and that counts 
# for part of the max length reqs)
for i, prompt in enumerate(prompts):
    this_prompt = "".join(prompt)
    result = my_chatbot(
                        this_prompt, 
                        do_sample=True,
                        top_k=50,
                        top_p=0.9, 
                        min_length=len(this_prompt) + resp_len,
                        clean_up_tokenization_spaces=True,
                        no_repeat_ngram_size=3,
                    )
    
    print(f"==========Testing Prompt-ID #{i} ==========")
    print(f"PROMPT TEXT:\n{''.join(prompt)}")
    print("----------FULL GENERATED TEXT:")
    print(result[0]['generated_text'])
    print("\n" * 4)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


PROMPT TEXT:
person alpha:
hi! how are you doing?

person beta:

----------FULL GENERATED TEXT:
person alpha:
hi! how are you doing?

person beta:
i am fine, thanks for asking.

person Alpha:
are you doing all right?
beta: i'm okay. how are you?









Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


PROMPT TEXT:
person alpha:
what should I bring to the party?

person beta:

----------FULL GENERATED TEXT:
person alpha:
what should I bring to the party?

person beta:
I would bring: beer, wine and good times:

Person alpha:
Awesome!  That just leaves the question of what to bring....
 







Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


PROMPT TEXT:
person alpha:
do you like memes?

person beta:

----------FULL GENERATED TEXT:
person alpha:
do you like memes?

person beta:
you probably do, huh?

person alpha:
no comment (also no comment)

person beta:
no comment, either. (again)









Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


PROMPT TEXT:
person alpha:
can we go on a date together this weekend?

person beta:

----------FULL GENERATED TEXT:
person alpha:
can we go on a date together this weekend?

person beta:
i like ur girlfriend, but i think she's a lesbian

person gamma:
i love lesbians

person delta:
I'm an







Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


PROMPT TEXT:
person alpha:
what's up homie?

person beta:

----------FULL GENERATED TEXT:
person alpha:
what's up homie?

person beta:
what's up?

person omega:
i can't hear you

person epsilon:
no one's gonna answer

person zeta:







Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


PROMPT TEXT:
person alpha:
do you know how can I make friends here?

person beta:

----------FULL GENERATED TEXT:
person alpha:
do you know how can I make friends here?

person beta:
well, i don't even know my name..

person alpha:
you get a profile card.

person beta:
that's







Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


PROMPT TEXT:
person alpha:
so what do you like to do for fun?

person beta:

----------FULL GENERATED TEXT:
person alpha:
so what do you like to do for fun?

person beta:
not really, but I can play some guitar. 

person alpha: 
Oh, yeah. Do you have any favorite guitar chords?







Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


PROMPT TEXT:
person alpha:
what is your favorite brand of cereal?

person beta:

----------FULL GENERATED TEXT:
person alpha:
what is your favorite brand of cereal?

person beta:
does this matter? what does it matter to you? is that what you
want to say? is there anything else that you want to say? is anybody





PROMPT TEXT:
person alpha:
what is the meaning of existence?

person beta:

----------FULL GENERATED TEXT:
person alpha:
what is the meaning of existence?

person beta:
what is the meaning of existence?

person gamma:
what is the meaning of existence?

person delta:
what is the meaning of existence?





