## Notebook-1 
## Fine-tune Llama 2 for Inverse Information Generation 
(Reverse-Thinking) 
An instruction is a piece of text or prompt that is provided to an LLM to perform text generation of an answer. 
>
***The goal is to create a model which can perform inverse information generation or filtering based on user input***. The idea behind this is that we want to the model able to filter large amount of information down into key points based on the fine-tune training dataset.


### 1. Environment Setup and Info

In [None]:
!pip install "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "t

In [19]:
def show_cuda_info():
    import torch
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print('Using device:*', device)
    #Additional Info when using cuda
    if device.type == 'cuda':
        print('Device count:', torch.cuda.device_count())    
        print('Device  Name:',torch.cuda.get_device_name(0))
        print('Memory Usage:')
        print('   Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
        print('      Cached:', round(torch.cuda.torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

show_cuda_info()

Using device:* cuda
Device count: 1
Device  Name: NVIDIA GeForce RTX 3090
Memory Usage:
   Allocated: 4.1 GB
      Cached: 4.4 GB


#### install dependencies
this steps may take more than 10 minutes to complete to compile flasj-attn 

In [20]:
!pip install ninja packaging
!export MAX_JOBS=3 
!pip install flash-attn --no-build-isolation

^C
[31mERROR: Operation cancelled by user[0m[31m


### 2.Prepare Dataset

In [21]:
# required for download dataset from Huggingface
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

To load the databricks/databricks-dolly-15k dataset, we use the load_dataset() method from the 🤗 Datasets library.

In [22]:
## load from hub
from datasets import load_dataset
dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
print(f"dataset size: {len(dolly_dataset)}")

Found cached dataset json (/home/pop/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)


dataset size: 15011


In [23]:
def format_training_instruction(sample):
	return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
{sample['response']}

### Response:
{sample['instruction']}
"""

In [24]:
## spot check origin sample record
print(dolly_dataset[2])

{'instruction': 'Why can camels survive for long without water?', 'context': '', 'response': 'Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.', 'category': 'open_qa'}


In [25]:
## sample formatted training instruction
print(format_training_instruction(dolly_dataset[2]))

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.

### Response:
Why can camels survive for long without water?



## 3 Instruction-tune Llama 2 using trl and the SFTTrainer
##### how QLoRA works is:

    Quantize the pre-trained model to 4 bits and freeze it.
    Attach small, trainable adapter layers. (LoRA)
    Finetune only the adapter layers while using the frozen quantized model for context.

If you want to learn more about QLoRA and how it works, I recommend you to read the Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA blog post.

**Flash Attention** is a an method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. It is based on the paper.  The TL;DR; accelerates training up to 3x. **Big cost saving feature when renting environment in Cloud**.

However, due hardware limitation, will not be use on this notebook because Flash attention limited support on specific NVIDIA card like A,H series.

#### Flash Attention  

Flash Attention is a an method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. It is based on the paper.  The TL;DR; accelerates training up to 3x.

Note1: Big cost saving feature when renting environment in Cloud

Note2: Flash attention limited support on specific NVIDIA card like A,H series.

#### Base Model

In [26]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# Select Hugging Face model id gated or non-gated
model_id = "NousResearch/Llama-2-7b-hf" # non-gated
# model_id = "meta-llama/Llama-2-7b-hf" # gated

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### PERF Configuration

The SFTTrainer supports a native integration with peft, which makes it super easy to efficiently instruction tune LLMs. We only need to create our LoRAConfig and provide it to the trainer.

In [27]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
# LoRA config based on QLoRA paper
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
)
# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

#### Training Hyperparameters

In [28]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="llama-7-int4-dolly",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=True # disable tqdm since with packing values are in correct
)

#### Create SFTTrainer 

In [29]:
%%time
from trl import SFTTrainer

# max sequence length for model and packing of the dataset
max_seq_length = 2048 

trainer = SFTTrainer(
    model=model,
    train_dataset=dolly_dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_training_instruction,
    args=args,
)

CPU times: user 2.76 ms, sys: 810 µs, total: 3.57 ms
Wall time: 5.71 ms


#### Perform Training

In [14]:
%%time
# train
trainer.train() 
# save model
trainer.save_model()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmychen76[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 1.6449, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 1.388, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 1.3566, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 1.2871, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 1.29, 'learning_rate': 0.0002, 'epoch': 0.03}
{'loss': 1.2353, 'learning_rate': 0.0002, 'epoch': 0.03}
{'loss': 1.2358, 'learning_rate': 0.0002, 'epoch': 0.04}
{'loss': 1.2327, 'learning_rate': 0.0002, 'epoch': 0.04}
{'loss': 1.2411, 'learning_rate': 0.0002, 'epoch': 0.05}
{'loss': 1.2298, 'learning_rate': 0.0002, 'epoch': 0.05}
{'loss': 1.2336, 'learning_rate': 0.0002, 'epoch': 0.06}
{'loss': 1.2789, 'learning_rate': 0.0002, 'epoch': 0.06}
{'loss': 1.1953, 'learning_rate': 0.0002, 'epoch': 0.07}
{'loss': 1.198, 'learning_rate': 0.0002, 'epoch': 0.07}
{'loss': 1.2051, 'learning_rate': 0.0002, 'epoch': 1.0}
{'loss': 1.2241, 'learning_rate': 0.0002, 'epoch': 1.01}
{'loss': 1.2001, 'learning_rate': 0.0002, 'epoch': 1.01}
{'loss': 1.2015, 'learning_rate': 0.

### Load Trained LoRA Model

In [14]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

args.output_dir = "llama-7-int4-dolly"
# load base LLM model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Merged and Save Model
After the training is done we want to run and test our model. We will use peft and transformers to load our LoRA adapter into our model.

In [15]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    low_cpu_mem_usage=True,
)
# Merge LoRA and base model
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")

# push merged model to the hub
# merged_model.push_to_hub("user/repo")
# tokenizer.push_to_hub("user/repo")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('merged_model/tokenizer_config.json',
 'merged_model/special_tokens_map.json',
 'merged_model/tokenizer.model',
 'merged_model/added_tokens.json',
 'merged_model/tokenizer.json')

In [30]:
## End of Fine-Tuning

### see notebook-2 for model testing and usage