## Run it!

### Prepare Environment

We first create a virtual environment and install the required packages.

```shell
cat /etc/os-release
nvcc -V
cd ../personal_copilot
python3.11 -m venv .copilot
source .copilot/bin/activate
pip install --upgrade pip setuptools wheel
pip install torch torchvision torchaudio
pip install packaging
pip install flash-attn
pip install -r training/requirements.txt
pip install -r dateset_generation/requirements.txt
```

### Generate Dataset

Follow `personal_copilot/README.md`. 

```shell
export GH_ACCESS_TOKEN=xxxx
```

In [1]:
import os
# os.getcwd()

In [12]:
# os.chdir("../dataset_generation")
# os.getcwd()

Clone repos

In [13]:
# !python clone_hf_repos.py

Check repos

In [14]:
# !ls hf_public_repos

In [15]:
# import nltk
# nltk.download('punkt')

Run data processing pipeline

In [16]:
# !python pipeline.py

We could collate and push to hub.

```shell
python prepare_hf_dataset.py
```

we can also just download it from the hub.

### Train Model

```shell
python train.py \
    --model_name_or_path "bigcode/starcoder2-7b" \
    --lora_r 32 \
    --lora_alpha 64 \
    --lora_dropout 0.0 \
    --lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj" \
    --use_nested_quant \
    --bnb_4bit_compute_dtype "bfloat16" \
    --use_flash_attn \
    --use_peft_lora \
    --use_4bit_quantization \
    --dataset_name "smangrul/hug_stack" \
    --dataset_text_field "text" \
    --max_seq_length 1024 \
    --fim_rate 0.5 \
    --fim_spm_rate 0.5 \
    --splits "train" \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --bf16 \
    --learning_rate 5e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --max_steps 1000 \
    --warmup_steps 30 \
    --dataloader_num_workers 4 \
    --evaluation_strategy "steps" \
    --eval_steps 50 \
    --save_steps 50 \
    --logging_steps 25 \
    --output_dir "peft-lora-starcoder2-7b-personal-copilot-dual-3090-local" 
```

If the training is interrupted, we can resume it by adding `--resume_from_checkpoint "path/to/checkpoint"`.

```shell
    python train.py \
    --model_name_or_path "bigcode/starcoder2-7b" \
    --lora_r 32 \
    --lora_alpha 64 \
    --lora_dropout 0.0 \
    --lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj" \
    --use_nested_quant \
    --bnb_4bit_compute_dtype "bfloat16" \
    --use_flash_attn \
    --use_peft_lora \
    --use_4bit_quantization \
    --dataset_name "smangrul/hug_stack" \
    --dataset_text_field "text" \
    --max_seq_length 1024 \
    --fim_rate 0.5 \
    --fim_spm_rate 0.5 \
    --splits "train" \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --bf16 \
    --learning_rate 5e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --max_steps 1000 \
    --warmup_steps 30 \
    --dataloader_num_workers 4 \
    --evaluation_strategy "steps" \
    --eval_steps 50 \
    --save_steps 50 \
    --logging_steps 25 \
    --output_dir "peft-lora-starcoder2-7b-personal-copilot-dual-3090-local" \
    --resume_from_checkpoint "peft-lora-starcoder2-7b-personal-copilot-dual-3090-local/checkpoint-450"
```

### Using Tensorboard

```shell
cd personal_copilot/training/peft-lora-starcoder2-7b-personal-copilot-dual-3090-local
tensorboard --logdir=runs --bind_all
```

## Deep Dive 

### Dependencies

Now that we can run the training, let's go back to understand what is actually going on.

In [17]:
import sys
# sys.path

In [18]:
import os
# os.getcwd()

In [19]:
# add the parent directory to the path
sys.path.append('../training')
# sys.path

In [20]:
packages = ['ipywidgets']  # Add your packages here

for package in packages:
    !pip show {package} > /dev/null || pip install {package}

In [21]:
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
os.environ['WANDB_NOTEBOOK_NAME'] = 'code_copilot.ipynb'
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import random
import sys
import humanfriendly
from typing import Optional
from dataclasses import dataclass, field

import numpy as np
import torch
from datasets import load_dataset
from torch.utils.data import IterableDataset
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    HfArgumentParser,
    set_seed,
    BitsAndBytesConfig,
)

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import fim
from train import ModelArguments, DataTrainingArguments, chars_token_ratio, ConstantLengthDataset, create_datasets, create_and_prepare_model

We start with defining a `HfArgumentParser`: This module from the Hugging Face transformers library parses command-line arguments related to the model, data, and training configurations. 

* We can place all the arguments in a `json` file and use `parse_json_file`.
* or place them in the command line and use `parse_args_into_dataclasses`. 

### Inputs 

#### from command line

In [22]:
args = [
    "--model_name_or_path", "bigcode/starcoder2-7b",
    "--lora_r", "32",
    "--lora_alpha", "64",
    "--lora_dropout", "0.0",
    "--lora_target_modules", "c_proj,c_attn,q_attn,c_fc,c_proj",
    "--use_nested_quant",
    "--bnb_4bit_compute_dtype", "bfloat16",
    "--use_flash_attn",
    "--use_peft_lora",
    "--use_4bit_quantization",
    "--dataset_name", "smangrul/hug_stack",
    "--dataset_text_field", "text",
    "--max_seq_length", "1024",
    "--fim_rate", "0.5",
    "--fim_spm_rate", "0.5",
    "--splits", "train",
    "--per_device_train_batch_size", "2",
    "--per_device_eval_batch_size", "2",
    "--gradient_accumulation_steps", "4",
    "--bf16",
    "--learning_rate", "5e-4",
    "--lr_scheduler_type", "cosine",
    "--weight_decay", "0.01",
    "--max_steps", "1000",
    "--warmup_steps", "30",
    "--dataloader_num_workers", "4",
    "--eval_strategy", "steps",
    "--eval_steps", "50",
    "--save_steps", "50",
    "--logging_steps", "25",
    "--output_dir", "peft-lora-starcoder2-7b-personal-copilot-test"
]

In [23]:
# Parse arguments
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses(args)

In [24]:
model_args

ModelArguments(model_name_or_path='bigcode/starcoder2-7b', lora_alpha=64, lora_dropout=0.0, lora_r=32, lora_target_modules='c_proj,c_attn,q_attn,c_fc,c_proj', use_nested_quant=True, bnb_4bit_compute_dtype='bfloat16', bnb_4bit_quant_type='nf4', use_flash_attn=True, use_peft_lora=True, use_8bit_qunatization=False, use_4bit_quantization=True, use_reentrant=False, use_unsloth=False, use_loftq=False, use_loftq_callback=False)

In [25]:
data_args

DataTrainingArguments(dataset_name='smangrul/hug_stack', dataset_text_field='text', max_seq_length=1024, test_size=0.1, fim_rate=0.5, fim_spm_rate=0.5, splits='train')

In [26]:
training_args;

#### Use JSON to get input

In [27]:
input_json_path = "data/copilot_train_input.json"

In [28]:
os.getcwd()

'/home/charles/github/LLM-Workshop/personal_copilot/notebooks'

In [29]:
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_json_file(json_file=input_json_path)

In [30]:
model_args

ModelArguments(model_name_or_path='bigcode/starcoder2-7b', lora_alpha=64, lora_dropout=0.0, lora_r=32, lora_target_modules='c_proj,c_attn,q_attn,c_fc,c_proj', use_nested_quant=True, bnb_4bit_compute_dtype='bfloat16', bnb_4bit_quant_type='nf4', use_flash_attn=True, use_peft_lora=True, use_8bit_qunatization=False, use_4bit_quantization=True, use_reentrant=False, use_unsloth=False, use_loftq=False, use_loftq_callback=False)

In [31]:
data_args

DataTrainingArguments(dataset_name='smangrul/hug_stack', dataset_text_field='text', max_seq_length=1024, test_size=0.1, fim_rate=0.5, fim_spm_rate=0.5, splits='train')

In [32]:
training_args;

In [33]:
training_args.output_dir

'peft-lora-starcoder2-7b-personal-copilot-test'

### Tokenizer

In [34]:
training_args.seed

42

In [35]:
model_args.model_name_or_path

'bigcode/starcoder2-7b'

In [36]:
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
tokenizer

GPT2TokenizerFast(name_or_path='bigcode/starcoder2-7b', vocab_size=49152, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'additional_special_tokens': ['<|endoftext|>', '<fim_prefix>', '<fim_middle>', '<fim_suffix>', '<fim_pad>', '<repo_name>', '<file_sep>', '<issue_start>', '<issue_comment>', '<issue_closed>', '<jupyter_start>', '<jupyter_text>', '<jupyter_code>', '<jupyter_output>', '<jupyter_script>', '<empty_output>', '<code_to_intermediate>', '<intermediate_to_code>', '<pr>', '<pr_status>', '<pr_is_merged>', '<pr_base>', '<pr_file>', '<pr_base_code>', '<pr_diff>', '<pr_diff_hunk>', '<pr_comment>', '<pr_event_id>', '<pr_review>', '<pr_review_state>', '<pr_review_comment>', '<pr_in_reply_to_review_id>', '<pr_in_reply_to_comment_id>', '<pr_diff_hunk_comment_line>', '<NAME>', '<EMAIL>', '<KEY>', '<PASSWORD>']}, clean_u

In [37]:
vars(data_args)

{'dataset_name': 'smangrul/hug_stack',
 'dataset_text_field': 'text',
 'max_seq_length': 1024,
 'test_size': 0.1,
 'fim_rate': 0.5,
 'fim_spm_rate': 0.5,
 'splits': 'train'}

### Datasets

#### Load dataset

In [38]:
seed = training_args.seed
seed

42

In [39]:
data_args.dataset_name

'smangrul/hug_stack'

In [40]:
data_args.splits

'train'

In [41]:
dataset = load_dataset(data_args.dataset_name, split=data_args.splits)
dataset

Dataset({
    features: ['text', 'id', 'metadata', '__index_level_0__'],
    num_rows: 6579
})

Split the dataset into training and validation 

In [42]:
test_size = data_args.test_size
test_size

0.1

In [43]:
dataset = dataset.train_test_split(
    test_size=test_size, seed=seed, shuffle=True
)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'id', 'metadata', '__index_level_0__'],
        num_rows: 5921
    })
    test: Dataset({
        features: ['text', 'id', 'metadata', '__index_level_0__'],
        num_rows: 658
    })
})

In [44]:
train_data = dataset["train"]
train_data

Dataset({
    features: ['text', 'id', 'metadata', '__index_level_0__'],
    num_rows: 5921
})

In [45]:
valid_data = dataset["test"]
valid_data

Dataset({
    features: ['text', 'id', 'metadata', '__index_level_0__'],
    num_rows: 658
})

In [46]:
print(
    f"Size of the train set: {len(train_data)}. Size of the validation set: {len(valid_data)}"
)

Size of the train set: 5921. Size of the validation set: 658


In [47]:
data_column = data_args.dataset_text_field
data_column

'text'

#### Check the number of tokens 

In [48]:
def total_tokens(dataset, tokenizer, data_column):
    """
    Compute the total number of tokens in the dataset.
    """
    total_tokens = 0
    for example in tqdm(dataset):
        total_tokens += len(tokenizer(example[data_column]).tokens())

    return total_tokens

In [49]:
# total_tokens_train = total_tokens(train_data, tokenizer, data_column)
# total_tokens_train

Let's cache the results since it takes time to run:

In [50]:
# Create a memory object for caching
import shutil
cache_dir = 'data/cache/total_tokens'

In [51]:
## if we want to delete the cache
# if os.path.exists(cache_dir):
#     shutil.rmtree(cache_dir)

In [52]:
from joblib import Memory

os.makedirs(cache_dir, exist_ok=True)
memory = Memory(cache_dir, verbose=0)

@memory.cache
def total_tokens(dataset, tokenizer, data_column):
    """
    Compute the total number of tokens in the dataset.
    """
    total_tokens = 0
    for example in tqdm(dataset):
        total_tokens += len(tokenizer(example[data_column]).tokens())
    
    return total_tokens

In [53]:
total_tokens_train = total_tokens(train_data, tokenizer, data_column)
total_tokens_train

23328978

In [54]:
# total_tokens_train = humanfriendly.format_number(total_tokens_train)
# total_tokens_train

In [55]:
print(f"The total number of tokens in the training dataset is: {total_tokens_train:,}")

The total number of tokens in the training dataset is: 23,328,978


Total of 23M tokens in the training dataset

In [56]:
def chars_token_ratio(dataset, tokenizer, data_column, nb_examples=400):
    """
    Estimate the average number of characters per token in the dataset.
    """
    total_characters, total_tokens = 0, 0
    for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):
        total_characters += len(example[data_column])
        total_tokens += len(tokenizer(example[data_column]).tokens())

    return total_characters / total_tokens


In [57]:
chars_per_token = chars_token_ratio(train_data, tokenizer, data_column)
chars_per_token

  0%|          | 0/400 [00:00<?, ?it/s]

100%|██████████| 400/400 [00:01<00:00, 212.55it/s]


3.6223575039906772

In [58]:
print(f"The character to token ratio of the dataset is: {chars_per_token:.2f}")

The character to token ratio of the dataset is: 3.62


#### Format train and validation datasets

In [59]:
# train_dataset, eval_dataset = create_datasets(
#     tokenizer, data_args, training_args.seed
# )

In [60]:
ConstantLengthDataset.__init__??

[0;31mSignature:[0m
[0mConstantLengthDataset[0m[0;34m.[0m[0m__init__[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtokenizer[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdataset[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minfinite[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mseq_length[0m[0;34m=[0m[0;36m1024[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_of_sequences[0m[0;34m=[0m[0;36m1024[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mchars_per_token[0m[0;34m=[0m[0;36m3.6[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcontent_field[0m[0;34m=[0m[0;34m'content'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfim_rate[0m[0;34m=[0m[0;36m0.5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfim_spm_rate[0m[0;34m=[0m[0;36m0.5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mseed[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mFalse[0m

In [61]:
max_seq_length = data_args.max_seq_length
max_seq_length

1024

In [62]:
fim_rate = data_args.fim_rate
fim_rate

0.5

In [63]:
fim_spm_rate = data_args.fim_spm_rate
fim_spm_rate

0.5

In [64]:
train_dataset = ConstantLengthDataset(
    tokenizer,
    train_data,
    infinite=True,
    seq_length=max_seq_length,
    chars_per_token=chars_per_token,
    content_field=data_column,
    fim_rate=fim_rate,
    fim_spm_rate=fim_spm_rate,
    seed=seed,
    shuffle=True,
)
train_dataset
print(f"A sample of train dataset: {next(iter(train_dataset))}")

A sample of train dataset: {'input_ids': tensor([   63, 20455,    53,  ...,    45,  1612,    46]), 'labels': tensor([   63, 20455,    53,  ...,    45,  1612,    46])}


In [65]:
eval_dataset = ConstantLengthDataset(
    tokenizer,
    valid_data,
    infinite=False,
    seq_length=max_seq_length,
    chars_per_token=chars_per_token,
    content_field=data_column,
    fim_rate=fim_rate,
    fim_spm_rate=fim_spm_rate,
    seed=seed,
)
print(f"A sample of valid dataset: {next(iter(eval_dataset))}")

A sample of valid dataset: {'input_ids': tensor([   40, 10633,    66,  ...,  6878,    49,   327]), 'labels': tensor([   40, 10633,    66,  ...,  6878,    49,   327])}


In [66]:
train_dataset.start_iteration = 0

ConstantLengthDataset deepdive

### Load Pre-trained Model

In [67]:
device_map = None
bnb_config = None

In [68]:
load_in_8bit = model_args.use_8bit_qunatization
load_in_8bit

False

In [69]:
model_args.use_unsloth

False

In [70]:
if model_args.use_unsloth:
    from unsloth import FastLanguageModel

In [71]:
load_in_4bit = model_args.use_4bit_quantization
load_in_4bit

True

#### Quantization & bnb config

We are using [QLoRA](https://huggingface.co/papers/2305.14314). QLoRA is a method for fine-tuning models that employs a two-pronged approach. 

Firstly, it quantizes the model to 4-bits, thereby reducing the computational resources required. 

Secondly, it incorporates a set of Low-Rank Adaptation (LoRA) weights into the model, which are fine-tuned via the quantized weights. 

In addition to the conventional Float4 data type (LinearFP4), QLoRA introduces a new 4-bit NormalFloat (LinearNF4) data type. This new data type is specifically designed for quantizing normally distributed data, and can enhance the model's performance.

##### 4bit quantization

In [72]:
bnb_4bit_compute_dtype = model_args.bnb_4bit_compute_dtype
bnb_4bit_compute_dtype

'bfloat16'

In [73]:
bnb_4bit_quant_type = model_args.bnb_4bit_quant_type
bnb_4bit_quant_type

'nf4'

In [74]:
bnb_4bit_use_double_quant = model_args.use_nested_quant
bnb_4bit_use_double_quant

True

In [75]:
# if load_in_4bit:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
compute_dtype

torch.bfloat16

In [76]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=load_in_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
)
bnb_config

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

##### what does this `compute_type` do?

We can change the data type from the default `flaot32` to `bf16` to speed up computation. This requires cuda capability that supports `torch.bfloat`.

In [77]:
if compute_dtype == torch.float16 and load_in_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print(
            "Your GPU supports bfloat16, you can accelerate training with the argument --bf16"
        )
        print("=" * 80)

In [78]:
torch.cuda.get_device_capability()

(8, 6)

##### quantization type

In [79]:
bnb_4bit_quant_type = model_args.bnb_4bit_quant_type
bnb_4bit_quant_type

'nf4'

[NF4](https://huggingface.co/docs/transformers/main/en/quantization?bnb=4-bit) is a 4-bit data type adpated for weights initialized from a normal distribution.

In [80]:
from bitsandbytes.nn import modules

In [81]:
modules.Linear4bit.__init__??

[0;31mSignature:[0m
[0mmodules[0m[0;34m.[0m[0mLinear4bit[0m[0;34m.[0m[0m__init__[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minput_features[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_features[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbias[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcompute_dtype[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcompress_statistics[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mquant_type[0m[0;34m=[0m[0;34m'fp4'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mquant_storage[0m[0;34m=[0m[0mtorch[0m[0;34m.[0m[0muint8[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdevice[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m,[0m

In [82]:
modules.Linear4bit.set_compute_type??

[0;31mSignature:[0m [0mmodules[0m[0;34m.[0m[0mLinear4bit[0m[0;34m.[0m[0mset_compute_type[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mx[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
    [0;32mdef[0m [0mset_compute_type[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mx[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mif[0m [0mx[0m[0;34m.[0m[0mdtype[0m [0;32min[0m [0;34m[[0m[0mtorch[0m[0;34m.[0m[0mfloat32[0m[0;34m,[0m [0mtorch[0m[0;34m.[0m[0mbfloat16[0m[0;34m][0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0;31m# the input is in a dtype that is safe to compute in, we switch[0m[0;34m[0m
[0;34m[0m            [0;31m# to this type for speed and stability[0m[0;34m[0m
[0;34m[0m            [0mself[0m[0;34m.[0m[0mcompute_dtype[0m [0;34m=[0m [0mx[0m[0;34m.[0m[0mdtype[0m[0;34m[0m
[0;34m[0m        [0;32melif[0m [0mx[0m[0;34m.[0m[0mdtype[0m [0;34m==[0m

We've set the `compute_type` for bnb to be `torch.bloat16`.  

##### Nested quantization

[Nested quantization](https://huggingface.co/docs/transformers/main/en/quantization?bnb=4-bit) performs a second round of quantization on quantized weights to achieve additional 0.4 bits/parameter memory savings. 

In [83]:
bnb_4bit_use_double_quant

True

##### Device Map (either 4bit or 8bit quantization)

```
if args.use_4bit_quantization or args.use_8bit_qunatization:
    device_map = (
        int(os.environ.get("LOCAL_RANK", -1))
        if torch.distributed.is_available() and torch.distributed.is_initialized()
        else "auto"
    )  # {"": 0}
```

In [84]:
os.environ.get("LOCAL_RANK", -1)

-1

In [85]:
torch.distributed.is_available() 

True

In [86]:
torch.distributed.is_initialized()

False

`torch.distributed.is_initialized()` is false so the `device_map` is set to "auto".

In [87]:
device_map = (
    int(os.environ.get("LOCAL_RANK", -1))
    if torch.distributed.is_available() and torch.distributed.is_initialized()
    else "auto"
)  # {"": 0}
device_map

'auto'

The `device_map` variable is used to determine the device mapping for distributed training when using quantization.

In the context of distributed training, each process runs on a specific device (like a GPU). The `device_map` variable is used to specify which device the current process should run on.

`int(os.environ.get("LOCAL_RANK", -1))` tries to get the `LOCAL_RANK` environment variable, which is typically set in distributed training to indicate the rank of the current process. The rank is a unique identifier assigned to each process in a distributed training setup. If `LOCAL_RANK` is not set, it defaults to -1.

`torch.distributed.is_available()` and `torch.distributed.is_initialized()` checks ensure that the PyTorch distributed package is available and has been initialized. If these conditions are met, it means the code is running in a distributed training setup.

If `device_map` is set to "auto" during training, it'll automatically load the model on a GPU. 

When using the 8-bit quantized model, it is possible to [offload weights between the CPU and GPU](https://huggingface.co/docs/transformers/main/en/quantization?bnb=4-bit#offloading) with a custom `device_map` setting such as:

```python
device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}
```

'0' represents the GPU. This allows support for very large models into memory.

#### Load model 

Depending on whether `unsloth` is used, we use different methods to load the model. 

We also specify different attention mechanisms.

In [88]:
model_args.use_unsloth

False

If `unsloth` is not used, we initialize the model with `AutoModelForCausalLM`.

```python
if args.use_unsloth:
    # Load model
    model, _ = FastLanguageModel.from_pretrained(
        model_name=args.model_name_or_path,
        max_seq_length=data_args.max_seq_length,
        dtype=None,
        load_in_4bit=load_in_4bit,
    )
else:
    model = AutoModelForCausalLM.from_pretrained(
        args.model_name_or_path,
        load_in_8bit=load_in_8bit,
        quantization_config=bnb_config,
        device_map=device_map,
        trust_remote_code=True,
        attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
    )
```

In [89]:
model_args.model_name_or_path

'bigcode/starcoder2-7b'

In [90]:
bnb_config

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

See also [quantization with bits and bytes](https://huggingface.co/docs/transformers/main/en/quantization?bnb=4-bit)

In [91]:
device_map

'auto'

##### flash attention

Using [Flash Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#Flash-Attention-2) in transformers can help speed up the training throughput. 

In [92]:
model_args.use_flash_attn

True

So we are using flash attention.

In [93]:
model = AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path,
    load_in_8bit=load_in_8bit,
    quantization_config=bnb_config,
    device_map=device_map,
    trust_remote_code=True,
    attn_implementation="flash_attention_2" if model_args.use_flash_attn else "eager",
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [94]:
model

Starcoder2ForCausalLM(
  (model): Starcoder2Model(
    (embed_tokens): Embedding(49152, 4608)
    (layers): ModuleList(
      (0-31): 32 x Starcoder2DecoderLayer(
        (self_attn): Starcoder2FlashAttention2(
          (q_proj): Linear4bit(in_features=4608, out_features=4608, bias=True)
          (k_proj): Linear4bit(in_features=4608, out_features=512, bias=True)
          (v_proj): Linear4bit(in_features=4608, out_features=512, bias=True)
          (o_proj): Linear4bit(in_features=4608, out_features=4608, bias=True)
          (rotary_emb): Starcoder2RotaryEmbedding()
        )
        (mlp): Starcoder2MLP(
          (c_fc): Linear4bit(in_features=4608, out_features=18432, bias=True)
          (c_proj): Linear4bit(in_features=18432, out_features=4608, bias=True)
          (act): PytorchGELUTanh()
        )
        (input_layernorm): LayerNorm((4608,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((4608,), eps=1e-05, elementwise_affine=True)
      )

In [95]:
print(f"The memory footprint of the model is: {model.get_memory_footprint():,}")

The memory footprint of the model is: 4,197,640,192


So it is 4.4G.

### Prep peft_lora with quantization and no unsloth

#### LORA PEFT

Parameter-Efficient Fine Tuning (PEFT) is a technique that allows you to fine-tune large models with limited resources. It does so by freezing the pretrained model parameters during fine-tuning, and add a small set of trainable parameters called adapters on top of it. Thus significantly [reduces the memory](https://huggingface.co/docs/transformers/model_memory_anatomy#anatomy-of-models-memory) required to fine-tune the model. 

Low-Rank Adaptation [(LoRA)](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) is a popular adapter-based method. It represent the weight updates with two smaller 'update matrices' through low-rank decomposition. The original weight matrix is frozen but the "update matrices" are trained based on the new data. At the end, the original weights and the adapter weights are combined to create the new weights.

Performance of LoRA fine-tuned models have been found to be comparable to that of fully fine-tuned models. Once the adapter weights are merged with the base model, it does not introduce additional inference latency.

LoRA is othogonal to and can be combined with other PEFT methods. 

For fine-tunning transformer models, LoRA is typically applied to only attention blocks for simplicity. The number of parameters in the adapter is determined by the rank parameter `r` and the shape of the original weight matrix.





* If we are using 4-bit or 8-bit quantization for peft_lora and
* We are NOT using unsloth 

Here is how we prepare for kbit training.

```python
if (
    (args.use_4bit_quantization or args.use_8bit_qunatization)
    and args.use_peft_lora
    and not args.use_unsloth
):
    model = prepare_model_for_kbit_training(
        model,
        use_gradient_checkpointing=training_args.gradient_checkpointing,
        gradient_checkpointing_kwargs={"use_reentrant": model_args.use_reentrant},
    )
```

In [96]:
prepare_model_for_kbit_training?

[0;31mSignature:[0m
[0mprepare_model_for_kbit_training[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmodel[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0muse_gradient_checkpointing[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgradient_checkpointing_kwargs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Note this method only works for `transformers` models.

This method wraps the entire protocol for preparing a model before running a training. This includes:
    1- Cast the layernorm in fp32 2- making output embedding layer require grads 3- Add the upcasting of the lm
    head to fp32

Args:
    model (`transformers.PreTrainedModel`):
        The loaded model from `transformers`
    use_gradient_checkpointing (`bool`, *optional*, defaults to `True`):
        If True, use gradient checkpointing to save memory at the expense of slower backward pass.
    gradient_checkpointing_kwargs

### Create peft model 

Depending on whether unsloth is used, we use different methods:

```python
if args.use_peft_lora and not args.use_unsloth:
    peft_config = LoraConfig(
        lora_alpha=args.lora_alpha,
        lora_dropout=args.lora_dropout,
        r=args.lora_r,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=args.lora_target_modules.split(",")
        if args.lora_target_modules != "all-linear"
        else args.lora_target_modules,
    )
    model = get_peft_model(model, peft_config)
elif args.use_peft_lora and args.use_unsloth:
    # Do model patching and add fast LoRA weights
    model = FastLanguageModel.get_peft_model(
        model,
        lora_alpha=args.lora_alpha,
        lora_dropout=args.lora_dropout,
        r=args.lora_r,
        target_modules=args.lora_target_modules.split(",")
        if args.lora_target_modules != "all-linear"
        else args.lora_target_modules,
        use_gradient_checkpointing=training_args.gradient_checkpointing,
        random_state=training_args.seed,
        max_seq_length=data_args.max_seq_length,
    )
```

##### lora_config

In [97]:
model_args.use_peft_lora

True

In [98]:
model_args.use_unsloth

False

##### lora_alpha

Scaling factor

In [99]:
model_args.lora_alpha

64

In [100]:
model_args.lora_dropout

0.0

##### lora_r

rank of the "update matrices" in int. Lower rank leads to smaller update matrices and fewer trainable parameters.

In [101]:
model_args.lora_r

32

##### bias

Whether `bias` parameters should be trained.

##### target modules

The modules (e.g., attention blocks etc.) to which the LoRA weights are applied.

In [102]:
model_args.lora_target_modules.split(",")

['c_proj', 'c_attn', 'q_attn', 'c_fc', 'c_proj']

In [103]:
peft_config = LoraConfig(
    lora_alpha=model_args.lora_alpha,
    lora_dropout=model_args.lora_dropout,
    r=model_args.lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=model_args.lora_target_modules.split(",")
    if model_args.lora_target_modules != "all-linear"
    else model_args.lora_target_modules,
)
peft_config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=32, target_modules={'q_attn', 'c_attn', 'c_proj', 'c_fc'}, lora_alpha=64, lora_dropout=0.0, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None)

In [104]:
vars(peft_config)

{'peft_type': <PeftType.LORA: 'LORA'>,
 'auto_mapping': None,
 'base_model_name_or_path': None,
 'revision': None,
 'task_type': 'CAUSAL_LM',
 'inference_mode': False,
 'r': 32,
 'target_modules': {'c_attn', 'c_fc', 'c_proj', 'q_attn'},
 'lora_alpha': 64,
 'lora_dropout': 0.0,
 'fan_in_fan_out': False,
 'bias': 'none',
 'use_rslora': False,
 'modules_to_save': None,
 'init_lora_weights': True,
 'layers_to_transform': None,
 'layers_pattern': None,
 'rank_pattern': {},
 'alpha_pattern': {},
 'megatron_config': None,
 'megatron_core': 'megatron.core',
 'loftq_config': {},
 'use_dora': False,
 'layer_replication': None}

In [105]:
from peft.tuners.lora.config import LoraConfig
LoraConfig?

[0;31mInit signature:[0m
[0mLoraConfig[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpeft_type[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpeft[0m[0;34m.[0m[0mutils[0m[0;34m.[0m[0mpeft_types[0m[0;34m.[0m[0mPeftType[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mauto_mapping[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mdict[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbase_model_name_or_path[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrevision[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtask_type[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpeft[0m[0;34m.[0m[0mutils[0m[0;34m.[0m[0mpeft_types[0m[0;34m.[0m[0mTaskTyp

##### get_peft_model

In [106]:
model = get_peft_model(model, peft_config)

In [107]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Starcoder2ForCausalLM(
      (model): Starcoder2Model(
        (embed_tokens): Embedding(49152, 4608)
        (layers): ModuleList(
          (0-31): 32 x Starcoder2DecoderLayer(
            (self_attn): Starcoder2FlashAttention2(
              (q_proj): Linear4bit(in_features=4608, out_features=4608, bias=True)
              (k_proj): Linear4bit(in_features=4608, out_features=512, bias=True)
              (v_proj): Linear4bit(in_features=4608, out_features=512, bias=True)
              (o_proj): Linear4bit(in_features=4608, out_features=4608, bias=True)
              (rotary_emb): Starcoder2RotaryEmbedding()
            )
            (mlp): Starcoder2MLP(
              (c_fc): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4608, out_features=18432, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
    

In [108]:
get_peft_model?

[0;31mSignature:[0m
[0mget_peft_model[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmodel[0m[0;34m:[0m [0;34m'PreTrainedModel'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpeft_config[0m[0;34m:[0m [0;34m'PeftConfig'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0madapter_name[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m'default'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmixed[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'PeftModel | PeftMixedModel'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns a Peft model object from a model and a config.

Args:
    model ([`transformers.PreTrainedModel`]):
        Model to be wrapped.
    peft_config ([`PeftConfig`]):
        Configuration object containing the parameters of the Peft model.
    adapter_name (`str`, `optional`, defaults to `"default"`):
        The name of the adapter to be injected, if not provided, the default 

### Configure gradient checkpointing

Gradient checkpointing is a technique used to reduce the memory usage when training deep learning models, at the cost of increased computation time. It's useful when training large models that would otherwise not fit in memory.

```python
    model.config.use_cache = not training_args.gradient_checkpointing
```

This line disables caching in the model configuration if gradient checkpointing is enabled. Caching is used to speed up computation by storing the results of expensive function calls and reusing them when the same inputs occur again. However, it increases memory usage, so it's disabled when gradient checkpointing is used.


In [109]:
model.config.use_cache = not training_args.gradient_checkpointing
model.config.use_cache

True


```python
    training_args.gradient_checkpointing = (
        training_args.gradient_checkpointing and not model_args.use_unsloth
    )
    if training_args.gradient_checkpointing:
        training_args.gradient_checkpointing_kwargs = {
            "use_reentrant": model_args.use_reentrant
        }
```

We enable gradient checkpointing only if it was initially enabled and `use_unsloth` is not set in the model arguments.

If gradient checkpointing is enabled, we set the `use_reentrant` argument according to the provided input arguments.

In [110]:
training_args.gradient_checkpointing

False

In [111]:
training_args.gradient_checkpointing and not model_args.use_unsloth

False

In [112]:
model_args.use_reentrant

False

In [113]:
training_args.gradient_checkpointing = (
    training_args.gradient_checkpointing and not model_args.use_unsloth
)
if training_args.gradient_checkpointing:
    training_args.gradient_checkpointing_kwargs = {
        "use_reentrant": model_args.use_reentrant
    }

In [114]:
training_args.gradient_checkpointing

False

### Review all the arguments

In [116]:
vars(model_args)

{'model_name_or_path': 'bigcode/starcoder2-7b',
 'lora_alpha': 64,
 'lora_dropout': 0.0,
 'lora_r': 32,
 'lora_target_modules': 'c_proj,c_attn,q_attn,c_fc,c_proj',
 'use_nested_quant': True,
 'bnb_4bit_compute_dtype': 'bfloat16',
 'bnb_4bit_quant_type': 'nf4',
 'use_flash_attn': True,
 'use_peft_lora': True,
 'use_8bit_qunatization': False,
 'use_4bit_quantization': True,
 'use_reentrant': False,
 'use_unsloth': False,
 'use_loftq': False,
 'use_loftq_callback': False}

In [118]:
vars(data_args)

{'dataset_name': 'smangrul/hug_stack',
 'dataset_text_field': 'text',
 'max_seq_length': 1024,
 'test_size': 0.1,
 'fim_rate': 0.5,
 'fim_spm_rate': 0.5,
 'splits': 'train'}

In [119]:
vars(training_args)

{'output_dir': 'peft-lora-starcoder2-7b-personal-copilot-test',
 'overwrite_output_dir': False,
 'do_train': False,
 'do_eval': True,
 'do_predict': False,
 'eval_strategy': <IntervalStrategy.STEPS: 'steps'>,
 'prediction_loss_only': False,
 'per_device_train_batch_size': 2,
 'per_device_eval_batch_size': 2,
 'per_gpu_train_batch_size': None,
 'per_gpu_eval_batch_size': None,
 'gradient_accumulation_steps': 4,
 'eval_accumulation_steps': None,
 'eval_delay': 0,
 'learning_rate': 0.0005,
 'weight_decay': 0.01,
 'adam_beta1': 0.9,
 'adam_beta2': 0.999,
 'adam_epsilon': 1e-08,
 'max_grad_norm': 1.0,
 'num_train_epochs': 3.0,
 'max_steps': 1000,
 'lr_scheduler_type': <SchedulerType.COSINE: 'cosine'>,
 'lr_scheduler_kwargs': {},
 'warmup_ratio': 0.0,
 'warmup_steps': 30,
 'log_level': 'passive',
 'log_on_each_node': True,
 'logging_dir': 'peft-lora-starcoder2-7b-personal-copilot-test/runs/May11_15-17-44_peace',
 'logging_strategy': <IntervalStrategy.STEPS: 'steps'>,
 'logging_first_step': F

Let's discuss those parameters that we have not yet covered

#### batch size

Batch size is recommended to be 2^N, often muliple of 8.

[Tensor Core Requirements](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc) define the multiplier based on the dtype and the hardware. For instance, 
* for fp16 data type a multiple of 8 is recommended 
* but for an A100 GPU, a multiples of 64 is recommended

#### gradient accumulation

Gradient accumulation is a technique designed to compute gradients in smaller, more manageable increments rather than processing the entire batch simultaneously. This method involves a series of forward and backward passes through the model, during which gradients are calculated and accumulated. After a sufficient number of gradients have been gathered, the optimization step of the model is carried out. 

The advantage of using gradient accumulation is that it allows for an increase in the effective batch size, surpassing the constraints set by the GPU's memory. However, it's crucial to be aware that the extra forward and backward passes required by this method can potentially decelerate the training process.

In [121]:
training_args.gradient_accumulation_steps

4

In [122]:
training_args.per_device_train_batch_size

2

The above results in a 4x2 = 8 effective batch size on a single GPU.

#### gradient checkpointing

Gradient checkpointing is a technique that balances memory usage and computational speed during model training. Instead of storing all activations from the forward pass for gradient computation, which can consume significant memory, or discarding and recalculating them, which can slow down training, gradient checkpointing selectively saves certain activations. This means only a subset of activations need to be recalculated, optimizing both memory and computation resources.

But it comes with a cost of [slowing down the training by approximately 20%](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one)

In [124]:
training_args.gradient_checkpointing

False

#### Mixed precision

Mixed precision training is a method that enhances computational efficiency in model training by using lower-precision numerical formats for certain variables. While most models traditionally use 32-bit floating point precision (fp32), not all variables need this level of precision. By lowering the precision of some variables to formats like 16-bit floating point (fp16), computations can be sped up.

Typically in mixed precision training: 
* Activations are in half precision (fp16)
* Despite gradients being computed in half precision, they are converted back to full precision for optimization, so no memory is saved in this step. 
* It could also lead to more GPU memory being utilized, especially for small batch sizes. 

Newer GPU architectures, like the Ampere architecture, offer bf16 and tf32 data types. Tradditonal one is ft16.  

In [126]:
print(training_args.tf32)

None


In [127]:
print(training_args.bf16)

True


In [128]:
training_args.optim

<OptimizerNames.ADAMW_TORCH: 'adamw_torch'>

### Trainer

In [192]:
# trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

max_steps is given, it will override any value given in num_train_epochs


In [91]:
Trainer?

[0;31mInit signature:[0m
[0mTrainer[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmodel[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mtransformers[0m[0;34m.[0m[0mmodeling_utils[0m[0;34m.[0m[0mPreTrainedModel[0m[0;34m,[0m [0mtorch[0m[0;34m.[0m[0mnn[0m[0;34m.[0m[0mmodules[0m[0;34m.[0m[0mmodule[0m[0;34m.[0m[0mModule[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0margs[0m[0;34m:[0m [0mtransformers[0m[0;34m.[0m[0mtraining_args[0m[0;34m.[0m[0mTrainingArguments[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_collator[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mtransformers[0m[0;34m.[0m[0mdata[0m[0;34m.[0m[0mdata_collator[0m[0;34m.[0m[0mDataCollator[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_dataset[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mtorch[0m[0;34m.[0m[0mutils[0m[0;34m.[0m[0mdata[0m[0;34m.[0m[0md

In [194]:
trainer.accelerator.print(f"{trainer.model}")

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Starcoder2ForCausalLM(
      (model): Starcoder2Model(
        (embed_tokens): Embedding(49152, 4608)
        (layers): ModuleList(
          (0-31): 32 x Starcoder2DecoderLayer(
            (self_attn): Starcoder2FlashAttention2(
              (q_proj): Linear4bit(in_features=4608, out_features=4608, bias=True)
              (k_proj): Linear4bit(in_features=4608, out_features=512, bias=True)
              (v_proj): Linear4bit(in_features=4608, out_features=512, bias=True)
              (o_proj): Linear4bit(in_features=4608, out_features=4608, bias=True)
              (rotary_emb): Starcoder2RotaryEmbedding()
            )
            (mlp): Starcoder2MLP(
              (c_fc): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4608, out_features=18432, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
    

In [195]:
model_args.use_peft_lora

True

In [197]:
if model_args.use_peft_lora:
    trainer.model.print_trainable_parameters()

trainable params: 47,185,920 || all params: 7,221,109,760 || trainable%: 0.6534


##### loftq

For QLoRA training, when we're preparing to quantize the base model, it's worth considering the use of LoftQ initialization. This method has demonstrated its ability to enhance performance in conjunction with quantization. The underlying concept is to initialize the LoRA weights in a way that minimizes the quantization error. 

In [198]:
model_args.use_loftq

False

In [199]:
# LoftQ initialization when using QLoRA
if model_args.use_4bit_quantization and model_args.use_loftq:
    loftq_init(trainer.model, tokenizer, train_dataset, data_args.max_seq_length ,model_args)

In [4]:
from peft.utils.loftq_utils import loftq_init, replace_lora_weights_loftq
loftq_init?

[0;31mSignature:[0m
[0mloftq_init[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mweight[0m[0;34m:[0m [0;34m'Union[torch.Tensor, torch.nn.Parameter]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_bits[0m[0;34m:[0m [0;34m'int'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreduced_rank[0m[0;34m:[0m [0;34m'int'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_iter[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/github/LLM-Workshop/personal_copilot/.copilot/lib/python3.11/site-packages/peft/utils/loftq_utils.py
[0;31mType:[0m      function

In [5]:
replace_lora_weights_loftq?

[0;31mSignature:[0m
[0mreplace_lora_weights_loftq[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpeft_model[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel_path[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0madapter_name[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m'default'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcallback[0m[0;34m:[0m [0;34m'Optional[Callable[[torch.nn.Module, str], bool]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Replace the LoRA weights of a model quantized with bitsandbytes, using the LoftQ technique.

The replacement is done on the fly by loading in the non-quantized weights from a locally stored safetensors model
file and initializing the LoRA weights such that the quantization error between the original and quantized weights
is minimized.

As lazy loading is not possible with pickle, nor

If enabled, `loftq_init` will call `replace_lora_weights_loftq` to replace the LoRA weights with LoftQ-initialized weights.

##### checkpoint

In [201]:
print(training_args.resume_from_checkpoint)

None


In [202]:
checkpoint = None
if training_args.resume_from_checkpoint is not None:
    checkpoint = training_args.resume_from_checkpoint

### Train

In [204]:
# trainer.train(resume_from_checkpoint=checkpoint)

### Save model result

In [205]:
trainer.is_fsdp_enabled

False

In [207]:
if trainer.is_fsdp_enabled:
    trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

#trainer.save_model()