# V 0.0

This is next "version" of [simple finetuning](https://www.kaggle.com/code/yannchikk/bart-large-cnn-dialoguesum-booksum-full-finetuning). There i try train the bart-large-cnn by LoRA finetuning method with PEFT lib. 

There i going to train LoRA with processed text and text without processing and make Model Soup from this two versions. 

In this Notebook: 

- ### Train original '[bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)'

- ## without text processing + preprocessed dataset 

- ### Using peft lib for add LoRA layers to model

- ### Finetune only 'target_modules'

- ### Using DataParalell more accelerate models

- ### Custom checkpointing

In [1]:
class Config:
    
    max_length = 1024
    target_max_length = 512

    epochs = 6
    
    batch_size = 8

#     model_preset_trained = "doublecringe123/bardt-large-cnn-dialoguesum-booksum"
    
    try: 
        model_preset = model_preset_trained 
    except: 
        model_preset = "facebook/bart-large-cnn"
    
    
    lora_params = {
        'target_modules':['out_proj', 'v_proj', 'q_proj', 'cf1', 'cf2'], 
        'r':8, 
        'lora_alpha': 16, 
    }
    
    save_frecuency = 2

    inp = 'input_content'
    target = 'target'

cfg = Config()

# At First, lets load datasets



In [2]:
! pip install -q --upgrade pip
! pip install -q transformers[torch]
! pip install -q -U transformers==4.38.2 datasets==2.18.0 evaluate rouge_score

from kaggle_secrets import UserSecretsClient
secret_label = "HF_TOKEN"
secret_value = UserSecretsClient().get_secret(secret_label)

import os
os.environ["HF_TOKEN"] = secret_value

try: 
    import wandb
    wandb.init(mode='disabled')
except: 
    ...

This is my code from [github repo](https://github.com/goin2crazy/multy-dataset/blob/main/main.py) 

In [3]:
! wget -O "mds.py" "https://raw.githubusercontent.com/goin2crazy/multy-dataset/main/main.py" 

--2024-04-13 15:31:52--  https://raw.githubusercontent.com/goin2crazy/multy-dataset/main/main.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2333 (2.3K) [text/plain]
Saving to: 'mds.py'


2024-04-13 15:31:53 (23.5 MB/s) - 'mds.py' saved [2333/2333]



# Prepare Dataset

In [4]:
from mds import NewDataset
dataset_params = {
    "knkarthick/dialogsum": ("dialogue", "summary"), 
    "doublecringe123/dialoguesum-npc-dialoguesum-stemmed-augmented": ('inp', 'target')
}

dataset = NewDataset(dataset_params, input_col_name = cfg.inp, target_col_name = cfg.target)

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 11.3M/11.3M [00:01<00:00, 7.75MB/s]
Downloading data: 100%|██████████| 442k/442k [00:00<00:00, 1.99MB/s]
Downloading data: 100%|██████████| 1.35M/1.35M [00:01<00:00, 1.12MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Downloading readme:   0%|          | 0.00/2.13k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 23.5M/23.5M [00:02<00:00, 11.3MB/s]
Downloading data: 100%|██████████| 1.18M/1.18M [00:00<00:00, 2.73MB/s]
Downloading data: 100%|██████████| 2.28M/2.28M [00:00<00:00, 5.04MB/s]


Generating train split:   0%|          | 0/59070 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7000 [00:00<?, ? examples/s]

# Load model and tokenizer

In [5]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(cfg.model_preset)
model = AutoModelForSeq2SeqLM.from_pretrained(cfg.model_preset)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

# Creating LoRA Model with PEFT

PEFT refers to a group of techniques that enable efficient adaptation of large language models (LLMs) to specific tasks or domains. It involves fine-tuning only a small subset of parameters in the LLM, rather than modifying the entire model. This approach offers several advantages:

I knew about PEFT and LoRA models buildin from [this notebook](https://www.kaggle.com/code/ajinkyabhandare2002/fine-tune-flan-t5-base-for-chat-with-peft-lora#Setup-the-PEFT/LoRA-model-for-Fine-Tuning)

In [6]:
! pip install -q peft

In [7]:
from peft import LoraConfig, get_peft_model, TaskType

lora_conf = LoraConfig(
    **cfg.lora_params, 
    lora_dropout = 0.05,
    bias = 'none', 
    task_type = TaskType.CAUSAL_LM,
    init_lora_weights = 'gaussian', 
)

In [8]:
lora_model = get_peft_model(model=model, peft_config=lora_conf)

lora_model.print_trainable_parameters()

trainable params: 1,769,472 || all params: 408,059,904 || trainable%: 0.4336304504938569


In [9]:
from torch import nn
import torch 

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
lora_model.model = nn.DataParallel(lora_model.model)

In [10]:
def preprocess_function(examples):
    try: 
        inputs = [doc for doc in examples[cfg.inp]]
        model_inputs = tokenizer(inputs, max_length=cfg.max_length, truncation=True)

        labels = tokenizer(text_target=examples[cfg.target], max_length=cfg.target_max_length, truncation=True)

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs
    except TypeError as e:
        print(e)
        print(examples[cfg.inp])

dataset = dataset.map(preprocess_function, batched = True)
tokenized_train, tokenized_val, tokenized_test = dataset.splits

Map:   0%|          | 0/71530 [00:00<?, ? examples/s]

Map:   0%|          | 0/3500 [00:00<?, ? examples/s]

Map:   0%|          | 0/8500 [00:00<?, ? examples/s]

In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=cfg.model_preset)

2024-04-13 15:34:04.110462: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-13 15:34:04.110563: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-13 15:34:04.209763: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Define Metrics

In [12]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [13]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

# Define training arguments

In [14]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

In [15]:
eps = cfg.epochs // cfg.save_frecuency

for i in range(eps): 
    i += 1
    
    print(f"{i}/{eps} Training Initiallization...")
    training_args = Seq2SeqTrainingArguments(
        output_dir="bardt-large-cnn-dialoguesum-booksum-lora",
        evaluation_strategy="epoch",
        save_strategy='no',
    #     save_safetensors = True,
    #     save_steps = 100, 
        learning_rate=2e-5,
        per_device_train_batch_size=cfg.batch_size,
        per_device_eval_batch_size=cfg.batch_size,
        weight_decay=0.01,
        num_train_epochs=cfg.save_frecuency,
        predict_with_generate=True,

        fp16=True,
    )

    trainer = Seq2SeqTrainer(
        model=lora_model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_val,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    print(f"{i}/{eps} start Training...")    
    trainer.train()
    
    print(f"{i}/{eps} Saving model...")    
    lora_model.save_pretrained("bardt-large-cnn-dialoguesum-booksum-lora")
    lora_model.push_to_hub("bardt-large-cnn-dialoguesum-booksum-lora", commit_message = f"Original+Augmented+Stemmed Dataset, {i * cfg.save_frecuency} epochs")

1/3 Training Initiallization...
1/3 start Training...


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.3452,1.400197,0.3924,0.1854,0.3018,0.3018,62.5623
2,1.2936,1.359301,0.3975,0.19,0.3068,0.3068,62.2106


1/3 Saving model...


adapter_model.safetensors:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

2/3 Training Initiallization...
2/3 start Training...


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.2593,1.343348,0.3985,0.1911,0.3065,0.3064,62.006
2,1.2305,1.320177,0.3991,0.1912,0.3075,0.3073,61.6114


2/3 Saving model...


README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

3/3 Training Initiallization...
3/3 start Training...


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.2125,1.320761,0.399,0.1922,0.3082,0.308,61.5617
2,1.1942,1.302516,0.4017,0.1944,0.3098,0.3097,61.2103


3/3 Saving model...


adapter_model.safetensors:   0%|          | 0.00/14.2M [00:00<?, ?B/s]