## Finetuning gemma with LORA r32

In [1]:
# followed tutorial here
# https://medium.com/@bnjmn_marie/googles-gemma-fine-tuning-quantization-and-inference-on-your-computer-83066b25791b

import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer

model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)

# Tokenizer pre processing
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id =  tokenizer.eos_token_id
tokenizer.padding_side = 'left'

# Load in finetuning dataset
ds = load_dataset("timdettmers/openassistant-guanaco")

#Quantization config
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}
)

# Needed for finetuning on custom dataset
model = prepare_model_for_kbit_training(model)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False 

training_arguments = TrainingArguments(
        output_dir="./results_qlora",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_steps=50,
        logging_steps=50,
        learning_rate=2e-5,
        eval_steps=50,
        max_steps=100,
        warmup_steps=30,
        lr_scheduler_type="linear",
)

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=32,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()



Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 100
  Number of trainable parameters = 39,223,296


Step,Training Loss,Validation Loss
50,2.0995,2.026574
100,1.9178,1.953279


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./results_qlora/checkpoint-50
loading configuration file config.json from cache at /home/jn2814/.cache/huggingface/hub/models--google--gemma-2b/snapshots/2ac59a5d7bf4e1425010f0d457dde7d146658953/config.json
Model config GemmaConfig {
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_activation": null,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.2",
  "use_cache": true,
  "vocab_size": 256000
}

tokenizer config fil

TrainOutput(global_step=100, training_loss=2.0086538696289065, metrics={'train_runtime': 757.6556, 'train_samples_per_second': 0.528, 'train_steps_per_second': 0.132, 'total_flos': 2092234534600704.0, 'train_loss': 2.0086538696289065, 'epoch': 0.04061738424045491})

In [2]:
m_name = 'gemma-lora-r32'
model.save_pretrained(m_name)
tokenizer.save_pretrained(m_name)
model.push_to_hub(m_name)
tokenizer.push_to_hub(m_name)

Configuration saved in gemma-lora-r32/config.json
Configuration saved in gemma-lora-r32/generation_config.json
Model weights saved in gemma-lora-r32/model.safetensors
tokenizer config file saved in gemma-lora-r32/tokenizer_config.json
Special tokens file saved in gemma-lora-r32/special_tokens_map.json
Configuration saved in gemma-lora-r32/config.json
Configuration saved in gemma-lora-r32/generation_config.json
Model weights saved in gemma-lora-r32/model.safetensors
Uploading the following files to jn2814/gemma-lora-r32: README.md,generation_config.json,config.json,model.safetensors


model.safetensors:   0%|          | 0.00/3.28G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

tokenizer config file saved in gemma-lora-r32/tokenizer_config.json
Special tokens file saved in gemma-lora-r32/special_tokens_map.json
Uploading the following files to jn2814/gemma-lora-r32: tokenizer.json,README.md,tokenizer_config.json,tokenizer.model,special_tokens_map.json


Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/jn2814/gemma-lora-r32/commit/a6d6b1da75f4c605ee952f6a56ad9312f455de9d', commit_message='Upload tokenizer', commit_description='', oid='a6d6b1da75f4c605ee952f6a56ad9312f455de9d', pr_url=None, pr_revision=None, pr_num=None)

In [3]:
!lm_eval --model hf --model_args pretrained=gemma-lora-r32 --tasks winogrande,arc_challenge --device cuda:0 --num_fewshot 1 --batch_size 2 --output_path ./eval_harness/gemma-lora-r32

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-05-08:17:55:54,159 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-08:17:56:00,206 INFO     [__main__.py:341] Selected Tasks: ['arc_challenge', 'winogrande']
2024-05-08:17:56:00,208 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-08:17:56:00,208 INFO     [evaluator.py:178] Initializing hf model, with arguments: {'pretrained': 'gemma-lora-r32'}
2024-05-08:17:56:00,226 INFO     [huggingface.py:165] Using device 'cuda:0'
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more d

In [4]:
%load_ext memory_profiler
import time

prompt = "Will AI take over the world?"

start_time = time.time()
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
%memit outputs = model.generate(**inputs, do_sample=True, max_new_tokens=150)
inference_time = time.time() - start_time
print("Inference time:", inference_time)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(result)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


peak memory: 6871.66 MiB, increment: 2.63 MiB
Inference time: 12.290580749511719
Will AI take over the world?2020年11月10日 星期日，北京时间06:06:55

As more technology companies continue to pour billions into research into human-like AIs, how could the coming generation of AI’s impact on the course of human history play out?

点击查看全文

What are AI chips?  What are the benefits of AI chips? And how do AI chips work? As you know, AI (artificial intelligence) has been around for decades now; still the hype around AI is on an all time high. The more time that we all spend in social media, the more we hear about AI. But the more we wonder about what exactly are AI chips. And what are the
