In [1]:
!pip install accelerate peft trl datasets bitsandbytes auto-gptq optimum -q

In [2]:
import torch

if torch.cuda.is_available():
    device = torch.cuda.get_device_name()
    print(f"CUDA device: {device}")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("CUDA is not available on this system.")

CUDA device: Tesla T4
CUDA version: 11.8


In [3]:
import torch.nn as nn
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM,GPTQConfig, TrainingArguments
from peft import LoraConfig,prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer



In [4]:
dataset = load_dataset("DR-DRR/Medical_Customer_care",split='train')
dataset['text'][0]

Downloading and preparing dataset text/DR-DRR--Medical_Customer_care to /root/.cache/huggingface/datasets/text/DR-DRR--Medical_Customer_care-e19aadc132a87f29/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/33.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/191M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/DR-DRR--Medical_Customer_care-e19aadc132a87f29/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


"<s>[INST] Explain the importance of staying hydrated and its benefits for overall health. [/INST] Staying hydrated is crucial for maintaining your overall health. It helps regulate your body temperature, keeps your joints functioning smoothly, aids in digestion, and flushes out toxins. When you're properly hydrated, you'll notice improvements in your skin's appearance and feel more energetic. So, remember to drink water regularly throughout the day to reap these health benefits and keep dehydration symptoms like a dry mouth and dizziness at bay.</s>"

In [5]:
model_ckpt = "TheBloke/Llama-2-7b-Chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt
)
tokenizer.pad_token = tokenizer.eos_token

Downloading tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [6]:
quantization_config = GPTQConfig(bits=4,disable_exllama=True,tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    revision='main',
    quantization_config=quantization_config,
    device_map='auto')
model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


Downloading config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. use_exllama, exllama_config, use_cuda_fp16, max_input_length) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.


Downloading model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [7]:
lora_config = LoraConfig(r=16,
                        lora_alpha=32,
                        lora_dropout=0.05,
                        bias='none',
                        task_type='CAUSAL_LM',
                        target_modules=[
                                    "q_proj",
                                    "k_proj",
                                    "v_proj",
                                    "o_proj",
                                    "gate_proj",
                                    "up_proj",
                                    "down_proj",
                                        ]
)
model = get_peft_model(model,lora_config)

In [8]:
training_args = TrainingArguments(output_dir='.',
                                 dataloader_drop_last=True,
                                 save_strategy='epoch',
                                 num_train_epochs=1,
                                 logging_steps=100,
                                 max_steps=2000,
                                 per_device_train_batch_size=1,
                                 learning_rate=3e-4,
                                 lr_scheduler_type='cosine',
                                 warmup_steps=100,
                                 fp16=True,
                                 #gradient_accumulation_steps=2,
                                 weight_decay=0.05,
                                 report_to=None,
                                 run_name='finetuning-llama2-chat-7b')

In [9]:
trainer = SFTTrainer(model=model,
                    args=training_args,
                    train_dataset = dataset,
                    dataset_text_field='text',
                    max_seq_length=1024,
                    tokenizer=tokenizer,
                    packing=False)

  0%|          | 0/208 [00:00<?, ?ba/s]



In [10]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,2.5988
200,2.412
300,2.375
400,2.3531
500,2.3965
600,2.4129
700,2.3585
800,2.3211
900,2.3114
1000,2.3381


TrainOutput(global_step=2000, training_loss=2.3137620620727537, metrics={'train_runtime': 4515.1375, 'train_samples_per_second': 0.443, 'train_steps_per_second': 0.443, 'total_flos': 492945782784000.0, 'train_loss': 2.3137620620727537, 'epoch': 0.01})

In [11]:
from huggingface_hub import notebook_login, HfApi

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [12]:
api = HfApi()

api.upload_folder(
    folder_path='/working/checkpoint-2000',
    path_in_repo=".",
    repo_id="Neupane9Sujal/llama-gptq-medical-finetuned-chatbot",
    repo_type='model',
    create_pr=1
)

rng_state.pth:   0%|          | 0.00/14.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

scheduler.pt:   0%|          | 0.00/627 [00:00<?, ?B/s]

optimizer.pt:   0%|          | 0.00/320M [00:00<?, ?B/s]

Upload 6 LFS files:   0%|          | 0/6 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/160M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.16k [00:00<?, ?B/s]

'https://huggingface.co/Neupane9Sujal/llama-gptq-medical-finetuned-chatbot/tree/refs%2Fpr%2F1/.'

## Inference

In [13]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    revision='main',
   quantization_config=quantization_config,
    device_map='auto')

You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. use_exllama, exllama_config, use_cuda_fp16, max_input_length) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.


In [14]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "/working/checkpoint-2000")

In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt, add_bos_token=True, trust_remote_code=True)

In [16]:
eval_prompt = "[INST] Discuss the role of exercise in maintaining a healthy weight and its impact on overall well-being. [/INST]"
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=100, repetition_penalty=1.15)[0], skip_special_tokens=True))



[INST] Discuss the role of exercise in maintaining a healthy weight and its impact on overall well-being. [/INST] Exercise is an important component of any weight loss program, as it helps to burn calories and build muscle mass. Regular physical activity can also help to improve cardiovascular function, reduce stress levels, boost mood, and enhance sleep quality. In addition, regular exercise has been shown to have numerous other benefits for overall health and wellness, including reducing blood pressure, improving insulin sensitivity, and lowering cholesterol levels. While di


## Output



[INST] Discuss the role of exercise in maintaining a healthy weight and its impact on overall well-being. [/INST] Exercise is an important component of any weight loss program, as it helps to burn calories and build muscle mass. Regular physical activity can also help to improve cardiovascular function, reduce stress levels, boost mood, and enhance sleep quality. In addition, regular exercise has been shown to have numerous other benefits for overall health and wellness, including reducing blood pressure, improving insulin sensitivity, and lowering cholesterol levels. While di