In [1]:
%%capture
!mamba install --force-reinstall aiohttp -y
!pip install -U "xformers<0.0.26" --index-url https://download.pytorch.org/whl/cu121
!pip install "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"

# Temporary fix for https://github.com/huggingface/datasets/issues/6753
!pip install datasets==2.16.0 fsspec==2023.10.0 gcsfs==2023.10.0

import os
os.environ["WANDB_DISABLED"] = "true"

In [2]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from transformers import TextStreamer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2024-06-13 12:30:55.424548: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-13 12:30:55.424648: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-13 12:30:55.574475: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model)

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
# initialize LoRA parameters and associate them with the model
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj","up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth", # reducing memory
    random_state = 1,
)

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [5]:
# load OpenAI's grade-school-math dataset from the paper Cobbe et al, 'Training Verifiers to Solve Math World Problems', 2021
dataset = load_dataset("qwedsacf/grade-school-math-instructions", split = "train")
print('The dataset has ', len(dataset), '  many entries')

dataset = dataset.select(range(500)) #Take only 500 examples
print('We take ' , len(dataset), ' many entries for this homework ')

print('This is how an entry of the dataset looks like: ' , dataset[0])

Downloading readme:   0%|          | 0.00/852 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.55M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8792 [00:00<?, ? examples/s]

The dataset has  8792   many entries
We take  500  many entries for this homework 
This is how an entry of the dataset looks like:  {'INSTRUCTION': 'This math problem has got me stumped: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?\nCan you show me the way?', 'RESPONSE': 'Natalia sold 48/2 = 24 clips in May.\nNatalia sold 48+24 = 72 clips altogether in April and May.', 'SOURCE': 'grade-school-math'}


In [6]:
# Prepare the data for finetuning, we concatenate the INSTRUCTION and RESPONSE fields from the grade-school-math instructions dataset
# into one string.

prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts(examples):
    instructions = examples["INSTRUCTION"]
    responses     = examples["RESPONSE"]
    texts = []
    for instruction, response in zip(instructions, responses):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instruction, response) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

dataset = dataset.map(formatting_prompts, batched = True)
print(dataset[0]['text'])

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
This math problem has got me stumped: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Can you show me the way?

### Response:
Natalia sold 48/2 = 24 clips in May.
Natalia sold 48+24 = 72 clips altogether in April and May.<|end_of_text|>


In [7]:
#Here we check how the pretrained model responds to 3 simple questions before finetuning.
Q1 = "I have 10 apples, my brother took half of them from me, I lost 1, and my friend gave me 3. How many do I have now?"
Q2 = "I earn five euros per hour. I worked two hours yesterday and five hours today. How much did I earn in total?"
Q3 = "In year 2000 I was 20 years old. My sister is 5 years younger than me. How old is she in 2020?"

input1 = tokenizer([prompt.format(Q1, "",)],return_tensors = "pt").to("cuda")
input2 = tokenizer([prompt.format(Q2, "",)],return_tensors = "pt").to("cuda")
input3 = tokenizer([prompt.format(Q3, "",)],return_tensors = "pt").to("cuda")

In [8]:
#Response to first question
_= model.generate(**input1, streamer = TextStreamer(tokenizer), max_new_tokens = 70, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
I have 10 apples, my brother took half of them from me, I lost 1, and my friend gave me 3. How many do I have now?

### Response:
I have 10 apples, my brother took half of them from me, I lost 1, and my friend gave me 3. How many do I have now?

### Explanation:
I have 10 apples, my brother took half of them from me, I lost 1, and my friend gave me 3. How many do I


In [9]:
#Response to second question
_= model.generate(**input2, streamer = TextStreamer(tokenizer), max_new_tokens = 70, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
I earn five euros per hour. I worked two hours yesterday and five hours today. How much did I earn in total?

### Response:
I earned 15 euros in total.<|end_of_text|>


In [10]:
#Response to third question
_= model.generate(**input3, streamer = TextStreamer(tokenizer), max_new_tokens = 70, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
In year 2000 I was 20 years old. My sister is 5 years younger than me. How old is she in 2020?

### Response:
In year 2000 I was 20 years old. My sister is 5 years younger than me. How old is she in 2020?

### Explanation:
In year 2000 I was 20 years old. My sister is 5 years younger than me. How old is she in 2020?

### Instruction:
In year


In [16]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, 
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",

    ),
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [17]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.922 GB of memory reserved.


In [18]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 31
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.9976
2,1.993
3,1.9062
4,1.7918
5,1.691
6,1.5941
7,1.3372
8,1.3501
9,1.2141
10,1.1995


In [19]:
import os
current_path = os.getcwd()
model.save_pretrained("lora_model") 
print(f"Address: {current_path}/lora_model")

Address: /kaggle/working/lora_model


In [20]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [21]:
#Response to first question after finetuning
_= model.generate(**input1, streamer = TextStreamer(tokenizer), max_new_tokens = 70, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
I have 10 apples, my brother took half of them from me, I lost 1, and my friend gave me 3. How many do I have now?

### Response:
I have 10 apples - half of them = 10/2 = 5
I have 5 apples - 1 = 5-1 = 4
I have 4 apples + 3 = 4+3 = 7<|end_of_text|>


In [22]:
#Response to second question after finetuning
_= model.generate(**input2, streamer = TextStreamer(tokenizer), max_new_tokens = 70, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
I earn five euros per hour. I worked two hours yesterday and five hours today. How much did I earn in total?

### Response:
I earned 5 euros per hour * 2 hours = 10 euros yesterday.
I earned 5 euros per hour * 5 hours = 25 euros today.
I earned 10 euros yesterday + 25 euros today = 35 euros in total.<|end_of_text|>


In [24]:
#Response to third question after finetuning
_= model.generate(**input3, streamer = TextStreamer(tokenizer), max_new_tokens = 70, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
In year 2000 I was 20 years old. My sister is 5 years younger than me. How old is she in 2020?

### Response:
In 2020 I will be 20+20=40 years old.
My sister will be 5 years younger than me, so she will be 40-5=35 years old.<|end_of_text|>
