<a href="https://colab.research.google.com/github/ismoil27/jaydariGPT/blob/main/jaydari_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers torch bitsandbytes datasets peft trl



In [None]:
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM, TrainingArguments
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

In [None]:
model_id = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
tokenizer = AutoTokenizer.from_pretrained(model_id)

# print('Vocab size:', tokenizer.vocab_size)
# print('Special tokens:', tokenizer.special_tokens_map)

# quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type='nf4'
)

bnb_config

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto', # GPU, CPU
    # dtype=torch.bfloat16
)

In [None]:
# Before Fine-tuning
prompt = "Explain what a tokenizer is?"
# prompt = "A tokenizer is a tool in natural language processing that"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # GPU, CPU

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=True,
        temperature=0.7
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

# print(model)
first_block = model.model.layers[0]
print('first_block:', first_block)
print('=======')
print(first_block.self_attn)
print('=======')
print(model.config)



Explain what a tokenizer is? And give an example of its use in Python.
first_block: LlamaDecoderLayer(
  (self_attn): LlamaAttention(
    (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
    (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
    (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
    (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
  )
  (mlp): LlamaMLP(
    (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
    (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
    (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
    (act_fn): SiLUActivation()
  )
  (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
  (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
LlamaAttention(
  (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
  (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
  (v_pr

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())

total_params = count_parameters(model)
print(f"Total parameters (including frozen 4-bit): {total_params:,}")


Total parameters (including frozen 4-bit): 615,606,272


## datasets library | load_dataset
* instruction tuning

In [None]:
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset
dataset[1]

{'output': 'The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB).',
 'input': '',
 'instruction': 'What are the three primary colors?'}

In [None]:
def generate_prompt(example):
  instruction = example['instruction']
  input_text = example['input']
  output_text = example['output']

  if input_text:
    return(
        "### Instruction:\n"
        f"{instruction}\n\n"
        "### Input:\n"
        f"{input_text}\n\n"
        "### Response:\n"
        f"{output_text}"
    )
  else:
    return(
       "### Instruction:\n"
       f"{instruction}\n\n"
       "### Response:\n"
       f"{output_text}"
    )

# generate_prompt(dataset[0])

def formatting_func(example):
  return {'text': generate_prompt(example)}

dataset = dataset.map(formatting_func)


In [None]:
dataset[0]['text']

'### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'

In [None]:
dataset = dataset.select(range(7000))

In [None]:
dataset = dataset.shuffle(seed=42)

In [None]:
# [7, 3, 2, 8, 5, 6, 9, 4, 0, 1]
# [7, 3, 2, 8, 5, 6, 9, 4, 0, 1] EXACT SAME ORDER

In [None]:
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# [10, 231, 342, 3453, 3464, 5123, 6456, 7, 8, 9]


In [None]:
dataset

Dataset({
    features: ['output', 'input', 'instruction', 'text'],
    num_rows: 7000
})

In [None]:
# Full Fine-tuning  =>
# Cheap Fine-tuning =>
# PEFT => Parameter Efficent Fine Tuning
# OOM => Out of Memory

In [None]:
lora_config = LoraConfig(
    r=8, # rank
    lora_alpha=16,
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],
)

In [None]:
model = get_peft_model(model, lora_config)

In [None]:
model.print_trainable_parameters()

trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023


In [None]:
# QLoRa
# LoRa

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4, # VRAM
    num_train_epochs=2, # overfit
    logging_steps=20,
    output_dir="./jaydari_gpt",
    save_strategy="epoch",
    bf16=True,
    fp16=False,
    report_to="none"
)

In [None]:
print(dataset.column_names)

['output', 'input', 'instruction', 'text']


In [None]:
# SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    formatting_func=lambda x: x['text'],
    args=training_args
)


trainer.train()
model.save_pretrained('jaydari_gpt')
tokenizer.save_pretrained('jaydari_gpt')



Step,Training Loss
20,1.4868
40,1.5263
60,1.2792
80,1.3378
100,1.2894
120,1.3097
140,1.3359
160,1.2351
180,1.2958
200,1.2054


('jaydari_gpt/tokenizer_config.json',
 'jaydari_gpt/special_tokens_map.json',
 'jaydari_gpt/chat_template.jinja',
 'jaydari_gpt/tokenizer.model',
 'jaydari_gpt/added_tokens.json',
 'jaydari_gpt/tokenizer.json')

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

base_model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer_base = AutoTokenizer.from_pretrained(base_model_id)
model_base = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto"
)

model_base.eval()


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rot

In [None]:
prompt = "Explain what machine learning is in simple words."

inputs_base = tokenizer_base(prompt, return_tensors="pt").to(model_base.device)

with torch.no_grad():
    output_base = model_base.generate(
        **inputs_base,
        max_new_tokens=120,
        temperature=0.7,
        do_sample=True
    )

print("===== BASE MODEL OUTPUT =====")
print(tokenizer_base.decode(output_base[0], skip_special_tokens=True))


===== BASE MODEL OUTPUT =====
Explain what machine learning is in simple words.


#FINE-TUNED MODEL INFERENCE

In [None]:
model_path = "jaydari_gpt"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto"
)

In [None]:
prompt = """### Instruction:
Explain what machine learning is in simple words.

### Response:
"""


inputs_ft = tokenizer_ft(prompt, return_tensors="pt").to(model_ft.device)

with torch.no_grad():
    output_ft = model_ft.generate(
        **inputs_ft,
        max_new_tokens=120,
        temperature=0.7,
        do_sample=True
    )

print("===== FINE-TUNED MODEL OUTPUT =====")
print(tokenizer_ft.decode(output_ft[0], skip_special_tokens=True))


===== FINE-TUNED MODEL OUTPUT =====
### Instruction:
Explain what machine learning is in simple words.

### Response:
Machine learning is the process of using algorithms and statistical models to learn from data and make predictions or decisions based on that data. It involves the use of algorithms to learn and improve from historical data, which helps to identify patterns and trends in the data, and to predict future outcomes. 

Machine learning techniques are used in various applications across different industries, including:
- Predicting customer behavior and preferences
- Training algorithms that can improve the performance of industrial robots
- Identifying the most effective marketing strategies for a business
- Personalizing user interfaces


In [None]:
!zip -r jaydari_gpt.zip jaydari_gpt

  adding: jaydari_gpt/ (stored 0%)
  adding: jaydari_gpt/README.md (deflated 45%)
  adding: jaydari_gpt/checkpoint-3500/ (stored 0%)
  adding: jaydari_gpt/checkpoint-3500/README.md (deflated 65%)
  adding: jaydari_gpt/checkpoint-3500/trainer_state.json (deflated 83%)
  adding: jaydari_gpt/checkpoint-3500/scheduler.pt (deflated 61%)
  adding: jaydari_gpt/checkpoint-3500/special_tokens_map.json (deflated 79%)
  adding: jaydari_gpt/checkpoint-3500/tokenizer.model (deflated 55%)
  adding: jaydari_gpt/checkpoint-3500/rng_state.pth (deflated 26%)
  adding: jaydari_gpt/checkpoint-3500/adapter_config.json (deflated 57%)
  adding: jaydari_gpt/checkpoint-3500/tokenizer.json (deflated 85%)
  adding: jaydari_gpt/checkpoint-3500/chat_template.jinja (deflated 60%)
  adding: jaydari_gpt/checkpoint-3500/optimizer.pt (deflated 22%)
  adding: jaydari_gpt/checkpoint-3500/adapter_model.safetensors (deflated 23%)
  adding: jaydari_gpt/checkpoint-3500/training_args.bin (deflated 53%)
  adding: jaydari_gpt/c

In [None]:
from google.colab import files
files.download("jaydari_gpt.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
###########################
#      MODEL TESTING      #
###########################