**Installing Packages**

In [1]:
!pip install git+https://github.com/huggingface/peft
!pip install trl
!pip install datasets
!pip install -U bitsandbytes

Collecting git+https://github.com/huggingface/peft
  Cloning https://github.com/huggingface/peft to /tmp/pip-req-build-dihuvq6t
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft /tmp/pip-req-build-dihuvq6t
  Resolved https://github.com/huggingface/peft to commit b180ae46f8cf9663f3bf786b41b2eb41c2512031
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: peft
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Created wheel for peft: filename=peft-0.12.1.dev0-py3-none-any.whl size=316782 sha256=7b87bb0f91a890b46d4d8e28e97a6f51c77d1eb55f5162735bb2f10ca724386b
  Stored in directory: /tmp/pip-ephem-wheel-cache-gg979ho6/wheels/4c/16/67/1002a2d4daa822eff130e6d85b90051b75d2ce0d26b9448e4a
Successfully built peft
Installing collected packages: peft
Successfully installed peft-0.12.1.d

**Imports and model selection**

In [2]:
import torch
import transformers
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, pipeline, LlamaTokenizer
import bitsandbytes
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import load_dataset, Dataset
import json
import matplotlib.pyplot as plt
from trl import SFTTrainer


# model_id = "./open_llama_7b_v2"
#model_id = "./Genz-70b-GPTQ"
#model_id = "/home/ek826@drexel.edu/Yi-34B-Chat"
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# model_id = "PY007/TinyLlama-1.1B-Chat-v0.1"

n_gpus = torch.cuda.device_count()
max_memory = f'{32960}MB'


**Kanye West 1000 prompts Dataset preparation**

In [4]:
def load_json_file(file_path):
    with open(file_path, "r") as file:
        data = json.load(file)
    return data

def convert_to_dataset(data):
    roles = []
    contents = []

    for item in data:
        roles.append("user")
        contents.append(item['Prompt'])

        roles.append("assistant")
        contents.append(item['Response'])

    data_dict = {"role": roles, "content": contents}

    dataset = Dataset.from_dict(data_dict)

    return dataset

file_path = "kanye_1000_conversations.json"

data = load_json_file(file_path)

dataset = convert_to_dataset(data)

**Formatting function**

In [5]:
def formatting_func(example):
    if example["role"] == "user":
        text = f"### Human: {example['content']}\n"
    elif example["role"] == "assistant":
        text = f"### Assistant: {example['content']}\n"
    else:
        text = ""

    return {"text": text}

formatted_dataset = dataset.map(formatting_func)


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

**Tokenizer**

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

**Quantization and LoRA definition**

In [7]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

qlora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Trainer object**

In [8]:

supervised_finetuning_trainer = SFTTrainer(
    base_model,
    train_dataset= formatted_dataset,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        max_steps=118,
        output_dir=".",
        optim="paged_adamw_8bit",
        fp16=True,
    ),
    tokenizer=tokenizer,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    peft_config=qlora_config,
    dataset_text_field="text",
    max_seq_length=2048
)

supervised_finetuning_trainer.train()



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss


TrainOutput(global_step=118, training_loss=1.6460401890641552, metrics={'train_runtime': 124.8884, 'train_samples_per_second': 3.779, 'train_steps_per_second': 0.945, 'total_flos': 266499226976256.0, 'train_loss': 1.6460401890641552, 'epoch': 0.236})

**Saving model and tokenizer**

In [9]:
save_directory = "./trained_model"
supervised_finetuning_trainer.model.save_pretrained(save_directory)
# base_model.save_pretrained(save_directory)

tokenizer.save_pretrained(save_directory)

# loaded_model = AutoModelForCausalLM.from_pretrained(save_directory)
# loaded_tokenizer = LlamaTokenizer.from_pretrained(save_directory)


('./trained_model/tokenizer_config.json',
 './trained_model/special_tokens_map.json',
 './trained_model/tokenizer.model',
 './trained_model/added_tokens.json',
 './trained_model/tokenizer.json')

**Loading Model and Tokenizer**

In [10]:
#save_directory = "./trained_model"
model = AutoModelForCausalLM.from_pretrained(save_directory)
tokenizer = AutoTokenizer.from_pretrained(save_directory)

In [11]:
device = 0 if torch.cuda.is_available() else -1
text_gen_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

prompts = [
    "What do you think about the deep state",
    "If you were to write a song about lawyers what will it be called",
    "Why is chicken unhealthy for people",
    "why is money happiness?",
    "Is timbaland better than justin timberlake",
    "Why do you create songs?",
    "Why not jazz over rap?",
    "Fifty called you fake, response?",
    "Why do you wanna run for office?",
    "Why was Vultures 2 a bad album?"
]

results = []
for prompt in prompts:
    result = text_gen_pipeline(f"### Human: {prompt} ### Assistant:",
                               max_length=200,
                               num_return_sequences=1,
                               do_sample=True,
                               temperature=0.7,
                               top_p=0.9)

    response = result[0]['generated_text'].split("### Assistant:")[-1].strip()
    results.append({"prompt": prompt, "response": response})

output_file = "predictions.txt"
with open(output_file, "w") as file:
    for item in results:
        file.write(f"### Human: {item['prompt']}\n")
        file.write(f"### Assistant: {item['response']}\n")
        file.write("\n")

print(f"Predictions written to {output_file}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Predictions written to predictions.txt


In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def read_predictions_file(file_path):
    conversations = []
    with open(file_path, 'r') as file:
        prompt, response = None, None
        for line in file:
            line = line.strip()
            if line.startswith("### Human:"):
                prompt = line.replace("### Human: ", "")
            elif line.startswith("### Assistant:"):
                response = line.replace("### Assistant: ", "")
                if prompt and response:
                    conversations.append({"Prompt": prompt, "Response": response})
                    prompt, response = None, None
    return conversations

def calculate_perplexity(prompt, response):
    model.eval()

    input_text = f"### Human: {prompt} ### Assistant: {response}"

    inputs = tokenizer(input_text, return_tensors="pt")
    input_ids = inputs["input_ids"].to(device)  # Move inputs to the same device as the model

    with torch.no_grad():
        outputs = model(input_ids=input_ids, labels=input_ids)

    loss = outputs.loss
    perplexity = torch.exp(loss).item()

    return perplexity

file_path = "predictions.txt"
test_data = read_predictions_file(file_path)

total_perplexity = 0
for item in test_data:
    prompt = item["Prompt"]
    response = item["Response"]
    perplexity = calculate_perplexity(prompt, response)
    print(f"Perplexity for prompt '{prompt}': {perplexity}")
    total_perplexity += perplexity

average_perplexity = total_perplexity / len(test_data)
print(f"Average Perplexity: {average_perplexity}")


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


Perplexity for prompt 'What do you think about the deep state': 4.087143898010254
Perplexity for prompt 'If you were to write a song about lawyers what will it be called': 16.05428123474121
Perplexity for prompt 'Why is chicken unhealthy for people': 5.387056350708008
Perplexity for prompt 'why is money happiness?': 2.834333896636963
Perplexity for prompt 'Is timbaland better than justin timberlake': 3.1128461360931396
Perplexity for prompt 'Why do you create songs?': 3.0847272872924805
Perplexity for prompt 'Why not jazz over rap?': 11.311464309692383
Perplexity for prompt 'Fifty called you fake, response?': 8.820530891418457
Perplexity for prompt 'Why do you wanna run for office?': 22.397838592529297
Perplexity for prompt 'Why was Vultures 2 a bad album?': 3.9611785411834717
Average Perplexity: 8.105140113830567


In [11]:
text_gen_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, device = 0)

sample_prompt = "What's your key to success?"

result = text_gen_pipeline(f"### Human: {sample_prompt} ### Assistant:",
                           max_length=200,
                           num_return_sequences=1,
                           do_sample=True,
                           temperature=0.7,
                           top_p=0.9)

response = result[0]['generated_text'].split("### Assistant:")[-1].strip()
print(response)

I believe that success is not just about achieving your goals, but about how you approach them and how you create value for others along the way. Success to me is not just about the number of accomplishments, but about the impact you leave on the world. How can you use your experiences and creativity to inspire and empower others?
