# LLM Finetuning

Data is 10-row sustainability data with prompt+completion.
Model is EleutherAI/pythia-70m. We use AutoTokenizer for tokenization and AutoModelForCausalLM for model training. 


- Data Preparation
    - Collect data
    - Tokenize data (pad - truncate)
    - Split data into train test
- Use Base Model
- Train
    - Train, save model
- Inference
    - Load model
    - Make predictions
- Evaluation
    - Load model
    - Calculate bleu score on test data

## Data Preparation

**Collect prompt completion pairs and create a jsonl file**

In [1]:
import pandas as pd
import numpy as np
import datasets

In [2]:
df = pd.read_excel("data_10.xlsx")
df.head()

Unnamed: 0,prompt,completion
0,Why is sustainability important?,Sustainability is crucial because it ensures a...
1,How can individuals contribute to sustainability?,Individuals can contribute to sustainability b...
2,What are some sustainable practices in agricul...,Sustainable agriculture practices include crop...
3,Describe the concept of a circular economy.,A circular economy is an economic model that a...
4,How does climate change relate to sustainability?,Climate change is a major threat to sustainabi...


In [3]:
# Define the output JSONL file name
filename = 'output.jsonl'

# Iterate through the rows and write each row as a JSON object to the JSONL file
with open(filename, 'w') as jsonl_file:
    for _, row in df.iterrows():
        json_data = row.to_json(orient='columns')
        jsonl_file.write(json_data + '\n')

**Create a tokenizer**

In [4]:
#!pip install transformers
from transformers import AutoTokenizer

In [5]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")

**Tokenize the jsonl data**

In [6]:
def tokenize_function(examples):
    if "question" in examples and "answer" in examples:
        text = examples["question"][0] + examples["answer"][0]
    elif "input" in examples and "output" in examples:
        text = examples["input"][0] + examples["output"][0]
    elif "prompt" in examples and "completion" in examples:
        text = examples["prompt"][0] + examples["completion"][0]
    else:
        text = examples["text"][0]

    # Add 0 for short sentences
    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )
    
    # find the max length after padding, select the min
    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    
    # truncate if the sentence is longer than 2048
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

In [7]:
finetuning_dataset_loaded = datasets.load_dataset("json", data_files=filename, split="train")

tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print(tokenized_dataset)

Using custom data configuration default-4d19725dae09d542


Downloading and preparing dataset json/default to C:/Users/pelin/.cache/huggingface/datasets/json/default-4d19725dae09d542/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to C:/Users/pelin/.cache/huggingface/datasets/json/default-4d19725dae09d542/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/10 [00:00<?, ?ba/s]

Dataset({
    features: ['prompt', 'completion', 'input_ids', 'attention_mask'],
    num_rows: 10
})


In [8]:
tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

**Analyse tokenized dataset**

In [9]:
tokenized_dataset

Dataset({
    features: ['prompt', 'completion', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 10
})

In [10]:
tokenized_dataset["prompt"][0]

'Why is sustainability important?'

In [11]:
tokenized_dataset["completion"][0]

'Sustainability is crucial because it ensures a balance between meeting our current needs and preserving resources for future generations.'

In [12]:
tokenized_dataset["input_ids"][0]

[4967,
 310,
 32435,
 1774,
 32,
 52,
 28216,
 1430,
 310,
 9560,
 984,
 352,
 20096,
 247,
 6654,
 875,
 4804,
 776,
 1655,
 3198,
 285,
 24279,
 5300,
 323,
 2852,
 14649,
 15]

In [13]:
tokenized_dataset["attention_mask"][0]

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

**Train test split**

In [14]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 9
    })
    test: Dataset({
        features: ['prompt', 'completion', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1
    })
})


In [15]:
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(train_dataset)
print(test_dataset)

Dataset({
    features: ['prompt', 'completion', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 9
})
Dataset({
    features: ['prompt', 'completion', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1
})


**Push to hub**

In [16]:
# This is how to push your own dataset to your Huggingface hub
# !pip install huggingface_hub
# !huggingface-cli login
# split_dataset.push_to_hub(dataset_path_hf)

##  Use Base Model 

In [17]:
import datasets
import logging
import random
import logging
import torch
import transformers
import pandas as pd

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments, Trainer

  device: Optional[torch.device] = torch.device("cuda"),


In [18]:
model_name = "EleutherAI/pythia-70m"

In [19]:
base_model = AutoModelForCausalLM.from_pretrained(model_name)

device_count = torch.cuda.device_count()
if device_count > 0:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
    
base_model.to(device)
print(device)

cpu


In [20]:
test_text = test_dataset[0]['prompt']
max_input_tokens = 1000
max_output_tokens=100
# Tokenize
input_ids = tokenizer.encode(
      test_text,
      return_tensors="pt",
      truncation=True,
      max_length=max_input_tokens
)

# Generate
device = base_model.device
generated_tokens_with_prompt = base_model.generate(input_ids=input_ids.to(device), max_length=max_output_tokens)

# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(test_text):]


print("Question input (test):", test_text)
print(f"Correct answer from docs: {test_dataset[0]['completion']}")
print("Model's answer: ")
print(generated_text_answer)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test): What is the triple bottom line concept in sustainability?
Correct answer from docs: The triple bottom line concept in sustainability evaluates the performance of organizations or projects based on three criteria: social, environmental, and economic, emphasizing a holistic approach to success.
Model's answer: 


A:

The answer is that you need to be able to use the following two concepts:

The first is the definition of the bottom line concept.
The second is the definition of the bottom line concept.
The third is the definition of the bottom line concept.
The fourth is the definition of the bottom line concept.
The fourth is the definition of the bottom line concept.
The fifth is the definition of


## Train

In [21]:
from transformers import TrainingArguments, Trainer

In [22]:
max_steps = 30

trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name

In [23]:
training_args = TrainingArguments(

  # Learning rate
  learning_rate=1.0e-5,

  # Number of training epochs
  num_train_epochs=1,

  # Max steps to train for (each step is a batch of data)
  # Overrides num_train_epochs, if not -1
  max_steps=max_steps,

  # Batch size for training
  per_device_train_batch_size=1,

  # Directory to save model checkpoints
  output_dir=output_dir,

  # Other arguments
  overwrite_output_dir=False, # Overwrite the content of the output directory
  disable_tqdm=False, # Disable progress bars
  eval_steps=120, # Number of update steps between two evaluations
  save_steps=120, # After # steps model is saved
  warmup_steps=1, # Number of warmup steps for learning rate scheduler
  per_device_eval_batch_size=1, # Batch size for evaluation
  evaluation_strategy="steps",
  save_strategy="steps",
  logging_strategy="steps",
  logging_steps=1,
  optim="adafactor",
  gradient_accumulation_steps = 4,
  gradient_checkpointing=False,

  # Parameters for early stopping
  load_best_model_at_end=True,
  save_total_limit=1,
  metric_for_best_model="eval_loss",
  greater_is_better=False
)



trainer = Trainer(
    model=base_model,
    # model_flops=model_flops,
    # total_steps=max_steps,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)


max_steps is given, it will override any value given in num_train_epochs


In [24]:
# Uncomment the next line 
# training_output = trainer.train()

**Save Model**

In [25]:
# Uncomment the next 3 lines
# save_dir = f'{output_dir}/final'
# trainer.save_model(save_dir)
# print("Saved model to:", save_dir)

**Load Model**

In [26]:
max_steps = 30
device_count = torch.cuda.device_count()
if device_count > 0:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name
save_dir = f'{output_dir}/final'



finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)
finetuned_slightly_model.to(device) 

loading configuration file lamini_docs_30_steps/final\config.json
Model config GPTNeoXConfig {
  "_name_or_path": "lamini_docs_30_steps/final",
  "architectures": [
    "GPTNeoXForCausalLM"
  ],
  "bos_token_id": 0,
  "eos_token_id": 0,
  "hidden_act": "gelu",
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neox",
  "num_attention_heads": 8,
  "num_hidden_layers": 6,
  "rotary_emb_base": 10000,
  "rotary_pct": 0.25,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "use_cache": true,
  "use_parallel_residual": true,
  "vocab_size": 50304
}

loading weights file lamini_docs_30_steps/final\pytorch_model.bin
Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing GPTNeoXForCausalLM.

All the weights of GPTNeoXForCa

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (layers): ModuleList(
      (0): GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (attention): GPTNeoXAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
          (act): GELUActivation()
        )
      )
      (1): GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (

## Inference

In [27]:
# Prediction
test_question = test_dataset[0]['prompt']
print("Question input (test):", test_question)

# Tokenize
input_ids = tokenizer.encode(
      test_question,
      return_tensors="pt",
      truncation=True,
      max_length=max_input_tokens
)

# Generate
device = finetuned_slightly_model.device
generated_tokens_with_prompt = finetuned_slightly_model.generate(input_ids=input_ids.to(device), max_length=max_output_tokens)

# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(test_question):]


print(f"Correct answer from docs: {test_dataset[0]['completion']}")
print(" ")
print("Model's answer: ")
print(generated_text_answer)


Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test): What is the triple bottom line concept in sustainability?
Correct answer from docs: The triple bottom line concept in sustainability evaluates the performance of organizations or projects based on three criteria: social, environmental, and economic, emphasizing a holistic approach to success.
 
Model's answer: 
Sustainability is a major challenge in sustainability as it provides a means for reducing waste, recycling, and conserving resources.Renewable energy sources like solar, wind, and geothermal energy play a vital role in sustainability as it provides a means for reducing waste, recycling, and conserving resources.Sustainable sustainability is essential for sustainability as it provides a means for reducing waste, recycling, recycling, and conserving resources.Sustainable sustainability is


In [28]:
# Prediction from scratch
test_question = "Can you describe sustainability"
print("Question input (test):", test_question)

# Tokenize
input_ids = tokenizer.encode(
      test_question,
      return_tensors="pt",
      truncation=True,
      max_length=max_input_tokens
)

# Generate
device = finetuned_slightly_model.device
generated_tokens_with_prompt = finetuned_slightly_model.generate(input_ids=input_ids.to(device), max_length=max_output_tokens)

# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(test_question):]
print("Model's answer: ")
print(generated_text_answer)

Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Question input (test): Can you describe sustainability
Model's answer: 
 as a concept, considering the sustainability of sustainability as a concept in sustainability.A sustainability concept is a concept that aims to minimize waste, promote sustainability, promote sustainability, and promote sustainability by designing, preserving, and conserving resources for human survival.Sustainable sustainability is a concept that aims to minimize waste, promote sustainability, promote sustainability, and promote sustainability by designing, preserving, and conserving resources for human survival.Sustainable sustainability is a concept that aims to minimize waste, promote


## Evaluate

In [29]:
def generate_output(test_question, model):

    # Tokenize
    input_ids = tokenizer.encode(
          test_question,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
    )

    # Generate
    device = model.device
    generated_tokens_with_prompt = model.generate(input_ids=input_ids.to(device), max_length=max_output_tokens)

    # Decode
    generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

    # Strip the prompt
    generated_text_answer = generated_text_with_prompt[0][len(test_question):]
    return generated_text_answer

In [30]:
# control the function
test_question = test_dataset[0]['prompt']
print('Fine tuned model: ', generate_output(test_question, finetuned_slightly_model))
print('Base model: ', generate_output(test_question, base_model))
print('Actual Answer: ', test_dataset[0]['completion'])

Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Fine tuned model:  Sustainability is a major challenge in sustainability as it provides a means for reducing waste, recycling, and conserving resources.Renewable energy sources like solar, wind, and geothermal energy play a vital role in sustainability as it provides a means for reducing waste, recycling, and conserving resources.Sustainable sustainability is essential for sustainability as it provides a means for reducing waste, recycling, recycling, and conserving resources.Sustainable sustainability is
Base model:  

A:

The answer is that you need to be able to use the following two concepts:

The first is the definition of the bottom line concept.
The second is the definition of the bottom line concept.
The third is the definition of the bottom line concept.
The fourth is the definition of the bottom line concept.
The fourth is the definition of the bottom line concept.
The fifth is the definition of
Actual Answer:  The triple bottom line concept in sustainability evaluates the pe

In [31]:
# Collect the predictions
tuned_predicted_text_list = []
actual_test_list = []
base_predicted_text_list = []
for i in range(len(test_dataset)):
    test_q = test_dataset[i]['prompt']
    completion_q = test_dataset[i]['completion']
    predicted_text = generate_output(test_question, finetuned_slightly_model)
    base_predicted_text = generate_output(test_question, base_model)
    actual_test_list.append(completion_q)
    tuned_predicted_text_list.append(predicted_text)
    base_predicted_text_list.append(base_predicted_text)

Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


**Calculate the bleu score**

In [32]:
# !pip install evaluate

import evaluate
bleu = evaluate.load("bleu")

results = bleu.compute(predictions=base_predicted_text_list, references=actual_test_list)
print("Base Model Predictions Results")
print(results)

results = bleu.compute(predictions=tuned_predicted_text_list, references=actual_test_list)
print("Fine Tuned Model Predictions Results")
print(results)

Base Model Predictions Results
{'bleu': 0.0, 'precisions': [0.11392405063291139, 0.02564102564102564, 0.012987012987012988, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 2.393939393939394, 'translation_length': 79, 'reference_length': 33}
Fine Tuned Model Predictions Results
{'bleu': 0.0, 'precisions': [0.1, 0.02531645569620253, 0.0, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 2.4242424242424243, 'translation_length': 80, 'reference_length': 33}


## Examine the Process

**How to read the jsonl file?**

In [33]:
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df

Unnamed: 0,prompt,completion
0,Why is sustainability important?,Sustainability is crucial because it ensures a...
1,How can individuals contribute to sustainability?,Individuals can contribute to sustainability b...
2,What are some sustainable practices in agricul...,Sustainable agriculture practices include crop...
3,Describe the concept of a circular economy.,A circular economy is an economic model that a...
4,How does climate change relate to sustainability?,Climate change is a major threat to sustainabi...
5,What are the benefits of sustainable transport...,"Sustainable transportation, such as public tra..."
6,Explain the role of renewable energy in sustai...,"Renewable energy sources like solar, wind, and..."
7,What is the triple bottom line concept in sust...,The triple bottom line concept in sustainabili...
8,How can businesses integrate sustainability in...,Businesses can integrate sustainability by ado...
9,Discuss the connection between biodiversity an...,Biodiversity is essential for sustainability a...


**Turn the file into dict**

In [34]:
examples = instruction_dataset_df.to_dict()
examples['prompt']

{0: 'Why is sustainability important?',
 1: 'How can individuals contribute to sustainability?',
 2: 'What are some sustainable practices in agriculture?',
 3: 'Describe the concept of a circular economy.',
 4: 'How does climate change relate to sustainability?',
 5: 'What are the benefits of sustainable transportation?',
 6: 'Explain the role of renewable energy in sustainability.',
 7: 'What is the triple bottom line concept in sustainability?',
 8: 'How can businesses integrate sustainability into their operations?',
 9: 'Discuss the connection between biodiversity and sustainability.'}

In [35]:
finetuning_dataset_loaded['prompt']

['Why is sustainability important?',
 'How can individuals contribute to sustainability?',
 'What are some sustainable practices in agriculture?',
 'Describe the concept of a circular economy.',
 'How does climate change relate to sustainability?',
 'What are the benefits of sustainable transportation?',
 'Explain the role of renewable energy in sustainability.',
 'What is the triple bottom line concept in sustainability?',
 'How can businesses integrate sustainability into their operations?',
 'Discuss the connection between biodiversity and sustainability.']