## Table of Contents
- Loading Dataset
- Testing Finetuned LLama2
- Testing Mistral
- Testing Phi-2
- Evaluating Models
- Task 2: Result Table
- Hyperparameter Evaluation
- Hyperparamenter Testing of LLama2
- Hyperparameters of Mistral
- Hyperparameters of Phi-2
- Task 3: Result Table

In [1]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
%pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    %pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    %pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

In [2]:
%pip install evaluate
%pip install rouge_score
%pip install bert_score

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0mNote: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0mNote: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0mNote: you may need to restart the kernel to use updated packages.


## Loading Dataset

In [3]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['output', 'input', 'instruction'],
    num_rows: 51760
})

In [4]:
import random

random.seed(1)

test_indexes = random.sample(range(len(dataset)), 20)
test_indexes

[8805,
 37303,
 50054,
 4135,
 16716,
 7727,
 32468,
 49870,
 29457,
 30949,
 42702,
 24878,
 51689,
 13759,
 6151,
 31972,
 1857,
 25546,
 28361,
 39809]

## Testing Finetuned LLama2

In [8]:
from unsloth import FastLanguageModel

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "javijer/llama2-alpaca",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
FastLanguageModel.for_inference(model)

==((====))==  Unsloth: Fast Llama patching release 2024.3
   \\   /|    GPU: GRID A100X-40C. Max memory: 39.996 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Unsloth 2024.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [9]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [10]:
test_dataset = dataset.select(test_indexes)
test_dataset = test_dataset.map(formatting_prompts_func, batched = True,)

Map: 100%|██████████| 20/20 [00:00<00:00, 3023.58 examples/s]


In [12]:
responses = []

for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
  prompt = alpaca_prompt.format(
    instruction,
    input,
    "",
  )

  inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, do_sample=True)
  response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
  responses.append(response.replace(prompt, ''))

responses[0]

'According to Box Office Mojo, the highest-grossing film of 2020 was "The Witches," which was released in October and grossed over $26 million worldwide. However, the film was released in a limited theatrical run and was mostly available for streaming on HBO Max, so it is unclear how much of that revenue came from theatrical ticket sales.'

In [13]:
model_predictions = {"llama2": responses}

## Testing Mistral

In [14]:
from unsloth import FastLanguageModel

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "javijer/mistral-alpaca",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )

FastLanguageModel.for_inference(model)

==((====))==  Unsloth: Fast Mistral patching release 2024.3
   \\   /|    GPU: GRID A100X-40C. Max memory: 39.996 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


In [15]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [16]:
test_dataset = dataset.select(test_indexes)
test_dataset = test_dataset.map(formatting_prompts_func, batched = True,)

In [17]:
responses = []

for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
  prompt = alpaca_prompt.format(
    instruction,
    input,
    "",
  )

  inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, do_sample=True)
  response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
  responses.append(response.replace(prompt, ''))

responses[0]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

'There were many popular movies released in 2020, but since popularity is a subjective term, it is difficult to say which one was the most popular. Some of the notable movies of 2020 include "Tenet", "Sound of Metal", "The Trial of the Chicago 7", "I Care a Lot", "Judas and the Black Messiah", etc. However, many people consider "Tenet" to be the most popular because of its high commercial success, worldwide recognition, and critical acclaim. To provide the most accurate information, we suggest checking the sources below for more information:\n\n1. [List of most popular 2020 movies - Box Office Mojo](https://www.boxofficemojo.com/year/?yearID=2020).\n2. [Top-grossing films of 2020](https://en.wikipedia.org/wiki/List_of_highest-grossing_films_of_2020).\n3. [Audiences around the world can\'t get enough of Meryl Streep in \'Let Them All Talk\'](https://www.boston.com/arts-entertainment-lifestyle/movies-tv/2020/12/18/meryl-streep-apple-movie-let-them-all-talk/).'

In [18]:
model_predictions["mistral"] = responses

## Testing Phi-2

In [19]:
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True

model = AutoModelForCausalLM.from_pretrained(
    "javijer/phi2-alpaca",
    # max_seq_length = max_seq_length,
    # dtype = dtype,
    load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("javijer/phi2-alpaca")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Downloading shards: 100%|██████████| 2/2 [01:47<00:00, 53.52s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.13s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [20]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [21]:
test_dataset = dataset.select(test_indexes)
test_dataset = test_dataset.map(formatting_prompts_func, batched = True,)

Map: 100%|██████████| 20/20 [00:00<00:00, 3991.72 examples/s]


In [22]:
responses = []

for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
  prompt = alpaca_prompt.format(
    instruction,
    input,
    "",
  )

  inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, do_sample=True)
  response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
  responses.append(response.replace(prompt, '').strip())

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [23]:
responses

['###',
 "|Instruction|Input|Response|\n|:---:|:---:|:---:|\n|Instructions in English|Excessive cell phone use|Excessive cell phone use can lead to sleep deprivation, social isolation, and negative impacts on one's mental health.|\n|Incorporating visual aids|A visual representation of a cell-like structure showing negative impacts of excessive cell phone use|The visual aids can help the person better visualize and understand the negative effects of excessive cell phone use, making it easier for them to remember the negative impacts.|\n|Using active learning methods|Using real-world examples| By using real-world examples of the negative effects of excessive cell phone use, such as decreased productivity and social isolation, you can create a more effective learning experience.|",
 'A wide variety of materials can be considered magnetic, including metals such as iron, nickel, and cobalt, as well as some hard rocks like magnetite. There are also a few types of plastic and ceramics which a

In [24]:
model_predictions["phi2"] = responses

# Evaluating Models

In [25]:
import evaluate

# Note: Not evaluating CodeBLEU because this is not code
references = test_dataset['output']
scores = {
    "Model Name": ["LLama2", "Mistral", "Phi-2"],
    "BLEU": [],
    "Rouge-L": [],
    "BERTScore": [],
    "Human Evaluation": []
}

In [26]:
model_predictions.keys()

dict_keys(['llama2', 'mistral', 'phi2'])

In [27]:
# BLEU Score
bleu = evaluate.load("bleu")

for predictions in model_predictions.values():
  results = bleu.compute(predictions=predictions, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))
  print(results)

Downloading builder script: 100%|██████████| 5.94k/5.94k [00:00<00:00, 8.86MB/s]
Downloading extra modules: 4.07kB [00:00, 6.14MB/s]                   
Downloading extra modules: 100%|██████████| 3.34k/3.34k [00:00<00:00, 11.3MB/s]

{'bleu': 0.09936880781026783, 'precisions': [0.5493945188017846, 0.22401549386701097, 0.11510791366906475, 0.06688741721854305], 'brevity_penalty': 0.5663658804742363, 'length_ratio': 0.6375457131247461, 'translation_length': 1569, 'reference_length': 2461}
{'bleu': 0.08075830955136315, 'precisions': [0.42095416276894293, 0.14305949008498584, 0.0548141086749285, 0.02358036573628489], 'brevity_penalty': 0.8597825489632324, 'length_ratio': 0.8687525396180414, 'translation_length': 2138, 'reference_length': 2461}
{'bleu': 0.04755541787453927, 'precisions': [0.34599406528189913, 0.1021021021021021, 0.04012158054711246, 0.02276923076923077], 'brevity_penalty': 0.6309465513970726, 'length_ratio': 0.6846810239739943, 'translation_length': 1685, 'reference_length': 2461}





In [28]:
# Rouge-L Score
rouge = evaluate.load('rouge')

for predictions in model_predictions.values():
  results = rouge.compute(predictions=predictions, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))
  print(results)

{'rouge1': 0.3952062028842738, 'rouge2': 0.16487606168009816, 'rougeL': 0.2855462642740738, 'rougeLsum': 0.3299770931941084}
{'rouge1': 0.37302210597060537, 'rouge2': 0.14881661717313946, 'rougeL': 0.25735776979810193, 'rougeLsum': 0.30475258769901736}
{'rouge1': 0.29851285511449527, 'rouge2': 0.09325615855970326, 'rougeL': 0.19999278244929797, 'rougeLsum': 0.23694337681415786}


In [29]:
[(i, out) for i, out in enumerate(model_predictions['phi2']) if out == '']

[]

In [30]:
# BERTScore Score
import numpy as np

bertscore = evaluate.load('bertscore')


for predictions in model_predictions.values():
  results = bertscore.compute(predictions=predictions, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))
  print(result)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


precision: 0.899, recall: 0.873, f1: 0.885
precision: 0.883, recall: 0.881, f1: 0.882
precision: 0.864, recall: 0.85, f1: 0.856


In [31]:
# Llama2 Human Evaluation

print("Evaluating LLama2 Predictions:")
for i, (instruction, input, prediction, reference) in enumerate(zip(test_dataset['instruction'], test_dataset['input'], model_predictions['llama2'], references)):
  print(f"\n******************* Sample {i} *******************\n")
  print("Instructions:", instruction)
  print("Input:", input, "\n")
  print("Prediction:", prediction, "\n")
  print("Reference:", reference)


Evaluating LLama2 Predictions:

******************* Sample 0 *******************

Instructions: What was the most popular movie in 2020?
Input:  

Prediction: According to Box Office Mojo, the highest-grossing film of 2020 was "The Witches," which was released in October and grossed over $26 million worldwide. However, the film was released in a limited theatrical run and was mostly available for streaming on HBO Max, so it is unclear how much of that revenue came from theatrical ticket sales. 

Reference: The most popular movie of 2020 was "Bad Boys for Life" starring Will Smith and Martin Lawrence. It was the highest-grossing film of 2020, making $426.5 million worldwide. However, it's important to note due to the COVID-19 pandemic, the box office for many films was severely impacted and a majority of productions were delayed or released through digital platforms.

******************* Sample 1 *******************

Instructions: Name three negative effects of excessive cellphone use.


In [32]:
# i) Grammatical correctness, ii) Coherence and iii) Correctness of answer
human_scores = [(1, 1, 0.5), (1, 1, 1), (1, 1, 0.5), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 0.4), (1, 1, 0.7), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 0.9), (1, 1, 0.7), (1, 1, 1)]

avg_score = 0
for evals in human_scores:
  avg_score += sum(evals) / 3

avg_score = avg_score / len(human_scores)
avg_score

0.9616666666666666

In [33]:
scores['Human Evaluation'] = []

In [34]:
scores['Human Evaluation'].append(round(avg_score, 3))

In [35]:
# Mistral Human Evaluation

print("Evaluating Mistral Predictions:")
for i, (instruction, input, prediction, reference) in enumerate(zip(test_dataset['instruction'], test_dataset['input'], model_predictions['mistral'], references)):
  print(f"\n******************* Sample {i} *******************\n")
  print("Instructions:", instruction)
  print("Input:", input, "\n")
  print("Prediction:", prediction, "\n")
  print("Reference:", reference)


Evaluating Mistral Predictions:

******************* Sample 0 *******************

Instructions: What was the most popular movie in 2020?
Input:  

Prediction: There were many popular movies released in 2020, but since popularity is a subjective term, it is difficult to say which one was the most popular. Some of the notable movies of 2020 include "Tenet", "Sound of Metal", "The Trial of the Chicago 7", "I Care a Lot", "Judas and the Black Messiah", etc. However, many people consider "Tenet" to be the most popular because of its high commercial success, worldwide recognition, and critical acclaim. To provide the most accurate information, we suggest checking the sources below for more information:

1. [List of most popular 2020 movies - Box Office Mojo](https://www.boxofficemojo.com/year/?yearID=2020).
2. [Top-grossing films of 2020](https://en.wikipedia.org/wiki/List_of_highest-grossing_films_of_2020).
3. [Audiences around the world can't get enough of Meryl Streep in 'Let Them All Ta

In [36]:
# i) Grammatical correctness, ii) Coherence and iii) Correctness of answer
human_scores = [(1, 1, 0.5), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 0), (1, 1, 1), (1, 1, 0.9), (1, 1, 0.7), (1, 1, 0), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 0.7), (1, 1, 1)]

avg_score = 0
for evals in human_scores:
  avg_score += sum(evals) / 3

avg_score = avg_score / len(human_scores)
avg_score

0.9466666666666665

In [37]:
scores['Human Evaluation'].append(round(avg_score, 3))

In [38]:
# Phi-2 Human Evaluation

print("Evaluating Phi-2 Predictions:")
for i, (instruction, input, prediction, reference) in enumerate(zip(test_dataset['instruction'], test_dataset['input'], model_predictions['phi2'], references)):
  print(f"\n******************* Sample {i} *******************\n")
  print("Instructions:", instruction)
  print("Input:", input, "\n")
  print("Prediction:", prediction, "\n")
  print("Reference:", reference)


Evaluating Mistral Predictions:

******************* Sample 0 *******************

Instructions: What was the most popular movie in 2020?
Input:  

Prediction: ### 

Reference: The most popular movie of 2020 was "Bad Boys for Life" starring Will Smith and Martin Lawrence. It was the highest-grossing film of 2020, making $426.5 million worldwide. However, it's important to note due to the COVID-19 pandemic, the box office for many films was severely impacted and a majority of productions were delayed or released through digital platforms.

******************* Sample 1 *******************

Instructions: Name three negative effects of excessive cellphone use.
Input:  

Prediction: |Instruction|Input|Response|
|:---:|:---:|:---:|
|Instructions in English|Excessive cell phone use|Excessive cell phone use can lead to sleep deprivation, social isolation, and negative impacts on one's mental health.|
|Incorporating visual aids|A visual representation of a cell-like structure showing negative i

In [39]:
# i) Grammatical correctness, ii) Coherence and iii) Correctness of answer
human_scores = [(0.5, 1, 0), (1, 1, 1), (1, 1, 0.5), (1, 1, 1), (1, 1, 1), (0, 0, 0), (1, 1, 1), (1, 1, 1), (0.5, 1, 0), (1, 1, 0.8), (1, 1, 1), (1, 1, 0.9), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 0.5), (1, 1, 0.5), (1, 1, 1)]

avg_score = 0
for evals in human_scores:
  avg_score += sum(evals) / 3

avg_score = avg_score / len(human_scores)
avg_score

0.8700000000000001

In [40]:
scores['Human Evaluation'].append(round(avg_score, 3))

In [41]:
scores

{'Model Name': ['LLama2', 'Mistral', 'Phi-2'],
 'BLEU': [0.099, 0.081, 0.048],
 'Rouge-L': [0.286, 0.257, 0.2],
 'BERTScore': ['precision: 0.899, recall: 0.873, f1: 0.885',
  'precision: 0.883, recall: 0.881, f1: 0.882',
  'precision: 0.864, recall: 0.85, f1: 0.856'],
 'Human Evaluation': [0.962, 0.947, 0.87]}

In [42]:
# Only using f1 score
for i, score in enumerate(scores['BERTScore']):
    scores['BERTScore'][i] = float(score[-5:])

In [43]:
scores

{'Model Name': ['LLama2', 'Mistral', 'Phi-2'],
 'BLEU': [0.099, 0.081, 0.048],
 'Rouge-L': [0.286, 0.257, 0.2],
 'BERTScore': [0.885, 0.882, 0.856],
 'Human Evaluation': [0.962, 0.947, 0.87]}

## Task 2: Result Table

In [44]:
import pandas as pd

# Comparison Table
df = pd.DataFrame(scores)

print(df)

  Model Name   BLEU  Rouge-L  BERTScore  Human Evaluation
0     LLama2  0.099    0.286      0.885             0.962
1    Mistral  0.081    0.257      0.882             0.947
2      Phi-2  0.048    0.200      0.856             0.870


## Hyperparameter Evaluation

In [12]:
# Hyperparameters

temperatures = [0.001, 0.2, 0.4, 0.8]
top_k_values = [2, 10, 20, 40]
beam_sizes = [1, 3, 5, 10]

In [13]:
references = test_dataset['output']
model_names = ["LLama2", "Mistral", "Phi-2"]

row_names = []
for name in model_names:
  row_names.extend([f"{name} (Temperature={temp})" for temp in temperatures])
  row_names.extend([f"{name} (Top K={top_k})" for top_k in top_k_values])
  row_names.extend([f"{name} (Beam Size={beam_size})" for beam_size in beam_sizes])

scores = {
    "Model Name": row_names,
    "BLEU": [],
    "Rouge-L": [],
    "BERTScore": [],
    "Human Evaluation": []
}

In [47]:
scores

{'Model Name': ['LLama2 (Temperature=0.001)',
  'LLama2 (Temperature=0.2)',
  'LLama2 (Temperature=0.4)',
  'LLama2 (Temperature=0.8)',
  'LLama2 (Top K=2)',
  'LLama2 (Top K=10)',
  'LLama2 (Top K=20)',
  'LLama2 (Top K=40)',
  'LLama2 (Beam Size=1)',
  'LLama2 (Beam Size=3)',
  'LLama2 (Beam Size=5)',
  'LLama2 (Beam Size=10)',
  'Mistral (Temperature=0.001)',
  'Mistral (Temperature=0.2)',
  'Mistral (Temperature=0.4)',
  'Mistral (Temperature=0.8)',
  'Mistral (Top K=2)',
  'Mistral (Top K=10)',
  'Mistral (Top K=20)',
  'Mistral (Top K=40)',
  'Mistral (Beam Size=1)',
  'Mistral (Beam Size=3)',
  'Mistral (Beam Size=5)',
  'Mistral (Beam Size=10)',
  'Phi-2 (Temperature=0.001)',
  'Phi-2 (Temperature=0.2)',
  'Phi-2 (Temperature=0.4)',
  'Phi-2 (Temperature=0.8)',
  'Phi-2 (Top K=2)',
  'Phi-2 (Top K=10)',
  'Phi-2 (Top K=20)',
  'Phi-2 (Top K=40)',
  'Phi-2 (Beam Size=1)',
  'Phi-2 (Beam Size=3)',
  'Phi-2 (Beam Size=5)',
  'Phi-2 (Beam Size=10)'],
 'BLEU': [],
 'Rouge-L': [],
 '

## Hyperparamenter Testing of LLama2

In [48]:
from unsloth import FastLanguageModel

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "javijer/llama2-alpaca",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
FastLanguageModel.for_inference(model)

==((====))==  Unsloth: Fast Llama patching release 2024.3
   \\   /|    GPU: GRID A100X-40C. Max memory: 39.996 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


In [49]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [50]:
test_dataset = dataset.select(test_indexes)
test_dataset = test_dataset.map(formatting_prompts_func, batched = True,)

In [51]:
import evaluate, numpy as np

bleu = evaluate.load("bleu")
rouge = evaluate.load('rouge')
bertscore = evaluate.load('bertscore')

In [52]:
# Testing temperature

for temperature in temperatures:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, temperature = temperature, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [53]:
# Testing top_k

for top_k in top_k_values:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, top_k = top_k, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



In [54]:
# Testing beam_size

for beam_size in beam_sizes:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, num_beams = beam_size, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



In [55]:
scores

{'Model Name': ['LLama2 (Temperature=0.001)',
  'LLama2 (Temperature=0.2)',
  'LLama2 (Temperature=0.4)',
  'LLama2 (Temperature=0.8)',
  'LLama2 (Top K=2)',
  'LLama2 (Top K=10)',
  'LLama2 (Top K=20)',
  'LLama2 (Top K=40)',
  'LLama2 (Beam Size=1)',
  'LLama2 (Beam Size=3)',
  'LLama2 (Beam Size=5)',
  'LLama2 (Beam Size=10)',
  'Mistral (Temperature=0.001)',
  'Mistral (Temperature=0.2)',
  'Mistral (Temperature=0.4)',
  'Mistral (Temperature=0.8)',
  'Mistral (Top K=2)',
  'Mistral (Top K=10)',
  'Mistral (Top K=20)',
  'Mistral (Top K=40)',
  'Mistral (Beam Size=1)',
  'Mistral (Beam Size=3)',
  'Mistral (Beam Size=5)',
  'Mistral (Beam Size=10)',
  'Phi-2 (Temperature=0.001)',
  'Phi-2 (Temperature=0.2)',
  'Phi-2 (Temperature=0.4)',
  'Phi-2 (Temperature=0.8)',
  'Phi-2 (Top K=2)',
  'Phi-2 (Top K=10)',
  'Phi-2 (Top K=20)',
  'Phi-2 (Top K=40)',
  'Phi-2 (Beam Size=1)',
  'Phi-2 (Beam Size=3)',
  'Phi-2 (Beam Size=5)',
  'Phi-2 (Beam Size=10)'],
 'BLEU': [0.114,
  0.12,
  0.12

In [65]:
import json

with open('data.json', 'w') as f:
    json.dump(scores, f)

In [30]:
import json

f = open('data.json')
old_scores = json.load(f)
old_scores

{'Model Name': ['LLama2 (Temperature=0.001)',
  'LLama2 (Temperature=0.2)',
  'LLama2 (Temperature=0.4)',
  'LLama2 (Temperature=0.8)',
  'LLama2 (Top K=2)',
  'LLama2 (Top K=10)',
  'LLama2 (Top K=20)',
  'LLama2 (Top K=40)',
  'LLama2 (Beam Size=1)',
  'LLama2 (Beam Size=3)',
  'LLama2 (Beam Size=5)',
  'LLama2 (Beam Size=10)',
  'Mistral (Temperature=0.001)',
  'Mistral (Temperature=0.2)',
  'Mistral (Temperature=0.4)',
  'Mistral (Temperature=0.8)',
  'Mistral (Top K=2)',
  'Mistral (Top K=10)',
  'Mistral (Top K=20)',
  'Mistral (Top K=40)',
  'Mistral (Beam Size=1)',
  'Mistral (Beam Size=3)',
  'Mistral (Beam Size=5)',
  'Mistral (Beam Size=10)',
  'Phi-2 (Temperature=0.001)',
  'Phi-2 (Temperature=0.2)',
  'Phi-2 (Temperature=0.4)',
  'Phi-2 (Temperature=0.8)',
  'Phi-2 (Top K=2)',
  'Phi-2 (Top K=10)',
  'Phi-2 (Top K=20)',
  'Phi-2 (Top K=40)',
  'Phi-2 (Beam Size=1)',
  'Phi-2 (Beam Size=3)',
  'Phi-2 (Beam Size=5)',
  'Phi-2 (Beam Size=10)'],
 'BLEU': [0.114,
  0.12,
  0.12

## Hyperparameters of Mistral

In [7]:
from unsloth import FastLanguageModel

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "javijer/mistral-alpaca",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        # device_map='auto'
    )

FastLanguageModel.for_inference(model)

==((====))==  Unsloth: Fast Mistral patching release 2024.3
   \\   /|    GPU: GRID A100X-40C. Max memory: 39.996 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unsloth 2024.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [8]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [9]:
test_dataset = dataset.select(test_indexes)
test_dataset = test_dataset.map(formatting_prompts_func, batched = True,)

In [10]:
import evaluate, numpy as np

bleu = evaluate.load("bleu")
rouge = evaluate.load('rouge')
bertscore = evaluate.load('bertscore')

In [15]:
# Testing temperature

for temperature in temperatures:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, temperature = temperature, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [16]:
# Testing top_k

for top_k in top_k_values:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, top_k = top_k, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [17]:
# Testing beam_size

for beam_size in beam_sizes:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, num_beams = beam_size, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

## Hyperparameters of Phi-2

In [18]:
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True

model = AutoModelForCausalLM.from_pretrained(
    "javijer/phi2-alpaca",
    # max_seq_length = max_seq_length,
    # dtype = dtype,
    load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("javijer/phi2-alpaca")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.02it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [19]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [20]:
test_dataset = dataset.select(test_indexes)
test_dataset = test_dataset.map(formatting_prompts_func, batched = True,)

In [21]:
import evaluate, numpy as np

bleu = evaluate.load("bleu")
rouge = evaluate.load('rouge')
bertscore = evaluate.load('bertscore')

In [22]:
# Testing temperature

for temperature in temperatures:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, temperature = temperature, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [23]:
# Testing top_k

for top_k in top_k_values:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, top_k = top_k, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [24]:
# Testing beam_size

for beam_size in beam_sizes:
  responses = []

  # Generate Predictions
  for instruction, input in zip(test_dataset['instruction'], test_dataset['input']):
    prompt = alpaca_prompt.format(
      instruction,
      input,
      "",
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, num_beams = beam_size, do_sample=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    responses.append(response.replace(prompt, ''))

  # Calculate BLEU Score
  results = bleu.compute(predictions=responses, references=references)
  scores['BLEU'].append(round(results['bleu'], 3))

  # Calculate Rouge-L Score
  results = rouge.compute(predictions=responses, references=references)
  scores['Rouge-L'].append(round(results['rougeL'], 3))

   # Calculate BERTScore
  results = bertscore.compute(predictions=responses, references=references, lang="en")
  precision = np.average(results["precision"])
  recall = np.average(results["recall"])
  f1 = np.average(results["f1"])
  result = f"precision: {round(precision, 3)}, recall: {round(recall, 3)}, f1: {round(f1, 3)}"
  scores['BERTScore'].append(round(f1, 3))



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

## Task 3: Result Table

In [74]:
# Average Human Scores per Hyperparameter
scores['Human Evaluation'] = [0.989, 0.991, 0.993, 0.964, 0.972, 0.973, 0.987, 0.998, 0.996, 0.959, 0.949, 0.951, 0.998, 0.994, 0.994, 0.983, 0.993, 0.994, 0.958, 0.961, 0.979, 0.968, 0.975, 0.954, 0.958, 0.943, 0.953, 0.901, 0.943, 0.915, 0.901, 0.932, 0.926, 0.962, 0.961, 0.963]

In [75]:
import pandas as pd

# Comparison Table
df = pd.DataFrame(scores)

print(df)

                     Model Name   BLEU  Rouge-L  BERTScore  Human Evaluation
0    LLama2 (Temperature=0.001)  0.114    0.288      0.884             0.989
1      LLama2 (Temperature=0.2)  0.120    0.285      0.887             0.991
2      LLama2 (Temperature=0.4)  0.121    0.290      0.890             0.993
3      LLama2 (Temperature=0.8)  0.099    0.226      0.877             0.964
4              LLama2 (Top K=2)  0.103    0.249      0.876             0.972
5             LLama2 (Top K=10)  0.113    0.241      0.879             0.973
6             LLama2 (Top K=20)  0.125    0.269      0.885             0.987
7             LLama2 (Top K=40)  0.104    0.318      0.895             0.998
8          LLama2 (Beam Size=1)  0.136    0.286      0.890             0.996
9          LLama2 (Beam Size=3)  0.081    0.243      0.862             0.959
10         LLama2 (Beam Size=5)  0.069    0.241      0.841             0.949
11        LLama2 (Beam Size=10)  0.076    0.229      0.854             0.951