**Team Programming Project: Building and Evaluating PolicyChat**

**Task**:

- Finetune `meta-llama/Llama-2-7b-hf` model on the custom train dataset with question-answer pairs.
- Evaluate the performance on the test dataset using Bleu and Rouge scores.

**Table of contents**:

- **Install and imports**
- **Loading and pre-processing the dataset**
- **Fine-tuning the model**
  - Loading the quantized model and the tokenizer
  - Setting up the training config
  - Re-training and saving the model
- **Merging the original and new (re-trained) model**
- **Running the inference using the fine-tuned model** -- the fine-tuned model is saved on the Google Drive, so feel free to jump to this section to run the inference

## **Installs and imports**

### Installing libraries/packages

In [None]:
!pip install "transformers==4.35" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" "tiktoken"

In [None]:
!pip install sacrebleu rouge_score

In [None]:
from    datasets import Dataset, load_dataset
import  gc
import  os
import  pandas as pd
from    peft import (
            LoraConfig,
            get_peft_model,
            AutoPeftModelForCausalLM,
            PeftModel,
            PeftConfig,)
from    pprint import pprint
from    random import randrange
import  torch
import  torch.nn as nn
from    tqdm.notebook import tqdm
import  transformers
from    transformers import (
            AutoTokenizer,
            AutoModelForCausalLM,
            BitsAndBytesConfig,
            TrainingArguments,
            LlamaForCausalLM,
            LlamaTokenizer,
            pipeline,
            logging)
from transformers.pipelines.pt_utils import KeyDataset
from    trl import SFTTrainer
import  yaml
from transformers import TextDataset, DataCollatorForLanguageModeling
from datasets import load_metric
bleu = load_metric("sacrebleu")
rouge = load_metric("rouge")
logging.set_verbosity(logging.CRITICAL)
from tqdm import tqdm

  bleu = load_metric("sacrebleu")


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


### Key configs and directories

In [None]:
# Hugging Face info
HF_TOKEN            = "name"  # with write permissions
MODEL_NAME_BASE     = 'meta-llama/Llama-2-7b-hf'
MODEL_NAME_FINETUNE = 'meta-llama/Llama-2-7b-hf-finetuned-policychat'
HF_MODEL_REPO       = "name" 

# Data info
DATA_DIR            = "/content/drive/MyDrive/94812 - NLX and LLM/Final project/finetuning-data"
OUTPUT_DIR          = "/content/drive/MyDrive/94812 - NLX and LLM/Final project/finetuning-output"
MERGED_MODEL_DIR    = OUTPUT_DIR + '/merged-model'
device_map          = {"": 0} # Load the entire model on the GPU 0

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## **Loading and pre-processing the dataset**

In [None]:
# Specify the path to the data
data_0_path = os.path.join(DATA_DIR, "finetune-dataset-0.csv") # Use for as train for fine-tuning
data_1_path = os.path.join(DATA_DIR, "finetune-dataset-1.csv")
data_2_path = os.path.join(DATA_DIR, "finetune-dataset-2.csv")
data_3_path = os.path.join(DATA_DIR, "finetune-dataset-3.csv")
data_4_path = os.path.join(DATA_DIR, "finetune-dataset-4.csv")
data_5_path = os.path.join(DATA_DIR, "finetune-dataset-5.csv")

In [None]:
# Merge with prompting types
prompting_types = {
    "No Prompting": "",

    "Task instruction": """Give the most concise answers possible to questions about AI policy, considering you are an expert of AI policy. \n
    """,

    "One-shot prompting": """Give an answer to my query by modeling the following example:
My Query: Summarize the UK's legislation on AI.
Your Answer: The UK's legislation on AI is currently decentralized, with no specific comprehensive law governing AI. Instead, existing laws such as data protection legislation (e.g., the Data Protection Act 2018), equalities and privacy laws (e.g., the Equality Act 2010 and the Human Rights Act 1998), and intellectual property laws (e.g., the Copyright, Designs and Patents Act 1988) play a role in regulating various aspects of AI development and usage. These laws impact data collection, discrimination, human rights implications, intellectual property rights, and the limitations on AI decision-making and surveillance tools in the workplace.\n
    """,

    "Few-shot prompting": """Give an answer to my query by modeling the following examples:
Summarize the UK's legislation on AI: The UK's legislation on AI is currently decentralized, with no specific comprehensive law governing AI. Instead, existing laws such as data protection legislation (e.g., the Data Protection Act 2018), equalities and privacy laws (e.g., the Equality Act 2010 and the Human Rights Act 1998), and intellectual property laws (e.g., the Copyright, Designs and Patents Act 1988) play a role in regulating various aspects of AI development and usage. These laws impact data collection, discrimination, human rights implications, intellectual property rights, and the limitations on AI decision-making and surveillance tools in the workplace.
What is the predicted impact of generative AI on jobs?: LinkedIn's analysis predicts that the jobs of 55% of the platform’s users will be impacted in some way by the adoption of generative AI.
What is the regulation around training powerful AI models in Europe?: Regulations in Europe mandate that developers must include drafting technical documentation, adhere to EU copyright laws, and provide detailed summaries of the content used for training. Moreover, for high-impact general purpose AI models that pose systemic risks, additional obligations apply. These obligations entail conducting model evaluations, assessing and mitigating systemic risks, performing adversarial testing, reporting serious incidents to the Commission, ensuring cybersecurity measures, and reporting on energy efficiency.
What are some drawbacks to the CASC approach?: Its rulemakings are inherently retroactive, it does not broadly ensure algorithmic rights for ADSs that do not qualify as CASC ADSs, and it does not resolve capacity issues at federal agencies.
How is AI related to the United States' geopolitical relations with China?: The U.S.-China relationship looms large over AI governance: as Beijing pursues a national strategy aimed at making China the global leader in “AI theories, technologies, and applications” by 2030, policymakers in Washington are struggling with how to place guardrails around AI development without undermining the United States’ technological edge.\n
    """,

    "Prompt chaining": """First, analyze the keywords in the query. Secondly, decipher the purpose of the query. Don’t explicitly write these. Your final answer should be a maximum of 3 sentences long. The first sentence should summarize what the question is asking. The second sentence should give the main answer to the query. The third sentence can be an additional point if you think some information is very important to the query. Format all the sentences into a single paragraph.

    """,
    "Active prompting": """Think of 3 possible different answers to the query but do not output them all. Only output the answer that is the shortest, most concrete, relevant to the query, and easy to understand by a college student. Do not include the reason for your pick.\n
    """}

prompting_techniques = {"No Prompting": data_0_path,
                        "Task instruction": data_1_path,
                        "One-shot prompting": data_2_path,
                        "Few-shot prompting": data_3_path,
                        "Prompt chaining": data_4_path,
                        "Active prompting": data_5_path}

In [None]:
# Path to the full dataset
FINETUNE_DATA_PATH = prompting_techniques["No Prompting"]

# Load dataset
full_dataset = load_dataset('csv', data_files=FINETUNE_DATA_PATH)['train']

# Split dataset into train and test sets
split_dataset = full_dataset.train_test_split(test_size=0.2, seed=1)

# Assign train and test datasets
finetune_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

# Process to the expected format of Llama
"""
Dataset should be structured in a way that's compatible with the model's expected input format.
For a causal language model like Llama, the typical input is a sequence of text,
and the model predicts the next token in the sequence.
In a question-answering setup, we might want to concatenate the question and answer
with some separator to form this sequence.
"""
def preprocess_data(examples): # Concatenate question and answer with a separator (like "\n")
    return {'text': [q + "\n" + a for q, a in zip(examples['question'], examples['answer'])]}

# Apply the preprocessing function to the datasets
finetune_dataset = finetune_dataset.map(preprocess_data, batched=True)
eval_dataset = eval_dataset.map(preprocess_data, batched=True)

# Print shapes to verify
print("Finetune (Train) Dataset Shape:", finetune_dataset.shape)
print("Evaluation (Test) Dataset Shape:", eval_dataset.shape)
pprint(finetune_dataset)
pprint(eval_dataset)

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-643117e56e32fae8/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-643117e56e32fae8/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/333 [00:00<?, ? examples/s]

Map:   0%|          | 0/84 [00:00<?, ? examples/s]

Finetune (Train) Dataset Shape: (333, 3)
Evaluation (Test) Dataset Shape: (84, 3)
Dataset({
    features: ['question', 'answer', 'text'],
    num_rows: 333
})
Dataset({
    features: ['question', 'answer', 'text'],
    num_rows: 84
})


In [None]:
# Example
for i in finetune_dataset:
  print(i)
  break

{'question': "What's the role of good leadership?", 'answer': 'Setting ethical standards, risk mitigation, and public trust and accountability.', 'text': "What's the role of good leadership?\nSetting ethical standards, risk mitigation, and public trust and accountability."}


## **Fine-tuning the model**

### Loading the quantized model and the tokenizer

In [None]:
################################################################################
# Quantization - bitsandbytes parameters
################################################################################

use_4bit               = True # Activate 4-bit precision base model loading
bnb_4bit_compute_dtype = "float16"
compute_dtype          = getattr(torch, bnb_4bit_compute_dtype) # Load tokenizer and model with QLoRA configuration
bnb_4bit_quant_type    = "nf4" # Quantization type (fp4 or nf4)
use_nested_quant       = False # Activate nested quantization for 4-bit base models (double quantization)

bnb_config = BitsAndBytesConfig(load_in_4bit              = use_4bit,
                              bnb_4bit_quant_type       = bnb_4bit_quant_type,
                              bnb_4bit_compute_dtype    = compute_dtype,
                              bnb_4bit_use_double_quant = use_nested_quant)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

In [None]:
%%time
# Load the pretrained base model
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME_BASE,
                                             quantization_config = bnb_config,
                                             device_map         = device_map) #token = HF_TOKEN
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_BASE,
                                          trust_remote_code = True) #token = HF_TOKEN
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Downloading config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

CPU times: user 17 s, sys: 19.4 s, total: 36.4 s
Wall time: 58.9 s


### Setting up the training config

In [None]:
################################################################################
# QLoRA parameters
################################################################################

lora_alpha   = 16 # Alpha parameter for LoRA scaling
lora_dropout = 0.1 # Dropout probability for LoRA layers
lora_r       = 64 # LoRA attention dimension

peft_config = LoraConfig(lora_alpha   = lora_alpha,
                       lora_dropout = lora_dropout,
                       r            = lora_r,
                       bias         = "none",
                       task_type    = "CAUSAL_LM")

################################################################################
# TrainingArguments parameters
################################################################################

num_train_epochs              = 5 # Number of training epochs
fp16                          = True # Enable fp16 training
bf16                          = False # Enable bf16 training (set bf16 to True with an A100)
per_device_train_batch_size   = 1 # Batch size per GPU for training
per_device_eval_batch_size    = 1 # Batch size per GPU for evaluation
gradient_accumulation_steps   = 1 # Number of update steps to accumulate the gradients for --- was 4
gradient_checkpointing        = False # Enable gradient checkpointing
max_grad_norm                 = 0.3 # Maximum gradient normal (gradient clipping)
learning_rate                 = 2e-4 # Initial learning rate (AdamW optimizer)
weight_decay                  = 0.001 # Weight decay to apply to all layers except bias/LayerNorm weights
optim                         = "paged_adamw_32bit" # Optimizer to use
lr_scheduler_type             = "cosine" # Learning rate schedule
max_steps                     = -1 # Number of training steps (overrides num_train_epochs)
warmup_ratio                  = 0.03 # Ratio of steps for a linear warmup (from 0 to learning rate)
group_by_length               = False # Group sequences into batches with same length
save_steps                    = 0 # Save checkpoint every X updates steps
logging_steps                 = 25 # Log every X updates steps
seed                          = 42

training_arguments = TrainingArguments(
    output_dir                        = OUTPUT_DIR,
    num_train_epochs                  = num_train_epochs,
    per_device_train_batch_size       = per_device_train_batch_size,
    gradient_accumulation_steps       = gradient_accumulation_steps,
    optim                             = optim,
    save_steps                        = save_steps,
    logging_steps                     = logging_steps,
    learning_rate                     = learning_rate,
    weight_decay                      = weight_decay,
    fp16                              = fp16,
    bf16                              = bf16,
    max_grad_norm                     = max_grad_norm,
    max_steps                         = max_steps,
    warmup_ratio                      = warmup_ratio,
    group_by_length                   = group_by_length,
    lr_scheduler_type                 = lr_scheduler_type,
    report_to                         = "tensorboard",
    seed                              = seed) # other: save_strategy="epoch", #epoch or steps; evaluation_strategy="epoch"

################################################################################
# SFT parameters
################################################################################

max_seq_length = 500 # Maximum sequence length to use
packing        = True # Pack multiple short examples in the same input sequence to increase efficiency
data_collator  = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm=False)
def compute_metrics(eval_preds):
    predictions, labels = eval_preds
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    bleu_score = bleu.compute(predictions=decoded_preds, references=decoded_labels)
    rouge_score = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": bleu_score, "rouge": rouge_score}

trainer = SFTTrainer(model             = model,
                     train_dataset     = finetune_dataset,
                     eval_dataset      = eval_dataset,
                     dataset_text_field = 'text',
                     peft_config        = peft_config,
                     max_seq_length    = max_seq_length,
                     tokenizer         = tokenizer,
                     args              = training_arguments,
                     packing           = True,
                     data_collator     = data_collator,
                     compute_metrics   = compute_metrics)



### Re-training and saving the model

In [None]:
%%time
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(MODEL_NAME_FINETUNE)



{'loss': 2.3428, 'learning_rate': 0.0001, 'epoch': 0.08}
{'loss': 2.1699, 'learning_rate': 0.0002, 'epoch': 0.15}
{'loss': 1.9408, 'learning_rate': 0.00019988177233314888, 'epoch': 1.03}
{'loss': 1.8664, 'learning_rate': 0.0001995273688882197, 'epoch': 1.11}
{'loss': 1.9481, 'learning_rate': 0.0001989376276710608, 'epoch': 1.18}
{'loss': 1.7738, 'learning_rate': 0.00019811394315623522, 'epoch': 2.06}
{'loss': 1.6656, 'learning_rate': 0.00019705826298971113, 'epoch': 2.14}
{'loss': 1.7189, 'learning_rate': 0.00019577308338354906, 'epoch': 3.02}
{'loss': 1.4852, 'learning_rate': 0.00019432621375734685, 'epoch': 3.09}
{'loss': 1.5341, 'learning_rate': 0.00019260052755697783, 'epoch': 3.17}
{'loss': 1.4475, 'learning_rate': 0.00019065588247016394, 'epoch': 4.05}
{'loss': 1.3236, 'learning_rate': 0.0001884968767139345, 'epoch': 4.12}
{'loss': 1.3057, 'learning_rate': 0.00018612861537255505, 'epoch': 4.2}
{'train_runtime': 653.9012, 'train_samples_per_second': 2.546, 'train_steps_per_second'

In [None]:
# Empty VRAM
del model
del trainer
#torch.cuda.empty_cache()
gc.collect()
gc.collect()

20602

## **Merging the original and new (re-trained) model**

In [None]:
device_map = torch.device("cpu")

# Load the fine-tuned model
new_model = AutoPeftModelForCausalLM.from_pretrained(MODEL_NAME_FINETUNE,
                                                     low_cpu_mem_usage=True,
                                                     return_dict=True,
                                                     torch_dtype=torch.float16,
                                                     device_map=device_map)

# Merge the models
merged_model = new_model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Save the merged model
merged_model.save_pretrained(MERGED_MODEL_DIR, safe_serialization=True)
tokenizer.save_pretrained(MERGED_MODEL_DIR)

('/content/drive/MyDrive/94812 - NLX and LLM/Final project/finetuning-output/merged-model/tokenizer_config.json',
 '/content/drive/MyDrive/94812 - NLX and LLM/Final project/finetuning-output/merged-model/special_tokens_map.json',
 '/content/drive/MyDrive/94812 - NLX and LLM/Final project/finetuning-output/merged-model/tokenizer.model',
 '/content/drive/MyDrive/94812 - NLX and LLM/Final project/finetuning-output/merged-model/added_tokens.json',
 '/content/drive/MyDrive/94812 - NLX and LLM/Final project/finetuning-output/merged-model/tokenizer.json')

In [None]:
# Push merged model to the Hugging Face Hub
#merged_model.push_to_hub(HF_MODEL_REPO)
#tokenizer.push_to_hub(HF_MODEL_REPO)

## **Running the inference using the fine-tuned model**

In [None]:
# Load the merged model and tokenizer
final_finetuned_model = AutoModelForCausalLM.from_pretrained(MERGED_MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MERGED_MODEL_DIR)

# Ensure model is in evaluation mode
final_finetuned_model.eval()

In [None]:
# Take a smaller sample of test for evaluation -- just 20 q-a pairs
eval_dataset = eval_dataset.select(range(10))
eval_dataset

Dataset({
    features: ['question', 'answer', 'text'],
    num_rows: 10
})

In [None]:
# Move model to appropriate device (GPU or CPU)
device = torch.device("cpu") #device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
final_finetuned_model.to(device)

# Function to generate responses
def generate_response(model, tokenizer, text, max_length=100):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
    outputs = model.generate(inputs.input_ids, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Lists to hold data
model_names = []
questions = []
ground_truths = []
predictions = []

# Iterate over the test dataset
for example in tqdm(eval_dataset, desc="Generating predictions"):
    question = example['question']
    ground_truth = example['answer']

    # Generate prediction
    generated_text = generate_response(final_finetuned_model, tokenizer, question)

    # Append to lists
    model_names.append(MODEL_NAME_FINETUNE)
    questions.append(question)
    ground_truths.append(ground_truth)
    predictions.append(generated_text)

# Create a DataFrame
df = pd.DataFrame({
    'model': model_names,
    'question': questions,
    'ground_truth': ground_truths,
    'prediction': predictions})

# Save to CSV
csv_output_path = OUTPUT_DIR + '/predictions.csv'
df.to_csv(csv_output_path, index = False)

Generating predictions: 100%|██████████| 10/10 [08:49<00:00, 52.97s/it]


In [None]:
df['prediction'][0]

'What are some possible benefits of the widespread adoption and use of Generative AI?\n "The widespread adoption and use of Generative AI could lead to a transformation of many industries and aspects of daily life, bringing significant benefits to society. These benefits include:\n - Improved efficiency and productivity: Generative AI has the potential to automate many tasks and processes, leading to increased efficiency and productivity across a wide range of'

In [None]:
# Load the predictions back for evaluations
csv_file_path = OUTPUT_DIR + '/predictions.csv'
predictions = pd.read_csv(csv_file_path)
predictions

Unnamed: 0,model,question,ground_truth,prediction
0,meta-llama/Llama-2-7b-hf-finetuned-policychat,What are some possible benefits of the widespr...,The widespread adoption and use of Generative ...,What are some possible benefits of the widespr...
1,meta-llama/Llama-2-7b-hf-finetuned-policychat,What are some recommendations for the governme...,Foster the sharing of AI knowledge internation...,What are some recommendations for the governme...
2,meta-llama/Llama-2-7b-hf-finetuned-policychat,What is the primary purpose of the EU AI Act?,The EU AI Act aims to regulate the sale and u...,What is the primary purpose of the EU AI Act?...
3,meta-llama/Llama-2-7b-hf-finetuned-policychat,What is the potential global significance of ...,The article suggests that the week's AI polic...,What is the potential global significance of ...
4,meta-llama/Llama-2-7b-hf-finetuned-policychat,Who were the featured speakers at the 2023 NAI...,"The Honorable Denis McDonough, Secretary of Ve...",Who were the featured speakers at the 2023 NAI...
5,meta-llama/Llama-2-7b-hf-finetuned-policychat,How are states and municipalities addressing g...,States and municipalities are actively legisla...,How are states and municipalities addressing g...
6,meta-llama/Llama-2-7b-hf-finetuned-policychat,What challenges do China's AI regulations pose...,Ensuring compliance with detailed regulatory r...,What challenges do China's AI regulations pose...
7,meta-llama/Llama-2-7b-hf-finetuned-policychat,What are the three exceptions(AI systems) that...,AI systems exclusively developed or used for m...,What are the three exceptions(AI systems) that...
8,meta-llama/Llama-2-7b-hf-finetuned-policychat,How does the inclusion of law enforcement in ...,The White House's Blueprint excludes law enfo...,How does the inclusion of law enforcement in ...
9,meta-llama/Llama-2-7b-hf-finetuned-policychat,In which instances can the UN act as the arbit...,Challenges to international security - help e...,In which instances can the UN act as the arbit...


In [None]:
# Evaluate the predictions using bleu and rouge scores
bleu = load_metric("sacrebleu")
rouge = load_metric("rouge")

# Tokenize the predictions and references
tokenized_predictions = [prediction.split() for prediction in df['prediction']] # update this
tokenized_references = [[reference.split()] for reference in df['ground_truth']] # update this

# Evaluate the predictions using bleu and rouge scores
bleu_score = bleu.compute(predictions=tokenized_predictions, references=tokenized_references)
rouge_score = rouge.compute(predictions=df['prediction'], references=df['ground_truth'])

print(f"BLEU Score: {bleu_score['score']}")
print(f"ROUGE Score: {rouge_score}")

BLEU Score: 18.25893742672674
ROUGE Score: {'rouge1': AggregateScore(low=Score(precision=0.17349027218252763, recall=0.29436916138468594, fmeasure=0.20990369758036584), mid=Score(precision=0.2560153747944406, recall=0.3846792597665253, fmeasure=0.2706036680790316), high=Score(precision=0.3574460497909951, recall=0.4537667569997975, fmeasure=0.3309454184180267)), 'rouge2': AggregateScore(low=Score(precision=0.0498689832465408, recall=0.0703148005804415, fmeasure=0.05386433986711077), mid=Score(precision=0.08434059757589171, recall=0.1355389564043083, fmeasure=0.09749159927763462), high=Score(precision=0.12137174146445244, recall=0.2134901310664869, fmeasure=0.14755483292266458)), 'rougeL': AggregateScore(low=Score(precision=0.12219623888339921, recall=0.19741563166650436, fmeasure=0.13973367176576035), mid=Score(precision=0.17251071713669744, recall=0.2619776924088954, fmeasure=0.18318265076101808), high=Score(precision=0.22388038189108628, recall=0.3238827187443342, fmeasure=0.22520357