# COMP3361 Part 1: Building a Transformer Encoder

Note: You should finish your code solution of Part 1 & 2 with A2p12.tgz. For Q2 & Q3, you should include your writeup in this notebook.

## Q2:
Training model with d_model=70, d_internal=100, num_layers=1
It seems like the model attended to occurances of the same letter. This aligns with my expectation as the model should predict the label according to the whole sequence.

## Q3:
Training model with d_model=70, d_internal=100, num_layers=4
They are not what I expected. I expected when more layers are used, they all focus on the same letter occurances. It seems that each layer are producing fairly different attention maps which the final output is very unclear, the model seems to be attending to all letters instead of those duplicate occurances.

# COMP3361 Part 3: Generation with Large Language Model

## Load model and tokenizer

In this section, we will use [CodeLlama-7B](https://huggingface.co/codellama/CodeLlama-7b-hf) as the language model.

In [3]:
!pip install transformers datasets evaluate accelerate bitsandbytes

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32

In [4]:
from abc import ABC, abstractmethod
from typing import List, Dict, Any
import os
import json
import evaluate
from datasets import load_dataset
from tqdm import tqdm
import re
import locale
from transformers import AutoTokenizer, AutoModelForCausalLM

os.environ["TOKENIZERS_PARALLELISM"] = "false"
locale.getpreferredencoding = lambda: "UTF-8"
# os.envion['LC_ALL'] = 'en_US.UTF-8'

In [None]:
# You will load this model in a 4-bit inference mode to optimize resource usage.
# This setup involves implementing a class named LLM that facilitates loading the model
# and generating text completions based on provided prompts
class LLM(object):
    def __init__(self, model_name="codellama/CodeLlama-7b-hf"):
        self.model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
        # pass

    def generate(self, prompts: List[str], **kwargs) -> List[str]:
        if self.tokenizer.pad_token is None:
            self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
            self.model.resize_token_embeddings(len(self.tokenizer))
        modelInput = self.tokenizer(
            prompts, return_tensors="pt", padding=True
        ).to("cuda")
        # self.tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

        # print("prompts", prompts)
        # print("input", modelInput)
        generated_ids = self.model.generate(**modelInput, max_new_tokens=256)
        # print("gen id", generated_ids)
        return self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
        # return generated_ids

In [10]:
llm = LLM()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [20]:
generated_ids = llm.generate(["A list of colors: red, blue", "Portugal is"])
# llm.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_ids)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


prompts ['A list of colors: red, blue', 'Portugal is']
input {'input_ids': tensor([[    1,   319,  1051,   310, 11955, 29901,  2654, 29892,  7254],
        [32016, 32016, 32016, 32016, 32016, 32016,     1, 12077,   338]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 1, 1, 1]], device='cuda:0')}
gen id tensor([[    1,   319,  1051,   310, 11955, 29901,  2654, 29892,  7254, 29892,
          7933, 29892, 13328, 29892, 24841, 29892,  3708,   552, 29892, 17354,
         29892,   282,   682, 29892,  4628, 29892,  4796, 29892, 16749, 29892,
         17354, 29892, 24841, 29892,  3708,   552, 29892, 17354, 29892,   282,
           682, 29892,  4628, 29892,  4796, 29892, 16749, 29892, 17354, 29892,
         24841, 29892,  3708,   552, 29892, 17354, 29892,   282,   682, 29892,
          4628, 29892,  4796, 29892, 16749, 29892, 17354, 29892, 24841, 29892,
          3708,   552, 29892, 17354, 29892,   282,   682, 29892,  4628, 29892,
   

In [27]:
class Evaluator(ABC):
    def __init__(self, llm):
        self.llm = llm

    @abstractmethod
    def load_data(self):
        pass

    @abstractmethod
    def build_prompts(self):
        pass

    @abstractmethod
    def postprocess_output(self, output: str) -> str:
        pass

    def generate_completions(self, prompts: List[str], batch_size=4, **kwargs) -> List[str]:
        generated_completions = []

        for i in range(0, len(prompts), batch_size):
          batch_prompts = prompts[i:i+batch_size]
          # batch_id = self.llm.generate(batch_prompts)
          # generated_completions += self.llm.tokenizer.batch_decode(batch_id, skip_special_tokens=True)
          generated_completions += self.llm.generate(batch_prompts)

        # generated_completions = []
        # for prompt in prompts:
        #     completion = self.model.generate(prompt)
        #     generated_completions.append(completion)

        return generated_completions
        # raise NotImplementedError

    def evaluate(self, evalset_file, batch_size=4, save_dir="outputs", max_new_tokens=128, **kwargs):
        dataset = self.load_data(evalset_file)
        prompts = self.build_prompts(dataset)
        outputs = self.generate_completions(prompts, batch_size=batch_size, max_new_tokens=max_new_tokens, **kwargs)

        predictions = []
        for i, (example, prompt, output) in enumerate(zip(dataset, prompts, outputs)):
            prediction = {
                # "task_id": example.get("task_id", f"task_{i}"),
                "prompt": prompt,
                "completion": self.postprocess_output(output)
            }
            print("i, prediction:", i, prediction)
            predictions.append(prediction)
            print("predictions", predictions)

        # Save predictions to file
        os.makedirs(save_dir, exist_ok=True)
        prediction_save_path = os.path.join(save_dir, f"{type(self).__name__}_predictions.jsonl")
        with open(prediction_save_path, "w") as fout:
            for pred in predictions:
                fout.write(json.dumps(pred) + "\n")

        # Calculate metrics and print results
        results = self.calculate_metrics(predictions, dataset)
        print(f"Results for {type(self).__name__}: {results}")

    @abstractmethod
    def calculate_metrics(self):
        pass

In [None]:
# # FOR TESTING ONLY
# # Delete before submission

# # Free GPU
# from numba import cuda
# device = cuda.get_current_device()
# device.reset()

## Zero-shot Code Generation

In [None]:
!mkdir -p human_eval
!wget -O human_eval/__init__.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/__init__.py
!wget -O human_eval/data.py human_eval https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/data.py
!wget -O human_eval/evaluation.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/evaluation.py
!wget -O human_eval/execution.py human_eval https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/execution.py

!mkdir -p data/humaneval
!wget -O data/humaneval/HumanEval.jsonl.gz https://github.com/openai/human-eval/raw/master/data/HumanEval.jsonl.gz

In [None]:
from human_eval.data import read_problems
from human_eval.evaluation import evaluate_functional_correctness

class HumanEvalEvaluator(Evaluator):
    def load_data(self, evalset_file="data/humaneval/HumanEval.jsonl.gz") -> List[Dict[str, Any]]:
        """
        Load the humaneval dataset
        :param evalset_file: path to the humaneval dataset file
        :return: list of examples
        """
        return list(read_problems(evalset_file).values())

    def build_prompts(self, dataset) -> List[str]:
        """
        Build zero-shot prompts from the humaneval dataset.
        """
        prompts = [example["prompt"] for example in dataset]
        return prompts

    def postprocess_output(self, output: str) -> str:
        stop_sequences=["\nclass", "\ndef", "\n#", "\nif", "\nprint"]

        for seq in stop_sequences:
            if seq in output:
                output = output.split(seq)[0]
                break

        return output.strip()
        # raise NotImplementedError

    def calculate_metrics(self, predictions, dataset):
        pass_at_k_results = evaluate_functional_correctness(
            sample_file=os.path.join("outputs", f"{type(self).__name__}_predictions.jsonl"),
            k=[1],
            problems={example["task_id"]: example for example in dataset},
            n_workers=64
        )
        return pass_at_k_results


In [None]:
# human_eval_evaluator = HumanEvalEvaluator(llm)
# human_eval_evaluator.evaluate(evalset_file="data/humaneval/HumanEval.jsonl.gz", batch_size=16)

## Few-shot Math Reasoning

In [22]:
GSM_EXAMPLARS = [
    {
        "question": "There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?",
        "cot_answer": "There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. So the answer is 6.",
        "pot_answer": "def solution():\n    \"\"\"There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\"\"\"\n    trees_initial = 15\n    trees_after = 21\n    trees_added = trees_after - trees_initial\n    result = trees_added\n    return result",
        "short_answer": "6"
    },
    {
        "question": "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?",
        "cot_answer": "There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. So the answer is 5.",
        "pot_answer": "def solution():\n    \"\"\"If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\"\"\"\n    cars_initial = 3\n    cars_arrived = 2\n    total_cars = cars_initial + cars_arrived\n    result = total_cars\n    return result",
        "short_answer": "5"
    },
    {
        "question": "Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?",
        "cot_answer": "Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. So the answer is 39.",
        "pot_answer": "def solution():\n    \"\"\"Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\"\"\"\n    leah_chocolates = 32\n    sister_chocolates = 42\n    total_chocolates = leah_chocolates + sister_chocolates\n    chocolates_eaten = 35\n    chocolates_left = total_chocolates - chocolates_eaten\n    result = chocolates_left\n    return result",
        "short_answer": "39"
    },
    {
        "question": "Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?",
        "cot_answer": "Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\"\"\"\n    jason_lollipops_initial = 20\n    jason_lollipops_after = 12\n    denny_lollipops = jason_lollipops_initial - jason_lollipops_after\n    result = denny_lollipops\n    return result",
        "short_answer": "8"
    },
    {
        "question": "Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?",
        "cot_answer": "Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. So the answer is 9.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "9"
    },
    {
        "question": "There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?",
        "cot_answer": "There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. So the answer is 29.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "29"
    },
    {
        "question": "Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?",
        "cot_answer": "Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. So the answer is 33.",
        "pot_answer": "def solution():\n    \"\"\"Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\"\"\"\n    golf_balls_initial = 58\n    golf_balls_lost_tuesday = 23\n    golf_balls_lost_wednesday = 2\n    golf_balls_left = golf_balls_initial - golf_balls_lost_tuesday - golf_balls_lost_wednesday\n    result = golf_balls_left\n    return result",
        "short_answer": "33"
    },
    {
        "question": "Olivia has $23. She bought five bagels for $3 each. How much money does she have left?",
        "cot_answer": "Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\"\"\"\n    money_initial = 23\n    bagels = 5\n    bagel_cost = 3\n    money_spent = bagels * bagel_cost\n    money_left = money_initial - money_spent\n    result = money_left\n    return result",
        "short_answer": "8"
    }
]

In [28]:
from sklearn.metrics import accuracy_score

class GSM8KEvaluator(Evaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        """
        Load the GSM8K dataset https://huggingface.co/datasets/gsm8k with Huggingface datasets library
        Load the first 100 examples from the test split in main subset.
        """
        # dataset = load_dataset("https://huggingface.co/datasets/gsm8k", split="test", )
        dataset = dataset = load_dataset("gsm8k", 'main', split="test")

        return dataset[:100]
        # raise NotImplementedError

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        # prompts = [example["prompt"] for example in dataset]
        prompts = []
        prompt = "Answer the following questions.\n"
        for demo in demos:
            prompt+="Question: " + demo["question"] + "\n"
            prompt+="Answer: " + demo["short_answer"] + "\n"

        # print("dataset ", dataset)
        # print("dataset["question"]", dataset["question"])
        # print("dataset["answer"]", dataset["answer"])
        for example in dataset["question"]:
            # print("example", example)
            prompts.append(prompt + "Question: " + example + "\n")

        # print("len(prompts)", len(prompts))
        return prompts
        # raise NotImplementedError

    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        stop_sequences=["\nclass", "\ndef", "\n#", "\nif", "\nprint"]

        for seq in stop_sequences:
            if seq in output:
                output = output.split(seq)[0]
                break

        return output
        # raise NotImplementedError

    def calculate_metrics(self, predictions, dataset):
        """
        Calculate accuracy metrics for the GSM8K dataset
        """
        print()
        accuracy = accuracy_score([example for example in dataset["answer"]], [prediction["completion"] for prediction in predictions])
        # accuracy = 0
        return accuracy
        # raise NotImplementedError

In [None]:
gsm8k_evaluator = GSM8KEvaluator(llm)
gsm8k_evaluator.evaluate(evalset_file="gsm8k")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\nAnswer: 9\nQuestion: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?\nAnswer: 29\nQuestion: Michael had 58 golf balls. 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[    1,   673,   278,  ..., 29901,  1670,   526],
        [32016, 32016, 32016,  ...,   526,   297,   278],
        [32016, 32016, 32016,  ..., 29871, 29896, 29900],
        [32016, 32016, 32016,  ..., 29871, 29896, 29900]], device='cuda:0')
prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[    1,   673,   278,  ...,   414, 29889, 29871],
        [32016, 32016, 32016,  ..., 29871, 29896, 29900],
        [32016, 32016, 32016,  ...,   770,  8345, 29889],
        [32016, 32016, 32016,  ..., 29900,  2305,   297]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[    1,   673,   278,  ...,  2305,   526, 13407],
        [32016, 32016, 32016,  ..., 29900,    13, 16492],
        [32016, 32016, 32016,  ...,  2305,   526,  2175],
        [32016, 32016, 32016,  ..., 29973,    13, 22550]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ...,  2305,   762,  1269],
        [32016, 32016, 32016,  ...,  1128,  1784,  2305],
        [32016, 32016, 32016,  ...,   278,  9755,   310],
        [    1,   673,   278,  ...,  8345, 29892,   920]], device='cuda:0')
prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ...,  2305,   297,   278],
        [32016, 32016, 32016,  ...,   920,  1784,  2305],
        [32016, 32016, 32016,  ..., 29871, 29929, 29900],
        [    1,   673,   278,  ...,  5967,   278, 19174]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[    1,   673,   278,  ..., 29900, 29900, 29900],
        [32016, 32016, 32016,  ..., 29901,  1670,   526],
        [32016, 32016, 32016,  ..., 29896, 29900, 29900],
        [32016, 32016, 32016,  ..., 29901,  1670,   526]], device='cuda:0')
prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29896, 29900, 29900],
        [32016, 32016, 32016,  ..., 29901,  1670,   526],
        [    1,   673,   278,  ..., 29871, 29896, 29900],
        [32016, 32016, 32016,  ...,  1008, 29889, 29871]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29871, 29929, 29900],
        [    1,   673,   278,  ..., 29901,  1670,   526],
        [32016, 32016, 32016,  ..., 16492, 29901,  1670],
        [32016, 32016, 32016,  ...,   278,   937,   697]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29871, 29896, 29900],
        [32016, 32016, 32016,  ...,  2305,  1135,   297],
        [    1,   673,   278,  ..., 29900,    13, 16492],
        [32016, 32016, 32016,  ...,   297,   278,  1196]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29896, 29900,  2305],
        [32016, 32016, 32016,  ...,   278,   770,  8345],
        [32016, 32016, 32016,  ..., 29889, 29871, 29896],
        [    1,   673,   278,  ...,  1784,  2305,   526]], device='cuda:0')
prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29900, 29900,  2305],
        [    1,   673,   278,  ..., 29889, 29871, 29896],
        [32016, 32016, 32016,  ..., 13407,   297,   278],
        [32016, 32016, 32016,  ..., 29900, 29900,  2305]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ...,    13, 22550, 29901],
        [    1,   673,   278,  ...,  2305,   526, 15723],
        [32016, 32016, 32016,  ...,  7902,  2185, 29889],
        [32016, 32016, 32016,  ...,   526,  7902,  2185]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29871, 29929, 29900],
        [32016, 32016, 32016,  ...,   526, 14000, 29889],
        [32016, 32016, 32016,  ..., 29900, 29900,  2305],
        [    1,   673,   278,  ...,   482,  5995, 29889]], device='cuda:0')
prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ...,   960, 29871, 29896],
        [    1,   673,   278,  ..., 14311, 29889, 29871],
        [32016, 32016, 32016,  ..., 29901,  1670,   526],
        [32016, 32016, 32016,  ..., 29901,  1670,   526]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ...,  1670,   526, 29871],
        [32016, 32016, 32016,  ..., 29871, 29896, 29900],
        [    1,   673,   278,  ...,  2305,   526,   297],
        [32016, 32016, 32016,  ...,  2305,   526,   297]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29900, 29900, 29900],
        [32016, 32016, 32016,  ...,    13, 16492, 29901],
        [    1,   673,   278,  ...,   526, 29871, 29896],
        [32016, 32016, 32016,  ..., 29900, 29900, 29900]], device='cuda:0')
prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[    1,   673,   278,  ..., 29901,  1670,   526],
        [32016, 32016, 32016,  ...,  2305,   526,   297],
        [32016, 32016, 32016,  ...,   310,   963,   526],
        [32016, 32016, 32016,  ...,  4515,  3250, 29892]], device='cuda:0')
prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29871, 29896, 29900],
        [    1,   673,   278,  ...,  2305,   297,   278],
        [    1,   673,   278,  ..., 29896, 29900,  2305],
        [32016, 32016, 32016,  ...,  2305,  5967,   278]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ...,  2305,   297,   278],
        [32016, 32016, 32016,  ...,  2305,  5967,   278],
        [    1,   673,   278,  ..., 29901, 29871, 29896],
        [32016, 32016, 32016,  ...,  1670,   526, 29871]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[    1,   673,   278,  ..., 29901,  1670,   526],
        [32016, 32016, 32016,  ..., 29892,   920,  1784],
        [32016, 32016, 32016,  ..., 29896, 29900, 29900],
        [32016, 32016, 32016,  ..., 29896, 29900, 29900]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ...,  1784,  2305,   526],
        [    1,   673,   278,  ...,  2185, 29889, 29871],
        [32016, 32016, 32016,  ..., 29900,   310,   963],
        [32016, 32016, 32016,  ..., 29871, 29896, 29900]], device='cuda:0')
prompts ['Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


gen id tensor([[32016, 32016, 32016,  ..., 29896, 29900, 29900],
        [    1,   673,   278,  ...,   963,   526,  3805],
        [32016, 32016, 32016,  ...,  2305,   762,  1269],
        [32016, 32016, 32016,  ..., 16492, 29901,  1670]], device='cuda:0')
prompts ["Answer the following questions.\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: 6\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: 5\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: 39\nQuestion: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nAnswer: 8\nQuestion: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many t

## Few-shot Chain-of Thought Math Reasoning

In [None]:

from sklearn.metrics import accuracy_score

class GSM8KCoTEvaluator(GSM8KEvaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        """
        Load the GSM8K dataset https://huggingface.co/datasets/gsm8k with Huggingface datasets library
        Load the first 100 examples from the test split in main subset.
        """
        # dataset = load_dataset("https://huggingface.co/datasets/gsm8k", split="test", )
        dataset = dataset = load_dataset("gsm8k", 'main', split="test")

        return dataset[:100]
        # raise NotImplementedError

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        # prompts = [example["prompt"] for example in dataset]
        prompts = []
        prompt = "Answer the following questions.\n"
        for demo in demos:
            prompt+="Question: " + demo["question"] + "\n"
            prompt+="Answer: " + demo["cot_answer"] + "\n"

        # print("dataset ", dataset)
        # print("dataset["question"]", dataset["question"])
        # print("dataset["answer"]", dataset["answer"])
        for example in dataset["question"]:
            # print("example", example)
            prompts.append(prompt + "Question: " + example + "\n")

        # print("len(prompts)", len(prompts))
        return prompts
        # raise NotImplementedError

    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        stop_sequences=["\nclass", "\ndef", "\n#", "\nif", "\nprint"]

        for seq in stop_sequences:
            if seq in output:
                output = output.split(seq)[0]
                break

        return output
        # raise NotImplementedError

    def calculate_metrics(self, predictions, dataset):
        """
        Calculate accuracy metrics for the GSM8K dataset
        """
        print()
        accuracy = accuracy_score([example for example in dataset["answer"]], [prediction["completion"] for prediction in predictions])
        # accuracy = 0
        return accuracy
        # raise NotImplementedError

In [None]:
gsm8k_cot_evaluator = GSM8KCoTEvaluator(llm)
gsm8k_cot_evaluator.evaluate(evalset_file="gsm8k")

## Few-shot Program-of Thought Math Reasoning

In [None]:
!pip install timeout-decorator Pebble
!wget -O python_executor.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/python_executor.py

In [None]:
from python_executor import PythonExecutor
executor = PythonExecutor(get_answer_expr='solution()')

codes = [
    "def solution():\n    return 1 + 1",
    "def solution():\n    return 2 * 2",
]

predictions = []
runtime_errors = []
for pred, err in executor.batch_apply(codes):
    predictions.append(str(pred))
    runtime_errors.append(str(err['exec_info']).strip())

In [None]:
predictions

In [None]:
from sklearn.metrics import accuracy_score

class GSM8KPoTEvaluator(Evaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        """
        Load the GSM8K dataset https://huggingface.co/datasets/gsm8k with Huggingface datasets library
        Load the first 100 examples from the test split in main subset.
        """
        # dataset = load_dataset("https://huggingface.co/datasets/gsm8k", split="test", )
        dataset = dataset = load_dataset("gsm8k", 'main', split="test")

        return dataset[:100]
        # raise NotImplementedError

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        # prompts = [example["prompt"] for example in dataset]
        prompts = []
        prompt = "Answer the following questions.\n"
        for demo in demos:
            prompt+="Question: " + demo["question"] + "\n"
            prompt+="Answer: " + demo["pot_answer"] + "\n"

        # print("dataset ", dataset)
        # print("dataset["question"]", dataset["question"])
        # print("dataset["answer"]", dataset["answer"])
        for example in dataset["question"]:
            # print("example", example)
            prompts.append(prompt + "Question: " + example + "\n")

        # print("len(prompts)", len(prompts))
        return prompts
        # raise NotImplementedError

    def postprocess_output(self, output: str) -> str:
        """
        Postprocess the output from the language model.
        """
        stop_sequences=["\nclass", "\ndef", "\n#", "\nif", "\nprint"]

        for seq in stop_sequences:
            if seq in output:
                output = output.split(seq)[0]
                break

        return output
        # raise NotImplementedError

    def calculate_metrics(self, predictions, dataset):
        """
        Calculate accuracy metrics for the GSM8K dataset
        """
        print()
        accuracy = accuracy_score([example for example in dataset["answer"]], [prediction["completion"] for prediction in predictions])
        # accuracy = 0
        return accuracy
        # raise NotImplementedError

In [None]:
gsm8k_pot_evaluator = GSM8KPoTEvaluator(llm)
gsm8k_pot_evaluator.evaluate(evalset_file="gsm8k")

|                    | GSM8K |
|--------------------|-------|
| Direct Prompting   |       |
| Chain-of-Thought   |       |
| Program-of-Thought |       |