## Downsampled Training
Before we potentially (and expensively) train on all 2M examples, let's see how training works on a smaller, downsampled dataset. We'll evaluate our main `exact_match` result and compare it to the initial benchmark. From here we can begin trying to better understand the trends within the partial correctness scores.

Just as we did in the initial Benchmarking notebook, we install Nvidia `apex`. As noted before, the `apex` package and its `optimizers` and `normalization` modules will be useful for expedited training. We can leverage the `normalization.FusedRMSLayer` class for acclerated normalization computations while also constructing an improved `FusedAdam` (Adam or AdamW) optimizer in favor of the standard `torch.optim.AdamW` optimizer. 

In [0]:
# create a volume to store our wheels that are time consuming to build
%sql
CREATE VOLUME workspace_dogfood.jgr.wheels;

In [0]:
# install Nvidia Apex for improved normalization computation and optimization with fused kernels
# ensure we have ninja installed to speed up Nvidia apex source compilation
!pip install ninja

# first build and save the binary for speed up in future installs
!pip wheel git+https://github.com/NVIDIA/apex.git --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" -w /tmp/wheels/

# Move the wheel file to the Unity Catalog volume
dbutils.fs.mv("file:/tmp/wheels/apex-0.1-cp311-cp311-linux_x86_64.whl", "dbfs:/Volumes/workspace_dogfood/jgr/wheels/apex-0.1-cp311-cp311-linux_x86_64.whl")

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/f2/3a/8bdab26e09c5a242182b7ba9152e216d5ab4ae2d78c4298eb4872549cd35/transformers-4.47.1-py3-none-any.whl.metadata
  Using cached transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers)
  Obtaining dependency information for huggingface-hub<1.0,>=0.24.0 from https://files.pythonhosted.org/packages/61/8c/fbdc0a88a622d9fa54e132d7bf3ee03ec602758658a2db5b339a65be2cfe/huggingface_hub-0.27.0-py3-none-any.whl.metadata
  Using cached huggingface_hub-0.27.0-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Obtaining dependency information for tokenizers<0.22,>=0.21 from https://files.pythonhosted.org/packages/22/06/69d7ce374747edaf1695a4f61b83570d91cc8bbfc51ccfecf76f56ab4aac/tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached tokenizer

2024-12-21 05:51:02.583206: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): still running...
  Building wheel for apex (pyproject.toml): finished with status 'done'
  Created wheel for apex: filename=apex-0.1-cp311-cp311-linux_x86_64.whl size=35430838 sha256=c4f1044261ccc7ee405c2d058c401f9dc6d3c41052af431fba8d587b79dc37a4
  Stored in directory: /tmp/pip-ephem-wheel-cache-org6fkc8/wheels/79/b8/83/5235f93f5bca64242106bf00bd06a198b5c54b8df578ca2f99
Successfully built apex


In [0]:
#install / upgrade transformers and install apex
!pip install -U transformers
!pip install /Volumes/workspace_dogfood/jgr/wheels/apex-0.1-cp311-cp311-linux_x86_64.whl
dbutils.library.restartPython()

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/f2/3a/8bdab26e09c5a242182b7ba9152e216d5ab4ae2d78c4298eb4872549cd35/transformers-4.47.1-py3-none-any.whl.metadata
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.1 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers)
  Obtaining dependency information for huggingface-hub<1.0,>=0.24.0 from https://files.pythonhosted.org/packages/61/8c/fbdc0a88a622d9fa54e132d7bf3ee03ec602758658a2db5b339a65be2cfe/huggingface_hub-0.27.0-py3-none-any.whl.metadata
  Downloading huggingface_hub-0.27.0-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Obtaining dependency information for tokenizers<

In [0]:
# confirm the apex library is available
from apex import normalization

In [0]:
# load the full preprocessed training and evaluation datasets
from datasets import load_dataset

tokenized_train_dataset = load_dataset("MarioBarbeque/DeepMind-LinAlg-1D-train")
tokenized_eval_dataset = load_dataset("MarioBarbeque/DeepMind-LinAlg-1D-eval")



In [0]:
tokenized_train_dataset = tokenized_train_dataset["train"]
tokenized_train_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 1999998
})

In [0]:
# we downsample the training dataset to 100k examples for training and use another 10k for evaluation
train_size = 100_000
test_size = int(0.1 * train_size)

downsampled_dataset = tokenized_train_dataset.train_test_split(
    train_size=train_size, test_size=test_size, seed=20
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10000
    })
})

In [0]:
# set the downsampled format to numpy in order to pass it to the seq2seq datacollator
# the DataCollatorForSeq2Seq uses numpy arrays to pad the labels
downsampled_dataset.set_format("numpy")

In [0]:
# peek some records
downsampled_dataset["train"][:5]

{'input_ids': array([array([5175,  162,    3,   18, 4201, 1935,    9, 1768,  668, 1935,    9,
               1768,  204, 3951, 3274, 4475, 1935,    9, 1768, 1401, 3166,   21,
                  3,    9,    5,    1,    0,    0,    0,    0,    0,    0,    0,
                  0,    0,    0])                                               ,
        array([5175,  162,    3,  632, 3274,    3, 3708, 1935,  208,    3,   18,
                850, 1935,  208,    3,   18,  850, 5062, 1768,  898, 4305,   21,
                  3,  208,    5,    1,    0,    0,    0])                       ,
        array([5175,  162, 1630, 3274,    3,   18,  102, 1768, 2059,   21,    3,
                102,    5,    1,    0,    0,    0,    0,    0,    0,    0,    0,
                  0,    0,    0])                                               ,
        array([ 5175,   162,  1003,  3274,     3,    18, 16169,  1935,     9,
                   3,    18,     3,  4729,    21,     3,     9,     5,     1,
                  

In [0]:
# reinstantiate our tokenizer and model in bfloat16
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

checkpoint = "google/flan-t5-large"
tokenizer = T5Tokenizer.from_pretrained(checkpoint)
# load the model onto the CPU and let 🤗 accelerate take care of device placement in our training loop
# model = T5ForConditionalGeneration.from_pretrained(checkpoint, torch_dtype=torch.bfloat16)
model = T5ForConditionalGeneration.from_pretrained(checkpoint)

2024-12-31 17:35:13.483371: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [0]:
# confirm our model weights are in bfloat16
# model.dtype

In [0]:
# grab our exact match metric
from datasets import load_metric

exact_match_metric = load_metric("exact_match")

  exact_match_metric = load_metric("exact_match")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [0]:
# define our hyperparameters

# the T5 documentation states: 
# T5 models need a slightly higher learning rate than the default one set in the Trainer when using the AdamW optimizer. Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question answering, question generation). Note that T5 was pre-trained using the AdaFactor optimizer.

hyperparameters = {
    "learning_rate": 1e-4, # see the T5 documentation on finetuning learning rate for AdamW
    "num_epochs": 3,
    "train_batch_size": 256, # NOTE: 32 originally # Actual batch size will this x num gpus
    "eval_batch_size": 256, # Actual batch size will this x num gpus
}

In [0]:
import torch

def mem_status_distributed():
    rank = torch.cuda.current_device()
    properties = torch.cuda.get_device_properties(rank)
    total_memory = properties.total_memory / (1024 ** 3)  # Convert to GB
    allocated_memory = torch.cuda.memory_allocated(rank) / (1024 ** 3)
    reserved_memory = torch.cuda.memory_reserved(rank) / (1024 ** 3)
    available_memory = total_memory - reserved_memory
    print(f"GPU {rank}: | ")
    print(f"  Total memory: {total_memory:.2f} GB |")
    print(f"  Allocated memory: {allocated_memory:.2f} GB |")
    print(f"  Reserved memory: {reserved_memory:.2f} GB |")
    print(f"  Available memory: {available_memory:.2f} GB |")


### Original Downsampled Training and Evaluation without Decoding

In [0]:
def accelerated_training_function(model, downsampled_dataset, tokenzier, metric, hyperparameters):
    
    from accelerate import Accelerator
    from apex.optimizers import FusedAdam
    import datasets
    import torch
    # from torch.optim import AdamW
    from torch.utils.data import DataLoader
    from tqdm.notebook import tqdm
    import transformers
    from transformers import DataCollatorForSeq2Seq, get_scheduler

    # initialize our accelerator as early as possible for configuring the distributed backend
    accelerator = Accelerator()

    # To have only one message (and not 2) per logs of Transformers or Datasets, we set the logging verbosity to INFO for the main process only.
    if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    # define the output dir
    output_dir = "/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-downsampled"

    # Collate our datasets
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id, pad_to_multiple_of=2)

    downsampled_train_dataloader = DataLoader(
        downsampled_dataset["train"],
        shuffle=True, # add shuffling
        batch_size=hyperparameters["train_batch_size"],
        collate_fn=data_collator
    )
    downsampled_eval_dataloader = DataLoader(
        downsampled_dataset["test"], 
        batch_size=hyperparameters["eval_batch_size"], 
        collate_fn=data_collator
    )

    # use the apex optimized version of AdamW with a fused kernel
    # NOTE T5 was pretrained with the AdaFactor optimizer - perhaps we should compare this optimizer in a separate training
    optimizer = FusedAdam(model.parameters(), lr=hyperparameters["learning_rate"], adam_w_mode=True)
    # optimizer = AdamW(model.parameters(), lr=hyperparameters["learning_rate"])

    model.to(accelerator.device)

    downsampled_train_dataloader, downsampled_eval_dataloader, model, optimizer = accelerator.prepare(downsampled_train_dataloader, downsampled_eval_dataloader, model, optimizer)

    num_epochs = 3
    num_training_steps = num_epochs * len(downsampled_train_dataloader)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps
    )

    mem_status_distributed()
    for epoch in range(num_epochs):
        # training
        model.train()
        for batch in tqdm(downsampled_train_dataloader, desc=f"Epoch {epoch}", position=0, leave=True):
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        # evaluation
        model.eval()
        all_predictions = []
        all_labels = []
        for step, batch in enumerate(downsampled_eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 2 GPUs to combine them all
            all_predictions.append(accelerator.gather(predictions))
            all_labels.append(accelerator.gather(batch["labels"]))

        # Concatenate all predictions and labels
        # The last thing we need to do is to truncate the predictions and labels we concatenated
        # together as the prepared evaluation dataloader has a little bit more elements to make
        # batches of the same size on each process.
        all_predictions = torch.cat(all_predictions)[:len(downsampled_dataset["test"])]
        all_labels = torch.cat(all_labels)[:len(downsampled_dataset["test"])]

        mem_status_distributed() # show us the mem status of each GPU at the end of each epoch
        metric = exact_match_metric.compute(predictions=all_predictions, references=all_labels)
        accelerator.print(f"epoch {epoch}:", metric)

    # be sure to save our trained model to a given path
    # first wait for all processes to reach the same stage
    accelerator.wait_for_everyone()
    # unwraps the model from accelerate.prepare() to reintroduce the save_pretrained() fn for saving
    unwrapped_model = accelerator.unwrap_model(model)
    # accelerator.save() instead of torch.save()
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)


In [0]:
import torch.multiprocessing as mp

# Set the multiprocessing start method to 'spawn'
mp.set_start_method('spawn', force=True) # essential for spawning the multiprocessing properly

In [0]:
from accelerate import notebook_launcher

notebook_launcher(accelerated_training_function, (model, downsampled_dataset, tokenizer, exact_match_metric, hyperparameters), num_processes=2, mixed_precision="bf16")

Launching training on 2 GPUs.
GPU 1: | 
  Total memory: 79.15 GB |
  Allocated memory: 5.98 GB |
  Reserved memory: 8.48 GB |GPU 0: | 

  Available memory: 70.67 GB |  Total memory: 79.15 GB |

  Allocated memory: 5.98 GB |
  Reserved memory: 8.73 GB |
  Available memory: 70.42 GB |


Epoch 0:   0%|          | 0/1563 [00:00<?, ?it/s]

Epoch 0:   0%|          | 0/1563 [00:00<?, ?it/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


GPU 1: | GPU 0: | 

  Total memory: 79.15 GB |  Total memory: 79.15 GB |

  Allocated memory: 12.08 GB |  Allocated memory: 12.08 GB |

  Reserved memory: 18.50 GB |  Reserved memory: 18.50 GB |

  Available memory: 60.65 GB |  Available memory: 60.65 GB |



Epoch 1:   0%|          | 0/1563 [00:00<?, ?it/s]

epoch 0: {'exact_match': 32.21}


Epoch 1:   0%|          | 0/1563 [00:00<?, ?it/s]

GPU 1: | GPU 0: | 

  Total memory: 79.15 GB |  Total memory: 79.15 GB |

  Allocated memory: 12.08 GB |  Allocated memory: 12.08 GB |

  Reserved memory: 18.50 GB |  Reserved memory: 18.50 GB |

  Available memory: 60.65 GB |  Available memory: 60.65 GB |



Epoch 2:   0%|          | 0/1563 [00:00<?, ?it/s]

epoch 1: {'exact_match': 39.69}


Epoch 2:   0%|          | 0/1563 [00:00<?, ?it/s]

GPU 0: | GPU 1: | 

  Total memory: 79.15 GB |  Total memory: 79.15 GB |

  Allocated memory: 12.08 GB |  Allocated memory: 12.08 GB |

  Reserved memory: 18.50 GB |  Reserved memory: 18.50 GB |

  Available memory: 60.65 GB |  Available memory: 60.65 GB |

epoch 2: {'exact_match': 44.99}
[2024-12-20 18:31:34,441] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 18:31:34,448] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory




collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status







Configuration saved in /Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-downsampled/config.json
Configuration saved in /Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-downsampled/generation_config.json
W1220 18:31:40.813138 140155197497344 torch/multiprocessing/spawn.py:145] Terminating process 63869 via signal SIGTERM
E1220 18:31:43.530642 140155197497344 torch/distributed/elastic/multiprocessing/api.py:695] failed (exitcode: 1) local_rank: 0 (pid: 63866) of fn: accelerated_training_function (start_method: fork)
E1220 18:31:43.530642 140155197497344 torch/distributed/elastic/multiprocessing/api.py:695] Traceback (most recent call last):
E1220 18:31:43.530642 140155197497344 torch/distributed/elastic/multiprocessing/api.py:695]   File "/databricks/python/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 656, in _poll
E1220 18:31:43.530642 140155197497344 torch/distributed/elastic/multiproc

[0;31m---------------------------------------------------------------------------[0m
[0;31mChildFailedError[0m                          Traceback (most recent call last)
File [0;32m<command-955088565215259>, line 3[0m
[1;32m      1[0m [38;5;28;01mfrom[39;00m [38;5;21;01maccelerate[39;00m [38;5;28;01mimport[39;00m notebook_launcher
[0;32m----> 3[0m notebook_launcher(accelerated_training_function, (model, downsampled_dataset, tokenizer, exact_match_metric, hyperparameters), num_processes[38;5;241m=[39m[38;5;241m2[39m, mixed_precision[38;5;241m=[39m[38;5;124m"[39m[38;5;124mbf16[39m[38;5;124m"[39m)

File [0;32m/databricks/python/lib/python3.11/site-packages/accelerate/launchers.py:239[0m, in [0;36mnotebook_launcher[0;34m(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes, rdzv_backend, rdzv_endpoint, rdzv_conf, rdzv_id, max_restarts, monitor_interval)[0m
[1;32m    225[0m             rdzv_endpoint [38;5;241m=[39m 

### Updated Downsampled Training and Evaluation with Decoding

In [0]:
!mkdir /tmp/machine_dir

In [0]:
!ls /tmp

Rserv
Rtmpf8qfoW
chauffeur-daemon-params
chauffeur-daemon.pid
chauffeur-env.sh
custom-spark.conf
driver-daemon-params
driver-daemon.pid
driver-env.sh
hsperfdata_root
machine_dir
python_lsp_logs
systemd-private-b81863601b2e4c9e819cc56024f38457-systemd-logind.service-1ezHoY
systemd-private-b81863601b2e4c9e819cc56024f38457-systemd-resolved.service-ikrxlP
tmp.aXcBvtVXck
tmpemz2zxud


In [0]:
import os

# Set the NCCL_SOCKET_IFNAME environment variable
os.environ["NCCL_SOCKET_IFNAME"] = "eth0"

In [0]:
def accelerated_training_function(model, downsampled_dataset, tokenzier, metric, hyperparameters):
    
    from accelerate import Accelerator
    from apex.optimizers import FusedAdam
    import datasets
    import torch
    # from torch.optim import AdamW
    from torch.utils.data import DataLoader
    from tqdm.notebook import tqdm
    import transformers
    from transformers import DataCollatorForSeq2Seq, get_scheduler

    # initialize our accelerator as early as possible for configuring the distributed backend
    accelerator = Accelerator()

    # To have only one message (and not 2) per logs of Transformers or Datasets, we set the logging verbosity to INFO for the main process only.
    if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    # on machine dir
    tmp_dir = "/tmp/machine_dir_2"
    # define the output dir
    output_dir = "/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-downsampled-v3"

    # Collate our datasets
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id, pad_to_multiple_of=2)

    downsampled_train_dataloader = DataLoader(
        downsampled_dataset["train"],
        shuffle=True, # add shuffling
        batch_size=hyperparameters["train_batch_size"],
        collate_fn=data_collator
    )
    downsampled_eval_dataloader = DataLoader(
        downsampled_dataset["test"], 
        batch_size=hyperparameters["eval_batch_size"], 
        collate_fn=data_collator
    )

    # use the apex optimized version of AdamW with a fused kernel
    # NOTE T5 was pretrained with the AdaFactor optimizer - perhaps we should compare this optimizer in a separate training
    optimizer = FusedAdam(model.parameters(), lr=hyperparameters["learning_rate"], adam_w_mode=True)
    # optimizer = AdamW(model.parameters(), lr=hyperparameters["learning_rate"])

    model.to(accelerator.device)

    downsampled_train_dataloader, downsampled_eval_dataloader, model, optimizer = accelerator.prepare(downsampled_train_dataloader, downsampled_eval_dataloader, model, optimizer)

    num_epochs = 3
    num_training_steps = num_epochs * len(downsampled_train_dataloader)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps
    )

    mem_status_distributed()
    for epoch in range(num_epochs):
        # training
        model.train()
        for batch in tqdm(downsampled_train_dataloader, desc=f"Epoch {epoch}", position=0, leave=True):
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        # evaluation
        model.eval()
        # all_predictions = []
        # all_labels = []
        for step, batch in enumerate(downsampled_eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 2 GPUs to combine them all
            # all_predictions.append(accelerator.gather(predictions))
            # all_labels.append(accelerator.gather(batch["labels"]))
            gathered_predictions = accelerator.gather_for_metrics(predictions)
            gathered_labels = accelerator.gather_for_metrics(batch["labels"])

            for pred, label in zip(gathered_predictions, gathered_labels):
                exact_match_metric.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))

        mem_status_distributed() # show us the mem status of each GPU at the end of each epoch
        # metric = exact_match_metric.compute(predictions=all_predictions, references=all_labels)
        metric = exact_match_metric.compute()
        accelerator.print(f"epoch {epoch}:", metric)

    # be sure to save our trained model to a given path
    # first wait for all processes to reach the same stage
    accelerator.wait_for_everyone()
    # unwraps the model from accelerate.prepare() to reintroduce the save_pretrained() fn for saving
    unwrapped_model = accelerator.unwrap_model(model)
    # accelerator.save() instead of torch.save()
    unwrapped_model.save_pretrained(tmp_dir, save_function=accelerator.save)
    # why do we need this at all?
    if accelerator.is_main_process:
        tokenizer.save_pretrained(tmp_dir)


In [0]:
import torch.multiprocessing as mp

# Set the multiprocessing start method to 'spawn'
mp.set_start_method('spawn', force=True) # essential for spawning the multiprocessing properly

In [0]:
from accelerate import notebook_launcher

notebook_launcher(accelerated_training_function, (model, downsampled_dataset, tokenizer, exact_match_metric, hyperparameters), num_processes=2, mixed_precision="bf16")

Launching training on 2 GPUs.
GPU 0: | 
  Total memory: 79.15 GB |
  Allocated memory: 5.98 GB |
  Reserved memory: 8.48 GB |
  Available memory: 70.67 GB |


Epoch 0:   0%|          | 0/196 [00:00<?, ?it/s]

GPU 1: | 
  Total memory: 79.15 GB |
  Allocated memory: 5.98 GB |
  Reserved memory: 8.73 GB |
  Available memory: 70.42 GB |


Epoch 0:   0%|          | 0/196 [00:00<?, ?it/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


GPU 0: | 
  Total memory: 79.15 GB |GPU 1: | 

  Allocated memory: 14.11 GB |  Total memory: 79.15 GB |

  Reserved memory: 55.60 GB |  Allocated memory: 14.11 GB |

  Available memory: 23.55 GB |  Reserved memory: 51.62 GB |

  Available memory: 27.53 GB |


Epoch 1:   0%|          | 0/196 [00:00<?, ?it/s]

epoch 0: {'exact_match': 25.019999999999996}


Epoch 1:   0%|          | 0/196 [00:00<?, ?it/s]

GPU 1: | 
  Total memory: 79.15 GB |GPU 0: | 

  Allocated memory: 14.11 GB |  Total memory: 79.15 GB |

  Reserved memory: 51.62 GB |  Allocated memory: 14.11 GB |

  Available memory: 27.53 GB |  Reserved memory: 55.60 GB |

  Available memory: 23.55 GB |
epoch 1: {'exact_match': 28.38}


Epoch 2:   0%|          | 0/196 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/196 [00:00<?, ?it/s]

GPU 1: | GPU 0: | 

  Total memory: 79.15 GB |  Total memory: 79.15 GB |

  Allocated memory: 14.11 GB |  Allocated memory: 14.11 GB |

  Reserved memory: 51.62 GB |  Reserved memory: 55.60 GB |

  Available memory: 27.53 GB |  Available memory: 23.55 GB |

epoch 2: {'exact_match': 29.74}
[2024-12-31 18:25:47,468] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-31 18:25:47,493] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status









Configuration saved in /tmp/machine_dir_2/config.json
Configuration saved in /tmp/machine_dir_2/generation_config.json
Model weights saved in /tmp/machine_dir_2/model.safetensors
tokenizer config file saved in /tmp/machine_dir_2/tokenizer_config.json
Special tokens file saved in /tmp/machine_dir_2/special_tokens_map.json
added tokens file saved in /tmp/machine_dir_2/added_tokens.json
W1231 18:26:13.488908 140074288316416 torch/distributed/elastic/multiprocessing/api.py:727] Closing process 16633 via signal SIGTERM


### Original Downsampled Training and Eval Postprocessing

In [0]:
# NOTE: Original

# make sure we can load our model from the saved location
from transformers import T5ForConditionalGeneration

trained_model = T5ForConditionalGeneration.from_pretrained("/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-downsampled")

In [0]:
trained_model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
       

In [0]:
trained_model.dtype

torch.float32

In [0]:
# tokenized_eval_dataset = tokenized_eval_dataset["test"]
tokenized_eval_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 10000
})

In [0]:
# collate our true eval set 
from transformers import DataCollatorForSeq2Seq
from torch.utils.data import DataLoader

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id, pad_to_multiple_of=2)

eval_dataloader = DataLoader(
    tokenized_eval_dataset, 
    batch_size=hyperparameters["eval_batch_size"], 
    collate_fn=data_collator
)

In [0]:
# put the model on the GPU
trained_model.to(torch.device("cuda"))

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): FusedRMSNorm(torch.Size([1024]), eps=1e-06, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(i

In [0]:
from evaluate import load

exact_match_metric = load("exact_match")

In [0]:
# evaluation
from tqdm.notebook import tqdm
from datasets import Dataset

partials = {"predicted_tokens": [], "label_tokens": [], "decoded_prediction": [], "decoded_label": []}

trained_model.eval()
for batch in tqdm(eval_dataloader, desc="Evaluating", position=0, leave=True):
    batch = {k: v.to(torch.device("cuda")) for k, v in batch.items()}
    with torch.no_grad():
        outputs = trained_model(**batch)
    
    for pred, label in zip(outputs.logits.argmax(dim=-1), batch["labels"]):
        # add decoded predictions and labels to the metric object
        exact_match_metric.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))
        # populate the partial correctness dict for detailed, individual eval
        partials["predicted_tokens"].append(pred)
        partials["label_tokens"].append(label)
        partials["decoded_prediction"].append(tokenizer.decode(pred, skip_special_tokens=True))
        partials["decoded_label"].append(tokenizer.decode(label, skip_special_tokens=True))

print(exact_match_metric.compute())
partial_correctness_dataset = Dataset.from_dict(partials)

Evaluating:   0%|          | 0/1250 [00:00<?, ?it/s]

{'exact_match': 0.2121}


In [0]:
partial_correctness_dataset

Dataset({
    features: ['predicted_tokens', 'label_tokens', 'decoded_prediction', 'decoded_label'],
    num_rows: 10000
})

In [0]:
# lets push the partial downsampled finetuned model and the evaluation dataset to the hub for saving
dbutils.widgets.text("hf_token", "", "hf_token")

In [0]:
hf_token = dbutils.widgets.get("hf_token")
!huggingface-cli login --token $hf_token

usage: huggingface-cli <command> [<args>] login [-h] [--token TOKEN]
                                                [--add-to-git-credential]
huggingface-cli <command> [<args>] login: error: argument --token: expected one argument


In [0]:
trained_model.push_to_hub("CyberSolve-DeepMind-LinAlg-1D-downsample", commit_message="A initial finetuing of the flan-T5-large model on a downsampled version of the DeepMind LingAlg 1D Dataset. We call this CyberSolve")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample/commit/a8f4d7d9c07c8dca81a71fc03d05e85e83bdb8e7', commit_message='A initial finetuing of the flan-T5-large model on a downsampled version of the DeepMind LingAlg 1D Dataset. We call this CyberSolve', commit_description='', oid='a8f4d7d9c07c8dca81a71fc03d05e85e83bdb8e7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample', endpoint='https://huggingface.co', repo_type='model', repo_id='MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample'), pr_revision=None, pr_num=None)

In [0]:
partial_correctness_dataset.push_to_hub("CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark", commit_message="a dataset for evaluating the partial correctness of the initial finetuning of the flan-T5-large model on a downsampled version of the DeepMind LingAlg 1D Dataset (which we subesequently call CyberSolve)")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/417 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark/commit/5665466e11700bb36670c7b28018aea96b8f3749', commit_message='a dataset for evaluating the partial correctness of the initial finetuning of the flan-T5-large model on a downsampled version of the DeepMind LingAlg 1D Dataset (which we subesequently call CyberSolve)', commit_description='', oid='5665466e11700bb36670c7b28018aea96b8f3749', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark', endpoint='https://huggingface.co', repo_type='dataset', repo_id='MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark'), pr_revision=None, pr_num=None)

### Updated Downsampled Training and Eval Postprocessing

In [0]:
# NOTE: Updated

# make sure we can load our model from the saved location
from transformers import T5ForConditionalGeneration

v2_trained_model = T5ForConditionalGeneration.from_pretrained("/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-downsampled-v2")

In [0]:
v2_trained_model # this should be the model we just adjusted the weights of

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): FusedRMSNorm(torch.Size([1024]), eps=1e-06, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(i

In [0]:
tokenized_eval_dataset = tokenized_eval_dataset["test"]
tokenized_eval_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 10000
})

In [0]:
# collate our true eval set 
from transformers import DataCollatorForSeq2Seq
from torch.utils.data import DataLoader

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id, pad_to_multiple_of=2)

eval_dataloader = DataLoader(
    tokenized_eval_dataset, 
    batch_size=hyperparameters["eval_batch_size"], 
    collate_fn=data_collator
)

In [0]:
# put the model on the GPU
v2_trained_model.to(torch.device("cuda"))

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): FusedRMSNorm(torch.Size([1024]), eps=1e-06, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(i

In [0]:
from evaluate import load

exact_match_metric = load("exact_match")

In [0]:
# evaluation
from tqdm.notebook import tqdm
from datasets import Dataset

partials = {"predicted_tokens": [], "label_tokens": [], "decoded_prediction": [], "decoded_label": []}

v2_trained_model.eval()
for batch in tqdm(eval_dataloader, desc="Evaluating", position=0, leave=True):
    batch = {k: v.to(torch.device("cuda")) for k, v in batch.items()}
    with torch.no_grad():
        # outputs = trained_model(**batch)
        outputs = v2_trained_model(**batch)
    
    for pred, label in zip(outputs.logits.argmax(dim=-1), batch["labels"]):
        # add decoded predictions and labels to the metric object
        exact_match_metric.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))
        # populate the partial correctness dict for detailed, individual eval
        partials["predicted_tokens"].append(pred)
        partials["label_tokens"].append(label)
        partials["decoded_prediction"].append(tokenizer.decode(pred, skip_special_tokens=True))
        partials["decoded_label"].append(tokenizer.decode(label, skip_special_tokens=True))

print(exact_match_metric.compute())
partial_correctness_dataset = Dataset.from_dict(partials)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


{'exact_match': 0.0692}


In [0]:
partial_correctness_dataset

Dataset({
    features: ['predicted_tokens', 'label_tokens', 'decoded_prediction', 'decoded_label'],
    num_rows: 10000
})

In [0]:
v2_trained_model.push_to_hub("CyberSolve-DeepMind-LinAlg-1D-downsample-v2", commit_message="A second finetuing of the flan-T5-large model on the downsampled DeepMind LingAlg 1D dataset, this time with a GPU batch size of 256 as opposed to 32 used before")

In [0]:
partial_correctness_dataset.push_to_hub("CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark-v2", commit_message="a second dataset for evaluating the partial correctness of the second finetuning of the flan-T5-large model on a downsampled version of the DeepMind LingAlg 1D dataset")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark-v2/commit/c0462b69888b93ee06cb892683ca3f64dc4cffe4', commit_message='a second dataset for evaluating the partial correctness of the second finetuning of the flan-T5-large model on a downsampled version of the DeepMind LingAlg 1D dataset', commit_description='', oid='c0462b69888b93ee06cb892683ca3f64dc4cffe4', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark-v2', endpoint='https://huggingface.co', repo_type='dataset', repo_id='MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark-v2'), pr_revision=None, pr_num=None)