## Full Training and Evaluation - DeepMind LingAlg 1D

Now that we have conducted a preliminary downsampled training, we can improve our model by feeding it more tokens. For this smaller, initial fine-tuning, we split our dataset of 2M records down to only 100K records (5% of the original data). This showed us that the T5 model can indeed improve when fed the properly preprocessed data, but the results of such a downsampled training were still unsatisfactory. We hope to attain results similar to (or potentially better than) those found in the original [DeepMind Mathematics Dataset paper](https://arxiv.org/abs/1904.01557).

We make use of the same architecture and optimization found in the downsampled training. We utilize Nvidia `Apex` for improved computation and memory utilization across the `flan-T5-large` model.

In [0]:
# install / upgrade transformers and install apex
!pip install -U transformers
!pip install /Volumes/workspace_dogfood/jgr/wheels/apex-0.1-cp311-cp311-linux_x86_64.whl

dbutils.library.restartPython()

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/f2/3a/8bdab26e09c5a242182b7ba9152e216d5ab4ae2d78c4298eb4872549cd35/transformers-4.47.1-py3-none-any.whl.metadata
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.1 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers)
  Obtaining dependency information for huggingface-hub<1.0,>=0.24.0 from https://files.pythonhosted.org/packages/6c/3f/50f6b25fafdcfb1c089187a328c95081abf882309afd86f4053951507cd1/huggingface_hub-0.27.1-py3-none-any.whl.metadata
  Downloading huggingface_hub-0.27.1-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Obtaining dependency information for tokenizers<

In [0]:
# confirm the apex library is available
from apex import normalization

In [0]:
# load the full preprocessed training and evaluation datasets
from datasets import load_dataset

tokenized_train_dataset = load_dataset("MarioBarbeque/DeepMind-LinAlg-1D-train")
tokenized_eval_dataset = load_dataset("MarioBarbeque/DeepMind-LinAlg-1D-eval")



In [0]:
tokenized_train_dataset = tokenized_train_dataset["train"]
tokenized_eval_dataset = tokenized_eval_dataset["test"]

In [0]:
# set the downsampled format to numpy in order to pass it to the seq2seq datacollator
# the DataCollatorForSeq2Seq uses numpy arrays to pad the labels
tokenized_train_dataset.set_format("numpy")
tokenized_eval_dataset.set_format("numpy")

In [0]:
# peek some records
tokenized_train_dataset[:5], tokenized_eval_dataset[:5]

({'input_ids': array([[ 5175,   162,   997,  3274,   898,  4542,  1935,    75,     3,
             18,   898,  3076,  1935,    75,    21,     3,    75,     5,
              1,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0],
         [ 5175,   162,   431,  3436,  3274,     3,  4949,  1755,  1935,
             17,  1768,   335,  3840,  1935,    17,  1768,  1630,  1458,
            940,    21,     3,    17,     5,     1,     0,     0,     0,
              0,     0,     0,     0,     0,     0],
         [ 5175,   162,     3,  9169,  1935,    63,     3,    18,   204,
           3891,  1935,    63,  1768,  2664,  4056,  3274,     3,    18,
           4060,  1935,    63,    21,     3,    63,     5,     1,     0,
              0,     0,     0,     0,     0,     0],
         [ 5175,   162,     3,   632,  3274,     3,  9169,  1935,   115,
              3,    18,   314, 24748,  1768,   314, 20489,    21,     3,
            115,     5,  

In [0]:
# reinstantiate our tokenizer and model on the CPU - let the accelerator handle device placement
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

checkpoint = "google/flan-t5-large"
tokenizer = T5Tokenizer.from_pretrained(checkpoint)
# load the model onto the CPU and let 🤗 accelerate take care of device placement in our training loop
# model = T5ForConditionalGeneration.from_pretrained(checkpoint, torch_dtype=torch.bfloat16)
model = T5ForConditionalGeneration.from_pretrained(checkpoint)

2024-12-31 20:00:40.140896: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [0]:
# grab our exact match metric
from datasets import load_metric

exact_match_metric = load_metric("exact_match")

  exact_match_metric = load_metric("exact_match")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [0]:
# choose an intermediate batch size of 64 for our training loop
# we saw best results at batch size 32, and overfitting at batch size 256
# also bump the learning rate to 3e-4 from 1e-4 to see if we can get better results

hyperparameters = {
    "learning_rate": 3e-4, # see the T5 documentation on finetuning learning rate for AdamW
    "num_epochs": 3,
    "train_batch_size": 64, # Actual batch size will this x num gpus
    "eval_batch_size": 256, # Actual batch size will this x num gpus
}

In [0]:
import torch

def mem_status_distributed():
    rank = torch.cuda.current_device()
    properties = torch.cuda.get_device_properties(rank)
    total_memory = properties.total_memory / (1024 ** 3)  # Convert to GB
    allocated_memory = torch.cuda.memory_allocated(rank) / (1024 ** 3)
    reserved_memory = torch.cuda.memory_reserved(rank) / (1024 ** 3)
    available_memory = total_memory - reserved_memory
    print(f"GPU {rank}: | ")
    print(f"Total memory: {total_memory:.2f} GB |")
    print(f"Allocated memory: {allocated_memory:.2f} GB |")
    print(f"Reserved memory: {reserved_memory:.2f} GB |")
    print(f"Available memory: {available_memory:.2f} GB |")


## Initial 3 epoch checkpoint training

In [0]:
# create a dir on the local machine to save our model after distributed training
!mkdir /tmp/machine_dir
!ls /tmp

Rserv
RtmppEQUgN
chauffeur-daemon-params
chauffeur-daemon.pid
chauffeur-env.sh
custom-spark.conf
driver-daemon-params
driver-daemon.pid
driver-env.sh
hsperfdata_root
machine_dir
python_lsp_logs
systemd-private-260b01789a5045ea8496a775359e3934-systemd-logind.service-CtBx3v
systemd-private-260b01789a5045ea8496a775359e3934-systemd-resolved.service-d0WEGP
tmp.yIJ4letUmp
tmpfabskme2


In [0]:
def accelerated_training_function(model, tokenized_train_dataset, tokenized_eval_dataset, tokenzier, exact_match_metric, hyperparameters):
    
    from accelerate import Accelerator
    from apex.optimizers import FusedAdam
    import datasets
    import torch
    from torch.utils.data import DataLoader
    from tqdm.notebook import tqdm
    import transformers
    from transformers import DataCollatorForSeq2Seq, get_scheduler

    # initialize our accelerator as early as possible for configuring the distributed backend
    accelerator = Accelerator()

    # To have only one message (and not 2) per logs of Transformers or Datasets, we set the logging verbosity to INFO for the main process only.
    if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    # on machine dir where we will save the model
    tmp_dir = "/tmp/machine_dir"

    # Collate our datasets
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id, pad_to_multiple_of=2)

    train_dataloader = DataLoader(
        tokenized_train_dataset,
        shuffle=True, # add shuffling
        batch_size=hyperparameters["train_batch_size"],
        collate_fn=data_collator
    )
    eval_dataloader = DataLoader(
        tokenized_eval_dataset, 
        batch_size=hyperparameters["eval_batch_size"], 
        collate_fn=data_collator
    )

    # use the apex optimized version of AdamW with a fused kernel
    # NOTE T5 was pretrained with the AdaFactor optimizer - perhaps we should compare this optimizer in a separate training
    optimizer = FusedAdam(model.parameters(), lr=hyperparameters["learning_rate"], adam_w_mode=True)
    # optimizer = AdamW(model.parameters(), lr=hyperparameters["learning_rate"])

    model.to(accelerator.device)

    train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(train_dataloader, eval_dataloader, model, optimizer)

    num_epochs = hyperparameters["num_epochs"]
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps
    )

    mem_status_distributed()
    for epoch in range(num_epochs):
        # training
        model.train()
        for batch in tqdm(train_dataloader, desc=f"Epoch {epoch}", position=0, leave=True):
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        # evaluation
        model.eval()
        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 2 GPUs to combine them all
            gathered_predictions = accelerator.gather_for_metrics(predictions)
            gathered_labels = accelerator.gather_for_metrics(batch["labels"])

            for pred, label in zip(gathered_predictions, gathered_labels):
                exact_match_metric.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))

        mem_status_distributed() # show us the mem status of each GPU at the end of each epoch
        metric = exact_match_metric.compute()
        accelerator.print(f"epoch {epoch}:", metric)

    # be sure to save our trained model to a given path
    # first wait for all processes to reach the same stage
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(tmp_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(tmp_dir)


In [0]:
import torch.multiprocessing as mp

# Set the multiprocessing start method to 'spawn'
mp.set_start_method('spawn', force=True) # essential for spawning the multiprocessing properly



In [0]:
from accelerate import notebook_launcher

notebook_launcher(accelerated_training_function, (model, tokenized_train_dataset, tokenized_eval_dataset, tokenizer, exact_match_metric, hyperparameters), num_processes=2, mixed_precision="bf16")

Launching training on 2 GPUs.
GPU 1: | 
Total memory: 79.15 GB |
GPU 0: | Allocated memory: 5.98 GB |

Reserved memory: 8.48 GB |Total memory: 79.15 GB |

Available memory: 70.67 GB |Allocated memory: 5.98 GB |

Reserved memory: 8.48 GB |
Available memory: 70.67 GB |


Epoch 0:   0%|          | 0/15625 [00:00<?, ?it/s]

Epoch 0:   0%|          | 0/15625 [00:00<?, ?it/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


GPU 1: | GPU 0: | 

Total memory: 79.15 GB |Total memory: 79.15 GB |

Allocated memory: 13.89 GB |Allocated memory: 13.90 GB |

Reserved memory: 28.15 GB |Reserved memory: 29.94 GB |

Available memory: 51.00 GB |Available memory: 49.21 GB |

epoch 0: {'exact_match': 55.35}


Epoch 1:   0%|          | 0/15625 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/15625 [00:00<?, ?it/s]



GPU 0: | 
Total memory: 79.15 GB |GPU 1: | 

Allocated memory: 13.90 GB |Total memory: 79.15 GB |

Reserved memory: 29.94 GB |Allocated memory: 13.89 GB |

Available memory: 49.21 GB |Reserved memory: 28.15 GB |

Available memory: 51.00 GB |
epoch 1: {'exact_match': 73.8}


Epoch 2:   0%|          | 0/15625 [00:00<?, ?it/s]

Epoch 2:   0%|          | 0/15625 [00:00<?, ?it/s]

GPU 0: | GPU 1: | 

Total memory: 79.15 GB |Total memory: 79.15 GB |

Allocated memory: 13.90 GB |Allocated memory: 13.89 GB |

Reserved memory: 29.94 GB |Reserved memory: 28.15 GB |

Available memory: 49.21 GB |Available memory: 51.00 GB |

epoch 2: {'exact_match': 86.56}
[2025-01-01 02:30:00,050] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-01 02:30:00,059] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


df: /root/.triton/autotune: No such file or directory
df: /root/.triton/autotune: No such file or directory




/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory




collect2: error: ld returned 1 exit status




Configuration saved in /tmp/machine_dir/config.json
Configuration saved in /tmp/machine_dir/generation_config.json
Model weights saved in /tmp/machine_dir/model.safetensors
tokenizer config file saved in /tmp/machine_dir/tokenizer_config.json
Special tokens file saved in /tmp/machine_dir/special_tokens_map.json
added tokens file saved in /tmp/machine_dir/added_tokens.json
W0101 02:30:30.890990 140480460886016 torch/distributed/elastic/multiprocessing/api.py:727] Closing process 3806 via signal SIGTERM


In [0]:
# peek the saved files on the local machine
!ls /tmp/machine_dir/

added_tokens.json	model.safetensors	 tokenizer_config.json
config.json		special_tokens_map.json
generation_config.json	spiece.model


In [0]:
# Copy the contents of the temp directory to the permanent volume
dbutils.fs.cp("file:/tmp/machine_dir/", "dbfs:/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D", recurse=True)

True

## Further fine-tuning our model checkpoint

We've reached an `exact_match` score of 86.6%, lets see if we can bump this past 95% with added epochs. We will train an additional 2 epochs from our previous checkpoint that hit the 86.6% benchmark.

In [0]:
# reinstantiate our tokenizer and model on the CPU - let the accelerator handle device placement
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

checkpoint = "/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D"
tokenizer = T5Tokenizer.from_pretrained(checkpoint)
# load the model onto the CPU and let 🤗 accelerate take care of device placement in our training loop
# model = T5ForConditionalGeneration.from_pretrained(checkpoint, torch_dtype=torch.bfloat16)
checkpoint_model = T5ForConditionalGeneration.from_pretrained(checkpoint)

2025-01-01 16:35:09.974783: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [0]:
# update our hyperparameters to train for an additional 2 epochs with a slightly lower initial learning rate

hyperparameters = {
    "learning_rate": 1e-4, # see the T5 documentation on finetuning learning rate for AdamW
    "num_epochs": 2,
    "train_batch_size": 64, # Actual batch size will this x num gpus
    "eval_batch_size": 256, # Actual batch size will this x num gpus
}

In [0]:
# recreate a dir on the local machine to save our model since were spinning up a new compute instance
!mkdir /tmp/machine_dir_1
!mkdir /tmp/machine_dir_2
!ls /tmp

Rserv
RtmptL6LDu
chauffeur-daemon-params
chauffeur-daemon.pid
chauffeur-env.sh
custom-spark.conf
driver-daemon-params
driver-daemon.pid
driver-env.sh
hsperfdata_root
machine_dir_1
machine_dir_2
python_lsp_logs
systemd-private-41be431891994dfb8161431ee8b41975-systemd-logind.service-1BQg7x
systemd-private-41be431891994dfb8161431ee8b41975-systemd-resolved.service-fm378o
tmp.XNysxuc1Bg
tmp5zp6ec06
tmppi6h52cn


Update our training loop to write out the weights at the end of each epoch

In [0]:
def second_accelerated_training_function(model, tokenized_train_dataset, tokenized_eval_dataset, tokenzier, exact_match_metric, hyperparameters):
    
    from accelerate import Accelerator
    from apex.optimizers import FusedAdam
    import datasets
    import torch
    from torch.utils.data import DataLoader
    from tqdm.notebook import tqdm
    import transformers
    from transformers import DataCollatorForSeq2Seq, get_scheduler

    # initialize our accelerator as early as possible for configuring the distributed backend
    accelerator = Accelerator()

    # To have only one message (and not 2) per logs of Transformers or Datasets, we set the logging verbosity to INFO for the main process only.
    if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    # on machine dir where we will save the model
    tmp_dir_1 = "/tmp/machine_dir_1"
    tmp_dir_2 = "/tmp/machine_dir_2"

    # Collate our datasets
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id, pad_to_multiple_of=2)

    train_dataloader = DataLoader(
        tokenized_train_dataset,
        shuffle=True, # add shuffling
        batch_size=hyperparameters["train_batch_size"],
        collate_fn=data_collator
    )
    eval_dataloader = DataLoader(
        tokenized_eval_dataset, 
        batch_size=hyperparameters["eval_batch_size"], 
        collate_fn=data_collator
    )

    # use the apex optimized version of AdamW with a fused kernel
    # NOTE T5 was pretrained with the AdaFactor optimizer - perhaps we should compare this optimizer in a separate training
    optimizer = FusedAdam(model.parameters(), lr=hyperparameters["learning_rate"], adam_w_mode=True)

    model.to(accelerator.device)

    train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(train_dataloader, eval_dataloader, model, optimizer)

    num_epochs = hyperparameters["num_epochs"]
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps
    )

    mem_status_distributed()
    for epoch in range(num_epochs):
        # training
        model.train()
        for batch in tqdm(train_dataloader, desc=f"Epoch {epoch}", position=0, leave=True):
            batch = {k: v.to(accelerator.device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        # evaluation
        model.eval()
        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 2 GPUs to combine them all
            gathered_predictions = accelerator.gather_for_metrics(predictions)
            gathered_labels = accelerator.gather_for_metrics(batch["labels"])

            for pred, label in zip(gathered_predictions, gathered_labels):
                exact_match_metric.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))

        mem_status_distributed() # show us the mem status of each GPU at the end of each epoch
        metric = exact_match_metric.compute()
        accelerator.print(f"epoch {epoch}:", metric)

        # be sure to save our trained model to a given path
        # first wait for all processes to reach the same stage
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        if epoch == 0:
            output_dir = tmp_dir_1
        else:
            output_dir = tmp_dir_2
        unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
        if accelerator.is_main_process:
            tokenizer.save_pretrained(output_dir)


In [0]:
import torch.multiprocessing as mp

# Set the multiprocessing start method to 'spawn'
mp.set_start_method('spawn', force=True) # essential for spawning the multiprocessing properly



In [0]:
from accelerate import notebook_launcher

notebook_launcher(second_accelerated_training_function, (checkpoint_model, tokenized_train_dataset, tokenized_eval_dataset, tokenizer, exact_match_metric, hyperparameters), num_processes=2, mixed_precision="bf16")

Launching training on 2 GPUs.
GPU 1: | 
Total memory: 79.15 GB |
Allocated memory: 5.98 GB |
Reserved memory: 8.48 GB |
Available memory: 70.67 GB |
GPU 0: | 
Total memory: 79.15 GB |
Allocated memory: 5.98 GB |
Reserved memory: 8.48 GB |
Available memory: 70.67 GB |


Epoch 0:   0%|          | 0/15625 [00:00<?, ?it/s]

Epoch 0:   0%|          | 0/15625 [00:00<?, ?it/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


GPU 0: | GPU 1: | 

Total memory: 79.15 GB |Total memory: 79.15 GB |

Allocated memory: 13.90 GB |Allocated memory: 13.89 GB |

Reserved memory: 29.94 GB |Reserved memory: 28.15 GB |

Available memory: 49.21 GB |Available memory: 51.00 GB |

epoch 0: {'exact_match': 83.12}
[2025-01-01 18:46:45,823] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-01 18:46:45,826] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)





df: /root/.triton/autotune: No such file or directory
df: /root/.triton/autotune: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status




Configuration saved in /tmp/machine_dir_1/config.json
Configuration saved in /tmp/machine_dir_1/generation_config.json
Model weights saved in /tmp/machine_dir_1/model.safetensors
tokenizer config file saved in /tmp/machine_dir_1/tokenizer_config.json
Special tokens file saved in /tmp/machine_dir_1/special_tokens_map.json
added tokens file saved in /tmp/machine_dir_1/added_tokens.json


Epoch 1:   0%|          | 0/15625 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/15625 [00:00<?, ?it/s]



GPU 1: | 
GPU 0: | Total memory: 79.15 GB |

Total memory: 79.15 GB |Allocated memory: 13.89 GB |

Allocated memory: 13.90 GB |Reserved memory: 28.15 GB |

Reserved memory: 29.94 GB |Available memory: 51.00 GB |

Available memory: 49.21 GB |
epoch 1: {'exact_match': 90.75}


Configuration saved in /tmp/machine_dir_2/config.json
Configuration saved in /tmp/machine_dir_2/generation_config.json
Model weights saved in /tmp/machine_dir_2/model.safetensors
tokenizer config file saved in /tmp/machine_dir_2/tokenizer_config.json
Special tokens file saved in /tmp/machine_dir_2/special_tokens_map.json
added tokens file saved in /tmp/machine_dir_2/added_tokens.json
W0101 20:53:46.322427 140278112034816 torch/distributed/elastic/multiprocessing/api.py:727] Closing process 4869 via signal SIGTERM


In [0]:
# Copy the contents of the temp directories to the permanent volumes
dbutils.fs.cp("file:/tmp/machine_dir_1/", "dbfs:/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-v2", recurse=True)
dbutils.fs.cp("file:/tmp/machine_dir_2/", "dbfs:/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-v3", recurse=True)

True

## Peek our trained models

In [0]:
# make sure we can load our models from the saved location
from transformers import T5ForConditionalGeneration

trained_model_1 = T5ForConditionalGeneration.from_pretrained("/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D")

2025-01-09 21:35:17.039721: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [0]:
trained_model_2 = T5ForConditionalGeneration.from_pretrained("/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-v3")

In [0]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("/Volumes/workspace_dogfood/jgr/hugging_face_cache/CyberSolve-DeepMind-LinAlg-1D-v3") # tokenizer is the same for both checkpoints

In [0]:
# test the inference on CPU

input_text = "Solve 24 = 1601*c - 1605*c for c."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = trained_model_2.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

-6


In [0]:
trained_model_2.to("cuda")
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = trained_model_2.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

-6


In [0]:
# lets push the fully finetuned model to the hub for saving
dbutils.widgets.text("hf_token", "", "hf_token")

In [0]:
hf_token = dbutils.widgets.get("hf_token")
!huggingface-cli login --token $hf_token

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `Personal Hub Token` has been saved to /Volumes/workspace_dogfood/jgr/hugging_face_cache/stored_tokens
Your token has been saved to /Volumes/workspace_dogfood/jgr/hugging_face_cache/token
Login successful.
The current active token is: `Personal Hub Token`


In [0]:
trained_model_1.push_to_hub("CyberSolve-LinAlg-1.1", commit_message="We introduce CyberSolve, the flan-t5-large model fintuned on all 2M records of the DeepMind LinAlg 1D dataset. This is the first model checkpoint, scoring 86.56 on the eval dataset")

[2025-01-09 20:46:33,648] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


df: /root/.triton/autotune: No such file or directory
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.1/commit/71a7616de8128a0665f022616c5e965bf72b4ec4', commit_message='We introduce CyberSolve, the flan-t5-large model fintuned on all 2M records of the DeepMind LinAlg 1D dataset. This is the first model checkpoint, scoring 86.56 on the eval dataset', commit_description='', oid='71a7616de8128a0665f022616c5e965bf72b4ec4', pr_url=None, repo_url=RepoUrl('https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.1', endpoint='https://huggingface.co', repo_type='model', repo_id='MarioBarbeque/CyberSolve-LinAlg-1.1'), pr_revision=None, pr_num=None)

In [0]:
trained_model_2.push_to_hub("CyberSolve-LinAlg-1.2", commit_message="Second CyberSolve model checkpoint; scoring 90.75 on the eval dataset")

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.2/commit/c1d20ad152b8a8f634c2862359d58fe11bc14076', commit_message='Second CyberSolve model checkpoint; scoring 90.75 on the eval dataset', commit_description='', oid='c1d20ad152b8a8f634c2862359d58fe11bc14076', pr_url=None, repo_url=RepoUrl('https://huggingface.co/MarioBarbeque/CyberSolve-LinAlg-1.2', endpoint='https://huggingface.co', repo_type='model', repo_id='MarioBarbeque/CyberSolve-LinAlg-1.2'), pr_revision=None, pr_num=None)

## Partial Correctness Evaluation

In [0]:
# put the models on the GPU so that we can create the partial correctness datasets in the same loop
trained_model_1.to("cuda")

# *** interestingly, this also confirms that we have the FusedRMSNorm layer in place of the T5Norm layer *** use this above to show this directly on our model before training

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): FusedRMSNorm(torch.Size([1024]), eps=1e-06, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(i

In [0]:
trained_model_2.to("cuda")

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): FusedRMSNorm(torch.Size([1024]), eps=1e-06, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(i

In [0]:
# construct the partial correctness dataset in a separate evaluation loop so as to not interfere with the larger, more fragile, more expensive full training
from torch.utils.data import DataLoader
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id, pad_to_multiple_of=2)

eval_dataloader = DataLoader(
    tokenized_eval_dataset, 
    batch_size=64, # multiple of 8 for tensor cores
    collate_fn=data_collator
)

In [0]:
# now we construct the partial correctness datasets by doing inference with each of our checkpoints
from tqdm.auto import tqdm
import torch
from datasets import Dataset, load_metric

progress_bar = tqdm(range(len(eval_dataloader)))
# compute the exact match of each dataset again - we expect 86.56 and 90.75, respectively
exact_match_1 = load_metric("exact_match")
exact_match_2 = load_metric("exact_match")

# create empty dicts to populate with the results for partial correctness evaluation
partials_1 = {"predicted_tokens": [], "label_tokens": [], "decoded_prediction": [], "decoded_label": []}
partials_2 = {"predicted_tokens": [], "label_tokens": [], "decoded_prediction": [], "decoded_label": []}

trained_model_1.eval()
trained_model_2.eval()
for batch in eval_dataloader:
    batch = {k: v.to("cuda") for k, v in batch.items()}
    with torch.no_grad():
        outputs_1 = trained_model_1(**batch)
        outputs_2 = trained_model_2(**batch)

    for pred, label in zip(outputs_1.logits.argmax(dim=-1), batch["labels"]):
        # compute the exact match from this checkpoint's predictions
        exact_match_1.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))
        # populate the partial correctness dict for detailed, individual eval
        partials_1["predicted_tokens"].append(pred)
        partials_1["label_tokens"].append(label)
        partials_1["decoded_prediction"].append(tokenizer.decode(pred, skip_special_tokens=True))
        partials_1["decoded_label"].append(tokenizer.decode(label, skip_special_tokens=True))
    
    for pred, label in zip(outputs_2.logits.argmax(dim=-1), batch["labels"]):
        # compute the exact match from this checkpoint's predictions
        exact_match_2.add(predictions=tokenizer.decode(pred, skip_special_tokens=True), references=tokenizer.decode(label, skip_special_tokens=True))
        # populate the partial correctness dict for detailed, individual eval
        partials_2["predicted_tokens"].append(pred)
        partials_2["label_tokens"].append(label)
        partials_2["decoded_prediction"].append(tokenizer.decode(pred, skip_special_tokens=True))
        partials_2["decoded_label"].append(tokenizer.decode(label, skip_special_tokens=True))

    progress_bar.update(1)

print(exact_match_1.compute())
print(exact_match_2.compute())
partial_correctness_dataset_1 = Dataset.from_dict(partials_1)
partial_correctness_dataset_2 = Dataset.from_dict(partials_2)

  0%|          | 0/157 [00:00<?, ?it/s]

  exact_match_1 = load_metric("exact_match")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


{'exact_match': 86.56}
{'exact_match': 90.75}


In [0]:
# peek the created dataset
partial_correctness_dataset_1, partial_correctness_dataset_2

(Dataset({
     features: ['predicted_tokens', 'label_tokens', 'decoded_prediction', 'decoded_label'],
     num_rows: 10000
 }),
 Dataset({
     features: ['predicted_tokens', 'label_tokens', 'decoded_prediction', 'decoded_label'],
     num_rows: 10000
 }))

In [0]:
partial_correctness_dataset_1[:5], partial_correctness_dataset_2[:5]

({'predicted_tokens': [[489, 1, 0, 0],
   [204, 1, 0, 0],
   [1902, 1, 0, 0],
   [3, 6039, 1, 0],
   [3, 6996, 1, 0]],
  'label_tokens': [[489, 1, 0, 0],
   [204, 1, 0, 0],
   [1902, 1, 0, 0],
   [3, 6039, 1, 0],
   [3, 10794, 1, 0]],
  'decoded_prediction': ['7', '2', '23', '-8', '-18'],
  'decoded_label': ['7', '2', '23', '-8', '-17']},
 {'predicted_tokens': [[489, 1, 0, 0],
   [204, 1, 0, 0],
   [1902, 1, 0, 0],
   [3, 6039, 1, 0],
   [3, 10794, 1, 0]],
  'label_tokens': [[489, 1, 0, 0],
   [204, 1, 0, 0],
   [1902, 1, 0, 0],
   [3, 6039, 1, 0],
   [3, 10794, 1, 0]],
  'decoded_prediction': ['7', '2', '23', '-8', '-17'],
  'decoded_label': ['7', '2', '23', '-8', '-17']})

In [0]:
partial_correctness_dataset_1.push_to_hub("CyberSolve-LinAlg-1.1-correctness-benchmark", commit_message="dataset constructed for benchmarking the partial correctness of our finetuned CyberSolve-LingAlg-1.1 model")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/MarioBarbeque/CyberSolve-LinAlg-1.1-correctness-benchmark/commit/350b1ba2d109755f287fcf92b50dd2ee98564010', commit_message='dataset constructed for benchmarking the partial correctness of our finetuned CyberSolve-LingAlg-1.1 model', commit_description='', oid='350b1ba2d109755f287fcf92b50dd2ee98564010', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/MarioBarbeque/CyberSolve-LinAlg-1.1-correctness-benchmark', endpoint='https://huggingface.co', repo_type='dataset', repo_id='MarioBarbeque/CyberSolve-LinAlg-1.1-correctness-benchmark'), pr_revision=None, pr_num=None)

In [0]:
partial_correctness_dataset_2.push_to_hub("CyberSolve-LinAlg-1.2-correctness-benchmark", commit_message="dataset constructed for benchmarking the partial correctness of our finetuned CyberSolve-LingAlg-1.2 model")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark/commit/f85e943523a1c35a1834acb8ec4effdb2b9d8026', commit_message='dataset constructed for benchmarking the partial correctness of our finetuned CyberSolve-LingAlg-1.2 model', commit_description='', oid='f85e943523a1c35a1834acb8ec4effdb2b9d8026', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark', endpoint='https://huggingface.co', repo_type='dataset', repo_id='MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark'), pr_revision=None, pr_num=None)