<a href="https://colab.research.google.com/github/pszemraj/ai-msgbot/blob/update-notebooks/notebooks/colab-huggingface-API/Train_GPT_for_Conversation_w_Huggingface_and_Deepspeed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> Train GPT for Conversation w Huggingface and Deepspeed </center>

> purpose: train a GPT generation model that can be used for conversation using the huggingface `trainer` API

material in this notebook is covered in:
- [fine-tune on your own data](https://huggingface.co/docs/transformers/training)
- [train bigger models faster](https://huggingface.co/docs/transformers/performance) also by huggingface

---

In [None]:
#@title check  system stats
from psutil import virtual_memory
import os
ram_gb = round(virtual_memory().total / (1024**3), 1)
print(f'Runtime has {ram_gb} gigs of memory and {os.cpu_count()} processors')

Runtime has 51.0 gigs of memory and 8 processors


In [None]:
#@title check GPU stats
!nvidia-smi

Sat Jan 15 19:17:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    25W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
%%capture
#@title set up auto-formatting of cells in notebook

from IPython.display import HTML, display


def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )
get_ipython().events.register("pre_run_cell", set_css)

In [None]:
#@title install packages 
#@markdown git-lfs for saving, transformers and deepspeed
!pip install transformers[fairscale] -U -q
!pip install deepspeed -q
!sudo apt-get install git-lfs -q


Reading package lists...
Building dependency tree...
Reading state information...
git-lfs is already the newest version (2.3.4-1).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


In [None]:
import shutil, os, gc

#@markdown clear out existing checkpoints (if restarting runtime)
chkpt_path = "/content/checkpoints"
if os.path.exists(chkpt_path): 
    shutil.rmtree(chkpt_path, True)
    print(f"removed all checkpoints in {chkpt_path}")    

fin_path = "/content/final_model"
if os.path.exists(fin_path): 
    shutil.rmtree(fin_path, True)
    print(f"removed all final models in {fin_path}")    

fin_zero_path = "/content/final_zero_weights"
if os.path.exists(fin_zero_path): 
    shutil.rmtree(fin_zero_path, True)
    print(f"removed the zero weights folder in {fin_zero_path}")    

gc.collect()

removed all checkpoints in /content/checkpoints


64

# setup

In [None]:
#@title <font color="orange"> Sign in to HF </font>
#@markdown create an account on their website if zou don't have one - you need somewhere to put this.

#@markdown also imports a lot of the functions from the package
from huggingface_hub import (
    # User management
    login,
    logout,
    notebook_login,
    whoami,
    # Repository creation and management
    create_repo,
    delete_repo,
    update_repo_visibility,
    # And some methods to retrieve/change information about the content
    list_models,
    list_datasets,
    list_metrics,
    list_repo_files,
    upload_file,
    delete_file,
)

notebook_login()


VBox(children=(HTML(value='<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#@markdown **Enter the huggingface model ID to use as a starting point:**

#@markdown generic starting points would be `EleutherAI/gpt-neo-2.7B` or `EleutherAI/gpt-neo-1.3B`
hf_name = "pszemraj/GPT-Converse-1pt3B-Neo-WoW-DD-17" #@param {type:"string"}


In [None]:
N_EPOCHS =  10#@param {type:"number"}
BATCH_SIZE =  32#@param {type:"integer"}


In [None]:
model_name_header = "C2" #@param {type:"string"}
dataset = "WeW" #@param {type:"string"}


In [None]:
#@markdown the model name is ...
full_out_name = f"{hf_name.split('/')[-1]}-{model_name_header}_DS-{dataset}_Ep-{N_EPOCHS}_Bs-{BATCH_SIZE}"
# model.push_to_hub(full_out_name, auth)
full_out_name = full_out_name.replace(".", "pt")
print(f"model will be saved on huggingface with the name:\n\t{full_out_name}")
print("note that this name can be changed in the model card later")

model will be saved on huggingface with the name:
	GPT-Converse-1pt3B-Neo-WoW-DD-17-C2_DS-WeW_Ep-10_Bs-32
note that this name can be changed in the model card later


### ideas for future setups


1. try doing tokenization by method. I.e. we have person alpha, person beta, and _similar to the netflix example_ we split starting with custom tokens from one speaker to the other

---

- article [on padding](https://huggingface.co/docs/transformers/preprocessing)


In [None]:
#@title import basic packages
import os
from urllib import request
from os.path import join
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPTNeoForCausalLM
from tqdm.auto import tqdm 

torch.manual_seed(42)

<torch._C.Generator at 0x7f789ee28ab0>

---

## set up DeepSpeed config

- As recommended [here](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero3-example) from hf
- the article above has recommendations on what to change for either improving runtime or adjustments to make less powerful GPUs work

As recommended [here](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero3-example) from hf

In [None]:
%%bash
#@markdown **zero2 config.**

#@markdown file will be saved under ds_config_zero2.json
cat <<'EOT' > ds_config_zero2.json
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
EOT

In [None]:
%%bash
#@markdown **zero3 config - this is slower but uses CPU**

#@markdown file will be saved under ds_config_zero3.json
cat <<'EOT' > ds_config_zero3.json
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
EOT

In [None]:
#@markdown create logging folder
from os.path import join
os.makedirs(join(os.getcwd(), "logs"), exist_ok=True )

# Prepare the dataset and build a ``TextDataset``

The next step is to extract the instructions from all recipes and build a `TextDataset`. The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library. 


may need to adjust [how the dataset is split](https://huggingface.co/docs/datasets/splits.html) later.

In [None]:
#@markdown import basics: `train_test_split` etc
import re
import json
from sklearn.model_selection import train_test_split


In [None]:
train_link = "https://www.dropbox.com/s/olnx438omur7j72/wow-train.txt.txt?dl=1" #@param {type:"string"}
test_link = "https://www.dropbox.com/s/t2hhawpsiocypyt/ScriptParse-wow-train-kilt_4.txt?dl=1" #@param {type:"string"}


In [None]:
#@markdown download text dataset files

vm_wd = os.getcwd()
train_path = join(vm_wd, "train_dataset.txt")
request.urlretrieve(train_link, train_path)

# test file
test_path = join(vm_wd, "test_dataset.txt")
request.urlretrieve(test_link, test_path)

('/content/test_dataset.txt', <http.client.HTTPMessage at 0x7f799ff27610>)

In [None]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import AutoTokenizer
#@markdown create helper function  for test/train data. if the current function 
#@markdown stops working (i.e. the TextDataset etc) just update it to [this script](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py)
#@markdown on the HF repo
def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

_tokenizer = AutoTokenizer.from_pretrained(hf_name)

train_dataset,test_dataset,data_collator = load_dataset(train_path,
                                                        test_path,
                                                        _tokenizer)

# note that are using the NEW tokenizer here (_tokenizer)

Downloading:   0%|          | 0.00/682 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]



In [None]:
%%time
from transformers import AutoModelForCausalLM
#@title load the text gen model 
#@markdown - downloads model file if needed
#@markdown - resize the token embeddings as needed
model = AutoModelForCausalLM.from_pretrained(hf_name, 
                                             use_cache=False,
                                             low_cpu_mem_usage=True,
                                        ).cuda()
model.resize_token_embeddings(len(_tokenizer))



Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

CPU times: user 1min 31s, sys: 20.6 s, total: 1min 51s
Wall time: 3min 9s


# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

The [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class provides an API for feature-complete training. It is used in most of the [example scripts](https://huggingface.co/transformers/examples.html) from Huggingface. Before we can instantiate our `Trainer` we need to download our GPT-2 model and create a [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to access all the points of customization during training. In the `TrainingArguments`, we can define the Hyperparameters we are going to use in the training process like our `learning_rate`, `num_train_epochs`, or  `per_device_train_batch_size`. A complete list can you find [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

In [None]:
#@title create `os.environ`setups for the notebook
# DeepSpeed requires a distributed environment even when only one process is used.
# This emulates a launcher in the notebook
import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"

# # Now proceed as normal, plus pass the deepspeed config file


In [None]:
#@title choose deepspeed config
#@markdown ZeR03 is slower, but is more GPU friendly. If just trying to get it to work, use that.
# from filenames defined earlier
ds_config = "/content/ds_config_zero3.json" #@param ["/content/ds_config_zero3.json", "/content/ds_config_zero2.json"] {allow-input: true}

In [None]:
#@title configure hyperparameters - training
GRAD_ACC_STEPS =  32#@param {type:"integer"}
MAX_SAVES_ON_HD =  1#@param {type:"integer"}
RATIO_WARMUP =  0.05#@param {type:"number"}
WEIGHT_DECAY =  0.1#@param {type:"number"}
LEARNING_RATE =  7e-5#@param {type:"number"}
LR_SCHEDULER =  "constant_with_warmup"#@param {type:"string"}
MAX_GRADIENT_NORM =  0.5#@param {type:"number"}
USE_HUB =  True #@param {type:"boolean"}


In [None]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead
#@title create the `trainer`
#@markdown create the `trainer` object using `TrainingArguments()`

training_args = TrainingArguments(

    output_dir='./checkpoints', 
    save_total_limit=MAX_SAVES_ON_HD,
    logging_dir='/content/logs',
    num_train_epochs=N_EPOCHS, 
    evaluation_strategy='epoch',
    save_strategy='epoch',
    overwrite_output_dir=True, #overwrite the content of the output directory
    per_device_train_batch_size=BATCH_SIZE, 
    per_device_eval_batch_size=BATCH_SIZE,
    # eval_steps = 500, # Number of update steps between two evaluations.
    # save_steps=100, # after # steps model is saved 
    gradient_accumulation_steps=GRAD_ACC_STEPS, # working A100 + 2.7B was 32
    eval_accumulation_steps=max(int(GRAD_ACC_STEPS/2), 1),
    gradient_checkpointing=True,
    max_grad_norm=MAX_GRADIENT_NORM,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type = LR_SCHEDULER,
    warmup_ratio=RATIO_WARMUP,
    weight_decay=WEIGHT_DECAY,
    # bf16=True, 
    # bf16_full_eval=True,
    fp16_full_eval=True,
    fp16=True,
    fp16_opt_level='O1',
    deepspeed=ds_config, # use deepspeed.
    push_to_hub=USE_HUB,
    hub_model_id=full_out_name if USE_HUB else None,
    hub_strategy='checkpoint' if USE_HUB else None,
    
)


trainer = Trainer(model=model, 
                  args=training_args, 
                  train_dataset=train_dataset,
                  eval_dataset=test_dataset, 
                  data_collator=data_collator,
            )


[2022-01-15 19:21:03,674] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl


Cloning https://huggingface.co/pszemraj/GPT-Converse-1pt3B-Neo-WoW-DD-17-C2_DS-WeW_Ep-10_Bs-32 into local empty directory.
Using amp half precision backend


# Train GPT Model

API is simple: `Trainer.train()`.

## working configs

```

A100

2.7 B
- zero2
- batch size 8

1.3 B 
- zero2
- batch size 16

V100

1.3 B
= zero3
- batch size 8

```



In [None]:
trainer.train()

[2022-01-15 19:21:12,018] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.5.9, git-hash=unknown, git-branch=unknown
[2022-01-15 19:21:12,035] [INFO] [logging.py:69:log_dist] [Rank 0] initializing deepspeed groups
[2022-01-15 19:21:12,036] [INFO] [logging.py:69:log_dist] [Rank 0] initializing deepspeed model parallel group with size 1
[2022-01-15 19:21:12,039] [INFO] [logging.py:69:log_dist] [Rank 0] initializing deepspeed expert parallel group with size 1
[2022-01-15 19:21:12,041] [INFO] [logging.py:69:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0]
[2022-01-15 19:21:12,042] [INFO] [logging.py:69:log_dist] [Rank 0] creating expert parallel process group with ranks: [0]
[2022-01-15 19:21:12,075] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py37_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py3

***** Running training *****
  Num examples = 53555
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 32
  Total optimization steps = 520


Using /root/.cache/torch_extensions/py37_cu111 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.002501249313354492 seconds


Epoch,Training Loss,Validation Loss
0,No log,0.142578


***** Running Evaluation *****
  Num examples = 17913
  Batch size = 32
Saving model checkpoint to ./checkpoints/checkpoint-52
Configuration saved in ./checkpoints/checkpoint-52/config.json
Model weights saved in ./checkpoints/checkpoint-52/pytorch_model.bin


[2022-01-15 20:16:21,740] [INFO] [engine.py:3053:save_fp16_model] Saving model weights to ./checkpoints/checkpoint-52/pytorch_model.bin
[2022-01-15 20:16:34,716] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/checkpoint-52/global_step52/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 20:19:33,347] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved ./checkpoints/checkpoint-52/global_step52/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2022-01-15 20:19:33,392] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/checkpoint-52/global_step52/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 20:22:29,384] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved ./checkpoints/checkpoint-52/global_step52/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2022-01-15 20:37:01,733] [INFO] [timer.py:184:stop] 0/2000, SamplesPerSec=17.746132318863697


Epoch,Training Loss,Validation Loss
0,No log,0.142578
1,No log,0.140259
2,No log,0.139404
3,No log,0.144653


***** Running Evaluation *****
  Num examples = 17913
  Batch size = 32
Saving model checkpoint to ./checkpoints/checkpoint-104
Configuration saved in ./checkpoints/checkpoint-104/config.json
Model weights saved in ./checkpoints/checkpoint-104/pytorch_model.bin


[2022-01-15 21:22:23,566] [INFO] [engine.py:3053:save_fp16_model] Saving model weights to ./checkpoints/checkpoint-104/pytorch_model.bin
[2022-01-15 21:22:39,191] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/checkpoint-104/global_step104/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 21:25:32,258] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved ./checkpoints/checkpoint-104/global_step104/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2022-01-15 21:25:32,284] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/checkpoint-104/global_step104/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 21:28:28,019] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved ./checkpoints/checkpoint-104/global_step104/zero_pp_rank_0_mp_rank_00_optim_states.pt


Deleting older checkpoint [checkpoints/checkpoint-52] due to args.save_total_limit


[2022-01-15 21:48:04,845] [INFO] [timer.py:184:stop] 0/4000, SamplesPerSec=17.708308059064347


***** Running Evaluation *****
  Num examples = 17913
  Batch size = 32
Saving model checkpoint to ./checkpoints/checkpoint-156
Configuration saved in ./checkpoints/checkpoint-156/config.json
Model weights saved in ./checkpoints/checkpoint-156/pytorch_model.bin


[2022-01-15 22:23:10,688] [INFO] [engine.py:3053:save_fp16_model] Saving model weights to ./checkpoints/checkpoint-156/pytorch_model.bin
[2022-01-15 22:23:25,860] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/checkpoint-156/global_step156/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 22:26:20,481] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved ./checkpoints/checkpoint-156/global_step156/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2022-01-15 22:26:21,453] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/checkpoint-156/global_step156/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 22:29:15,343] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved ./checkpoints/checkpoint-156/global_step156/zero_pp_rank_0_mp_rank_00_optim_states.pt


Deleting older checkpoint [checkpoints/checkpoint-104] due to args.save_total_limit


[2022-01-15 22:58:20,005] [INFO] [timer.py:184:stop] 0/6000, SamplesPerSec=17.76980356549299


***** Running Evaluation *****
  Num examples = 17913
  Batch size = 32
Saving model checkpoint to ./checkpoints/checkpoint-208
Configuration saved in ./checkpoints/checkpoint-208/config.json
Model weights saved in ./checkpoints/checkpoint-208/pytorch_model.bin


[2022-01-15 23:23:23,585] [INFO] [engine.py:3053:save_fp16_model] Saving model weights to ./checkpoints/checkpoint-208/pytorch_model.bin
[2022-01-15 23:23:37,432] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/checkpoint-208/global_step209/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 23:26:31,650] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved ./checkpoints/checkpoint-208/global_step209/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2022-01-15 23:26:31,676] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/checkpoint-208/global_step209/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 23:29:27,237] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved ./checkpoints/checkpoint-208/global_step209/zero_pp_rank_0_mp_rank_00_optim_states.pt


Deleting older checkpoint [checkpoints/checkpoint-156] due to args.save_total_limit


KeyboardInterrupt: ignored

## save & convert

After training is done you can save the model by calling `save_model()`. This will save the trained model to our `output_dir` from our `TrainingArguments`.

In [None]:
#@markdown prep work - imports, create directories for conversion
import os
from os.path import join
from google.colab import files 
import gc
fin_zero = join(os.getcwd(), "final_zero_weights")
os.makedirs(fin_zero, exist_ok=True )

fin_loc = join(os.getcwd(), "final_model")
os.makedirs(fin_loc, exist_ok=True )

print(f"final model file will be saved to {fin_loc}")

final model file will be saved to /content/final_model


In [None]:
import shutil
from pathlib import Path
_in = Path('/content/checkpoints/config.json')
_out = Path(fin_loc) / _in.name
shutil.copyfile(_in, _out)

PosixPath('/content/final_model/config.json')

In [None]:
trainer.save_model(output_dir=fin_zero) # save to one directory with zero weights
trainer.evaluate()

Saving model checkpoint to /content/final_zero_weights
Configuration saved in /content/final_zero_weights/config.json
Model weights saved in /content/final_zero_weights/pytorch_model.bin


[2022-01-15 23:46:22,882] [INFO] [engine.py:3053:save_fp16_model] Saving model weights to /content/final_zero_weights/pytorch_model.bin
[2022-01-15 23:46:36,242] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /content/final_zero_weights/global_step226/zero_pp_rank_0_mp_rank_00_model_states.pt
[2022-01-15 23:49:30,873] [INFO] [engine.py:2947:_save_zero_checkpoint] zero checkpoint saved /content/final_zero_weights/global_step226/zero_pp_rank_0_mp_rank_00_optim_states.pt


***** Running Evaluation *****
  Num examples = 17913
  Batch size = 32


Epoch,Training Loss,Validation Loss
0,No log,0.142578
1,No log,0.140259
2,No log,0.139404
3,No log,0.144653


In [None]:
del trainer 
gc.collect() # free up RAM

In [None]:
#@markdown convert zero weights
!python /content/final_zero_weights/zero_to_fp32.py /content/checkpoints/final_zero_weights /content/final_model/pytorch_model.bin

<font color="yellow"> note to self - may need to copy over other files like the config.json and so forth </font>

# Save and Share

- [transformers docs](https://huggingface.co/docs/transformers/model_sharing) on how to share it.

if issues, try just working with the cells as a pseudo command-line a la:

```
!huggingface-cli login
!git config --global credential.helper store
!transformers-cli repo create your-model-name


### push tokenizer to hub

In [None]:
#@markdown re-load tokenizer from the original model if something happened and it 
#@markdown does not exist
from transformers import AutoTokenizer

if "_tokenizer" not in globals():
    _tokenizer = AutoTokenizer.from_pretrained(hf_name, use_fast=False,
                                            max_length=2048,
                                            model_max_length=2048,
                                            )

In [None]:
# Push the tokenizer to your namespace with the name full_out_name with no local clone.
_tokenizer.push_to_hub(full_out_name,
                       use_auth_token=True,
                       use_temp_dir=True)

### push model to hub

In [None]:
%%capture

from transformers import AutoModelForCausalLM, AutoTokenizer
finetuned_model = AutoModelForCausalLM.from_pretrained(fin_loc)

In [None]:
finetuned_model.push_to_hub(full_out_name,
                            use_auth_token=True,
                            use_temp_dir=True)

### save to google drive 

- this is not required, just a utility in case pushing to hf is not working or for convenience

In [None]:
from datetime import datetime
save_gdrive = False #@param {type:"boolean"}

def get_timestamp(exact=False):
    """
    get_timestamp - return a timestamp in the format YYYY-MM-DD_HH-MM-SS (exact=False)
        or YYYY-MM-DD_HH-MM-SS-MS (exact=True)
    exact : bool, optional, by default False,  if True, return a timestamp with seconds
    """
    ts = (
        datetime.now().strftime("%b-%d-%Y_-%H-%M-%S")
        if exact
        else datetime.now().strftime("%b-%d-%Y_-%H")
    )
    return ts

In [None]:
#@markdown connect to your drive with `google.colab`
from google.colab import drive
if save_gdrive:
    drive.mount('/content/drive')

In [None]:
#@markdown save and copy the directory to drive
import os, shutil
from os.path import join
if save_gdrive:
    drive_folder = join('content', 'drive', "Programming", get_timestamp(), full_out_name)
    os.makedirs(drive_folder, exist_ok=True )
    outpath = shutil.copytree(fin_loc, join(drive_folder, "new"))
    print(outpath)

# testing the new model

In [None]:
#@markdown print GPU status
import torch
!nvidia-smi

device = 'cuda' if torch.cuda.is_available() else 'cpu'

print(f"\nwill run computations on {device}")

In [None]:
from transformers import pipeline
my_chatbot = pipeline('text-generation', 
                      model=finetuned_model, tokenizer=_tokenizer,
                      device=0 if device == 'cuda' else -1,
                    )

In [None]:
my_chatbot("hello")

## add prompts

In [None]:
#@title define speaker and responder
#@markdown for testing the models this should not need to be changed. 
#@markdown if testing a model related to [ai-msgbot](https://github.com/pszemraj/ai-msgbot)
#@markdown trained on data that **was not** using the entries below, update as needed.
speaker = "person alpha" #@param {type:"string"}
responder = "person beta" #@param {type:"string"}

## define prompt messages

the reason `f"{responder}:\n"` is added at the end of each prompt is to force the text-gen model to actually _respond_ to the prompt as opposed to adding on to it.

In [None]:

prompts = [
           [f"{speaker}:\n", "hi! how are you doing?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "what should I bring to the party?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "do you like memes?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "can we go on a date together this weekend?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "what's up homie?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "do you know how can I make friends here?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "so what do you like to do for fun?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "what is your favorite brand of cereal?\n", "\n", f"{responder}:\n"],
           [f"{speaker}:\n", "what is the meaning of existence?\n", "\n", f"{responder}:\n"],
]

## generate text!

In [None]:
#@markdown set amount of text to generate (higher # = longer RT)
resp_len =  1024#@param {type:"integer"}

In [None]:
#@title model generated chats
#@markdown - note that responses output the prompt as part of the output (and that counts 
#@markdown for part of the max length reqs)
for i, prompt in enumerate(prompts):
    this_prompt = "".join(prompt)
    result = my_chatbot(
                        this_prompt, 
                        do_sample=True,
                        top_k=50,
                        top_p=0.9, 
                        min_length=len(this_prompt) + resp_len,
                        no_repeat_ngram_size=3,
                    )
    
    print(f"==========Testing Prompt-ID #{i} ==========")
    print(f"PROMPT TEXT:\n{''.join(prompt)}")
    print("----------FULL GENERATED TEXT:")
    print(result[0]['generated_text'])
    print("\n" * 4)

# Metadata

- export all the parameters you could ever want, and more

In [None]:
metadata = training_args.to_sanitized_dict()
metadata["configs_src"] = hf_name
pp.pprint(metadata)

In [None]:
from google.colab import files
import json

metadata_path = f"{model_name_header}_training_metadata.json"
with open(metadata_path, "w") as write_file:
    json.dump(metadata, write_file)

files.download(metadata_path)

# Extras

## full docstring for TrainingArgs()

In [None]:
ta_docstring = """
    TrainingArguments is the subset of the arguments we use in our example scripts **which relate to the training loop
    itself**.
    Using :class:`~transformers.HfArgumentParser` we can turn this class into `argparse
    <https://docs.python.org/3/library/argparse.html#module-argparse>`__ arguments that can be specified on the command
    line.
    Parameters:
        output_dir (:obj:`str`):
            The output directory where the model predictions and checkpoints will be written.
        overwrite_output_dir (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If :obj:`True`, overwrite the content of the output directory. Use this to continue training if
            :obj:`output_dir` points to a checkpoint directory.
        do_train (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to run training or not. This argument is not directly used by :class:`~transformers.Trainer`, it's
            intended to be used by your training/evaluation scripts instead. See the `example scripts
            <https://github.com/huggingface/transformers/tree/master/examples>`__ for more details.
        do_eval (:obj:`bool`, `optional`):
            Whether to run evaluation on the validation set or not. Will be set to :obj:`True` if
            :obj:`evaluation_strategy` is different from :obj:`"no"`. This argument is not directly used by
            :class:`~transformers.Trainer`, it's intended to be used by your training/evaluation scripts instead. See
            the `example scripts <https://github.com/huggingface/transformers/tree/master/examples>`__ for more
            details.
        do_predict (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to run predictions on the test set or not. This argument is not directly used by
            :class:`~transformers.Trainer`, it's intended to be used by your training/evaluation scripts instead. See
            the `example scripts <https://github.com/huggingface/transformers/tree/master/examples>`__ for more
            details.
        evaluation_strategy (:obj:`str` or :class:`~transformers.trainer_utils.IntervalStrategy`, `optional`, defaults to :obj:`"no"`):
            The evaluation strategy to adopt during training. Possible values are:
                * :obj:`"no"`: No evaluation is done during training.
                * :obj:`"steps"`: Evaluation is done (and logged) every :obj:`eval_steps`.
                * :obj:`"epoch"`: Evaluation is done at the end of each epoch.
        prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`):
            When performing evaluation and generating predictions, only returns the loss.
        per_device_train_batch_size (:obj:`int`, `optional`, defaults to 8):
            The batch size per GPU/TPU core/CPU for training.
        per_device_eval_batch_size (:obj:`int`, `optional`, defaults to 8):
            The batch size per GPU/TPU core/CPU for evaluation.
        gradient_accumulation_steps (:obj:`int`, `optional`, defaults to 1):
            Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
            .. warning::
                When using gradient accumulation, one step is counted as one step with backward pass. Therefore,
                logging, evaluation, save will be conducted every ``gradient_accumulation_steps * xxx_step`` training
                examples.
        eval_accumulation_steps (:obj:`int`, `optional`):
            Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If
            left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but
            requires more memory).
        learning_rate (:obj:`float`, `optional`, defaults to 5e-5):
            The initial learning rate for :class:`~transformers.AdamW` optimizer.
        weight_decay (:obj:`float`, `optional`, defaults to 0):
            The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in
            :class:`~transformers.AdamW` optimizer.
        adam_beta1 (:obj:`float`, `optional`, defaults to 0.9):
            The beta1 hyperparameter for the :class:`~transformers.AdamW` optimizer.
        adam_beta2 (:obj:`float`, `optional`, defaults to 0.999):
            The beta2 hyperparameter for the :class:`~transformers.AdamW` optimizer.
        adam_epsilon (:obj:`float`, `optional`, defaults to 1e-8):
            The epsilon hyperparameter for the :class:`~transformers.AdamW` optimizer.
        max_grad_norm (:obj:`float`, `optional`, defaults to 1.0):
            Maximum gradient norm (for gradient clipping).
        num_train_epochs(:obj:`float`, `optional`, defaults to 3.0):
            Total number of training epochs to perform (if not an integer, will perform the decimal part percents of
            the last epoch before stopping training).
        max_steps (:obj:`int`, `optional`, defaults to -1):
            If set to a positive number, the total number of training steps to perform. Overrides
            :obj:`num_train_epochs`. In case of using a finite iterable dataset the training may stop before reaching
            the set number of steps when all data is exhausted
        lr_scheduler_type (:obj:`str` or :class:`~transformers.SchedulerType`, `optional`, defaults to :obj:`"linear"`):
            The scheduler type to use. See the documentation of :class:`~transformers.SchedulerType` for all possible
            values.
        warmup_ratio (:obj:`float`, `optional`, defaults to 0.0):
            Ratio of total training steps used for a linear warmup from 0 to :obj:`learning_rate`.
        warmup_steps (:obj:`int`, `optional`, defaults to 0):
            Number of steps used for a linear warmup from 0 to :obj:`learning_rate`. Overrides any effect of
            :obj:`warmup_ratio`.
        log_level (:obj:`str`, `optional`, defaults to ``passive``):
            Logger log level to use on the main process. Possible choices are the log levels as strings: 'debug',
            'info', 'warning', 'error' and 'critical', plus a 'passive' level which doesn't set anything and lets the
            application set the level.
        log_level_replica (:obj:`str`, `optional`, defaults to ``passive``):
            Logger log level to use on replicas. Same choices as ``log_level``"
        log_on_each_node (:obj:`bool`, `optional`, defaults to :obj:`True`):
            In multinode distributed training, whether to log using :obj:`log_level` once per node, or only on the main
            node.
        logging_dir (:obj:`str`, `optional`):
            `TensorBoard <https://www.tensorflow.org/tensorboard>`__ log directory. Will default to
            `output_dir/runs/**CURRENT_DATETIME_HOSTNAME**`.
        logging_strategy (:obj:`str` or :class:`~transformers.trainer_utils.IntervalStrategy`, `optional`, defaults to :obj:`"steps"`):
            The logging strategy to adopt during training. Possible values are:
                * :obj:`"no"`: No logging is done during training.
                * :obj:`"epoch"`: Logging is done at the end of each epoch.
                * :obj:`"steps"`: Logging is done every :obj:`logging_steps`.
        logging_first_step (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to log and evaluate the first :obj:`global_step` or not.
        logging_steps (:obj:`int`, `optional`, defaults to 500):
            Number of update steps between two logs if :obj:`logging_strategy="steps"`.
        logging_nan_inf_filter (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether to filter :obj:`nan` and :obj:`inf` losses for logging. If set to obj:`True` the loss of every step
            that is :obj:`nan` or :obj:`inf` is filtered and the average loss of the current logging window is taken
            instead.
            .. note::
                :obj:`logging_nan_inf_filter` only influences the logging of loss values, it does not change the
                behavior the gradient is computed or applied to the model.
        save_strategy (:obj:`str` or :class:`~transformers.trainer_utils.IntervalStrategy`, `optional`, defaults to :obj:`"steps"`):
            The checkpoint save strategy to adopt during training. Possible values are:
                * :obj:`"no"`: No save is done during training.
                * :obj:`"epoch"`: Save is done at the end of each epoch.
                * :obj:`"steps"`: Save is done every :obj:`save_steps`.
        save_steps (:obj:`int`, `optional`, defaults to 500):
            Number of updates steps before two checkpoint saves if :obj:`save_strategy="steps"`.
        save_total_limit (:obj:`int`, `optional`):
            If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
            :obj:`output_dir`.
        save_on_each_node (:obj:`bool`, `optional`, defaults to :obj:`False`):
            When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on
            the main one.
            This should not be activated when the different nodes use the same storage as the files will be saved with
            the same names for each node.
        no_cuda (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to not use CUDA even when it is available or not.
        seed (:obj:`int`, `optional`, defaults to 42):
            Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the
            :func:`~transformers.Trainer.model_init` function to instantiate the model if it has some randomly
            initialized parameters.
        bf16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to use bf16 16-bit (mixed) precision training instead of 32-bit training. Requires Ampere or higher
            NVIDIA architecture. This is an experimental API and it may change.
        fp16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
        fp16_opt_level (:obj:`str`, `optional`, defaults to 'O1'):
            For :obj:`fp16` training, Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details
            on the `Apex documentation <https://nvidia.github.io/apex/amp.html>`__.
        fp16_backend (:obj:`str`, `optional`, defaults to :obj:`"auto"`):
            This argument is deprecated. Use ``half_precision_backend`` instead.
        half_precision_backend (:obj:`str`, `optional`, defaults to :obj:`"auto"`):
            The backend to use for mixed precision training. Must be one of :obj:`"auto"`, :obj:`"amp"` or
            :obj:`"apex"`. :obj:`"auto"` will use AMP or APEX depending on the PyTorch version detected, while the
            other choices will force the requested backend.
        bf16_full_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory but can harm
            metric values. This is an experimental API and it may change.
        fp16_full_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to use full float16 evaluation instead of 32-bit. This will be faster and save memory but can harm
            metric values.
        tf32 (:obj:`bool`, `optional`):
            Whether to enable tf32 mode, available in Ampere and newer GPU architectures. This is an experimental API
            and it may change.
        local_rank (:obj:`int`, `optional`, defaults to -1):
            Rank of the process during distributed training.
        xpu_backend (:obj:`str`, `optional`):
            The backend to use for xpu distributed training. Must be one of :obj:`"mpi"` or :obj:`"ccl"`.
        tpu_num_cores (:obj:`int`, `optional`):
            When training on TPU, the number of TPU cores (automatically passed by launcher script).
        dataloader_drop_last (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size)
            or not.
        eval_steps (:obj:`int`, `optional`):
            Number of update steps between two evaluations if :obj:`evaluation_strategy="steps"`. Will default to the
            same value as :obj:`logging_steps` if not set.
        dataloader_num_workers (:obj:`int`, `optional`, defaults to 0):
            Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the
            main process.
        past_index (:obj:`int`, `optional`, defaults to -1):
            Some models like :doc:`TransformerXL <../model_doc/transformerxl>` or :doc:`XLNet <../model_doc/xlnet>` can
            make use of the past hidden states for their predictions. If this argument is set to a positive int, the
            ``Trainer`` will use the corresponding output (usually index 2) as the past state and feed it to the model
            at the next training step under the keyword argument ``mems``.
        run_name (:obj:`str`, `optional`):
            A descriptor for the run. Typically used for `wandb <https://www.wandb.com/>`_ logging.
        disable_tqdm (:obj:`bool`, `optional`):
            Whether or not to disable the tqdm progress bars and table of metrics produced by
            :class:`~transformers.notebook.NotebookTrainingTracker` in Jupyter Notebooks. Will default to :obj:`True`
            if the logging level is set to warn or lower (default), :obj:`False` otherwise.
        remove_unused_columns (:obj:`bool`, `optional`, defaults to :obj:`True`):
            If using :obj:`datasets.Dataset` datasets, whether or not to automatically remove the columns unused by the
            model forward method.
            (Note that this behavior is not implemented for :class:`~transformers.TFTrainer` yet.)
        label_names (:obj:`List[str]`, `optional`):
            The list of keys in your dictionary of inputs that correspond to the labels.
            Will eventually default to :obj:`["labels"]` except if the model used is one of the
            :obj:`XxxForQuestionAnswering` in which case it will default to :obj:`["start_positions",
            "end_positions"]`.
        load_best_model_at_end (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to load the best model found during training at the end of training.
            .. note::
                When set to :obj:`True`, the parameters :obj:`save_strategy` needs to be the same as
                :obj:`eval_strategy`, and in the case it is "steps", :obj:`save_steps` must be a round multiple of
                :obj:`eval_steps`.
        metric_for_best_model (:obj:`str`, `optional`):
            Use in conjunction with :obj:`load_best_model_at_end` to specify the metric to use to compare two different
            models. Must be the name of a metric returned by the evaluation with or without the prefix :obj:`"eval_"`.
            Will default to :obj:`"loss"` if unspecified and :obj:`load_best_model_at_end=True` (to use the evaluation
            loss).
            If you set this value, :obj:`greater_is_better` will default to :obj:`True`. Don't forget to set it to
            :obj:`False` if your metric is better when lower.
        greater_is_better (:obj:`bool`, `optional`):
            Use in conjunction with :obj:`load_best_model_at_end` and :obj:`metric_for_best_model` to specify if better
            models should have a greater metric or not. Will default to:
            - :obj:`True` if :obj:`metric_for_best_model` is set to a value that isn't :obj:`"loss"` or
              :obj:`"eval_loss"`.
            - :obj:`False` if :obj:`metric_for_best_model` is not set, or set to :obj:`"loss"` or :obj:`"eval_loss"`.
        ignore_data_skip (:obj:`bool`, `optional`, defaults to :obj:`False`):
            When resuming training, whether or not to skip the epochs and batches to get the data loading at the same
            stage as in the previous training. If set to :obj:`True`, the training will begin faster (as that skipping
            step can take a long time) but will not yield the same results as the interrupted training would have.
        sharded_ddp (:obj:`bool`, :obj:`str` or list of :class:`~transformers.trainer_utils.ShardedDDPOption`, `optional`, defaults to :obj:`False`):
            Use Sharded DDP training from `FairScale <https://github.com/facebookresearch/fairscale>`__ (in distributed
            training only). This is an experimental feature.
            A list of options along the following:
            - :obj:`"simple"`: to use first instance of sharded DDP released by fairscale (:obj:`ShardedDDP`) similar
              to ZeRO-2.
            - :obj:`"zero_dp_2"`: to use the second instance of sharded DPP released by fairscale
              (:obj:`FullyShardedDDP`) in Zero-2 mode (with :obj:`reshard_after_forward=False`).
            - :obj:`"zero_dp_3"`: to use the second instance of sharded DPP released by fairscale
              (:obj:`FullyShardedDDP`) in Zero-3 mode (with :obj:`reshard_after_forward=True`).
            - :obj:`"offload"`: to add ZeRO-offload (only compatible with :obj:`"zero_dp_2"` and :obj:`"zero_dp_3"`).
            If a string is passed, it will be split on space. If a bool is passed, it will be converted to an empty
            list for :obj:`False` and :obj:`["simple"]` for :obj:`True`.
        deepspeed (:obj:`str` or :obj:`dict`, `optional`):
            Use `Deepspeed <https://github.com/microsoft/deepspeed>`__. This is an experimental feature and its API may
            evolve in the future. The value is either the location of DeepSpeed json config file (e.g.,
            ``ds_config.json``) or an already loaded json file as a :obj:`dict`"
        label_smoothing_factor (:obj:`float`, `optional`, defaults to 0.0):
            The label smoothing factor to use. Zero means no label smoothing, otherwise the underlying onehot-encoded
            labels are changed from 0s and 1s to :obj:`label_smoothing_factor/num_labels` and :obj:`1 -
            label_smoothing_factor + label_smoothing_factor/num_labels` respectively.
        debug (:obj:`str` or list of :class:`~transformers.debug_utils.DebugOption`, `optional`, defaults to :obj:`""`):
            Enable one or more debug features. This is an experimental feature.
            Possible options are:
            - :obj:`"underflow_overflow"`: detects overflow in model's input/outputs and reports the last frames that
              led to the event
            - :obj:`"tpu_metrics_debug"`: print debug metrics on TPU
            The options should be separated by whitespaces.
        adafactor (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to use the :class:`~transformers.Adafactor` optimizer instead of
            :class:`~transformers.AdamW`.
        group_by_length (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to group together samples of roughly the same length in the training dataset (to minimize
            padding applied and be more efficient). Only useful if applying dynamic padding.
        length_column_name (:obj:`str`, `optional`, defaults to :obj:`"length"`):
            Column name for precomputed lengths. If the column exists, grouping by length will use these values rather
            than computing them on train startup. Ignored unless :obj:`group_by_length` is :obj:`True` and the dataset
            is an instance of :obj:`Dataset`.
        report_to (:obj:`str` or :obj:`List[str]`, `optional`, defaults to :obj:`"all"`):
            The list of integrations to report the results and logs to. Supported platforms are :obj:`"azure_ml"`,
            :obj:`"comet_ml"`, :obj:`"mlflow"`, :obj:`"tensorboard"` and :obj:`"wandb"`. Use :obj:`"all"` to report to
            all integrations installed, :obj:`"none"` for no integrations.
        ddp_find_unused_parameters (:obj:`bool`, `optional`):
            When using distributed training, the value of the flag :obj:`find_unused_parameters` passed to
            :obj:`DistributedDataParallel`. Will default to :obj:`False` if gradient checkpointing is used, :obj:`True`
            otherwise.
        dataloader_pin_memory (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether you want to pin memory in data loaders or not. Will default to :obj:`True`.
        skip_memory_metrics (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether to skip adding of memory profiler reports to metrics. This is skipped by default because it slows
            down the training and evaluation speed.
        push_to_hub (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to upload the trained model to the hub after training. If this is activated, and
            :obj:`output_dir` exists, it needs to be a local clone of the repository to which the
            :class:`~transformers.Trainer` will be pushed.
        resume_from_checkpoint (:obj:`str`, `optional`):
            The path to a folder with a valid checkpoint for your model. This argument is not directly used by
            :class:`~transformers.Trainer`, it's intended to be used by your training/evaluation scripts instead. See
            the `example scripts <https://github.com/huggingface/transformers/tree/master/examples>`__ for more
            details.
        hub_model_id (:obj:`str`, `optional`):
            The name of the repository to keep in sync with the local `output_dir`. It can be a simple model ID in
            which case the model will be pushed in your namespace. Otherwise it should be the whole repository name,
            for instance :obj:`"user_name/model"`, which allows you to push to an organization you are a member of with
            :obj:`"organization_name/model"`. Will default to :obj:`user_name/output_dir_name` with `output_dir_name`
            being the name of :obj:`output_dir`.
            Will default to to the name of :obj:`output_dir`.
        hub_strategy (:obj:`str` or :class:`~transformers.trainer_utils.HubStrategy`, `optional`, defaults to :obj:`"every_save"`):
            Defines the scope of what is pushed to the Hub and when. Possible values are:
            - :obj:`"end"`: push the model, its configuration, the tokenizer (if passed along to the
              :class:`~transformers.Trainer`) and a draft of a model card at the end of training.
            - :obj:`"every_save"`: push the model, its configuration, the tokenizer (if passed along to the
              :class:`~transformers.Trainer`) and a draft of a model card each time there is a model save. The pushes
              are asynchronous to not block training, and in case the save are very frequent, a new push is only
              attempted if the previous one is finished. A last push is made with the final model at the end of
              training.
            - :obj:`"checkpoint"`: like :obj:`"every_save"` but the latest checkpoint is also pushed in a subfolder
              named last-checkpoint, allowing you to resume training easily with
              :obj:`trainer.train(resume_from_checkpoint="last-checkpoint")`.
            - :obj:`"all_checkpoints"`: like :obj:`"checkpoint"` but all checkpoints are pushed like they appear in the
              output folder (so you will get one checkpoint folder per folder in your final repository)
        hub_token (:obj:`str`, `optional`):
            The token to use to push the model to the Hub. Will default to the token in the cache folder obtained with
            :obj:`huggingface-cli login`.
        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If True, use gradient checkpointing to save memory at the expense of slower backward pass.
    """


In [None]:
import pprint as pp

pp.pprint(ta_docstring)


## SchedulerType for learning rate 

In [None]:

class SchedulerType(ExplicitEnum):
    LINEAR = "linear"
    COSINE = "cosine"
    COSINE_WITH_RESTARTS = "cosine_with_restarts"
    POLYNOMIAL = "polynomial"
    CONSTANT = "constant"
    CONSTANT_WITH_WARMUP = "constant_with_warmup"