# Fine Tune BLOOM for Summarization

## Objective

Experiment the way to custom tran the decoder only BLOOM/BLOOMZ models to generate summarization using prompt.

## References
### Hugginface BLOOM Discussion Forum & Github

* [Huggingface Bloom Discussions](https://huggingface.co/bigscience/bloom/discussions)

* [Text summarization with Bloom#122](https://huggingface.co/bigscience/bloom/discussions/122)

* [Training or Fine-tuning the Bloom AI Model on my own Dataset#187](https://huggingface.co/bigscience/bloom/discussions/187)

> In the [official example for text classification](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) README:
> replace ```--model_name_or_path bert-base-multilingual-cased``` with ```--model_name_or_path bigscience/bloom-560m```

* [Fine-tuning BLOOM for Summarization with Trainer API #234](https://huggingface.co/bigscience/bloom/discussions/234)

* [Huge Num Epochs (9223372036854775807) when using Trainer API with streaming dataset #22757](https://github.com/huggingface/transformers/issues/22757)

* [Data Collator class to use for BLOOM#238](https://huggingface.co/bigscience/bloom/discussions/238)

* [TrainingArguments class - max_steps formula when using streaming dataset](https://discuss.huggingface.co/t/training-max-steps-formula-when-using-streaming-dataset/36531)

### Huggingface Casual Language Model

* [Huggingface Task Guide - Causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling)

### Huggingface Task Parameters 

* [Detailed parameters](https://huggingface.co/docs/api-inference/detailed_parameters#text2text-generation-task)

## BLOOM Prompt Example

* [Learn how to use Bloom like chatGPT for free.#183](https://huggingface.co/bigscience/bloom/discussions/183)

```
User: Number BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks.
AI: 
```

<img src="./image/bloom_prompt_example.png" align="left" width=400/>


In [14]:
! pip install torch transformers datasets evaluate scikit-learn rouge rouge-score promptsource --quiet

[0m

In [15]:
import re
import gc
import datetime
from typing import (
    List,
    Dict,
    Callable,
)
import multiprocessing

import numpy as np
import pandas as pd
from datasets import (
    load_dataset,
    get_dataset_split_names
)
import torch
import transformers
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModel,
    AutoModelForCausalLM,
    DataCollatorWithPadding,
    DataCollatorForLanguageModeling,
    BloomForCausalLM,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback, 
    IntervalStrategy
)
import evaluate
from promptsource.templates import (
    DatasetTemplates,
    Template
)

In [16]:
def get_datetime_string(
        date_time: datetime.datetime = datetime.datetime.now(),
        format_string: str = '%Y%m%d%H%M%S'
) -> str:
    """Generate date/time string from datetime instance
    Args:
        date_time: datetime.datetime instance. default datetime.datetime.now()
        format_string: default '%Y%m%d%H%M%S' e.g. 20230217113330 for 11:33:30 AM on 2023 Feb 17
    Returns: date/time string
    """
    return date_time.strftime(format_string)

# Environment

In [17]:
NUM_CPUS: int = multiprocessing.cpu_count()

In [18]:
torch.cuda.empty_cache()

In [19]:
!nvidia-smi

Tue Apr 18 10:16:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   43C    P0    25W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [20]:
!transformers-cli env


Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.28.1
- Platform: Linux-4.14.311-233.529.amzn2.x86_64-x86_64-with-debian-10.6
- Python version: 3.7.10
- Huggingface_hub version: 0.13.4
- Safetensors version: not installed
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>



# Constant

In [21]:
RUN_DATE_TIME: str = get_datetime_string()
print(RUN_DATE_TIME)

## Huggingface Datasets
DATASET_NAME: str = "xsum"
DATASET_TRAIN_NUM_ROWS: int = 204045      # Number of rows in the original train dataset
DATASET_STREAMING: bool = True                    # If using Dataset streaming
DATASET_TRAIN_NUM_SELECT: int = 4098       # Number of rows to use for training
DATASET_VALIDATE_NUM_SELECT: int = 256

# Huggingface Tokenizer (BLOOM default token length is 2048)
MAX_TOKEN_LENGTH: int = 512         # Max token length to avoid out of memory
MAX_RESPONSE_LENGTH: int = 50
BUFFER = 64
MAX_REQUEST_LENGTH: int = MAX_TOKEN_LENGTH - MAX_RESPONSE_LENGTH - BUFFER
PER_DEVICE_BATCH_SIZE: int = 1       # GPU batch size

# Huggingface Model
# MODEL = "bigscience/bloomz-560m"
MODEL = "bigscience/bloomz-560m"
USE_FLOAT16: bool = True

# Training
LEARNING_RATE: float = 3e-5
NUM_EPOCHS: int = 3
MAX_STEPS: int = NUM_EPOCHS * DATASET_TRAIN_NUM_SELECT if DATASET_STREAMING else -1
EVAL_STEPS: int = 500
MODEL_DIR_FINE_TUNED: str = f"finetuned_bloom_model_{RUN_DATE_TIME}"
MODEL_DIR_CHECKPOINT: str = "bloom_fine_tuning"
RESUME_FROM_MODEL_DIR_CHECKPOINT: bool = False

20230418101644


## Logging

In [22]:
transformers.logging.set_verbosity_info()

# Dataset

Use Huggingface dataset [xsum](https://huggingface.co/datasets/xsum) which has PromptSource templates.

<img src="./image/xsum.png" align="left" width=600/>

<img src="./image/xsum_promptsource_templates.png" align="left"/>

In [23]:
get_dataset_split_names(path=DATASET_NAME)

['train', 'validation', 'test']

In [24]:
train = load_dataset("xsum", split="train", streaming=DATASET_STREAMING)
train

<datasets.iterable_dataset.IterableDataset at 0x7fc801ef7d90>

There are two fields that you'll want to use:

- `document`: the text of the news.
- `summary`: a condensed version of `document` which'll be the model target.

In [25]:
if DATASET_STREAMING:
    example: Dict[str, str]  = list(train.take(50))[0]
else:
    example: Dict[str, str] = train.select(range(DATASET_TRAIN_NUM_SELECT)).shuffle(seed=42)[49]

example

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

# Prompt Template

Use [PromptSource](https://github.com/bigscience-workshop/promptsource) summarization template for XSUM dataset to generate prompts and feed the prompts to the model as the training data.

In [26]:
prompt_templates = DatasetTemplates( dataset_name=DATASET_NAME)  
prompt_templates.all_template_names

['DOC_boils_down_to_simple_idea_that',
 'DOC_given_above_write_one_sentence',
 'DOC_how_would_you_rephrase_few_words',
 'DOC_tldr',
 'DOC_write_summary_of_above',
 'article_DOC_summary',
 'college_roommate_asked_DOC_so_I_recap',
 'read_below_DOC_write_abstract',
 'summarize_DOC',
 'summarize_this_DOC_summary']

In [27]:
template: Template = prompt_templates['summarize_DOC']
print(template.jinja)

Summarize: {{document}}|||
{{summary}}


### Prompt xample from the template

In [28]:
prompt, response = template.apply(example=example, truncate=False)
print('-' * 80)
print("Prompt")
print('-' * 80)
print(re.sub(r'[\s\'\"]+', ' ', prompt))

print('-' * 80)
print("Response")
print('-' * 80)
print(re.sub(r'[\s\'\"]+', ' ', response))

--------------------------------------------------------------------------------
Prompt
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Response
--------------------------------------------------------------------------------
Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.


---
# Preprocess

1. Generate prompt from the dataset.
2. Remove characters e.g. quotes.
3. Trim sentence to MAX length if exceeds it (note: BLOOMZ skips sentences that exceed 2048).
4. Tokenize.


To apply the preprocessing function over the entire dataset, use Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once.

* [Datasets - select / filter](https://huggingface.co/docs/datasets/process#select-and-filter)
* [Datasets - select](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.select)

### Deep Learning Framework Tensor Format

* [Use with PyTorch - Dataset Format](https://huggingface.co/docs/datasets/use_with_pytorch)
> By default, datasets return regular python objects: integers, floats, strings, lists, etc. To get PyTorch tensors instead, you can set the format of the dataset to pytorch using Dataset.with_format():

```
ds = ds.with_format("torch")
```

* [Using Datasets with TensorFlow](https://huggingface.co/docs/datasets/use_with_tensorflow)

> By default, datasets return regular Python objects: integers, floats, strings, lists, etc. To get TensorFlow tensors instead, you can set the format of the dataset to tf:

```
ds = ds.with_format("tf")
```

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

### Map function for prompt generation

In [29]:
# NOT USING NOW
def get_convert_to_request_response(template: Template) -> Callable:
    def _convert_to_prompt_response(example: Dict[str, str]) -> Dict[str, str]:
        """Generate prompt, response as a dictionary:
        {
            "prompt": "Summarize: ...",
            "response": "..."
        }

        NOTE: DO NOT use with dataset map function( batched=True). Use batch=False

        Args:
            example: single {document, summary} pair to be able to apply template
        Returns: a dictionary of pro
        """
        # assert isinstance(example, dict), f"expected dict but {type(example)}.\n{example}" # does not work with streaming
        assert isinstance(example['document'], str), f"expected str but {type(example['document'])}."
        
        prompt, response = template.apply(example=example, truncate=False)
        if len(prompt) <=1 or len(response) <= 1:
            return {
                "prompt": "NA",
                "response": "NA"                
            }

        return {
            "prompt": " ".join(
                re.sub(r'[\s\'\"]+', ' ', prompt).split(' ')[:MAX_REQUEST_LENGTH]
            ),
            "response": " ".join(
                re.sub(r'[\s\'\"]+', ' ', response).split(' ')[:MAX_RESPONSE_LENGTH]
            )
        }

    return _convert_to_prompt_response

convert_to_request_response: Callable = get_convert_to_request_response(template=template)

In [30]:
def get_convert_to_prompt(template: Template) -> Callable:
    def _convert_to_prompt(example: Dict[str, str]) -> Dict[str, str]:
        """Generate prompt as a dictionary:
        {
            "prompt": "Summarize: <document>\n<summary>"
        }

        NOTE: DO NOT use dataset map function with  batched=True. Use batch=False

        Args:
            example: single {document, summary} pair to be able to apply template
        Returns: a dictionary of prompt
        """
        # assert isinstance(example, dict), f"expected dict but {type(example)}.\n{example}"
        assert isinstance(example['document'], str), f"expected str but {type(example['document'])}."

        prompt, response = template.apply(example=example, truncate=False)
        if len(prompt) <=1 or len(response) <= 1:
            return {
                "prompt": "NA\nNA\n"
            }
        
        # prompt and response are seprated with '\n' character.
        prompt = re.sub(r'^Summarize:', 'USER:', prompt)
        return {
            "prompt": " ".join(
                re.sub(r'[\s\'\"]+', ' ', prompt).split(' ')[:MAX_REQUEST_LENGTH-1]  # -1 for \n
            ) + "\nAI: " + " ".join(
                re.sub(r'[\s\'\"]+', ' ', response).split(' ')[:MAX_RESPONSE_LENGTH-7] # -7 for \nAI:<...>\n\n
            ) + "\n\n"
        }

    return _convert_to_prompt

convert_to_prompt: Callable = get_convert_to_prompt(template=template)

In [31]:
prompt = convert_to_prompt(example=example)
prompt



## Tokenization

* [Causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling)

> Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
> Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

```
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
```

In [32]:
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
# TODO: Needs to verify if using EOS is correct for BLOOM/Z and with DataCollatorWithPadding
# (perhaps should use DataCollatorForLanguageModeling which causes an error)
tokenizer.pad_token = tokenizer.eos_token

loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/25f241f41c04f08d658a1dd3b49ad41390109a8e/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/25f241f41c04f08d658a1dd3b49ad41390109a8e/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/25f241f41c04f08d658a1dd3b49ad41390109a8e/tokenizer_config.json


### Map function to tokkenize prompts

Casual LM model internally shift one input to use as a label according to the explanation. As such, using the **summary** as the **label** for the sentence would not be correct. 

* [Huggnggface - Data processing for Causal Language Modeling](https://youtu.be/ma1TrR7gE7I?t=191)

> Casual language model the input sequences themselves are the labels. Input sequence is the label just shifted by one. All **the shifting is handled by the model internally**.
> <img src="./image/transformers_casual_lm_inputs_are_labels.png" align="left" width=500/>

We should use ```(input,label)=(document,summary)``` and Seq2Seq task, however Huggingface Seq2Seq task nor pipeline does not accept the BLOOM model because class e.g. ```AutoModelForSeq2SeqLM``` expects encoder/decoder model such as T5. Need to find out how BLOOM/BLOOMZ train the models for seq2seq tasks.

> The model 'BloomForCausalLM' is not supported for summarization. Supported models are ```['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration']```.

In [33]:
# NOT USING NOW
def tokenize_prompt_response(example):
    """Generate the model inputs in the dictionary with format:
    {
        "input_ids": List[int], 
        "attention_mask": List[int]",
        "labels": List[int]
    }

    This function only works with the Dataset.map(batched=False).

    Note: Huggngface dataaset map(batched=True, batch_size=n) merges values of 
    n dictionarys into a values of the key. If you have n instances of {"key", "v"}, then
    you will get {"key": ["v", "v", "v", ...] }.
    
    Args:
        example:   a dictionary of format {
            "prompt": prompt to summarize a sentence,
            "response": summary for the sentence
        }
    """    
    # NOTE: It is a bug to use 'max_length=MAX_TOKEN_LENGTH' when using "batched=True"
    # examples["prompt"] with 'batched=True" has N instances of prompts each of which 
    # can have MAX_TOKEN_LENGTH length. Chopping N * MAX_TOKEN_LENGTH to
    # MAX_TOKEN_LENGTH means only using the first prompt out of N.
    inputs: Dict[str, List[int]] = tokenizer(
        text=examples["prompt"], 
        max_length=MAX_TOKEN_LENGTH,
        truncation=True,
        # padding='max_length',
        # padding=True
    )

    labels: Dict[str, List[int]] = tokenizer(
        text=examples["response"], 
        max_length=MAX_TOKEN_LENGTH,
        truncation=True,
        # DataCollatorWithPadding does not pad the "labels" element, hence need to pad here.
        # padding='max_length',
        padding=True
    )
    inputs["labels"] = labels["input_ids"]
    
    return inputs

In [34]:
def tokenize_prompt(example):
    """Generate the model inputs in the dictionary with format:
    {
        "input_ids": List[int], 
        "attention_mask": List[int]",
        "labels": List[int]
    }
    
    This function only works with the Dataset.map(batched=False).
    
    Note: Huggngface dataaset map(batched=True, batch_size=n) merges values of 
    n dictionarys into a values of the key. If you have n instances of {"key", "v"}, then
    you will get {"key": ["v", "v", "v", ...] }.
    
    Args:
        example:   a dictionary of format {
            "prompt": "Summarize:<document>\n<summary>\n",
        }
    """    
    assert isinstance(example['prompt'], str), f"expected str, got {type(example['prompt'])}"
    inputs: Dict[str, List[int]] = tokenizer(
        example['prompt'], 
        max_length=MAX_TOKEN_LENGTH,
        truncation=True,
        # padding='max_length',
        padding=True
    )
    inputs["labels"] = inputs["input_ids"].copy()   # Casual LM get the same tokens as inputs and label
    
    return inputs

In [35]:
tokenized: Dict[str, List[int]] = tokenize_prompt(example=convert_to_prompt(example=example))
tokenizer.decode(token_ids=tokenized['input_ids'])



## Apply preprocessing

In [36]:
if DATASET_STREAMING:
    train = train.take(DATASET_TRAIN_NUM_SELECT)
else:
    train = train.select(
        indices=range(DATASET_TRAIN_NUM_SELECT)
    )

remove_column_names: List[str] = list(train.features.keys())

tokenized_train = train.map(
    function=convert_to_prompt, 
    batched=False,
    #batch_size=2048,
    #drop_last_batch=False,
    remove_columns=remove_column_names,
    #num_proc=NUM_CPUS
).map(
    function=tokenize_prompt, 
    batched=False,
    # batch_size=32,
    # drop_last_batch=True,
    # remove_columns=['prompt', 'response']
    remove_columns=['prompt'],
    #num_proc=NUM_CPUS
).shuffle(
    seed=42
).with_format(
    "torch"
)

del train

In [37]:
validation =  load_dataset(
    path="xsum", 
    split="validation", 
    streaming=DATASET_STREAMING
)

if DATASET_STREAMING:
    validation =  validation.take(DATASET_VALIDATE_NUM_SELECT)
else:
    validation = validation.select(
        indices=range(DATASET_VALIDATE_NUM_SELECT)
    )

tokenized_validation =  validation.map(
    function=convert_to_prompt, 
    batched=False,
    # batch_size=2048,
    # drop_last_batch=False,
    remove_columns=remove_column_names,
    # num_proc=NUM_CPUS
).map(
    function=tokenize_prompt, 
    batched=False,
    # batch_size=32,
    # drop_last_batch=True,
    remove_columns=['prompt'],
    # num_proc=NUM_CPUS
).shuffle(
    seed=42
).with_format(
    "torch"
)

## Verify preprocessing

In [38]:
examples = list(tokenized_train.take(50)) if DATASET_STREAMING else tokenized_train[:50]

In [39]:
print('-' * 80)
print("prompt")
print('-' * 80)

if DATASET_STREAMING:
    print(tokenizer.decode(token_ids=examples[49]['input_ids']).split('\n')[0])
else:
    print(tokenizer.decode(token_ids=examples['input_ids'][49]).split('\n')[0])

print('-' * 80)
print("response")
print('-' * 80)
if DATASET_STREAMING:
    print(tokenizer.decode(token_ids=examples[49]['input_ids']).split('\n')[1])
else:
    print(tokenizer.decode(token_ids=examples['input_ids'][49]).split('\n')[1])


--------------------------------------------------------------------------------
prompt
--------------------------------------------------------------------------------
USER: Scotland s chief statistician estimated services grew by 0.5% and production by 0.3% between April and June, while construction contracted by 1.9%. UK output as a whole grew by 0.7% over the same period. Over the past year, the Scottish economy grew by 0.7% - a third of the UK rate of 2.1%. In the first three months of the year, there was no growth in Scotland. Scottish GDP per person - which takes population changes into account - grew by 0.3% during the second quarter, compared with 0.4% for the UK. The report indicated that growth in Scottish GDP over the past year was driven by growth in the services industry, particularly in business services and finance. However, that was tempered by contractions in the construction and production industries, especially electricity and gas, following the closure in March of 

---
# Training

Regarding the hyperparameter to use, need to investivage. Those used in [Fine-tune the model? #46 by NXBY - opened Jul 16, 2022](https://huggingface.co/bigscience/bloom/discussions/46#633d452d48ab6a0add2b61bd) might be a starting point.

```
!python run_qa.py \
  --model_name_or_path bigscience/bloom-560m \
  --dataset_name squad_v2 \
  --do_train \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_seq2seq_squad/ \
  --eval_accumulation_steps 1 \
  --version_2_with_negative \
  --overwrite_output_dir
```

## Model

We may need to use a specific model class e.g.  ```AutoModelForSequenceClassification```  to use BERT for classifying pairs of sentences because BERT has not been pretrained on such a task, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead in ```AutoModelForSequenceClassification```.

```
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
```

Note that BLOOM is a Decoder model, not Encoder-Decoder, hence cannot be used with ```AutoModelForSeq2SeqLM``` which causes:
```
ValueError: Unrecognized configuration class <class 'transformers.models.bloom.configuration_bloom.BloomConfig'> for this kind of AutoModel: AutoModelForSeq2SeqLM. Model type should be one of BartConfig, BigBirdPegasusConfig, BlenderbotConfig, BlenderbotSmallConfig, EncoderDecoderConfig, FSMTConfig, LEDConfig, LongT5Config, M2M100Config, MarianConfig, MBartConfig, MT5Config, MvpConfig, PegasusConfig, PegasusXConfig, PLBartConfig, ProphetNetConfig, SwitchTransformersConfig, T5Config, XLMProphetNetConfig.
```

Have a solid understanding on the model architecture and the task to execute for the fine-tuning, and devise the appropriate model to use. BLOOM is still a new model and Decoder architecture such as GPT 

Note that we cannot use AutoModel as it causes the error:

```
TypeError: The current model class (BloomModel) is not compatible with `.generate()`, as it doesn't have a language model head. Please use one of the following classes instead: {'BloomForCausalLM'}
```

<img src="./image/bloom_model_classes.png" align="left" width=200/>

In [40]:
# model = BloomForCausalLM.from_pretrained(MODEL)
torch.cuda.empty_cache()
model = AutoModelForCausalLM.from_pretrained(MODEL)
model.cuda()

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/25f241f41c04f08d658a1dd3b49ad41390109a8e/config.json
Model config BloomConfig {
  "_name_or_path": "bigscience/bloomz-560m",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "masked_softmax_fusion": true,
  "model_type": "bloom",
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "offset_alibi": 100,
  "pad_token_id": 3,
  "pretraining_tp": 1,
  "seq_length": 2048,
  "skip_bias_add": true,
  "skip_bias_add_qkv": false,
  "slow_but_exact": false,
  "transformers_version": "4.28.1",
  "unk_token_id": 0,
  "use_cache": true,
  "vocab_size": 250880
}

loading weights fi

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1024)
    (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0): BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (1): BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementw

In [41]:
type(model)

transformers.models.bloom.modeling_bloom.BloomForCausalLM

In [42]:
dir(model)

['T_destination',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_apply',
 '_auto_class',
 '_backward_compatibility_gradient_checkpointing',
 '_backward_hooks',
 '_buffers',
 '_call_impl',
 '_convert_head_mask_to_5d',
 '_convert_to_bloom_cache',
 '_convert_to_standard_cache',
 '_create_repo',
 '_expand_inputs_for_generation',
 '_extract_past_from_model_output',
 '_forward_hooks',
 '_forward_pre_hooks',
 '_from_config',
 '_get_backward_hooks',
 '_get_decoder_start_token_id',
 '_get_files_timestamps',
 '_get_logits_processor',
 '_get_logits_warper',
 '_get_name',
 '_get_resized_embeddings',
 '_get_resize

In [43]:
model.config

BloomConfig {
  "_name_or_path": "bigscience/bloomz-560m",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "masked_softmax_fusion": true,
  "model_type": "bloom",
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "offset_alibi": 100,
  "pad_token_id": 3,
  "pretraining_tp": 1,
  "seq_length": 2048,
  "skip_bias_add": true,
  "skip_bias_add_qkv": false,
  "slow_but_exact": false,
  "transformers_version": "4.28.1",
  "unk_token_id": 0,
  "use_cache": true,
  "vocab_size": 250880
}

## Prediction

See the predction parameters to use for the hugging face model tasks.

* [Generation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

See Huggingface BLOOM model discussion for **do_sample** parameter requirement.

* [Change seed in interference API #131](https://huggingface.co/bigscience/bloom/discussions/131#6368f28950a665fa20d35cc0)

> Yes, you need to provide the do_sample parameter as @TimeRobber explained. This endpoint only supports:
> * temperature
> * topK
> * topP
> * do_sample
> * max_new_tokens

Use the [generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/text_generation) API.

In [44]:
def predict(model, prompt) -> str:
    inputs: Dict[str, List[int]] = tokenizer(
        text=prompt, 
        max_length=MAX_TOKEN_LENGTH, 
        truncation=True,
        # padding='max_length',
        return_tensors='pt'
    )

    length: int = tuple(inputs['input_ids'].shape)[1]
    response_tokens = model.generate(
        inputs["input_ids"].cuda(), 
        min_new_tokens=32,
        max_new_tokens=length+32,
        do_sample=True, 
        top_k=50, 
        top_p=0.9,
        num_return_sequences=1,
        # return_full_text=False, Not available
        repetition_penalty=50.0,
    )[0]
    response = tokenizer.decode(response_tokens, skip_special_tokens=True)
    return response

### Test prediction before training

In [45]:
# prompt = "Summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."
prompt = "USER: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."
print(prompt)

USER: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes.


In [46]:
print(predict(model=model, prompt=prompt))

Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.28.1"
}



USER: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes. Now it's time for your friends that are richer than you do! As Americans' government is taking responsibility ... we must step back from our economic growth policies because they have failed us as people...


## Data Collator

Tensors going into a model must have the same shape. Hcne pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as dynamic padding. We delay the padding to the last moment, otherwise we bring around padded data which waste the memory and computation time. 

The function that is responsible for packaging examples into a batch is a collate function, which you pass to a DataLoader as an argument when instantiate it. The collate function converts examples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries).

The collator is [DataCollatorWithPadding(tokenizer=tokenizer)](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) that takes a tokenizer as an argument to know which padding token to use, and whether the model expects padding to be on the left or on the right.

```
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```

if there are eight tokenized sentences whose lengths are ```[50, 59, 47, 67, 59, 50, 62, 32]```, the collator will pad the sentences so that the length will be all 67 as ```[67,  67,  67,  67,  67,  67,  67,  67 ]```.

In [47]:
# DataCollatorWithPadding does not pad 'labels' which causes an error at train()
# https://stackoverflow.com/a/74228547/4281353
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer, 
    # padding='max_length',
    padding=True,
    # pad_to_multiple_of=8,
    # max_length=MAX_TOKEN_LENGTH,
    return_tensors='pt'
)

```DataCollatorForLanguageModeling``` does not work with the error at Trainer.

```
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
```

In [48]:
# data_collator = DataCollatorForLanguageModeling(
#    tokenizer=tokenizer, 
#    mlm=False,
#    return_tensors='pt'
# )

In [49]:
# This does not work with DataCollatorForLanguageModeling either.
# Only works with DataCollatorWithPadding
collated = data_collator(list(tokenized_train.take(1))[0]) if DATASET_STREAMING else data_collator(tokenized_train[0])
for key in collated.keys():
    print(f"{key}:{len(collated[key])}")
    
assert len(collated['input_ids']) == len(collated['labels']), \
    f"expected the same length of input_ids:[{len(collated['input_ids'])}] and labels:{len(collated['labels'])}"

You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids:512
attention_mask:512
labels:512


In [50]:
tokenizer.decode(token_ids=collated['input_ids'], skip_special_tokens=True)

'USER: Carlos Vela and Juanmi, formerly of Arsenal and Southampton respectively, scored the hosts goals as Granada suffered a fourth successive defeat under Adams. We are all sad, the players, the fans, everybody, said Adams. There s been a lot of mistakes. We re going to try to rectify it and rebound very quickly. The 50-year-old, who took charge on 10 April, has a contract to the end of the current campaign. However Adams has been working at the Spanish club since November and is vice president of the company owned by Granada s club president. If the team played like this at the beginning of the season, there s no way we d be in this situation, he added. I thought they were incredible today, but it s not a day for incredible, it s too late, you re down, you re finished, it s over. Granada s relegation ends a six-season spell in the top flight. They play Real Madrid at home in their next match on 6 May with fans having walked out of previous defeats in protest at how the club is being

## Evaluation Metrics

Use ROUGE for now.

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [51]:
rouge = evaluate.load("rouge")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the ROUGE metric:

In [52]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Trainer API

* [TrainingArguments class](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.TrainingArguments)

The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation. 

```
from transformers import TrainingArguments
training_args = TrainingArguments("bloom-trainer")
```

We can then define a Trainer by passing it all the objects constructed - the model, the training_args, the training and validation datasets, our data_collator, and our tokenizer.

* [Trainer class](https://huggingface.co/docs/transformers/main_classes/trainer)

> The ```Trainer``` class provides an API for training in **PyTorch**  for most standard use cases. It’s used in most of the [example scripts](https://github.com/huggingface/transformers/tree/main/examples). 

[TFTrainer is deprecated](https://discuss.huggingface.co/t/tensorflow-trainer/6383) for Tensorflow, and we should use Keras. See Huggingface [Tensorflow examples](https://github.com/huggingface/transformers/tree/main/examples/tensorflow) github.

> TFTrainer will be deprecated and removed in v5, we will focus on better integrating with Keras (though the means of Keras callbacks if we need to add functionality). Checkout the new [classification example](https://github.com/huggingface/transformers/blob/main/examples/tensorflow/text-classification/run_text_classification.py) for an example of where we are going.

The Trainer contains the basic training loop which supports the above features. You can subclass them and override the following methods:
```
from torch import nn
from transformers import Trainer


class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0]))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss
```




At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/trainer#transformers.TrainingArguments). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the ROUGE metric and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

### Loss function

Use the loss function associated to the Huggingface pretrained model. No need to provide.

* [What is the loss function used in Trainer from the Transformers library of Hugging Face?](https://stackoverflow.com/a/71585375/4281353)
* [Specify Loss for Trainer / TrainingArguments](https://discuss.huggingface.co/t/specify-loss-for-trainer-trainingarguments/10481)


### max_steps for streaming dataset

* [TrainingArguments class - max_steps formula when using streaming dataset](https://discuss.huggingface.co/t/training-max-steps-formula-when-using-streaming-dataset/36531)
* [Explicitly set number of training steps using Trainer](https://discuss.huggingface.co/t/explicitly-set-number-of-training-steps-using-trainer/1127)

### num epochs for streaming dataset

* [Huge Num Epochs (9223372036854775807) when using Trainer API with streaming dataset #22757](https://github.com/huggingface/transformers/issues/22757)

### Training with streaming dataset

* [Streaming Dataset of Sequence Length 2048](https://discuss.huggingface.co/t/streaming-dataset-of-sequence-length-2048/17649)

### Early Stopping

* [Early stopping in Bert Trainer instances](https://stackoverflow.com/questions/69087044/early-stopping-in-bert-trainer-instances)

> You need to:
> * Use load_best_model_at_end = True (EarlyStoppingCallback() requires this to be True).
> * evaluation_strategy = 'steps' or IntervalStrategy.STEPS instead of 'epoch'.
> * eval_steps = 50 (evaluate the metrics after N steps).

```
from transformers import EarlyStoppingCallback, IntervalStrategy
...
...
# Defining the TrainingArguments() arguments
args = TrainingArguments(
   f"training_with_callbacks",
   evaluation_strategy = IntervalStrategy.STEPS, # "steps"
   eval_steps = 50, # Evaluation and Save happens every 50 steps
   save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   num_train_epochs=5,
   weight_decay=0.01,
   push_to_hub=False,
   metric_for_best_model = 'f1',
   load_best_model_at_end=True)
```

> In your Trainer():

```
trainer = Trainer(
    model,
    args,
    ...
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)
```

> Of course, when you use compute_metrics(), for example it can be a function below:
> The return of the compute_metrics() should be a dictionary and you can access whatever metric you want/compute inside the function and return.

```
def compute_metrics(p):    
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)    
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}
```

> Note: In newer transformers version, the usage of Enum IntervalStrategy.steps is recommended (see TrainingArguments()) instead of plain steps string, the latter being soon subject to deprecation.

### Out of Memory

Using validation causes ```OutOfMemoryError: CUDA out of memory```. There are sum workaround suggested at Huggingface forums.

* [CUDA out of memory only during validation not training](https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378)

> Use ```eval_accumulation_steps``` to regularly offload the predictions on the GPU to the CPU (slower but will avoid this OOM error).  
> Setting ```predict_with_generate``` to True in the training args seemed to solve it for me

* [CUDA out of memory when using Trainer with compute_metrics](https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941/3)

> use ```eval_accumulation_steps``` to set a number of steps after which your predictions are sent back to the CPU (slower but uses less device memory). This should avoid your OOM.

### Compute Metrics Error

Computing the metrics score cause the error ```Computing the metrics score cause the error ```. The workaroud suggested is ```predict_with_generate=True``` but it is not avilable for BLOOM/Z (causes ```unexpected keyword argument 'predict_with_generate'``` because it is the parameter of the class ```Seq2SeqTrainingArguments```.

* [Type Error: list object cannot be interpreted as integer’ while evaluating a summarization model (seq2seq,BART)](https://discuss.huggingface.co/t/type-error-list-object-cannot-be-interpreted-as-integer-while-evaluating-a-summarization-model-seq2seq-bart/11590)

> the code runs albeit without the metrics (rouge etc) 

> the reason why you get an error with ```predict_with_generate=False``` is because the Trainer won’t call the model’s ```generate()``` method in that case (it just computes the loss / logits, which is why you don’t see the metrics). So if you want to compute things like ROUGE during training, you’ll need to generate the summaries with ```predict_with_generate=True```

Need to find out a way to use compute_metrics for BLOOM or examing the ```compute_metrics``` function if the implementation is correct.

### Early Stopping Callback

Early Stop requires the two training argument.

1. metric_for_best_model='eval_<metric>'
2. load_best_model_at_end=True,


The ```eval_``` prefixed metric is to be specified to the training argument ```metric_for_best_model="eval_<metric>"```. ```load_best_model_at_end=True``` ensures the best model checkpoint is preserved.

* [Trainer not keeping best model checkpoint with save_total_limit=1 #15089](https://github.com/huggingface/transformers/issues/15089)

> load_best_model_at_end=True makes sure the best model checkpoint is always kept. That means the absolute best model checkpoint, so if at step 500, you get a model worse then at step 450, and the best model checkpoint was at step 350, the Trainer will delete the checkpoint at step 450 indeed, and only keep the checkpoint at step 350 for the best model.

In [53]:
# For steaming=True datasdt, *max_steps* is required to tell the total number of rows.
# https://discuss.huggingface.co/t/streaming-dataset-into-trainer-does-not-implement-len-max-steps-has-to-be-specified/32893/5
# ValueError: train_dataset does not implement __len__, max_steps has to be specified
# 
# Enable evaluation cause OutOfMemory
training_args = TrainingArguments(
    output_dir=MODEL_DIR_CHECKPOINT,
    overwrite_output_dir=True,
    max_steps=MAX_STEPS,
    num_train_epochs=-1 if DATASET_STREAMING else NUM_EPOCHS,
    per_device_train_batch_size=PER_DEVICE_BATCH_SIZE,
    per_device_eval_batch_size=PER_DEVICE_BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    weight_decay=0.01, 
    fp16=USE_FLOAT16,
    no_cuda=False,
    # predict_with_generate=True,  # This is for Seq2SeqTrainingArguments only
    # evaluation_strategy="epoch",
    evaluation_strategy="steps",
    eval_steps=EVAL_STEPS,
    eval_accumulation_steps=1,
    # fp16_full_eval=True,
    # save_strategy="epoch",
    save_strategy="steps",
    save_total_limit=10,
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end=True,
    log_level="debug",
    disable_tqdm=False,
    push_to_hub=False,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [54]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_validation,
    tokenizer=tokenizer,
    data_collator=data_collator,
    # Disable compute_metrics
    # Causes Error: argument 'ids': 'list' object cannot be interpreted as an integer
    # compute_metrics=compute_metrics,
    callbacks = [
        EarlyStoppingCallback(early_stopping_patience=5)
    ]
)

max_steps is given, it will override any value given in num_train_epochs
Using cuda_amp half precision backend


## Run Training

Epoch num will be set to a huge number when using streaming dataset.

* [Huge Num Epochs (9223372036854775807) when using Trainer API with streaming dataset #22757](https://github.com/huggingface/transformers/issues/22757)

In [55]:
# Not to remove to resume from saved checkpoint
# !rm -rf $MODEL_DIR_CHECKPOINT

In [56]:
if RESUME_FROM_MODEL_DIR_CHECKPOINT:
    trainer.train(MODEL_DIR_CHECKPOINT)  # Resume from MODEL_DIR_CHECKPOINT
else:
    trainer.train()

trainer.save_model(MODEL_DIR_FINE_TUNED)

***** Running training *****
  Num examples = 12,294
  Num Epochs = 9,223,372,036,854,775,807
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 12,294
  Number of trainable parameters = 559,214,592


Step,Training Loss,Validation Loss
500,3.7823,3.702616
1000,3.6638,3.56049
1500,3.5523,3.487336
2000,3.5119,3.422432
2500,3.4868,3.376068
3000,3.413,3.345376
3500,3.3927,3.311706
4000,3.3187,3.292373
4500,2.6689,3.401282
5000,2.3356,3.401787


***** Running Evaluation *****
  Num examples: Unknown
  Batch size = 1
Saving model checkpoint to bloom_fine_tuning/checkpoint-500
Configuration saved in bloom_fine_tuning/checkpoint-500/config.json
Configuration saved in bloom_fine_tuning/checkpoint-500/generation_config.json
Model weights saved in bloom_fine_tuning/checkpoint-500/pytorch_model.bin
tokenizer config file saved in bloom_fine_tuning/checkpoint-500/tokenizer_config.json
Special tokens file saved in bloom_fine_tuning/checkpoint-500/special_tokens_map.json
Deleting older checkpoint [bloom_fine_tuning/checkpoint-10000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples: Unknown
  Batch size = 1
Saving model checkpoint to bloom_fine_tuning/checkpoint-1000
Configuration saved in bloom_fine_tuning/checkpoint-1000/config.json
Configuration saved in bloom_fine_tuning/checkpoint-1000/generation_config.json
Model weights saved in bloom_fine_tuning/checkpoint-1000/pytorch_model.bin
tokenizer config file save

# Inference

In [57]:
prompt

"USER: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [58]:
print(predict(model=model, prompt=prompt))

Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.28.1"
}



USER: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes. At risk from losing your life now as part of an ongoing scheme at high levels with many small business owners living below 10% so that people do not look like this world is going into problems for our future . So are we too afraid? Why was all those who were affected forced out by austerity reduced or moved back home instead when they re paid something different over another 10 years? . I don t mean there arenics but here s what other parties believe about how it feel. 
AIer: All you need says Texas President Muhammad Ali after facing big data challe

## Inference with saved model

In [59]:
del model
torch.cuda.empty_cache()
gc.collect()

244

In [60]:
finetuned_model = AutoModelForCausalLM.from_pretrained(MODEL_DIR_FINE_TUNED).cuda()

loading configuration file finetuned_bloom_model_20230418101644/config.json
Model config BloomConfig {
  "_name_or_path": "finetuned_bloom_model_20230418101644",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "masked_softmax_fusion": true,
  "model_type": "bloom",
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "offset_alibi": 100,
  "pad_token_id": 3,
  "pretraining_tp": 1,
  "seq_length": 2048,
  "skip_bias_add": true,
  "skip_bias_add_qkv": false,
  "slow_but_exact": false,
  "torch_dtype": "float32",
  "transformers_version": "4.28.1",
  "unk_token_id": 0,
  "use_cache": true,
  "vocab_size": 250880
}

loading weights file finetuned_bloom_model_20230418101644/pytorch_model.bin
G

In [61]:
predict(model=finetuned_model, prompt=prompt)

Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.28.1"
}



"USER: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes. A big rise means that after about 5 years it becomes cheaper for everyone (from ordinary Americans); but this is likely not true with companies - rather many are surprised when they do actually have higher prices as long term . In recent seasons of rising house price sales were also an important part ways out policy changes such as: Mr Obama said we ve always made progress , But today he changed his mind by raising government money from next week s rate cap into 1p every 25th anniversary since 1999 because so much revenue had been collected than tax

# Try pipeline

In [62]:
# summarizer = pipeline(task="text2text-generation", model=finetuned_model, config=model.config, tokenizer=tokenizer)

The simplest way to try out your finetuned model RUN_DATE_TIME inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for summarization with your model, and pass your text to it: