# Fine Tune BLOOM for Summarization

## Hugginface BLOOM Discussion Forum & Github

* [Huggingface Bloom Discussions](https://huggingface.co/bigscience/bloom/discussions)

* [Text summarization with Bloom#122](https://huggingface.co/bigscience/bloom/discussions/122)

* [Training or Fine-tuning the Bloom AI Model on my own Dataset#187](https://huggingface.co/bigscience/bloom/discussions/187)

> In the [official example for text classification](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) README:
> replace ```--model_name_or_path bert-base-multilingual-cased``` with ```--model_name_or_path bigscience/bloom-560m```

* [Fine-tuning BLOOM for Summarization with Trainer API #234](https://huggingface.co/bigscience/bloom/discussions/234)

* [Huge Num Epochs (9223372036854775807) when using Trainer API with streaming dataset #22757](https://github.com/huggingface/transformers/issues/22757)

* [Data Collator class to use for BLOOM#238](https://huggingface.co/bigscience/bloom/discussions/238)

* [TrainingArguments class - max_steps formula when using streaming dataset](https://discuss.huggingface.co/t/training-max-steps-formula-when-using-streaming-dataset/36531)

## Huggingface Casual Language Model

* [Huggingface Task Guide - Causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling)

## Huggingface Task Parameters 

* [Detailed parameters](https://huggingface.co/docs/api-inference/detailed_parameters#text2text-generation-task)

## BLOOM Prompt Example

* [Learn how to use Bloom like chatGPT for free.#183](https://huggingface.co/bigscience/bloom/discussions/183)

```
User: Number BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks.
AI: 
```

<img src="./image/bloom_prompt_example.png" align="left" width=400/>


In [2]:
! pip install torch transformers datasets evaluate scikit-learn rouge rouge-score promptsource --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytest-astropy 0.8.0 requires pytest-cov>=2.0, which is not installed.
pytest-astropy 0.8.0 requires pytest-filter-subpackage>=0.1, which is not installed.
sagemaker 2.145.0 requires importlib-metadata<5.0,>=1.4.0, but you have importlib-metadata 6.3.0 which is incompatible.
sagemaker 2.145.0 requires PyYAML==5.4.1, but you have pyyaml 6.0 which is incompatible.
docker-compose 1.29.2 requires PyYAML<6,>=3.10, but you have pyyaml 6.0 which is incompatible.[0m[31m
[0m

In [44]:
import re
from typing import (
    List,
    Dict,
    Callable,
)
import multiprocessing

import numpy as np
import pandas as pd
from datasets import (
    load_dataset,
    get_dataset_split_names
)
import torch
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForCausalLM,
    DataCollatorWithPadding,
    DataCollatorForLanguageModeling,
    BloomForCausalLM,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback, 
    IntervalStrategy
)
import evaluate
from promptsource.templates import (
    DatasetTemplates,
    Template
)

# Environment

In [4]:
NUM_CPUS: int = multiprocessing.cpu_count()

# Constant

In [5]:
## Huggingface Datasets
DATASET_NAME: str = "xsum"
DATASET_TRAIN_NUM_ROWS: int = 204045      # Number of rows in the original train dataset
DATASET_STREAMING: bool = False                    # If using Dataset streaming
DATASET_TRAIN_NUM_SELECT: int = 4096       # Number of rows to use for training
DATASET_VALIDATE_NUM_SELECT: int =32

# Huggingface Tokenizer (BLOOM default token length is 2048)
MAX_TOKEN_LENGTH: int = 512         # Max token length to avoid out of memory
MAX_RESPONSE_LENGTH: int = 64
BUFFER = 64
MAX_REQUEST_LENGTH: int = MAX_TOKEN_LENGTH - MAX_RESPONSE_LENGTH - BUFFER
PER_DEVICE_BATCH_SIZE: int = 1       # GPU batch size

# Huggingface Model
# MODEL = "bigscience/bloomz-560m"
MODEL = "bigscience/bloom-560m"
USE_FLOAT16: bool = True

# Training
NUM_EPOCHS: int = 3
MAX_STEPS: int = NUM_EPOCHS * DATASET_TRAIN_NUM_SELECT if DATASET_STREAMING else -1

## Load dataset

Use [xsum](https://huggingface.co/datasets/xsum) which has PromptSource template 

<img src="./image/xsum.png" align="left" width=600/>

<img src="./image/xsum_promptsource_templates.png" align="left"/>

In [6]:
get_dataset_split_names(path=DATASET_NAME)

['train', 'validation', 'test']

In [7]:
train = load_dataset("xsum", split="train", streaming=DATASET_STREAMING)
train

Found cached dataset xsum (/root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71)


Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

There are two fields that you'll want to use:

- `document`: the text of the news.
- `summary`: a condensed version of `document` which'll be the model target.

In [8]:
if DATASET_STREAMING:
    example: Dict[str, str]  = list(train.take(50))[0]
else:
    example: Dict[str, str] = train.select(range(DATASET_TRAIN_NUM_SELECT)).shuffle(seed=42)[49]

example

{'document': 'Gatwickmeetandgreet.net also said it had been approved by Gatwick Police and Trading Standards.\nIt said it "never" overbooked customers and parked cars in a police-inspected, fenced and floodlit compound.\nOne reader complained cars were parked entirely in a quiet residential road.\nUrban Parking, owner of the service, did not respond to the Advertising Standards Authority (ASA) questions about the complaint.\nThe ASA said there was no evidence to support customers\' understanding that their cars would be routinely parked at the compound and would remain there for the duration of their stay.\nGatwickmeetandgreet.net\'s claim of having been approved by Gatwick Police and Trading Standards was misleading and unsubstantiated, the ASA ruled.\nIt said the advert must not appear again in its current form, saying: "We told Urban Parking to ensure their future advertising did not mislead in relation to where consumers\' vehicles would be parked."',
 'summary': 'An advert for car

# Prompt Template

In [9]:
prompt_templates = DatasetTemplates( dataset_name=DATASET_NAME)  
prompt_templates.all_template_names

['DOC_boils_down_to_simple_idea_that',
 'DOC_given_above_write_one_sentence',
 'DOC_how_would_you_rephrase_few_words',
 'DOC_tldr',
 'DOC_write_summary_of_above',
 'article_DOC_summary',
 'college_roommate_asked_DOC_so_I_recap',
 'read_below_DOC_write_abstract',
 'summarize_DOC',
 'summarize_this_DOC_summary']

In [10]:
template: Template = prompt_templates['summarize_DOC']
print(template.jinja)

Summarize: {{document}}|||
{{summary}}


In [11]:
#prompt, response = template.apply(example=example, truncate=False)
prompt, response = template.apply(example=example, truncate=False)
print('-' * 80)
print("Prompt")
print('-' * 80)
print(re.sub(r'[\s\'\"]+', ' ', prompt))

print('-' * 80)
print("Response")
print('-' * 80)
print(re.sub(r'[\s\'\"]+', ' ', response))

--------------------------------------------------------------------------------
Prompt
--------------------------------------------------------------------------------
Summarize: Gatwickmeetandgreet.net also said it had been approved by Gatwick Police and Trading Standards. It said it never overbooked customers and parked cars in a police-inspected, fenced and floodlit compound. One reader complained cars were parked entirely in a quiet residential road. Urban Parking, owner of the service, did not respond to the Advertising Standards Authority (ASA) questions about the complaint. The ASA said there was no evidence to support customers understanding that their cars would be routinely parked at the compound and would remain there for the duration of their stay. Gatwickmeetandgreet.net s claim of having been approved by Gatwick Police and Trading Standards was misleading and unsubstantiated, the ASA ruled. It said the advert must not appear again in its current form, saying: We told Urb

---
# Preprocess

To apply the preprocessing function over the entire dataset, use Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once.

* [Datasets - select / filter](https://huggingface.co/docs/datasets/process#select-and-filter)
* [Datasets - select](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.select)

### Framework Tensor Format

* [Use with PyTorch - Dataset Format](https://huggingface.co/docs/datasets/use_with_pytorch)
> By default, datasets return regular python objects: integers, floats, strings, lists, etc. To get PyTorch tensors instead, you can set the format of the dataset to pytorch using Dataset.with_format():

```
ds = ds.with_format("torch")
```

* [Using Datasets with TensorFlow](https://huggingface.co/docs/datasets/use_with_tensorflow)

> By default, datasets return regular Python objects: integers, floats, strings, lists, etc. To get TensorFlow tensors instead, you can set the format of the dataset to tf:

```
ds = ds.with_format("tf")
```

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

## Tokenization

* [Causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling)

> Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
> Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

```
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
```

In [12]:
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

In [13]:
def get_convert_to_request_response(template: Template) -> Callable:
    def _convert_to_prompt_response(example: Dict[str, str]) -> Dict[str, str]:
        """Generate prompt, response as a dictionary:
        {
            "prompt": "Summarize: ...",
            "response": "..."
        }

        NOTE: DO NOT use with dataset map function( batched=True). Use batch=False

        Args:
            example: single {document, summary} pair to be able to apply template
        Returns: a dictionary of pro
        """
        # assert isinstance(example, dict), f"expected dict but {type(example)}.\n{example}"
        assert isinstance(example['document'], str), f"expected str but {type(example['document'])}."
        
        prompt, response = template.apply(example=example, truncate=False)
        if len(prompt) <=1 or len(response) <= 1:
            return {
                "prompt": "NA",
                "response": "NA"                
            }

        return {
            "prompt": " ".join(
                re.sub(r'[\s\'\"]+', ' ', prompt).split(' ')[:MAX_REQUEST_LENGTH]
            ),
            "response": " ".join(
                re.sub(r'[\s\'\"]+', ' ', response).split(' ')[:MAX_RESPONSE_LENGTH]
            )
        }

    return _convert_to_prompt_response

convert_to_request_response: Callable = get_convert_to_request_response(template=template)

In [14]:
def get_convert_to_prompt(template: Template) -> Callable:
    def _convert_to_prompt(example: Dict[str, str]) -> Dict[str, str]:
        """Generate prompt as a dictionary:
        {
            "prompt": "Summarize: <document>\n<summary>"
        }

        NOTE: DO NOT use dataset map function with  batched=True. Use batch=False

        Args:
            example: single {document, summary} pair to be able to apply template
        Returns: a dictionary of prompt
        """
        # assert isinstance(example, dict), f"expected dict but {type(example)}.\n{example}"
        assert isinstance(example['document'], str), f"expected str but {type(example['document'])}."

        prompt, response = template.apply(example=example, truncate=False)
        if len(prompt) <=1 or len(response) <= 1:
            return {
                "prompt": "NA\nNA\n"
            }
        
        return {
            "prompt": " ".join(
                re.sub(r'[\s\'\"]+', ' ', prompt).split(' ')[:MAX_REQUEST_LENGTH-1]  # -1 for \n
            ) + "\n" + " ".join(
                re.sub(r'[\s\'\"]+', ' ', response).split(' ')[:MAX_RESPONSE_LENGTH-1]
            ) + "\n"
        }

    return _convert_to_prompt

convert_to_prompt: Callable = get_convert_to_prompt(template=template)

In [15]:
prompt = convert_to_prompt(example=example)
prompt

{'prompt': 'Summarize: Gatwickmeetandgreet.net also said it had been approved by Gatwick Police and Trading Standards. It said it never overbooked customers and parked cars in a police-inspected, fenced and floodlit compound. One reader complained cars were parked entirely in a quiet residential road. Urban Parking, owner of the service, did not respond to the Advertising Standards Authority (ASA) questions about the complaint. The ASA said there was no evidence to support customers understanding that their cars would be routinely parked at the compound and would remain there for the duration of their stay. Gatwickmeetandgreet.net s claim of having been approved by Gatwick Police and Trading Standards was misleading and unsubstantiated, the ASA ruled. It said the advert must not appear again in its current form, saying: We told Urban Parking to ensure their future advertising did not mislead in relation to where consumers vehicles would be parked. \nAn advert for car parking at Gatwick

In [16]:
def tokenize_prompt_response(examples):
    """Generate the model inputs in the dictionary with format:
    {
        "input_ids": List[int], 
        "attention_mask": List[int]",
        "labels": List[int]
    }
    
    Note: Huggngface dataaset map(batched=True, batch_size=n) merges values of 
    n dictionarys into a values of the key. If you have n instances of {"key", "v"}, then
    you will get {"key": ["v", "v", "v", ...] }.
    
    Args:
        examples:   a dictionary of format {
            "prompt": [prompt+],
            "response": [respnse+]
        } where + means more than one instance because of Dataset.map(batched=True)
    """    
    # TODO: Fix the bug 'max_length=MAX_TOKEN_LENGTH'.
    # examples["prompt"] with 'batched=True" has N instances of prompts each of which 
    # can have MAX_TOKEN_LENGTH length. Chopping N * MAX_TOKEN_LENGTH to
    # MAX_TOKEN_LENGTH means only using the first prompt out of N.
    inputs: Dict[str, List[int]] = tokenizer(
        text=examples["prompt"], 
        max_length=MAX_TOKEN_LENGTH,    # bug
        truncation=True,
        padding='max_length',
    )

    labels: Dict[str, List[int]] = tokenizer(
        text=examples["response"], 
        max_length=MAX_TOKEN_LENGTH,    # bug
        truncation=True,
        padding='max_length',
    )
    inputs["labels"] = labels["input_ids"]
    
    return inputs

In [17]:
def tokenize_prompt(example):
    """Generate the model inputs in the dictionary with format:
    {
        "input_ids": List[int], 
        "attention_mask": List[int]",
        "labels": List[int]
    }
    
    Note: Huggngface dataaset map(batched=True, batch_size=n) merges values of 
    n dictionarys into a values of the key. If you have n instances of {"key", "v"}, then
    you will get {"key": ["v", "v", "v", ...] }.
    
    Args:
        example:   a dictionary of format {
            "prompt": "Summarize:<document>\n<summary>\n",
        }
    """    
    assert isinstance(example['prompt'], str), f"expected str, got {type(example['prompt'])}"
    inputs: Dict[str, List[int]] = tokenizer(
        example['prompt'], 
        max_length=MAX_TOKEN_LENGTH,   
        truncation=True,
        padding='max_length',
    )
    inputs["labels"] = inputs["input_ids"].copy()   # Casual LM get the same tokens as inputs and label
    
    return inputs

In [18]:
tokenized: Dict[str, List[int]] = tokenize_prompt(example=convert_to_prompt(example=example))
tokenizer.decode(token_ids=tokenized['input_ids'])

'</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s

In [19]:
len(tokenized['input_ids'])

512

## Apply preprocessing

In [20]:
if DATASET_STREAMING:
    train = train.take(DATASET_TRAIN_NUM_SELECT)
    print(f"size of train: {len(train)}")
else:
    train = train.select(
        indices=range(DATASET_TRAIN_NUM_SELECT)
    )

remove_column_names: List[str] = list(train.features.keys())

tokenized_train = train.map(
    function=convert_to_prompt, 
    batched=False,
    #batch_size=2048,
    #drop_last_batch=False,
    remove_columns=remove_column_names,
    num_proc=NUM_CPUS
).map(
    function=tokenize_prompt, 
    batched=False,
    # batch_size=32,
    # drop_last_batch=True,
    # remove_columns=['prompt', 'response']
    remove_columns=['prompt'],
    num_proc=NUM_CPUS
).shuffle(
    seed=42
).with_format(
    "torch"
)

del train

Map (num_proc=16):   0%|          | 0/4096 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/4096 [00:00<?, ? examples/s]

In [21]:
len(tokenized_train)

4096

In [22]:
validation =  load_dataset(
    path="xsum", 
    split="validation", 
    streaming=DATASET_STREAMING
)

if DATASET_STREAMING:
    validation =  validation.take(DATASET_VALIDATE_NUM_SELECT)
else:
    validation = validation.select(
        indices=range(DATASET_VALIDATE_NUM_SELECT)
    )

tokenized_validation =  validation.map(
    function=convert_to_prompt, 
    batched=False,
    # batch_size=2048,
    # drop_last_batch=False,
    remove_columns=remove_column_names,
    num_proc=NUM_CPUS
).map(
    function=tokenize_prompt, 
    batched=False,
    # batch_size=32,
    # drop_last_batch=True,
    remove_columns=['prompt'],
    num_proc=NUM_CPUS
).shuffle(
    seed=42
).with_format(
    "torch"
)

Found cached dataset xsum (/root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71)


Map (num_proc=16):   0%|          | 0/32 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/32 [00:00<?, ? examples/s]

In [23]:
len(tokenized_validation)

32

## Verify preprocessing

In [24]:
examples = list(tokenized_train.take(50)) if DATASET_STREAMING else tokenized_train[:50]
len(examples['input_ids'])

50

In [25]:
print('-' * 80)
print("prompt")
print('-' * 80)

if DATASET_STREAMING:
    print(tokenizer.decode(token_ids=examples[49]['input_ids']).split('\n')[0])
else:
    print(tokenizer.decode(token_ids=examples['input_ids'][49]).split('\n')[0])

print('-' * 80)
print("response")
print('-' * 80)
if DATASET_STREAMING:
    print(tokenizer.decode(token_ids=examples[49]['input_ids']).split('\n')[1])
else:
    print(tokenizer.decode(token_ids=examples['input_ids'][49]).split('\n')[1])


--------------------------------------------------------------------------------
prompt
--------------------------------------------------------------------------------
</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s

---
# Training

Regarding the hyperparameter to use, need to investivage. Those used in [Fine-tune the model? #46 by NXBY - opened Jul 16, 2022](https://huggingface.co/bigscience/bloom/discussions/46#633d452d48ab6a0add2b61bd) might be a starting point.

```
!python run_qa.py \
  --model_name_or_path bigscience/bloom-560m \
  --dataset_name squad_v2 \
  --do_train \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_seq2seq_squad/ \
  --eval_accumulation_steps 1 \
  --version_2_with_negative \
  --overwrite_output_dir
```

## Model

We may need to use a specific model class e.g.  ```AutoModelForSequenceClassification```  to use BERT for classifying pairs of sentences because BERT has not been pretrained on such a task, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead in ```AutoModelForSequenceClassification```.

```
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
```

Note that BLOOM is a Decoder model, not Encoder-Decoder, hence cannot be used with ```AutoModelForSeq2SeqLM``` which causes:
```
ValueError: Unrecognized configuration class <class 'transformers.models.bloom.configuration_bloom.BloomConfig'> for this kind of AutoModel: AutoModelForSeq2SeqLM. Model type should be one of BartConfig, BigBirdPegasusConfig, BlenderbotConfig, BlenderbotSmallConfig, EncoderDecoderConfig, FSMTConfig, LEDConfig, LongT5Config, M2M100Config, MarianConfig, MBartConfig, MT5Config, MvpConfig, PegasusConfig, PegasusXConfig, PLBartConfig, ProphetNetConfig, SwitchTransformersConfig, T5Config, XLMProphetNetConfig.
```

Have a solid understanding on the model architecture and the task to execute for the fine-tuning, and devise the appropriate model to use. BLOOM is still a new model and Decoder architecture such as GPT 

Note that we cannot use AutoModel as it causes the error:

```
TypeError: The current model class (BloomModel) is not compatible with `.generate()`, as it doesn't have a language model head. Please use one of the following classes instead: {'BloomForCausalLM'}
```

<img src="./image/bloom_model_classes.png" align="left" width=200/>

In [26]:
# model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)
# model = AutoModel.from_pretrained(MODEL)
model = BloomForCausalLM.from_pretrained(MODEL)
model.cuda()

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1024)
    (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0): BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (1): BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementw

In [27]:
type(model)

transformers.models.bloom.modeling_bloom.BloomForCausalLM

In [28]:
dir(model)

['T_destination',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_apply',
 '_auto_class',
 '_backward_compatibility_gradient_checkpointing',
 '_backward_hooks',
 '_buffers',
 '_call_impl',
 '_convert_head_mask_to_5d',
 '_convert_to_bloom_cache',
 '_convert_to_standard_cache',
 '_create_repo',
 '_expand_inputs_for_generation',
 '_extract_past_from_model_output',
 '_forward_hooks',
 '_forward_pre_hooks',
 '_from_config',
 '_get_backward_hooks',
 '_get_decoder_start_token_id',
 '_get_files_timestamps',
 '_get_logits_processor',
 '_get_logits_warper',
 '_get_name',
 '_get_resized_embeddings',
 '_get_resize

In [29]:
model.config

BloomConfig {
  "_name_or_path": "bigscience/bloom-560m",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "masked_softmax_fusion": true,
  "model_type": "bloom",
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "offset_alibi": 100,
  "pad_token_id": 3,
  "pretraining_tp": 1,
  "skip_bias_add": true,
  "skip_bias_add_qkv": false,
  "slow_but_exact": false,
  "transformers_version": "4.28.1",
  "unk_token_id": 0,
  "use_cache": true,
  "vocab_size": 250880
}

## Prediction

See the predction parameters to use for the hugging face model tasks.

* [Generation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

See Huggingface BLOOM model discussion for **do_sample** parameter requirement.

* [Change seed in interference API #131](https://huggingface.co/bigscience/bloom/discussions/131#6368f28950a665fa20d35cc0)

> Yes, you need to provide the do_sample parameter as @TimeRobber explained. This endpoint only supports:
> * temperature
> * topK
> * topP
> * do_sample
> * max_new_tokens

Use the [generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/text_generation) API.

In [48]:
def predict(model, prompt) -> str:
    inputs: Dict[str, List[int]] = tokenizer(
        text=prompt, 
#        max_length=MAX_TOKEN_LENGTH, 
        truncation=True,
#        padding='max_length',
        return_tensors='pt'
    )

    length: int = tuple(inputs['input_ids'].shape)[1]
    response_tokens = model.generate(
        inputs["input_ids"].cuda(), 
        min_new_tokens=length,
        max_new_tokens=length+32,
        do_sample=True, 
        top_k=50, 
        top_p=0.9,
    )[0]
    response = tokenizer.decode(response_tokens, skip_special_tokens=True)
    return response

In [31]:
prompt = "Summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."
print(prompt)

Summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes.


In [32]:
print(predict(model=model, prompt=prompt))

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes. We should be proud of our new law, which would lift the federal budget by $400 billion. That is the way it should go. But it is all of this at the expense of a large portion of our population. It is the way this bill was originally written. It was intended to lift the budget, but it was put to a slump in the face of the COVID pandemic. It is designed to cover a single bill, and it is designed to protect people against social insurance companies, employers and Americans. And I can't get it right. I want it to cover everyone.
I cannot deny the fac

## Data Collator

Tensors going into a model must have the same shape. Hcne pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as dynamic padding. We delay the padding to the last moment, otherwise we bring around padded data which waste the memory and computation time. 

The function that is responsible for packaging examples into a batch is a collate function, which you pass to a DataLoader as an argument when instantiate it. The collate function converts examples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries).

The collator is [DataCollatorWithPadding(tokenizer=tokenizer)](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) that takes a tokenizer as an argument to know which padding token to use, and whether the model expects padding to be on the left or on the right.

```
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```

if there are eight tokenized sentences whose lengths are ```[50, 59, 47, 67, 59, 50, 62, 32]```, the collator will pad the sentences so that the length will be all 67 as ```[67,  67,  67,  67,  67,  67,  67,  67 ]```.

In [33]:
# DataCollatorWithPadding does not pad 'labels' which causes an error at train()
# https://stackoverflow.com/a/74228547/4281353
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer, 
    padding='max_length',
    pad_to_multiple_of=8,
    max_length=MAX_TOKEN_LENGTH,
    return_tensors='pt'
)

```DataCollatorForLanguageModeling``` does not work with the error at Trainer.

```
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
```

In [45]:
# data_collator = DataCollatorForLanguageModeling(
#    tokenizer=tokenizer, 
#    mlm=False,
#    return_tensors='pt'
# )

In [None]:
# This does not work with DataCollatorForLanguageModeling either.
# Only works with DataCollatorWithPadding
collated = data_collator(list(tokenized_train.take(1))[0]) if DATASET_STREAMING else data_collator(tokenized_train[0])
for key in collated.keys():
    print(f"{key}:{len(collated[key])}")
    
assert len(collated['input_ids']) == len(collated['labels']), \
    f"expected the same length of input_ids:[{len(collated['input_ids'])}] and labels:{len(collated['labels'])}"

In [None]:
tokenizer.decode(token_ids=collated['input_ids'], skip_special_tokens=True)

## Evaluation

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [35]:
rouge = evaluate.load("rouge")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the ROUGE metric:

In [36]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Trainer API

* [TrainingArguments class](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.TrainingArguments)

The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation. 

```
from transformers import TrainingArguments
training_args = TrainingArguments("bloom-trainer")
```

We can then define a Trainer by passing it all the objects constructed - the model, the training_args, the training and validation datasets, our data_collator, and our tokenizer.

* [Trainer class](https://huggingface.co/docs/transformers/main_classes/trainer)

> The ```Trainer``` class provides an API for training in **PyTorch**  for most standard use cases. It’s used in most of the [example scripts](https://github.com/huggingface/transformers/tree/main/examples). 

[TFTrainer is deprecated](https://discuss.huggingface.co/t/tensorflow-trainer/6383) for Tensorflow, and we should use Keras. See Huggingface [Tensorflow examples](https://github.com/huggingface/transformers/tree/main/examples/tensorflow) github.

> TFTrainer will be deprecated and removed in v5, we will focus on better integrating with Keras (though the means of Keras callbacks if we need to add functionality). Checkout the new [classification example](https://github.com/huggingface/transformers/blob/main/examples/tensorflow/text-classification/run_text_classification.py) for an example of where we are going.

The Trainer contains the basic training loop which supports the above features. You can subclass them and override the following methods:
```
from torch import nn
from transformers import Trainer


class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0]))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss
```




At this point, only three steps remain:

1. Define your training hyperparameters in [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the ROUGE metric and save the training checkpoint.
2. Pass the training arguments to [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

### Loss function

Use the loss function associated to the Huggingface pretrained model. No need to provide.

* [What is the loss function used in Trainer from the Transformers library of Hugging Face?](https://stackoverflow.com/a/71585375/4281353)
* [Specify Loss for Trainer / TrainingArguments](https://discuss.huggingface.co/t/specify-loss-for-trainer-trainingarguments/10481)


### max_steps for streaming dataset

* [TrainingArguments class - max_steps formula when using streaming dataset](https://discuss.huggingface.co/t/training-max-steps-formula-when-using-streaming-dataset/36531)
* [Explicitly set number of training steps using Trainer](https://discuss.huggingface.co/t/explicitly-set-number-of-training-steps-using-trainer/1127)

### num epochs for streaming dataset

* [Huge Num Epochs (9223372036854775807) when using Trainer API with streaming dataset #22757](https://github.com/huggingface/transformers/issues/22757)

### Training with streaming dataset

* [Streaming Dataset of Sequence Length 2048](https://discuss.huggingface.co/t/streaming-dataset-of-sequence-length-2048/17649)

### Early Stopping

* [Early stopping in Bert Trainer instances](https://stackoverflow.com/questions/69087044/early-stopping-in-bert-trainer-instances)

> You need to:
> * Use load_best_model_at_end = True (EarlyStoppingCallback() requires this to be True).
> * evaluation_strategy = 'steps' or IntervalStrategy.STEPS instead of 'epoch'.
> * eval_steps = 50 (evaluate the metrics after N steps).

```
from transformers import EarlyStoppingCallback, IntervalStrategy
...
...
# Defining the TrainingArguments() arguments
args = TrainingArguments(
   f"training_with_callbacks",
   evaluation_strategy = IntervalStrategy.STEPS, # "steps"
   eval_steps = 50, # Evaluation and Save happens every 50 steps
   save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   num_train_epochs=5,
   weight_decay=0.01,
   push_to_hub=False,
   metric_for_best_model = 'f1',
   load_best_model_at_end=True)
```

> In your Trainer():

```
trainer = Trainer(
    model,
    args,
    ...
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)
```

> Of course, when you use compute_metrics(), for example it can be a function below:
> The return of the compute_metrics() should be a dictionary and you can access whatever metric you want/compute inside the function and return.

```
def compute_metrics(p):    
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)    
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}
```

> Note: In newer transformers version, the usage of Enum IntervalStrategy.steps is recommended (see TrainingArguments()) instead of plain steps string, the latter being soon subject to deprecation.

In [37]:
# For steaming=True datasdt, *max_steps* is required to tell the total number of rows.
# https://discuss.huggingface.co/t/streaming-dataset-into-trainer-does-not-implement-len-max-steps-has-to-be-specified/32893/5
# ValueError: train_dataset does not implement __len__, max_steps has to be specified
# 
# Enable evaluation cause OutOfMemory
training_args = TrainingArguments(
    output_dir="bloom_finetuned",
    max_steps=MAX_STEPS,
    num_train_epochs=-1 if DATASET_STREAMING else NUM_EPOCHS,
    per_device_train_batch_size=PER_DEVICE_BATCH_SIZE,
#    per_device_eval_batch_size=PER_DEVICE_BATCH_SIZE,
    learning_rate=2e-5,
    weight_decay=0.01, 
    fp16=USE_FLOAT16,
    no_cuda=False,
#    evaluation_strategy="epoch",
    evaluation_strategy="steps",
    save_strategy="epoch",
    save_total_limit=3,
    load_best_model_at_end=True,
#    log_level="debug",
    disable_tqdm=False,
    push_to_hub=False,
)

In [38]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
#    eval_dataset=tokenized_validation,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=5)]
)

In [39]:
trainer.train()
trainer.save_model("finetuned_bloom_model")

You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,2.597
1000,2.5284
1500,2.4995
2000,2.4728
2500,2.3475
3000,2.3757
3500,2.4215
4000,2.3439
4500,1.8957
5000,1.7223


## Inference

In [40]:
prompt

"Summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [49]:
print(predict(model=model, prompt=prompt))

Summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes. The federal government said the actions, announced in late December, would mean that by the summer of 2030 the federal deficit would be as low as $5.8 trillion dollars. The reduction in health care costs, from $500 billion to $450 billion, means that there will be just 2.5 million more people in the United States eligible for some form of private healthcare. And energy companies will be able to spend more. The federal government says that, from a carbon footprint perspective, it has increased the United States carbon footprint by 29% since 1990.

In [42]:
finetuned_model = AutoModelForCausalLM.from_pretrained("finetuned_bloom_model").cuda()

In [43]:
predict(model=finetuned_model, prompt=prompt)

"Summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes. So it will raise the whole world s hopes for the next 50 years and drive up the likelihood of a baby. American households are paying an average of $9,184, but the average for corporate America is $2,162. And if the federal government cuts its budget to $1.5 trillion in 2017 ($8,400 billion if the US is a nation, not a state), how will the world end? A huge portion of the world s energy is generated by coal and oil. American coal exports about 30% of the world s coal, while American oil exports about 80% of the world s oil. And while the US has 

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for summarization with your model, and pass your text to it: