<a href="https://colab.research.google.com/github/muo-ahn/ML/blob/main/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# # Transformers installation
# ! pip install transformers datasets
# # To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

# !pip install evaluate
# !pip install rouge_score

# Summarization

In [4]:
# from huggingface_hub import notebook_login

# notebook_login()

## Load BillSum dataset

In [5]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [6]:
billsum = billsum.train_test_split(test_size=0.2)

In [7]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of the following:\n(a) Educators and policymakers have long acknowledged that the skills and competencies needed to be an effective teacher are supported through early and structured mentoring and assessment.\n(b) Induction programs help beginning teachers transition into the profession by providing standards-based, individualized assistance that combines the application of theory with intensive mentor-based support and formative assessment.\n(c) In 1998, California created its two-tiered teaching credential system and established the completion of a statewide, standards-based induction program, Beginning Teacher Support and Assessment (BTSA), as a path toward a clear credential.\n(d) Until 2009, the state provided $4,000 per participating teacher to BTSA providers as part of the Teacher Credentialing Block Grant.\n(e) In order to receive state funding, a local e

## Preprocess

In [8]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [9]:
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [10]:
tokenized_billsum = billsum.map(preprocess_function, batched=True, remove_columns=billsum["train"].column_names)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

## Evaluate

In [12]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [13]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    if isinstance(predictions, dict):
        predictions = predictions['logits']

    if isinstance(predictions, tf.Tensor):
        predictions = predictions.numpy()
    if isinstance(labels, tf.Tensor):
        labels = labels.numpy()

    if predictions.ndim == 3:  # shape (batch_size, seq_length, vocab_size)
        predictions = np.argmax(predictions, axis=-1)

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Train

In [14]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [15]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [16]:
import tensorflow as tf

def cast_to_int64(tensor):
    return tf.cast(tensor, dtype=tf.int64)

def ensure_data_types(example):
    example["input_ids"] = cast_to_int64(example["input_ids"])
    example["attention_mask"] = cast_to_int64(example["attention_mask"])
    example["labels"] = cast_to_int64(example["labels"])
    return example

tokenized_billsum["train"] = tokenized_billsum["train"].map(ensure_data_types)
tokenized_billsum["test"] = tokenized_billsum["test"].map(ensure_data_types)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [17]:
# print(tokenized_billsum["train"][:3])

In [18]:
def convert_to_tf_dataset(hf_dataset, batch_size, max_length, shuffle=False):
    def gen():
        for ex in hf_dataset:
            input_ids = tf.keras.preprocessing.sequence.pad_sequences([ex["input_ids"]], maxlen=max_length, padding='post')[0]
            attention_mask = tf.keras.preprocessing.sequence.pad_sequences([ex["attention_mask"]], maxlen=max_length, padding='post')[0]
            labels = tf.keras.preprocessing.sequence.pad_sequences([ex["labels"]], maxlen=max_length, padding='post')[0]
            yield {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

    output_signature = {
        "input_ids": tf.TensorSpec(shape=(max_length,), dtype=tf.int64),
        "attention_mask": tf.TensorSpec(shape=(max_length,), dtype=tf.int64),
        "labels": tf.TensorSpec(shape=(max_length,), dtype=tf.int64)
    }

    tf_dataset = tf.data.Dataset.from_generator(gen, output_signature=output_signature)

    if (shuffle):
        tf_dataset = tf_dataset.shuffle(buffer_size=len(hf_dataset))

    tf_dataset = tf_dataset.batch(batch_size)
    return tf_dataset

In [19]:
batch_size = 8
max_length = 128

tf_train_set = convert_to_tf_dataset(
    tokenized_billsum["train"],
    batch_size=batch_size,
    max_length=max_length,
    shuffle=True
)

tf_test_set = convert_to_tf_dataset(
    tokenized_billsum["test"],
    batch_size=batch_size,
    max_length=max_length,
    shuffle=False
)

In [20]:
model.compile(optimizer=optimizer)  # No loss argument!

In [21]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)



In [22]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="my_awesome_billsum_model",
    tokenizer=tokenizer,
)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/muo-ahn/my_awesome_billsum_model into local empty directory.


Download file tf_model.h5:   0%|          | 7.45k/357M [00:00<?, ?B/s]

Download file spiece.model:   3%|3         | 25.3k/773k [00:00<?, ?B/s]

Clean file spiece.model:   0%|          | 1.00k/773k [00:00<?, ?B/s]

Clean file tf_model.h5:   0%|          | 1.00k/357M [00:00<?, ?B/s]

In [23]:
callbacks = [metric_callback, push_to_hub_callback]

In [24]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=15, callbacks=callbacks)

Epoch 1/15


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tf_keras.src.callbacks.History at 0x7ab4c0644730>

## Inference

In [25]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [26]:
from transformers import pipeline

summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model")
summarizer(text)

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]

In [27]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model")
inputs = tokenizer(text, return_tensors="tf").input_ids

In [28]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model", from_pt=True)
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFT5ForConditionalGeneration: ['encoder.embed_tokens.weight', 'lm_head.weight', 'decoder.embed_tokens.weight']
- This IS expected if you are initializing TFT5ForConditionalGeneration from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFT5ForConditionalGeneration from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [29]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

"The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history. It'll ask the ultra-wealthy and corporations to pay their fair share."

## Save model

In [35]:
# Mount Google Drive
from google.colab import drive
# drive.mount('/content/drive')

# Save the model to Google Drive
model.save('/content/drive/My Drive/models/summarization_eng', save_format='tf')