<a href="https://colab.research.google.com/github/muo-ahn/ML/blob/main/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# # Transformers installation
# ! pip install transformers datasets
# # To install from source instead of the last release, comment the command above and uncomment the following one.
! pip install git+https://github.com/huggingface/transformers.git

# !pip install evaluate
# !pip install rouge_score

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-4hxnz8f3
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-4hxnz8f3
  Resolved https://github.com/huggingface/transformers.git to commit a564d10afe1a78c31934f0492422700f61a0ffc0
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


# Summarization

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

## Load BillSum dataset

In [None]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

In [None]:
billsum = billsum.train_test_split(test_size=0.2)

In [None]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 14132.725 of the Welfare and Institutions Code is amended to read:\n14132.725.\n(a) To the extent that federal financial participation is available, face-to-face contact between a health care provider and a patient is not required under the Medi-Cal program for teleophthalmology, teledermatology, and teledentistry, and reproductive health care provided by store and forward. Services appropriately provided through the store and forward process are subject to billing and reimbursement policies developed by the department. A Medi-Cal managed care plan that contracts with the department pursuant to this chapter and Chapter 8 (commencing with Section 14200) shall be required to cover\nthe services described in this section.\nreproductive health care provided by store and forward.\n(b) For purposes of this section, “teleophthalmology, teledermatology, and teledentistry, and reproductive health care 

## Preprocess

In [None]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [None]:
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [None]:
tokenized_billsum = billsum.map(preprocess_function, batched=True, remove_columns=billsum["train"].column_names)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

## Evaluate

In [None]:
import evaluate

rouge = evaluate.load("rouge")

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    if isinstance(predictions, dict):
        predictions = predictions['logits']

    if isinstance(predictions, tf.Tensor):
        predictions = predictions.numpy()
    if isinstance(labels, tf.Tensor):
        labels = labels.numpy()

    if predictions.ndim == 3:  # shape (batch_size, seq_length, vocab_size)
        predictions = np.argmax(predictions, axis=-1)

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Train

In [None]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [None]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


In [None]:
import tensorflow as tf

def cast_to_int64(tensor):
    return tf.cast(tensor, dtype=tf.int64)

def ensure_data_types(example):
    example["input_ids"] = cast_to_int64(example["input_ids"])
    example["attention_mask"] = cast_to_int64(example["attention_mask"])
    example["labels"] = cast_to_int64(example["labels"])
    return example

tokenized_billsum["train"] = tokenized_billsum["train"].map(ensure_data_types)
tokenized_billsum["test"] = tokenized_billsum["test"].map(ensure_data_types)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [None]:
# print(tokenized_billsum["train"][:3])

In [None]:
def convert_to_tf_dataset(hf_dataset, batch_size, max_length, shuffle=False):
    def gen():
        for ex in hf_dataset:
            input_ids = tf.keras.preprocessing.sequence.pad_sequences([ex["input_ids"]], maxlen=max_length, padding='post')[0]
            attention_mask = tf.keras.preprocessing.sequence.pad_sequences([ex["attention_mask"]], maxlen=max_length, padding='post')[0]
            labels = tf.keras.preprocessing.sequence.pad_sequences([ex["labels"]], maxlen=max_length, padding='post')[0]
            yield {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

    output_signature = {
        "input_ids": tf.TensorSpec(shape=(max_length,), dtype=tf.int64),
        "attention_mask": tf.TensorSpec(shape=(max_length,), dtype=tf.int64),
        "labels": tf.TensorSpec(shape=(max_length,), dtype=tf.int64)
    }

    tf_dataset = tf.data.Dataset.from_generator(gen, output_signature=output_signature)

    if (shuffle):
        tf_dataset = tf_dataset.shuffle(buffer_size=len(hf_dataset))

    tf_dataset = tf_dataset.batch(batch_size)
    return tf_dataset

In [None]:
batch_size = 16
max_length = 128

tf_train_set = convert_to_tf_dataset(
    tokenized_billsum["train"],
    batch_size=batch_size,
    max_length=max_length,
    shuffle=True
)

tf_test_set = convert_to_tf_dataset(
    tokenized_billsum["test"],
    batch_size=batch_size,
    max_length=max_length,
    shuffle=False
)

In [None]:
model.compile(optimizer=optimizer)  # No loss argument!

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)



In [None]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="my_awesome_billsum_model",
    tokenizer=tokenizer,
)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/muo-ahn/my_awesome_billsum_model into local empty directory.


Download file tf_model.h5:   0%|          | 8.00k/357M [00:00<?, ?B/s]

Download file spiece.model:   4%|4         | 32.0k/773k [00:00<?, ?B/s]

Clean file spiece.model:   0%|          | 1.00k/773k [00:00<?, ?B/s]

Clean file tf_model.h5:   0%|          | 1.00k/357M [00:00<?, ?B/s]

In [None]:
callbacks = [metric_callback, push_to_hub_callback]

In [None]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=30, callbacks=callbacks)

Epoch 1/30


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
     62/Unknown - 160s 924ms/step - loss: 19.3968

## Inference

In [None]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model")
summarizer(text)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model")
inputs = tokenizer(text, return_tensors="tf").input_ids

In [None]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model", from_pt=True)
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

In [None]:
tokenizer.decode(outputs[0], skip_special_tokens=True)